[PATCH 0/3] TLB flush multiple pages per IPI v6

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] TLB flush multiple pages per IPI v6
@ 2015-06-09 17:31 Mel Gorman
  2015-06-09 17:31 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	Peter Zijlstra, Linux-MM, LKML, Mel Gorman

Changelog since V5
o Split series to first do a full TLB flush and then targetting flushing

Changelog since V4
o Rebase to 4.1-rc6

Changelog since V3
o Drop batching of TLB flush from migration
o Redo how larger batching is managed
o Batch TLB flushes when writable entries exist

When unmapping pages it is necessary to flush the TLB. If that page was
accessed by another CPU then an IPI is used to flush the remote CPU. That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second.

There already is a window between when a page is unmapped and when it is
TLB flushed. This series ses the window so multiple pages can be flushed
using a single IPI. This should be safe or the kernel is hosed already.

Patch 1 simply made the rest of the series easier to write as ftrace
	could identify all the senders of TLB flush IPIS.

Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI
	to flush the entire TLB.

Patch 3 tracks when there potentially are writable TLB entries that
	need to be batched differently

Patch 4 notes that a full TLB flush could clear active entries and
	incur a penalty in the near future while the TLB is being
	refilled. The IPI flushes just the individual PFNs which
	incurs a direct cost to avoid an indirect cost.

The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.

 arch/x86/Kconfig                |   1 +
 arch/x86/include/asm/tlbflush.h |   2 +
 arch/x86/mm/tlb.c               |   1 +
 include/linux/mm_types.h        |   1 +
 include/linux/rmap.h            |   3 +
 include/linux/sched.h           |  31 +++++++++++
 include/trace/events/tlb.h      |   3 +-
 init/Kconfig                    |   8 +++
 kernel/fork.c                   |   5 ++
 kernel/sched/core.c             |   3 +
 mm/internal.h                   |  15 +++++
 mm/rmap.c                       | 118 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     |  33 ++++++++++-
 13 files changed, 220 insertions(+), 4 deletions(-)

-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
  2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman
@ 2015-06-09 17:31 ` Mel Gorman
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	Peter Zijlstra, Linux-MM, LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8d37e26a1007..86ad9f902042 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -534,6 +534,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 4250f364a6ca..bc8815f45f3b 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	EM(  TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" )	\
 	EM(  TLB_REMOTE_SHOOTDOWN,	"remote shootdown" )		\
 	EM(  TLB_LOCAL_SHOOTDOWN,	"local shootdown" )		\
-	EMe( TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )
+	EM(  TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )		\
+	EMe( TLB_REMOTE_SEND_IPI,	"remote ipi send" )
 
 /*
  * First define the enums in TLB_FLUSH_REASON to be exported to userspace
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman
  2015-06-09 17:31 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
@ 2015-06-09 17:31 ` Mel Gorman
  2015-06-09 20:01   ` Rik van Riel
                     ` (3 more replies)
  2015-06-09 17:31 ` [PATCH 3/4] mm: Defer flush of writable TLB entries Mel Gorman
  2015-06-09 17:31 ` [PATCH 4/4] mm: Send one IPI per CPU to TLB flush pages that were recently unmapped Mel Gorman
  3 siblings, 4 replies; 19+ messages in thread
From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	Peter Zijlstra, Linux-MM, LKML, Mel Gorman

An IPI is sent to flush remote TLBs when a page is unmapped that was
potentially accesssed by other CPUs. There are many circumstances where
this happens but the obvious one is kswapd reclaiming pages belonging to
a running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine gets
larger with more cores and more memory, the cost of these IPIs can be
high. This patch uses a simple structure that tracks CPUs that potentially
have TLB entries for pages being unmapped. When the unmapping is complete,
the full TLB is flushed on the assumption that a refill cost is lower than
flushing individual entries.

Architectures wishing to do this must give the following guarantee.

	If a clean page is unmapped and not immediately flushed, the
	architecture must guarantee that a write to that linear address
	from a CPU with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is much
larger with this patch applied and is worth highlighting. The architecture
should consider whether the cost of the full TLB flush is higher than
sending an IPI to flush each individual entry. An additional architecture
helper may be required to flush the local TLB but it is expected this will
be a trivial alias of an internal function in most cases.  In this case,
the existing x86 helper was used.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

                                        4.1.0-rc6          4.1.0-rc6
                                          vanilla       flushfull-v6
Ops lru-file-mmap-read-elapsed   162.88 (  0.00%)   120.81 ( 25.83%)

           4.1.0-rc6   4.1.0-rc6
             vanillaflushfull-v6r5
User          568.96      614.68
System       6085.61     4226.61
Elapsed       164.24      122.17

This is showing that the readers completed 25.83% faster with 30% less
system CPU time. From vmstats, it is known that the vanilla kernel was
interrupted roughly 900K times per second during the steady phase of the
test and the patched kernel was interrupts 180K times per second.

The impact is lower on a single socket machine.

                                        4.1.0-rc6          4.1.0-rc6
                                          vanilla       flushfull-v6
Ops lru-file-mmap-read-elapsed    25.43 (  0.00%)    20.59 ( 19.03%)

           4.1.0-rc6    4.1.0-rc6
             vanilla flushfull-v6
User           59.14        58.99
System        109.15        77.84
Elapsed        27.32        22.31

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages. It will have an unpredictable impact
on the workload running on the CPU being flushed as it'll depend on how
many TLB entries need to be refilled and how long that takes. Worst case,
the TLB will be completely cleared of active entries when the target PFNs
were not resident at all.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/Kconfig      |   1 +
 include/linux/rmap.h  |   3 ++
 include/linux/sched.h |  16 ++++++++
 init/Kconfig          |  10 +++++
 kernel/fork.c         |   5 +++
 kernel/sched/core.c   |   3 ++
 mm/internal.h         |  11 ++++++
 mm/rmap.c             | 103 +++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c           |  26 ++++++++++++-
 9 files changed, 176 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d5696e1d1..0810703bdc9a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,6 +44,7 @@ config X86
 	select ARCH_DISCARD_MEMBLOCK
 	select ARCH_WANT_OPTIONAL_GPIOLIB
 	select ARCH_WANT_FRAME_POINTERS
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select HAVE_DMA_ATTRS
 	select HAVE_DMA_CONTIGUOUS
 	select HAVE_KRETPROBES
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c89c53a113a8..29446aeef36e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -89,6 +89,9 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
+					 * and caller guarantees they will
+					 * do a final flush if necessary */
 };
 
 #ifdef CONFIG_MMU
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a2e6122734..d891e01f0445 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1289,6 +1289,18 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/* Track pages that require TLB flushes */
+struct tlbflush_unmap_batch {
+	/*
+	 * Each bit set is a CPU that potentially has a TLB entry for one of
+	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
+	 */
+	struct cpumask cpumask;
+
+	/* True if any bit in cpumask is set */
+	bool flush_required;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1648,6 +1660,10 @@ struct task_struct {
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	struct tlbflush_unmap_batch *tlb_ubc;
+#endif
+
 	struct rcu_head rcu;
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec60232..6e6fa4842250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -904,6 +904,16 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
 #
+# For architectures that prefer to flush all TLBs after a number of pages
+# are unmapped instead of sending one IPI per page to flush. The architecture
+# must provide guarantees on what happens if a clean TLB cache entry is
+# written after the unmap. Details are in mm/rmap.c near the check for
+# should_defer_flush. The architecture should also consider if the full flush
+# and the refill costs are offset by the savings of sending fewer IPIs.
+config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	bool
+
+#
 # For architectures that know their GCC __int128 support is sound
 #
 config ARCH_SUPPORTS_INT128
diff --git a/kernel/fork.c b/kernel/fork.c
index 03c1eaaa6ef5..3fb3e776cfcf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -257,6 +257,11 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	kfree(tsk->tlb_ubc);
+	tsk->tlb_ubc = NULL;
+#endif
+
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 123673291ffb..d58ebdf4d759 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1843,6 +1843,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	p->tlb_ubc = NULL;
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 }
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/internal.h b/mm/internal.h
index a25e359a4039..465e621b86b1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -433,4 +433,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
 
+enum ttu_flags;
+struct tlbflush_unmap_batch;
+
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+void try_to_unmap_flush(void);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index 24dd3f9fee27..4cadb60df74a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -60,6 +60,8 @@
 
 #include <asm/tlbflush.h>
 
+#include <trace/events/tlb.h>
+
 #include "internal.h"
 
 static struct kmem_cache *anon_vma_cachep;
@@ -581,6 +583,88 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+static void percpu_flush_tlb_batch_pages(void *data)
+{
+	/*
+	 * All TLB entries are flushed on the assumption that it is
+	 * cheaper to flush all TLBs and let them be refilled than
+	 * flushing individual PFNs. Note that we do not track mm's
+	 * to flush as that might simply be multiple full TLB flushes
+	 * for no gain.
+	 */
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	local_flush_tlb();
+}
+
+/*
+ * Flush TLB entries for recently unmapped pages from remote CPUs. It is
+ * important if a PTE was dirty when it was unmapped that it's flushed
+ * before any IO is initiated on the page to prevent lost writes. Similarly,
+ * it must be flushed before freeing to prevent data leakage.
+ */
+void try_to_unmap_flush(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+	int cpu;
+
+	if (!tlb_ubc || !tlb_ubc->flush_required)
+		return;
+
+	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, -1UL);
+
+	cpu = get_cpu();
+	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
+		percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask);
+
+	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) {
+		smp_call_function_many(&tlb_ubc->cpumask,
+			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
+	}
+	cpumask_clear(&tlb_ubc->cpumask);
+	tlb_ubc->flush_required = false;
+	put_cpu();
+}
+
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
+	tlb_ubc->flush_required = true;
+}
+
+/*
+ * Returns true if the TLB flush should be deferred to the end of a batch of
+ * unmap operations to reduce IPIs.
+ */
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	bool should_defer = false;
+
+	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
+		return false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+#else
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+}
+
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 /*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
@@ -1213,7 +1297,24 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	if (should_defer_flush(mm, flags)) {
+		/*
+		 * We clear the PTE but do not flush so potentially a remote
+		 * CPU could still be writing to the page. If the entry was
+		 * previously clean then the architecture must guarantee that
+		 * a clear->dirty transition on a cached TLB entry is written
+		 * through and traps if the PTE is unmapped.
+		 */
+		pteval = ptep_get_and_clear(mm, address, pte);
+
+		/* Potentially writable TLBs must be flushed before IO */
+		if (pte_dirty(pteval))
+			flush_tlb_page(vma, address);
+		else
+			set_tlb_ubc_flush_pending(mm, page);
+	} else {
+		pteval = ptep_clear_flush(vma, address, pte);
+	}
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..f16e07aaef59 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+			switch (try_to_unmap(page,
+					ttu_flags|TTU_BATCH_FLUSH)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1175,6 +1176,7 @@ keep:
 	}
 
 	mem_cgroup_uncharge_list(&free_pages);
+	try_to_unmap_flush();
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
@@ -2118,6 +2120,26 @@ out:
 	}
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+/*
+ * Allocate the control structure for batch TLB flushing. An allocation
+ * failure is harmless as the reclaimer will send IPIs where necessary.
+ * A GFP_KERNEL allocation from this context is normally not advised but
+ * we are depending on PF_MEMALLOC (set by direct reclaim or kswapd) to
+ * limit the depth of the call.
+ */
+static void alloc_tlb_ubc(void)
+{
+	if (!current->tlb_ubc)
+		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
+						GFP_KERNEL | __GFP_NOWARN);
+}
+#else
+static inline void alloc_tlb_ubc(void)
+{
+}
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -2152,6 +2174,8 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
 			 sc->priority == DEF_PRIORITY);
 
+	alloc_tlb_ubc();
+
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] mm: Defer flush of writable TLB entries
  2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman
  2015-06-09 17:31 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
@ 2015-06-09 17:31 ` Mel Gorman
  2015-06-09 20:02   ` Rik van Riel
  2015-06-10  7:50   ` Ingo Molnar
  2015-06-09 17:31 ` [PATCH 4/4] mm: Send one IPI per CPU to TLB flush pages that were recently unmapped Mel Gorman
  3 siblings, 2 replies; 19+ messages in thread
From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	Peter Zijlstra, Linux-MM, LKML, Mel Gorman

If a PTE is unmapped and it's dirty then it was writable recently. Due
to deferred TLB flushing, it's best to assume a writable TLB cache entry
exists. With that assumption, the TLB must be flushed before any IO can
start or the page is freed to avoid lost writes or data corruption. This
patch defers flushing of potentially writable TLBs as long as possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  7 +++++++
 mm/internal.h         |  4 ++++
 mm/rmap.c             | 28 +++++++++++++++++++++-------
 mm/vmscan.c           |  7 ++++++-
 4 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d891e01f0445..6b787a7f6c38 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1299,6 +1299,13 @@ struct tlbflush_unmap_batch {
 
 	/* True if any bit in cpumask is set */
 	bool flush_required;
+
+	/*
+	 * If true then the PTE was dirty when unmapped. The entry must be
+	 * flushed before IO is initiated or a stale TLB entry potentially
+	 * allows an update without redirtying the page.
+	 */
+	bool writable;
 };
 
 struct task_struct {
diff --git a/mm/internal.h b/mm/internal.h
index 465e621b86b1..4bbe774497b2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -438,10 +438,14 @@ struct tlbflush_unmap_batch;
 
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
+void try_to_unmap_flush_dirty(void);
 #else
 static inline void try_to_unmap_flush(void)
 {
 }
+static inline void try_to_unmap_flush_dirty(void)
+{
+}
 
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index 4cadb60df74a..1e36b2fb3e95 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -623,16 +623,34 @@ void try_to_unmap_flush(void)
 	}
 	cpumask_clear(&tlb_ubc->cpumask);
 	tlb_ubc->flush_required = false;
+	tlb_ubc->writable = false;
 	put_cpu();
 }
 
+/* Flush iff there are potentially writable TLB entries that can race with IO */
+void try_to_unmap_flush_dirty(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	if (tlb_ubc && tlb_ubc->writable)
+		try_to_unmap_flush();
+}
+
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
-		struct page *page)
+		struct page *page, bool writable)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
 
 	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
 	tlb_ubc->flush_required = true;
+
+	/*
+	 * If the PTE was dirty then it's best to assume it's writable. The
+	 * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush()
+	 * before the page any IO is initiated.
+	 */
+	if (writable)
+		tlb_ubc->writable = true;
 }
 
 /*
@@ -655,7 +673,7 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
-		struct page *page)
+		struct page *page, bool writable)
 {
 }
 
@@ -1307,11 +1325,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 */
 		pteval = ptep_get_and_clear(mm, address, pte);
 
-		/* Potentially writable TLBs must be flushed before IO */
-		if (pte_dirty(pteval))
-			flush_tlb_page(vma, address);
-		else
-			set_tlb_ubc_flush_pending(mm, page);
+		set_tlb_ubc_flush_pending(mm, page, pte_dirty(pteval));
 	} else {
 		pteval = ptep_clear_flush(vma, address, pte);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f16e07aaef59..c329eca98edf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1065,7 +1065,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			if (!sc->may_writepage)
 				goto keep_locked;
 
-			/* Page is dirty, try to write it out here */
+			/*
+			 * Page is dirty. Flush the TLB if a writable entry
+			 * potentially exists to avoid CPU writes after IO
+			 * starts and then write it out here
+			 */
+			try_to_unmap_flush_dirty();
 			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] mm: Send one IPI per CPU to TLB flush pages that were recently unmapped
  2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman
                   ` (2 preceding siblings ...)
  2015-06-09 17:31 ` [PATCH 3/4] mm: Defer flush of writable TLB entries Mel Gorman
@ 2015-06-09 17:31 ` Mel Gorman
  3 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	Peter Zijlstra, Linux-MM, LKML, Mel Gorman

When unmapping pages, an IPI is sent to flush all TLB entries on CPUs that
potentially have a valid TLB entry. There are many circumstances where
this happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate CPUs.
This forces processes running the affected CPUs to refill their TLB entries.
This is an unpredictable cost as it heavily depends on the workloads,
the timing and the exact CPU used.

This patch uses a structure similar in principle to a pagevec to collect
a list of PFNs and CPUs that require flushing. It then sends one IPI per
CPU that was mapping any of those pages to flush the list of PFNs. A new
TLB flush helper is required for this and one is added for x86. Other
architectures will need to decide if batching like this is both safe and
worth the overhead.

There is a direct cost to tracking the PFNs both in memory and the cost of
the individual PFN flushes.  In the absolute worst case, the kernel flushes
individual PFNs and none of the active TLB entries were being used. Hence,
this results reflect the full cost without any of the benefit of preserving
existing entries.

On a 4-socket machine the results were

                                        4.1.0-rc6          4.1.0-rc6
                                    batchdirty-v6      batchunmap-v6
Ops lru-file-mmap-read-elapsed   121.27 (  0.00%)   118.79 (  2.05%)

           4.1.0-rc6      4.1.0-rc6
        batchdirty-v6 batchunmap-v6
User          620.84         608.48
System       4245.35        4152.89
Elapsed       122.65         120.15

In this case the workload completed faster and there was less CPU overhead
but as it's a NUMA machine there are a lot of factors at play. It's easier
to quantify on a single socket machine;

                                        4.1.0-rc6          4.1.0-rc6
                                    batchdirty-v6      batchunmap-v6
Ops lru-file-mmap-read-elapsed    20.35 (  0.00%)    21.52 ( -5.75%)

           4.1.0-rc6   4.1.0-rc6
        batchdirty-v6r5batchunmap-v6r5
User           58.02       60.70
System         77.57       81.92
Elapsed        22.14       23.16

That shows the workload takes 5.75% longer to complete with a similar
increase in the system CPU usage.

It is expected that there is overhead to tracking the PFNs and flushing
individual pages. This can be quantified but we cannot quantify the
indirect savings due to active unrelated TLB entries being preserved.
Whether this matters depends on whether the workload was using those
entries and if they would be used before a context switch but targeting
the TLB flushes is the conservative and safer choice.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/tlbflush.h |  2 ++
 include/linux/sched.h           | 12 ++++++++++--
 init/Kconfig                    | 10 ++++------
 mm/rmap.c                       | 25 +++++++++++++------------
 4 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..10c197a649f5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 
+#define flush_local_tlb_addr(addr) __flush_tlb_single(addr)
+
 #ifndef CONFIG_SMP
 
 /* "_up" is for UniProcessor.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6b787a7f6c38..4dbffe0a1868 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1289,6 +1289,9 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
+#define BATCH_TLBFLUSH_SIZE 32UL
+
 /* Track pages that require TLB flushes */
 struct tlbflush_unmap_batch {
 	/*
@@ -1297,8 +1300,13 @@ struct tlbflush_unmap_batch {
 	 */
 	struct cpumask cpumask;
 
-	/* True if any bit in cpumask is set */
-	bool flush_required;
+	/*
+	 * The number and list of pfns to be flushed. PFNs are tracked instead
+	 * of struct pages to avoid multiple page->pfn lookups by each CPU that
+	 * receives an IPI in percpu_flush_tlb_batch_pages.
+	 */
+	unsigned int nr_pages;
+	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
 
 	/*
 	 * If true then the PTE was dirty when unmapped. The entry must be
diff --git a/init/Kconfig b/init/Kconfig
index 6e6fa4842250..095b3d470c3f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -904,12 +904,10 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
 #
-# For architectures that prefer to flush all TLBs after a number of pages
-# are unmapped instead of sending one IPI per page to flush. The architecture
-# must provide guarantees on what happens if a clean TLB cache entry is
-# written after the unmap. Details are in mm/rmap.c near the check for
-# should_defer_flush. The architecture should also consider if the full flush
-# and the refill costs are offset by the savings of sending fewer IPIs.
+# For architectures that have a local TLB flush for a PFN without knowledge
+# of the VMA. The architecture must provide guarantees on what happens if
+# a clean TLB cache entry is written after the unmap. Details are in mm/rmap.c
+# near the check for should_defer_flush.
 config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	bool
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1e36b2fb3e95..0085b0eb720c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -586,15 +586,12 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 static void percpu_flush_tlb_batch_pages(void *data)
 {
-	/*
-	 * All TLB entries are flushed on the assumption that it is
-	 * cheaper to flush all TLBs and let them be refilled than
-	 * flushing individual PFNs. Note that we do not track mm's
-	 * to flush as that might simply be multiple full TLB flushes
-	 * for no gain.
-	 */
+	struct tlbflush_unmap_batch *tlb_ubc = data;
+	unsigned int i;
+
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
-	local_flush_tlb();
+	for (i = 0; i < tlb_ubc->nr_pages; i++)
+		flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
 }
 
 /*
@@ -608,10 +605,10 @@ void try_to_unmap_flush(void)
 	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
 	int cpu;
 
-	if (!tlb_ubc || !tlb_ubc->flush_required)
+	if (!tlb_ubc || !tlb_ubc->nr_pages)
 		return;
 
-	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, -1UL);
+	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
 
 	cpu = get_cpu();
 	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
@@ -622,7 +619,7 @@ void try_to_unmap_flush(void)
 			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
 	}
 	cpumask_clear(&tlb_ubc->cpumask);
-	tlb_ubc->flush_required = false;
+	tlb_ubc->nr_pages = 0;
 	tlb_ubc->writable = false;
 	put_cpu();
 }
@@ -642,7 +639,8 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
 	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
 
 	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
-	tlb_ubc->flush_required = true;
+	tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page);
+	tlb_ubc->nr_pages++;
 
 	/*
 	 * If the PTE was dirty then it's best to assume it's writable. The
@@ -651,6 +649,9 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
 	 */
 	if (writable)
 		tlb_ubc->writable = true;
+
+	if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE)
+		try_to_unmap_flush();
 }
 
 /*
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
@ 2015-06-09 20:01   ` Rik van Riel
  2015-06-10  7:47   ` Ingo Molnar
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2015-06-09 20:01 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, H Peter Anvin,
	Ingo Molnar, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Linux-MM, LKML

On 06/09/2015 01:31 PM, Mel Gorman wrote:
> An IPI is sent to flush remote TLBs when a page is unmapped that was
> potentially accesssed by other CPUs. There are many circumstances where
> this happens but the obvious one is kswapd reclaiming pages belonging to
> a running process as kswapd and the task are likely running on separate CPUs.

> It's still a noticeable improvement with vmstat showing interrupts went
> from roughly 500K per second to 45K per second.
> 
> The patch will have no impact on workloads with no memory pressure or
> have relatively few mapped pages. It will have an unpredictable impact
> on the workload running on the CPU being flushed as it'll depend on how
> many TLB entries need to be refilled and how long that takes. Worst case,
> the TLB will be completely cleared of active entries when the target PFNs
> were not resident at all.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] mm: Defer flush of writable TLB entries
  2015-06-09 17:31 ` [PATCH 3/4] mm: Defer flush of writable TLB entries Mel Gorman
@ 2015-06-09 20:02   ` Rik van Riel
  2015-06-10  7:50   ` Ingo Molnar
  1 sibling, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2015-06-09 20:02 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, H Peter Anvin,
	Ingo Molnar, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Linux-MM, LKML

On 06/09/2015 01:31 PM, Mel Gorman wrote:
> If a PTE is unmapped and it's dirty then it was writable recently. Due
> to deferred TLB flushing, it's best to assume a writable TLB cache entry
> exists. With that assumption, the TLB must be flushed before any IO can
> start or the page is freed to avoid lost writes or data corruption. This
> patch defers flushing of potentially writable TLBs as long as possible.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
  2015-06-09 20:01   ` Rik van Riel
@ 2015-06-10  7:47   ` Ingo Molnar
  2015-06-10  8:14     ` Mel Gorman
  2015-06-10  8:26   ` Ingo Molnar
  2015-06-10  8:33   ` Ingo Molnar
  3 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2015-06-10  7:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1289,6 +1289,18 @@ enum perf_event_task_context {
>  	perf_nr_task_contexts,
>  };
>  
> +/* Track pages that require TLB flushes */
> +struct tlbflush_unmap_batch {
> +	/*
> +	 * Each bit set is a CPU that potentially has a TLB entry for one of
> +	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
> +	 */
> +	struct cpumask cpumask;
> +
> +	/* True if any bit in cpumask is set */
> +	bool flush_required;
> +};
> +
>  struct task_struct {
>  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
>  	void *stack;
> @@ -1648,6 +1660,10 @@ struct task_struct {
>  	unsigned long numa_pages_migrated;
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +	struct tlbflush_unmap_batch *tlb_ubc;
> +#endif

Please embedd this constant size structure in task_struct directly so that the 
whole per task allocation overhead goes away:

> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +/*
> + * Allocate the control structure for batch TLB flushing. An allocation
> + * failure is harmless as the reclaimer will send IPIs where necessary.
> + * A GFP_KERNEL allocation from this context is normally not advised but
> + * we are depending on PF_MEMALLOC (set by direct reclaim or kswapd) to
> + * limit the depth of the call.
> + */
> +static void alloc_tlb_ubc(void)
> +{
> +	if (!current->tlb_ubc)
> +		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
> +						GFP_KERNEL | __GFP_NOWARN);
> +}
> +#else
> +static inline void alloc_tlb_ubc(void)
> +{
> +}
> +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> +
>  /*
>   * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
>   */
> @@ -2152,6 +2174,8 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>  	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
>  			 sc->priority == DEF_PRIORITY);
>  
> +	alloc_tlb_ubc();
> +
>  	blk_start_plug(&plug);
>  	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>  					nr[LRU_INACTIVE_FILE]) {

the whole patch series will become even simpler.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] mm: Defer flush of writable TLB entries
  2015-06-09 17:31 ` [PATCH 3/4] mm: Defer flush of writable TLB entries Mel Gorman
  2015-06-09 20:02   ` Rik van Riel
@ 2015-06-10  7:50   ` Ingo Molnar
  2015-06-10  8:17     ` Mel Gorman
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2015-06-10  7:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> +
> +	/*
> +	 * If the PTE was dirty then it's best to assume it's writable. The
> +	 * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush()
> +	 * before the page any IO is initiated.
> +	 */

Speling nit: "before the page any IO is initiated" does not parse for me.

> +			/*
> +			 * Page is dirty. Flush the TLB if a writable entry
> +			 * potentially exists to avoid CPU writes after IO
> +			 * starts and then write it out here
> +			 */

s/here/here.

or:

s/here/here:

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  7:47   ` Ingo Molnar
@ 2015-06-10  8:14     ` Mel Gorman
  2015-06-10  8:21       ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2015-06-10  8:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Wed, Jun 10, 2015 at 09:47:04AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1289,6 +1289,18 @@ enum perf_event_task_context {
> >  	perf_nr_task_contexts,
> >  };
> >  
> > +/* Track pages that require TLB flushes */
> > +struct tlbflush_unmap_batch {
> > +	/*
> > +	 * Each bit set is a CPU that potentially has a TLB entry for one of
> > +	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
> > +	 */
> > +	struct cpumask cpumask;
> > +
> > +	/* True if any bit in cpumask is set */
> > +	bool flush_required;
> > +};
> > +
> >  struct task_struct {
> >  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
> >  	void *stack;
> > @@ -1648,6 +1660,10 @@ struct task_struct {
> >  	unsigned long numa_pages_migrated;
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> > +	struct tlbflush_unmap_batch *tlb_ubc;
> > +#endif
> 
> Please embedd this constant size structure in task_struct directly so that the 
> whole per task allocation overhead goes away:
> 

That puts a structure (72 bytes in the config I used) within the task struct
even when it's not required. On a lightly loaded system direct reclaim
will not be active and for some processes, it'll never be active. It's
very wasteful.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] mm: Defer flush of writable TLB entries
  2015-06-10  7:50   ` Ingo Molnar
@ 2015-06-10  8:17     ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2015-06-10  8:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Wed, Jun 10, 2015 at 09:50:34AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > +
> > +	/*
> > +	 * If the PTE was dirty then it's best to assume it's writable. The
> > +	 * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush()
> > +	 * before the page any IO is initiated.
> > +	 */
> 
> Speling nit: "before the page any IO is initiated" does not parse for me.
> 
> > +			/*
> > +			 * Page is dirty. Flush the TLB if a writable entry
> > +			 * potentially exists to avoid CPU writes after IO
> > +			 * starts and then write it out here
> > +			 */
> 
> s/here/here.
> 
> or:
> 
> s/here/here:
> 

Both fixed, thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  8:14     ` Mel Gorman
@ 2015-06-10  8:21       ` Ingo Molnar
  2015-06-10  8:51         ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2015-06-10  8:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jun 10, 2015 at 09:47:04AM +0200, Ingo Molnar wrote:
> > 
> > * Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1289,6 +1289,18 @@ enum perf_event_task_context {
> > >  	perf_nr_task_contexts,
> > >  };
> > >  
> > > +/* Track pages that require TLB flushes */
> > > +struct tlbflush_unmap_batch {
> > > +	/*
> > > +	 * Each bit set is a CPU that potentially has a TLB entry for one of
> > > +	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
> > > +	 */
> > > +	struct cpumask cpumask;
> > > +
> > > +	/* True if any bit in cpumask is set */
> > > +	bool flush_required;
> > > +};
> > > +
> > >  struct task_struct {
> > >  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
> > >  	void *stack;
> > > @@ -1648,6 +1660,10 @@ struct task_struct {
> > >  	unsigned long numa_pages_migrated;
> > >  #endif /* CONFIG_NUMA_BALANCING */
> > >  
> > > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> > > +	struct tlbflush_unmap_batch *tlb_ubc;
> > > +#endif
> > 
> > Please embedd this constant size structure in task_struct directly so that the 
> > whole per task allocation overhead goes away:
> > 
> 
> That puts a structure (72 bytes in the config I used) within the task struct 
> even when it's not required. On a lightly loaded system direct reclaim will not 
> be active and for some processes, it'll never be active. It's very wasteful.

For certain values of 'very'.

 - 72 bytes suggests that you have NR_CPUS set to 512 or so? On a kernel sized to 
   such large systems with 1000 active tasks we are talking about about +72K of 
   RAM...

 - Furthermore, by embedding it it gets packed better with neighboring task_struct 
   fields, while by allocating it dynamically it's a separate cache line wasted.

 - Plus by allocating it separately you spend two cachelines on it: each slab will 
   be at least cacheline aligned, and 72 bytes will allocate 128 bytes. So when 
   this gets triggered you've just wasted some more RAM.

 - I mean, if it had dynamic size, or was arguably huge. But this is just a 
   cpumask and a boolean!

 - The cpumask will be dynamic if you increase the NR_CPUS count any more than 
   that - in which case embedding the structure is the right choice again.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
  2015-06-09 20:01   ` Rik van Riel
  2015-06-10  7:47   ` Ingo Molnar
@ 2015-06-10  8:26   ` Ingo Molnar
  2015-06-10  9:58     ` Mel Gorman
  2015-06-10  8:33   ` Ingo Molnar
  3 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2015-06-10  8:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> wrote:

> On a 4-socket machine the results were
> 
>                                         4.1.0-rc6          4.1.0-rc6
>                                     batchdirty-v6      batchunmap-v6
> Ops lru-file-mmap-read-elapsed   121.27 (  0.00%)   118.79 (  2.05%)
> 
>            4.1.0-rc6      4.1.0-rc6
>         batchdirty-v6 batchunmap-v6
> User          620.84         608.48
> System       4245.35        4152.89
> Elapsed       122.65         120.15
> 
> In this case the workload completed faster and there was less CPU overhead
> but as it's a NUMA machine there are a lot of factors at play. It's easier
> to quantify on a single socket machine;
> 
>                                         4.1.0-rc6          4.1.0-rc6
>                                     batchdirty-v6      batchunmap-v6
> Ops lru-file-mmap-read-elapsed    20.35 (  0.00%)    21.52 ( -5.75%)
> 
>            4.1.0-rc6   4.1.0-rc6
>         batchdirty-v6r5batchunmap-v6r5
> User           58.02       60.70
> System         77.57       81.92
> Elapsed        22.14       23.16
> 
> That shows the workload takes 5.75% longer to complete with a similar
> increase in the system CPU usage.

Btw., do you have any stddev noise numbers?

The batching speedup is brutal enough to not need any noise estimations, it's a 
clear winner.

But this PFN tracking patch is more difficult to judge as the numbers are pretty 
close to each other.

> It is expected that there is overhead to tracking the PFNs and flushing 
> individual pages. This can be quantified but we cannot quantify the indirect 
> savings due to active unrelated TLB entries being preserved. Whether this 
> matters depends on whether the workload was using those entries and if they 
> would be used before a context switch but targeting the TLB flushes is the 
> conservative and safer choice.

So this is how I picture a realistic TLB flushing 'worst case': a workload that 
uses about 80% of the TLB cache in a 'fast' function and trashes memory in a 
'slow' function, and does alternate calls to the two functions from the same task.

Typical dTLB sizes on x86 are a couple of hundred entries (you can see the precise 
count in x86info -c), up to 1024 entries on the latest uarchs.

A cached TLB miss will take about 10-20 cycles (progressively more if the lookup 
chain misses in the cache) - but that cost is partially hidden if the L1 data 
cache was missed (which is likely for most TLB-flush intense workloads), and will 
be almost completely hidden if it goes out to the L3 cache or goes to RAM. (It 
takes up cache/memory bandwidth though, but unless the access patters are totally 
sparse, it should be a small fraction.)

A single INVLPG with its 200+ cycles cost is equivalent to about 10-20 TLB misses. 
That's a lot.

So this kind of workload should trigger the TLB flushing 'worst case': with say 
512 dTLB entries you could see up to 5k-10k cycles of hidden/indirect cost, but 
potentially parallelized with other misses going on with the same data accesses.

The current limit for INVLPG flushing is 33 entries: that's 10k-20k cycles max 
with an INVLPG cost of 250 cycles - this could explain the results you got.

But the problem is: AFAICS you can only decrease the INVLPG count by decreasing 
the batching size - the additional IPI costs will overwhelm any TLB preservation 
benefits. So depending on the cost relationship between INVLPG, TLB miss cost and 
IPI cost, it might not be possible to see a speedup even in the worst-case.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
                     ` (2 preceding siblings ...)
  2015-06-10  8:26   ` Ingo Molnar
@ 2015-06-10  8:33   ` Ingo Molnar
  2015-06-10  8:59     ` Mel Gorman
  3 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2015-06-10  8:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
> 
>                                         4.1.0-rc6          4.1.0-rc6
>                                           vanilla       flushfull-v6
> Ops lru-file-mmap-read-elapsed   162.88 (  0.00%)   120.81 ( 25.83%)
> 
>            4.1.0-rc6   4.1.0-rc6
>              vanillaflushfull-v6r5
> User          568.96      614.68
> System       6085.61     4226.61
> Elapsed       164.24      122.17
> 
> This is showing that the readers completed 25.83% faster with 30% less
> system CPU time. From vmstats, it is known that the vanilla kernel was
> interrupted roughly 900K times per second during the steady phase of the
> test and the patched kernel was interrupts 180K times per second.
> 
> The impact is lower on a single socket machine.
> 
>                                         4.1.0-rc6          4.1.0-rc6
>                                           vanilla       flushfull-v6
> Ops lru-file-mmap-read-elapsed    25.43 (  0.00%)    20.59 ( 19.03%)
> 
>            4.1.0-rc6    4.1.0-rc6
>              vanilla flushfull-v6
> User           59.14        58.99
> System        109.15        77.84
> Elapsed        27.32        22.31
> 
> It's still a noticeable improvement with vmstat showing interrupts went
> from roughly 500K per second to 45K per second.

Btw., I tried to compare your previous (v5) pfn-tracking numbers with these 
full-flushing numbers, and found that the IRQ rate appears to be the same:

> > From vmstats, it is known that the vanilla kernel was interrupted roughly 900K 
> > times per second during the steady phase of the test and the patched kernel 
> > was interrupts 180K times per second.

> > It's still a noticeable improvement with vmstat showing interrupts went from 
> > roughly 500K per second to 45K per second.

... is that because the batching limit in the pfn-tracking case was high enough to 
not be noticeable in the vmstat?

In the full-flushing case (v6 without patch 4) the batching limit is 'infinite', 
we'll batch as long as possible, right?

Or have I managed to get confused somewhere ...

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  8:21       ` Ingo Molnar
@ 2015-06-10  8:51         ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2015-06-10  8:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Wed, Jun 10, 2015 at 10:21:07AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Wed, Jun 10, 2015 at 09:47:04AM +0200, Ingo Molnar wrote:
> > > 
> > > * Mel Gorman <mgorman@suse.de> wrote:
> > > 
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -1289,6 +1289,18 @@ enum perf_event_task_context {
> > > >  	perf_nr_task_contexts,
> > > >  };
> > > >  
> > > > +/* Track pages that require TLB flushes */
> > > > +struct tlbflush_unmap_batch {
> > > > +	/*
> > > > +	 * Each bit set is a CPU that potentially has a TLB entry for one of
> > > > +	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
> > > > +	 */
> > > > +	struct cpumask cpumask;
> > > > +
> > > > +	/* True if any bit in cpumask is set */
> > > > +	bool flush_required;
> > > > +};
> > > > +
> > > >  struct task_struct {
> > > >  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
> > > >  	void *stack;
> > > > @@ -1648,6 +1660,10 @@ struct task_struct {
> > > >  	unsigned long numa_pages_migrated;
> > > >  #endif /* CONFIG_NUMA_BALANCING */
> > > >  
> > > > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> > > > +	struct tlbflush_unmap_batch *tlb_ubc;
> > > > +#endif
> > > 
> > > Please embedd this constant size structure in task_struct directly so that the 
> > > whole per task allocation overhead goes away:
> > > 
> > 
> > That puts a structure (72 bytes in the config I used) within the task struct 
> > even when it's not required. On a lightly loaded system direct reclaim will not 
> > be active and for some processes, it'll never be active. It's very wasteful.
> 
> For certain values of 'very'.
> 
>  - 72 bytes suggests that you have NR_CPUS set to 512 or so? On a kernel sized to 
>    such large systems with 1000 active tasks we are talking about about +72K of 
>    RAM...
> 

The NR_CPUS is based on the openSUSE 13.1 distro config so yes, it's large but I also
expect it to be a common configuration.

>  - Furthermore, by embedding it it gets packed better with neighboring task_struct 
>    fields, while by allocating it dynamically it's a separate cache line wasted.
> 

A separate cache line that is only used during direct reclaim when the
process is taking a large hit anyway

>  - Plus by allocating it separately you spend two cachelines on it: each slab will 
>    be at least cacheline aligned, and 72 bytes will allocate 128 bytes. So when 
>    this gets triggered you've just wasted some more RAM.
> 
>  - I mean, if it had dynamic size, or was arguably huge. But this is just a 
>    cpumask and a boolean!
> 

It gets larger with enterprise configs.

>  - The cpumask will be dynamic if you increase the NR_CPUS count any more than 
>    that - in which case embedding the structure is the right choice again.
> 

Enterprise configurations are larger. The most recent one I checked defined
NR_CPUS as 8192. If it's embedded in the structure, it means that we need
to call cpumask_clear on every fork even if it's never used. That adds
constant overhead to a fast path to avoid an allocation and a few cache
misses in a direct reclaim path. Are you certain you want that trade-off?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  8:33   ` Ingo Molnar
@ 2015-06-10  8:59     ` Mel Gorman
  2015-06-11 15:02       ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2015-06-10  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Wed, Jun 10, 2015 at 10:33:32AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
> > 
> >                                         4.1.0-rc6          4.1.0-rc6
> >                                           vanilla       flushfull-v6
> > Ops lru-file-mmap-read-elapsed   162.88 (  0.00%)   120.81 ( 25.83%)
> > 
> >            4.1.0-rc6   4.1.0-rc6
> >              vanillaflushfull-v6r5
> > User          568.96      614.68
> > System       6085.61     4226.61
> > Elapsed       164.24      122.17
> > 
> > This is showing that the readers completed 25.83% faster with 30% less
> > system CPU time. From vmstats, it is known that the vanilla kernel was
> > interrupted roughly 900K times per second during the steady phase of the
> > test and the patched kernel was interrupts 180K times per second.
> > 
> > The impact is lower on a single socket machine.
> > 
> >                                         4.1.0-rc6          4.1.0-rc6
> >                                           vanilla       flushfull-v6
> > Ops lru-file-mmap-read-elapsed    25.43 (  0.00%)    20.59 ( 19.03%)
> > 
> >            4.1.0-rc6    4.1.0-rc6
> >              vanilla flushfull-v6
> > User           59.14        58.99
> > System        109.15        77.84
> > Elapsed        27.32        22.31
> > 
> > It's still a noticeable improvement with vmstat showing interrupts went
> > from roughly 500K per second to 45K per second.
> 
> Btw., I tried to compare your previous (v5) pfn-tracking numbers with these 
> full-flushing numbers, and found that the IRQ rate appears to be the same:
> 

That's expected because the number of IPIs sent is the same. What
changes is the tracking of the PFNs and then the work within the IPI
itself.

> > > From vmstats, it is known that the vanilla kernel was interrupted roughly 900K 
> > > times per second during the steady phase of the test and the patched kernel 
> > > was interrupts 180K times per second.
> 
> > > It's still a noticeable improvement with vmstat showing interrupts went from 
> > > roughly 500K per second to 45K per second.
> 
> ... is that because the batching limit in the pfn-tracking case was high enough to 
> not be noticeable in the vmstat?
> 

It's just the case that there are fewer cores and less activity in the
machine overall.

> In the full-flushing case (v6 without patch 4) the batching limit is 'infinite', 
> we'll batch as long as possible, right?
> 

No because we must flush before pages are freed so the maximum batching
is related to SWAP_CLUSTER_MAX. If we free a page before the flush then
in theory the page can be reallocated and a stale TLB entry can allow
access to unrelated data. It would be almost impossible to trigger
corruption this way but it's a concern.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  8:26   ` Ingo Molnar
@ 2015-06-10  9:58     ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2015-06-10  9:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Wed, Jun 10, 2015 at 10:26:40AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > On a 4-socket machine the results were
> > 
> >                                         4.1.0-rc6          4.1.0-rc6
> >                                     batchdirty-v6      batchunmap-v6
> > Ops lru-file-mmap-read-elapsed   121.27 (  0.00%)   118.79 (  2.05%)
> > 
> >            4.1.0-rc6      4.1.0-rc6
> >         batchdirty-v6 batchunmap-v6
> > User          620.84         608.48
> > System       4245.35        4152.89
> > Elapsed       122.65         120.15
> > 
> > In this case the workload completed faster and there was less CPU overhead
> > but as it's a NUMA machine there are a lot of factors at play. It's easier
> > to quantify on a single socket machine;
> > 
> >                                         4.1.0-rc6          4.1.0-rc6
> >                                     batchdirty-v6      batchunmap-v6
> > Ops lru-file-mmap-read-elapsed    20.35 (  0.00%)    21.52 ( -5.75%)
> > 
> >            4.1.0-rc6   4.1.0-rc6
> >         batchdirty-v6r5batchunmap-v6r5
> > User           58.02       60.70
> > System         77.57       81.92
> > Elapsed        22.14       23.16
> > 
> > That shows the workload takes 5.75% longer to complete with a similar
> > increase in the system CPU usage.
> 
> Btw., do you have any stddev noise numbers?
> 

                                           4.1.0-rc6          4.1.0-rc6          4.1.0-rc6          4.1.0-rc6
                                             vanilla     flushfull-v6r5    batchdirty-v6r5    batchunmap-v6r5
Ops lru-file-mmap-read-elapsed       25.43 (  0.00%)    20.59 ( 19.03%)    20.35 ( 19.98%)    21.52 ( 15.38%)
Ops lru-file-mmap-read-time_stddv     0.32 (  0.00%)     0.32 ( -1.30%)     0.39 (-23.00%)     0.45 (-40.91%)


flushfull  -- patch 2
batchdirty -- patch 3
batchunmap -- patch 4

So the impact of tracking the PFNs is outside the noise and there is
definite direct cost to it. This was expected for both the PFN tracking
and the individual flushes.

> The batching speedup is brutal enough to not need any noise estimations, it's a 
> clear winner.
> 

Agreed.

> But this PFN tracking patch is more difficult to judge as the numbers are pretty 
> close to each other.
> 

It's definitely measurable, no doubt about it and there never was. The
concerns were always the refill costs due to flushing potentially active
TLB entries unnecessarily. From https://lkml.org/lkml/2014/7/31/825, this
is potentially high where it says that a 512 DTLB refill takes 22,000
cycles which is higher than the individual flushes. However, this is an
estimate and it'll always be a case of "it depends". It's been asserted
that the refill costs are really low so lets just go with that, drop
patch 4 and wait and see who complains.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  8:59     ` Mel Gorman
@ 2015-06-11 15:02       ` Ingo Molnar
  2015-06-11 15:25         ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2015-06-11 15:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> > In the full-flushing case (v6 without patch 4) the batching limit is 
> > 'infinite', we'll batch as long as possible, right?
> 
> No because we must flush before pages are freed so the maximum batching is 
> related to SWAP_CLUSTER_MAX. If we free a page before the flush then in theory 
> the page can be reallocated and a stale TLB entry can allow access to unrelated 
> data. It would be almost impossible to trigger corruption this way but it's a 
> concern.

Well, could we say double SWAP_CLUSTER_MAX to further reduce the IPI rate?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-11 15:02       ` Ingo Molnar
@ 2015-06-11 15:25         ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2015-06-11 15:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Thu, Jun 11, 2015 at 05:02:51PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > In the full-flushing case (v6 without patch 4) the batching limit is 
> > > 'infinite', we'll batch as long as possible, right?
> > 
> > No because we must flush before pages are freed so the maximum batching is 
> > related to SWAP_CLUSTER_MAX. If we free a page before the flush then in theory 
> > the page can be reallocated and a stale TLB entry can allow access to unrelated 
> > data. It would be almost impossible to trigger corruption this way but it's a 
> > concern.
> 
> Well, could we say double SWAP_CLUSTER_MAX to further reduce the IPI rate?
> 

We could but it's a suprisingly subtle change. The impacts I can think
of are;

1. LRU lock hold times increase slightly because more pages are being
   isolated
2. There are slight timing changes due to more pages having to be
   processed before they are freed. There is a slight risk that more
   pages than are necessary get reclaimed but I doubt it'll be
   measurable
3. There is a risk that too_many_isolated checks will be easier to
   trigger resulting in a HZ/10 stall
4. The rotation rate of active->inactive is slightly faster but there
   should be fewer rotations before the lists get balanced so it
   shouldn't matter.
5. More pages are reclaimed in a single pass if zone_reclaim_mode is
   active but that thing sucks hard when it's enabled no matter what
6. More pages are isolated for compaction so page hold times there
   are longer while they are being copied

There might be others. To be honest, I'm struggling to think of any serious
problems such a change would cause. The biggest risk is issue 3 but I expect
that hitting that requires that the system is already getting badly hammered.
The main downside is that it affects all page reclaim activity, not just
the mapped pages which are triggering the IPIs. I'll add a patch to the
series that alters SWAP_CLUSTER_MAX with the intent to further reduce
IPIs and see what falls out and see if any other VM person complains.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-06-11 15:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman
2015-06-09 17:31 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
2015-06-09 20:01   ` Rik van Riel
2015-06-10  7:47   ` Ingo Molnar
2015-06-10  8:14     ` Mel Gorman
2015-06-10  8:21       ` Ingo Molnar
2015-06-10  8:51         ` Mel Gorman
2015-06-10  8:26   ` Ingo Molnar
2015-06-10  9:58     ` Mel Gorman
2015-06-10  8:33   ` Ingo Molnar
2015-06-10  8:59     ` Mel Gorman
2015-06-11 15:02       ` Ingo Molnar
2015-06-11 15:25         ` Mel Gorman
2015-06-09 17:31 ` [PATCH 3/4] mm: Defer flush of writable TLB entries Mel Gorman
2015-06-09 20:02   ` Rik van Riel
2015-06-10  7:50   ` Ingo Molnar
2015-06-10  8:17     ` Mel Gorman
2015-06-09 17:31 ` [PATCH 4/4] mm: Send one IPI per CPU to TLB flush pages that were recently unmapped Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).