[PATCH 0/4] TLB flush multiple pages per IPI v7

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] TLB flush multiple pages per IPI v7
@ 2015-07-06 13:39 Mel Gorman
  2015-07-06 13:39 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Mel Gorman @ 2015-07-06 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML, Mel Gorman

This is hopefully the final version that was agreed on. Ingo, you had sent
an ack but I had to add a new arch helper after that for accounting purposes
and there was a new patch added for the swap cluster suggestion. With the
changes I did not include the ack just in case it was no longer valid.

Changelog since V6
o Rebase to v4.2-rc1
o Fix TLB flush counter accounting
o Drop dynamic allocation patch, no benefit and very messy
o Drop targetting flushing, expected to be of dubious merit
o Increase swap cluster max

Changelog since V5
o Split series to first do a full TLB flush and then targetting flushing

Changelog since V4
o Rebase to 4.1-rc6

Changelog since V3
o Drop batching of TLB flush from migration
o Redo how larger batching is managed
o Batch TLB flushes when writable entries exist

When unmapping pages it is necessary to flush the TLB. If that page was
accessed by another CPU then an IPI is used to flush the remote CPU. That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second.

There already is a window between when a page is unmapped and when it is
TLB flushed. This series ses the window so multiple pages can be flushed
using a single IPI. This should be safe or the kernel is hosed already.

Patch 1 simply made the rest of the series easier to write as ftrace
	could identify all the senders of TLB flush IPIS.

Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI
	to flush the entire TLB.

Patch 3 tracks when there potentially are writable TLB entries that
	need to be batched differently

Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes

The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.

 arch/x86/Kconfig                |   1 +
 arch/x86/include/asm/tlbflush.h |   6 +++
 arch/x86/mm/tlb.c               |   1 +
 include/linux/mm_types.h        |   1 +
 include/linux/rmap.h            |   3 ++
 include/linux/sched.h           |  23 ++++++++
 include/linux/swap.h            |   2 +-
 include/trace/events/tlb.h      |   3 +-
 init/Kconfig                    |  10 ++++
 mm/internal.h                   |  15 ++++++
 mm/rmap.c                       | 117 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     |  30 ++++++++++-
 12 files changed, 207 insertions(+), 5 deletions(-)

-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
  2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman
@ 2015-07-06 13:39 ` Mel Gorman
  2015-07-06 13:39 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2015-07-06 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..84ef58543e2b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -552,6 +552,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 4250f364a6ca..bc8815f45f3b 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	EM(  TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" )	\
 	EM(  TLB_REMOTE_SHOOTDOWN,	"remote shootdown" )		\
 	EM(  TLB_LOCAL_SHOOTDOWN,	"local shootdown" )		\
-	EMe( TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )
+	EM(  TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )		\
+	EMe( TLB_REMOTE_SEND_IPI,	"remote ipi send" )
 
 /*
  * First define the enums in TLB_FLUSH_REASON to be exported to userspace
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman
  2015-07-06 13:39 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
@ 2015-07-06 13:39 ` Mel Gorman
  2015-07-06 13:39 ` [PATCH 3/4] mm: Defer flush of writable TLB entries Mel Gorman
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2015-07-06 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML, Mel Gorman

An IPI is sent to flush remote TLBs when a page is unmapped that was
potentially accesssed by other CPUs. There are many circumstances where
this happens but the obvious one is kswapd reclaiming pages belonging to
a running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine gets
larger with more cores and more memory, the cost of these IPIs can be
high. This patch uses a simple structure that tracks CPUs that potentially
have TLB entries for pages being unmapped. When the unmapping is complete,
the full TLB is flushed on the assumption that a refill cost is lower than
flushing individual entries.

Architectures wishing to do this must give the following guarantee.

	If a clean page is unmapped and not immediately flushed, the
	architecture must guarantee that a write to that linear address
	from a CPU with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window
is much larger with this patch applied and is worth highlighting. The
architecture should consider whether the cost of the full TLB flush is
higher than sending an IPI to flush each individual entry. An additional
architecture helper called flush_tlb_local is required. It's a trivial
wrapper with some accounting in the x86 case.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

                                           4.2.0-rc1          4.2.0-rc1
                                             vanilla       flushfull-v7
Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)

           4.2.0-rc1    4.2.0-rc1
             vanilla flushfull-v7
User          581.00       611.43
System       5804.93      4111.76
Elapsed       161.03       122.12

This is showing that the readers completed 24.40% faster with 29% less
system CPU time. From vmstats, it is known that the vanilla kernel was
interrupted roughly 900K times per second during the steady phase of the
test and the patched kernel was interrupts 180K times per second.

The impact is lower on a single socket machine.

                                           4.2.0-rc1          4.2.0-rc1
                                             vanilla       flushfull-v7
Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)

           4.2.0-rc1    4.2.0-rc1
             vanilla flushfull-v7
User           58.09        57.64
System        111.82        76.56
Elapsed        27.29        22.55

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages. It will have an unpredictable impact
on the workload running on the CPU being flushed as it'll depend on how
many TLB entries need to be refilled and how long that takes. Worst case,
the TLB will be completely cleared of active entries when the target PFNs
were not resident at all.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/Kconfig                |   1 +
 arch/x86/include/asm/tlbflush.h |   6 +++
 include/linux/rmap.h            |   3 ++
 include/linux/sched.h           |  16 +++++++
 init/Kconfig                    |  10 ++++
 mm/internal.h                   |  11 +++++
 mm/rmap.c                       | 103 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     |  23 ++++++++-
 8 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 55bced17dc95..d646d6f26a16 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -41,6 +41,7 @@ config X86
 	select ARCH_USE_CMPXCHG_LOCKREF		if X86_64
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP
 	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANT_IPC_PARSE_VERSION	if X86_32
 	select ARCH_WANT_OPTIONAL_GPIOLIB
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..6df2029405a3 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -261,6 +261,12 @@ static inline void reset_lazy_tlbstate(void)
 
 #endif	/* SMP */
 
+/* Not inlined due to inc_irq_stat not being defined yet */
+#define flush_tlb_local() {		\
+	inc_irq_stat(irq_tlb_count);	\
+	local_flush_tlb();		\
+}
+
 #ifndef CONFIG_PARAVIRT
 #define flush_tlb_others(mask, mm, start, end)	\
 	native_flush_tlb_others(mask, mm, start, end)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c89c53a113a8..29446aeef36e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -89,6 +89,9 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
+					 * and caller guarantees they will
+					 * do a final flush if necessary */
 };
 
 #ifdef CONFIG_MMU
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae21f1591615..1a83fb44ab34 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1341,6 +1341,18 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/* Track pages that require TLB flushes */
+struct tlbflush_unmap_batch {
+	/*
+	 * Each bit set is a CPU that potentially has a TLB entry for one of
+	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
+	 */
+	struct cpumask cpumask;
+
+	/* True if any bit in cpumask is set */
+	bool flush_required;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1701,6 +1713,10 @@ struct task_struct {
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	struct tlbflush_unmap_batch tlb_ubc;
+#endif
+
 	struct rcu_head rcu;
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index af09b4fb43d2..0aa5be1e617d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -891,6 +891,16 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
 #
+# For architectures that prefer to flush all TLBs after a number of pages
+# are unmapped instead of sending one IPI per page to flush. The architecture
+# must provide guarantees on what happens if a clean TLB cache entry is
+# written after the unmap. Details are in mm/rmap.c near the check for
+# should_defer_flush. The architecture should also consider if the full flush
+# and the refill costs are offset by the savings of sending fewer IPIs.
+config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	bool
+
+#
 # For architectures that know their GCC __int128 support is sound
 #
 config ARCH_SUPPORTS_INT128
diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1e2ca6..bd6372ac5f7f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -426,4 +426,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
 
+enum ttu_flags;
+struct tlbflush_unmap_batch;
+
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+void try_to_unmap_flush(void);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index 171b68768df1..d54f47666af5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -62,6 +62,8 @@
 
 #include <asm/tlbflush.h>
 
+#include <trace/events/tlb.h>
+
 #include "internal.h"
 
 static struct kmem_cache *anon_vma_cachep;
@@ -583,6 +585,88 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+static void percpu_flush_tlb_batch_pages(void *data)
+{
+	/*
+	 * All TLB entries are flushed on the assumption that it is
+	 * cheaper to flush all TLBs and let them be refilled than
+	 * flushing individual PFNs. Note that we do not track mm's
+	 * to flush as that might simply be multiple full TLB flushes
+	 * for no gain.
+	 */
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	flush_tlb_local();
+}
+
+/*
+ * Flush TLB entries for recently unmapped pages from remote CPUs. It is
+ * important if a PTE was dirty when it was unmapped that it's flushed
+ * before any IO is initiated on the page to prevent lost writes. Similarly,
+ * it must be flushed before freeing to prevent data leakage.
+ */
+void try_to_unmap_flush(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	int cpu;
+
+	if (!tlb_ubc->flush_required)
+		return;
+
+	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, -1UL);
+
+	cpu = get_cpu();
+	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
+		percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask);
+
+	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) {
+		smp_call_function_many(&tlb_ubc->cpumask,
+			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
+	}
+	cpumask_clear(&tlb_ubc->cpumask);
+	tlb_ubc->flush_required = false;
+	put_cpu();
+}
+
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+
+	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
+	tlb_ubc->flush_required = true;
+}
+
+/*
+ * Returns true if the TLB flush should be deferred to the end of a batch of
+ * unmap operations to reduce IPIs.
+ */
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	bool should_defer = false;
+
+	if (!(flags & TTU_BATCH_FLUSH))
+		return false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+#else
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+}
+
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 /*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
@@ -1220,7 +1304,24 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	if (should_defer_flush(mm, flags)) {
+		/*
+		 * We clear the PTE but do not flush so potentially a remote
+		 * CPU could still be writing to the page. If the entry was
+		 * previously clean then the architecture must guarantee that
+		 * a clear->dirty transition on a cached TLB entry is written
+		 * through and traps if the PTE is unmapped.
+		 */
+		pteval = ptep_get_and_clear(mm, address, pte);
+
+		/* Potentially writable TLBs must be flushed before IO */
+		if (pte_dirty(pteval))
+			flush_tlb_page(vma, address);
+		else
+			set_tlb_ubc_flush_pending(mm, page);
+	} else {
+		pteval = ptep_clear_flush(vma, address, pte);
+	}
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e61445dce04e..e4f1df1052a2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1061,7 +1061,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+			switch (try_to_unmap(page,
+					ttu_flags|TTU_BATCH_FLUSH)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1212,6 +1213,7 @@ keep:
 	}
 
 	mem_cgroup_uncharge_list(&free_pages);
+	try_to_unmap_flush();
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
@@ -2155,6 +2157,23 @@ out:
 	}
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+static void init_tlb_ubc(void)
+{
+	/*
+	 * This deliberately does not clear the cpumask as it's expensive
+	 * and unnecessary. If there happens to be data in there then the
+	 * first SWAP_CLUSTER_MAX pages will send an unnecessary IPI and
+	 * then will be cleared.
+	 */
+	current->tlb_ubc.flush_required = false;
+}
+#else
+static inline void init_tlb_ubc(void)
+{
+}
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -2189,6 +2208,8 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
 			 sc->priority == DEF_PRIORITY);
 
+	init_tlb_ubc();
+
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] mm: Defer flush of writable TLB entries
  2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman
  2015-07-06 13:39 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
  2015-07-06 13:39 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
@ 2015-07-06 13:39 ` Mel Gorman
  2015-07-06 13:39 ` [PATCH 4/4] mm: Increase SWAP_CLUSTER_MAX to batch TLB flushes Mel Gorman
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2015-07-06 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML, Mel Gorman

If a PTE is unmapped and it's dirty then it was writable recently. Due
to deferred TLB flushing, it's best to assume a writable TLB cache entry
exists. With that assumption, the TLB must be flushed before any IO can
start or the page is freed to avoid lost writes or data corruption. This
patch defers flushing of potentially writable TLBs as long as possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h |  7 +++++++
 mm/internal.h         |  4 ++++
 mm/rmap.c             | 28 +++++++++++++++++++++-------
 mm/vmscan.c           |  7 ++++++-
 4 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1a83fb44ab34..e769d5b4975c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,6 +1351,13 @@ struct tlbflush_unmap_batch {
 
 	/* True if any bit in cpumask is set */
 	bool flush_required;
+
+	/*
+	 * If true then the PTE was dirty when unmapped. The entry must be
+	 * flushed before IO is initiated or a stale TLB entry potentially
+	 * allows an update without redirtying the page.
+	 */
+	bool writable;
 };
 
 struct task_struct {
diff --git a/mm/internal.h b/mm/internal.h
index bd6372ac5f7f..1195dd2d6a2b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -431,10 +431,14 @@ struct tlbflush_unmap_batch;
 
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
+void try_to_unmap_flush_dirty(void);
 #else
 static inline void try_to_unmap_flush(void)
 {
 }
+static inline void try_to_unmap_flush_dirty(void)
+{
+}
 
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index d54f47666af5..85a8aea2d593 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -625,16 +625,34 @@ void try_to_unmap_flush(void)
 	}
 	cpumask_clear(&tlb_ubc->cpumask);
 	tlb_ubc->flush_required = false;
+	tlb_ubc->writable = false;
 	put_cpu();
 }
 
+/* Flush iff there are potentially writable TLB entries that can race with IO */
+void try_to_unmap_flush_dirty(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+
+	if (tlb_ubc->writable)
+		try_to_unmap_flush();
+}
+
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
-		struct page *page)
+		struct page *page, bool writable)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 
 	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
 	tlb_ubc->flush_required = true;
+
+	/*
+	 * If the PTE was dirty then it's best to assume it's writable. The
+	 * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush()
+	 * before the page is queued for IO.
+	 */
+	if (writable)
+		tlb_ubc->writable = true;
 }
 
 /*
@@ -657,7 +675,7 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
-		struct page *page)
+		struct page *page, bool writable)
 {
 }
 
@@ -1314,11 +1332,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 */
 		pteval = ptep_get_and_clear(mm, address, pte);
 
-		/* Potentially writable TLBs must be flushed before IO */
-		if (pte_dirty(pteval))
-			flush_tlb_page(vma, address);
-		else
-			set_tlb_ubc_flush_pending(mm, page);
+		set_tlb_ubc_flush_pending(mm, page, pte_dirty(pteval));
 	} else {
 		pteval = ptep_clear_flush(vma, address, pte);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e4f1df1052a2..b5c5dc0997a1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1102,7 +1102,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			if (!sc->may_writepage)
 				goto keep_locked;
 
-			/* Page is dirty, try to write it out here */
+			/*
+			 * Page is dirty. Flush the TLB if a writable entry
+			 * potentially exists to avoid CPU writes after IO
+			 * starts and then write it out here.
+			 */
+			try_to_unmap_flush_dirty();
 			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] mm: Increase SWAP_CLUSTER_MAX to batch TLB flushes
  2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman
                   ` (2 preceding siblings ...)
  2015-07-06 13:39 ` [PATCH 3/4] mm: Defer flush of writable TLB entries Mel Gorman
@ 2015-07-06 13:39 ` Mel Gorman
  2015-07-07 23:25   ` Andrew Morton
  2015-07-06 13:45 ` [PATCH 0/4] TLB flush multiple pages per IPI v7 Ingo Molnar
  2015-07-09  8:20 ` [PATCH 5/4] Documentation/features/vm: Add feature description and arch support status for batched TLB flush after unmap Mel Gorman
  5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2015-07-06 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML, Mel Gorman

Pages that are unmapped for reclaim must be flushed before being freed to
avoid corruption due to a page being freed and reallocated while a stale
TLB entry exists. When reclaiming mapped pages, the requires one IPI per
SWAP_CLUSTER_MAX. This patch increases SWAP_CLUSTER_MAX to 256 so more
pages can be flushed with a single IPI. This number was selected because
it reduced IPIs for TLB shootdowns by 40% on a workload that is dominated
by mapped pages.

Note that it is expected that doubling SWAP_CLUSTER_MAX would not always
halve the IPIs as it is workload dependent. Reclaim efficiency was not 100%
on this workload which was picked for being IPI-intensive and was closer to
35%. More importantly, reclaim does not always isolate in SWAP_CLUSTER_MAX
pages. The LRU lists for a zone may be small, the priority can be low
and even when reclaiming a lot of pages, the last isolation may not be
exactly SWAP_CLUSTER_MAX.

There are a few potential issues with increasing SWAP_CLUSTER_MAX.

1. LRU lock hold times increase slightly because more pages are being
   isolated.
2. There are slight timing changes due to more pages having to be
   processed before they are freed. There is a slight risk that more
   pages than are necessary get reclaimed.
3. There is a risk that too_many_isolated checks will be easier to
   trigger resulting in a HZ/10 stall.
4. The rotation rate of active->inactive is slightly faster but there
   should be fewer rotations before the lists get balanced so it
   shouldn't matter.
5. More pages are reclaimed in a single pass if zone_reclaim_mode is
   active but that thing sucks hard when it's enabled no matter what
6. More pages are isolated for compaction so page hold times there
   are longer while they are being copied

It's unlikely any of these will be problems but worth keeping in mind if
there are any reclaim-related bug reports in the near future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/swap.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 38874729dc5f..89b648665877 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -154,7 +154,7 @@ enum {
 	SWP_SCANNING	= (1 << 10),	/* refcount in scan_swap_map */
 };
 
-#define SWAP_CLUSTER_MAX 32UL
+#define SWAP_CLUSTER_MAX 256UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 /*
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: Increase SWAP_CLUSTER_MAX to batch TLB flushes
  2015-07-06 13:39 ` [PATCH 4/4] mm: Increase SWAP_CLUSTER_MAX to batch TLB flushes Mel Gorman
@ 2015-07-07 23:25   ` Andrew Morton
  2015-07-09  8:14     ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2015-07-07 23:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML

On Mon,  6 Jul 2015 14:39:56 +0100 Mel Gorman <mgorman@suse.de> wrote:

> Pages that are unmapped for reclaim must be flushed before being freed to
> avoid corruption due to a page being freed and reallocated while a stale
> TLB entry exists. When reclaiming mapped pages, the requires one IPI per
> SWAP_CLUSTER_MAX. This patch increases SWAP_CLUSTER_MAX to 256 so more
> pages can be flushed with a single IPI. This number was selected because
> it reduced IPIs for TLB shootdowns by 40% on a workload that is dominated
> by mapped pages.
> 
> Note that it is expected that doubling SWAP_CLUSTER_MAX would not always
> halve the IPIs as it is workload dependent. Reclaim efficiency was not 100%
> on this workload which was picked for being IPI-intensive and was closer to
> 35%. More importantly, reclaim does not always isolate in SWAP_CLUSTER_MAX
> pages. The LRU lists for a zone may be small, the priority can be low
> and even when reclaiming a lot of pages, the last isolation may not be
> exactly SWAP_CLUSTER_MAX.
> 
> There are a few potential issues with increasing SWAP_CLUSTER_MAX.
> 
> 1. LRU lock hold times increase slightly because more pages are being
>    isolated.
> 2. There are slight timing changes due to more pages having to be
>    processed before they are freed. There is a slight risk that more
>    pages than are necessary get reclaimed.
> 3. There is a risk that too_many_isolated checks will be easier to
>    trigger resulting in a HZ/10 stall.
> 4. The rotation rate of active->inactive is slightly faster but there
>    should be fewer rotations before the lists get balanced so it
>    shouldn't matter.
> 5. More pages are reclaimed in a single pass if zone_reclaim_mode is
>    active but that thing sucks hard when it's enabled no matter what
> 6. More pages are isolated for compaction so page hold times there
>    are longer while they are being copied
> 
> It's unlikely any of these will be problems but worth keeping in mind if
> there are any reclaim-related bug reports in the near future.

Yes, this may well cause small&subtle changes which will take some time
to be noticed.

What is the overall effect on the performance improvement if this patch
is omitted?

I wonder if we should leave small systems or !SMP systems or
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=n systems with
SWAP_CLUSTER_MAX=32.  If not, why didn't we change this years ago ;)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: Increase SWAP_CLUSTER_MAX to batch TLB flushes
  2015-07-07 23:25   ` Andrew Morton
@ 2015-07-09  8:14     ` Mel Gorman
  2015-07-13 23:03       ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2015-07-09  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML

On Tue, Jul 07, 2015 at 04:25:26PM -0700, Andrew Morton wrote:
> On Mon,  6 Jul 2015 14:39:56 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > Pages that are unmapped for reclaim must be flushed before being freed to
> > avoid corruption due to a page being freed and reallocated while a stale
> > TLB entry exists. When reclaiming mapped pages, the requires one IPI per
> > SWAP_CLUSTER_MAX. This patch increases SWAP_CLUSTER_MAX to 256 so more
> > pages can be flushed with a single IPI. This number was selected because
> > it reduced IPIs for TLB shootdowns by 40% on a workload that is dominated
> > by mapped pages.
> > 
> > Note that it is expected that doubling SWAP_CLUSTER_MAX would not always
> > halve the IPIs as it is workload dependent. Reclaim efficiency was not 100%
> > on this workload which was picked for being IPI-intensive and was closer to
> > 35%. More importantly, reclaim does not always isolate in SWAP_CLUSTER_MAX
> > pages. The LRU lists for a zone may be small, the priority can be low
> > and even when reclaiming a lot of pages, the last isolation may not be
> > exactly SWAP_CLUSTER_MAX.
> > 
> > There are a few potential issues with increasing SWAP_CLUSTER_MAX.
> > 
> > 1. LRU lock hold times increase slightly because more pages are being
> >    isolated.
> > 2. There are slight timing changes due to more pages having to be
> >    processed before they are freed. There is a slight risk that more
> >    pages than are necessary get reclaimed.
> > 3. There is a risk that too_many_isolated checks will be easier to
> >    trigger resulting in a HZ/10 stall.
> > 4. The rotation rate of active->inactive is slightly faster but there
> >    should be fewer rotations before the lists get balanced so it
> >    shouldn't matter.
> > 5. More pages are reclaimed in a single pass if zone_reclaim_mode is
> >    active but that thing sucks hard when it's enabled no matter what
> > 6. More pages are isolated for compaction so page hold times there
> >    are longer while they are being copied
> > 
> > It's unlikely any of these will be problems but worth keeping in mind if
> > there are any reclaim-related bug reports in the near future.
> 
> Yes, this may well cause small&subtle changes which will take some time
> to be noticed.
> 
> What is the overall effect on the performance improvement if this patch
> is omitted?
> 

For the workload that maps a lot of memory and is reclaim-intensive, the
headline performance difference is marginal, in the noise and inconclusive
as to whether it's a win -- at least on the workloads and machines I
tried. This is a representative example;

vmscale
                                                           4.2.0-rc1                          4.2.0-rc1
                                                    batchdirty-v7r17                  swapcluster-v7r17
Ops lru-file-mmap-read-elapsed                       20.47 (  0.00%)                    20.36 (  0.54%)
Ops lru-file-mmap-read-time_range                     0.59 (  0.00%)                     0.72 (-22.03%)
Ops lru-file-mmap-read-time_stddv                     0.19 (  0.00%)                     0.22 (-16.26%)

           4.2.0-rc1   4.2.0-rc1
        batchdirty-v7r17swapcluster-v7r17
User           58.20       57.13
System         76.97       78.09
Elapsed        22.50       22.45

There is a slight gain in elapsed time but well within standard deviation
and an increase in system CPUI usage. The number of IPIs sent is halved but
other factors dominate such as LRU processing, rmap walks, page reference
counting, IO etc.

A workload that force fragments memory and then attempts to allocate THP
reported no significant difference as a result of this patch.

Other reclaim workloads were inconclusive on whether it was a gain or a
loss. lmbench for mappings of different sizes showed little difference
but it was nice to note that reclaim activity is approximately the same.

The "stutter" workload that measures the latency of mmap in the presense
of intensive reclaim was odd for two reasons. This workload used to be a
reliable indicator if a desktop interactivity would stall during heavy
IO. First, it showed that mapping latency was higher -- 63ns stall on
average with patch applied vs 30ns without patch.  Second, it showed
that compaction activity was high with many more migration attempts and
failures. It follows that COMPACT_CLUSTER_MAX should have been divorced
from SWAP_CLUSTER_MAX and separately considered.

A workload that runs a in-memory database while doing a lot of IO in the
background showed no difference.

Overall, I would say that none of these workloads justify the patch on
its own. Reducing IPIs further is nice but we got the bulk of the
benefit from the two batching patches and after that other factors
dominate. Based on the results I have, I'd be ok with the patch being
dropped. It can be reconsidered for evaluation if someone complains
about excessive IPIs again on reclaim intensive workloads.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: Increase SWAP_CLUSTER_MAX to batch TLB flushes
  2015-07-09  8:14     ` Mel Gorman
@ 2015-07-13 23:03       ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2015-07-13 23:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML

On Thu, 9 Jul 2015 09:14:25 +0100 Mel Gorman <mgorman@suse.de> wrote:

> Overall, I would say that none of these workloads justify the patch on
> its own. Reducing IPIs further is nice but we got the bulk of the
> benefit from the two batching patches and after that other factors
> dominate. Based on the results I have, I'd be ok with the patch being
> dropped. It can be reconsidered for evaluation if someone complains
> about excessive IPIs again on reclaim intensive workloads.

OK, thanks.  The benefit is small and there is some risk of
unanticipated problems.  I think I'll park the patch in -mm for now and
will wait to see if something happens.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] TLB flush multiple pages per IPI v7
  2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman
                   ` (3 preceding siblings ...)
  2015-07-06 13:39 ` [PATCH 4/4] mm: Increase SWAP_CLUSTER_MAX to batch TLB flushes Mel Gorman
@ 2015-07-06 13:45 ` Ingo Molnar
  2015-07-09  8:20 ` [PATCH 5/4] Documentation/features/vm: Add feature description and arch support status for batched TLB flush after unmap Mel Gorman
  5 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2015-07-06 13:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, Linus Torvalds,
	Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> This is hopefully the final version that was agreed on. Ingo, you had sent
> an ack but I had to add a new arch helper after that for accounting purposes
> and there was a new patch added for the swap cluster suggestion. With the
> changes I did not include the ack just in case it was no longer valid.

The series still looks very good to me:

  Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks Mel!

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/4] Documentation/features/vm: Add feature description and arch support status for batched TLB flush after unmap
  2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman
                   ` (4 preceding siblings ...)
  2015-07-06 13:45 ` [PATCH 0/4] TLB flush multiple pages per IPI v7 Ingo Molnar
@ 2015-07-09  8:20 ` Mel Gorman
  5 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2015-07-09  8:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/features/vm/TLB/arch-support.txt | 40 ++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
new file mode 100644
index 000000000000..261b92e2fb1a
--- /dev/null
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -0,0 +1,40 @@
+#
+# Feature name:          batch-unmap-tlb-flush
+#         Kconfig:       ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+#         description:   arch supports deferral of TLB flush until multiple pages are unmapped
+#
+    -----------------------
+    |         arch |status|
+    -----------------------
+    |       alpha: | TODO |
+    |         arc: | TODO |
+    |         arm: | TODO |
+    |       arm64: | TODO |
+    |       avr32: |  ..  |
+    |    blackfin: | TODO |
+    |         c6x: |  ..  |
+    |        cris: |  ..  |
+    |         frv: |  ..  |
+    |       h8300: |  ..  |
+    |     hexagon: | TODO |
+    |        ia64: | TODO |
+    |        m32r: | TODO |
+    |        m68k: |  ..  |
+    |       metag: | TODO |
+    |  microblaze: |  ..  |
+    |        mips: | TODO |
+    |     mn10300: | TODO |
+    |       nios2: |  ..  |
+    |    openrisc: |  ..  |
+    |      parisc: | TODO |
+    |     powerpc: | TODO |
+    |        s390: | TODO |
+    |       score: |  ..  |
+    |          sh: | TODO |
+    |       sparc: | TODO |
+    |        tile: | TODO |
+    |          um: |  ..  |
+    |   unicore32: |  ..  |
+    |         x86: |  ok  |
+    |      xtensa: | TODO |
+    -----------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 0/3] TLB flush multiple pages per IPI v6
@ 2015-06-09 17:31 Mel Gorman
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	Peter Zijlstra, Linux-MM, LKML, Mel Gorman

Changelog since V5
o Split series to first do a full TLB flush and then targetting flushing

Changelog since V4
o Rebase to 4.1-rc6

Changelog since V3
o Drop batching of TLB flush from migration
o Redo how larger batching is managed
o Batch TLB flushes when writable entries exist

When unmapping pages it is necessary to flush the TLB. If that page was
accessed by another CPU then an IPI is used to flush the remote CPU. That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second.

There already is a window between when a page is unmapped and when it is
TLB flushed. This series ses the window so multiple pages can be flushed
using a single IPI. This should be safe or the kernel is hosed already.

Patch 1 simply made the rest of the series easier to write as ftrace
	could identify all the senders of TLB flush IPIS.

Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI
	to flush the entire TLB.

Patch 3 tracks when there potentially are writable TLB entries that
	need to be batched differently

Patch 4 notes that a full TLB flush could clear active entries and
	incur a penalty in the near future while the TLB is being
	refilled. The IPI flushes just the individual PFNs which
	incurs a direct cost to avoid an indirect cost.

The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.

 arch/x86/Kconfig                |   1 +
 arch/x86/include/asm/tlbflush.h |   2 +
 arch/x86/mm/tlb.c               |   1 +
 include/linux/mm_types.h        |   1 +
 include/linux/rmap.h            |   3 +
 include/linux/sched.h           |  31 +++++++++++
 include/trace/events/tlb.h      |   3 +-
 init/Kconfig                    |   8 +++
 kernel/fork.c                   |   5 ++
 kernel/sched/core.c             |   3 +
 mm/internal.h                   |  15 +++++
 mm/rmap.c                       | 118 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     |  33 ++++++++++-
 13 files changed, 220 insertions(+), 4 deletions(-)

-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman
@ 2015-06-09 17:31 ` Mel Gorman
  2015-06-09 20:01   ` Rik van Riel
                     ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	Peter Zijlstra, Linux-MM, LKML, Mel Gorman

An IPI is sent to flush remote TLBs when a page is unmapped that was
potentially accesssed by other CPUs. There are many circumstances where
this happens but the obvious one is kswapd reclaiming pages belonging to
a running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine gets
larger with more cores and more memory, the cost of these IPIs can be
high. This patch uses a simple structure that tracks CPUs that potentially
have TLB entries for pages being unmapped. When the unmapping is complete,
the full TLB is flushed on the assumption that a refill cost is lower than
flushing individual entries.

Architectures wishing to do this must give the following guarantee.

	If a clean page is unmapped and not immediately flushed, the
	architecture must guarantee that a write to that linear address
	from a CPU with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is much
larger with this patch applied and is worth highlighting. The architecture
should consider whether the cost of the full TLB flush is higher than
sending an IPI to flush each individual entry. An additional architecture
helper may be required to flush the local TLB but it is expected this will
be a trivial alias of an internal function in most cases.  In this case,
the existing x86 helper was used.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

                                        4.1.0-rc6          4.1.0-rc6
                                          vanilla       flushfull-v6
Ops lru-file-mmap-read-elapsed   162.88 (  0.00%)   120.81 ( 25.83%)

           4.1.0-rc6   4.1.0-rc6
             vanillaflushfull-v6r5
User          568.96      614.68
System       6085.61     4226.61
Elapsed       164.24      122.17

This is showing that the readers completed 25.83% faster with 30% less
system CPU time. From vmstats, it is known that the vanilla kernel was
interrupted roughly 900K times per second during the steady phase of the
test and the patched kernel was interrupts 180K times per second.

The impact is lower on a single socket machine.

                                        4.1.0-rc6          4.1.0-rc6
                                          vanilla       flushfull-v6
Ops lru-file-mmap-read-elapsed    25.43 (  0.00%)    20.59 ( 19.03%)

           4.1.0-rc6    4.1.0-rc6
             vanilla flushfull-v6
User           59.14        58.99
System        109.15        77.84
Elapsed        27.32        22.31

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages. It will have an unpredictable impact
on the workload running on the CPU being flushed as it'll depend on how
many TLB entries need to be refilled and how long that takes. Worst case,
the TLB will be completely cleared of active entries when the target PFNs
were not resident at all.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/Kconfig      |   1 +
 include/linux/rmap.h  |   3 ++
 include/linux/sched.h |  16 ++++++++
 init/Kconfig          |  10 +++++
 kernel/fork.c         |   5 +++
 kernel/sched/core.c   |   3 ++
 mm/internal.h         |  11 ++++++
 mm/rmap.c             | 103 +++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c           |  26 ++++++++++++-
 9 files changed, 176 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d5696e1d1..0810703bdc9a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,6 +44,7 @@ config X86
 	select ARCH_DISCARD_MEMBLOCK
 	select ARCH_WANT_OPTIONAL_GPIOLIB
 	select ARCH_WANT_FRAME_POINTERS
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select HAVE_DMA_ATTRS
 	select HAVE_DMA_CONTIGUOUS
 	select HAVE_KRETPROBES
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c89c53a113a8..29446aeef36e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -89,6 +89,9 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
+					 * and caller guarantees they will
+					 * do a final flush if necessary */
 };
 
 #ifdef CONFIG_MMU
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a2e6122734..d891e01f0445 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1289,6 +1289,18 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/* Track pages that require TLB flushes */
+struct tlbflush_unmap_batch {
+	/*
+	 * Each bit set is a CPU that potentially has a TLB entry for one of
+	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
+	 */
+	struct cpumask cpumask;
+
+	/* True if any bit in cpumask is set */
+	bool flush_required;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1648,6 +1660,10 @@ struct task_struct {
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	struct tlbflush_unmap_batch *tlb_ubc;
+#endif
+
 	struct rcu_head rcu;
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec60232..6e6fa4842250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -904,6 +904,16 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
 #
+# For architectures that prefer to flush all TLBs after a number of pages
+# are unmapped instead of sending one IPI per page to flush. The architecture
+# must provide guarantees on what happens if a clean TLB cache entry is
+# written after the unmap. Details are in mm/rmap.c near the check for
+# should_defer_flush. The architecture should also consider if the full flush
+# and the refill costs are offset by the savings of sending fewer IPIs.
+config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	bool
+
+#
 # For architectures that know their GCC __int128 support is sound
 #
 config ARCH_SUPPORTS_INT128
diff --git a/kernel/fork.c b/kernel/fork.c
index 03c1eaaa6ef5..3fb3e776cfcf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -257,6 +257,11 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	kfree(tsk->tlb_ubc);
+	tsk->tlb_ubc = NULL;
+#endif
+
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 123673291ffb..d58ebdf4d759 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1843,6 +1843,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	p->tlb_ubc = NULL;
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 }
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/internal.h b/mm/internal.h
index a25e359a4039..465e621b86b1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -433,4 +433,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
 
+enum ttu_flags;
+struct tlbflush_unmap_batch;
+
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+void try_to_unmap_flush(void);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index 24dd3f9fee27..4cadb60df74a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -60,6 +60,8 @@
 
 #include <asm/tlbflush.h>
 
+#include <trace/events/tlb.h>
+
 #include "internal.h"
 
 static struct kmem_cache *anon_vma_cachep;
@@ -581,6 +583,88 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+static void percpu_flush_tlb_batch_pages(void *data)
+{
+	/*
+	 * All TLB entries are flushed on the assumption that it is
+	 * cheaper to flush all TLBs and let them be refilled than
+	 * flushing individual PFNs. Note that we do not track mm's
+	 * to flush as that might simply be multiple full TLB flushes
+	 * for no gain.
+	 */
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	local_flush_tlb();
+}
+
+/*
+ * Flush TLB entries for recently unmapped pages from remote CPUs. It is
+ * important if a PTE was dirty when it was unmapped that it's flushed
+ * before any IO is initiated on the page to prevent lost writes. Similarly,
+ * it must be flushed before freeing to prevent data leakage.
+ */
+void try_to_unmap_flush(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+	int cpu;
+
+	if (!tlb_ubc || !tlb_ubc->flush_required)
+		return;
+
+	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, -1UL);
+
+	cpu = get_cpu();
+	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
+		percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask);
+
+	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) {
+		smp_call_function_many(&tlb_ubc->cpumask,
+			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
+	}
+	cpumask_clear(&tlb_ubc->cpumask);
+	tlb_ubc->flush_required = false;
+	put_cpu();
+}
+
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
+	tlb_ubc->flush_required = true;
+}
+
+/*
+ * Returns true if the TLB flush should be deferred to the end of a batch of
+ * unmap operations to reduce IPIs.
+ */
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	bool should_defer = false;
+
+	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
+		return false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+#else
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+}
+
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 /*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
@@ -1213,7 +1297,24 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	if (should_defer_flush(mm, flags)) {
+		/*
+		 * We clear the PTE but do not flush so potentially a remote
+		 * CPU could still be writing to the page. If the entry was
+		 * previously clean then the architecture must guarantee that
+		 * a clear->dirty transition on a cached TLB entry is written
+		 * through and traps if the PTE is unmapped.
+		 */
+		pteval = ptep_get_and_clear(mm, address, pte);
+
+		/* Potentially writable TLBs must be flushed before IO */
+		if (pte_dirty(pteval))
+			flush_tlb_page(vma, address);
+		else
+			set_tlb_ubc_flush_pending(mm, page);
+	} else {
+		pteval = ptep_clear_flush(vma, address, pte);
+	}
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..f16e07aaef59 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+			switch (try_to_unmap(page,
+					ttu_flags|TTU_BATCH_FLUSH)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1175,6 +1176,7 @@ keep:
 	}
 
 	mem_cgroup_uncharge_list(&free_pages);
+	try_to_unmap_flush();
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
@@ -2118,6 +2120,26 @@ out:
 	}
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+/*
+ * Allocate the control structure for batch TLB flushing. An allocation
+ * failure is harmless as the reclaimer will send IPIs where necessary.
+ * A GFP_KERNEL allocation from this context is normally not advised but
+ * we are depending on PF_MEMALLOC (set by direct reclaim or kswapd) to
+ * limit the depth of the call.
+ */
+static void alloc_tlb_ubc(void)
+{
+	if (!current->tlb_ubc)
+		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
+						GFP_KERNEL | __GFP_NOWARN);
+}
+#else
+static inline void alloc_tlb_ubc(void)
+{
+}
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -2152,6 +2174,8 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
 			 sc->priority == DEF_PRIORITY);
 
+	alloc_tlb_ubc();
+
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
@ 2015-06-09 20:01   ` Rik van Riel
  2015-06-10  7:47   ` Ingo Molnar
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2015-06-09 20:01 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, H Peter Anvin,
	Ingo Molnar, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Linux-MM, LKML

On 06/09/2015 01:31 PM, Mel Gorman wrote:
> An IPI is sent to flush remote TLBs when a page is unmapped that was
> potentially accesssed by other CPUs. There are many circumstances where
> this happens but the obvious one is kswapd reclaiming pages belonging to
> a running process as kswapd and the task are likely running on separate CPUs.

> It's still a noticeable improvement with vmstat showing interrupts went
> from roughly 500K per second to 45K per second.
> 
> The patch will have no impact on workloads with no memory pressure or
> have relatively few mapped pages. It will have an unpredictable impact
> on the workload running on the CPU being flushed as it'll depend on how
> many TLB entries need to be refilled and how long that takes. Worst case,
> the TLB will be completely cleared of active entries when the target PFNs
> were not resident at all.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
  2015-06-09 20:01   ` Rik van Riel
@ 2015-06-10  7:47   ` Ingo Molnar
  2015-06-10  8:14     ` Mel Gorman
  2015-06-10  8:26   ` Ingo Molnar
  2015-06-10  8:33   ` Ingo Molnar
  3 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2015-06-10  7:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1289,6 +1289,18 @@ enum perf_event_task_context {
>  	perf_nr_task_contexts,
>  };
>  
> +/* Track pages that require TLB flushes */
> +struct tlbflush_unmap_batch {
> +	/*
> +	 * Each bit set is a CPU that potentially has a TLB entry for one of
> +	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
> +	 */
> +	struct cpumask cpumask;
> +
> +	/* True if any bit in cpumask is set */
> +	bool flush_required;
> +};
> +
>  struct task_struct {
>  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
>  	void *stack;
> @@ -1648,6 +1660,10 @@ struct task_struct {
>  	unsigned long numa_pages_migrated;
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +	struct tlbflush_unmap_batch *tlb_ubc;
> +#endif

Please embedd this constant size structure in task_struct directly so that the 
whole per task allocation overhead goes away:

> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +/*
> + * Allocate the control structure for batch TLB flushing. An allocation
> + * failure is harmless as the reclaimer will send IPIs where necessary.
> + * A GFP_KERNEL allocation from this context is normally not advised but
> + * we are depending on PF_MEMALLOC (set by direct reclaim or kswapd) to
> + * limit the depth of the call.
> + */
> +static void alloc_tlb_ubc(void)
> +{
> +	if (!current->tlb_ubc)
> +		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
> +						GFP_KERNEL | __GFP_NOWARN);
> +}
> +#else
> +static inline void alloc_tlb_ubc(void)
> +{
> +}
> +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> +
>  /*
>   * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
>   */
> @@ -2152,6 +2174,8 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>  	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
>  			 sc->priority == DEF_PRIORITY);
>  
> +	alloc_tlb_ubc();
> +
>  	blk_start_plug(&plug);
>  	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>  					nr[LRU_INACTIVE_FILE]) {

the whole patch series will become even simpler.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  7:47   ` Ingo Molnar
@ 2015-06-10  8:14     ` Mel Gorman
  2015-06-10  8:21       ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2015-06-10  8:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Wed, Jun 10, 2015 at 09:47:04AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1289,6 +1289,18 @@ enum perf_event_task_context {
> >  	perf_nr_task_contexts,
> >  };
> >  
> > +/* Track pages that require TLB flushes */
> > +struct tlbflush_unmap_batch {
> > +	/*
> > +	 * Each bit set is a CPU that potentially has a TLB entry for one of
> > +	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
> > +	 */
> > +	struct cpumask cpumask;
> > +
> > +	/* True if any bit in cpumask is set */
> > +	bool flush_required;
> > +};
> > +
> >  struct task_struct {
> >  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
> >  	void *stack;
> > @@ -1648,6 +1660,10 @@ struct task_struct {
> >  	unsigned long numa_pages_migrated;
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> > +	struct tlbflush_unmap_batch *tlb_ubc;
> > +#endif
> 
> Please embedd this constant size structure in task_struct directly so that the 
> whole per task allocation overhead goes away:
> 

That puts a structure (72 bytes in the config I used) within the task struct
even when it's not required. On a lightly loaded system direct reclaim
will not be active and for some processes, it'll never be active. It's
very wasteful.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  8:14     ` Mel Gorman
@ 2015-06-10  8:21       ` Ingo Molnar
  2015-06-10  8:51         ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2015-06-10  8:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jun 10, 2015 at 09:47:04AM +0200, Ingo Molnar wrote:
> > 
> > * Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1289,6 +1289,18 @@ enum perf_event_task_context {
> > >  	perf_nr_task_contexts,
> > >  };
> > >  
> > > +/* Track pages that require TLB flushes */
> > > +struct tlbflush_unmap_batch {
> > > +	/*
> > > +	 * Each bit set is a CPU that potentially has a TLB entry for one of
> > > +	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
> > > +	 */
> > > +	struct cpumask cpumask;
> > > +
> > > +	/* True if any bit in cpumask is set */
> > > +	bool flush_required;
> > > +};
> > > +
> > >  struct task_struct {
> > >  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
> > >  	void *stack;
> > > @@ -1648,6 +1660,10 @@ struct task_struct {
> > >  	unsigned long numa_pages_migrated;
> > >  #endif /* CONFIG_NUMA_BALANCING */
> > >  
> > > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> > > +	struct tlbflush_unmap_batch *tlb_ubc;
> > > +#endif
> > 
> > Please embedd this constant size structure in task_struct directly so that the 
> > whole per task allocation overhead goes away:
> > 
> 
> That puts a structure (72 bytes in the config I used) within the task struct 
> even when it's not required. On a lightly loaded system direct reclaim will not 
> be active and for some processes, it'll never be active. It's very wasteful.

For certain values of 'very'.

 - 72 bytes suggests that you have NR_CPUS set to 512 or so? On a kernel sized to 
   such large systems with 1000 active tasks we are talking about about +72K of 
   RAM...

 - Furthermore, by embedding it it gets packed better with neighboring task_struct 
   fields, while by allocating it dynamically it's a separate cache line wasted.

 - Plus by allocating it separately you spend two cachelines on it: each slab will 
   be at least cacheline aligned, and 72 bytes will allocate 128 bytes. So when 
   this gets triggered you've just wasted some more RAM.

 - I mean, if it had dynamic size, or was arguably huge. But this is just a 
   cpumask and a boolean!

 - The cpumask will be dynamic if you increase the NR_CPUS count any more than 
   that - in which case embedding the structure is the right choice again.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  8:21       ` Ingo Molnar
@ 2015-06-10  8:51         ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2015-06-10  8:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Wed, Jun 10, 2015 at 10:21:07AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Wed, Jun 10, 2015 at 09:47:04AM +0200, Ingo Molnar wrote:
> > > 
> > > * Mel Gorman <mgorman@suse.de> wrote:
> > > 
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -1289,6 +1289,18 @@ enum perf_event_task_context {
> > > >  	perf_nr_task_contexts,
> > > >  };
> > > >  
> > > > +/* Track pages that require TLB flushes */
> > > > +struct tlbflush_unmap_batch {
> > > > +	/*
> > > > +	 * Each bit set is a CPU that potentially has a TLB entry for one of
> > > > +	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
> > > > +	 */
> > > > +	struct cpumask cpumask;
> > > > +
> > > > +	/* True if any bit in cpumask is set */
> > > > +	bool flush_required;
> > > > +};
> > > > +
> > > >  struct task_struct {
> > > >  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
> > > >  	void *stack;
> > > > @@ -1648,6 +1660,10 @@ struct task_struct {
> > > >  	unsigned long numa_pages_migrated;
> > > >  #endif /* CONFIG_NUMA_BALANCING */
> > > >  
> > > > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> > > > +	struct tlbflush_unmap_batch *tlb_ubc;
> > > > +#endif
> > > 
> > > Please embedd this constant size structure in task_struct directly so that the 
> > > whole per task allocation overhead goes away:
> > > 
> > 
> > That puts a structure (72 bytes in the config I used) within the task struct 
> > even when it's not required. On a lightly loaded system direct reclaim will not 
> > be active and for some processes, it'll never be active. It's very wasteful.
> 
> For certain values of 'very'.
> 
>  - 72 bytes suggests that you have NR_CPUS set to 512 or so? On a kernel sized to 
>    such large systems with 1000 active tasks we are talking about about +72K of 
>    RAM...
> 

The NR_CPUS is based on the openSUSE 13.1 distro config so yes, it's large but I also
expect it to be a common configuration.

>  - Furthermore, by embedding it it gets packed better with neighboring task_struct 
>    fields, while by allocating it dynamically it's a separate cache line wasted.
> 

A separate cache line that is only used during direct reclaim when the
process is taking a large hit anyway

>  - Plus by allocating it separately you spend two cachelines on it: each slab will 
>    be at least cacheline aligned, and 72 bytes will allocate 128 bytes. So when 
>    this gets triggered you've just wasted some more RAM.
> 
>  - I mean, if it had dynamic size, or was arguably huge. But this is just a 
>    cpumask and a boolean!
> 

It gets larger with enterprise configs.

>  - The cpumask will be dynamic if you increase the NR_CPUS count any more than 
>    that - in which case embedding the structure is the right choice again.
> 

Enterprise configurations are larger. The most recent one I checked defined
NR_CPUS as 8192. If it's embedded in the structure, it means that we need
to call cpumask_clear on every fork even if it's never used. That adds
constant overhead to a fast path to avoid an allocation and a few cache
misses in a direct reclaim path. Are you certain you want that trade-off?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
  2015-06-09 20:01   ` Rik van Riel
  2015-06-10  7:47   ` Ingo Molnar
@ 2015-06-10  8:26   ` Ingo Molnar
  2015-06-10  9:58     ` Mel Gorman
  2015-06-10  8:33   ` Ingo Molnar
  3 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2015-06-10  8:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> wrote:

> On a 4-socket machine the results were
> 
>                                         4.1.0-rc6          4.1.0-rc6
>                                     batchdirty-v6      batchunmap-v6
> Ops lru-file-mmap-read-elapsed   121.27 (  0.00%)   118.79 (  2.05%)
> 
>            4.1.0-rc6      4.1.0-rc6
>         batchdirty-v6 batchunmap-v6
> User          620.84         608.48
> System       4245.35        4152.89
> Elapsed       122.65         120.15
> 
> In this case the workload completed faster and there was less CPU overhead
> but as it's a NUMA machine there are a lot of factors at play. It's easier
> to quantify on a single socket machine;
> 
>                                         4.1.0-rc6          4.1.0-rc6
>                                     batchdirty-v6      batchunmap-v6
> Ops lru-file-mmap-read-elapsed    20.35 (  0.00%)    21.52 ( -5.75%)
> 
>            4.1.0-rc6   4.1.0-rc6
>         batchdirty-v6r5batchunmap-v6r5
> User           58.02       60.70
> System         77.57       81.92
> Elapsed        22.14       23.16
> 
> That shows the workload takes 5.75% longer to complete with a similar
> increase in the system CPU usage.

Btw., do you have any stddev noise numbers?

The batching speedup is brutal enough to not need any noise estimations, it's a 
clear winner.

But this PFN tracking patch is more difficult to judge as the numbers are pretty 
close to each other.

> It is expected that there is overhead to tracking the PFNs and flushing 
> individual pages. This can be quantified but we cannot quantify the indirect 
> savings due to active unrelated TLB entries being preserved. Whether this 
> matters depends on whether the workload was using those entries and if they 
> would be used before a context switch but targeting the TLB flushes is the 
> conservative and safer choice.

So this is how I picture a realistic TLB flushing 'worst case': a workload that 
uses about 80% of the TLB cache in a 'fast' function and trashes memory in a 
'slow' function, and does alternate calls to the two functions from the same task.

Typical dTLB sizes on x86 are a couple of hundred entries (you can see the precise 
count in x86info -c), up to 1024 entries on the latest uarchs.

A cached TLB miss will take about 10-20 cycles (progressively more if the lookup 
chain misses in the cache) - but that cost is partially hidden if the L1 data 
cache was missed (which is likely for most TLB-flush intense workloads), and will 
be almost completely hidden if it goes out to the L3 cache or goes to RAM. (It 
takes up cache/memory bandwidth though, but unless the access patters are totally 
sparse, it should be a small fraction.)

A single INVLPG with its 200+ cycles cost is equivalent to about 10-20 TLB misses. 
That's a lot.

So this kind of workload should trigger the TLB flushing 'worst case': with say 
512 dTLB entries you could see up to 5k-10k cycles of hidden/indirect cost, but 
potentially parallelized with other misses going on with the same data accesses.

The current limit for INVLPG flushing is 33 entries: that's 10k-20k cycles max 
with an INVLPG cost of 250 cycles - this could explain the results you got.

But the problem is: AFAICS you can only decrease the INVLPG count by decreasing 
the batching size - the additional IPI costs will overwhelm any TLB preservation 
benefits. So depending on the cost relationship between INVLPG, TLB miss cost and 
IPI cost, it might not be possible to see a speedup even in the worst-case.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  8:26   ` Ingo Molnar
@ 2015-06-10  9:58     ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2015-06-10  9:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Wed, Jun 10, 2015 at 10:26:40AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > On a 4-socket machine the results were
> > 
> >                                         4.1.0-rc6          4.1.0-rc6
> >                                     batchdirty-v6      batchunmap-v6
> > Ops lru-file-mmap-read-elapsed   121.27 (  0.00%)   118.79 (  2.05%)
> > 
> >            4.1.0-rc6      4.1.0-rc6
> >         batchdirty-v6 batchunmap-v6
> > User          620.84         608.48
> > System       4245.35        4152.89
> > Elapsed       122.65         120.15
> > 
> > In this case the workload completed faster and there was less CPU overhead
> > but as it's a NUMA machine there are a lot of factors at play. It's easier
> > to quantify on a single socket machine;
> > 
> >                                         4.1.0-rc6          4.1.0-rc6
> >                                     batchdirty-v6      batchunmap-v6
> > Ops lru-file-mmap-read-elapsed    20.35 (  0.00%)    21.52 ( -5.75%)
> > 
> >            4.1.0-rc6   4.1.0-rc6
> >         batchdirty-v6r5batchunmap-v6r5
> > User           58.02       60.70
> > System         77.57       81.92
> > Elapsed        22.14       23.16
> > 
> > That shows the workload takes 5.75% longer to complete with a similar
> > increase in the system CPU usage.
> 
> Btw., do you have any stddev noise numbers?
> 

                                           4.1.0-rc6          4.1.0-rc6          4.1.0-rc6          4.1.0-rc6
                                             vanilla     flushfull-v6r5    batchdirty-v6r5    batchunmap-v6r5
Ops lru-file-mmap-read-elapsed       25.43 (  0.00%)    20.59 ( 19.03%)    20.35 ( 19.98%)    21.52 ( 15.38%)
Ops lru-file-mmap-read-time_stddv     0.32 (  0.00%)     0.32 ( -1.30%)     0.39 (-23.00%)     0.45 (-40.91%)


flushfull  -- patch 2
batchdirty -- patch 3
batchunmap -- patch 4

So the impact of tracking the PFNs is outside the noise and there is
definite direct cost to it. This was expected for both the PFN tracking
and the individual flushes.

> The batching speedup is brutal enough to not need any noise estimations, it's a 
> clear winner.
> 

Agreed.

> But this PFN tracking patch is more difficult to judge as the numbers are pretty 
> close to each other.
> 

It's definitely measurable, no doubt about it and there never was. The
concerns were always the refill costs due to flushing potentially active
TLB entries unnecessarily. From https://lkml.org/lkml/2014/7/31/825, this
is potentially high where it says that a 512 DTLB refill takes 22,000
cycles which is higher than the individual flushes. However, this is an
estimate and it'll always be a case of "it depends". It's been asserted
that the refill costs are really low so lets just go with that, drop
patch 4 and wait and see who complains.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
                     ` (2 preceding siblings ...)
  2015-06-10  8:26   ` Ingo Molnar
@ 2015-06-10  8:33   ` Ingo Molnar
  2015-06-10  8:59     ` Mel Gorman
  3 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2015-06-10  8:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
> 
>                                         4.1.0-rc6          4.1.0-rc6
>                                           vanilla       flushfull-v6
> Ops lru-file-mmap-read-elapsed   162.88 (  0.00%)   120.81 ( 25.83%)
> 
>            4.1.0-rc6   4.1.0-rc6
>              vanillaflushfull-v6r5
> User          568.96      614.68
> System       6085.61     4226.61
> Elapsed       164.24      122.17
> 
> This is showing that the readers completed 25.83% faster with 30% less
> system CPU time. From vmstats, it is known that the vanilla kernel was
> interrupted roughly 900K times per second during the steady phase of the
> test and the patched kernel was interrupts 180K times per second.
> 
> The impact is lower on a single socket machine.
> 
>                                         4.1.0-rc6          4.1.0-rc6
>                                           vanilla       flushfull-v6
> Ops lru-file-mmap-read-elapsed    25.43 (  0.00%)    20.59 ( 19.03%)
> 
>            4.1.0-rc6    4.1.0-rc6
>              vanilla flushfull-v6
> User           59.14        58.99
> System        109.15        77.84
> Elapsed        27.32        22.31
> 
> It's still a noticeable improvement with vmstat showing interrupts went
> from roughly 500K per second to 45K per second.

Btw., I tried to compare your previous (v5) pfn-tracking numbers with these 
full-flushing numbers, and found that the IRQ rate appears to be the same:

> > From vmstats, it is known that the vanilla kernel was interrupted roughly 900K 
> > times per second during the steady phase of the test and the patched kernel 
> > was interrupts 180K times per second.

> > It's still a noticeable improvement with vmstat showing interrupts went from 
> > roughly 500K per second to 45K per second.

... is that because the batching limit in the pfn-tracking case was high enough to 
not be noticeable in the vmstat?

In the full-flushing case (v6 without patch 4) the batching limit is 'infinite', 
we'll batch as long as possible, right?

Or have I managed to get confused somewhere ...

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  8:33   ` Ingo Molnar
@ 2015-06-10  8:59     ` Mel Gorman
  2015-06-11 15:02       ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2015-06-10  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Wed, Jun 10, 2015 at 10:33:32AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
> > 
> >                                         4.1.0-rc6          4.1.0-rc6
> >                                           vanilla       flushfull-v6
> > Ops lru-file-mmap-read-elapsed   162.88 (  0.00%)   120.81 ( 25.83%)
> > 
> >            4.1.0-rc6   4.1.0-rc6
> >              vanillaflushfull-v6r5
> > User          568.96      614.68
> > System       6085.61     4226.61
> > Elapsed       164.24      122.17
> > 
> > This is showing that the readers completed 25.83% faster with 30% less
> > system CPU time. From vmstats, it is known that the vanilla kernel was
> > interrupted roughly 900K times per second during the steady phase of the
> > test and the patched kernel was interrupts 180K times per second.
> > 
> > The impact is lower on a single socket machine.
> > 
> >                                         4.1.0-rc6          4.1.0-rc6
> >                                           vanilla       flushfull-v6
> > Ops lru-file-mmap-read-elapsed    25.43 (  0.00%)    20.59 ( 19.03%)
> > 
> >            4.1.0-rc6    4.1.0-rc6
> >              vanilla flushfull-v6
> > User           59.14        58.99
> > System        109.15        77.84
> > Elapsed        27.32        22.31
> > 
> > It's still a noticeable improvement with vmstat showing interrupts went
> > from roughly 500K per second to 45K per second.
> 
> Btw., I tried to compare your previous (v5) pfn-tracking numbers with these 
> full-flushing numbers, and found that the IRQ rate appears to be the same:
> 

That's expected because the number of IPIs sent is the same. What
changes is the tracking of the PFNs and then the work within the IPI
itself.

> > > From vmstats, it is known that the vanilla kernel was interrupted roughly 900K 
> > > times per second during the steady phase of the test and the patched kernel 
> > > was interrupts 180K times per second.
> 
> > > It's still a noticeable improvement with vmstat showing interrupts went from 
> > > roughly 500K per second to 45K per second.
> 
> ... is that because the batching limit in the pfn-tracking case was high enough to 
> not be noticeable in the vmstat?
> 

It's just the case that there are fewer cores and less activity in the
machine overall.

> In the full-flushing case (v6 without patch 4) the batching limit is 'infinite', 
> we'll batch as long as possible, right?
> 

No because we must flush before pages are freed so the maximum batching
is related to SWAP_CLUSTER_MAX. If we free a page before the flush then
in theory the page can be reallocated and a stale TLB entry can allow
access to unrelated data. It would be almost impossible to trigger
corruption this way but it's a concern.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-10  8:59     ` Mel Gorman
@ 2015-06-11 15:02       ` Ingo Molnar
  2015-06-11 15:25         ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2015-06-11 15:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> > In the full-flushing case (v6 without patch 4) the batching limit is 
> > 'infinite', we'll batch as long as possible, right?
> 
> No because we must flush before pages are freed so the maximum batching is 
> related to SWAP_CLUSTER_MAX. If we free a page before the flush then in theory 
> the page can be reallocated and a stale TLB entry can allow access to unrelated 
> data. It would be almost impossible to trigger corruption this way but it's a 
> concern.

Well, could we say double SWAP_CLUSTER_MAX to further reduce the IPI rate?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages
  2015-06-11 15:02       ` Ingo Molnar
@ 2015-06-11 15:25         ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2015-06-11 15:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML

On Thu, Jun 11, 2015 at 05:02:51PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > In the full-flushing case (v6 without patch 4) the batching limit is 
> > > 'infinite', we'll batch as long as possible, right?
> > 
> > No because we must flush before pages are freed so the maximum batching is 
> > related to SWAP_CLUSTER_MAX. If we free a page before the flush then in theory 
> > the page can be reallocated and a stale TLB entry can allow access to unrelated 
> > data. It would be almost impossible to trigger corruption this way but it's a 
> > concern.
> 
> Well, could we say double SWAP_CLUSTER_MAX to further reduce the IPI rate?
> 

We could but it's a suprisingly subtle change. The impacts I can think
of are;

1. LRU lock hold times increase slightly because more pages are being
   isolated
2. There are slight timing changes due to more pages having to be
   processed before they are freed. There is a slight risk that more
   pages than are necessary get reclaimed but I doubt it'll be
   measurable
3. There is a risk that too_many_isolated checks will be easier to
   trigger resulting in a HZ/10 stall
4. The rotation rate of active->inactive is slightly faster but there
   should be fewer rotations before the lists get balanced so it
   shouldn't matter.
5. More pages are reclaimed in a single pass if zone_reclaim_mode is
   active but that thing sucks hard when it's enabled no matter what
6. More pages are isolated for compaction so page hold times there
   are longer while they are being copied

There might be others. To be honest, I'm struggling to think of any serious
problems such a change would cause. The biggest risk is issue 3 but I expect
that hitting that requires that the system is already getting badly hammered.
The main downside is that it affects all page reclaim activity, not just
the mapped pages which are triggering the IPIs. I'll add a patch to the
series that alters SWAP_CLUSTER_MAX with the intent to further reduce
IPIs and see what falls out and see if any other VM person complains.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-07-13 23:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman
2015-07-06 13:39 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-07-06 13:39 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
2015-07-06 13:39 ` [PATCH 3/4] mm: Defer flush of writable TLB entries Mel Gorman
2015-07-06 13:39 ` [PATCH 4/4] mm: Increase SWAP_CLUSTER_MAX to batch TLB flushes Mel Gorman
2015-07-07 23:25   ` Andrew Morton
2015-07-09  8:14     ` Mel Gorman
2015-07-13 23:03       ` Andrew Morton
2015-07-06 13:45 ` [PATCH 0/4] TLB flush multiple pages per IPI v7 Ingo Molnar
2015-07-09  8:20 ` [PATCH 5/4] Documentation/features/vm: Add feature description and arch support status for batched TLB flush after unmap Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman
2015-06-09 17:31 ` [PATCH 2/4] mm: Send one IPI per CPU to TLB flush all entries after unmapping pages Mel Gorman
2015-06-09 20:01   ` Rik van Riel
2015-06-10  7:47   ` Ingo Molnar
2015-06-10  8:14     ` Mel Gorman
2015-06-10  8:21       ` Ingo Molnar
2015-06-10  8:51         ` Mel Gorman
2015-06-10  8:26   ` Ingo Molnar
2015-06-10  9:58     ` Mel Gorman
2015-06-10  8:33   ` Ingo Molnar
2015-06-10  8:59     ` Mel Gorman
2015-06-11 15:02       ` Ingo Molnar
2015-06-11 15:25         ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).