[PATCH 0/3] TLB flush multiple pages per IPI v4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] TLB flush multiple pages per IPI v4
@ 2015-04-25 17:45 Mel Gorman
  2015-04-25 17:45 ` [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Mel Gorman @ 2015-04-25 17:45 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

The big change here is that I dropped the patch that batches TLB flushes
from migration context. After V3, I realised that there are non-trivial
corner cases there that deserve treatment in their own series. It did not
help that I could not find a workload that was both migration and IPI
intensive. The common case for IPIs during reclaim is kswapd unmapping
pages which guarantees IPIs. In migration, at least some of the pages
being migrated will belong to the process itself.

The main issue is that migration cannot have any cached TLB entries after
migration completes. Once the migration PTE is removed then writes can happen
to that new page. The old TLB entry could see stale reads until it's flushed
which is different to the reclaim case. This is difficult to get around. We
cannot just unmap in advance because then there are no migration entries
to restore and there would be minor faults post-migration. We can't batch
restore the migration entries because the page lock must be held during
migration or BUG_ONs get triggered. Batching TLB flushes safely requires
a major rethink of how migration works so lets deal with reclaim first on
its own, preferably in the context of a workload that is both migration
and IPI intensive.

The patch that increased the batching size was also removed because there
is no advantage when TLBs are flushed before freeing the page. To increase
batching we would have to alter how many pages are isolated from the LRU
which would be a different patch series.

Most reviewed-bys had to be dropped as the patches changed too much to
preserve them.

Changelog since V3
o Drop batching of TLB flush from migration
o Redo how larger batching is managed
o Batch TLB flushes when writable entries exist

When unmapping pages it is necessary to flush the TLB. If that page was
accessed by another CPU then an IPI is used to flush the remote CPU. That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second.

There already is a window between when a page is unmapped and when it is
TLB flushed. This series simply increases the window so multiple pages can
be flushed using a single IPI.

Patch 1 simply made the rest of the series easier to write as ftrace
	could identify all the senders of TLB flush IPIS.

Patch 2 collects a list of PFNs and sends one IPI to flush them all

Patch 3 tracks when there potentially are writable TLB entries that
	need to be batched differently

The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.

 arch/x86/Kconfig                |   1 +
 arch/x86/include/asm/tlbflush.h |   2 +
 arch/x86/mm/tlb.c               |   1 +
 include/linux/init_task.h       |   8 +++
 include/linux/mm_types.h        |   1 +
 include/linux/rmap.h            |   3 +
 include/linux/sched.h           |  15 +++++
 include/trace/events/tlb.h      |   3 +-
 init/Kconfig                    |   8 +++
 kernel/fork.c                   |   5 ++
 kernel/sched/core.c             |   3 +
 mm/internal.h                   |  15 +++++
 mm/rmap.c                       | 119 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     |  45 ++++++++++++++-
 14 files changed, 224 insertions(+), 5 deletions(-)

-- 
2.3.5

Mel Gorman (3):
  x86, mm: Trace when an IPI is about to be sent
  mm: Send one IPI per CPU to TLB flush multiple pages that were
    recently unmapped
  mm: Defer flush of writable TLB entries

 arch/x86/Kconfig                |   1 +
 arch/x86/include/asm/tlbflush.h |   2 +
 arch/x86/mm/tlb.c               |   1 +
 include/linux/init_task.h       |   8 +++
 include/linux/mm_types.h        |   1 +
 include/linux/rmap.h            |   3 +
 include/linux/sched.h           |  15 +++++
 include/trace/events/tlb.h      |   3 +-
 init/Kconfig                    |   8 +++
 kernel/fork.c                   |   5 ++
 kernel/sched/core.c             |   3 +
 mm/internal.h                   |  15 +++++
 mm/rmap.c                       | 119 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     |  30 +++++++++-
 14 files changed, 210 insertions(+), 4 deletions(-)

-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent
  2015-04-25 17:45 [PATCH 0/3] TLB flush multiple pages per IPI v4 Mel Gorman
@ 2015-04-25 17:45 ` Mel Gorman
  2015-04-25 17:45 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Mel Gorman @ 2015-04-25 17:45 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 199a03aab8dc..856038aa166e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -532,6 +532,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 0e7635765153..0fc101472988 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	{ TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" },	\
 	{ TLB_REMOTE_SHOOTDOWN,		"remote shootdown" },		\
 	{ TLB_LOCAL_SHOOTDOWN,		"local shootdown" },		\
-	{ TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" }
+	{ TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" },		\
+	{ TLB_REMOTE_SEND_IPI,		"remote ipi send" }
 
 TRACE_EVENT_CONDITION(tlb_flush,
 
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped
  2015-04-25 17:45 [PATCH 0/3] TLB flush multiple pages per IPI v4 Mel Gorman
  2015-04-25 17:45 ` [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Mel Gorman
@ 2015-04-25 17:45 ` Mel Gorman
  2015-04-27  2:48   ` Rik van Riel
  2015-04-25 17:45 ` [PATCH 3/3] mm: Defer flush of writable TLB entries Mel Gorman
  2015-08-31 16:20 ` [PATCH 0/3] TLB flush multiple pages per IPI v4 Sébastien Wacquiez
  3 siblings, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2015-04-25 17:45 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

An IPI is sent to flush remote TLBs when a page is unmapped that was
recently accessed by other CPUs. There are many circumstances where this
happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine
gets larger with more cores and more memory, the cost of these IPIs can
be high. This patch uses a structure similar in principle to a pagevec
to collect a list of PFNs and CPUs that require flushing. It then sends
one IPI per CPU that was mapping any of those pages to flush the list of
PFNs. A new TLB flush helper is required for this and one is added for
x86. Other architectures will need to decide if batching like this is both
safe and worth the memory overhead. Specifically the requirement is;

	If a clean page is unmapped and not immediately flushed, the
	architecture must guarantee that a write to that page from a CPU
	with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

vmscale on a 4-node machine with 64G RAM and 48 CPUs
                                                4.0.0                      4.0.0
                                              vanilla              batchunmap-v4
lru-file-mmap-read-elapsed            166.90 (  0.00%)          119.80 ( 28.22%)

               4.0.0       4.0.0
             vanilla batchunmap-v3
User          564.25      623.74
System       6168.10     4196.53
Elapsed       168.29      121.14

This is showing that the readers completed 25% with 30% less CPU time. From
vmstats, it is known that the vanilla kernel was interrupted roughly 900K
times per second during the steady phase of the test and the patched kernel
was interrupts 180K times per second.

The impact is much lower on a small machine

vmscale on a 1-node machine with 8G RAM and 1 CPU
                                                4.0.0                      4.0.0
                                              vanilla              batchunmap-v4
Ops lru-file-mmap-read-elapsed        22.19 (  0.00%)            19.90 ( 10.32%)

               4.0.0       4.0.0
             vanilla  batchunmap-v4
User           33.49       32.41
System         35.29       33.23
Elapsed        23.07       21.46

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/Kconfig                |   1 +
 arch/x86/include/asm/tlbflush.h |   2 +
 include/linux/init_task.h       |   8 +++
 include/linux/rmap.h            |   3 ++
 include/linux/sched.h           |  14 ++++++
 init/Kconfig                    |   8 +++
 kernel/fork.c                   |   5 ++
 kernel/sched/core.c             |   3 ++
 mm/internal.h                   |  11 +++++
 mm/rmap.c                       | 105 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     |  23 ++++++++-
 11 files changed, 181 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..290844263218 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,6 +30,7 @@ config X86
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
 	select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
 	select ARCH_SUPPORTS_INT128 if X86_64
 	select HAVE_IDE
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..10c197a649f5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 
+#define flush_local_tlb_addr(addr) __flush_tlb_single(addr)
+
 #ifndef CONFIG_SMP
 
 /* "_up" is for UniProcessor.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 696d22312b31..0771937b47e1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -175,6 +175,13 @@ extern struct task_group root_task_group;
 # define INIT_NUMA_BALANCING(tsk)
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
+	.tlb_ubc = NULL,
+#else
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
+#endif
+
 #ifdef CONFIG_KASAN
 # define INIT_KASAN(tsk)						\
 	.kasan_depth = 1,
@@ -257,6 +264,7 @@ extern struct task_group root_task_group;
 	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
 	INIT_NUMA_BALANCING(tsk)					\
+	INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
 	INIT_KASAN(tsk)							\
 }
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c4c559a45dc8..8d23914b219e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -89,6 +89,9 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
+					 * and caller guarantees they will
+					 * do a final flush if necessary */
 };
 
 #ifdef CONFIG_MMU
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65770d6..5c09db02fe78 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,6 +1275,16 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
+#define BATCH_TLBFLUSH_SIZE 32UL
+
+/* Track pages that require TLB flushes */
+struct tlbflush_unmap_batch {
+	struct cpumask cpumask;
+	unsigned long nr_pages;
+	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1634,6 +1644,10 @@ struct task_struct {
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	struct tlbflush_unmap_batch *tlb_ubc;
+#endif
+
 	struct rcu_head rcu;
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d4261b..f519fbb6ac35 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -889,6 +889,14 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
 #
+# For architectures that have a local TLB flush for a PFN without knowledge
+# of the VMA. The architecture must provide guarantees on what happens if
+# a clean TLB cache entry is written after the unmap. Details are in mm/rmap.c
+# near the check for should_defer_flush.
+config ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	bool
+
+#
 # For architectures that know their GCC __int128 support is sound
 #
 config ARCH_SUPPORTS_INT128
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139615a0..86c872fec9fb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -246,6 +246,11 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	kfree(tsk->tlb_ubc);
+	tsk->tlb_ubc = NULL;
+#endif
+
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 62671f53202a..9836a28d001b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1823,6 +1823,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	p->tlb_ubc = NULL;
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 }
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/internal.h b/mm/internal.h
index a96da5b0029d..35aba439c275 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -431,4 +431,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
 
+enum ttu_flags;
+struct tlbflush_unmap_batch;
+
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+void try_to_unmap_flush(void);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index c161a14b6a8f..c06f2ce422e5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -60,6 +60,8 @@
 
 #include <asm/tlbflush.h>
 
+#include <trace/events/tlb.h>
+
 #include "internal.h"
 
 static struct kmem_cache *anon_vma_cachep;
@@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+static void percpu_flush_tlb_batch_pages(void *data)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = data;
+	int i;
+
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	for (i = 0; i < tlb_ubc->nr_pages; i++)
+		flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
+}
+
+/*
+ * Flush TLB entries for recently unmapped pages from remote CPUs. It is
+ * important that if a PTE was dirty when it was unmapped that it's flushed
+ * before any IO is initiated on the page to prevent lost writes. Similarly,
+ * it must be flushed before freeing to prevent data leakage.
+ */
+void try_to_unmap_flush(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+	int cpu;
+
+	if (!tlb_ubc || !tlb_ubc->nr_pages)
+		return;
+
+	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
+
+	preempt_disable();
+	cpu = smp_processor_id();
+	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
+		percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask);
+
+	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) {
+		smp_call_function_many(&tlb_ubc->cpumask,
+			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
+	}
+	cpumask_clear(&tlb_ubc->cpumask);
+	tlb_ubc->nr_pages = 0;
+	preempt_enable();
+}
+
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
+	tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page);
+	tlb_ubc->nr_pages++;
+
+	if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE)
+		try_to_unmap_flush();
+}
+
+/*
+ * Returns true if the TLB flush should be deferred to the end of a batch of
+ * unmap operations to reduce IPIs.
+ */
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	bool should_defer = false;
+
+	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
+		return false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+#else
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+}
+
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
 /*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
@@ -1213,7 +1299,24 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	if (should_defer_flush(mm, flags)) {
+		/*
+		 * We clear the PTE but do not flush so potentially a remote
+		 * CPU could still be writing to the page. If the entry was
+		 * previously clean then the architecture must guarantee that
+		 * a clear->dirty transition on a cached TLB entry is written
+		 * through and traps if the PTE is unmapped.
+		 */
+		pteval = ptep_get_and_clear(mm, address, pte);
+
+		/* Potentially writable TLBs must be flushed before IO */
+		if (pte_dirty(pteval))
+			flush_tlb_page(vma, address);
+		else
+			set_tlb_ubc_flush_pending(mm, page);
+	} else {
+		pteval = ptep_clear_flush(vma, address, pte);
+	}
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..5121742ccb87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+			switch (try_to_unmap(page,
+					ttu_flags|TTU_BATCH_FLUSH)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1175,6 +1176,7 @@ keep:
 	}
 
 	mem_cgroup_uncharge_list(&free_pages);
+	try_to_unmap_flush();
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
@@ -2118,6 +2120,23 @@ out:
 	}
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+/*
+ * Allocate the control structure for batch TLB flushing. An allocation
+ * failure is harmless as the reclaimer will send IPIs where necessary.
+ */
+static void alloc_tlb_ubc(void)
+{
+	if (!current->tlb_ubc)
+		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
+						GFP_KERNEL | __GFP_NOWARN);
+}
+#else
+static inline void alloc_tlb_ubc(void)
+{
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -2152,6 +2171,8 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
 			 sc->priority == DEF_PRIORITY);
 
+	alloc_tlb_ubc();
+
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] mm: Defer flush of writable TLB entries
  2015-04-25 17:45 [PATCH 0/3] TLB flush multiple pages per IPI v4 Mel Gorman
  2015-04-25 17:45 ` [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Mel Gorman
  2015-04-25 17:45 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
@ 2015-04-25 17:45 ` Mel Gorman
  2015-04-27  2:50   ` Rik van Riel
  2015-08-31 16:20 ` [PATCH 0/3] TLB flush multiple pages per IPI v4 Sébastien Wacquiez
  3 siblings, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2015-04-25 17:45 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

If a PTE is unmapped and it's dirty then it was writable recently. Due
to deferred TLB flushing, it's best to assume a writable TLB cache entry
exists. With that assumption, the TLB must be flushed before any IO can
start or the page is freed to avoid lost writes or data corruption. This
patch defers flushing of potentially writable TLBs as long as possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 mm/internal.h         |  4 ++++
 mm/rmap.c             | 28 +++++++++++++++++++++-------
 mm/vmscan.c           |  7 ++++++-
 4 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c09db02fe78..ea7c466be0bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1282,6 +1282,7 @@ enum perf_event_task_context {
 struct tlbflush_unmap_batch {
 	struct cpumask cpumask;
 	unsigned long nr_pages;
+	bool writable;
 	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index 35aba439c275..ae7822bd5659 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -436,10 +436,14 @@ struct tlbflush_unmap_batch;
 
 #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
 void try_to_unmap_flush(void);
+void try_to_unmap_flush_dirty(void);
 #else
 static inline void try_to_unmap_flush(void)
 {
 }
+static inline void try_to_unmap_flush_dirty(void)
+{
+}
 
 #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index c06f2ce422e5..984d2c258c13 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -621,11 +621,21 @@ void try_to_unmap_flush(void)
 	}
 	cpumask_clear(&tlb_ubc->cpumask);
 	tlb_ubc->nr_pages = 0;
+	tlb_ubc->writable = false;
 	preempt_enable();
 }
 
+/* Flush iff there are potentially writable TLB entries that can race with IO */
+void try_to_unmap_flush_dirty(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	if (tlb_ubc && tlb_ubc->writable)
+		try_to_unmap_flush();
+}
+
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
-		struct page *page)
+		struct page *page, bool writable)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
 
@@ -633,6 +643,14 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
 	tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page);
 	tlb_ubc->nr_pages++;
 
+	/*
+	 * If the PTE was dirty then it's best to assume it's writable. The
+	 * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush()
+	 * before the page any IO is initiated.
+	 */
+	if (writable)
+		tlb_ubc->writable = true;
+
 	if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE)
 		try_to_unmap_flush();
 }
@@ -657,7 +675,7 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
-		struct page *page)
+		struct page *page, bool writable)
 {
 }
 
@@ -1309,11 +1327,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 */
 		pteval = ptep_get_and_clear(mm, address, pte);
 
-		/* Potentially writable TLBs must be flushed before IO */
-		if (pte_dirty(pteval))
-			flush_tlb_page(vma, address);
-		else
-			set_tlb_ubc_flush_pending(mm, page);
+		set_tlb_ubc_flush_pending(mm, page, pte_dirty(pteval));
 	} else {
 		pteval = ptep_clear_flush(vma, address, pte);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5121742ccb87..0055224c52d4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1065,7 +1065,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			if (!sc->may_writepage)
 				goto keep_locked;
 
-			/* Page is dirty, try to write it out here */
+			/*
+			 * Page is dirty. Flush the TLB if a writable entry
+			 * potentially exists to avoid CPU writes after IO
+			 * starts and then write it out here
+			 */
+			try_to_unmap_flush_dirty();
 			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped
  2015-04-25 17:45 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
@ 2015-04-27  2:48   ` Rik van Riel
  0 siblings, 0 replies; 10+ messages in thread
From: Rik van Riel @ 2015-04-27  2:48 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On 04/25/2015 01:45 PM, Mel Gorman wrote:
> An IPI is sent to flush remote TLBs when a page is unmapped that was
> recently accessed by other CPUs. There are many circumstances where this
> happens but the obvious one is kswapd reclaiming pages belonging to a
> running process as kswapd and the task are likely running on separate CPUs.

> It's still a noticeable improvement with vmstat showing interrupts went
> from roughly 500K per second to 45K per second.
> 
> The patch will have no impact on workloads with no memory pressure or
> have relatively few mapped pages.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] mm: Defer flush of writable TLB entries
  2015-04-25 17:45 ` [PATCH 3/3] mm: Defer flush of writable TLB entries Mel Gorman
@ 2015-04-27  2:50   ` Rik van Riel
  0 siblings, 0 replies; 10+ messages in thread
From: Rik van Riel @ 2015-04-27  2:50 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On 04/25/2015 01:45 PM, Mel Gorman wrote:
> If a PTE is unmapped and it's dirty then it was writable recently. Due
> to deferred TLB flushing, it's best to assume a writable TLB cache entry
> exists. With that assumption, the TLB must be flushed before any IO can
> start or the page is freed to avoid lost writes or data corruption. This
> patch defers flushing of potentially writable TLBs as long as possible.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped
  2015-06-08 12:50 [PATCH 0/3] TLB flush multiple pages per IPI v5 Mel Gorman
@ 2015-06-08 12:50 ` Mel Gorman
  2015-06-08 22:38   ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2015-06-08 12:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linux-MM, LKML, Mel Gorman

An IPI is sent to flush remote TLBs when a page is unmapped that was
recently accessed by other CPUs. There are many circumstances where this
happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine
gets larger with more cores and more memory, the cost of these IPIs can
be high. This patch uses a structure similar in principle to a pagevec
to collect a list of PFNs and CPUs that require flushing. It then sends
one IPI per CPU that was mapping any of those pages to flush the list of
PFNs. A new TLB flush helper is required for this and one is added for
x86. Other architectures will need to decide if batching like this is both
safe and worth the memory overhead. Specifically the requirement is;

	If a clean page is unmapped and not immediately flushed, the
	architecture must guarantee that a write to that page from a CPU
	with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

vmscale on a 4-node machine with 64G RAM and 48 CPUs
           4.1.0-rc6     4.1.0-rc6
             vanilla batchunmap-v5
User          577.35        618.60
System       5927.06       4195.03
Elapsed       162.21        121.31

The workload completed 25% faster with 29% less CPU time.

This is showing that the readers completed 25% with 30% less CPU time. From
vmstats, it is known that the vanilla kernel was interrupted roughly 900K
times per second during the steady phase of the test and the patched kernel
was interrupts 180K times per second.

The impact is much lower on a small machine

vmscale on a 1-node machine with 8G RAM and 1 CPU
           4.1.0-rc6     4.1.0-rc6
             vanilla batchunmap-v5
User           59.14         58.86
System        109.15         83.78
Elapsed        27.32         23.14

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/Kconfig                |   1 +
 arch/x86/include/asm/tlbflush.h |   2 +
 include/linux/init_task.h       |   8 +++
 include/linux/rmap.h            |   3 ++
 include/linux/sched.h           |  14 ++++++
 init/Kconfig                    |   8 +++
 kernel/fork.c                   |   5 ++
 kernel/sched/core.c             |   3 ++
 mm/internal.h                   |  11 +++++
 mm/rmap.c                       | 105 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     |  23 ++++++++-
 11 files changed, 181 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d5696e1d1..4e8bd86735af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -31,6 +31,7 @@ config X86
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
 	select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
 	select ARCH_SUPPORTS_INT128 if X86_64
 	select HAVE_IDE
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..10c197a649f5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 
+#define flush_local_tlb_addr(addr) __flush_tlb_single(addr)
+
 #ifndef CONFIG_SMP
 
 /* "_up" is for UniProcessor.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 696d22312b31..0771937b47e1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -175,6 +175,13 @@ extern struct task_group root_task_group;
 # define INIT_NUMA_BALANCING(tsk)
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
+	.tlb_ubc = NULL,
+#else
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
+#endif
+
 #ifdef CONFIG_KASAN
 # define INIT_KASAN(tsk)						\
 	.kasan_depth = 1,
@@ -257,6 +264,7 @@ extern struct task_group root_task_group;
 	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
 	INIT_NUMA_BALANCING(tsk)					\
+	INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
 	INIT_KASAN(tsk)							\
 }
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c89c53a113a8..29446aeef36e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -89,6 +89,9 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
+					 * and caller guarantees they will
+					 * do a final flush if necessary */
 };
 
 #ifdef CONFIG_MMU
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a2e6122734..57ff61b16565 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1289,6 +1289,16 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
+#define BATCH_TLBFLUSH_SIZE 32UL
+
+/* Track pages that require TLB flushes */
+struct tlbflush_unmap_batch {
+	struct cpumask cpumask;
+	unsigned long nr_pages;
+	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1648,6 +1658,10 @@ struct task_struct {
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	struct tlbflush_unmap_batch *tlb_ubc;
+#endif
+
 	struct rcu_head rcu;
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec60232..1ff5d17e518a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -904,6 +904,14 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
 #
+# For architectures that have a local TLB flush for a PFN without knowledge
+# of the VMA. The architecture must provide guarantees on what happens if
+# a clean TLB cache entry is written after the unmap. Details are in mm/rmap.c
+# near the check for should_defer_flush.
+config ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	bool
+
+#
 # For architectures that know their GCC __int128 support is sound
 #
 config ARCH_SUPPORTS_INT128
diff --git a/kernel/fork.c b/kernel/fork.c
index 03c1eaaa6ef5..8ea9e8730ada 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -257,6 +257,11 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	kfree(tsk->tlb_ubc);
+	tsk->tlb_ubc = NULL;
+#endif
+
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 123673291ffb..936ce13aeb97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1843,6 +1843,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	p->tlb_ubc = NULL;
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 }
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/internal.h b/mm/internal.h
index a25e359a4039..8cbb68ccc731 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -433,4 +433,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
 
+enum ttu_flags;
+struct tlbflush_unmap_batch;
+
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+void try_to_unmap_flush(void);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index 24dd3f9fee27..a8dbba62398a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -60,6 +60,8 @@
 
 #include <asm/tlbflush.h>
 
+#include <trace/events/tlb.h>
+
 #include "internal.h"
 
 static struct kmem_cache *anon_vma_cachep;
@@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+static void percpu_flush_tlb_batch_pages(void *data)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = data;
+	int i;
+
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	for (i = 0; i < tlb_ubc->nr_pages; i++)
+		flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
+}
+
+/*
+ * Flush TLB entries for recently unmapped pages from remote CPUs. It is
+ * important that if a PTE was dirty when it was unmapped that it's flushed
+ * before any IO is initiated on the page to prevent lost writes. Similarly,
+ * it must be flushed before freeing to prevent data leakage.
+ */
+void try_to_unmap_flush(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+	int cpu;
+
+	if (!tlb_ubc || !tlb_ubc->nr_pages)
+		return;
+
+	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
+
+	preempt_disable();
+	cpu = smp_processor_id();
+	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
+		percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask);
+
+	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) {
+		smp_call_function_many(&tlb_ubc->cpumask,
+			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
+	}
+	cpumask_clear(&tlb_ubc->cpumask);
+	tlb_ubc->nr_pages = 0;
+	preempt_enable();
+}
+
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
+	tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page);
+	tlb_ubc->nr_pages++;
+
+	if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE)
+		try_to_unmap_flush();
+}
+
+/*
+ * Returns true if the TLB flush should be deferred to the end of a batch of
+ * unmap operations to reduce IPIs.
+ */
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	bool should_defer = false;
+
+	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
+		return false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+#else
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+}
+
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
 /*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
@@ -1213,7 +1299,24 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	if (should_defer_flush(mm, flags)) {
+		/*
+		 * We clear the PTE but do not flush so potentially a remote
+		 * CPU could still be writing to the page. If the entry was
+		 * previously clean then the architecture must guarantee that
+		 * a clear->dirty transition on a cached TLB entry is written
+		 * through and traps if the PTE is unmapped.
+		 */
+		pteval = ptep_get_and_clear(mm, address, pte);
+
+		/* Potentially writable TLBs must be flushed before IO */
+		if (pte_dirty(pteval))
+			flush_tlb_page(vma, address);
+		else
+			set_tlb_ubc_flush_pending(mm, page);
+	} else {
+		pteval = ptep_clear_flush(vma, address, pte);
+	}
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..5121742ccb87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+			switch (try_to_unmap(page,
+					ttu_flags|TTU_BATCH_FLUSH)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1175,6 +1176,7 @@ keep:
 	}
 
 	mem_cgroup_uncharge_list(&free_pages);
+	try_to_unmap_flush();
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
@@ -2118,6 +2120,23 @@ out:
 	}
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+/*
+ * Allocate the control structure for batch TLB flushing. An allocation
+ * failure is harmless as the reclaimer will send IPIs where necessary.
+ */
+static void alloc_tlb_ubc(void)
+{
+	if (!current->tlb_ubc)
+		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
+						GFP_KERNEL | __GFP_NOWARN);
+}
+#else
+static inline void alloc_tlb_ubc(void)
+{
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -2152,6 +2171,8 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
 			 sc->priority == DEF_PRIORITY);
 
+	alloc_tlb_ubc();
+
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped
  2015-06-08 12:50 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
@ 2015-06-08 22:38   ` Andrew Morton
  2015-06-09 11:07     ` Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2015-06-08 22:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linux-MM, LKML

On Mon,  8 Jun 2015 13:50:53 +0100 Mel Gorman <mgorman@suse.de> wrote:

> An IPI is sent to flush remote TLBs when a page is unmapped that was
> recently accessed by other CPUs. There are many circumstances where this
> happens but the obvious one is kswapd reclaiming pages belonging to a
> running process as kswapd and the task are likely running on separate CPUs.
> 
> On small machines, this is not a significant problem but as machine
> gets larger with more cores and more memory, the cost of these IPIs can
> be high. This patch uses a structure similar in principle to a pagevec
> to collect a list of PFNs and CPUs that require flushing. It then sends
> one IPI per CPU that was mapping any of those pages to flush the list of
> PFNs. A new TLB flush helper is required for this and one is added for
> x86. Other architectures will need to decide if batching like this is both
> safe and worth the memory overhead. Specifically the requirement is;
> 
> 	If a clean page is unmapped and not immediately flushed, the
> 	architecture must guarantee that a write to that page from a CPU
> 	with a cached TLB entry will trap a page fault.
> 
> This is essentially what the kernel already depends on but the window is
> much larger with this patch applied and is worth highlighting.
> 
> The impact of this patch depends on the workload as measuring any benefit
> requires both mapped pages co-located on the LRU and memory pressure. The
> case with the biggest impact is multiple processes reading mapped pages
> taken from the vm-scalability test suite. The test case uses NR_CPU readers
> of mapped files that consume 10*RAM.
> 
> vmscale on a 4-node machine with 64G RAM and 48 CPUs
>            4.1.0-rc6     4.1.0-rc6
>              vanilla batchunmap-v5
> User          577.35        618.60
> System       5927.06       4195.03
> Elapsed       162.21        121.31
> 
> The workload completed 25% faster with 29% less CPU time.
> 
> This is showing that the readers completed 25% with 30% less CPU time. From
> vmstats, it is known that the vanilla kernel was interrupted roughly 900K
> times per second during the steady phase of the test and the patched kernel
> was interrupts 180K times per second.
> 
> The impact is much lower on a small machine
> 
> vmscale on a 1-node machine with 8G RAM and 1 CPU
>            4.1.0-rc6     4.1.0-rc6
>              vanilla batchunmap-v5
> User           59.14         58.86
> System        109.15         83.78
> Elapsed        27.32         23.14
> 
> It's still a noticeable improvement with vmstat showing interrupts went
> from roughly 500K per second to 45K per second.

Looks nice.

> The patch will have no impact on workloads with no memory pressure or
> have relatively few mapped pages.

What benefit can we expect to see to any real-world userspace?

> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -175,6 +175,13 @@ extern struct task_group root_task_group;
>  # define INIT_NUMA_BALANCING(tsk)
>  #endif
>  
> +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
> +	.tlb_ubc = NULL,
> +#else
> +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
> +#endif
> +
>  #ifdef CONFIG_KASAN
>  # define INIT_KASAN(tsk)						\
>  	.kasan_depth = 1,
> @@ -257,6 +264,7 @@ extern struct task_group root_task_group;
>  	INIT_RT_MUTEXES(tsk)						\
>  	INIT_VTIME(tsk)							\
>  	INIT_NUMA_BALANCING(tsk)					\
> +	INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
>  	INIT_KASAN(tsk)							\
>  }

We don't really need any of this - init_task starts up all-zero anyway.
Maybe it's useful for documentation reasons (dubious), but I bet we've
already missed fields.

>
> ...
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1289,6 +1289,16 @@ enum perf_event_task_context {
>  	perf_nr_task_contexts,
>  };
>  
> +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
> +#define BATCH_TLBFLUSH_SIZE 32UL
> +
> +/* Track pages that require TLB flushes */
> +struct tlbflush_unmap_batch {
> +	struct cpumask cpumask;
> +	unsigned long nr_pages;
> +	unsigned long pfns[BATCH_TLBFLUSH_SIZE];

Why are we storing pfn's rather than page*'s?

I'm trying to get my head around what's actually in this structure.

Each thread has one of these, lazily allocated <under circumstances>. 
The cpumask field contains a mask of all the CPUs which have done
<something>.  The careful reader will find mm_struct.cpu_vm_mask_var
and will wonder why it didn't need documenting, sigh.

Wanna fill in the blanks?  As usual, understanding the data structure
is the key to understanding the design, so it's worth a couple of
paragraphs.  With this knowledge, the reader may understand why
try_to_unmap_flush() has preempt_disable() protection but
set_tlb_ubc_flush_pending() doesn't need it!

> +};
> +
>  struct task_struct {
>  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
>  	void *stack;
>
> ...
>
> @@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma)
>  	return address;
>  }
>  
> +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> +static void percpu_flush_tlb_batch_pages(void *data)
> +{
> +	struct tlbflush_unmap_batch *tlb_ubc = data;
> +	int i;

Anally speaking, this should be unsigned long (in which case it
shouldn't be called `i'!).  Or make tlbflush_unmap_batch.nr_pages int. 
But int is signed, which is silly.  Whatever ;)


> +	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> +	for (i = 0; i < tlb_ubc->nr_pages; i++)
> +		flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
> +}
> +
> +/*
> + * Flush TLB entries for recently unmapped pages from remote CPUs. It is
> + * important that if a PTE was dirty when it was unmapped that it's flushed

s/important that/important /

> + * before any IO is initiated on the page to prevent lost writes. Similarly,
> + * it must be flushed before freeing to prevent data leakage.
> + */
> +void try_to_unmap_flush(void)
> +{
> +	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
> +	int cpu;
> +
> +	if (!tlb_ubc || !tlb_ubc->nr_pages)
> +		return;
> +
> +	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
> +
> +	preempt_disable();
> +	cpu = smp_processor_id();

get_cpu()

> +	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
> +		percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask);
> +
> +	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) {
> +		smp_call_function_many(&tlb_ubc->cpumask,
> +			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
> +	}
> +	cpumask_clear(&tlb_ubc->cpumask);
> +	tlb_ubc->nr_pages = 0;
> +	preempt_enable();

put_cpu()

> +}
>
> ...
>
> +/*
> + * Returns true if the TLB flush should be deferred to the end of a batch of
> + * unmap operations to reduce IPIs.
> + */
> +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> +{
> +	bool should_defer = false;
> +
> +	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
> +		return false;
> +
> +	/* If remote CPUs need to be flushed then defer batch the flush */
> +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> +		should_defer = true;
> +	put_cpu();

What did the get_cpu() protect?

> +	return should_defer;
> +}
> +#else
> +static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
> +		struct page *page)
> +{
> +}
> +
>
> ...
>
> +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> +/*
> + * Allocate the control structure for batch TLB flushing. An allocation
> + * failure is harmless as the reclaimer will send IPIs where necessary.
> + */
> +static void alloc_tlb_ubc(void)
> +{
> +	if (!current->tlb_ubc)
> +		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
> +						GFP_KERNEL | __GFP_NOWARN);

A GFP_KERNEL allocation from deep within page reclaim?  Seems imprudent
if only from a stack-usage POV.

> +}
> +#else
> +static inline void alloc_tlb_ubc(void)
> +{
> +}
> +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped
  2015-06-08 22:38   ` Andrew Morton
@ 2015-06-09 11:07     ` Mel Gorman
  0 siblings, 0 replies; 10+ messages in thread
From: Mel Gorman @ 2015-06-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linux-MM, LKML

On Mon, Jun 08, 2015 at 03:38:13PM -0700, Andrew Morton wrote:
> On Mon,  8 Jun 2015 13:50:53 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > An IPI is sent to flush remote TLBs when a page is unmapped that was
> > recently accessed by other CPUs. There are many circumstances where this
> > happens but the obvious one is kswapd reclaiming pages belonging to a
> > running process as kswapd and the task are likely running on separate CPUs.
> > 
> > On small machines, this is not a significant problem but as machine
> > gets larger with more cores and more memory, the cost of these IPIs can
> > be high. This patch uses a structure similar in principle to a pagevec
> > to collect a list of PFNs and CPUs that require flushing. It then sends
> > one IPI per CPU that was mapping any of those pages to flush the list of
> > PFNs. A new TLB flush helper is required for this and one is added for
> > x86. Other architectures will need to decide if batching like this is both
> > safe and worth the memory overhead. Specifically the requirement is;
> > 
> > 	If a clean page is unmapped and not immediately flushed, the
> > 	architecture must guarantee that a write to that page from a CPU
> > 	with a cached TLB entry will trap a page fault.
> > 
> > This is essentially what the kernel already depends on but the window is
> > much larger with this patch applied and is worth highlighting.
> > 
> > The impact of this patch depends on the workload as measuring any benefit
> > requires both mapped pages co-located on the LRU and memory pressure. The
> > case with the biggest impact is multiple processes reading mapped pages
> > taken from the vm-scalability test suite. The test case uses NR_CPU readers
> > of mapped files that consume 10*RAM.
> > 
> > vmscale on a 4-node machine with 64G RAM and 48 CPUs
> >            4.1.0-rc6     4.1.0-rc6
> >              vanilla batchunmap-v5
> > User          577.35        618.60
> > System       5927.06       4195.03
> > Elapsed       162.21        121.31
> > 
> > The workload completed 25% faster with 29% less CPU time.
> > 
> > This is showing that the readers completed 25% with 30% less CPU time. From
> > vmstats, it is known that the vanilla kernel was interrupted roughly 900K
> > times per second during the steady phase of the test and the patched kernel
> > was interrupts 180K times per second.
> > 
> > The impact is much lower on a small machine
> > 
> > vmscale on a 1-node machine with 8G RAM and 1 CPU
> >            4.1.0-rc6     4.1.0-rc6
> >              vanilla batchunmap-v5
> > User           59.14         58.86
> > System        109.15         83.78
> > Elapsed        27.32         23.14
> > 
> > It's still a noticeable improvement with vmstat showing interrupts went
> > from roughly 500K per second to 45K per second.
> 
> Looks nice.
> 
> > The patch will have no impact on workloads with no memory pressure or
> > have relatively few mapped pages.
> 
> What benefit can we expect to see to any real-world userspace?
> 

Only a small subset of workloads will see a benefit -- ones that mmap a
lot of data with working sets larger than the size of a node. Some
streaming media servers allegedly do this.

Some numerical processing applications may hit this. Those that use glibc
for large buffers use mmap and if the application is larger than a NUMA
node, it'll need to be unmapped and flushed. Python/NumPY uses large
maps for large buffers (based on the paper "Doubling the Performance of
Python/Numpy with less than 100 SLOC"). Whether users of NumPY hit this
issue or not depends on whether kswapd is active.

Anecdotally, I'm aware from IRC of one user that is experimenting with
a large HTTP cache and a load generator that spent a lot of time sending
IPIs that was obvious from the profile. I asked weeks ago that they post
the results here which they promised they would but didn't. Unfortunately,
I don't know the persons real name to cc them. Rik might.

Anecdotally, I also believe that Intel hit this internally with some
internal workload but I'm basing this on idle comments at LSF/MM. However
they were unwilling or unable to describe exactly what the test does and
against what software.

I know this is more vague than you'd like. By and large, I'm relying on
the assumption that if we are reclaiming mapped pages from kswapd context
then sending one IPI per page is stupid.

> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -175,6 +175,13 @@ extern struct task_group root_task_group;
> >  # define INIT_NUMA_BALANCING(tsk)
> >  #endif
> >  
> > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
> > +	.tlb_ubc = NULL,
> > +#else
> > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
> > +#endif
> > +
> >  #ifdef CONFIG_KASAN
> >  # define INIT_KASAN(tsk)						\
> >  	.kasan_depth = 1,
> > @@ -257,6 +264,7 @@ extern struct task_group root_task_group;
> >  	INIT_RT_MUTEXES(tsk)						\
> >  	INIT_VTIME(tsk)							\
> >  	INIT_NUMA_BALANCING(tsk)					\
> > +	INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
> >  	INIT_KASAN(tsk)							\
> >  }
> 
> We don't really need any of this - init_task starts up all-zero anyway.
> Maybe it's useful for documentation reasons (dubious), but I bet we've
> already missed fields.
> 

True. I'll look into removing it and make sure there is no adverse
impact.

> >
> > ...
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1289,6 +1289,16 @@ enum perf_event_task_context {
> >  	perf_nr_task_contexts,
> >  };
> >  
> > +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
> > +#define BATCH_TLBFLUSH_SIZE 32UL
> > +
> > +/* Track pages that require TLB flushes */
> > +struct tlbflush_unmap_batch {
> > +	struct cpumask cpumask;
> > +	unsigned long nr_pages;
> > +	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
> 
> Why are we storing pfn's rather than page*'s?
> 

Because a page would require a page->pfn lookup within the IPI handler.
That will work but it's unnecessarily wasteful.

> I'm trying to get my head around what's actually in this structure.
> 
> Each thread has one of these, lazily allocated <under circumstances>. 
> The cpumask field contains a mask of all the CPUs which have done
> <something>.  The careful reader will find mm_struct.cpu_vm_mask_var
> and will wonder why it didn't need documenting, sigh.
> 
> Wanna fill in the blanks?  As usual, understanding the data structure
> is the key to understanding the design, so it's worth a couple of
> paragraphs.  With this knowledge, the reader may understand why
> try_to_unmap_flush() has preempt_disable() protection but
> set_tlb_ubc_flush_pending() doesn't need it!
> 

Is this any help?

struct tlbflush_unmap_batch {
        /*
         * Each bit set is a CPU that potentially has a TLB entry for one of
         * the PFNs being flushed. See set_tlb_ubc_flush_pending().
         */
        struct cpumask cpumask;

        /*
         * The number and list of pfns to be flushed. PFNs are tracked instead
         * of struct pages to avoid multiple page->pfn lookups by each CPU that
         * receives an IPI in percpu_flush_tlb_batch_pages
         */
        unsigned long nr_pages;
        unsigned long pfns[BATCH_TLBFLUSH_SIZE];

        /*
         * If true then the PTE was dirty when unmapped. The entry must be
         * flushed before IO is initiated or a stale TLB entry potentially
         * allows an update without redirtying the page.
         */
        bool writable;
};

> > +};
> > +
> >  struct task_struct {
> >  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
> >  	void *stack;
> >
> > ...
> >
> > @@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma)
> >  	return address;
> >  }
> >  
> > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> > +static void percpu_flush_tlb_batch_pages(void *data)
> > +{
> > +	struct tlbflush_unmap_batch *tlb_ubc = data;
> > +	int i;
> 
> Anally speaking, this should be unsigned long (in which case it
> shouldn't be called `i'!).  Or make tlbflush_unmap_batch.nr_pages int. 
> But int is signed, which is silly.  Whatever ;)
> 

I can make it unsigned int which is a micro-optimisation for loop iterators
anyway.

> 
> > +	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> > +	for (i = 0; i < tlb_ubc->nr_pages; i++)
> > +		flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
> > +}
> > +
> > +/*
> > + * Flush TLB entries for recently unmapped pages from remote CPUs. It is
> > + * important that if a PTE was dirty when it was unmapped that it's flushed
> 
> s/important that/important /
> 

Fixed.

> > + * before any IO is initiated on the page to prevent lost writes. Similarly,
> > + * it must be flushed before freeing to prevent data leakage.
> > + */
> > +void try_to_unmap_flush(void)
> > +{
> > +	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
> > +	int cpu;
> > +
> > +	if (!tlb_ubc || !tlb_ubc->nr_pages)
> > +		return;
> > +
> > +	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
> > +
> > +	preempt_disable();
> > +	cpu = smp_processor_id();
> 
> get_cpu()
> 
> > +	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
> > +		percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask);
> > +
> > +	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) {
> > +		smp_call_function_many(&tlb_ubc->cpumask,
> > +			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
> > +	}
> > +	cpumask_clear(&tlb_ubc->cpumask);
> > +	tlb_ubc->nr_pages = 0;
> > +	preempt_enable();
> 
> put_cpu()
> 

Gack. Fixed.

> > +}
> >
> > ...
> >
> > +/*
> > + * Returns true if the TLB flush should be deferred to the end of a batch of
> > + * unmap operations to reduce IPIs.
> > + */
> > +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> > +{
> > +	bool should_defer = false;
> > +
> > +	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
> > +		return false;
> > +
> > +	/* If remote CPUs need to be flushed then defer batch the flush */
> > +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> > +		should_defer = true;
> > +	put_cpu();
> 
> What did the get_cpu() protect?
> 

To safely lookup the current running CPU number. smp_processor_id() is
potentially safe except in specific circumstances and I did not think
raw_smp_processor_id() was justified.

> > +	return should_defer;
> > +}
> > +#else
> > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
> > +		struct page *page)
> > +{
> > +}
> > +
> >
> > ...
> >
> > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> > +/*
> > + * Allocate the control structure for batch TLB flushing. An allocation
> > + * failure is harmless as the reclaimer will send IPIs where necessary.
> > + */
> > +static void alloc_tlb_ubc(void)
> > +{
> > +	if (!current->tlb_ubc)
> > +		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
> > +						GFP_KERNEL | __GFP_NOWARN);
> 
> A GFP_KERNEL allocation from deep within page reclaim?  Seems imprudent
> if only from a stack-usage POV.
> 

When we call it, we are in PF_MEMALLOC context either set in the page
allocator from direct reclaim or because kswapd always sets it. This limits
the stack depth. It would be comparable to the depth we reach during normal
reclaim calling into shrink_page_list and everything below it so we should
be safe. Granted, the dependency on PF_MEMALLOC is not obvious at all.

> > +}
> > +#else
> > +static inline void alloc_tlb_ubc(void)
> > +{
> > +}
> > +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
> >
> > ...
> >
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v4
  2015-04-25 17:45 [PATCH 0/3] TLB flush multiple pages per IPI v4 Mel Gorman
                   ` (2 preceding siblings ...)
  2015-04-25 17:45 ` [PATCH 3/3] mm: Defer flush of writable TLB entries Mel Gorman
@ 2015-08-31 16:20 ` Sébastien Wacquiez
  3 siblings, 0 replies; 10+ messages in thread
From: Sébastien Wacquiez @ 2015-08-31 16:20 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML

On 04/25/2015 07:45 PM, Mel Gorman wrote:

> The performance impact is documented in the changelogs but in the optimistic
> case on a 4-socket machine the full series reduces interrupts from 900K
> interrupts/second to 60K interrupts/second.

Hello to the list,

this patch have a huge (positive) performance impact on my setup.

In the goal of building the best ever CDN, I run varnish web cache over
very big boxes (dual xeon 12 cores, 256 Gb Ram, 24 SSd, 2*40G ethernet).

Without going into varnish internal, it help to know that varnish have
multiple storage backend (memory, file, etc), and that the file backend,
(the one you use when you have caches drives), don't use read/write
syscall but mmap.

The raw performances of this server are very good : when using varnish
with memory storage only, it push 80Gbps of network traffic easily. When
reading/writing from/to the drives, you get 10GB/s of data. And you can
do both at the same time without performance loss.

Anyway, without this patch, using file storage backend and after warmup,
the performance of the server was limited to a frustrating 14 Gbps. At
start, varnish read from the http backend at ~ 30 Gbps, cache the data
in his huge mmap, the system write it to the disk, stream it to the
client, so everything looks ok. But instead of becoming quicker when the
hitrate goes up (as we alread have data in the cache), it became slower
and slower, to finally freeze for like 4-5 seconds every 10 sec or so.

After analysis, I found out the bottleneck is the system's capacity
to find free memory. If I get it correctly, when you read a "swapped
out" page of a mmaped file, the kernel have to find some free memory to
put the data it'll read from the drive. In my case, the disk are quick
enough to handle the change almost in real time, so I've a lot of
potential free memory (ie Inactive(file)). Really freeing this memory
(either in direct or hard reclaim) is done relatively slowly, ie, after
some tuning to avoid any direct reclaim (which was causing the freeze),
I ended up having 2 kswapd (it's a bi-socket numa node) process eating
100% of cpu for ~ 14 Gbps of traffic (or ~1.5 Millions reclaims/s)

After a chat with Rik van Riel and Mel Gorman, they suggest me to try
this patch, and the limitation immediately jumped at 33 Gbps, which was
in fact my upstream capacity, after a while I was able to achieve
60 Gbps without experiencing any issue.
Even the freezing part, happening in direct reclaim mode, is a lot
smoother ; on my test rig it sufficiently quick to not be seen as
unavailability by my supervision (which wasn't the case before).

The bad news is that after some time (like 24h) of stress testing, the
performance degrade, I guess due to some kind of fragmentation. Still,
the performance seems to be maintained to a higher level than the
vanilla kernel.

I suppose that this patch could also help a lot with database (which
often mmap their data) which have to reread huge dataset frequently.

Thanks a lot to Rik and Mel for the provided help, and feel free to mail
me if you have question.

Regards,

Sebastien Wacquiez

PS : the test were conducted with a 4.0.0 kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-08-31 16:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-25 17:45 [PATCH 0/3] TLB flush multiple pages per IPI v4 Mel Gorman
2015-04-25 17:45 ` [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-04-25 17:45 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
2015-04-27  2:48   ` Rik van Riel
2015-04-25 17:45 ` [PATCH 3/3] mm: Defer flush of writable TLB entries Mel Gorman
2015-04-27  2:50   ` Rik van Riel
2015-08-31 16:20 ` [PATCH 0/3] TLB flush multiple pages per IPI v4 Sébastien Wacquiez
  -- strict thread matches above, loose matches on Subject: below --
2015-06-08 12:50 [PATCH 0/3] TLB flush multiple pages per IPI v5 Mel Gorman
2015-06-08 12:50 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
2015-06-08 22:38   ` Andrew Morton
2015-06-09 11:07     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).