From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f45.google.com (mail-wg0-f45.google.com [74.125.82.45]) by kanga.kvack.org (Postfix) with ESMTP id 5546D6B0032 for ; Mon, 8 Jun 2015 08:51:00 -0400 (EDT) Received: by wgv5 with SMTP id 5so102579596wgv.1 for ; Mon, 08 Jun 2015 05:50:59 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id wo1si4931339wjc.207.2015.06.08.05.50.58 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 08 Jun 2015 05:50:58 -0700 (PDT) From: Mel Gorman Subject: [PATCH 0/3] TLB flush multiple pages per IPI v5 Date: Mon, 8 Jun 2015 13:50:51 +0100 Message-Id: <1433767854-24408-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML , Mel Gorman Changelog since V4 o Rebase to 4.1-rc6 Changelog since V3 o Drop batching of TLB flush from migration o Redo how larger batching is managed o Batch TLB flushes when writable entries exist When unmapping pages it is necessary to flush the TLB. If that page was accessed by another CPU then an IPI is used to flush the remote CPU. That is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. There already is a window between when a page is unmapped and when it is TLB flushed. This series simply increases the window so multiple pages can be flushed using a single IPI. This *should* be safe or the kernel is hosed already but I've cc'd the x86 maintainers and some of the Intel folk for comment. Patch 1 simply made the rest of the series easier to write as ftrace could identify all the senders of TLB flush IPIS. Patch 2 collects a list of PFNs and sends one IPI to flush them all Patch 3 tracks when there potentially are writable TLB entries that need to be batched differently The performance impact is documented in the changelogs but in the optimistic case on a 4-socket machine the full series reduces interrupts from 900K interrupts/second to 60K interrupts/second. arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 2 + arch/x86/mm/tlb.c | 1 + include/linux/init_task.h | 8 +++ include/linux/mm_types.h | 1 + include/linux/rmap.h | 3 + include/linux/sched.h | 15 +++++ include/trace/events/tlb.h | 3 +- init/Kconfig | 8 +++ kernel/fork.c | 5 ++ kernel/sched/core.c | 3 + mm/internal.h | 15 +++++ mm/rmap.c | 119 +++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 30 +++++++++- 14 files changed, 210 insertions(+), 4 deletions(-) -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f49.google.com (mail-wg0-f49.google.com [74.125.82.49]) by kanga.kvack.org (Postfix) with ESMTP id 0E70B6B0038 for ; Mon, 8 Jun 2015 08:51:01 -0400 (EDT) Received: by wgez8 with SMTP id z8so102465731wge.0 for ; Mon, 08 Jun 2015 05:51:00 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id m3si4941949wjx.191.2015.06.08.05.50.58 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 08 Jun 2015 05:50:58 -0700 (PDT) From: Mel Gorman Subject: [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Date: Mon, 8 Jun 2015 13:50:52 +0100 Message-Id: <1433767854-24408-2-git-send-email-mgorman@suse.de> In-Reply-To: <1433767854-24408-1-git-send-email-mgorman@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML , Mel Gorman It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Reviewed-by: Dave Hansen --- arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/trace/events/tlb.h | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3250f2371aea..2da824c1c140 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, info.flush_end = end; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start); if (is_uv_system()) { unsigned int cpu; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8d37e26a1007..86ad9f902042 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -534,6 +534,7 @@ enum tlb_flush_reason { TLB_REMOTE_SHOOTDOWN, TLB_LOCAL_SHOOTDOWN, TLB_LOCAL_MM_SHOOTDOWN, + TLB_REMOTE_SEND_IPI, NR_TLB_FLUSH_REASONS, }; diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h index 4250f364a6ca..bc8815f45f3b 100644 --- a/include/trace/events/tlb.h +++ b/include/trace/events/tlb.h @@ -11,7 +11,8 @@ EM( TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" ) \ EM( TLB_REMOTE_SHOOTDOWN, "remote shootdown" ) \ EM( TLB_LOCAL_SHOOTDOWN, "local shootdown" ) \ - EMe( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) + EM( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) \ + EMe( TLB_REMOTE_SEND_IPI, "remote ipi send" ) /* * First define the enums in TLB_FLUSH_REASON to be exported to userspace -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f175.google.com (mail-wi0-f175.google.com [209.85.212.175]) by kanga.kvack.org (Postfix) with ESMTP id 8A2ED900015 for ; Mon, 8 Jun 2015 08:51:03 -0400 (EDT) Received: by wiwd19 with SMTP id d19so85194414wiw.0 for ; Mon, 08 Jun 2015 05:51:03 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id h2si967945wiv.100.2015.06.08.05.50.59 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 08 Jun 2015 05:50:59 -0700 (PDT) From: Mel Gorman Subject: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Date: Mon, 8 Jun 2015 13:50:53 +0100 Message-Id: <1433767854-24408-3-git-send-email-mgorman@suse.de> In-Reply-To: <1433767854-24408-1-git-send-email-mgorman@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML , Mel Gorman An IPI is sent to flush remote TLBs when a page is unmapped that was recently accessed by other CPUs. There are many circumstances where this happens but the obvious one is kswapd reclaiming pages belonging to a running process as kswapd and the task are likely running on separate CPUs. On small machines, this is not a significant problem but as machine gets larger with more cores and more memory, the cost of these IPIs can be high. This patch uses a structure similar in principle to a pagevec to collect a list of PFNs and CPUs that require flushing. It then sends one IPI per CPU that was mapping any of those pages to flush the list of PFNs. A new TLB flush helper is required for this and one is added for x86. Other architectures will need to decide if batching like this is both safe and worth the memory overhead. Specifically the requirement is; If a clean page is unmapped and not immediately flushed, the architecture must guarantee that a write to that page from a CPU with a cached TLB entry will trap a page fault. This is essentially what the kernel already depends on but the window is much larger with this patch applied and is worth highlighting. The impact of this patch depends on the workload as measuring any benefit requires both mapped pages co-located on the LRU and memory pressure. The case with the biggest impact is multiple processes reading mapped pages taken from the vm-scalability test suite. The test case uses NR_CPU readers of mapped files that consume 10*RAM. vmscale on a 4-node machine with 64G RAM and 48 CPUs 4.1.0-rc6 4.1.0-rc6 vanilla batchunmap-v5 User 577.35 618.60 System 5927.06 4195.03 Elapsed 162.21 121.31 The workload completed 25% faster with 29% less CPU time. This is showing that the readers completed 25% with 30% less CPU time. From vmstats, it is known that the vanilla kernel was interrupted roughly 900K times per second during the steady phase of the test and the patched kernel was interrupts 180K times per second. The impact is much lower on a small machine vmscale on a 1-node machine with 8G RAM and 1 CPU 4.1.0-rc6 4.1.0-rc6 vanilla batchunmap-v5 User 59.14 58.86 System 109.15 83.78 Elapsed 27.32 23.14 It's still a noticeable improvement with vmstat showing interrupts went from roughly 500K per second to 45K per second. The patch will have no impact on workloads with no memory pressure or have relatively few mapped pages. Signed-off-by: Mel Gorman --- arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 2 + include/linux/init_task.h | 8 +++ include/linux/rmap.h | 3 ++ include/linux/sched.h | 14 ++++++ init/Kconfig | 8 +++ kernel/fork.c | 5 ++ kernel/sched/core.c | 3 ++ mm/internal.h | 11 +++++ mm/rmap.c | 105 +++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 23 ++++++++- 11 files changed, 181 insertions(+), 2 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 226d5696e1d1..4e8bd86735af 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -31,6 +31,7 @@ config X86 select ARCH_MIGHT_HAVE_PC_SERIO select HAVE_AOUT if X86_32 select HAVE_UNSTABLE_SCHED_CLOCK + select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 select ARCH_SUPPORTS_INT128 if X86_64 select HAVE_IDE diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index cd791948b286..10c197a649f5 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr) * and page-granular flushes are available only on i486 and up. */ +#define flush_local_tlb_addr(addr) __flush_tlb_single(addr) + #ifndef CONFIG_SMP /* "_up" is for UniProcessor. diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 696d22312b31..0771937b47e1 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -175,6 +175,13 @@ extern struct task_group root_task_group; # define INIT_NUMA_BALANCING(tsk) #endif +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ + .tlb_ubc = NULL, +#else +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) +#endif + #ifdef CONFIG_KASAN # define INIT_KASAN(tsk) \ .kasan_depth = 1, @@ -257,6 +264,7 @@ extern struct task_group root_task_group; INIT_RT_MUTEXES(tsk) \ INIT_VTIME(tsk) \ INIT_NUMA_BALANCING(tsk) \ + INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ INIT_KASAN(tsk) \ } diff --git a/include/linux/rmap.h b/include/linux/rmap.h index c89c53a113a8..29446aeef36e 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -89,6 +89,9 @@ enum ttu_flags { TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ + TTU_BATCH_FLUSH = (1 << 11), /* Batch TLB flushes where possible + * and caller guarantees they will + * do a final flush if necessary */ }; #ifdef CONFIG_MMU diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a2e6122734..57ff61b16565 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1289,6 +1289,16 @@ enum perf_event_task_context { perf_nr_task_contexts, }; +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */ +#define BATCH_TLBFLUSH_SIZE 32UL + +/* Track pages that require TLB flushes */ +struct tlbflush_unmap_batch { + struct cpumask cpumask; + unsigned long nr_pages; + unsigned long pfns[BATCH_TLBFLUSH_SIZE]; +}; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; @@ -1648,6 +1658,10 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + struct tlbflush_unmap_batch *tlb_ubc; +#endif + struct rcu_head rcu; /* diff --git a/init/Kconfig b/init/Kconfig index dc24dec60232..1ff5d17e518a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -904,6 +904,14 @@ config ARCH_SUPPORTS_NUMA_BALANCING bool # +# For architectures that have a local TLB flush for a PFN without knowledge +# of the VMA. The architecture must provide guarantees on what happens if +# a clean TLB cache entry is written after the unmap. Details are in mm/rmap.c +# near the check for should_defer_flush. +config ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + bool + +# # For architectures that know their GCC __int128 support is sound # config ARCH_SUPPORTS_INT128 diff --git a/kernel/fork.c b/kernel/fork.c index 03c1eaaa6ef5..8ea9e8730ada 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -257,6 +257,11 @@ void __put_task_struct(struct task_struct *tsk) delayacct_tsk_free(tsk); put_signal_struct(tsk->signal); +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + kfree(tsk->tlb_ubc); + tsk->tlb_ubc = NULL; +#endif + if (!profile_handoff_task(tsk)) free_task(tsk); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 123673291ffb..936ce13aeb97 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1843,6 +1843,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->numa_group = NULL; #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + p->tlb_ubc = NULL; +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ } #ifdef CONFIG_NUMA_BALANCING diff --git a/mm/internal.h b/mm/internal.h index a25e359a4039..8cbb68ccc731 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -433,4 +433,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, #define ALLOC_CMA 0x80 /* allow allocations from CMA areas */ #define ALLOC_FAIR 0x100 /* fair zone allocation */ +enum ttu_flags; +struct tlbflush_unmap_batch; + +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +void try_to_unmap_flush(void); +#else +static inline void try_to_unmap_flush(void) +{ +} + +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ #endif /* __MM_INTERNAL_H */ diff --git a/mm/rmap.c b/mm/rmap.c index 24dd3f9fee27..a8dbba62398a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -60,6 +60,8 @@ #include +#include + #include "internal.h" static struct kmem_cache *anon_vma_cachep; @@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma) return address; } +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +static void percpu_flush_tlb_batch_pages(void *data) +{ + struct tlbflush_unmap_batch *tlb_ubc = data; + int i; + + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); + for (i = 0; i < tlb_ubc->nr_pages; i++) + flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT); +} + +/* + * Flush TLB entries for recently unmapped pages from remote CPUs. It is + * important that if a PTE was dirty when it was unmapped that it's flushed + * before any IO is initiated on the page to prevent lost writes. Similarly, + * it must be flushed before freeing to prevent data leakage. + */ +void try_to_unmap_flush(void) +{ + struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; + int cpu; + + if (!tlb_ubc || !tlb_ubc->nr_pages) + return; + + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages); + + preempt_disable(); + cpu = smp_processor_id(); + if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) + percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask); + + if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) { + smp_call_function_many(&tlb_ubc->cpumask, + percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true); + } + cpumask_clear(&tlb_ubc->cpumask); + tlb_ubc->nr_pages = 0; + preempt_enable(); +} + +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, + struct page *page) +{ + struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; + + cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm)); + tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page); + tlb_ubc->nr_pages++; + + if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE) + try_to_unmap_flush(); +} + +/* + * Returns true if the TLB flush should be deferred to the end of a batch of + * unmap operations to reduce IPIs. + */ +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) +{ + bool should_defer = false; + + if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH)) + return false; + + /* If remote CPUs need to be flushed then defer batch the flush */ + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + should_defer = true; + put_cpu(); + + return should_defer; +} +#else +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, + struct page *page) +{ +} + +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) +{ + return false; +} +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ + /* * At what user virtual address is page expected in vma? * Caller should check the page is actually part of the vma. @@ -1213,7 +1299,24 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush(vma, address, pte); + if (should_defer_flush(mm, flags)) { + /* + * We clear the PTE but do not flush so potentially a remote + * CPU could still be writing to the page. If the entry was + * previously clean then the architecture must guarantee that + * a clear->dirty transition on a cached TLB entry is written + * through and traps if the PTE is unmapped. + */ + pteval = ptep_get_and_clear(mm, address, pte); + + /* Potentially writable TLBs must be flushed before IO */ + if (pte_dirty(pteval)) + flush_tlb_page(vma, address); + else + set_tlb_ubc_flush_pending(mm, page); + } else { + pteval = ptep_clear_flush(vma, address, pte); + } /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8eadd71bac..5121742ccb87 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, * processes. Try to unmap it here. */ if (page_mapped(page) && mapping) { - switch (try_to_unmap(page, ttu_flags)) { + switch (try_to_unmap(page, + ttu_flags|TTU_BATCH_FLUSH)) { case SWAP_FAIL: goto activate_locked; case SWAP_AGAIN: @@ -1175,6 +1176,7 @@ keep: } mem_cgroup_uncharge_list(&free_pages); + try_to_unmap_flush(); free_hot_cold_page_list(&free_pages, true); list_splice(&ret_pages, page_list); @@ -2118,6 +2120,23 @@ out: } } +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +/* + * Allocate the control structure for batch TLB flushing. An allocation + * failure is harmless as the reclaimer will send IPIs where necessary. + */ +static void alloc_tlb_ubc(void) +{ + if (!current->tlb_ubc) + current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch), + GFP_KERNEL | __GFP_NOWARN); +} +#else +static inline void alloc_tlb_ubc(void) +{ +} +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ + /* * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. */ @@ -2152,6 +2171,8 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness, scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() && sc->priority == DEF_PRIORITY); + alloc_tlb_ubc(); + blk_start_plug(&plug); while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) { -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f175.google.com (mail-wi0-f175.google.com [209.85.212.175]) by kanga.kvack.org (Postfix) with ESMTP id 2B21F900015 for ; Mon, 8 Jun 2015 08:51:06 -0400 (EDT) Received: by wibut5 with SMTP id ut5so85172077wib.1 for ; Mon, 08 Jun 2015 05:51:05 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id gf1si1006151wib.52.2015.06.08.05.51.00 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 08 Jun 2015 05:51:00 -0700 (PDT) From: Mel Gorman Subject: [PATCH 3/3] mm: Defer flush of writable TLB entries Date: Mon, 8 Jun 2015 13:50:54 +0100 Message-Id: <1433767854-24408-4-git-send-email-mgorman@suse.de> In-Reply-To: <1433767854-24408-1-git-send-email-mgorman@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML , Mel Gorman If a PTE is unmapped and it's dirty then it was writable recently. Due to deferred TLB flushing, it's best to assume a writable TLB cache entry exists. With that assumption, the TLB must be flushed before any IO can start or the page is freed to avoid lost writes or data corruption. This patch defers flushing of potentially writable TLBs as long as possible. Signed-off-by: Mel Gorman --- include/linux/sched.h | 1 + mm/internal.h | 4 ++++ mm/rmap.c | 28 +++++++++++++++++++++------- mm/vmscan.c | 7 ++++++- 4 files changed, 32 insertions(+), 8 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 57ff61b16565..827d9b123bd5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1296,6 +1296,7 @@ enum perf_event_task_context { struct tlbflush_unmap_batch { struct cpumask cpumask; unsigned long nr_pages; + bool writable; unsigned long pfns[BATCH_TLBFLUSH_SIZE]; }; diff --git a/mm/internal.h b/mm/internal.h index 8cbb68ccc731..ecf47a01420d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -438,10 +438,14 @@ struct tlbflush_unmap_batch; #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH void try_to_unmap_flush(void); +void try_to_unmap_flush_dirty(void); #else static inline void try_to_unmap_flush(void) { } +static inline void try_to_unmap_flush_dirty(void) +{ +} #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ #endif /* __MM_INTERNAL_H */ diff --git a/mm/rmap.c b/mm/rmap.c index a8dbba62398a..3c8ebacfe6ef 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -621,11 +621,21 @@ void try_to_unmap_flush(void) } cpumask_clear(&tlb_ubc->cpumask); tlb_ubc->nr_pages = 0; + tlb_ubc->writable = false; preempt_enable(); } +/* Flush iff there are potentially writable TLB entries that can race with IO */ +void try_to_unmap_flush_dirty(void) +{ + struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; + + if (tlb_ubc && tlb_ubc->writable) + try_to_unmap_flush(); +} + static void set_tlb_ubc_flush_pending(struct mm_struct *mm, - struct page *page) + struct page *page, bool writable) { struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; @@ -633,6 +643,14 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page); tlb_ubc->nr_pages++; + /* + * If the PTE was dirty then it's best to assume it's writable. The + * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush() + * before the page any IO is initiated. + */ + if (writable) + tlb_ubc->writable = true; + if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE) try_to_unmap_flush(); } @@ -657,7 +675,7 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) } #else static void set_tlb_ubc_flush_pending(struct mm_struct *mm, - struct page *page) + struct page *page, bool writable) { } @@ -1309,11 +1327,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, */ pteval = ptep_get_and_clear(mm, address, pte); - /* Potentially writable TLBs must be flushed before IO */ - if (pte_dirty(pteval)) - flush_tlb_page(vma, address); - else - set_tlb_ubc_flush_pending(mm, page); + set_tlb_ubc_flush_pending(mm, page, pte_dirty(pteval)); } else { pteval = ptep_clear_flush(vma, address, pte); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 5121742ccb87..0055224c52d4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1065,7 +1065,12 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (!sc->may_writepage) goto keep_locked; - /* Page is dirty, try to write it out here */ + /* + * Page is dirty. Flush the TLB if a writable entry + * potentially exists to avoid CPU writes after IO + * starts and then write it out here + */ + try_to_unmap_flush_dirty(); switch (pageout(page, mapping, sc)) { case PAGE_KEEP: goto keep_locked; -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f52.google.com (mail-la0-f52.google.com [209.85.215.52]) by kanga.kvack.org (Postfix) with ESMTP id D01AD6B0032 for ; Mon, 8 Jun 2015 13:45:57 -0400 (EDT) Received: by labko7 with SMTP id ko7so102599384lab.2 for ; Mon, 08 Jun 2015 10:45:57 -0700 (PDT) Received: from mail-wi0-x233.google.com (mail-wi0-x233.google.com. [2a00:1450:400c:c05::233]) by mx.google.com with ESMTPS id ck8si2687061wib.55.2015.06.08.10.45.55 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 Jun 2015 10:45:56 -0700 (PDT) Received: by wigg3 with SMTP id g3so60443489wig.1 for ; Mon, 08 Jun 2015 10:45:55 -0700 (PDT) Date: Mon, 8 Jun 2015 19:45:51 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150608174551.GA27558@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1433767854-24408-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner * Mel Gorman wrote: > Changelog since V4 > o Rebase to 4.1-rc6 > > Changelog since V3 > o Drop batching of TLB flush from migration > o Redo how larger batching is managed > o Batch TLB flushes when writable entries exist > > When unmapping pages it is necessary to flush the TLB. If that page was > accessed by another CPU then an IPI is used to flush the remote CPU. That > is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. > > There already is a window between when a page is unmapped and when it is > TLB flushed. This series simply increases the window so multiple pages > can be flushed using a single IPI. This *should* be safe or the kernel is > hosed already but I've cc'd the x86 maintainers and some of the Intel folk > for comment. > > Patch 1 simply made the rest of the series easier to write as ftrace > could identify all the senders of TLB flush IPIS. > > Patch 2 collects a list of PFNs and sends one IPI to flush them all > > Patch 3 tracks when there potentially are writable TLB entries that > need to be batched differently > > The performance impact is documented in the changelogs but in the optimistic > case on a 4-socket machine the full series reduces interrupts from 900K > interrupts/second to 60K interrupts/second. Yeah, so I think batching writable flushes is useful I think, but I disagree with one aspect of it: with the need to gather _pfns_ and batch them over to the remote CPU. As per my measurements the __flush_tlb_single() primitive (which you use in patch #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2 pages and gets exponentially worse. It's probably done in microcode and its performance is horrible. So have you explored the possibility to significantly simplify your patch-set by only deferring the flushing, and doing a simple TLB flush on the remote CPU? As per your measurements there must be tons and tons of flushes of lots of pages, the pfn tracking simply does not make sense. That way there's no memory overhead and no complex tracking of pfns - we'd basically track a simple deferred-flush bit instead. We'd still have the benefits of batching the IPIs, which is the main win. I strongly suspect that your numbers will become even better with such a variant. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f177.google.com (mail-pd0-f177.google.com [209.85.192.177]) by kanga.kvack.org (Postfix) with ESMTP id 05CE16B0032 for ; Mon, 8 Jun 2015 14:30:21 -0400 (EDT) Received: by pdbnf5 with SMTP id nf5so110491390pdb.2 for ; Mon, 08 Jun 2015 11:30:20 -0700 (PDT) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id pw8si5314515pdb.85.2015.06.08.11.30.19 for ; Mon, 08 Jun 2015 11:30:20 -0700 (PDT) Message-ID: <5575DD33.3000400@intel.com> Date: Mon, 08 Jun 2015 11:21:39 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> In-Reply-To: <20150608174551.GA27558@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar , Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner On 06/08/2015 10:45 AM, Ingo Molnar wrote: > As per my measurements the __flush_tlb_single() primitive (which you use in patch > #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2 > pages and gets exponentially worse. It's probably done in microcode and its > performance is horrible. I discussed this a bit in commit a5102476a2. I'd be curious what numbers you came up with. But, don't we have to take in to account the cost of refilling the TLB in addition to the cost of emptying it? The TLB size is historically increasing on a per-core basis, so isn't this refill cost only going to get worse? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by kanga.kvack.org (Postfix) with ESMTP id 37ADD6B006C for ; Mon, 8 Jun 2015 15:52:43 -0400 (EDT) Received: by wigg3 with SMTP id g3so63670957wig.1 for ; Mon, 08 Jun 2015 12:52:42 -0700 (PDT) Received: from mail-wg0-x230.google.com (mail-wg0-x230.google.com. [2a00:1450:400c:c00::230]) by mx.google.com with ESMTPS id u3si3366163wiy.42.2015.06.08.12.52.41 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 Jun 2015 12:52:41 -0700 (PDT) Received: by wgbgq6 with SMTP id gq6so111617336wgb.3 for ; Mon, 08 Jun 2015 12:52:41 -0700 (PDT) Date: Mon, 8 Jun 2015 21:52:37 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150608195237.GA15429@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <5575DD33.3000400@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5575DD33.3000400@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner * Dave Hansen wrote: > On 06/08/2015 10:45 AM, Ingo Molnar wrote: > > As per my measurements the __flush_tlb_single() primitive (which you use in patch > > #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2 > > pages and gets exponentially worse. It's probably done in microcode and its > > performance is horrible. > > I discussed this a bit in commit a5102476a2. I'd be curious what > numbers you came up with. ... which for those of us who don't have sha1's cached in their brain is: a5102476a24b ("x86/mm: Set TLB flush tunable to sane value (33)") ;-) So what I measured agrees generally with the comment you added in the commit: + * Each single flush is about 100 ns, so this caps the maximum overhead at + * _about_ 3,000 ns. Let that sink through: 3,000 nsecs = 3 usecs, that's like eternity! A CR3 driven TLB flush takes less time than a single INVLPG (!): [ 0.389028] x86/fpu: Cost of: __flush_tlb() fn : 96 cycles [ 0.405885] x86/fpu: Cost of: __flush_tlb_one() fn : 260 cycles [ 0.414302] x86/fpu: Cost of: __flush_tlb_range() fn : 404 cycles it's true that a full flush has hidden costs not measured above, because it has knock-on effects (because it drops non-global TLB entries), but it's not _that_ bad due to: - there almost always being a L1 or L2 cache miss when a TLB miss occurs, which latency can be overlaid - global bit being held for kernel entries - user-space with high memory pressure trashing through TLBs typically ... and especially with caches and Intel's historically phenomenally low TLB refill latency it's difficult to measure the effects of local TLB refills, let alone measure it in any macro benchmark. Cross-CPU flushes are expensive, absolutely no argument about that - my suggestion here is to keep the batching but simplify it: because I strongly suspect that the biggest win is the batching, not the pfn queueing. We might even win a bit more performance due to the simplification. > But, don't we have to take in to account the cost of refilling the TLB in > addition to the cost of emptying it? The TLB size is historically increasing on > a per-core basis, so isn't this refill cost only going to get worse? Only if TLB refill latency sucks - but Intel's is very good and AMD's is pretty good as well. Also, usually if you miss the TLB you miss the cache line as well (you definitely miss the L1 cache, and TLB caches are sized to hold a fair chunk of your L2 cache), and the CPU can overlap the two latencies. So while it might sound counter-intuitive, a full TLB flush might be faster than trying to do software based TLB cache management ... INVLPG really sucks. I can be convinced by numbers, but this isn't nearly as clear-cut as it might look. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id 12A8F6B006C for ; Mon, 8 Jun 2015 16:03:14 -0400 (EDT) Received: by wiga1 with SMTP id a1so97951087wig.0 for ; Mon, 08 Jun 2015 13:03:13 -0700 (PDT) Received: from mail-wi0-x22b.google.com (mail-wi0-x22b.google.com. [2a00:1450:400c:c05::22b]) by mx.google.com with ESMTPS id fa2si3410348wib.27.2015.06.08.13.03.12 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 Jun 2015 13:03:12 -0700 (PDT) Received: by wiwd19 with SMTP id d19so97392456wiw.0 for ; Mon, 08 Jun 2015 13:03:12 -0700 (PDT) Date: Mon, 8 Jun 2015 22:03:08 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150608200308.GA16978@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <5575DD33.3000400@intel.com> <20150608195237.GA15429@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150608195237.GA15429@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner * Ingo Molnar wrote: > So what I measured agrees generally with the comment you added in the commit: > > + * Each single flush is about 100 ns, so this caps the maximum overhead at > + * _about_ 3,000 ns. > > Let that sink through: 3,000 nsecs = 3 usecs, that's like eternity! > > A CR3 driven TLB flush takes less time than a single INVLPG (!): > > [ 0.389028] x86/fpu: Cost of: __flush_tlb() fn : 96 cycles > [ 0.405885] x86/fpu: Cost of: __flush_tlb_one() fn : 260 cycles > [ 0.414302] x86/fpu: Cost of: __flush_tlb_range() fn : 404 cycles > > it's true that a full flush has hidden costs not measured above, because it has > knock-on effects (because it drops non-global TLB entries), but it's not _that_ > bad due to: > > - there almost always being a L1 or L2 cache miss when a TLB miss occurs, > which latency can be overlaid > > - global bit being held for kernel entries > > - user-space with high memory pressure trashing through TLBs typically I also have cache-cold numbers from another (Intel) system: [ 0.176473] x86/bench:########################################################################## [ 0.185656] x86/bench: Running x86 benchmarks: cache- hot / cold cycles [ 1.234448] x86/bench: Cost of: null : 35 / 73 cycles [ ........] [ 27.930451] x86/bench:######## MM instructions: ###################################### [ 28.979251] x86/bench: Cost of: __flush_tlb() fn : 251 / 366 cycles [ 30.028795] x86/bench: Cost of: __flush_tlb_global() fn : 746 / 1795 cycles [ 31.077862] x86/bench: Cost of: __flush_tlb_one() fn : 237 / 883 cycles [ 32.127371] x86/bench: Cost of: __flush_tlb_range() fn : 312 / 1603 cycles [ 35.254202] x86/bench: Cost of: wbinvd() insn : 2491761 / 2491922 cycles Note how the numbers are even worse in the cache-cold case: the algorithmic complexity of __flush_tlb_range() versus __flush_tlb() makes it run slower (because we miss the I$), while the TLB cache-preservation argument is probably weaker, because when we are cache cold then TLB refill latency probably matters less (as it can be overlapped). So __flush_tlb_range() is software trying to beat hardware, and that's almost always a bad idea on x86. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com [209.85.220.54]) by kanga.kvack.org (Postfix) with ESMTP id BA0476B0070 for ; Mon, 8 Jun 2015 17:07:59 -0400 (EDT) Received: by payr10 with SMTP id r10so105478548pay.1 for ; Mon, 08 Jun 2015 14:07:59 -0700 (PDT) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTP id tz3si5757726pbc.224.2015.06.08.14.07.58 for ; Mon, 08 Jun 2015 14:07:58 -0700 (PDT) Message-ID: <5576042E.9030001@intel.com> Date: Mon, 08 Jun 2015 14:07:58 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <5575DD33.3000400@intel.com> <20150608195237.GA15429@gmail.com> In-Reply-To: <20150608195237.GA15429@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner On 06/08/2015 12:52 PM, Ingo Molnar wrote: > A CR3 driven TLB flush takes less time than a single INVLPG (!): > > [ 0.389028] x86/fpu: Cost of: __flush_tlb() fn : 96 cycles > [ 0.405885] x86/fpu: Cost of: __flush_tlb_one() fn : 260 cycles > [ 0.414302] x86/fpu: Cost of: __flush_tlb_range() fn : 404 cycles How was that measured, btw? Are these instructions running in a loop? Does __flush_tlb_one() include the tracepoint? (From the commit I referenced) This was (probably) using a different method than you did, but "FULL" below is __flush_tlb() while "1" is __flush_tlb_one(). The "cycles" includes some overhead from the tracing: > FULL: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960 > 1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895 So it looks like we've got some discrepancy, either from the test methodology or the CPU. All of the code and my methodology are in the commit. Could you share yours? > it's true that a full flush has hidden costs not measured above, because it has > knock-on effects (because it drops non-global TLB entries), but it's not _that_ > bad due to: > > - there almost always being a L1 or L2 cache miss when a TLB miss occurs, > which latency can be overlaid > > - global bit being held for kernel entries > > - user-space with high memory pressure trashing through TLBs typically > > ... and especially with caches and Intel's historically phenomenally low TLB > refill latency it's difficult to measure the effects of local TLB refills, let > alone measure it in any macro benchmark. All that you're saying there is that you need to consider how TLB misses act in _practice_ and not just measure worst-case or theoretical TLB miss cost. I completely agree with that. ... > INVLPG really sucks. I can be convinced by numbers, but this isn't nearly as > clear-cut as it might look. It's clear as mud! I'd be very interested to see any numbers for how this affects real workloads. I've been unable to find anything that was measurably affected by invlpg vs a full flush. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by kanga.kvack.org (Postfix) with ESMTP id B27BF6B0038 for ; Mon, 8 Jun 2015 17:51:00 -0400 (EDT) Received: by wibdq8 with SMTP id dq8so3617446wib.1 for ; Mon, 08 Jun 2015 14:51:00 -0700 (PDT) Received: from mail-wi0-x22a.google.com (mail-wi0-x22a.google.com. [2a00:1450:400c:c05::22a]) by mx.google.com with ESMTPS id jv5si3863133wid.14.2015.06.08.14.50.58 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 Jun 2015 14:50:59 -0700 (PDT) Received: by wiwd19 with SMTP id d19so615635wiw.0 for ; Mon, 08 Jun 2015 14:50:58 -0700 (PDT) Date: Mon, 8 Jun 2015 23:50:54 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150608215054.GB30566@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <5575DD33.3000400@intel.com> <20150608195237.GA15429@gmail.com> <5576042E.9030001@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5576042E.9030001@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner * Dave Hansen wrote: > On 06/08/2015 12:52 PM, Ingo Molnar wrote: > > A CR3 driven TLB flush takes less time than a single INVLPG (!): > > > > [ 0.389028] x86/fpu: Cost of: __flush_tlb() fn : 96 cycles > > [ 0.405885] x86/fpu: Cost of: __flush_tlb_one() fn : 260 cycles > > [ 0.414302] x86/fpu: Cost of: __flush_tlb_range() fn : 404 cycles > > How was that measured, btw? Are these instructions running in a loop? Yes - see the x86 benchmarking patch in the big FPU submission for an earlier version. > Does __flush_tlb_one() include the tracepoint? No tracing overhead. > (From the commit I referenced) This was (probably) using a different method than > you did, but "FULL" below is __flush_tlb() while "1" is __flush_tlb_one(). The > "cycles" includes some overhead from the tracing: > > > FULL: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960 > > 1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895 > > So it looks like we've got some discrepancy, either from the test methodology or > the CPU. All of the code and my methodology are in the commit. Could you share > yours? Yes, you can reproduce it by applying this patch from the FPU series: Subject: [PATCH 207/208] x86/fpu: Add FPU performance measurement subsystem (you were Cc:-ed to it, so it should be in your inbox.) I've got a more advanced version meanwhile, will post it in the next couple of days or so. > > it's true that a full flush has hidden costs not measured above, because it has > > knock-on effects (because it drops non-global TLB entries), but it's not _that_ > > bad due to: > > > > - there almost always being a L1 or L2 cache miss when a TLB miss occurs, > > which latency can be overlaid > > > > - global bit being held for kernel entries > > > > - user-space with high memory pressure trashing through TLBs typically > > > > ... and especially with caches and Intel's historically phenomenally low TLB > > refill latency it's difficult to measure the effects of local TLB refills, let > > alone measure it in any macro benchmark. > > All that you're saying there is that you need to consider how TLB misses act in > _practice_ and not just measure worst-case or theoretical TLB miss cost. I > completely agree with that. So I'm saying considerably more than that: I consider it likely that a full TLB flush is not nearly as costly as assumed, for the three reasons outlined above. It might even be a performance win in Mel's benchmark - although possibly not measurable within measurement noise levels. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id 693306B0038 for ; Mon, 8 Jun 2015 18:38:16 -0400 (EDT) Received: by payr10 with SMTP id r10so877507pay.1 for ; Mon, 08 Jun 2015 15:38:16 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id qm2si6081193pac.57.2015.06.08.15.38.15 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 Jun 2015 15:38:15 -0700 (PDT) Date: Mon, 8 Jun 2015 15:38:13 -0700 From: Andrew Morton Subject: Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Message-Id: <20150608153813.dbadaceffe316d09ee9f2446@linux-foundation.org> In-Reply-To: <1433767854-24408-3-git-send-email-mgorman@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <1433767854-24408-3-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML On Mon, 8 Jun 2015 13:50:53 +0100 Mel Gorman wrote: > An IPI is sent to flush remote TLBs when a page is unmapped that was > recently accessed by other CPUs. There are many circumstances where this > happens but the obvious one is kswapd reclaiming pages belonging to a > running process as kswapd and the task are likely running on separate CPUs. > > On small machines, this is not a significant problem but as machine > gets larger with more cores and more memory, the cost of these IPIs can > be high. This patch uses a structure similar in principle to a pagevec > to collect a list of PFNs and CPUs that require flushing. It then sends > one IPI per CPU that was mapping any of those pages to flush the list of > PFNs. A new TLB flush helper is required for this and one is added for > x86. Other architectures will need to decide if batching like this is both > safe and worth the memory overhead. Specifically the requirement is; > > If a clean page is unmapped and not immediately flushed, the > architecture must guarantee that a write to that page from a CPU > with a cached TLB entry will trap a page fault. > > This is essentially what the kernel already depends on but the window is > much larger with this patch applied and is worth highlighting. > > The impact of this patch depends on the workload as measuring any benefit > requires both mapped pages co-located on the LRU and memory pressure. The > case with the biggest impact is multiple processes reading mapped pages > taken from the vm-scalability test suite. The test case uses NR_CPU readers > of mapped files that consume 10*RAM. > > vmscale on a 4-node machine with 64G RAM and 48 CPUs > 4.1.0-rc6 4.1.0-rc6 > vanilla batchunmap-v5 > User 577.35 618.60 > System 5927.06 4195.03 > Elapsed 162.21 121.31 > > The workload completed 25% faster with 29% less CPU time. > > This is showing that the readers completed 25% with 30% less CPU time. From > vmstats, it is known that the vanilla kernel was interrupted roughly 900K > times per second during the steady phase of the test and the patched kernel > was interrupts 180K times per second. > > The impact is much lower on a small machine > > vmscale on a 1-node machine with 8G RAM and 1 CPU > 4.1.0-rc6 4.1.0-rc6 > vanilla batchunmap-v5 > User 59.14 58.86 > System 109.15 83.78 > Elapsed 27.32 23.14 > > It's still a noticeable improvement with vmstat showing interrupts went > from roughly 500K per second to 45K per second. Looks nice. > The patch will have no impact on workloads with no memory pressure or > have relatively few mapped pages. What benefit can we expect to see to any real-world userspace? > --- a/include/linux/init_task.h > +++ b/include/linux/init_task.h > @@ -175,6 +175,13 @@ extern struct task_group root_task_group; > # define INIT_NUMA_BALANCING(tsk) > #endif > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ > + .tlb_ubc = NULL, > +#else > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) > +#endif > + > #ifdef CONFIG_KASAN > # define INIT_KASAN(tsk) \ > .kasan_depth = 1, > @@ -257,6 +264,7 @@ extern struct task_group root_task_group; > INIT_RT_MUTEXES(tsk) \ > INIT_VTIME(tsk) \ > INIT_NUMA_BALANCING(tsk) \ > + INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ > INIT_KASAN(tsk) \ > } We don't really need any of this - init_task starts up all-zero anyway. Maybe it's useful for documentation reasons (dubious), but I bet we've already missed fields. > > ... > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1289,6 +1289,16 @@ enum perf_event_task_context { > perf_nr_task_contexts, > }; > > +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */ > +#define BATCH_TLBFLUSH_SIZE 32UL > + > +/* Track pages that require TLB flushes */ > +struct tlbflush_unmap_batch { > + struct cpumask cpumask; > + unsigned long nr_pages; > + unsigned long pfns[BATCH_TLBFLUSH_SIZE]; Why are we storing pfn's rather than page*'s? I'm trying to get my head around what's actually in this structure. Each thread has one of these, lazily allocated . The cpumask field contains a mask of all the CPUs which have done . The careful reader will find mm_struct.cpu_vm_mask_var and will wonder why it didn't need documenting, sigh. Wanna fill in the blanks? As usual, understanding the data structure is the key to understanding the design, so it's worth a couple of paragraphs. With this knowledge, the reader may understand why try_to_unmap_flush() has preempt_disable() protection but set_tlb_ubc_flush_pending() doesn't need it! > +}; > + > struct task_struct { > volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ > void *stack; > > ... > > @@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma) > return address; > } > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > +static void percpu_flush_tlb_batch_pages(void *data) > +{ > + struct tlbflush_unmap_batch *tlb_ubc = data; > + int i; Anally speaking, this should be unsigned long (in which case it shouldn't be called `i'!). Or make tlbflush_unmap_batch.nr_pages int. But int is signed, which is silly. Whatever ;) > + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); > + for (i = 0; i < tlb_ubc->nr_pages; i++) > + flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT); > +} > + > +/* > + * Flush TLB entries for recently unmapped pages from remote CPUs. It is > + * important that if a PTE was dirty when it was unmapped that it's flushed s/important that/important / > + * before any IO is initiated on the page to prevent lost writes. Similarly, > + * it must be flushed before freeing to prevent data leakage. > + */ > +void try_to_unmap_flush(void) > +{ > + struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; > + int cpu; > + > + if (!tlb_ubc || !tlb_ubc->nr_pages) > + return; > + > + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages); > + > + preempt_disable(); > + cpu = smp_processor_id(); get_cpu() > + if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) > + percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask); > + > + if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) { > + smp_call_function_many(&tlb_ubc->cpumask, > + percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true); > + } > + cpumask_clear(&tlb_ubc->cpumask); > + tlb_ubc->nr_pages = 0; > + preempt_enable(); put_cpu() > +} > > ... > > +/* > + * Returns true if the TLB flush should be deferred to the end of a batch of > + * unmap operations to reduce IPIs. > + */ > +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > +{ > + bool should_defer = false; > + > + if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH)) > + return false; > + > + /* If remote CPUs need to be flushed then defer batch the flush */ > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > + should_defer = true; > + put_cpu(); What did the get_cpu() protect? > + return should_defer; > +} > +#else > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, > + struct page *page) > +{ > +} > + > > ... > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > +/* > + * Allocate the control structure for batch TLB flushing. An allocation > + * failure is harmless as the reclaimer will send IPIs where necessary. > + */ > +static void alloc_tlb_ubc(void) > +{ > + if (!current->tlb_ubc) > + current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch), > + GFP_KERNEL | __GFP_NOWARN); A GFP_KERNEL allocation from deep within page reclaim? Seems imprudent if only from a stack-usage POV. > +} > +#else > +static inline void alloc_tlb_ubc(void) > +{ > +} > +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ > > ... > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f49.google.com (mail-wg0-f49.google.com [74.125.82.49]) by kanga.kvack.org (Postfix) with ESMTP id 0E9856B006C for ; Tue, 9 Jun 2015 04:47:47 -0400 (EDT) Received: by wgez8 with SMTP id z8so7455248wge.0 for ; Tue, 09 Jun 2015 01:47:46 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id vn9si10181562wjc.113.2015.06.09.01.47.44 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 09 Jun 2015 01:47:45 -0700 (PDT) Date: Tue, 9 Jun 2015 09:47:39 +0100 From: Mel Gorman Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609084739.GQ26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150608174551.GA27558@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner On Mon, Jun 08, 2015 at 07:45:51PM +0200, Ingo Molnar wrote: > > * Mel Gorman wrote: > > > Changelog since V4 > > o Rebase to 4.1-rc6 > > > > Changelog since V3 > > o Drop batching of TLB flush from migration > > o Redo how larger batching is managed > > o Batch TLB flushes when writable entries exist > > > > When unmapping pages it is necessary to flush the TLB. If that page was > > accessed by another CPU then an IPI is used to flush the remote CPU. That > > is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. > > > > There already is a window between when a page is unmapped and when it is > > TLB flushed. This series simply increases the window so multiple pages > > can be flushed using a single IPI. This *should* be safe or the kernel is > > hosed already but I've cc'd the x86 maintainers and some of the Intel folk > > for comment. > > > > Patch 1 simply made the rest of the series easier to write as ftrace > > could identify all the senders of TLB flush IPIS. > > > > Patch 2 collects a list of PFNs and sends one IPI to flush them all > > > > Patch 3 tracks when there potentially are writable TLB entries that > > need to be batched differently > > > > The performance impact is documented in the changelogs but in the optimistic > > case on a 4-socket machine the full series reduces interrupts from 900K > > interrupts/second to 60K interrupts/second. > > Yeah, so I think batching writable flushes is useful I think, but I disagree with > one aspect of it: with the need to gather _pfns_ and batch them over to the remote > CPU. > It's PFN-based for three reasons. The first is because the old code is flushing on a per-page basis and the intent of the series was to reduce the number of IPIs that requires. Moving away from that has an unpredictable impact that depends on the workload and the exact CPU used. The second is that a struct page-based interface would require percpu_flush_tlb_batch_pages to do a page->pfn lookup in the IPI call which is more expensive than necessary. The final reason is that the TLB flush API given to architectures at the moment includes single page primitives and while it's not necessarily the best decision for x86, the same may not be true for other architectures if they decide to activate the batching. > As per my measurements the __flush_tlb_single() primitive (which you use in patch > #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2 > pages and gets exponentially worse. It's probably done in microcode and its > performance is horrible. > > So have you explored the possibility to significantly simplify your patch-set by > only deferring the flushing, and doing a simple TLB flush on the remote CPU? Yes. At one point I looked at range flushing but it is not a good idea. The ranges that reach the end of the LRU are too large to be useful except in the ideal case of a workload that sequentially accesses memory. Flushing the full TLB has an unpredictable cost. Currently, the unmapping and flush is of inactive pages, some of which may not be in the TLB at all and the impact on the workload is limited to the IPI and flush cost. With a full flush we clear entries we know were recently accessed and may have to be looked up again and we do this every 32 mapped pages that are reclaimed. In the ideal case of a sequential mapped reader it would not matter as the entries are not needed so we would not see the cost at all. Other workloads will have to do a refill that was not necessary before this series. The cost of the refill will depend on the CPU and whether the lookup information is still in the CPU cache or not. That means measuring the full impact of your proposal is impossible as it depends heavily on the workload, the timing of its interaction with kswapd in particular, the state of the CPU cache and the cost of refills for the CPU. I agree with you in that it would be a simplier series and the actual flush would probably be faster but the downsides are too unpredictable for a series that primarily is about reducing the number of IPIs. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f176.google.com (mail-wi0-f176.google.com [209.85.212.176]) by kanga.kvack.org (Postfix) with ESMTP id 4D0966B0032 for ; Tue, 9 Jun 2015 06:32:38 -0400 (EDT) Received: by wifx6 with SMTP id x6so11654236wif.0 for ; Tue, 09 Jun 2015 03:32:37 -0700 (PDT) Received: from mail-wg0-x22a.google.com (mail-wg0-x22a.google.com. [2a00:1450:400c:c00::22a]) by mx.google.com with ESMTPS id es2si2429473wib.12.2015.06.09.03.32.35 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 Jun 2015 03:32:36 -0700 (PDT) Received: by wgbgq6 with SMTP id gq6so9592462wgb.3 for ; Tue, 09 Jun 2015 03:32:35 -0700 (PDT) Date: Tue, 9 Jun 2015 12:32:31 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609103231.GA11026@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150609084739.GQ26425@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner * Mel Gorman wrote: > > So have you explored the possibility to significantly simplify your patch-set > > by only deferring the flushing, and doing a simple TLB flush on the remote > > CPU? > > Yes. At one point I looked at range flushing but it is not a good idea. My suggestion wasn't range-flushing, but a simple all-or-nothing batched flush of user-space TLBs. > The ranges that reach the end of the LRU are too large to be useful except in > the ideal case of a workload that sequentially accesses memory. Flushing the > full TLB has an unpredictable cost. [...] Why would it have unpredictable cost? We flush the TLB on every process context switch. Yes, it's somewhat workload dependent, but the performance profile is so different anyway with batching that it has to be re-measured anyway. > With a full flush we clear entries we know were recently accessed and may have > to be looked up again and we do this every 32 mapped pages that are reclaimed. > In the ideal case of a sequential mapped reader it would not matter as the > entries are not needed so we would not see the cost at all. Other workloads will > have to do a refill that was not necessary before this series. The cost of the > refill will depend on the CPU and whether the lookup information is still in the > CPU cache or not. That means measuring the full impact of your proposal is > impossible as it depends heavily on the workload, the timing of its interaction > with kswapd in particular, the state of the CPU cache and the cost of refills > for the CPU. > > I agree with you in that it would be a simplier series and the actual flush > would probably be faster but the downsides are too unpredictable for a series > that primarily is about reducing the number of IPIs. Sorry, I don't buy this, at all. Please measure this, the code would become a lot simpler, as I'm not convinced that we need pfn (or struct page) or even range based flushing. I.e. please first implement the simplest remote batching variant, then complicate it if the numbers warrant it. Not the other way around. It's not like the VM code needs the extra complexity! Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f176.google.com (mail-wi0-f176.google.com [209.85.212.176]) by kanga.kvack.org (Postfix) with ESMTP id E92BF6B006E for ; Tue, 9 Jun 2015 07:07:39 -0400 (EDT) Received: by wiwd19 with SMTP id d19so14164018wiw.0 for ; Tue, 09 Jun 2015 04:07:39 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id d7si2510273wij.93.2015.06.09.04.07.37 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 09 Jun 2015 04:07:37 -0700 (PDT) Date: Tue, 9 Jun 2015 12:07:32 +0100 From: Mel Gorman Subject: Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Message-ID: <20150609110732.GR26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <1433767854-24408-3-git-send-email-mgorman@suse.de> <20150608153813.dbadaceffe316d09ee9f2446@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150608153813.dbadaceffe316d09ee9f2446@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML On Mon, Jun 08, 2015 at 03:38:13PM -0700, Andrew Morton wrote: > On Mon, 8 Jun 2015 13:50:53 +0100 Mel Gorman wrote: > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > recently accessed by other CPUs. There are many circumstances where this > > happens but the obvious one is kswapd reclaiming pages belonging to a > > running process as kswapd and the task are likely running on separate CPUs. > > > > On small machines, this is not a significant problem but as machine > > gets larger with more cores and more memory, the cost of these IPIs can > > be high. This patch uses a structure similar in principle to a pagevec > > to collect a list of PFNs and CPUs that require flushing. It then sends > > one IPI per CPU that was mapping any of those pages to flush the list of > > PFNs. A new TLB flush helper is required for this and one is added for > > x86. Other architectures will need to decide if batching like this is both > > safe and worth the memory overhead. Specifically the requirement is; > > > > If a clean page is unmapped and not immediately flushed, the > > architecture must guarantee that a write to that page from a CPU > > with a cached TLB entry will trap a page fault. > > > > This is essentially what the kernel already depends on but the window is > > much larger with this patch applied and is worth highlighting. > > > > The impact of this patch depends on the workload as measuring any benefit > > requires both mapped pages co-located on the LRU and memory pressure. The > > case with the biggest impact is multiple processes reading mapped pages > > taken from the vm-scalability test suite. The test case uses NR_CPU readers > > of mapped files that consume 10*RAM. > > > > vmscale on a 4-node machine with 64G RAM and 48 CPUs > > 4.1.0-rc6 4.1.0-rc6 > > vanilla batchunmap-v5 > > User 577.35 618.60 > > System 5927.06 4195.03 > > Elapsed 162.21 121.31 > > > > The workload completed 25% faster with 29% less CPU time. > > > > This is showing that the readers completed 25% with 30% less CPU time. From > > vmstats, it is known that the vanilla kernel was interrupted roughly 900K > > times per second during the steady phase of the test and the patched kernel > > was interrupts 180K times per second. > > > > The impact is much lower on a small machine > > > > vmscale on a 1-node machine with 8G RAM and 1 CPU > > 4.1.0-rc6 4.1.0-rc6 > > vanilla batchunmap-v5 > > User 59.14 58.86 > > System 109.15 83.78 > > Elapsed 27.32 23.14 > > > > It's still a noticeable improvement with vmstat showing interrupts went > > from roughly 500K per second to 45K per second. > > Looks nice. > > > The patch will have no impact on workloads with no memory pressure or > > have relatively few mapped pages. > > What benefit can we expect to see to any real-world userspace? > Only a small subset of workloads will see a benefit -- ones that mmap a lot of data with working sets larger than the size of a node. Some streaming media servers allegedly do this. Some numerical processing applications may hit this. Those that use glibc for large buffers use mmap and if the application is larger than a NUMA node, it'll need to be unmapped and flushed. Python/NumPY uses large maps for large buffers (based on the paper "Doubling the Performance of Python/Numpy with less than 100 SLOC"). Whether users of NumPY hit this issue or not depends on whether kswapd is active. Anecdotally, I'm aware from IRC of one user that is experimenting with a large HTTP cache and a load generator that spent a lot of time sending IPIs that was obvious from the profile. I asked weeks ago that they post the results here which they promised they would but didn't. Unfortunately, I don't know the persons real name to cc them. Rik might. Anecdotally, I also believe that Intel hit this internally with some internal workload but I'm basing this on idle comments at LSF/MM. However they were unwilling or unable to describe exactly what the test does and against what software. I know this is more vague than you'd like. By and large, I'm relying on the assumption that if we are reclaiming mapped pages from kswapd context then sending one IPI per page is stupid. > > --- a/include/linux/init_task.h > > +++ b/include/linux/init_task.h > > @@ -175,6 +175,13 @@ extern struct task_group root_task_group; > > # define INIT_NUMA_BALANCING(tsk) > > #endif > > > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ > > + .tlb_ubc = NULL, > > +#else > > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) > > +#endif > > + > > #ifdef CONFIG_KASAN > > # define INIT_KASAN(tsk) \ > > .kasan_depth = 1, > > @@ -257,6 +264,7 @@ extern struct task_group root_task_group; > > INIT_RT_MUTEXES(tsk) \ > > INIT_VTIME(tsk) \ > > INIT_NUMA_BALANCING(tsk) \ > > + INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ > > INIT_KASAN(tsk) \ > > } > > We don't really need any of this - init_task starts up all-zero anyway. > Maybe it's useful for documentation reasons (dubious), but I bet we've > already missed fields. > True. I'll look into removing it and make sure there is no adverse impact. > > > > ... > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -1289,6 +1289,16 @@ enum perf_event_task_context { > > perf_nr_task_contexts, > > }; > > > > +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */ > > +#define BATCH_TLBFLUSH_SIZE 32UL > > + > > +/* Track pages that require TLB flushes */ > > +struct tlbflush_unmap_batch { > > + struct cpumask cpumask; > > + unsigned long nr_pages; > > + unsigned long pfns[BATCH_TLBFLUSH_SIZE]; > > Why are we storing pfn's rather than page*'s? > Because a page would require a page->pfn lookup within the IPI handler. That will work but it's unnecessarily wasteful. > I'm trying to get my head around what's actually in this structure. > > Each thread has one of these, lazily allocated . > The cpumask field contains a mask of all the CPUs which have done > . The careful reader will find mm_struct.cpu_vm_mask_var > and will wonder why it didn't need documenting, sigh. > > Wanna fill in the blanks? As usual, understanding the data structure > is the key to understanding the design, so it's worth a couple of > paragraphs. With this knowledge, the reader may understand why > try_to_unmap_flush() has preempt_disable() protection but > set_tlb_ubc_flush_pending() doesn't need it! > Is this any help? struct tlbflush_unmap_batch { /* * Each bit set is a CPU that potentially has a TLB entry for one of * the PFNs being flushed. See set_tlb_ubc_flush_pending(). */ struct cpumask cpumask; /* * The number and list of pfns to be flushed. PFNs are tracked instead * of struct pages to avoid multiple page->pfn lookups by each CPU that * receives an IPI in percpu_flush_tlb_batch_pages */ unsigned long nr_pages; unsigned long pfns[BATCH_TLBFLUSH_SIZE]; /* * If true then the PTE was dirty when unmapped. The entry must be * flushed before IO is initiated or a stale TLB entry potentially * allows an update without redirtying the page. */ bool writable; }; > > +}; > > + > > struct task_struct { > > volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ > > void *stack; > > > > ... > > > > @@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma) > > return address; > > } > > > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > +static void percpu_flush_tlb_batch_pages(void *data) > > +{ > > + struct tlbflush_unmap_batch *tlb_ubc = data; > > + int i; > > Anally speaking, this should be unsigned long (in which case it > shouldn't be called `i'!). Or make tlbflush_unmap_batch.nr_pages int. > But int is signed, which is silly. Whatever ;) > I can make it unsigned int which is a micro-optimisation for loop iterators anyway. > > > + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); > > + for (i = 0; i < tlb_ubc->nr_pages; i++) > > + flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT); > > +} > > + > > +/* > > + * Flush TLB entries for recently unmapped pages from remote CPUs. It is > > + * important that if a PTE was dirty when it was unmapped that it's flushed > > s/important that/important / > Fixed. > > + * before any IO is initiated on the page to prevent lost writes. Similarly, > > + * it must be flushed before freeing to prevent data leakage. > > + */ > > +void try_to_unmap_flush(void) > > +{ > > + struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; > > + int cpu; > > + > > + if (!tlb_ubc || !tlb_ubc->nr_pages) > > + return; > > + > > + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages); > > + > > + preempt_disable(); > > + cpu = smp_processor_id(); > > get_cpu() > > > + if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) > > + percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask); > > + > > + if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) { > > + smp_call_function_many(&tlb_ubc->cpumask, > > + percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true); > > + } > > + cpumask_clear(&tlb_ubc->cpumask); > > + tlb_ubc->nr_pages = 0; > > + preempt_enable(); > > put_cpu() > Gack. Fixed. > > +} > > > > ... > > > > +/* > > + * Returns true if the TLB flush should be deferred to the end of a batch of > > + * unmap operations to reduce IPIs. > > + */ > > +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > > +{ > > + bool should_defer = false; > > + > > + if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH)) > > + return false; > > + > > + /* If remote CPUs need to be flushed then defer batch the flush */ > > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > > + should_defer = true; > > + put_cpu(); > > What did the get_cpu() protect? > To safely lookup the current running CPU number. smp_processor_id() is potentially safe except in specific circumstances and I did not think raw_smp_processor_id() was justified. > > + return should_defer; > > +} > > +#else > > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, > > + struct page *page) > > +{ > > +} > > + > > > > ... > > > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > +/* > > + * Allocate the control structure for batch TLB flushing. An allocation > > + * failure is harmless as the reclaimer will send IPIs where necessary. > > + */ > > +static void alloc_tlb_ubc(void) > > +{ > > + if (!current->tlb_ubc) > > + current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch), > > + GFP_KERNEL | __GFP_NOWARN); > > A GFP_KERNEL allocation from deep within page reclaim? Seems imprudent > if only from a stack-usage POV. > When we call it, we are in PF_MEMALLOC context either set in the page allocator from direct reclaim or because kswapd always sets it. This limits the stack depth. It would be comparable to the depth we reach during normal reclaim calling into shrink_page_list and everything below it so we should be safe. Granted, the dependency on PF_MEMALLOC is not obvious at all. > > +} > > +#else > > +static inline void alloc_tlb_ubc(void) > > +{ > > +} > > +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ > > > > ... > > > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f175.google.com (mail-wi0-f175.google.com [209.85.212.175]) by kanga.kvack.org (Postfix) with ESMTP id 0DB396B006E for ; Tue, 9 Jun 2015 07:21:02 -0400 (EDT) Received: by wifx6 with SMTP id x6so12997307wif.0 for ; Tue, 09 Jun 2015 04:21:01 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id x19si10843615wjq.43.2015.06.09.04.21.00 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 09 Jun 2015 04:21:00 -0700 (PDT) Date: Tue, 9 Jun 2015 12:20:55 +0100 From: Mel Gorman Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609112055.GS26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150609103231.GA11026@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner On Tue, Jun 09, 2015 at 12:32:31PM +0200, Ingo Molnar wrote: > > * Mel Gorman wrote: > > > > So have you explored the possibility to significantly simplify your patch-set > > > by only deferring the flushing, and doing a simple TLB flush on the remote > > > CPU? > > > > Yes. At one point I looked at range flushing but it is not a good idea. > > My suggestion wasn't range-flushing, but a simple all-or-nothing batched flush of > user-space TLBs. > I'm aware of that. I had considered both range flushing and full flushing as alternatives to PFN tracking and settled on PFN tracking as the least risky change. > > The ranges that reach the end of the LRU are too large to be useful except in > > the ideal case of a workload that sequentially accesses memory. Flushing the > > full TLB has an unpredictable cost. [...] > > Why would it have unpredictable cost? Because we have no means of knowing how many active TLB entries are flushed, no way of knowing if it matters and we potentially do this every 32 (BATCH_TLBFLUSH_SIZE) pages that are reclaimed. > We flush the TLB on every process context > switch. Yes, it's somewhat workload dependent, but the performance profile is so > different anyway with batching that it has to be re-measured anyway. > With the per-page flush, there is a direct cost associated with the operation -- the IPI and the TLB flushes. This is easy to measure. With a full flush there is an indirect cost -- the TLB entries that have to be refilled after the full flush. It also works against any notion of using ASID or similar mechanisms that avoid full flushes on context switches. It will be very easy to show the benefit in the direct case. The indirect case is both unpredictable and impossible to measure the full impact in all cases. > > With a full flush we clear entries we know were recently accessed and may have > > to be looked up again and we do this every 32 mapped pages that are reclaimed. > > In the ideal case of a sequential mapped reader it would not matter as the > > entries are not needed so we would not see the cost at all. Other workloads will > > have to do a refill that was not necessary before this series. The cost of the > > refill will depend on the CPU and whether the lookup information is still in the > > CPU cache or not. That means measuring the full impact of your proposal is > > impossible as it depends heavily on the workload, the timing of its interaction > > with kswapd in particular, the state of the CPU cache and the cost of refills > > for the CPU. > > > > I agree with you in that it would be a simplier series and the actual flush > > would probably be faster but the downsides are too unpredictable for a series > > that primarily is about reducing the number of IPIs. > > Sorry, I don't buy this, at all. > > Please measure this, the code would become a lot simpler, as I'm not convinced > that we need pfn (or struct page) or even range based flushing. > The code will be simplier and the cost of reclaim will be lower and that is the direct case but shows nothing about the indirect cost. The mapped reader will benefit as it is not reusing the TLB entries and will look artifically very good. It'll be very difficult for even experienced users to determine that a slowdown during kswapd activity is due to increased TLB misses incurred by the full flush. > I.e. please first implement the simplest remote batching variant, then complicate > it if the numbers warrant it. Not the other way around. It's not like the VM code > needs the extra complexity! > The simplest remote batching variant is a much more drastic change from what we do today and an unpredictable one. If we were to take that direction, it goes against the notion of making incremental changes. Even if we ultimately ended up with your proposal, it would make sense to separte it from this series by at least one release for bisection purposes. That way we get; Current: Send one IPI per page to unmap, active TLB entries preserved This series: Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap, active TLB entries preserved Your proposal: Send one IPI, flush everything, active TLB entries must refill -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com [209.85.212.182]) by kanga.kvack.org (Postfix) with ESMTP id B00EB6B0032 for ; Tue, 9 Jun 2015 08:43:35 -0400 (EDT) Received: by wiwd19 with SMTP id d19so16898196wiw.0 for ; Tue, 09 Jun 2015 05:43:35 -0700 (PDT) Received: from mail-wi0-x22d.google.com (mail-wi0-x22d.google.com. [2a00:1450:400c:c05::22d]) by mx.google.com with ESMTPS id bb4si2920736wib.124.2015.06.09.05.43.33 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 Jun 2015 05:43:34 -0700 (PDT) Received: by wibut5 with SMTP id ut5so16825295wib.1 for ; Tue, 09 Jun 2015 05:43:33 -0700 (PDT) Date: Tue, 9 Jun 2015 14:43:28 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609124328.GA23066@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150609112055.GS26425@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner * Mel Gorman wrote: > > Sorry, I don't buy this, at all. > > > > Please measure this, the code would become a lot simpler, as I'm not convinced > > that we need pfn (or struct page) or even range based flushing. > > The code will be simplier and the cost of reclaim will be lower and that is the > direct case but shows nothing about the indirect cost. The mapped reader will > benefit as it is not reusing the TLB entries and will look artifically very > good. It'll be very difficult for even experienced users to determine that a > slowdown during kswapd activity is due to increased TLB misses incurred by the > full flush. If so then the converse is true just as much: if you were to introduce finegrained flushing today, you couldn't justify it because you claim it's very hard to measure! Really, in such cases we should IMHO fall back to the simplest approach, and iterate from there. I cited very real numbers about the direct costs of TLB flushes, and plausible speculation about why the indirect costs are low on the achitecture you are trying to modify here. I think since it is you who wants to introduce additional complexity into the x86 MM code the burden is on you to provide proof that the complexity of pfn (or struct page) tracking is worth it. > > I.e. please first implement the simplest remote batching variant, then > > complicate it if the numbers warrant it. Not the other way around. It's not > > like the VM code needs the extra complexity! > > The simplest remote batching variant is a much more drastic change from what we > do today and an unpredictable one. If we were to take that direction, it goes > against the notion of making incremental changes. Even if we ultimately ended up > with your proposal, it would make sense to separte it from this series by at > least one release for bisection purposes. That way we get; > > Current: Send one IPI per page to unmap, active TLB entries preserved > This series: Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap, active TLB entries preserved > Your proposal: Send one IPI, flush everything, active TLB entries must refill Not quite, my take of it is: Current: Simplest method: send one IPI per page to unmap, active TLB entries preserved. Remote TLB flushing cost is so high that it probably moots any secondary effects of TLB preservation. This series: Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap, add complex tracking of pfn's with expensive flushing, active TLB entries preserved. Cost of the more complex flushing are probably higher than the refill cost, based on the numbers I gave. My proposal: Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap that flushes everything. TLB entries not preserved but this is expected to be more than offset by the reduction in remote flushing costs and the simplicity of the flushing scheme. It can still be complicated to your proposed pfn tracking scheme, based on numbers. Btw., have you measured the full TLB flush variant as well? If so, mind sharing the numbers? Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f45.google.com (mail-wg0-f45.google.com [74.125.82.45]) by kanga.kvack.org (Postfix) with ESMTP id 6B2A96B0032 for ; Tue, 9 Jun 2015 09:05:44 -0400 (EDT) Received: by wgv5 with SMTP id 5so12928875wgv.1 for ; Tue, 09 Jun 2015 06:05:43 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id hr12si3035567wib.98.2015.06.09.06.05.41 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 09 Jun 2015 06:05:42 -0700 (PDT) Date: Tue, 9 Jun 2015 14:05:36 +0100 From: Mel Gorman Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609130536.GT26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150609124328.GA23066@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner On Tue, Jun 09, 2015 at 02:43:28PM +0200, Ingo Molnar wrote: > > * Mel Gorman wrote: > > > > Sorry, I don't buy this, at all. > > > > > > Please measure this, the code would become a lot simpler, as I'm not convinced > > > that we need pfn (or struct page) or even range based flushing. > > > > The code will be simplier and the cost of reclaim will be lower and that is the > > direct case but shows nothing about the indirect cost. The mapped reader will > > benefit as it is not reusing the TLB entries and will look artifically very > > good. It'll be very difficult for even experienced users to determine that a > > slowdown during kswapd activity is due to increased TLB misses incurred by the > > full flush. > > If so then the converse is true just as much: if you were to introduce finegrained > flushing today, you couldn't justify it because you claim it's very hard to > measure! > I'm claiming the *INDIRECT COST* is impossible to measure as part of this series because it depends on the workload and exact CPU used. The direct cost is measurable and can be quantified. > Really, in such cases we should IMHO fall back to the simplest approach, and > iterate from there. > > I cited very real numbers about the direct costs of TLB flushes, and plausible > speculation about why the indirect costs are low on the achitecture you are trying > to modify here. > > I think since it is you who wants to introduce additional complexity into the x86 > MM code the burden is on you to provide proof that the complexity of pfn (or > struct page) tracking is worth it. > I'm taking a situation whereby IPIs are sent like crazy with interrupt storms and replacing it with something that is a lot more efficient that minimises the number of potential surprises. I'm stating that the benefit of PFN tracking is unknowable in the general case because it depends on the workload, timing and the exact CPU used so any example provided can be naked with a counter-example such as a trivial sequential reader that shows no benefit. The series as posted is approximately in line with current behaviour minimising the chances of surprise regressions from excessive TLB flush. You are actively blocking a measurable improvement and forcing it to be replaced with something whose full impact is unquantifiable. Any regressions in this area due to increased TLB misses could take several kernel releases as the issue will be so difficult to detect. I'm going to implement the approach you are forcing because there is an x86 part of the patch and you are the maintainer that could indefinitely NAK it. However, I'm extremely pissed about being forced to introduce these indirect unpredictable costs because I know the alternative is you dragging this out for weeks with no satisfactory conclusion in an argument that I cannot prove in the general case. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f52.google.com (mail-pa0-f52.google.com [209.85.220.52]) by kanga.kvack.org (Postfix) with ESMTP id 6C62B6B0032 for ; Tue, 9 Jun 2015 11:34:37 -0400 (EDT) Received: by pabqy3 with SMTP id qy3so16160428pab.3 for ; Tue, 09 Jun 2015 08:34:37 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id nl4si9366778pbc.114.2015.06.09.08.34.36 for ; Tue, 09 Jun 2015 08:34:36 -0700 (PDT) Message-ID: <5577078B.2000503@intel.com> Date: Tue, 09 Jun 2015 08:34:35 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> In-Reply-To: <20150609124328.GA23066@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar , Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner On 06/09/2015 05:43 AM, Ingo Molnar wrote: > I cited very real numbers about the direct costs of TLB flushes, and plausible > speculation about why the indirect costs are low on the achitecture you are trying > to modify here. We should be careful to extrapolate what the real-world cost of a TLB flush is from the cost of running a kernel function in a loop. Let's look at what got measured: > +static char tlb_flush_target[PAGE_SIZE] __aligned(4096); > +static void fn_flush_tlb_one(void) > +{ > + unsigned long addr = (unsigned long)&tlb_flush_target; > + > + tlb_flush_target[0]++; > + __flush_tlb_one(addr); > +} So we've got an increment of a variable in kernel memory (which is almost surely in the L1), then we flush that memory location, and repeat the increment. I assume the increment is so that the __flush_tlb_one() has some "real" work to do and is not just flushing an address which is not in the TLB. This is almost certainly a departure from workloads like Mel is addressing where we (try to) flush pages used long ago that will hopefully *not* be in the TLB. But, that unfortunately means that we're measuring a TLB _miss_ here in addition to the flush. A TLB miss shouldn't be *that* expensive, right? The SDM says: "INVLPG also invalidates all entries in all paging-structure caches ... regardless of the linear addresses to which they correspond." Ugh, so the TLB refill has to also refill the paging structure caches. At least the page tables will be in the L1. Since "tlb_flush_target" is in kernel mapping, you might also be shooting down the TLB entry for kernel text, or who knows what else. The TLB entry might be 1G or 2M which might never be in the STLB (second-level TLB), which could have *VERY* different behavior than a 4k flush or a flush of an entry in the first-level TLB. I'm not sure that these loop-style tests are particularly valuable, but if we're going to do it, I think we should consider: 1. We need to separate the TLB fill portion from the flush and not measure any part of a fill along with the flush 2. We should measure flushing of ascending, adjacent virtual addresses mapped with 4k pages since that is the normal case. Perhaps vmalloc(16MB) or something. 3. We should flush a mix of virtual addresses that are in and out of the TLB. 4. To measure instruction (as opposed to instruction+software) overhead, use __flush_tlb_single(), not __flush_tlb_one() P.S. I think I'm responsible for it, but we should probably also move the count_vm_tlb_event() to outside the loop in flush_tlb_mm_range(). invlpg is not a "normal" instruction and could potentially increase the overhead of incrementing the counter. But, I guess the kernel mappings _should_ stay in the TLB over an invlpg and shouldn't pay any cost to be refilled in to the TLB despite the paging-structure caches going away. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f52.google.com (mail-pa0-f52.google.com [209.85.220.52]) by kanga.kvack.org (Postfix) with ESMTP id 9EE7E6B0032 for ; Tue, 9 Jun 2015 12:49:16 -0400 (EDT) Received: by pacyx8 with SMTP id yx8so17318413pac.2 for ; Tue, 09 Jun 2015 09:49:16 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65]) by mx.google.com with ESMTP id ts9si9647622pab.90.2015.06.09.09.49.14 for ; Tue, 09 Jun 2015 09:49:15 -0700 (PDT) Message-ID: <55771909.2020005@intel.com> Date: Tue, 09 Jun 2015 09:49:13 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> In-Reply-To: <5577078B.2000503@intel.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar , Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner On 06/09/2015 08:34 AM, Dave Hansen wrote: > 2. We should measure flushing of ascending, adjacent virtual addresses > mapped with 4k pages since that is the normal case. Perhaps > vmalloc(16MB) or something. Now that I think about this a bit more, we really have two different patterns here: 1. flush_tlb_mm_range() style, which is contiguous virtual address flushing from an unmap-style operation. 2. Mel's new try_to_unmap_one() style, which are going to be relatively random virtual addresses. So, the "ascending adjacent" case is still interesting, but it wouldn't cover Mel's new case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f177.google.com (mail-pd0-f177.google.com [209.85.192.177]) by kanga.kvack.org (Postfix) with ESMTP id 188EE6B0032 for ; Tue, 9 Jun 2015 17:14:54 -0400 (EDT) Received: by pdjn11 with SMTP id n11so22394211pdj.0 for ; Tue, 09 Jun 2015 14:14:53 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id bt1si10427476pbb.171.2015.06.09.14.14.52 for ; Tue, 09 Jun 2015 14:14:53 -0700 (PDT) Message-ID: <55775749.3090004@intel.com> Date: Tue, 09 Jun 2015 14:14:49 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> In-Reply-To: <55771909.2020005@intel.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar , Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner I did some of what I talked about earlier in the thread. I think the sfence (via mb()) is potentially unfair since it removes some of the CPU's ability to optimize things. For this kind of test, any ability that the CPU has to smear the overhead around is a bonus in practice and should be taken in to account for these tests. Here's the horribly hacked-together patch so you can see precisely what's going on: https://www.sr71.net/~dave/intel/measure-tlb-stuff.patch Here's a Haswell Xeon: > [ 0.222090] x86/fpu:######## MM instructions: ############################ > [ 0.222168] x86/fpu: Cost of: __flush_tlb() fn : 124 cycles avg: 125 > [ 0.222623] x86/fpu: Cost of: __flush_tlb_global() fn : 960 cycles avg: 968 > [ 0.222744] x86/fpu: Cost of: __flush_tlb_single() fn : 216 cycles avg: 216 > [ 0.222864] x86/fpu: Cost of: __flush_tlb_single() vmal fn : 216 cycles avg: 219 > [ 0.222987] x86/fpu: Cost of: __flush_tlb_one() OLD fn : 216 cycles avg: 216 > [ 0.223139] x86/fpu: Cost of: __flush_tlb_range() fn : 284 cycles avg: 287 > [ 0.223272] x86/fpu: Cost of: tlb miss fn : 0 cycles avg: 0 And a Westmere Xeon: > [ 1.057770] x86/fpu:######## MM instructions: ############################ > [ 1.065876] x86/fpu: Cost of: __flush_tlb() fn : 108 cycles avg: 109 > [ 1.075188] x86/fpu: Cost of: __flush_tlb_global() fn : 828 cycles avg: 829 > [ 1.084162] x86/fpu: Cost of: __flush_tlb_single() fn : 232 cycles avg: 237 > [ 1.093175] x86/fpu: Cost of: __flush_tlb_single() vmal fn : 240 cycles avg: 240 > [ 1.102214] x86/fpu: Cost of: __flush_tlb_one() OLD fn : 284 cycles avg: 286 > [ 1.111299] x86/fpu: Cost of: __flush_tlb_range() fn : 472 cycles avg: 478 > [ 1.120281] x86/fpu: Cost of: tlb miss fn : 0 cycles avg: 0 I was rather surprised how close the three __flush_tlb_single/one() variants were on Haswell. I've looked at a few other CPUs and this was the only one that acted like this. The 0 cycle TLB miss was also interesting. It goes back up to something reasonable if I put the mb()/mfence's back. I don't think this kind of thing is a realistic test unless we put mfence's around all of our TLB flushes in practice. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f170.google.com (mail-ig0-f170.google.com [209.85.213.170]) by kanga.kvack.org (Postfix) with ESMTP id 2D6966B006E for ; Tue, 9 Jun 2015 17:54:02 -0400 (EDT) Received: by igbpi8 with SMTP id pi8so21833219igb.1 for ; Tue, 09 Jun 2015 14:54:02 -0700 (PDT) Received: from mail-ig0-x233.google.com (mail-ig0-x233.google.com. [2607:f8b0:4001:c05::233]) by mx.google.com with ESMTPS id nq6si2324610igb.16.2015.06.09.14.54.01 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 Jun 2015 14:54:01 -0700 (PDT) Received: by igbzc4 with SMTP id zc4so21884929igb.0 for ; Tue, 09 Jun 2015 14:54:01 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <55775749.3090004@intel.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> Date: Tue, 9 Jun 2015 14:54:01 -0700 Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Ingo Molnar , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen wrote: > > The 0 cycle TLB miss was also interesting. It goes back up to something > reasonable if I put the mb()/mfence's back. So I've said it before, and I'll say it again: Intel does really well on TLB fills. The reason is partly historical, with Win95 doing a ton of TLB invalidation (I think every single GDI call ended up invalidating the TLB, so under some important Windows benchmarks of the time, you literally had a TLB flush every 10k instructions!). But partly it is because people are wrong in thinking that TLB fills have to be slow. There's a lot of complete garbage RISC machines where the TLB fill took forever, because in the name of simplicity it would stop the pipeline and often be done in SW. The zero-cycle TLB fill is obviously a bit optimistic, but at the same time it's not completely insane. TLB fills can be prefetched, and the table walker can hit the cache, if you do them right. And Intel mostly does. So the normal full (non-global) TLB fill really is fairly cheap. It's been optimized for, and the TLB gets re-filled fairly efficiently. I suspect that it's really the case that doing more than just a couple of single-tlb flushes is a complete waste of time: the flushing takes longer than re-filling the TLB well. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by kanga.kvack.org (Postfix) with ESMTP id A1B756B0072 for ; Tue, 9 Jun 2015 18:32:10 -0400 (EDT) Received: by wiwd19 with SMTP id d19so30590533wiw.0 for ; Tue, 09 Jun 2015 15:32:10 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id t4si5672009wiz.31.2015.06.09.15.32.08 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 09 Jun 2015 15:32:09 -0700 (PDT) Date: Tue, 9 Jun 2015 23:32:03 +0100 From: Mel Gorman Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609223203.GW26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Dave Hansen , Ingo Molnar , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner On Tue, Jun 09, 2015 at 02:54:01PM -0700, Linus Torvalds wrote: > On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen wrote: > > > > The 0 cycle TLB miss was also interesting. It goes back up to something > > reasonable if I put the mb()/mfence's back. > > So I've said it before, and I'll say it again: Intel does really well > on TLB fills. > > The reason is partly historical, with Win95 doing a ton of TLB > invalidation (I think every single GDI call ended up invalidating the > TLB, so under some important Windows benchmarks of the time, you > literally had a TLB flush every 10k instructions!). > > But partly it is because people are wrong in thinking that TLB fills > have to be slow. There's a lot of complete garbage RISC machines where > the TLB fill took forever, because in the name of simplicity it would > stop the pipeline and often be done in SW. > > The zero-cycle TLB fill is obviously a bit optimistic, but at the same > time it's not completely insane. TLB fills can be prefetched, and the > table walker can hit the cache, if you do them right. And Intel mostly > does. > > So the normal full (non-global) TLB fill really is fairly cheap. It's > been optimized for, and the TLB gets re-filled fairly efficiently. I > suspect that it's really the case that doing more than just a couple > of single-tlb flushes is a complete waste of time: the flushing takes > longer than re-filling the TLB well. > I expect I'll do another revision of the series after 4.2-rc1 as it's way too close to 4.1's release. When that happens, I'll drop patch 4 and leave just the full non-global flush patch. In the event there is an architecture that really cares about the refill cost or we find that there is a corner case where the TLB refill hurts then patch 4 will be in the mail archives to consider for rebase and testing. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f176.google.com (mail-wi0-f176.google.com [209.85.212.176]) by kanga.kvack.org (Postfix) with ESMTP id 423326B0071 for ; Tue, 9 Jun 2015 18:35:16 -0400 (EDT) Received: by wiga1 with SMTP id a1so30053840wig.0 for ; Tue, 09 Jun 2015 15:35:15 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id db10si13918174wjb.46.2015.06.09.15.35.14 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 09 Jun 2015 15:35:15 -0700 (PDT) Date: Tue, 9 Jun 2015 23:35:09 +0100 From: Mel Gorman Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609223509.GX26425@suse.de> References: <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150609223203.GW26425@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150609223203.GW26425@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Dave Hansen , Ingo Molnar , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner On Tue, Jun 09, 2015 at 11:32:03PM +0100, Mel Gorman wrote: > On Tue, Jun 09, 2015 at 02:54:01PM -0700, Linus Torvalds wrote: > > On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen wrote: > > > > > > The 0 cycle TLB miss was also interesting. It goes back up to something > > > reasonable if I put the mb()/mfence's back. > > > > So I've said it before, and I'll say it again: Intel does really well > > on TLB fills. > > > > The reason is partly historical, with Win95 doing a ton of TLB > > invalidation (I think every single GDI call ended up invalidating the > > TLB, so under some important Windows benchmarks of the time, you > > literally had a TLB flush every 10k instructions!). > > > > But partly it is because people are wrong in thinking that TLB fills > > have to be slow. There's a lot of complete garbage RISC machines where > > the TLB fill took forever, because in the name of simplicity it would > > stop the pipeline and often be done in SW. > > > > The zero-cycle TLB fill is obviously a bit optimistic, but at the same > > time it's not completely insane. TLB fills can be prefetched, and the > > table walker can hit the cache, if you do them right. And Intel mostly > > does. > > > > So the normal full (non-global) TLB fill really is fairly cheap. It's > > been optimized for, and the TLB gets re-filled fairly efficiently. I > > suspect that it's really the case that doing more than just a couple > > of single-tlb flushes is a complete waste of time: the flushing takes > > longer than re-filling the TLB well. > > > > I expect I'll do another revision of the series after 4.2-rc1 as it's way > too close to 4.1's release. When that happens, I'll drop patch 4 and leave > just the full non-global flush patch. In the event there is an architecture > that really cares about the refill cost or we find that there is a corner > case where the TLB refill hurts then patch 4 will be in the mail archives > to consider for rebase and testing. > To be clear I meant patch 4 from v6 of the series will be dropped. In that series, patch 2 does a full non-global flush and patch 4 does the per-pfn tracking with multiple single page frame flushes. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by kanga.kvack.org (Postfix) with ESMTP id 1B2926B006C for ; Wed, 10 Jun 2015 04:51:48 -0400 (EDT) Received: by wigg3 with SMTP id g3so40853833wig.1 for ; Wed, 10 Jun 2015 01:51:47 -0700 (PDT) Received: from mail-wi0-x235.google.com (mail-wi0-x235.google.com. [2a00:1450:400c:c05::235]) by mx.google.com with ESMTPS id gs9si8323087wib.31.2015.06.10.01.51.46 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Jun 2015 01:51:46 -0700 (PDT) Received: by wiga1 with SMTP id a1so40791030wig.0 for ; Wed, 10 Jun 2015 01:51:46 -0700 (PDT) Date: Wed, 10 Jun 2015 10:51:41 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610085141.GA25704@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <20150609130536.GT26425@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150609130536.GT26425@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner * Mel Gorman wrote: > > I think since it is you who wants to introduce additional complexity into the > > x86 MM code the burden is on you to provide proof that the complexity of pfn > > (or struct page) tracking is worth it. > > I'm taking a situation whereby IPIs are sent like crazy with interrupt storms > and replacing it with something that is a lot more efficient that minimises the > number of potential surprises. I'm stating that the benefit of PFN tracking is > unknowable in the general case because it depends on the workload, timing and > the exact CPU used so any example provided can be naked with a counter-example > such as a trivial sequential reader that shows no benefit. The series as posted > is approximately in line with current behaviour minimising the chances of > surprise regressions from excessive TLB flush. > > You are actively blocking a measurable improvement and forcing it to be replaced > with something whose full impact is unquantifiable. Any regressions in this area > due to increased TLB misses could take several kernel releases as the issue will > be so difficult to detect. > > I'm going to implement the approach you are forcing because there is an x86 part > of the patch and you are the maintainer that could indefinitely NAK it. However, > I'm extremely pissed about being forced to introduce these indirect > unpredictable costs because I know the alternative is you dragging this out for > weeks with no satisfactory conclusion in an argument that I cannot prove in the > general case. Stop this crap. I made a really clear and unambiguous chain of arguments: - I'm unconvinced about the benefits of INVLPG in general, and your patches adds a whole new bunch of them. I cited measurements and went out on a limb to explain my position, backed with numbers and logic. It's admittedly still a speculative position and I might be wrong, but I think it's well grounded position that you cannot just brush aside. - I suggested that you split this approach into steps that first does the simpler approach that will give us at least 95% of the benefits, then the more complex one on top of it. Your false claim that I'm blocking a clear improvement is pure demagogy! - I very clearly claimed that I am more than willing to be convinced by numbers. It's not _that_ hard to construct a memory trashing workload with a TLB-efficient iteration that uses say 80% of the TLB cache, to measure the worst-case overhead of full flushes. I'm really sick of this partly deceptive, partly passive-aggressive discussion style that seems to frequently permeate VM discussions and which made sched/numa such a huge PITA in the past... And I think the numbers in the v6 series you submitted today support my position, so you owe me an apology I think ... Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by kanga.kvack.org (Postfix) with ESMTP id 0A1B56B0072 for ; Wed, 10 Jun 2015 05:08:20 -0400 (EDT) Received: by wibdq8 with SMTP id dq8so40339004wib.1 for ; Wed, 10 Jun 2015 02:08:19 -0700 (PDT) Received: from mail-wg0-x22d.google.com (mail-wg0-x22d.google.com. [2a00:1450:400c:c00::22d]) by mx.google.com with ESMTPS id gd5si16603298wjb.115.2015.06.10.02.08.18 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Jun 2015 02:08:18 -0700 (PDT) Received: by wgez8 with SMTP id z8so30624447wge.0 for ; Wed, 10 Jun 2015 02:08:18 -0700 (PDT) Date: Wed, 10 Jun 2015 11:08:13 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610090813.GA30359@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <20150609130536.GT26425@suse.de> <20150610085141.GA25704@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150610085141.GA25704@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner * Ingo Molnar wrote: > Stop this crap. > > I made a really clear and unambiguous chain of arguments: > > - I'm unconvinced about the benefits of INVLPG in general, and your patches adds > a whole new bunch of them. [...] ... and note that your claim that 'we were doing them before, this is just an equivalent transformation' is utter bullsh*t technically: what we were doing previously was a hideously expensive IPI combined with an INVLPG. The behavior was dominated by the huge overhead of the remote flushing IPI, which does not prove or disprove either your or my opinion! Preserving that old INVLPG logic without measuring its benefits _again_ would be cargo cult programming. So I think this should be measured, and I don't mind worst-case TLB trashing measurements, which would be relatively straightforward to construct and the results should be unambiguous. The batching limit (which you set to 32) should then be tuned by comparing it to a working full-flushing batching logic, not by comparing it to the previous single IPI per single flush approach! ... and if the benefits of a complex algorithm are not measurable and if there are doubts about the cost/benefit tradeoff then frankly it should not exist in the kernel in the first place. It's not like the Linux TLB flushing code is too boring due to overwhelming simplicity. and yes, it's my job as a maintainer to request measurements justifying complexity and your ad hominem attacks against me are disgusting - you should know better. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f175.google.com (mail-wi0-f175.google.com [209.85.212.175]) by kanga.kvack.org (Postfix) with ESMTP id 455A26B0071 for ; Wed, 10 Jun 2015 05:19:20 -0400 (EDT) Received: by wigg3 with SMTP id g3so41614782wig.1 for ; Wed, 10 Jun 2015 02:19:19 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id ml5si8436853wic.74.2015.06.10.02.19.17 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 10 Jun 2015 02:19:18 -0700 (PDT) Date: Wed, 10 Jun 2015 10:19:13 +0100 From: Mel Gorman Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610091913.GC26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <20150609130536.GT26425@suse.de> <20150610085141.GA25704@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150610085141.GA25704@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner On Wed, Jun 10, 2015 at 10:51:41AM +0200, Ingo Molnar wrote: > > * Mel Gorman wrote: > > > > I think since it is you who wants to introduce additional complexity into the > > > x86 MM code the burden is on you to provide proof that the complexity of pfn > > > (or struct page) tracking is worth it. > > > > I'm taking a situation whereby IPIs are sent like crazy with interrupt storms > > and replacing it with something that is a lot more efficient that minimises the > > number of potential surprises. I'm stating that the benefit of PFN tracking is > > unknowable in the general case because it depends on the workload, timing and > > the exact CPU used so any example provided can be naked with a counter-example > > such as a trivial sequential reader that shows no benefit. The series as posted > > is approximately in line with current behaviour minimising the chances of > > surprise regressions from excessive TLB flush. > > > > You are actively blocking a measurable improvement and forcing it to be replaced > > with something whose full impact is unquantifiable. Any regressions in this area > > due to increased TLB misses could take several kernel releases as the issue will > > be so difficult to detect. > > > > I'm going to implement the approach you are forcing because there is an x86 part > > of the patch and you are the maintainer that could indefinitely NAK it. However, > > I'm extremely pissed about being forced to introduce these indirect > > unpredictable costs because I know the alternative is you dragging this out for > > weeks with no satisfactory conclusion in an argument that I cannot prove in the > > general case. > > Stop this crap. > > I made a really clear and unambiguous chain of arguments: > > - I'm unconvinced about the benefits of INVLPG in general, and your patches adds > a whole new bunch of them. I cited measurements and went out on a limb to > explain my position, backed with numbers and logic. It's admittedly still a > speculative position and I might be wrong, but I think it's well grounded > position that you cannot just brush aside. > And I explained my concerns with the use of a full flush and the difficulty of measuring its impact in the general case. I also explained why I thought starting with PFN tracking was an incremental approach. The argument looped so I bailed. > - I suggested that you split this approach into steps that first does the simpler > approach that will give us at least 95% of the benefits, then the more complex > one on top of it. Your false claim that I'm blocking a clear improvement is > pure demagogy! > The splitting was already done, released and I followed up saying that I'll be dropping patch 4 in the final merge request. In the event we find in a few kernel releases time that it was required then there will be a realistic bug report to use as a reference. > - I very clearly claimed that I am more than willing to be convinced by numbers. > It's not _that_ hard to construct a memory trashing workload with a > TLB-efficient iteration that uses say 80% of the TLB cache, to measure the > worst-case overhead of full flushes. > And what I had said was that in the general case that it will not show us any proof. Any other workload will always behave differently and two critical components are the exact timing versus kswapd running and the CPU used. Even if it's demonstrated for a single workload on one CPU then we would still be faced with the same problem and dropping patch 4. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by kanga.kvack.org (Postfix) with ESMTP id 093776B0032 for ; Wed, 10 Jun 2015 06:15:39 -0400 (EDT) Received: by wigg3 with SMTP id g3so43139898wig.1 for ; Wed, 10 Jun 2015 03:15:38 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id m1si16913293wja.170.2015.06.10.03.15.34 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 10 Jun 2015 03:15:35 -0700 (PDT) Date: Wed, 10 Jun 2015 11:15:29 +0100 From: Mel Gorman Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610101529.GE26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <20150609130536.GT26425@suse.de> <20150610085141.GA25704@gmail.com> <20150610090813.GA30359@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150610090813.GA30359@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner On Wed, Jun 10, 2015 at 11:08:13AM +0200, Ingo Molnar wrote: > > * Ingo Molnar wrote: > > > Stop this crap. > > > > I made a really clear and unambiguous chain of arguments: > > > > - I'm unconvinced about the benefits of INVLPG in general, and your patches adds > > a whole new bunch of them. [...] > > ... and note that your claim that 'we were doing them before, this is just an > equivalent transformation' is utter bullsh*t technically: what we were doing > previously was a hideously expensive IPI combined with an INVLPG. > And replacing it with an INVLPG without excessive IPI transmission is changing one major variable. Going straight to a full TLB flush is changing two major variables. I thought the refill cost was high, parially based on the estimate of 22,000 cycles in https://lkml.org/lkml/2014/7/31/825. I've been told in these discussions that I'm wrong and the cost is not high. As it'll always be variable, we can never be sure which is why I do not see a value to building a complex test around it that will be invalidated the instant we use a different CPU. When/if a workload shows up that really cares about those refill costs then there will be a stable test case to work from. > The behavior was dominated by the huge overhead of the remote flushing IPI, which > does not prove or disprove either your or my opinion! > > Preserving that old INVLPG logic without measuring its benefits _again_ would be > cargo cult programming. > > So I think this should be measured, and I don't mind worst-case TLB trashing > measurements, which would be relatively straightforward to construct and the > results should be unambiguous. > > The batching limit (which you set to 32) should then be tuned by comparing it to a > working full-flushing batching logic, not by comparing it to the previous single > IPI per single flush approach! > We can decrease it easily but increasing it means we also have to change SWAP_CLUSTER_MAX because otherwise enough pages are not unmapped for flushes and it is a requirement that we flush before freeing the pages. That changes another complex variable because at the very least, it alters LRU lock hold times. > ... and if the benefits of a complex algorithm are not measurable and if there are > doubts about the cost/benefit tradeoff then frankly it should not exist in the > kernel in the first place. It's not like the Linux TLB flushing code is too boring > due to overwhelming simplicity. > > and yes, it's my job as a maintainer to request measurements justifying complexity > and your ad hominem attacks against me are disgusting - you should know better. > It was not intended as an ad hominem attack and my apologies for that. I wanted to express my frustration that a series that adjusted one variable with known benefit will be rejected for a series that adjusts two major variables instead with the second variable being very sensitive to workload and CPU. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f48.google.com (mail-wg0-f48.google.com [74.125.82.48]) by kanga.kvack.org (Postfix) with ESMTP id 116136B0032 for ; Wed, 10 Jun 2015 09:13:57 -0400 (EDT) Received: by wgez8 with SMTP id z8so35486793wge.0 for ; Wed, 10 Jun 2015 06:13:56 -0700 (PDT) Received: from one.firstfloor.org (one.firstfloor.org. [193.170.194.197]) by mx.google.com with ESMTPS id y1si17977658wjw.91.2015.06.10.06.13.54 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Jun 2015 06:13:55 -0700 (PDT) Date: Wed, 10 Jun 2015 15:13:54 +0200 From: Andi Kleen Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610131354.GO19417@two.firstfloor.org> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Dave Hansen , Ingo Molnar , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner On Tue, Jun 09, 2015 at 02:54:01PM -0700, Linus Torvalds wrote: > On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen wrote: > > > > The 0 cycle TLB miss was also interesting. It goes back up to something > > reasonable if I put the mb()/mfence's back. > > So I've said it before, and I'll say it again: Intel does really well > on TLB fills. Assuming the page tables are cache-hot... And hot here does not mean L3 cache, but higher. But a memory intensive workload can easily violate that. That's why I'm dubious of all these micro benchmarks. They won't be clearing caches. They generate unrealistic conditions in the CPU pipeline and overestimate the cost of the flushes. The only good way to measure TLB costs is macro benchmarks. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f182.google.com (mail-ig0-f182.google.com [209.85.213.182]) by kanga.kvack.org (Postfix) with ESMTP id B5E156B006C for ; Wed, 10 Jun 2015 12:17:16 -0400 (EDT) Received: by igbhj9 with SMTP id hj9so39322592igb.1 for ; Wed, 10 Jun 2015 09:17:16 -0700 (PDT) Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com. [2607:f8b0:4001:c05::232]) by mx.google.com with ESMTPS id ba5si9320335icc.39.2015.06.10.09.17.15 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Jun 2015 09:17:16 -0700 (PDT) Received: by igblz2 with SMTP id lz2so37243194igb.1 for ; Wed, 10 Jun 2015 09:17:15 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20150610131354.GO19417@two.firstfloor.org> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> Date: Wed, 10 Jun 2015 09:17:15 -0700 Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andi Kleen Cc: Dave Hansen , Ingo Molnar , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner On Wed, Jun 10, 2015 at 6:13 AM, Andi Kleen wrote: > > Assuming the page tables are cache-hot... And hot here does not mean > L3 cache, but higher. But a memory intensive workload can easily > violate that. In practice, no. You'll spend all your time on the actual real data cache misses, the TLB misses won't be any more noticeable. And if your access patters are even *remoptely* cache-friendly (ie _not_ spending all your time just waiting for regular data cache misses), then a radix-tree-like page table like Intel will have much better locality in the page tables than in the actual data. So again, the TLB misses won't be your big problem. There may be pathological cases where you just look at one word per page, but let's face it, we don't optimize for pathological or unrealistic cases. And the thing is, you need to look at the costs. Single-page invalidation taking hundreds of cycles? Yeah, we definitely need to take the downside of trying to be clever into account. If the invalidation was really cheap, the rules might change. As it is, I really don't think there is any question about this. That's particularly true when the single-page invalidation approach has lots of *software* overhead too - not just the complexity, but even "obvious" costs feeding the list of pages to be invalidated across CPU's. Think about it - there are cache misses there too, and because we do those across CPU's those cache misses are *mandatory*. So trying to avoid a few TLB misses by forcing mandatory cache misses and extra complexity, and by doing lots of 200+ cycle operations? Really? In what universe does that sound like a good idea? Quite frankly, I can pretty much *guarantee* that you didn't actually think about any real numbers, you've just been taught that fairy-tale of "TLB misses are expensive". As if TLB entries were somehow sacred. If somebody can show real numbers on a real workload, that's one thing. But in the absence of those real numbers, we should go for the simple and straightforward code, and not try to make up "but but but it could be expensive" issues. Trying to avoid IPI's by batching them up sounds like a very obvious win. I absolutely believe that is worth it. I just don't believe it is worth it trying to pass a lot of page detail data around. The number of individual pages flushed should be *very* controlled. Here's a suggestion: the absolute maximum number of TLB entries we flush should be something we can describe in one single cacheline. Strive to minimize the number of cachelines we have to transfer for the IPI, instead of trying to minimize the number of TLB entries. So maybe soemthing along the lines of "one word of flags (all or nothing, or a range), or up to seven indivual addresses to be flushed". I personally think even seven invlpg's is likely too much, but with the "single cacheline" at least there's another basis for the limit too. So anyway, I like the patch series. I just think that the final patch - the one that actually saves the addreses, and limits things to BATCH_TLBFLUSH_SIZE, should be limited. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ie0-f172.google.com (mail-ie0-f172.google.com [209.85.223.172]) by kanga.kvack.org (Postfix) with ESMTP id BA2E76B006C for ; Wed, 10 Jun 2015 12:42:35 -0400 (EDT) Received: by iesa3 with SMTP id a3so38352981ies.2 for ; Wed, 10 Jun 2015 09:42:35 -0700 (PDT) Received: from mail-ig0-x22f.google.com (mail-ig0-x22f.google.com. [2607:f8b0:4001:c05::22f]) by mx.google.com with ESMTPS id b1si9381847ico.3.2015.06.10.09.42.35 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Jun 2015 09:42:35 -0700 (PDT) Received: by igbpi8 with SMTP id pi8so40008731igb.1 for ; Wed, 10 Jun 2015 09:42:35 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> Date: Wed, 10 Jun 2015 09:42:34 -0700 Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andi Kleen Cc: Dave Hansen , Ingo Molnar , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner On Wed, Jun 10, 2015 at 9:17 AM, Linus Torvalds wrote: > > So anyway, I like the patch series. I just think that the final patch > - the one that actually saves the addreses, and limits things to > BATCH_TLBFLUSH_SIZE, should be limited. Oh, and another thing: Mel, can you please make that "struct tlbflush_unmap_batch" be just part of "struct task_struct" rather than a pointer? If you are worried about the cpumask size, you could use cpumask_var_t cpumask; and alloc_cpumask_var(..) ... free_cpumask_var(..) for that. That way, sane configurations never have the allocation cost. (Of course, sad to say, sane configurations are probably few and far between. At least Fedora seems to ship with a kernel where NR_CPU's is 1024 and thus CONFIG_CPUMASK_OFFSTACK=y. Oh well. What a waste of CPU cycles that is..) Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id 715B06B0032 for ; Wed, 10 Jun 2015 13:07:07 -0400 (EDT) Received: by wibut5 with SMTP id ut5so54987737wib.1 for ; Wed, 10 Jun 2015 10:07:06 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id o19si19183605wjr.59.2015.06.10.10.07.05 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 10 Jun 2015 10:07:05 -0700 (PDT) Date: Wed, 10 Jun 2015 18:07:00 +0100 From: Mel Gorman Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610170700.GG26425@suse.de> References: <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Andi Kleen , Dave Hansen , Ingo Molnar , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner On Wed, Jun 10, 2015 at 09:17:15AM -0700, Linus Torvalds wrote: > On Wed, Jun 10, 2015 at 6:13 AM, Andi Kleen wrote: > > > > Assuming the page tables are cache-hot... And hot here does not mean > > L3 cache, but higher. But a memory intensive workload can easily > > violate that. > > In practice, no. > > You'll spend all your time on the actual real data cache misses, the > TLB misses won't be any more noticeable. > > And if your access patters are even *remoptely* cache-friendly (ie > _not_ spending all your time just waiting for regular data cache > misses), then a radix-tree-like page table like Intel will have much > better locality in the page tables than in the actual data. So again, > the TLB misses won't be your big problem. > > There may be pathological cases where you just look at one word per > page, but let's face it, we don't optimize for pathological or > unrealistic cases. > It's concerns like this that have me avoiding any micro-benchmarking approach that tried to measure the indirect costs of refills. No matter what the microbenchmark does, there will be other cases that render it irrelevant. > And the thing is, you need to look at the costs. Single-page > invalidation taking hundreds of cycles? Yeah, we definitely need to > take the downside of trying to be clever into account. > > If the invalidation was really cheap, the rules might change. As it > is, I really don't think there is any question about this. > > That's particularly true when the single-page invalidation approach > has lots of *software* overhead too - not just the complexity, but > even "obvious" costs feeding the list of pages to be invalidated > across CPU's. Think about it - there are cache misses there too, and > because we do those across CPU's those cache misses are *mandatory*. > > So trying to avoid a few TLB misses by forcing mandatory cache misses > and extra complexity, and by doing lots of 200+ cycle operations? > Really? In what universe does that sound like a good idea? > > Quite frankly, I can pretty much *guarantee* that you didn't actually > think about any real numbers, you've just been taught that fairy-tale > of "TLB misses are expensive". As if TLB entries were somehow sacred. > Everyone has been taught that one. Papers I've read from the last two years on TLB implementations or page reclaim management bring this up as a supporting point for whatever they are proposing. It was partially why I kept PFN tracking and also to put much of the cost on the reclaimer and minimise interference on the recipient of the IPI. I still think it was a rational concern but will assume that refills are cheaper than smart invalidations until it can be proven otherwise. > If somebody can show real numbers on a real workload, that's one > thing. The last adjustments made today to the series are at http://git.kernel.org/cgit/linux/kernel/git/mel/linux-balancenuma.git/log/?h=mm-vmscan-lessipi-v7r5 I'll redo it on top of 4.2-rc1 whenever that happens so gets a full round in linux-next. Patch 4 can be revisited if a real workload is found that is not deliberately pathological running on a CPU that matters. The forward port of patch 4 for testing will be trivial. It also separated out the dynamic allocation of the structure so that it can be excluded if deemed to be an unnecessary complication. > So anyway, I like the patch series. I just think that the final patch > - the one that actually saves the addreses, and limits things to > BATCH_TLBFLUSH_SIZE, should be limited. > I see your logic but if it's limited then we send more IPIs and it's all crappy tradeoffs. If a real workload complains, it'll be far easier to work with. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id 98F2C6B0032 for ; Wed, 10 Jun 2015 13:25:04 -0400 (EDT) Received: by wigg3 with SMTP id g3so55776252wig.1 for ; Wed, 10 Jun 2015 10:25:04 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id df4si10988810wib.111.2015.06.10.10.25.02 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 10 Jun 2015 10:25:02 -0700 (PDT) Date: Wed, 10 Jun 2015 18:24:57 +0100 From: Mel Gorman Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610172457.GH26425@suse.de> References: <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Andi Kleen , Dave Hansen , Ingo Molnar , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner On Wed, Jun 10, 2015 at 09:42:34AM -0700, Linus Torvalds wrote: > On Wed, Jun 10, 2015 at 9:17 AM, Linus Torvalds > wrote: > > > > So anyway, I like the patch series. I just think that the final patch > > - the one that actually saves the addreses, and limits things to > > BATCH_TLBFLUSH_SIZE, should be limited. > > Oh, and another thing: > > Mel, can you please make that "struct tlbflush_unmap_batch" be just > part of "struct task_struct" rather than a pointer? > Yes, that was done earlier today based on Ingo's review so that the allocation could be dealt with as a separate path at the end of the series. > If you are worried about the cpumask size, you could use > > cpumask_var_t cpumask; > > and > > alloc_cpumask_var(..) > ... > free_cpumask_var(..) > > for that. > > That way, sane configurations never have the allocation cost. > Ok, good point. Patch 3 in my git tree ("mm: Dynamically allocate TLB batch unmap control structure") does not do this but I'll look into doing it before the release based on 4.2-rc1. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f175.google.com (mail-ig0-f175.google.com [209.85.213.175]) by kanga.kvack.org (Postfix) with ESMTP id 716BD6B0032 for ; Wed, 10 Jun 2015 13:31:02 -0400 (EDT) Received: by igblz2 with SMTP id lz2so38773872igb.1 for ; Wed, 10 Jun 2015 10:31:02 -0700 (PDT) Received: from mail-ie0-x231.google.com (mail-ie0-x231.google.com. [2607:f8b0:4001:c03::231]) by mx.google.com with ESMTPS id cy5si4379799igc.53.2015.06.10.10.31.01 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Jun 2015 10:31:01 -0700 (PDT) Received: by iebgx4 with SMTP id gx4so39237984ieb.0 for ; Wed, 10 Jun 2015 10:31:01 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20150610172457.GH26425@suse.de> References: <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> <20150610172457.GH26425@suse.de> Date: Wed, 10 Jun 2015 10:31:01 -0700 Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andi Kleen , Dave Hansen , Ingo Molnar , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner On Wed, Jun 10, 2015 at 10:24 AM, Mel Gorman wrote: > > Yes, that was done earlier today based on Ingo's review so that the > allocation could be dealt with as a separate path at the end of the series. Ahh, ok, never mind then. > Ok, good point. Patch 3 in my git tree ("mm: Dynamically allocate TLB > batch unmap control structure") does not do this but I'll look into doing > it before the release based on 4.2-rc1. I'm not sure how size-sensitive this is. The 'struct task_struct' is pretty big already, and if somebody builds a MAXSMP kernel, I really don't think they worry too much about wasting a few bytes for each process. Clearly they either are insane, or they actually *have* a big machine, in which point the allocation is not going to be wasteful and they'll likely trigger this code all the time anyway, so trying to be any more dynamic about it is probably not worth it. So my "maybe you could use 'cpumask_var_t' here" suggestion isn't necessarily even worth it. Just statically allocating it is probably perfectly fine. Of course, again the counter-example for that may well be how distros seem to make thousand-cpu configurations the default. That does seem insane to me. But I guess the pain of multiple kernel configs is just too much. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ie0-f175.google.com (mail-ie0-f175.google.com [209.85.223.175]) by kanga.kvack.org (Postfix) with ESMTP id 4FE4A6B006C for ; Wed, 10 Jun 2015 14:08:15 -0400 (EDT) Received: by iebmu5 with SMTP id mu5so39899607ieb.1 for ; Wed, 10 Jun 2015 11:08:15 -0700 (PDT) Received: from mail-ig0-x231.google.com (mail-ig0-x231.google.com. [2607:f8b0:4001:c05::231]) by mx.google.com with ESMTPS id e7si4451277igl.62.2015.06.10.11.08.14 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Jun 2015 11:08:14 -0700 (PDT) Received: by igbpi8 with SMTP id pi8so41495185igb.0 for ; Wed, 10 Jun 2015 11:08:14 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> Date: Wed, 10 Jun 2015 14:08:14 -0400 Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Josh Boyer Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Andi Kleen , Dave Hansen , Ingo Molnar , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner On Wed, Jun 10, 2015 at 12:42 PM, Linus Torvalds wrote: > On Wed, Jun 10, 2015 at 9:17 AM, Linus Torvalds > wrote: >> >> So anyway, I like the patch series. I just think that the final patch >> - the one that actually saves the addreses, and limits things to >> BATCH_TLBFLUSH_SIZE, should be limited. > > Oh, and another thing: > > Mel, can you please make that "struct tlbflush_unmap_batch" be just > part of "struct task_struct" rather than a pointer? > > If you are worried about the cpumask size, you could use > > cpumask_var_t cpumask; > > and > > alloc_cpumask_var(..) > ... > free_cpumask_var(..) > > for that. > > That way, sane configurations never have the allocation cost. > > (Of course, sad to say, sane configurations are probably few and far > between. At least Fedora seems to ship with a kernel where NR_CPU's is > 1024 and thus CONFIG_CPUMASK_OFFSTACK=y. Oh well. What a waste of CPU > cycles that is..) The insane part being NR_CPUS = 1024? Or that to have said number requires cpumask being dynamically allocated to avoid stack overflow? (Or both I guess). josh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f47.google.com (mail-wg0-f47.google.com [74.125.82.47]) by kanga.kvack.org (Postfix) with ESMTP id E14826B0038 for ; Thu, 11 Jun 2015 11:26:06 -0400 (EDT) Received: by wgme6 with SMTP id e6so7288115wgm.2 for ; Thu, 11 Jun 2015 08:26:06 -0700 (PDT) Received: from mail-wi0-x234.google.com (mail-wi0-x234.google.com. [2a00:1450:400c:c05::234]) by mx.google.com with ESMTPS id el1si2479882wib.120.2015.06.11.08.26.04 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 11 Jun 2015 08:26:05 -0700 (PDT) Received: by wibdq8 with SMTP id dq8so12384799wib.1 for ; Thu, 11 Jun 2015 08:26:04 -0700 (PDT) Date: Thu, 11 Jun 2015 17:26:00 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150611152559.GA15509@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <20150609130536.GT26425@suse.de> <20150610085141.GA25704@gmail.com> <20150610090813.GA30359@gmail.com> <20150610101529.GE26425@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150610101529.GE26425@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner * Mel Gorman wrote: > > > I made a really clear and unambiguous chain of arguments: > > > > > > - I'm unconvinced about the benefits of INVLPG in general, and your patches adds > > > a whole new bunch of them. [...] > > > > ... and note that your claim that 'we were doing them before, this is just an > > equivalent transformation' is utter bullsh*t technically: what we were doing > > previously was a hideously expensive IPI combined with an INVLPG. > > And replacing it with an INVLPG without excessive IPI transmission is changing > one major variable. Going straight to a full TLB flush is changing two major > variables. [...] But this argument employs the fallacy that the transition to 'batching with PFN tracking' is not a major variable in itself. In reality it is a major variable: it adds extra complexity, such as the cross CPU data flow (the pfn tracking), and it also changes the distribution of the flushes and related caching patterns. > > [...] > > > > The batching limit (which you set to 32) should then be tuned by comparing it > > to a working full-flushing batching logic, not by comparing it to the previous > > single IPI per single flush approach! > > We can decrease it easily but increasing it means we also have to change > SWAP_CLUSTER_MAX because otherwise enough pages are not unmapped for flushes and > it is a requirement that we flush before freeing the pages. That changes another > complex variable because at the very least, it alters LRU lock hold times. ('should then be' implied 'as a separate patch/series', obviously.) I.e. all I wanted to observe is that I think the series did not explore the performance impact of the batching limit, because it was too focused on the INVLPG approach which has an internal API limit of 33. Now that the TLB flushing side is essentially limit-less, a future enhancement would be to further increase batching. My suspicion is that say doubling SWAP_CLUSTER_MAX would possibly further reduce the IPI rate, at a negligible cost. But this observation does not impact the current series in any case. > > ... and if the benefits of a complex algorithm are not measurable and if there > > are doubts about the cost/benefit tradeoff then frankly it should not exist in > > the kernel in the first place. It's not like the Linux TLB flushing code is > > too boring due to overwhelming simplicity. > > > > and yes, it's my job as a maintainer to request measurements justifying > > complexity and your ad hominem attacks against me are disgusting - you should > > know better. > > It was not intended as an ad hominem attack and my apologies for that. I wanted > to express my frustration that a series that adjusted one variable with known > benefit will be rejected for a series that adjusts two major variables instead > with the second variable being very sensitive to workload and CPU. ... but that's not what it did: it adjusted multiple complex variables already, with a questionable rationale for more complexity. And my argument was and continues to be: start with the simplest variant and iterate from there. Which you seem to have adapted in your latest series, so my concerns are addressed: Acked-by: Ingo Molnar Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id 828E76B0032 for ; Sun, 21 Jun 2015 16:23:05 -0400 (EDT) Received: by wicgi11 with SMTP id gi11so58148660wic.0 for ; Sun, 21 Jun 2015 13:23:04 -0700 (PDT) Received: from johanna1.rokki.sonera.fi (mta-out1.inet.fi. [62.71.2.229]) by mx.google.com with ESMTP id wl7si31361811wjc.206.2015.06.21.13.23.03 for ; Sun, 21 Jun 2015 13:23:03 -0700 (PDT) Date: Sun, 21 Jun 2015 23:22:31 +0300 From: "Kirill A. Shutemov" Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150621202231.GB6766@node.dhcp.inet.fi> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5577078B.2000503@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , Ingo Molnar Cc: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner On Tue, Jun 09, 2015 at 08:34:35AM -0700, Dave Hansen wrote: > On 06/09/2015 05:43 AM, Ingo Molnar wrote: > > +static char tlb_flush_target[PAGE_SIZE] __aligned(4096); > > +static void fn_flush_tlb_one(void) > > +{ > > + unsigned long addr = (unsigned long)&tlb_flush_target; > > + > > + tlb_flush_target[0]++; > > + __flush_tlb_one(addr); > > +} > > So we've got an increment of a variable in kernel memory (which is > almost surely in the L1), then we flush that memory location, and repeat > the increment. BTW, Ingo, have you disabled direct mapping of kernel memory with 2M/1G pages for the test? I'm just thinking if there is chance that the test shooting out 1G tlb entry. In this case we're measure wrong thing. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id CC25B6B0038 for ; Thu, 25 Jun 2015 07:48:27 -0400 (EDT) Received: by wibdq8 with SMTP id dq8so73315190wib.1 for ; Thu, 25 Jun 2015 04:48:27 -0700 (PDT) Received: from mail-wg0-x235.google.com (mail-wg0-x235.google.com. [2a00:1450:400c:c00::235]) by mx.google.com with ESMTPS id bn1si8193697wib.38.2015.06.25.04.48.24 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 25 Jun 2015 04:48:25 -0700 (PDT) Received: by wgck11 with SMTP id k11so60397659wgc.0 for ; Thu, 25 Jun 2015 04:48:24 -0700 (PDT) Date: Thu, 25 Jun 2015 13:48:20 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150625114819.GA20478@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150621202231.GB6766@node.dhcp.inet.fi> Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: Dave Hansen , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Borislav Petkov * Kirill A. Shutemov wrote: > On Tue, Jun 09, 2015 at 08:34:35AM -0700, Dave Hansen wrote: > > On 06/09/2015 05:43 AM, Ingo Molnar wrote: > > > +static char tlb_flush_target[PAGE_SIZE] __aligned(4096); > > > +static void fn_flush_tlb_one(void) > > > +{ > > > + unsigned long addr = (unsigned long)&tlb_flush_target; > > > + > > > + tlb_flush_target[0]++; > > > + __flush_tlb_one(addr); > > > +} > > > > So we've got an increment of a variable in kernel memory (which is > > almost surely in the L1), then we flush that memory location, and repeat > > the increment. > > BTW, Ingo, have you disabled direct mapping of kernel memory with 2M/1G > pages for the test? No, the kernel was using gbpages. > I'm just thinking if there is chance that the test shooting out 1G tlb entry. In > this case we're measure wrong thing. Yeah, you are right, it's slightly wrong, but still gave us the right numbers to work with. I've updated the benchmarks with 4K flushes as well. Changes to the previous measurement: - I've refined the testing methodology. (It now looks at the true distribution of delays, not just the minimum or an average: this is especially important for cache-cold numbers were perturbations (such as irq or SMI events) can generate outliers in both directions.) - Added tests to attempt to measure the TLB miss costs, I've split the INVLPG measurements into over a dozen separate variants, which should help measure the overall impact of TLB flushes and should help quantify the 'hidden' cost cost associated with TLB flushes as well. Here are the numbers: Old 64-bit CPU (AMD) using 2MB/4K pages: [ 8.122002] x86/bench: Running x86 benchmarks: cache- hot / cold cycles [ 12.269003] x86/bench:######## MM methods: ####################################### [ 12.369003] x86/bench: __flush_tlb() fn : 127 / 273 [ 12.469003] x86/bench: __flush_tlb_global() fn : 589 / 1501 [ 12.568003] x86/bench: access 2M page fn : 0 / 440 [ 12.668003] x86/bench: invlpg 2M page fn : 93 / 512 [ 12.768003] x86/bench: access+invlpg 2M page fn : 107 / 512 [ 12.868003] x86/bench: invlpg+access 2M page fn : 94 / 535 [ 12.972003] x86/bench: flush+access 2M page fn : 113 / 1060 [ 13.070003] x86/bench: access 1x 4K page fn : 0 / 5 [ 13.166003] x86/bench: access 2x 4K page fn : 1 / 150 [ 13.264003] x86/bench: access 3x 4K page fn : 1 / 331 [ 13.360003] x86/bench: access 4x 4K page fn : 1 / 360 [ 13.463003] x86/bench: invlpg+access 1x 4K page fn : 95 / 247 [ 13.566003] x86/bench: invlpg+access 2x 4K page fn : 188 / 543 [ 13.668003] x86/bench: invlpg+access 3x 4K page fn : 281 / 836 [ 13.774003] x86/bench: invlpg+access 4x 4K page fn : 374 / 1165 [ 13.878003] x86/bench: flush+access 1x 4K page fn : 114 / 275 [ 13.981003] x86/bench: flush+access 2x 4K page fn : 115 / 480 [ 14.083003] x86/bench: flush+access 3x 4K page fn : 115 / 671 [ 14.183003] x86/bench: flush+access 4x 4K page fn : 116 / 670 [ 14.286003] x86/bench: flush 4K page fn : 108 / 247 [ 14.385003] x86/bench: access+invlpg 4K page fn : 94 / 287 [ 14.502003] x86/bench: __flush_tlb_range() fn : 280 / 5173 Newer 64-bit CPU (Intel) using gbpages/4K pages: [ ] x86/bench: Running x86 benchmarks: cache- hot / cold cycles [ ] x86/bench:######## MM methods: ####################################### [ 124.362251] x86/bench: __flush_tlb() fn : 152 / 168 [ 125.287283] x86/bench: __flush_tlb_global() fn : 936 / 1312 [ 126.227641] x86/bench: access GB page fn : 12 / 304 [ 127.152369] x86/bench: invlpg GB page fn : 280 / 584 [ 128.084376] x86/bench: access+invlpg GB page fn : 280 / 592 [ 129.018430] x86/bench: invlpg+access GB page fn : 280 / 596 [ 129.939640] x86/bench: flush+access GB page fn : 152 / 396 [ 130.849827] x86/bench: access 1x 4K page fn : 0 / 0 [ 131.796836] x86/bench: access 2x 4K page fn : 12 / 432 [ 132.738849] x86/bench: access 3x 4K page fn : 16 / 736 [ 133.661315] x86/bench: access 4x 4K page fn : 12 / 736 [ 134.575238] x86/bench: invlpg+access 1x 4K page fn : 280 / 356 [ 135.506638] x86/bench: invlpg+access 2x 4K page fn : 500 / 1152 [ 136.432645] x86/bench: invlpg+access 3x 4K page fn : 728 / 1920 [ 137.355536] x86/bench: invlpg+access 4x 4K page fn : 956 / 3084 [ 138.296174] x86/bench: flush+access 1x 4K page fn : 148 / 280 [ 139.229264] x86/bench: flush+access 2x 4K page fn : 152 / 728 [ 140.165103] x86/bench: flush+access 3x 4K page fn : 136 / 736 [ 141.088691] x86/bench: flush+access 4x 4K page fn : 152 / 820 [ 141.994946] x86/bench: flush 4K page fn : 280 / 328 [ 142.902499] x86/bench: access+invlpg 4K page fn : 280 / 352 [ 143.814408] x86/bench: __flush_tlb_range() fn : 424 / 2100 Notes: - 'access' means an ACCESS_ONCE() load. - 'flush' means a CR3 based full flush - 'invlpg' means an __flush_tlb_one() INVLPG flush - the 4K pages are vmalloc()-ed pages. - 1x, 2x, 3x, 4x means up to 4 adjacent 4K vmalloc()-ed pages are accessed, the first byte in each - PGE is turned off in the CR4 for these measuremnts, to make sure the vmalloc()-ed page(s) get flushed by the CR3 method as well. - there's a bit of a jitter in the numbers, for example the 'flush+access 3x 4K' number is lower - but the jitter is in the 10 cycles range max As you can see (assuming the numbers are correct), the 'full' CR3 based flushing is superior to any __flush_tlb_one() method: [ 134.575238] x86/bench: invlpg+access 1x 4K page fn : 280 / 356 [ 135.506638] x86/bench: invlpg+access 2x 4K page fn : 500 / 1152 [ 136.432645] x86/bench: invlpg+access 3x 4K page fn : 728 / 1920 [ 137.355536] x86/bench: invlpg+access 4x 4K page fn : 956 / 3084 [ 138.296174] x86/bench: flush+access 1x 4K page fn : 148 / 280 [ 139.229264] x86/bench: flush+access 2x 4K page fn : 152 / 728 [ 140.165103] x86/bench: flush+access 3x 4K page fn : 136 / 736 [ 141.088691] x86/bench: flush+access 4x 4K page fn : 152 / 820 In the cache-hot case: with the CR3 based method the _full_ cost of flushing (i.e. the cost of the flush plus the cost of the access) is roughly constant. With INVLPG it's both higher, and increases linearly. In the cache-cold case: the costs are increasing linearly, but it's much higher in the INVLPG case. So in fact I'd argue that, somewhat counter-intuitively, we should consider doing a full flush even for single-page flushes, for example for page faults... I'll do more measurements (and will publish the latest patches) to increase confidence in the numbers. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f179.google.com (mail-ig0-f179.google.com [209.85.213.179]) by kanga.kvack.org (Postfix) with ESMTP id CA6A46B0038 for ; Thu, 25 Jun 2015 14:36:26 -0400 (EDT) Received: by igin14 with SMTP id n14so61653618igi.1 for ; Thu, 25 Jun 2015 11:36:26 -0700 (PDT) Received: from mail-ig0-x229.google.com (mail-ig0-x229.google.com. [2607:f8b0:4001:c05::229]) by mx.google.com with ESMTPS id yv5si25920546icb.22.2015.06.25.11.36.26 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 25 Jun 2015 11:36:26 -0700 (PDT) Received: by igblr2 with SMTP id lr2so61521874igb.0 for ; Thu, 25 Jun 2015 11:36:26 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20150625114819.GA20478@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> <20150625114819.GA20478@gmail.com> Date: Thu, 25 Jun 2015 11:36:25 -0700 Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds Content-Type: multipart/alternative; boundary=001a113feefca6105105195be6b2 Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Thomas Gleixner , linux-mm , Borislav Petkov , Hugh Dickins , Andrew Morton , Minchan Kim , Rik van Riel , "Kirill A. Shutemov" , H Peter Anvin , Andi Kleen , Peter Zijlstra , Dave Hansen , Mel Gorman , LKML --001a113feefca6105105195be6b2 Content-Type: text/plain; charset=UTF-8 On Jun 25, 2015 04:48, "Ingo Molnar" wrote: > > - 1x, 2x, 3x, 4x means up to 4 adjacent 4K vmalloc()-ed pages are accessed, the > first byte in each So that test is a bit unfair. From previous timing of Intel TLB fills, I can tell you that Intel is particularly good at doing adjacent entries. That's independent of the fact that page tables have very good locality (if they are the radix tree type - the hashed page tables that ppc uses are shit). So when filling adjacent entries, you take the cache misses for the page tables only once, but even aside from that, Intel send to do particularly well at the "next page" TLB fill case Now, I think that's a reasonably common case, and I'm not saying that it's unfair to compare for that reason, but it does highlight the good case for TLB walking. So I would suggest you highlight the bad case too: use invlpg to invalidate *one* TLB entry, and then walk four non-adjacent entries. And compare *that* to the full TLB flush. Now, I happen to still believe in the full flush, but let's not pick benchmarks that might not show the advantages of the finer granularity. Linus --001a113feefca6105105195be6b2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Jun 25, 2015 04:48, "Ingo Molnar" <mingo@kernel.org> wrote:
>
> =C2=A0- 1x, 2x, 3x, 4x means up to 4 adjacent 4K vmalloc()-ed pages ar= e accessed, the
> =C2=A0 =C2=A0first byte in each

So that test is a bit unfair. From previous timing of Intel = TLB fills, I can tell you that Intel is particularly good at doing adjacent= entries.

That's independent of the fact that page tables have ver= y good locality (if they are the radix tree type - the hashed page tables t= hat ppc uses are shit). So when filling adjacent entries, you take the cach= e misses for the page tables only once, but even aside from that, Intel sen= d to do particularly well at the "next page" TLB fill case

Now, I think that's a reasonably common case, and I'= m not saying that it's unfair to compare for that reason, but it does h= ighlight the good case for TLB walking.

So I would suggest you highlight the bad case too: use invlp= g to invalidate *one* TLB entry, and then walk four non-adjacent entries. A= nd compare *that* to the full TLB flush.

Now, I happen to still believe in the full flush, but let= 9;s not pick benchmarks that might not show the advantages of the finer gra= nularity.

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Linus

--001a113feefca6105105195be6b2-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f177.google.com (mail-pd0-f177.google.com [209.85.192.177]) by kanga.kvack.org (Postfix) with ESMTP id 958796B0038 for ; Thu, 25 Jun 2015 14:46:39 -0400 (EDT) Received: by pdjn11 with SMTP id n11so58767573pdj.0 for ; Thu, 25 Jun 2015 11:46:39 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65]) by mx.google.com with ESMTP id m3si46443210pdr.71.2015.06.25.11.46.38 for ; Thu, 25 Jun 2015 11:46:38 -0700 (PDT) Message-ID: <558C4C8E.5010107@intel.com> Date: Thu, 25 Jun 2015 11:46:38 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> <20150625114819.GA20478@gmail.com> In-Reply-To: <20150625114819.GA20478@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar , "Kirill A. Shutemov" Cc: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Borislav Petkov On 06/25/2015 04:48 AM, Ingo Molnar wrote: > I've updated the benchmarks with 4K flushes as well. Changes to the previous > measurement: Did you push these out somewhere? The tip tmp.fpu branch hasn't seen any updates. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com [209.85.212.182]) by kanga.kvack.org (Postfix) with ESMTP id B13176B0038 for ; Thu, 25 Jun 2015 15:15:18 -0400 (EDT) Received: by wicnd19 with SMTP id nd19so26709616wic.1 for ; Thu, 25 Jun 2015 12:15:18 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 18si54313938wjs.134.2015.06.25.12.15.16 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 25 Jun 2015 12:15:17 -0700 (PDT) Message-ID: <558C5342.9020702@suse.cz> Date: Thu, 25 Jun 2015 21:15:14 +0200 From: Vlastimil Babka MIME-Version: 1.0 Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> <20150625114819.GA20478@gmail.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds , Ingo Molnar Cc: Thomas Gleixner , linux-mm , Borislav Petkov , Hugh Dickins , Andrew Morton , Minchan Kim , Rik van Riel , "Kirill A. Shutemov" , H Peter Anvin , Andi Kleen , Peter Zijlstra , Dave Hansen , Mel Gorman , LKML On 25.6.2015 20:36, Linus Torvalds wrote: > > On Jun 25, 2015 04:48, "Ingo Molnar" > wrote: >> >> - 1x, 2x, 3x, 4x means up to 4 adjacent 4K vmalloc()-ed pages are accessed, the >> first byte in each > > So that test is a bit unfair. From previous timing of Intel TLB fills, I can > tell you that Intel is particularly good at doing adjacent entries. > > That's independent of the fact that page tables have very good locality (if they > are the radix tree type - the hashed page tables that ppc uses are shit). So > when filling adjacent entries, you take the cache misses for the page tables > only once, but even aside from that, Intel send to do particularly well at the > "next page" TLB fill case AFAIK that's because they also cache partial translations, so if the first 3 levels are the same (as they mostly are for the "next page" scenario) it will only have to look at the last level of pages tables. AMD does that too. > Now, I think that's a reasonably common case, and I'm not saying that it's > unfair to compare for that reason, but it does highlight the good case for TLB > walking. > > So I would suggest you highlight the bad case too: use invlpg to invalidate > *one* TLB entry, and then walk four non-adjacent entries. And compare *that* to > the full TLB flush. > > Now, I happen to still believe in the full flush, but let's not pick benchmarks > that might not show the advantages of the finer granularity. > > Linus > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f174.google.com (mail-ig0-f174.google.com [209.85.213.174]) by kanga.kvack.org (Postfix) with ESMTP id 103396B0038 for ; Thu, 25 Jun 2015 18:04:31 -0400 (EDT) Received: by igin14 with SMTP id n14so1753360igi.1 for ; Thu, 25 Jun 2015 15:04:30 -0700 (PDT) Received: from mail-ie0-x22a.google.com (mail-ie0-x22a.google.com. [2607:f8b0:4001:c03::22a]) by mx.google.com with ESMTPS id p20si5461485igs.34.2015.06.25.15.04.30 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 25 Jun 2015 15:04:30 -0700 (PDT) Received: by iecvh10 with SMTP id vh10so64045867iec.3 for ; Thu, 25 Jun 2015 15:04:30 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <558C5342.9020702@suse.cz> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> <20150625114819.GA20478@gmail.com> <558C5342.9020702@suse.cz> Date: Thu, 25 Jun 2015 15:04:30 -0700 Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Ingo Molnar , Thomas Gleixner , linux-mm , Borislav Petkov , Hugh Dickins , Andrew Morton , Minchan Kim , Rik van Riel , "Kirill A. Shutemov" , H Peter Anvin , Andi Kleen , Peter Zijlstra , Dave Hansen , Mel Gorman , LKML On Thu, Jun 25, 2015 at 12:15 PM, Vlastimil Babka wrote: >> [..] even aside from that, Intel seems to do particularly well at the >> "next page" TLB fill case > > AFAIK that's because they also cache partial translations, so if the first 3 > levels are the same (as they mostly are for the "next page" scenario) it will > only have to look at the last level of pages tables. AMD does that too. In the tests I did long ago, Intel seemed to do even more than that. Now, it's admittedly somewhat hard to do a good job at TLB fill cost measurement with any aggressively out-of-order core (lots of asynchronous and potentially speculative stuff going on), so without any real knowledge about what is going on it's kind of hard to say. But I had an old TLB benchmark (long lost now, so I can't re-run it to check on this) that implied that when Intel brings in a new TLB entry, they brought it in 32 bytes at a time, and would then look up the adjacent entries in that aligned half-cache-line in just a single cycle. (This was back in the P6/P4/Core Duo kinds of days - we're talking 10-15 years ago. The tests I did mapped the constant zero-page in a big area, so that the effects of the D$ would be minimized, and I'd basically *only* see the TLB miss costs. I also did things like measure the costs of the micro-fault that happens when marking a page table entry entry dirty for the first time etc, which was absolutely horrendously slow on a P4). Now, it could have been other effects too - things like multiple outstanding TLB misses in parallel due to prefetching TLB entries could possibly cause similar patterns, although I distinctly remember that effect of aligned 8-entry entries. So the simplest explanation *seemed* to be a very special single-entry 32-byte "L0 cache" for TLB fills (and 32 bytes would probably match the L1 fetch width into the core). That's in *addition* to the fact that the upper level entries are cached in the TLB. As mentioned earlier, from my Transmeta days I have some insight into what old pre-NT Windows used to do. The GDI protection traversal seemed to flush the whole TLB for every GDI kernel entry, and some of the graphics benchmarks (remember: this was back in the days when unaccelerated VGA graphics was still a big deal) would literally blow the TLB away every 5-10k instructions. So it actually makes sense that Intel and AMD spent a *lot* of effort on making TLB fills go fast, because those GDI benchmarks were important back then. It was all the basic old 2D window handling and font drawing etc benchmarking. None of the RISC vendors ever cared, they had absolutely crap hardware, and instead encouraged their SW partners (particularly databases) to change the software to use large-pages and any trick possible to avoid TLB misses. Their compilers would schedule loads early and stores late, because the whole memory subsystem was a toy. TLB misses would break the whole pipeline etc. Truly crap hardware, and people still look up to it. Sad. In the Windows world, that wasn't an option: when Microsoft said "jump", Intel and AMD both said "how high?". As a result, x86 is a *lot* more well-rounded than any RISC I have ever seen, because Intel/AMD were simply forced to run any crap software, rather than having software writers optimize for what they were good at. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id DDC946B0038 for ; Fri, 26 Jun 2015 05:09:03 -0400 (EDT) Received: by wiga1 with SMTP id a1so11789470wig.0 for ; Fri, 26 Jun 2015 02:09:03 -0700 (PDT) Received: from mail-wi0-x22f.google.com (mail-wi0-x22f.google.com. [2a00:1450:400c:c05::22f]) by mx.google.com with ESMTPS id f18si57340501wjz.182.2015.06.26.02.09.02 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 26 Jun 2015 02:09:02 -0700 (PDT) Received: by wicnd19 with SMTP id nd19so11758705wic.1 for ; Fri, 26 Jun 2015 02:09:02 -0700 (PDT) Date: Fri, 26 Jun 2015 11:08:56 +0200 From: Ingo Molnar Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150626090856.GA31657@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> <20150625114819.GA20478@gmail.com> <558C4C8E.5010107@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <558C4C8E.5010107@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: "Kirill A. Shutemov" , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Borislav Petkov * Dave Hansen wrote: > On 06/25/2015 04:48 AM, Ingo Molnar wrote: > > > I've updated the benchmarks with 4K flushes as well. Changes to the previous > > measurement: > > Did you push these out somewhere? The tip tmp.fpu branch hasn't seen any > updates. Not yet, I still don't trust the numbers. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752839AbbFHMvJ (ORCPT ); Mon, 8 Jun 2015 08:51:09 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46107 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750762AbbFHMu7 (ORCPT ); Mon, 8 Jun 2015 08:50:59 -0400 From: Mel Gorman To: Andrew Morton Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML , Mel Gorman Subject: [PATCH 0/3] TLB flush multiple pages per IPI v5 Date: Mon, 8 Jun 2015 13:50:51 +0100 Message-Id: <1433767854-24408-1-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Changelog since V4 o Rebase to 4.1-rc6 Changelog since V3 o Drop batching of TLB flush from migration o Redo how larger batching is managed o Batch TLB flushes when writable entries exist When unmapping pages it is necessary to flush the TLB. If that page was accessed by another CPU then an IPI is used to flush the remote CPU. That is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. There already is a window between when a page is unmapped and when it is TLB flushed. This series simply increases the window so multiple pages can be flushed using a single IPI. This *should* be safe or the kernel is hosed already but I've cc'd the x86 maintainers and some of the Intel folk for comment. Patch 1 simply made the rest of the series easier to write as ftrace could identify all the senders of TLB flush IPIS. Patch 2 collects a list of PFNs and sends one IPI to flush them all Patch 3 tracks when there potentially are writable TLB entries that need to be batched differently The performance impact is documented in the changelogs but in the optimistic case on a 4-socket machine the full series reduces interrupts from 900K interrupts/second to 60K interrupts/second. arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 2 + arch/x86/mm/tlb.c | 1 + include/linux/init_task.h | 8 +++ include/linux/mm_types.h | 1 + include/linux/rmap.h | 3 + include/linux/sched.h | 15 +++++ include/trace/events/tlb.h | 3 +- init/Kconfig | 8 +++ kernel/fork.c | 5 ++ kernel/sched/core.c | 3 + mm/internal.h | 15 +++++ mm/rmap.c | 119 +++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 30 +++++++++- 14 files changed, 210 insertions(+), 4 deletions(-) -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752987AbbFHMvS (ORCPT ); Mon, 8 Jun 2015 08:51:18 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46106 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751582AbbFHMu7 (ORCPT ); Mon, 8 Jun 2015 08:50:59 -0400 From: Mel Gorman To: Andrew Morton Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML , Mel Gorman Subject: [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Date: Mon, 8 Jun 2015 13:50:52 +0100 Message-Id: <1433767854-24408-2-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1433767854-24408-1-git-send-email-mgorman@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Reviewed-by: Dave Hansen --- arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/trace/events/tlb.h | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3250f2371aea..2da824c1c140 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, info.flush_end = end; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start); if (is_uv_system()) { unsigned int cpu; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8d37e26a1007..86ad9f902042 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -534,6 +534,7 @@ enum tlb_flush_reason { TLB_REMOTE_SHOOTDOWN, TLB_LOCAL_SHOOTDOWN, TLB_LOCAL_MM_SHOOTDOWN, + TLB_REMOTE_SEND_IPI, NR_TLB_FLUSH_REASONS, }; diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h index 4250f364a6ca..bc8815f45f3b 100644 --- a/include/trace/events/tlb.h +++ b/include/trace/events/tlb.h @@ -11,7 +11,8 @@ EM( TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" ) \ EM( TLB_REMOTE_SHOOTDOWN, "remote shootdown" ) \ EM( TLB_LOCAL_SHOOTDOWN, "local shootdown" ) \ - EMe( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) + EM( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) \ + EMe( TLB_REMOTE_SEND_IPI, "remote ipi send" ) /* * First define the enums in TLB_FLUSH_REASON to be exported to userspace -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753119AbbFHMvh (ORCPT ); Mon, 8 Jun 2015 08:51:37 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46120 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752100AbbFHMvA (ORCPT ); Mon, 8 Jun 2015 08:51:00 -0400 From: Mel Gorman To: Andrew Morton Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML , Mel Gorman Subject: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Date: Mon, 8 Jun 2015 13:50:53 +0100 Message-Id: <1433767854-24408-3-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1433767854-24408-1-git-send-email-mgorman@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org An IPI is sent to flush remote TLBs when a page is unmapped that was recently accessed by other CPUs. There are many circumstances where this happens but the obvious one is kswapd reclaiming pages belonging to a running process as kswapd and the task are likely running on separate CPUs. On small machines, this is not a significant problem but as machine gets larger with more cores and more memory, the cost of these IPIs can be high. This patch uses a structure similar in principle to a pagevec to collect a list of PFNs and CPUs that require flushing. It then sends one IPI per CPU that was mapping any of those pages to flush the list of PFNs. A new TLB flush helper is required for this and one is added for x86. Other architectures will need to decide if batching like this is both safe and worth the memory overhead. Specifically the requirement is; If a clean page is unmapped and not immediately flushed, the architecture must guarantee that a write to that page from a CPU with a cached TLB entry will trap a page fault. This is essentially what the kernel already depends on but the window is much larger with this patch applied and is worth highlighting. The impact of this patch depends on the workload as measuring any benefit requires both mapped pages co-located on the LRU and memory pressure. The case with the biggest impact is multiple processes reading mapped pages taken from the vm-scalability test suite. The test case uses NR_CPU readers of mapped files that consume 10*RAM. vmscale on a 4-node machine with 64G RAM and 48 CPUs 4.1.0-rc6 4.1.0-rc6 vanilla batchunmap-v5 User 577.35 618.60 System 5927.06 4195.03 Elapsed 162.21 121.31 The workload completed 25% faster with 29% less CPU time. This is showing that the readers completed 25% with 30% less CPU time. From vmstats, it is known that the vanilla kernel was interrupted roughly 900K times per second during the steady phase of the test and the patched kernel was interrupts 180K times per second. The impact is much lower on a small machine vmscale on a 1-node machine with 8G RAM and 1 CPU 4.1.0-rc6 4.1.0-rc6 vanilla batchunmap-v5 User 59.14 58.86 System 109.15 83.78 Elapsed 27.32 23.14 It's still a noticeable improvement with vmstat showing interrupts went from roughly 500K per second to 45K per second. The patch will have no impact on workloads with no memory pressure or have relatively few mapped pages. Signed-off-by: Mel Gorman --- arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 2 + include/linux/init_task.h | 8 +++ include/linux/rmap.h | 3 ++ include/linux/sched.h | 14 ++++++ init/Kconfig | 8 +++ kernel/fork.c | 5 ++ kernel/sched/core.c | 3 ++ mm/internal.h | 11 +++++ mm/rmap.c | 105 +++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 23 ++++++++- 11 files changed, 181 insertions(+), 2 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 226d5696e1d1..4e8bd86735af 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -31,6 +31,7 @@ config X86 select ARCH_MIGHT_HAVE_PC_SERIO select HAVE_AOUT if X86_32 select HAVE_UNSTABLE_SCHED_CLOCK + select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 select ARCH_SUPPORTS_INT128 if X86_64 select HAVE_IDE diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index cd791948b286..10c197a649f5 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr) * and page-granular flushes are available only on i486 and up. */ +#define flush_local_tlb_addr(addr) __flush_tlb_single(addr) + #ifndef CONFIG_SMP /* "_up" is for UniProcessor. diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 696d22312b31..0771937b47e1 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -175,6 +175,13 @@ extern struct task_group root_task_group; # define INIT_NUMA_BALANCING(tsk) #endif +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ + .tlb_ubc = NULL, +#else +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) +#endif + #ifdef CONFIG_KASAN # define INIT_KASAN(tsk) \ .kasan_depth = 1, @@ -257,6 +264,7 @@ extern struct task_group root_task_group; INIT_RT_MUTEXES(tsk) \ INIT_VTIME(tsk) \ INIT_NUMA_BALANCING(tsk) \ + INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ INIT_KASAN(tsk) \ } diff --git a/include/linux/rmap.h b/include/linux/rmap.h index c89c53a113a8..29446aeef36e 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -89,6 +89,9 @@ enum ttu_flags { TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ + TTU_BATCH_FLUSH = (1 << 11), /* Batch TLB flushes where possible + * and caller guarantees they will + * do a final flush if necessary */ }; #ifdef CONFIG_MMU diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a2e6122734..57ff61b16565 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1289,6 +1289,16 @@ enum perf_event_task_context { perf_nr_task_contexts, }; +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */ +#define BATCH_TLBFLUSH_SIZE 32UL + +/* Track pages that require TLB flushes */ +struct tlbflush_unmap_batch { + struct cpumask cpumask; + unsigned long nr_pages; + unsigned long pfns[BATCH_TLBFLUSH_SIZE]; +}; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; @@ -1648,6 +1658,10 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + struct tlbflush_unmap_batch *tlb_ubc; +#endif + struct rcu_head rcu; /* diff --git a/init/Kconfig b/init/Kconfig index dc24dec60232..1ff5d17e518a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -904,6 +904,14 @@ config ARCH_SUPPORTS_NUMA_BALANCING bool # +# For architectures that have a local TLB flush for a PFN without knowledge +# of the VMA. The architecture must provide guarantees on what happens if +# a clean TLB cache entry is written after the unmap. Details are in mm/rmap.c +# near the check for should_defer_flush. +config ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + bool + +# # For architectures that know their GCC __int128 support is sound # config ARCH_SUPPORTS_INT128 diff --git a/kernel/fork.c b/kernel/fork.c index 03c1eaaa6ef5..8ea9e8730ada 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -257,6 +257,11 @@ void __put_task_struct(struct task_struct *tsk) delayacct_tsk_free(tsk); put_signal_struct(tsk->signal); +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + kfree(tsk->tlb_ubc); + tsk->tlb_ubc = NULL; +#endif + if (!profile_handoff_task(tsk)) free_task(tsk); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 123673291ffb..936ce13aeb97 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1843,6 +1843,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->numa_group = NULL; #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + p->tlb_ubc = NULL; +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ } #ifdef CONFIG_NUMA_BALANCING diff --git a/mm/internal.h b/mm/internal.h index a25e359a4039..8cbb68ccc731 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -433,4 +433,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, #define ALLOC_CMA 0x80 /* allow allocations from CMA areas */ #define ALLOC_FAIR 0x100 /* fair zone allocation */ +enum ttu_flags; +struct tlbflush_unmap_batch; + +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +void try_to_unmap_flush(void); +#else +static inline void try_to_unmap_flush(void) +{ +} + +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ #endif /* __MM_INTERNAL_H */ diff --git a/mm/rmap.c b/mm/rmap.c index 24dd3f9fee27..a8dbba62398a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -60,6 +60,8 @@ #include +#include + #include "internal.h" static struct kmem_cache *anon_vma_cachep; @@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma) return address; } +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +static void percpu_flush_tlb_batch_pages(void *data) +{ + struct tlbflush_unmap_batch *tlb_ubc = data; + int i; + + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); + for (i = 0; i < tlb_ubc->nr_pages; i++) + flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT); +} + +/* + * Flush TLB entries for recently unmapped pages from remote CPUs. It is + * important that if a PTE was dirty when it was unmapped that it's flushed + * before any IO is initiated on the page to prevent lost writes. Similarly, + * it must be flushed before freeing to prevent data leakage. + */ +void try_to_unmap_flush(void) +{ + struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; + int cpu; + + if (!tlb_ubc || !tlb_ubc->nr_pages) + return; + + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages); + + preempt_disable(); + cpu = smp_processor_id(); + if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) + percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask); + + if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) { + smp_call_function_many(&tlb_ubc->cpumask, + percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true); + } + cpumask_clear(&tlb_ubc->cpumask); + tlb_ubc->nr_pages = 0; + preempt_enable(); +} + +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, + struct page *page) +{ + struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; + + cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm)); + tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page); + tlb_ubc->nr_pages++; + + if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE) + try_to_unmap_flush(); +} + +/* + * Returns true if the TLB flush should be deferred to the end of a batch of + * unmap operations to reduce IPIs. + */ +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) +{ + bool should_defer = false; + + if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH)) + return false; + + /* If remote CPUs need to be flushed then defer batch the flush */ + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + should_defer = true; + put_cpu(); + + return should_defer; +} +#else +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, + struct page *page) +{ +} + +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) +{ + return false; +} +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ + /* * At what user virtual address is page expected in vma? * Caller should check the page is actually part of the vma. @@ -1213,7 +1299,24 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush(vma, address, pte); + if (should_defer_flush(mm, flags)) { + /* + * We clear the PTE but do not flush so potentially a remote + * CPU could still be writing to the page. If the entry was + * previously clean then the architecture must guarantee that + * a clear->dirty transition on a cached TLB entry is written + * through and traps if the PTE is unmapped. + */ + pteval = ptep_get_and_clear(mm, address, pte); + + /* Potentially writable TLBs must be flushed before IO */ + if (pte_dirty(pteval)) + flush_tlb_page(vma, address); + else + set_tlb_ubc_flush_pending(mm, page); + } else { + pteval = ptep_clear_flush(vma, address, pte); + } /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8eadd71bac..5121742ccb87 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, * processes. Try to unmap it here. */ if (page_mapped(page) && mapping) { - switch (try_to_unmap(page, ttu_flags)) { + switch (try_to_unmap(page, + ttu_flags|TTU_BATCH_FLUSH)) { case SWAP_FAIL: goto activate_locked; case SWAP_AGAIN: @@ -1175,6 +1176,7 @@ keep: } mem_cgroup_uncharge_list(&free_pages); + try_to_unmap_flush(); free_hot_cold_page_list(&free_pages, true); list_splice(&ret_pages, page_list); @@ -2118,6 +2120,23 @@ out: } } +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +/* + * Allocate the control structure for batch TLB flushing. An allocation + * failure is harmless as the reclaimer will send IPIs where necessary. + */ +static void alloc_tlb_ubc(void) +{ + if (!current->tlb_ubc) + current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch), + GFP_KERNEL | __GFP_NOWARN); +} +#else +static inline void alloc_tlb_ubc(void) +{ +} +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ + /* * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. */ @@ -2152,6 +2171,8 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness, scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() && sc->priority == DEF_PRIORITY); + alloc_tlb_ubc(); + blk_start_plug(&plug); while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) { -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753054AbbFHMvX (ORCPT ); Mon, 8 Jun 2015 08:51:23 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46134 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752331AbbFHMvB (ORCPT ); Mon, 8 Jun 2015 08:51:01 -0400 From: Mel Gorman To: Andrew Morton Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML , Mel Gorman Subject: [PATCH 3/3] mm: Defer flush of writable TLB entries Date: Mon, 8 Jun 2015 13:50:54 +0100 Message-Id: <1433767854-24408-4-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1433767854-24408-1-git-send-email-mgorman@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If a PTE is unmapped and it's dirty then it was writable recently. Due to deferred TLB flushing, it's best to assume a writable TLB cache entry exists. With that assumption, the TLB must be flushed before any IO can start or the page is freed to avoid lost writes or data corruption. This patch defers flushing of potentially writable TLBs as long as possible. Signed-off-by: Mel Gorman --- include/linux/sched.h | 1 + mm/internal.h | 4 ++++ mm/rmap.c | 28 +++++++++++++++++++++------- mm/vmscan.c | 7 ++++++- 4 files changed, 32 insertions(+), 8 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 57ff61b16565..827d9b123bd5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1296,6 +1296,7 @@ enum perf_event_task_context { struct tlbflush_unmap_batch { struct cpumask cpumask; unsigned long nr_pages; + bool writable; unsigned long pfns[BATCH_TLBFLUSH_SIZE]; }; diff --git a/mm/internal.h b/mm/internal.h index 8cbb68ccc731..ecf47a01420d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -438,10 +438,14 @@ struct tlbflush_unmap_batch; #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH void try_to_unmap_flush(void); +void try_to_unmap_flush_dirty(void); #else static inline void try_to_unmap_flush(void) { } +static inline void try_to_unmap_flush_dirty(void) +{ +} #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ #endif /* __MM_INTERNAL_H */ diff --git a/mm/rmap.c b/mm/rmap.c index a8dbba62398a..3c8ebacfe6ef 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -621,11 +621,21 @@ void try_to_unmap_flush(void) } cpumask_clear(&tlb_ubc->cpumask); tlb_ubc->nr_pages = 0; + tlb_ubc->writable = false; preempt_enable(); } +/* Flush iff there are potentially writable TLB entries that can race with IO */ +void try_to_unmap_flush_dirty(void) +{ + struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; + + if (tlb_ubc && tlb_ubc->writable) + try_to_unmap_flush(); +} + static void set_tlb_ubc_flush_pending(struct mm_struct *mm, - struct page *page) + struct page *page, bool writable) { struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; @@ -633,6 +643,14 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page); tlb_ubc->nr_pages++; + /* + * If the PTE was dirty then it's best to assume it's writable. The + * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush() + * before the page any IO is initiated. + */ + if (writable) + tlb_ubc->writable = true; + if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE) try_to_unmap_flush(); } @@ -657,7 +675,7 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) } #else static void set_tlb_ubc_flush_pending(struct mm_struct *mm, - struct page *page) + struct page *page, bool writable) { } @@ -1309,11 +1327,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, */ pteval = ptep_get_and_clear(mm, address, pte); - /* Potentially writable TLBs must be flushed before IO */ - if (pte_dirty(pteval)) - flush_tlb_page(vma, address); - else - set_tlb_ubc_flush_pending(mm, page); + set_tlb_ubc_flush_pending(mm, page, pte_dirty(pteval)); } else { pteval = ptep_clear_flush(vma, address, pte); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 5121742ccb87..0055224c52d4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1065,7 +1065,12 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (!sc->may_writepage) goto keep_locked; - /* Page is dirty, try to write it out here */ + /* + * Page is dirty. Flush the TLB if a writable entry + * potentially exists to avoid CPU writes after IO + * starts and then write it out here + */ + try_to_unmap_flush_dirty(); switch (pageout(page, mapping, sc)) { case PAGE_KEEP: goto keep_locked; -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753263AbbFHRqD (ORCPT ); Mon, 8 Jun 2015 13:46:03 -0400 Received: from mail-wg0-f47.google.com ([74.125.82.47]:33992 "EHLO mail-wg0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751627AbbFHRp4 (ORCPT ); Mon, 8 Jun 2015 13:45:56 -0400 Date: Mon, 8 Jun 2015 19:45:51 +0200 From: Ingo Molnar To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150608174551.GA27558@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1433767854-24408-1-git-send-email-mgorman@suse.de> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Mel Gorman wrote: > Changelog since V4 > o Rebase to 4.1-rc6 > > Changelog since V3 > o Drop batching of TLB flush from migration > o Redo how larger batching is managed > o Batch TLB flushes when writable entries exist > > When unmapping pages it is necessary to flush the TLB. If that page was > accessed by another CPU then an IPI is used to flush the remote CPU. That > is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. > > There already is a window between when a page is unmapped and when it is > TLB flushed. This series simply increases the window so multiple pages > can be flushed using a single IPI. This *should* be safe or the kernel is > hosed already but I've cc'd the x86 maintainers and some of the Intel folk > for comment. > > Patch 1 simply made the rest of the series easier to write as ftrace > could identify all the senders of TLB flush IPIS. > > Patch 2 collects a list of PFNs and sends one IPI to flush them all > > Patch 3 tracks when there potentially are writable TLB entries that > need to be batched differently > > The performance impact is documented in the changelogs but in the optimistic > case on a 4-socket machine the full series reduces interrupts from 900K > interrupts/second to 60K interrupts/second. Yeah, so I think batching writable flushes is useful I think, but I disagree with one aspect of it: with the need to gather _pfns_ and batch them over to the remote CPU. As per my measurements the __flush_tlb_single() primitive (which you use in patch #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2 pages and gets exponentially worse. It's probably done in microcode and its performance is horrible. So have you explored the possibility to significantly simplify your patch-set by only deferring the flushing, and doing a simple TLB flush on the remote CPU? As per your measurements there must be tons and tons of flushes of lots of pages, the pfn tracking simply does not make sense. That way there's no memory overhead and no complex tracking of pfns - we'd basically track a simple deferred-flush bit instead. We'd still have the benefits of batching the IPIs, which is the main win. I strongly suspect that your numbers will become even better with such a variant. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752988AbbFHSV5 (ORCPT ); Mon, 8 Jun 2015 14:21:57 -0400 Received: from mga11.intel.com ([192.55.52.93]:4260 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751530AbbFHSVt (ORCPT ); Mon, 8 Jun 2015 14:21:49 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,574,1427785200"; d="scan'208";a="743131474" Message-ID: <5575DD33.3000400@intel.com> Date: Mon, 08 Jun 2015 11:21:39 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Ingo Molnar , Mel Gorman CC: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> In-Reply-To: <20150608174551.GA27558@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/08/2015 10:45 AM, Ingo Molnar wrote: > As per my measurements the __flush_tlb_single() primitive (which you use in patch > #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2 > pages and gets exponentially worse. It's probably done in microcode and its > performance is horrible. I discussed this a bit in commit a5102476a2. I'd be curious what numbers you came up with. But, don't we have to take in to account the cost of refilling the TLB in addition to the cost of emptying it? The TLB size is historically increasing on a per-core basis, so isn't this refill cost only going to get worse? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932162AbbFHTwt (ORCPT ); Mon, 8 Jun 2015 15:52:49 -0400 Received: from mail-wi0-f172.google.com ([209.85.212.172]:37091 "EHLO mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751459AbbFHTwm (ORCPT ); Mon, 8 Jun 2015 15:52:42 -0400 Date: Mon, 8 Jun 2015 21:52:37 +0200 From: Ingo Molnar To: Dave Hansen Cc: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150608195237.GA15429@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <5575DD33.3000400@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5575DD33.3000400@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Dave Hansen wrote: > On 06/08/2015 10:45 AM, Ingo Molnar wrote: > > As per my measurements the __flush_tlb_single() primitive (which you use in patch > > #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2 > > pages and gets exponentially worse. It's probably done in microcode and its > > performance is horrible. > > I discussed this a bit in commit a5102476a2. I'd be curious what > numbers you came up with. ... which for those of us who don't have sha1's cached in their brain is: a5102476a24b ("x86/mm: Set TLB flush tunable to sane value (33)") ;-) So what I measured agrees generally with the comment you added in the commit: + * Each single flush is about 100 ns, so this caps the maximum overhead at + * _about_ 3,000 ns. Let that sink through: 3,000 nsecs = 3 usecs, that's like eternity! A CR3 driven TLB flush takes less time than a single INVLPG (!): [ 0.389028] x86/fpu: Cost of: __flush_tlb() fn : 96 cycles [ 0.405885] x86/fpu: Cost of: __flush_tlb_one() fn : 260 cycles [ 0.414302] x86/fpu: Cost of: __flush_tlb_range() fn : 404 cycles it's true that a full flush has hidden costs not measured above, because it has knock-on effects (because it drops non-global TLB entries), but it's not _that_ bad due to: - there almost always being a L1 or L2 cache miss when a TLB miss occurs, which latency can be overlaid - global bit being held for kernel entries - user-space with high memory pressure trashing through TLBs typically ... and especially with caches and Intel's historically phenomenally low TLB refill latency it's difficult to measure the effects of local TLB refills, let alone measure it in any macro benchmark. Cross-CPU flushes are expensive, absolutely no argument about that - my suggestion here is to keep the batching but simplify it: because I strongly suspect that the biggest win is the batching, not the pfn queueing. We might even win a bit more performance due to the simplification. > But, don't we have to take in to account the cost of refilling the TLB in > addition to the cost of emptying it? The TLB size is historically increasing on > a per-core basis, so isn't this refill cost only going to get worse? Only if TLB refill latency sucks - but Intel's is very good and AMD's is pretty good as well. Also, usually if you miss the TLB you miss the cache line as well (you definitely miss the L1 cache, and TLB caches are sized to hold a fair chunk of your L2 cache), and the CPU can overlap the two latencies. So while it might sound counter-intuitive, a full TLB flush might be faster than trying to do software based TLB cache management ... INVLPG really sucks. I can be convinced by numbers, but this isn't nearly as clear-cut as it might look. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932345AbbFHUDV (ORCPT ); Mon, 8 Jun 2015 16:03:21 -0400 Received: from mail-wi0-f178.google.com ([209.85.212.178]:38422 "EHLO mail-wi0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752584AbbFHUDN (ORCPT ); Mon, 8 Jun 2015 16:03:13 -0400 Date: Mon, 8 Jun 2015 22:03:08 +0200 From: Ingo Molnar To: Dave Hansen Cc: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150608200308.GA16978@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <5575DD33.3000400@intel.com> <20150608195237.GA15429@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150608195237.GA15429@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Ingo Molnar wrote: > So what I measured agrees generally with the comment you added in the commit: > > + * Each single flush is about 100 ns, so this caps the maximum overhead at > + * _about_ 3,000 ns. > > Let that sink through: 3,000 nsecs = 3 usecs, that's like eternity! > > A CR3 driven TLB flush takes less time than a single INVLPG (!): > > [ 0.389028] x86/fpu: Cost of: __flush_tlb() fn : 96 cycles > [ 0.405885] x86/fpu: Cost of: __flush_tlb_one() fn : 260 cycles > [ 0.414302] x86/fpu: Cost of: __flush_tlb_range() fn : 404 cycles > > it's true that a full flush has hidden costs not measured above, because it has > knock-on effects (because it drops non-global TLB entries), but it's not _that_ > bad due to: > > - there almost always being a L1 or L2 cache miss when a TLB miss occurs, > which latency can be overlaid > > - global bit being held for kernel entries > > - user-space with high memory pressure trashing through TLBs typically I also have cache-cold numbers from another (Intel) system: [ 0.176473] x86/bench:########################################################################## [ 0.185656] x86/bench: Running x86 benchmarks: cache- hot / cold cycles [ 1.234448] x86/bench: Cost of: null : 35 / 73 cycles [ ........] [ 27.930451] x86/bench:######## MM instructions: ###################################### [ 28.979251] x86/bench: Cost of: __flush_tlb() fn : 251 / 366 cycles [ 30.028795] x86/bench: Cost of: __flush_tlb_global() fn : 746 / 1795 cycles [ 31.077862] x86/bench: Cost of: __flush_tlb_one() fn : 237 / 883 cycles [ 32.127371] x86/bench: Cost of: __flush_tlb_range() fn : 312 / 1603 cycles [ 35.254202] x86/bench: Cost of: wbinvd() insn : 2491761 / 2491922 cycles Note how the numbers are even worse in the cache-cold case: the algorithmic complexity of __flush_tlb_range() versus __flush_tlb() makes it run slower (because we miss the I$), while the TLB cache-preservation argument is probably weaker, because when we are cache cold then TLB refill latency probably matters less (as it can be overlapped). So __flush_tlb_range() is software trying to beat hardware, and that's almost always a bad idea on x86. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752457AbbFHVIG (ORCPT ); Mon, 8 Jun 2015 17:08:06 -0400 Received: from mga11.intel.com ([192.55.52.93]:21863 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751888AbbFHVH7 (ORCPT ); Mon, 8 Jun 2015 17:07:59 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,575,1427785200"; d="scan'208";a="707518474" Message-ID: <5576042E.9030001@intel.com> Date: Mon, 08 Jun 2015 14:07:58 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Ingo Molnar CC: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <5575DD33.3000400@intel.com> <20150608195237.GA15429@gmail.com> In-Reply-To: <20150608195237.GA15429@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/08/2015 12:52 PM, Ingo Molnar wrote: > A CR3 driven TLB flush takes less time than a single INVLPG (!): > > [ 0.389028] x86/fpu: Cost of: __flush_tlb() fn : 96 cycles > [ 0.405885] x86/fpu: Cost of: __flush_tlb_one() fn : 260 cycles > [ 0.414302] x86/fpu: Cost of: __flush_tlb_range() fn : 404 cycles How was that measured, btw? Are these instructions running in a loop? Does __flush_tlb_one() include the tracepoint? (From the commit I referenced) This was (probably) using a different method than you did, but "FULL" below is __flush_tlb() while "1" is __flush_tlb_one(). The "cycles" includes some overhead from the tracing: > FULL: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960 > 1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895 So it looks like we've got some discrepancy, either from the test methodology or the CPU. All of the code and my methodology are in the commit. Could you share yours? > it's true that a full flush has hidden costs not measured above, because it has > knock-on effects (because it drops non-global TLB entries), but it's not _that_ > bad due to: > > - there almost always being a L1 or L2 cache miss when a TLB miss occurs, > which latency can be overlaid > > - global bit being held for kernel entries > > - user-space with high memory pressure trashing through TLBs typically > > ... and especially with caches and Intel's historically phenomenally low TLB > refill latency it's difficult to measure the effects of local TLB refills, let > alone measure it in any macro benchmark. All that you're saying there is that you need to consider how TLB misses act in _practice_ and not just measure worst-case or theoretical TLB miss cost. I completely agree with that. ... > INVLPG really sucks. I can be convinced by numbers, but this isn't nearly as > clear-cut as it might look. It's clear as mud! I'd be very interested to see any numbers for how this affects real workloads. I've been unable to find anything that was measurably affected by invlpg vs a full flush. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932504AbbFHVvD (ORCPT ); Mon, 8 Jun 2015 17:51:03 -0400 Received: from mail-wi0-f179.google.com ([209.85.212.179]:37866 "EHLO mail-wi0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932167AbbFHVvA (ORCPT ); Mon, 8 Jun 2015 17:51:00 -0400 Date: Mon, 8 Jun 2015 23:50:54 +0200 From: Ingo Molnar To: Dave Hansen Cc: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150608215054.GB30566@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <5575DD33.3000400@intel.com> <20150608195237.GA15429@gmail.com> <5576042E.9030001@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5576042E.9030001@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Dave Hansen wrote: > On 06/08/2015 12:52 PM, Ingo Molnar wrote: > > A CR3 driven TLB flush takes less time than a single INVLPG (!): > > > > [ 0.389028] x86/fpu: Cost of: __flush_tlb() fn : 96 cycles > > [ 0.405885] x86/fpu: Cost of: __flush_tlb_one() fn : 260 cycles > > [ 0.414302] x86/fpu: Cost of: __flush_tlb_range() fn : 404 cycles > > How was that measured, btw? Are these instructions running in a loop? Yes - see the x86 benchmarking patch in the big FPU submission for an earlier version. > Does __flush_tlb_one() include the tracepoint? No tracing overhead. > (From the commit I referenced) This was (probably) using a different method than > you did, but "FULL" below is __flush_tlb() while "1" is __flush_tlb_one(). The > "cycles" includes some overhead from the tracing: > > > FULL: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960 > > 1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895 > > So it looks like we've got some discrepancy, either from the test methodology or > the CPU. All of the code and my methodology are in the commit. Could you share > yours? Yes, you can reproduce it by applying this patch from the FPU series: Subject: [PATCH 207/208] x86/fpu: Add FPU performance measurement subsystem (you were Cc:-ed to it, so it should be in your inbox.) I've got a more advanced version meanwhile, will post it in the next couple of days or so. > > it's true that a full flush has hidden costs not measured above, because it has > > knock-on effects (because it drops non-global TLB entries), but it's not _that_ > > bad due to: > > > > - there almost always being a L1 or L2 cache miss when a TLB miss occurs, > > which latency can be overlaid > > > > - global bit being held for kernel entries > > > > - user-space with high memory pressure trashing through TLBs typically > > > > ... and especially with caches and Intel's historically phenomenally low TLB > > refill latency it's difficult to measure the effects of local TLB refills, let > > alone measure it in any macro benchmark. > > All that you're saying there is that you need to consider how TLB misses act in > _practice_ and not just measure worst-case or theoretical TLB miss cost. I > completely agree with that. So I'm saying considerably more than that: I consider it likely that a full TLB flush is not nearly as costly as assumed, for the three reasons outlined above. It might even be a performance win in Mel's benchmark - although possibly not measurable within measurement noise levels. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932581AbbFHWiW (ORCPT ); Mon, 8 Jun 2015 18:38:22 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:58935 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752956AbbFHWiP (ORCPT ); Mon, 8 Jun 2015 18:38:15 -0400 Date: Mon, 8 Jun 2015 15:38:13 -0700 From: Andrew Morton To: Mel Gorman Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML Subject: Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Message-Id: <20150608153813.dbadaceffe316d09ee9f2446@linux-foundation.org> In-Reply-To: <1433767854-24408-3-git-send-email-mgorman@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <1433767854-24408-3-git-send-email-mgorman@suse.de> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 8 Jun 2015 13:50:53 +0100 Mel Gorman wrote: > An IPI is sent to flush remote TLBs when a page is unmapped that was > recently accessed by other CPUs. There are many circumstances where this > happens but the obvious one is kswapd reclaiming pages belonging to a > running process as kswapd and the task are likely running on separate CPUs. > > On small machines, this is not a significant problem but as machine > gets larger with more cores and more memory, the cost of these IPIs can > be high. This patch uses a structure similar in principle to a pagevec > to collect a list of PFNs and CPUs that require flushing. It then sends > one IPI per CPU that was mapping any of those pages to flush the list of > PFNs. A new TLB flush helper is required for this and one is added for > x86. Other architectures will need to decide if batching like this is both > safe and worth the memory overhead. Specifically the requirement is; > > If a clean page is unmapped and not immediately flushed, the > architecture must guarantee that a write to that page from a CPU > with a cached TLB entry will trap a page fault. > > This is essentially what the kernel already depends on but the window is > much larger with this patch applied and is worth highlighting. > > The impact of this patch depends on the workload as measuring any benefit > requires both mapped pages co-located on the LRU and memory pressure. The > case with the biggest impact is multiple processes reading mapped pages > taken from the vm-scalability test suite. The test case uses NR_CPU readers > of mapped files that consume 10*RAM. > > vmscale on a 4-node machine with 64G RAM and 48 CPUs > 4.1.0-rc6 4.1.0-rc6 > vanilla batchunmap-v5 > User 577.35 618.60 > System 5927.06 4195.03 > Elapsed 162.21 121.31 > > The workload completed 25% faster with 29% less CPU time. > > This is showing that the readers completed 25% with 30% less CPU time. From > vmstats, it is known that the vanilla kernel was interrupted roughly 900K > times per second during the steady phase of the test and the patched kernel > was interrupts 180K times per second. > > The impact is much lower on a small machine > > vmscale on a 1-node machine with 8G RAM and 1 CPU > 4.1.0-rc6 4.1.0-rc6 > vanilla batchunmap-v5 > User 59.14 58.86 > System 109.15 83.78 > Elapsed 27.32 23.14 > > It's still a noticeable improvement with vmstat showing interrupts went > from roughly 500K per second to 45K per second. Looks nice. > The patch will have no impact on workloads with no memory pressure or > have relatively few mapped pages. What benefit can we expect to see to any real-world userspace? > --- a/include/linux/init_task.h > +++ b/include/linux/init_task.h > @@ -175,6 +175,13 @@ extern struct task_group root_task_group; > # define INIT_NUMA_BALANCING(tsk) > #endif > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ > + .tlb_ubc = NULL, > +#else > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) > +#endif > + > #ifdef CONFIG_KASAN > # define INIT_KASAN(tsk) \ > .kasan_depth = 1, > @@ -257,6 +264,7 @@ extern struct task_group root_task_group; > INIT_RT_MUTEXES(tsk) \ > INIT_VTIME(tsk) \ > INIT_NUMA_BALANCING(tsk) \ > + INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ > INIT_KASAN(tsk) \ > } We don't really need any of this - init_task starts up all-zero anyway. Maybe it's useful for documentation reasons (dubious), but I bet we've already missed fields. > > ... > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1289,6 +1289,16 @@ enum perf_event_task_context { > perf_nr_task_contexts, > }; > > +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */ > +#define BATCH_TLBFLUSH_SIZE 32UL > + > +/* Track pages that require TLB flushes */ > +struct tlbflush_unmap_batch { > + struct cpumask cpumask; > + unsigned long nr_pages; > + unsigned long pfns[BATCH_TLBFLUSH_SIZE]; Why are we storing pfn's rather than page*'s? I'm trying to get my head around what's actually in this structure. Each thread has one of these, lazily allocated . The cpumask field contains a mask of all the CPUs which have done . The careful reader will find mm_struct.cpu_vm_mask_var and will wonder why it didn't need documenting, sigh. Wanna fill in the blanks? As usual, understanding the data structure is the key to understanding the design, so it's worth a couple of paragraphs. With this knowledge, the reader may understand why try_to_unmap_flush() has preempt_disable() protection but set_tlb_ubc_flush_pending() doesn't need it! > +}; > + > struct task_struct { > volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ > void *stack; > > ... > > @@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma) > return address; > } > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > +static void percpu_flush_tlb_batch_pages(void *data) > +{ > + struct tlbflush_unmap_batch *tlb_ubc = data; > + int i; Anally speaking, this should be unsigned long (in which case it shouldn't be called `i'!). Or make tlbflush_unmap_batch.nr_pages int. But int is signed, which is silly. Whatever ;) > + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); > + for (i = 0; i < tlb_ubc->nr_pages; i++) > + flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT); > +} > + > +/* > + * Flush TLB entries for recently unmapped pages from remote CPUs. It is > + * important that if a PTE was dirty when it was unmapped that it's flushed s/important that/important / > + * before any IO is initiated on the page to prevent lost writes. Similarly, > + * it must be flushed before freeing to prevent data leakage. > + */ > +void try_to_unmap_flush(void) > +{ > + struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; > + int cpu; > + > + if (!tlb_ubc || !tlb_ubc->nr_pages) > + return; > + > + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages); > + > + preempt_disable(); > + cpu = smp_processor_id(); get_cpu() > + if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) > + percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask); > + > + if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) { > + smp_call_function_many(&tlb_ubc->cpumask, > + percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true); > + } > + cpumask_clear(&tlb_ubc->cpumask); > + tlb_ubc->nr_pages = 0; > + preempt_enable(); put_cpu() > +} > > ... > > +/* > + * Returns true if the TLB flush should be deferred to the end of a batch of > + * unmap operations to reduce IPIs. > + */ > +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > +{ > + bool should_defer = false; > + > + if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH)) > + return false; > + > + /* If remote CPUs need to be flushed then defer batch the flush */ > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > + should_defer = true; > + put_cpu(); What did the get_cpu() protect? > + return should_defer; > +} > +#else > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, > + struct page *page) > +{ > +} > + > > ... > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > +/* > + * Allocate the control structure for batch TLB flushing. An allocation > + * failure is harmless as the reclaimer will send IPIs where necessary. > + */ > +static void alloc_tlb_ubc(void) > +{ > + if (!current->tlb_ubc) > + current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch), > + GFP_KERNEL | __GFP_NOWARN); A GFP_KERNEL allocation from deep within page reclaim? Seems imprudent if only from a stack-usage POV. > +} > +#else > +static inline void alloc_tlb_ubc(void) > +{ > +} > +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ > > ... > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932897AbbFIIr6 (ORCPT ); Tue, 9 Jun 2015 04:47:58 -0400 Received: from cantor2.suse.de ([195.135.220.15]:36819 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932858AbbFIIrq (ORCPT ); Tue, 9 Jun 2015 04:47:46 -0400 Date: Tue, 9 Jun 2015 09:47:39 +0100 From: Mel Gorman To: Ingo Molnar Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609084739.GQ26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150608174551.GA27558@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 08, 2015 at 07:45:51PM +0200, Ingo Molnar wrote: > > * Mel Gorman wrote: > > > Changelog since V4 > > o Rebase to 4.1-rc6 > > > > Changelog since V3 > > o Drop batching of TLB flush from migration > > o Redo how larger batching is managed > > o Batch TLB flushes when writable entries exist > > > > When unmapping pages it is necessary to flush the TLB. If that page was > > accessed by another CPU then an IPI is used to flush the remote CPU. That > > is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. > > > > There already is a window between when a page is unmapped and when it is > > TLB flushed. This series simply increases the window so multiple pages > > can be flushed using a single IPI. This *should* be safe or the kernel is > > hosed already but I've cc'd the x86 maintainers and some of the Intel folk > > for comment. > > > > Patch 1 simply made the rest of the series easier to write as ftrace > > could identify all the senders of TLB flush IPIS. > > > > Patch 2 collects a list of PFNs and sends one IPI to flush them all > > > > Patch 3 tracks when there potentially are writable TLB entries that > > need to be batched differently > > > > The performance impact is documented in the changelogs but in the optimistic > > case on a 4-socket machine the full series reduces interrupts from 900K > > interrupts/second to 60K interrupts/second. > > Yeah, so I think batching writable flushes is useful I think, but I disagree with > one aspect of it: with the need to gather _pfns_ and batch them over to the remote > CPU. > It's PFN-based for three reasons. The first is because the old code is flushing on a per-page basis and the intent of the series was to reduce the number of IPIs that requires. Moving away from that has an unpredictable impact that depends on the workload and the exact CPU used. The second is that a struct page-based interface would require percpu_flush_tlb_batch_pages to do a page->pfn lookup in the IPI call which is more expensive than necessary. The final reason is that the TLB flush API given to architectures at the moment includes single page primitives and while it's not necessarily the best decision for x86, the same may not be true for other architectures if they decide to activate the batching. > As per my measurements the __flush_tlb_single() primitive (which you use in patch > #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2 > pages and gets exponentially worse. It's probably done in microcode and its > performance is horrible. > > So have you explored the possibility to significantly simplify your patch-set by > only deferring the flushing, and doing a simple TLB flush on the remote CPU? Yes. At one point I looked at range flushing but it is not a good idea. The ranges that reach the end of the LRU are too large to be useful except in the ideal case of a workload that sequentially accesses memory. Flushing the full TLB has an unpredictable cost. Currently, the unmapping and flush is of inactive pages, some of which may not be in the TLB at all and the impact on the workload is limited to the IPI and flush cost. With a full flush we clear entries we know were recently accessed and may have to be looked up again and we do this every 32 mapped pages that are reclaimed. In the ideal case of a sequential mapped reader it would not matter as the entries are not needed so we would not see the cost at all. Other workloads will have to do a refill that was not necessary before this series. The cost of the refill will depend on the CPU and whether the lookup information is still in the CPU cache or not. That means measuring the full impact of your proposal is impossible as it depends heavily on the workload, the timing of its interaction with kswapd in particular, the state of the CPU cache and the cost of refills for the CPU. I agree with you in that it would be a simplier series and the actual flush would probably be faster but the downsides are too unpredictable for a series that primarily is about reducing the number of IPIs. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932385AbbFIKco (ORCPT ); Tue, 9 Jun 2015 06:32:44 -0400 Received: from mail-wi0-f179.google.com ([209.85.212.179]:37604 "EHLO mail-wi0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751344AbbFIKcg (ORCPT ); Tue, 9 Jun 2015 06:32:36 -0400 Date: Tue, 9 Jun 2015 12:32:31 +0200 From: Ingo Molnar To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609103231.GA11026@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150609084739.GQ26425@suse.de> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Mel Gorman wrote: > > So have you explored the possibility to significantly simplify your patch-set > > by only deferring the flushing, and doing a simple TLB flush on the remote > > CPU? > > Yes. At one point I looked at range flushing but it is not a good idea. My suggestion wasn't range-flushing, but a simple all-or-nothing batched flush of user-space TLBs. > The ranges that reach the end of the LRU are too large to be useful except in > the ideal case of a workload that sequentially accesses memory. Flushing the > full TLB has an unpredictable cost. [...] Why would it have unpredictable cost? We flush the TLB on every process context switch. Yes, it's somewhat workload dependent, but the performance profile is so different anyway with batching that it has to be re-measured anyway. > With a full flush we clear entries we know were recently accessed and may have > to be looked up again and we do this every 32 mapped pages that are reclaimed. > In the ideal case of a sequential mapped reader it would not matter as the > entries are not needed so we would not see the cost at all. Other workloads will > have to do a refill that was not necessary before this series. The cost of the > refill will depend on the CPU and whether the lookup information is still in the > CPU cache or not. That means measuring the full impact of your proposal is > impossible as it depends heavily on the workload, the timing of its interaction > with kswapd in particular, the state of the CPU cache and the cost of refills > for the CPU. > > I agree with you in that it would be a simplier series and the actual flush > would probably be faster but the downsides are too unpredictable for a series > that primarily is about reducing the number of IPIs. Sorry, I don't buy this, at all. Please measure this, the code would become a lot simpler, as I'm not convinced that we need pfn (or struct page) or even range based flushing. I.e. please first implement the simplest remote batching variant, then complicate it if the numbers warrant it. Not the other way around. It's not like the VM code needs the extra complexity! Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932355AbbFILHz (ORCPT ); Tue, 9 Jun 2015 07:07:55 -0400 Received: from cantor2.suse.de ([195.135.220.15]:47858 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933156AbbFILHj (ORCPT ); Tue, 9 Jun 2015 07:07:39 -0400 Date: Tue, 9 Jun 2015 12:07:32 +0100 From: Mel Gorman To: Andrew Morton Cc: Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Ingo Molnar , Linux-MM , LKML Subject: Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Message-ID: <20150609110732.GR26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <1433767854-24408-3-git-send-email-mgorman@suse.de> <20150608153813.dbadaceffe316d09ee9f2446@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150608153813.dbadaceffe316d09ee9f2446@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 08, 2015 at 03:38:13PM -0700, Andrew Morton wrote: > On Mon, 8 Jun 2015 13:50:53 +0100 Mel Gorman wrote: > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > recently accessed by other CPUs. There are many circumstances where this > > happens but the obvious one is kswapd reclaiming pages belonging to a > > running process as kswapd and the task are likely running on separate CPUs. > > > > On small machines, this is not a significant problem but as machine > > gets larger with more cores and more memory, the cost of these IPIs can > > be high. This patch uses a structure similar in principle to a pagevec > > to collect a list of PFNs and CPUs that require flushing. It then sends > > one IPI per CPU that was mapping any of those pages to flush the list of > > PFNs. A new TLB flush helper is required for this and one is added for > > x86. Other architectures will need to decide if batching like this is both > > safe and worth the memory overhead. Specifically the requirement is; > > > > If a clean page is unmapped and not immediately flushed, the > > architecture must guarantee that a write to that page from a CPU > > with a cached TLB entry will trap a page fault. > > > > This is essentially what the kernel already depends on but the window is > > much larger with this patch applied and is worth highlighting. > > > > The impact of this patch depends on the workload as measuring any benefit > > requires both mapped pages co-located on the LRU and memory pressure. The > > case with the biggest impact is multiple processes reading mapped pages > > taken from the vm-scalability test suite. The test case uses NR_CPU readers > > of mapped files that consume 10*RAM. > > > > vmscale on a 4-node machine with 64G RAM and 48 CPUs > > 4.1.0-rc6 4.1.0-rc6 > > vanilla batchunmap-v5 > > User 577.35 618.60 > > System 5927.06 4195.03 > > Elapsed 162.21 121.31 > > > > The workload completed 25% faster with 29% less CPU time. > > > > This is showing that the readers completed 25% with 30% less CPU time. From > > vmstats, it is known that the vanilla kernel was interrupted roughly 900K > > times per second during the steady phase of the test and the patched kernel > > was interrupts 180K times per second. > > > > The impact is much lower on a small machine > > > > vmscale on a 1-node machine with 8G RAM and 1 CPU > > 4.1.0-rc6 4.1.0-rc6 > > vanilla batchunmap-v5 > > User 59.14 58.86 > > System 109.15 83.78 > > Elapsed 27.32 23.14 > > > > It's still a noticeable improvement with vmstat showing interrupts went > > from roughly 500K per second to 45K per second. > > Looks nice. > > > The patch will have no impact on workloads with no memory pressure or > > have relatively few mapped pages. > > What benefit can we expect to see to any real-world userspace? > Only a small subset of workloads will see a benefit -- ones that mmap a lot of data with working sets larger than the size of a node. Some streaming media servers allegedly do this. Some numerical processing applications may hit this. Those that use glibc for large buffers use mmap and if the application is larger than a NUMA node, it'll need to be unmapped and flushed. Python/NumPY uses large maps for large buffers (based on the paper "Doubling the Performance of Python/Numpy with less than 100 SLOC"). Whether users of NumPY hit this issue or not depends on whether kswapd is active. Anecdotally, I'm aware from IRC of one user that is experimenting with a large HTTP cache and a load generator that spent a lot of time sending IPIs that was obvious from the profile. I asked weeks ago that they post the results here which they promised they would but didn't. Unfortunately, I don't know the persons real name to cc them. Rik might. Anecdotally, I also believe that Intel hit this internally with some internal workload but I'm basing this on idle comments at LSF/MM. However they were unwilling or unable to describe exactly what the test does and against what software. I know this is more vague than you'd like. By and large, I'm relying on the assumption that if we are reclaiming mapped pages from kswapd context then sending one IPI per page is stupid. > > --- a/include/linux/init_task.h > > +++ b/include/linux/init_task.h > > @@ -175,6 +175,13 @@ extern struct task_group root_task_group; > > # define INIT_NUMA_BALANCING(tsk) > > #endif > > > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ > > + .tlb_ubc = NULL, > > +#else > > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) > > +#endif > > + > > #ifdef CONFIG_KASAN > > # define INIT_KASAN(tsk) \ > > .kasan_depth = 1, > > @@ -257,6 +264,7 @@ extern struct task_group root_task_group; > > INIT_RT_MUTEXES(tsk) \ > > INIT_VTIME(tsk) \ > > INIT_NUMA_BALANCING(tsk) \ > > + INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \ > > INIT_KASAN(tsk) \ > > } > > We don't really need any of this - init_task starts up all-zero anyway. > Maybe it's useful for documentation reasons (dubious), but I bet we've > already missed fields. > True. I'll look into removing it and make sure there is no adverse impact. > > > > ... > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -1289,6 +1289,16 @@ enum perf_event_task_context { > > perf_nr_task_contexts, > > }; > > > > +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */ > > +#define BATCH_TLBFLUSH_SIZE 32UL > > + > > +/* Track pages that require TLB flushes */ > > +struct tlbflush_unmap_batch { > > + struct cpumask cpumask; > > + unsigned long nr_pages; > > + unsigned long pfns[BATCH_TLBFLUSH_SIZE]; > > Why are we storing pfn's rather than page*'s? > Because a page would require a page->pfn lookup within the IPI handler. That will work but it's unnecessarily wasteful. > I'm trying to get my head around what's actually in this structure. > > Each thread has one of these, lazily allocated . > The cpumask field contains a mask of all the CPUs which have done > . The careful reader will find mm_struct.cpu_vm_mask_var > and will wonder why it didn't need documenting, sigh. > > Wanna fill in the blanks? As usual, understanding the data structure > is the key to understanding the design, so it's worth a couple of > paragraphs. With this knowledge, the reader may understand why > try_to_unmap_flush() has preempt_disable() protection but > set_tlb_ubc_flush_pending() doesn't need it! > Is this any help? struct tlbflush_unmap_batch { /* * Each bit set is a CPU that potentially has a TLB entry for one of * the PFNs being flushed. See set_tlb_ubc_flush_pending(). */ struct cpumask cpumask; /* * The number and list of pfns to be flushed. PFNs are tracked instead * of struct pages to avoid multiple page->pfn lookups by each CPU that * receives an IPI in percpu_flush_tlb_batch_pages */ unsigned long nr_pages; unsigned long pfns[BATCH_TLBFLUSH_SIZE]; /* * If true then the PTE was dirty when unmapped. The entry must be * flushed before IO is initiated or a stale TLB entry potentially * allows an update without redirtying the page. */ bool writable; }; > > +}; > > + > > struct task_struct { > > volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ > > void *stack; > > > > ... > > > > @@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma) > > return address; > > } > > > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > +static void percpu_flush_tlb_batch_pages(void *data) > > +{ > > + struct tlbflush_unmap_batch *tlb_ubc = data; > > + int i; > > Anally speaking, this should be unsigned long (in which case it > shouldn't be called `i'!). Or make tlbflush_unmap_batch.nr_pages int. > But int is signed, which is silly. Whatever ;) > I can make it unsigned int which is a micro-optimisation for loop iterators anyway. > > > + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); > > + for (i = 0; i < tlb_ubc->nr_pages; i++) > > + flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT); > > +} > > + > > +/* > > + * Flush TLB entries for recently unmapped pages from remote CPUs. It is > > + * important that if a PTE was dirty when it was unmapped that it's flushed > > s/important that/important / > Fixed. > > + * before any IO is initiated on the page to prevent lost writes. Similarly, > > + * it must be flushed before freeing to prevent data leakage. > > + */ > > +void try_to_unmap_flush(void) > > +{ > > + struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc; > > + int cpu; > > + > > + if (!tlb_ubc || !tlb_ubc->nr_pages) > > + return; > > + > > + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages); > > + > > + preempt_disable(); > > + cpu = smp_processor_id(); > > get_cpu() > > > + if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) > > + percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask); > > + > > + if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) { > > + smp_call_function_many(&tlb_ubc->cpumask, > > + percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true); > > + } > > + cpumask_clear(&tlb_ubc->cpumask); > > + tlb_ubc->nr_pages = 0; > > + preempt_enable(); > > put_cpu() > Gack. Fixed. > > +} > > > > ... > > > > +/* > > + * Returns true if the TLB flush should be deferred to the end of a batch of > > + * unmap operations to reduce IPIs. > > + */ > > +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > > +{ > > + bool should_defer = false; > > + > > + if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH)) > > + return false; > > + > > + /* If remote CPUs need to be flushed then defer batch the flush */ > > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > > + should_defer = true; > > + put_cpu(); > > What did the get_cpu() protect? > To safely lookup the current running CPU number. smp_processor_id() is potentially safe except in specific circumstances and I did not think raw_smp_processor_id() was justified. > > + return should_defer; > > +} > > +#else > > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, > > + struct page *page) > > +{ > > +} > > + > > > > ... > > > > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > +/* > > + * Allocate the control structure for batch TLB flushing. An allocation > > + * failure is harmless as the reclaimer will send IPIs where necessary. > > + */ > > +static void alloc_tlb_ubc(void) > > +{ > > + if (!current->tlb_ubc) > > + current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch), > > + GFP_KERNEL | __GFP_NOWARN); > > A GFP_KERNEL allocation from deep within page reclaim? Seems imprudent > if only from a stack-usage POV. > When we call it, we are in PF_MEMALLOC context either set in the page allocator from direct reclaim or because kswapd always sets it. This limits the stack depth. It would be comparable to the depth we reach during normal reclaim calling into shrink_page_list and everything below it so we should be safe. Granted, the dependency on PF_MEMALLOC is not obvious at all. > > +} > > +#else > > +static inline void alloc_tlb_ubc(void) > > +{ > > +} > > +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ > > > > ... > > > -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753721AbbFILVF (ORCPT ); Tue, 9 Jun 2015 07:21:05 -0400 Received: from cantor2.suse.de ([195.135.220.15]:48939 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753351AbbFILVB (ORCPT ); Tue, 9 Jun 2015 07:21:01 -0400 Date: Tue, 9 Jun 2015 12:20:55 +0100 From: Mel Gorman To: Ingo Molnar Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609112055.GS26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150609103231.GA11026@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 09, 2015 at 12:32:31PM +0200, Ingo Molnar wrote: > > * Mel Gorman wrote: > > > > So have you explored the possibility to significantly simplify your patch-set > > > by only deferring the flushing, and doing a simple TLB flush on the remote > > > CPU? > > > > Yes. At one point I looked at range flushing but it is not a good idea. > > My suggestion wasn't range-flushing, but a simple all-or-nothing batched flush of > user-space TLBs. > I'm aware of that. I had considered both range flushing and full flushing as alternatives to PFN tracking and settled on PFN tracking as the least risky change. > > The ranges that reach the end of the LRU are too large to be useful except in > > the ideal case of a workload that sequentially accesses memory. Flushing the > > full TLB has an unpredictable cost. [...] > > Why would it have unpredictable cost? Because we have no means of knowing how many active TLB entries are flushed, no way of knowing if it matters and we potentially do this every 32 (BATCH_TLBFLUSH_SIZE) pages that are reclaimed. > We flush the TLB on every process context > switch. Yes, it's somewhat workload dependent, but the performance profile is so > different anyway with batching that it has to be re-measured anyway. > With the per-page flush, there is a direct cost associated with the operation -- the IPI and the TLB flushes. This is easy to measure. With a full flush there is an indirect cost -- the TLB entries that have to be refilled after the full flush. It also works against any notion of using ASID or similar mechanisms that avoid full flushes on context switches. It will be very easy to show the benefit in the direct case. The indirect case is both unpredictable and impossible to measure the full impact in all cases. > > With a full flush we clear entries we know were recently accessed and may have > > to be looked up again and we do this every 32 mapped pages that are reclaimed. > > In the ideal case of a sequential mapped reader it would not matter as the > > entries are not needed so we would not see the cost at all. Other workloads will > > have to do a refill that was not necessary before this series. The cost of the > > refill will depend on the CPU and whether the lookup information is still in the > > CPU cache or not. That means measuring the full impact of your proposal is > > impossible as it depends heavily on the workload, the timing of its interaction > > with kswapd in particular, the state of the CPU cache and the cost of refills > > for the CPU. > > > > I agree with you in that it would be a simplier series and the actual flush > > would probably be faster but the downsides are too unpredictable for a series > > that primarily is about reducing the number of IPIs. > > Sorry, I don't buy this, at all. > > Please measure this, the code would become a lot simpler, as I'm not convinced > that we need pfn (or struct page) or even range based flushing. > The code will be simplier and the cost of reclaim will be lower and that is the direct case but shows nothing about the indirect cost. The mapped reader will benefit as it is not reusing the TLB entries and will look artifically very good. It'll be very difficult for even experienced users to determine that a slowdown during kswapd activity is due to increased TLB misses incurred by the full flush. > I.e. please first implement the simplest remote batching variant, then complicate > it if the numbers warrant it. Not the other way around. It's not like the VM code > needs the extra complexity! > The simplest remote batching variant is a much more drastic change from what we do today and an unpredictable one. If we were to take that direction, it goes against the notion of making incremental changes. Even if we ultimately ended up with your proposal, it would make sense to separte it from this series by at least one release for bisection purposes. That way we get; Current: Send one IPI per page to unmap, active TLB entries preserved This series: Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap, active TLB entries preserved Your proposal: Send one IPI, flush everything, active TLB entries must refill -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753974AbbFIMnl (ORCPT ); Tue, 9 Jun 2015 08:43:41 -0400 Received: from mail-wi0-f176.google.com ([209.85.212.176]:34918 "EHLO mail-wi0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751744AbbFIMne (ORCPT ); Tue, 9 Jun 2015 08:43:34 -0400 Date: Tue, 9 Jun 2015 14:43:28 +0200 From: Ingo Molnar To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609124328.GA23066@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150609112055.GS26425@suse.de> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Mel Gorman wrote: > > Sorry, I don't buy this, at all. > > > > Please measure this, the code would become a lot simpler, as I'm not convinced > > that we need pfn (or struct page) or even range based flushing. > > The code will be simplier and the cost of reclaim will be lower and that is the > direct case but shows nothing about the indirect cost. The mapped reader will > benefit as it is not reusing the TLB entries and will look artifically very > good. It'll be very difficult for even experienced users to determine that a > slowdown during kswapd activity is due to increased TLB misses incurred by the > full flush. If so then the converse is true just as much: if you were to introduce finegrained flushing today, you couldn't justify it because you claim it's very hard to measure! Really, in such cases we should IMHO fall back to the simplest approach, and iterate from there. I cited very real numbers about the direct costs of TLB flushes, and plausible speculation about why the indirect costs are low on the achitecture you are trying to modify here. I think since it is you who wants to introduce additional complexity into the x86 MM code the burden is on you to provide proof that the complexity of pfn (or struct page) tracking is worth it. > > I.e. please first implement the simplest remote batching variant, then > > complicate it if the numbers warrant it. Not the other way around. It's not > > like the VM code needs the extra complexity! > > The simplest remote batching variant is a much more drastic change from what we > do today and an unpredictable one. If we were to take that direction, it goes > against the notion of making incremental changes. Even if we ultimately ended up > with your proposal, it would make sense to separte it from this series by at > least one release for bisection purposes. That way we get; > > Current: Send one IPI per page to unmap, active TLB entries preserved > This series: Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap, active TLB entries preserved > Your proposal: Send one IPI, flush everything, active TLB entries must refill Not quite, my take of it is: Current: Simplest method: send one IPI per page to unmap, active TLB entries preserved. Remote TLB flushing cost is so high that it probably moots any secondary effects of TLB preservation. This series: Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap, add complex tracking of pfn's with expensive flushing, active TLB entries preserved. Cost of the more complex flushing are probably higher than the refill cost, based on the numbers I gave. My proposal: Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap that flushes everything. TLB entries not preserved but this is expected to be more than offset by the reduction in remote flushing costs and the simplicity of the flushing scheme. It can still be complicated to your proposed pfn tracking scheme, based on numbers. Btw., have you measured the full TLB flush variant as well? If so, mind sharing the numbers? Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932909AbbFINFu (ORCPT ); Tue, 9 Jun 2015 09:05:50 -0400 Received: from cantor2.suse.de ([195.135.220.15]:57209 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753290AbbFINFn (ORCPT ); Tue, 9 Jun 2015 09:05:43 -0400 Date: Tue, 9 Jun 2015 14:05:36 +0100 From: Mel Gorman To: Ingo Molnar Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609130536.GT26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150609124328.GA23066@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 09, 2015 at 02:43:28PM +0200, Ingo Molnar wrote: > > * Mel Gorman wrote: > > > > Sorry, I don't buy this, at all. > > > > > > Please measure this, the code would become a lot simpler, as I'm not convinced > > > that we need pfn (or struct page) or even range based flushing. > > > > The code will be simplier and the cost of reclaim will be lower and that is the > > direct case but shows nothing about the indirect cost. The mapped reader will > > benefit as it is not reusing the TLB entries and will look artifically very > > good. It'll be very difficult for even experienced users to determine that a > > slowdown during kswapd activity is due to increased TLB misses incurred by the > > full flush. > > If so then the converse is true just as much: if you were to introduce finegrained > flushing today, you couldn't justify it because you claim it's very hard to > measure! > I'm claiming the *INDIRECT COST* is impossible to measure as part of this series because it depends on the workload and exact CPU used. The direct cost is measurable and can be quantified. > Really, in such cases we should IMHO fall back to the simplest approach, and > iterate from there. > > I cited very real numbers about the direct costs of TLB flushes, and plausible > speculation about why the indirect costs are low on the achitecture you are trying > to modify here. > > I think since it is you who wants to introduce additional complexity into the x86 > MM code the burden is on you to provide proof that the complexity of pfn (or > struct page) tracking is worth it. > I'm taking a situation whereby IPIs are sent like crazy with interrupt storms and replacing it with something that is a lot more efficient that minimises the number of potential surprises. I'm stating that the benefit of PFN tracking is unknowable in the general case because it depends on the workload, timing and the exact CPU used so any example provided can be naked with a counter-example such as a trivial sequential reader that shows no benefit. The series as posted is approximately in line with current behaviour minimising the chances of surprise regressions from excessive TLB flush. You are actively blocking a measurable improvement and forcing it to be replaced with something whose full impact is unquantifiable. Any regressions in this area due to increased TLB misses could take several kernel releases as the issue will be so difficult to detect. I'm going to implement the approach you are forcing because there is an x86 part of the patch and you are the maintainer that could indefinitely NAK it. However, I'm extremely pissed about being forced to introduce these indirect unpredictable costs because I know the alternative is you dragging this out for weeks with no satisfactory conclusion in an argument that I cannot prove in the general case. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932364AbbFIQHQ (ORCPT ); Tue, 9 Jun 2015 12:07:16 -0400 Received: from mga01.intel.com ([192.55.52.88]:34950 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751480AbbFIQHL (ORCPT ); Tue, 9 Jun 2015 12:07:11 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,581,1427785200"; d="scan'208";a="743689981" Message-ID: <5577078B.2000503@intel.com> Date: Tue, 09 Jun 2015 08:34:35 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Ingo Molnar , Mel Gorman CC: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> In-Reply-To: <20150609124328.GA23066@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/09/2015 05:43 AM, Ingo Molnar wrote: > I cited very real numbers about the direct costs of TLB flushes, and plausible > speculation about why the indirect costs are low on the achitecture you are trying > to modify here. We should be careful to extrapolate what the real-world cost of a TLB flush is from the cost of running a kernel function in a loop. Let's look at what got measured: > +static char tlb_flush_target[PAGE_SIZE] __aligned(4096); > +static void fn_flush_tlb_one(void) > +{ > + unsigned long addr = (unsigned long)&tlb_flush_target; > + > + tlb_flush_target[0]++; > + __flush_tlb_one(addr); > +} So we've got an increment of a variable in kernel memory (which is almost surely in the L1), then we flush that memory location, and repeat the increment. I assume the increment is so that the __flush_tlb_one() has some "real" work to do and is not just flushing an address which is not in the TLB. This is almost certainly a departure from workloads like Mel is addressing where we (try to) flush pages used long ago that will hopefully *not* be in the TLB. But, that unfortunately means that we're measuring a TLB _miss_ here in addition to the flush. A TLB miss shouldn't be *that* expensive, right? The SDM says: "INVLPG also invalidates all entries in all paging-structure caches ... regardless of the linear addresses to which they correspond." Ugh, so the TLB refill has to also refill the paging structure caches. At least the page tables will be in the L1. Since "tlb_flush_target" is in kernel mapping, you might also be shooting down the TLB entry for kernel text, or who knows what else. The TLB entry might be 1G or 2M which might never be in the STLB (second-level TLB), which could have *VERY* different behavior than a 4k flush or a flush of an entry in the first-level TLB. I'm not sure that these loop-style tests are particularly valuable, but if we're going to do it, I think we should consider: 1. We need to separate the TLB fill portion from the flush and not measure any part of a fill along with the flush 2. We should measure flushing of ascending, adjacent virtual addresses mapped with 4k pages since that is the normal case. Perhaps vmalloc(16MB) or something. 3. We should flush a mix of virtual addresses that are in and out of the TLB. 4. To measure instruction (as opposed to instruction+software) overhead, use __flush_tlb_single(), not __flush_tlb_one() P.S. I think I'm responsible for it, but we should probably also move the count_vm_tlb_event() to outside the loop in flush_tlb_mm_range(). invlpg is not a "normal" instruction and could potentially increase the overhead of incrementing the counter. But, I guess the kernel mappings _should_ stay in the TLB over an invlpg and shouldn't pay any cost to be refilled in to the TLB despite the paging-structure caches going away. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752895AbbFIQtg (ORCPT ); Tue, 9 Jun 2015 12:49:36 -0400 Received: from mga03.intel.com ([134.134.136.65]:48818 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932708AbbFIQtO (ORCPT ); Tue, 9 Jun 2015 12:49:14 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,582,1427785200"; d="scan'208";a="708027242" Message-ID: <55771909.2020005@intel.com> Date: Tue, 09 Jun 2015 09:49:13 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Ingo Molnar , Mel Gorman CC: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> In-Reply-To: <5577078B.2000503@intel.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/09/2015 08:34 AM, Dave Hansen wrote: > 2. We should measure flushing of ascending, adjacent virtual addresses > mapped with 4k pages since that is the normal case. Perhaps > vmalloc(16MB) or something. Now that I think about this a bit more, we really have two different patterns here: 1. flush_tlb_mm_range() style, which is contiguous virtual address flushing from an unmap-style operation. 2. Mel's new try_to_unmap_one() style, which are going to be relatively random virtual addresses. So, the "ascending adjacent" case is still interesting, but it wouldn't cover Mel's new case. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932663AbbFIVPH (ORCPT ); Tue, 9 Jun 2015 17:15:07 -0400 Received: from mga14.intel.com ([192.55.52.115]:48589 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753697AbbFIVO7 (ORCPT ); Tue, 9 Jun 2015 17:14:59 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,583,1427785200"; d="scan'208";a="584975152" Message-ID: <55775749.3090004@intel.com> Date: Tue, 09 Jun 2015 14:14:49 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Ingo Molnar , Mel Gorman CC: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> In-Reply-To: <55771909.2020005@intel.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I did some of what I talked about earlier in the thread. I think the sfence (via mb()) is potentially unfair since it removes some of the CPU's ability to optimize things. For this kind of test, any ability that the CPU has to smear the overhead around is a bonus in practice and should be taken in to account for these tests. Here's the horribly hacked-together patch so you can see precisely what's going on: https://www.sr71.net/~dave/intel/measure-tlb-stuff.patch Here's a Haswell Xeon: > [ 0.222090] x86/fpu:######## MM instructions: ############################ > [ 0.222168] x86/fpu: Cost of: __flush_tlb() fn : 124 cycles avg: 125 > [ 0.222623] x86/fpu: Cost of: __flush_tlb_global() fn : 960 cycles avg: 968 > [ 0.222744] x86/fpu: Cost of: __flush_tlb_single() fn : 216 cycles avg: 216 > [ 0.222864] x86/fpu: Cost of: __flush_tlb_single() vmal fn : 216 cycles avg: 219 > [ 0.222987] x86/fpu: Cost of: __flush_tlb_one() OLD fn : 216 cycles avg: 216 > [ 0.223139] x86/fpu: Cost of: __flush_tlb_range() fn : 284 cycles avg: 287 > [ 0.223272] x86/fpu: Cost of: tlb miss fn : 0 cycles avg: 0 And a Westmere Xeon: > [ 1.057770] x86/fpu:######## MM instructions: ############################ > [ 1.065876] x86/fpu: Cost of: __flush_tlb() fn : 108 cycles avg: 109 > [ 1.075188] x86/fpu: Cost of: __flush_tlb_global() fn : 828 cycles avg: 829 > [ 1.084162] x86/fpu: Cost of: __flush_tlb_single() fn : 232 cycles avg: 237 > [ 1.093175] x86/fpu: Cost of: __flush_tlb_single() vmal fn : 240 cycles avg: 240 > [ 1.102214] x86/fpu: Cost of: __flush_tlb_one() OLD fn : 284 cycles avg: 286 > [ 1.111299] x86/fpu: Cost of: __flush_tlb_range() fn : 472 cycles avg: 478 > [ 1.120281] x86/fpu: Cost of: tlb miss fn : 0 cycles avg: 0 I was rather surprised how close the three __flush_tlb_single/one() variants were on Haswell. I've looked at a few other CPUs and this was the only one that acted like this. The 0 cycle TLB miss was also interesting. It goes back up to something reasonable if I put the mb()/mfence's back. I don't think this kind of thing is a realistic test unless we put mfence's around all of our TLB flushes in practice. :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754045AbbFIVyK (ORCPT ); Tue, 9 Jun 2015 17:54:10 -0400 Received: from mail-ie0-f177.google.com ([209.85.223.177]:36428 "EHLO mail-ie0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753859AbbFIVyC (ORCPT ); Tue, 9 Jun 2015 17:54:02 -0400 MIME-Version: 1.0 In-Reply-To: <55775749.3090004@intel.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> Date: Tue, 9 Jun 2015 14:54:01 -0700 X-Google-Sender-Auth: vLpijtzi0SoCbf0CqJtPnwTTV3w Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds To: Dave Hansen Cc: Ingo Molnar , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen wrote: > > The 0 cycle TLB miss was also interesting. It goes back up to something > reasonable if I put the mb()/mfence's back. So I've said it before, and I'll say it again: Intel does really well on TLB fills. The reason is partly historical, with Win95 doing a ton of TLB invalidation (I think every single GDI call ended up invalidating the TLB, so under some important Windows benchmarks of the time, you literally had a TLB flush every 10k instructions!). But partly it is because people are wrong in thinking that TLB fills have to be slow. There's a lot of complete garbage RISC machines where the TLB fill took forever, because in the name of simplicity it would stop the pipeline and often be done in SW. The zero-cycle TLB fill is obviously a bit optimistic, but at the same time it's not completely insane. TLB fills can be prefetched, and the table walker can hit the cache, if you do them right. And Intel mostly does. So the normal full (non-global) TLB fill really is fairly cheap. It's been optimized for, and the TLB gets re-filled fairly efficiently. I suspect that it's really the case that doing more than just a couple of single-tlb flushes is a complete waste of time: the flushing takes longer than re-filling the TLB well. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932294AbbFIWcP (ORCPT ); Tue, 9 Jun 2015 18:32:15 -0400 Received: from cantor2.suse.de ([195.135.220.15]:35380 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752039AbbFIWcK (ORCPT ); Tue, 9 Jun 2015 18:32:10 -0400 Date: Tue, 9 Jun 2015 23:32:03 +0100 From: Mel Gorman To: Linus Torvalds Cc: Dave Hansen , Ingo Molnar , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609223203.GW26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 09, 2015 at 02:54:01PM -0700, Linus Torvalds wrote: > On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen wrote: > > > > The 0 cycle TLB miss was also interesting. It goes back up to something > > reasonable if I put the mb()/mfence's back. > > So I've said it before, and I'll say it again: Intel does really well > on TLB fills. > > The reason is partly historical, with Win95 doing a ton of TLB > invalidation (I think every single GDI call ended up invalidating the > TLB, so under some important Windows benchmarks of the time, you > literally had a TLB flush every 10k instructions!). > > But partly it is because people are wrong in thinking that TLB fills > have to be slow. There's a lot of complete garbage RISC machines where > the TLB fill took forever, because in the name of simplicity it would > stop the pipeline and often be done in SW. > > The zero-cycle TLB fill is obviously a bit optimistic, but at the same > time it's not completely insane. TLB fills can be prefetched, and the > table walker can hit the cache, if you do them right. And Intel mostly > does. > > So the normal full (non-global) TLB fill really is fairly cheap. It's > been optimized for, and the TLB gets re-filled fairly efficiently. I > suspect that it's really the case that doing more than just a couple > of single-tlb flushes is a complete waste of time: the flushing takes > longer than re-filling the TLB well. > I expect I'll do another revision of the series after 4.2-rc1 as it's way too close to 4.1's release. When that happens, I'll drop patch 4 and leave just the full non-global flush patch. In the event there is an architecture that really cares about the refill cost or we find that there is a corner case where the TLB refill hurts then patch 4 will be in the mail archives to consider for rebase and testing. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753896AbbFIWfU (ORCPT ); Tue, 9 Jun 2015 18:35:20 -0400 Received: from cantor2.suse.de ([195.135.220.15]:35461 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752669AbbFIWfP (ORCPT ); Tue, 9 Jun 2015 18:35:15 -0400 Date: Tue, 9 Jun 2015 23:35:09 +0100 From: Mel Gorman To: Linus Torvalds Cc: Dave Hansen , Ingo Molnar , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150609223509.GX26425@suse.de> References: <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150609223203.GW26425@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150609223203.GW26425@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 09, 2015 at 11:32:03PM +0100, Mel Gorman wrote: > On Tue, Jun 09, 2015 at 02:54:01PM -0700, Linus Torvalds wrote: > > On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen wrote: > > > > > > The 0 cycle TLB miss was also interesting. It goes back up to something > > > reasonable if I put the mb()/mfence's back. > > > > So I've said it before, and I'll say it again: Intel does really well > > on TLB fills. > > > > The reason is partly historical, with Win95 doing a ton of TLB > > invalidation (I think every single GDI call ended up invalidating the > > TLB, so under some important Windows benchmarks of the time, you > > literally had a TLB flush every 10k instructions!). > > > > But partly it is because people are wrong in thinking that TLB fills > > have to be slow. There's a lot of complete garbage RISC machines where > > the TLB fill took forever, because in the name of simplicity it would > > stop the pipeline and often be done in SW. > > > > The zero-cycle TLB fill is obviously a bit optimistic, but at the same > > time it's not completely insane. TLB fills can be prefetched, and the > > table walker can hit the cache, if you do them right. And Intel mostly > > does. > > > > So the normal full (non-global) TLB fill really is fairly cheap. It's > > been optimized for, and the TLB gets re-filled fairly efficiently. I > > suspect that it's really the case that doing more than just a couple > > of single-tlb flushes is a complete waste of time: the flushing takes > > longer than re-filling the TLB well. > > > > I expect I'll do another revision of the series after 4.2-rc1 as it's way > too close to 4.1's release. When that happens, I'll drop patch 4 and leave > just the full non-global flush patch. In the event there is an architecture > that really cares about the refill cost or we find that there is a corner > case where the TLB refill hurts then patch 4 will be in the mail archives > to consider for rebase and testing. > To be clear I meant patch 4 from v6 of the series will be dropped. In that series, patch 2 does a full non-global flush and patch 4 does the per-pfn tracking with multiple single page frame flushes. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933348AbbFJIwA (ORCPT ); Wed, 10 Jun 2015 04:52:00 -0400 Received: from mail-wi0-f173.google.com ([209.85.212.173]:38743 "EHLO mail-wi0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751401AbbFJIvr (ORCPT ); Wed, 10 Jun 2015 04:51:47 -0400 Date: Wed, 10 Jun 2015 10:51:41 +0200 From: Ingo Molnar To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610085141.GA25704@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <20150609130536.GT26425@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150609130536.GT26425@suse.de> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Mel Gorman wrote: > > I think since it is you who wants to introduce additional complexity into the > > x86 MM code the burden is on you to provide proof that the complexity of pfn > > (or struct page) tracking is worth it. > > I'm taking a situation whereby IPIs are sent like crazy with interrupt storms > and replacing it with something that is a lot more efficient that minimises the > number of potential surprises. I'm stating that the benefit of PFN tracking is > unknowable in the general case because it depends on the workload, timing and > the exact CPU used so any example provided can be naked with a counter-example > such as a trivial sequential reader that shows no benefit. The series as posted > is approximately in line with current behaviour minimising the chances of > surprise regressions from excessive TLB flush. > > You are actively blocking a measurable improvement and forcing it to be replaced > with something whose full impact is unquantifiable. Any regressions in this area > due to increased TLB misses could take several kernel releases as the issue will > be so difficult to detect. > > I'm going to implement the approach you are forcing because there is an x86 part > of the patch and you are the maintainer that could indefinitely NAK it. However, > I'm extremely pissed about being forced to introduce these indirect > unpredictable costs because I know the alternative is you dragging this out for > weeks with no satisfactory conclusion in an argument that I cannot prove in the > general case. Stop this crap. I made a really clear and unambiguous chain of arguments: - I'm unconvinced about the benefits of INVLPG in general, and your patches adds a whole new bunch of them. I cited measurements and went out on a limb to explain my position, backed with numbers and logic. It's admittedly still a speculative position and I might be wrong, but I think it's well grounded position that you cannot just brush aside. - I suggested that you split this approach into steps that first does the simpler approach that will give us at least 95% of the benefits, then the more complex one on top of it. Your false claim that I'm blocking a clear improvement is pure demagogy! - I very clearly claimed that I am more than willing to be convinced by numbers. It's not _that_ hard to construct a memory trashing workload with a TLB-efficient iteration that uses say 80% of the TLB cache, to measure the worst-case overhead of full flushes. I'm really sick of this partly deceptive, partly passive-aggressive discussion style that seems to frequently permeate VM discussions and which made sched/numa such a huge PITA in the past... And I think the numbers in the v6 series you submitted today support my position, so you owe me an apology I think ... Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933461AbbFJJI0 (ORCPT ); Wed, 10 Jun 2015 05:08:26 -0400 Received: from mail-wi0-f173.google.com ([209.85.212.173]:34722 "EHLO mail-wi0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933359AbbFJJIS (ORCPT ); Wed, 10 Jun 2015 05:08:18 -0400 Date: Wed, 10 Jun 2015 11:08:13 +0200 From: Ingo Molnar To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610090813.GA30359@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <20150609130536.GT26425@suse.de> <20150610085141.GA25704@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150610085141.GA25704@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Ingo Molnar wrote: > Stop this crap. > > I made a really clear and unambiguous chain of arguments: > > - I'm unconvinced about the benefits of INVLPG in general, and your patches adds > a whole new bunch of them. [...] ... and note that your claim that 'we were doing them before, this is just an equivalent transformation' is utter bullsh*t technically: what we were doing previously was a hideously expensive IPI combined with an INVLPG. The behavior was dominated by the huge overhead of the remote flushing IPI, which does not prove or disprove either your or my opinion! Preserving that old INVLPG logic without measuring its benefits _again_ would be cargo cult programming. So I think this should be measured, and I don't mind worst-case TLB trashing measurements, which would be relatively straightforward to construct and the results should be unambiguous. The batching limit (which you set to 32) should then be tuned by comparing it to a working full-flushing batching logic, not by comparing it to the previous single IPI per single flush approach! ... and if the benefits of a complex algorithm are not measurable and if there are doubts about the cost/benefit tradeoff then frankly it should not exist in the kernel in the first place. It's not like the Linux TLB flushing code is too boring due to overwhelming simplicity. and yes, it's my job as a maintainer to request measurements justifying complexity and your ad hominem attacks against me are disgusting - you should know better. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933525AbbFJJUk (ORCPT ); Wed, 10 Jun 2015 05:20:40 -0400 Received: from cantor2.suse.de ([195.135.220.15]:39773 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933391AbbFJJTT (ORCPT ); Wed, 10 Jun 2015 05:19:19 -0400 Date: Wed, 10 Jun 2015 10:19:13 +0100 From: Mel Gorman To: Ingo Molnar Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610091913.GC26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <20150609130536.GT26425@suse.de> <20150610085141.GA25704@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150610085141.GA25704@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 10, 2015 at 10:51:41AM +0200, Ingo Molnar wrote: > > * Mel Gorman wrote: > > > > I think since it is you who wants to introduce additional complexity into the > > > x86 MM code the burden is on you to provide proof that the complexity of pfn > > > (or struct page) tracking is worth it. > > > > I'm taking a situation whereby IPIs are sent like crazy with interrupt storms > > and replacing it with something that is a lot more efficient that minimises the > > number of potential surprises. I'm stating that the benefit of PFN tracking is > > unknowable in the general case because it depends on the workload, timing and > > the exact CPU used so any example provided can be naked with a counter-example > > such as a trivial sequential reader that shows no benefit. The series as posted > > is approximately in line with current behaviour minimising the chances of > > surprise regressions from excessive TLB flush. > > > > You are actively blocking a measurable improvement and forcing it to be replaced > > with something whose full impact is unquantifiable. Any regressions in this area > > due to increased TLB misses could take several kernel releases as the issue will > > be so difficult to detect. > > > > I'm going to implement the approach you are forcing because there is an x86 part > > of the patch and you are the maintainer that could indefinitely NAK it. However, > > I'm extremely pissed about being forced to introduce these indirect > > unpredictable costs because I know the alternative is you dragging this out for > > weeks with no satisfactory conclusion in an argument that I cannot prove in the > > general case. > > Stop this crap. > > I made a really clear and unambiguous chain of arguments: > > - I'm unconvinced about the benefits of INVLPG in general, and your patches adds > a whole new bunch of them. I cited measurements and went out on a limb to > explain my position, backed with numbers and logic. It's admittedly still a > speculative position and I might be wrong, but I think it's well grounded > position that you cannot just brush aside. > And I explained my concerns with the use of a full flush and the difficulty of measuring its impact in the general case. I also explained why I thought starting with PFN tracking was an incremental approach. The argument looped so I bailed. > - I suggested that you split this approach into steps that first does the simpler > approach that will give us at least 95% of the benefits, then the more complex > one on top of it. Your false claim that I'm blocking a clear improvement is > pure demagogy! > The splitting was already done, released and I followed up saying that I'll be dropping patch 4 in the final merge request. In the event we find in a few kernel releases time that it was required then there will be a realistic bug report to use as a reference. > - I very clearly claimed that I am more than willing to be convinced by numbers. > It's not _that_ hard to construct a memory trashing workload with a > TLB-efficient iteration that uses say 80% of the TLB cache, to measure the > worst-case overhead of full flushes. > And what I had said was that in the general case that it will not show us any proof. Any other workload will always behave differently and two critical components are the exact timing versus kswapd running and the CPU used. Even if it's demonstrated for a single workload on one CPU then we would still be faced with the same problem and dropping patch 4. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933262AbbFJKPn (ORCPT ); Wed, 10 Jun 2015 06:15:43 -0400 Received: from cantor2.suse.de ([195.135.220.15]:47380 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754057AbbFJKPf (ORCPT ); Wed, 10 Jun 2015 06:15:35 -0400 Date: Wed, 10 Jun 2015 11:15:29 +0100 From: Mel Gorman To: Ingo Molnar Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610101529.GE26425@suse.de> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <20150609130536.GT26425@suse.de> <20150610085141.GA25704@gmail.com> <20150610090813.GA30359@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150610090813.GA30359@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 10, 2015 at 11:08:13AM +0200, Ingo Molnar wrote: > > * Ingo Molnar wrote: > > > Stop this crap. > > > > I made a really clear and unambiguous chain of arguments: > > > > - I'm unconvinced about the benefits of INVLPG in general, and your patches adds > > a whole new bunch of them. [...] > > ... and note that your claim that 'we were doing them before, this is just an > equivalent transformation' is utter bullsh*t technically: what we were doing > previously was a hideously expensive IPI combined with an INVLPG. > And replacing it with an INVLPG without excessive IPI transmission is changing one major variable. Going straight to a full TLB flush is changing two major variables. I thought the refill cost was high, parially based on the estimate of 22,000 cycles in https://lkml.org/lkml/2014/7/31/825. I've been told in these discussions that I'm wrong and the cost is not high. As it'll always be variable, we can never be sure which is why I do not see a value to building a complex test around it that will be invalidated the instant we use a different CPU. When/if a workload shows up that really cares about those refill costs then there will be a stable test case to work from. > The behavior was dominated by the huge overhead of the remote flushing IPI, which > does not prove or disprove either your or my opinion! > > Preserving that old INVLPG logic without measuring its benefits _again_ would be > cargo cult programming. > > So I think this should be measured, and I don't mind worst-case TLB trashing > measurements, which would be relatively straightforward to construct and the > results should be unambiguous. > > The batching limit (which you set to 32) should then be tuned by comparing it to a > working full-flushing batching logic, not by comparing it to the previous single > IPI per single flush approach! > We can decrease it easily but increasing it means we also have to change SWAP_CLUSTER_MAX because otherwise enough pages are not unmapped for flushes and it is a requirement that we flush before freeing the pages. That changes another complex variable because at the very least, it alters LRU lock hold times. > ... and if the benefits of a complex algorithm are not measurable and if there are > doubts about the cost/benefit tradeoff then frankly it should not exist in the > kernel in the first place. It's not like the Linux TLB flushing code is too boring > due to overwhelming simplicity. > > and yes, it's my job as a maintainer to request measurements justifying complexity > and your ad hominem attacks against me are disgusting - you should know better. > It was not intended as an ad hominem attack and my apologies for that. I wanted to express my frustration that a series that adjusted one variable with known benefit will be rejected for a series that adjusts two major variables instead with the second variable being very sensitive to workload and CPU. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933357AbbFJNOC (ORCPT ); Wed, 10 Jun 2015 09:14:02 -0400 Received: from one.firstfloor.org ([193.170.194.197]:39679 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754116AbbFJNNz (ORCPT ); Wed, 10 Jun 2015 09:13:55 -0400 Date: Wed, 10 Jun 2015 15:13:54 +0200 From: Andi Kleen To: Linus Torvalds Cc: Dave Hansen , Ingo Molnar , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610131354.GO19417@two.firstfloor.org> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 09, 2015 at 02:54:01PM -0700, Linus Torvalds wrote: > On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen wrote: > > > > The 0 cycle TLB miss was also interesting. It goes back up to something > > reasonable if I put the mb()/mfence's back. > > So I've said it before, and I'll say it again: Intel does really well > on TLB fills. Assuming the page tables are cache-hot... And hot here does not mean L3 cache, but higher. But a memory intensive workload can easily violate that. That's why I'm dubious of all these micro benchmarks. They won't be clearing caches. They generate unrealistic conditions in the CPU pipeline and overestimate the cost of the flushes. The only good way to measure TLB costs is macro benchmarks. -Andi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933649AbbFJQR1 (ORCPT ); Wed, 10 Jun 2015 12:17:27 -0400 Received: from mail-ig0-f179.google.com ([209.85.213.179]:32857 "EHLO mail-ig0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933090AbbFJQRQ (ORCPT ); Wed, 10 Jun 2015 12:17:16 -0400 MIME-Version: 1.0 In-Reply-To: <20150610131354.GO19417@two.firstfloor.org> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> Date: Wed, 10 Jun 2015 09:17:15 -0700 X-Google-Sender-Auth: rE89t5OHmKtok3fo52j1pga3adw Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds To: Andi Kleen Cc: Dave Hansen , Ingo Molnar , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 10, 2015 at 6:13 AM, Andi Kleen wrote: > > Assuming the page tables are cache-hot... And hot here does not mean > L3 cache, but higher. But a memory intensive workload can easily > violate that. In practice, no. You'll spend all your time on the actual real data cache misses, the TLB misses won't be any more noticeable. And if your access patters are even *remoptely* cache-friendly (ie _not_ spending all your time just waiting for regular data cache misses), then a radix-tree-like page table like Intel will have much better locality in the page tables than in the actual data. So again, the TLB misses won't be your big problem. There may be pathological cases where you just look at one word per page, but let's face it, we don't optimize for pathological or unrealistic cases. And the thing is, you need to look at the costs. Single-page invalidation taking hundreds of cycles? Yeah, we definitely need to take the downside of trying to be clever into account. If the invalidation was really cheap, the rules might change. As it is, I really don't think there is any question about this. That's particularly true when the single-page invalidation approach has lots of *software* overhead too - not just the complexity, but even "obvious" costs feeding the list of pages to be invalidated across CPU's. Think about it - there are cache misses there too, and because we do those across CPU's those cache misses are *mandatory*. So trying to avoid a few TLB misses by forcing mandatory cache misses and extra complexity, and by doing lots of 200+ cycle operations? Really? In what universe does that sound like a good idea? Quite frankly, I can pretty much *guarantee* that you didn't actually think about any real numbers, you've just been taught that fairy-tale of "TLB misses are expensive". As if TLB entries were somehow sacred. If somebody can show real numbers on a real workload, that's one thing. But in the absence of those real numbers, we should go for the simple and straightforward code, and not try to make up "but but but it could be expensive" issues. Trying to avoid IPI's by batching them up sounds like a very obvious win. I absolutely believe that is worth it. I just don't believe it is worth it trying to pass a lot of page detail data around. The number of individual pages flushed should be *very* controlled. Here's a suggestion: the absolute maximum number of TLB entries we flush should be something we can describe in one single cacheline. Strive to minimize the number of cachelines we have to transfer for the IPI, instead of trying to minimize the number of TLB entries. So maybe soemthing along the lines of "one word of flags (all or nothing, or a range), or up to seven indivual addresses to be flushed". I personally think even seven invlpg's is likely too much, but with the "single cacheline" at least there's another basis for the limit too. So anyway, I like the patch series. I just think that the final patch - the one that actually saves the addreses, and limits things to BATCH_TLBFLUSH_SIZE, should be limited. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933528AbbFJQmp (ORCPT ); Wed, 10 Jun 2015 12:42:45 -0400 Received: from mail-ie0-f181.google.com ([209.85.223.181]:35643 "EHLO mail-ie0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754181AbbFJQmf (ORCPT ); Wed, 10 Jun 2015 12:42:35 -0400 MIME-Version: 1.0 In-Reply-To: References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> Date: Wed, 10 Jun 2015 09:42:34 -0700 X-Google-Sender-Auth: k28vDeBSUvL16boc8u3KyqP7GhY Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds To: Andi Kleen Cc: Dave Hansen , Ingo Molnar , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 10, 2015 at 9:17 AM, Linus Torvalds wrote: > > So anyway, I like the patch series. I just think that the final patch > - the one that actually saves the addreses, and limits things to > BATCH_TLBFLUSH_SIZE, should be limited. Oh, and another thing: Mel, can you please make that "struct tlbflush_unmap_batch" be just part of "struct task_struct" rather than a pointer? If you are worried about the cpumask size, you could use cpumask_var_t cpumask; and alloc_cpumask_var(..) ... free_cpumask_var(..) for that. That way, sane configurations never have the allocation cost. (Of course, sad to say, sane configurations are probably few and far between. At least Fedora seems to ship with a kernel where NR_CPU's is 1024 and thus CONFIG_CPUMASK_OFFSTACK=y. Oh well. What a waste of CPU cycles that is..) Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934081AbbFJRHN (ORCPT ); Wed, 10 Jun 2015 13:07:13 -0400 Received: from cantor2.suse.de ([195.135.220.15]:60253 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754284AbbFJRHG (ORCPT ); Wed, 10 Jun 2015 13:07:06 -0400 Date: Wed, 10 Jun 2015 18:07:00 +0100 From: Mel Gorman To: Linus Torvalds Cc: Andi Kleen , Dave Hansen , Ingo Molnar , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610170700.GG26425@suse.de> References: <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 10, 2015 at 09:17:15AM -0700, Linus Torvalds wrote: > On Wed, Jun 10, 2015 at 6:13 AM, Andi Kleen wrote: > > > > Assuming the page tables are cache-hot... And hot here does not mean > > L3 cache, but higher. But a memory intensive workload can easily > > violate that. > > In practice, no. > > You'll spend all your time on the actual real data cache misses, the > TLB misses won't be any more noticeable. > > And if your access patters are even *remoptely* cache-friendly (ie > _not_ spending all your time just waiting for regular data cache > misses), then a radix-tree-like page table like Intel will have much > better locality in the page tables than in the actual data. So again, > the TLB misses won't be your big problem. > > There may be pathological cases where you just look at one word per > page, but let's face it, we don't optimize for pathological or > unrealistic cases. > It's concerns like this that have me avoiding any micro-benchmarking approach that tried to measure the indirect costs of refills. No matter what the microbenchmark does, there will be other cases that render it irrelevant. > And the thing is, you need to look at the costs. Single-page > invalidation taking hundreds of cycles? Yeah, we definitely need to > take the downside of trying to be clever into account. > > If the invalidation was really cheap, the rules might change. As it > is, I really don't think there is any question about this. > > That's particularly true when the single-page invalidation approach > has lots of *software* overhead too - not just the complexity, but > even "obvious" costs feeding the list of pages to be invalidated > across CPU's. Think about it - there are cache misses there too, and > because we do those across CPU's those cache misses are *mandatory*. > > So trying to avoid a few TLB misses by forcing mandatory cache misses > and extra complexity, and by doing lots of 200+ cycle operations? > Really? In what universe does that sound like a good idea? > > Quite frankly, I can pretty much *guarantee* that you didn't actually > think about any real numbers, you've just been taught that fairy-tale > of "TLB misses are expensive". As if TLB entries were somehow sacred. > Everyone has been taught that one. Papers I've read from the last two years on TLB implementations or page reclaim management bring this up as a supporting point for whatever they are proposing. It was partially why I kept PFN tracking and also to put much of the cost on the reclaimer and minimise interference on the recipient of the IPI. I still think it was a rational concern but will assume that refills are cheaper than smart invalidations until it can be proven otherwise. > If somebody can show real numbers on a real workload, that's one > thing. The last adjustments made today to the series are at http://git.kernel.org/cgit/linux/kernel/git/mel/linux-balancenuma.git/log/?h=mm-vmscan-lessipi-v7r5 I'll redo it on top of 4.2-rc1 whenever that happens so gets a full round in linux-next. Patch 4 can be revisited if a real workload is found that is not deliberately pathological running on a CPU that matters. The forward port of patch 4 for testing will be trivial. It also separated out the dynamic allocation of the structure so that it can be excluded if deemed to be an unnecessary complication. > So anyway, I like the patch series. I just think that the final patch > - the one that actually saves the addreses, and limits things to > BATCH_TLBFLUSH_SIZE, should be limited. > I see your logic but if it's limited then we send more IPIs and it's all crappy tradeoffs. If a real workload complains, it'll be far easier to work with. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964985AbbFJRZF (ORCPT ); Wed, 10 Jun 2015 13:25:05 -0400 Received: from cantor2.suse.de ([195.135.220.15]:32865 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933829AbbFJRZD (ORCPT ); Wed, 10 Jun 2015 13:25:03 -0400 Date: Wed, 10 Jun 2015 18:24:57 +0100 From: Mel Gorman To: Linus Torvalds Cc: Andi Kleen , Dave Hansen , Ingo Molnar , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150610172457.GH26425@suse.de> References: <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 10, 2015 at 09:42:34AM -0700, Linus Torvalds wrote: > On Wed, Jun 10, 2015 at 9:17 AM, Linus Torvalds > wrote: > > > > So anyway, I like the patch series. I just think that the final patch > > - the one that actually saves the addreses, and limits things to > > BATCH_TLBFLUSH_SIZE, should be limited. > > Oh, and another thing: > > Mel, can you please make that "struct tlbflush_unmap_batch" be just > part of "struct task_struct" rather than a pointer? > Yes, that was done earlier today based on Ingo's review so that the allocation could be dealt with as a separate path at the end of the series. > If you are worried about the cpumask size, you could use > > cpumask_var_t cpumask; > > and > > alloc_cpumask_var(..) > ... > free_cpumask_var(..) > > for that. > > That way, sane configurations never have the allocation cost. > Ok, good point. Patch 3 in my git tree ("mm: Dynamically allocate TLB batch unmap control structure") does not do this but I'll look into doing it before the release based on 4.2-rc1. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754543AbbFJRbI (ORCPT ); Wed, 10 Jun 2015 13:31:08 -0400 Received: from mail-ie0-f181.google.com ([209.85.223.181]:34259 "EHLO mail-ie0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754374AbbFJRbD (ORCPT ); Wed, 10 Jun 2015 13:31:03 -0400 MIME-Version: 1.0 In-Reply-To: <20150610172457.GH26425@suse.de> References: <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> <20150610172457.GH26425@suse.de> Date: Wed, 10 Jun 2015 10:31:01 -0700 X-Google-Sender-Auth: dGCo7j293gX9Jf1rtH65TVWuGe0 Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds To: Mel Gorman Cc: Andi Kleen , Dave Hansen , Ingo Molnar , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 10, 2015 at 10:24 AM, Mel Gorman wrote: > > Yes, that was done earlier today based on Ingo's review so that the > allocation could be dealt with as a separate path at the end of the series. Ahh, ok, never mind then. > Ok, good point. Patch 3 in my git tree ("mm: Dynamically allocate TLB > batch unmap control structure") does not do this but I'll look into doing > it before the release based on 4.2-rc1. I'm not sure how size-sensitive this is. The 'struct task_struct' is pretty big already, and if somebody builds a MAXSMP kernel, I really don't think they worry too much about wasting a few bytes for each process. Clearly they either are insane, or they actually *have* a big machine, in which point the allocation is not going to be wasteful and they'll likely trigger this code all the time anyway, so trying to be any more dynamic about it is probably not worth it. So my "maybe you could use 'cpumask_var_t' here" suggestion isn't necessarily even worth it. Just statically allocating it is probably perfectly fine. Of course, again the counter-example for that may well be how distros seem to make thousand-cpu configurations the default. That does seem insane to me. But I guess the pain of multiple kernel configs is just too much. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754309AbbFJSIW (ORCPT ); Wed, 10 Jun 2015 14:08:22 -0400 Received: from mail-ie0-f172.google.com ([209.85.223.172]:34878 "EHLO mail-ie0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751445AbbFJSIP (ORCPT ); Wed, 10 Jun 2015 14:08:15 -0400 MIME-Version: 1.0 In-Reply-To: References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <55771909.2020005@intel.com> <55775749.3090004@intel.com> <20150610131354.GO19417@two.firstfloor.org> Date: Wed, 10 Jun 2015 14:08:14 -0400 X-Google-Sender-Auth: WLoaWaAsGhwD1xbhFSANtrHkMZg Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Josh Boyer To: Linus Torvalds Cc: Andi Kleen , Dave Hansen , Ingo Molnar , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , H Peter Anvin , Linux-MM , LKML , Peter Zijlstra , Thomas Gleixner Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 10, 2015 at 12:42 PM, Linus Torvalds wrote: > On Wed, Jun 10, 2015 at 9:17 AM, Linus Torvalds > wrote: >> >> So anyway, I like the patch series. I just think that the final patch >> - the one that actually saves the addreses, and limits things to >> BATCH_TLBFLUSH_SIZE, should be limited. > > Oh, and another thing: > > Mel, can you please make that "struct tlbflush_unmap_batch" be just > part of "struct task_struct" rather than a pointer? > > If you are worried about the cpumask size, you could use > > cpumask_var_t cpumask; > > and > > alloc_cpumask_var(..) > ... > free_cpumask_var(..) > > for that. > > That way, sane configurations never have the allocation cost. > > (Of course, sad to say, sane configurations are probably few and far > between. At least Fedora seems to ship with a kernel where NR_CPU's is > 1024 and thus CONFIG_CPUMASK_OFFSTACK=y. Oh well. What a waste of CPU > cycles that is..) The insane part being NR_CPUS = 1024? Or that to have said number requires cpumask being dynamically allocated to avoid stack overflow? (Or both I guess). josh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752878AbbFKP0J (ORCPT ); Thu, 11 Jun 2015 11:26:09 -0400 Received: from mail-wi0-f182.google.com ([209.85.212.182]:34786 "EHLO mail-wi0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752722AbbFKP0F (ORCPT ); Thu, 11 Jun 2015 11:26:05 -0400 Date: Thu, 11 Jun 2015 17:26:00 +0200 From: Ingo Molnar To: Mel Gorman Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Dave Hansen , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150611152559.GA15509@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <20150609130536.GT26425@suse.de> <20150610085141.GA25704@gmail.com> <20150610090813.GA30359@gmail.com> <20150610101529.GE26425@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150610101529.GE26425@suse.de> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Mel Gorman wrote: > > > I made a really clear and unambiguous chain of arguments: > > > > > > - I'm unconvinced about the benefits of INVLPG in general, and your patches adds > > > a whole new bunch of them. [...] > > > > ... and note that your claim that 'we were doing them before, this is just an > > equivalent transformation' is utter bullsh*t technically: what we were doing > > previously was a hideously expensive IPI combined with an INVLPG. > > And replacing it with an INVLPG without excessive IPI transmission is changing > one major variable. Going straight to a full TLB flush is changing two major > variables. [...] But this argument employs the fallacy that the transition to 'batching with PFN tracking' is not a major variable in itself. In reality it is a major variable: it adds extra complexity, such as the cross CPU data flow (the pfn tracking), and it also changes the distribution of the flushes and related caching patterns. > > [...] > > > > The batching limit (which you set to 32) should then be tuned by comparing it > > to a working full-flushing batching logic, not by comparing it to the previous > > single IPI per single flush approach! > > We can decrease it easily but increasing it means we also have to change > SWAP_CLUSTER_MAX because otherwise enough pages are not unmapped for flushes and > it is a requirement that we flush before freeing the pages. That changes another > complex variable because at the very least, it alters LRU lock hold times. ('should then be' implied 'as a separate patch/series', obviously.) I.e. all I wanted to observe is that I think the series did not explore the performance impact of the batching limit, because it was too focused on the INVLPG approach which has an internal API limit of 33. Now that the TLB flushing side is essentially limit-less, a future enhancement would be to further increase batching. My suspicion is that say doubling SWAP_CLUSTER_MAX would possibly further reduce the IPI rate, at a negligible cost. But this observation does not impact the current series in any case. > > ... and if the benefits of a complex algorithm are not measurable and if there > > are doubts about the cost/benefit tradeoff then frankly it should not exist in > > the kernel in the first place. It's not like the Linux TLB flushing code is > > too boring due to overwhelming simplicity. > > > > and yes, it's my job as a maintainer to request measurements justifying > > complexity and your ad hominem attacks against me are disgusting - you should > > know better. > > It was not intended as an ad hominem attack and my apologies for that. I wanted > to express my frustration that a series that adjusted one variable with known > benefit will be rejected for a series that adjusts two major variables instead > with the second variable being very sensitive to workload and CPU. ... but that's not what it did: it adjusted multiple complex variables already, with a questionable rationale for more complexity. And my argument was and continues to be: start with the simplest variant and iterate from there. Which you seem to have adapted in your latest series, so my concerns are addressed: Acked-by: Ingo Molnar Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753560AbbFUUXX (ORCPT ); Sun, 21 Jun 2015 16:23:23 -0400 Received: from mta-out1.inet.fi ([62.71.2.229]:42322 "EHLO johanna1.rokki.sonera.fi" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752277AbbFUUXP (ORCPT ); Sun, 21 Jun 2015 16:23:15 -0400 RazorGate-KAS: Rate: 5 RazorGate-KAS: {RECEIVED: dynamic ip detected} RazorGate-KAS: Envelope from: RazorGate-KAS: Version: 5.5.3 RazorGate-KAS: LuaCore: 80 2014-11-10_18-01-23 260f8afb9361da3c7edfd3a8e3a4ca908191ad29 RazorGate-KAS: Method: none RazorGate-KAS: Lua profiles 69136 [Nov 12 2014] RazorGate-KAS: Status: not_detected Date: Sun, 21 Jun 2015 23:22:31 +0300 From: "Kirill A. Shutemov" To: Dave Hansen , Ingo Molnar Cc: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150621202231.GB6766@node.dhcp.inet.fi> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5577078B.2000503@intel.com> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 09, 2015 at 08:34:35AM -0700, Dave Hansen wrote: > On 06/09/2015 05:43 AM, Ingo Molnar wrote: > > +static char tlb_flush_target[PAGE_SIZE] __aligned(4096); > > +static void fn_flush_tlb_one(void) > > +{ > > + unsigned long addr = (unsigned long)&tlb_flush_target; > > + > > + tlb_flush_target[0]++; > > + __flush_tlb_one(addr); > > +} > > So we've got an increment of a variable in kernel memory (which is > almost surely in the L1), then we flush that memory location, and repeat > the increment. BTW, Ingo, have you disabled direct mapping of kernel memory with 2M/1G pages for the test? I'm just thinking if there is chance that the test shooting out 1G tlb entry. In this case we're measure wrong thing. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751247AbbFYLsd (ORCPT ); Thu, 25 Jun 2015 07:48:33 -0400 Received: from mail-wi0-f181.google.com ([209.85.212.181]:35605 "EHLO mail-wi0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751019AbbFYLs0 (ORCPT ); Thu, 25 Jun 2015 07:48:26 -0400 Date: Thu, 25 Jun 2015 13:48:20 +0200 From: Ingo Molnar To: "Kirill A. Shutemov" Cc: Dave Hansen , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Borislav Petkov Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150625114819.GA20478@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150621202231.GB6766@node.dhcp.inet.fi> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Kirill A. Shutemov wrote: > On Tue, Jun 09, 2015 at 08:34:35AM -0700, Dave Hansen wrote: > > On 06/09/2015 05:43 AM, Ingo Molnar wrote: > > > +static char tlb_flush_target[PAGE_SIZE] __aligned(4096); > > > +static void fn_flush_tlb_one(void) > > > +{ > > > + unsigned long addr = (unsigned long)&tlb_flush_target; > > > + > > > + tlb_flush_target[0]++; > > > + __flush_tlb_one(addr); > > > +} > > > > So we've got an increment of a variable in kernel memory (which is > > almost surely in the L1), then we flush that memory location, and repeat > > the increment. > > BTW, Ingo, have you disabled direct mapping of kernel memory with 2M/1G > pages for the test? No, the kernel was using gbpages. > I'm just thinking if there is chance that the test shooting out 1G tlb entry. In > this case we're measure wrong thing. Yeah, you are right, it's slightly wrong, but still gave us the right numbers to work with. I've updated the benchmarks with 4K flushes as well. Changes to the previous measurement: - I've refined the testing methodology. (It now looks at the true distribution of delays, not just the minimum or an average: this is especially important for cache-cold numbers were perturbations (such as irq or SMI events) can generate outliers in both directions.) - Added tests to attempt to measure the TLB miss costs, I've split the INVLPG measurements into over a dozen separate variants, which should help measure the overall impact of TLB flushes and should help quantify the 'hidden' cost cost associated with TLB flushes as well. Here are the numbers: Old 64-bit CPU (AMD) using 2MB/4K pages: [ 8.122002] x86/bench: Running x86 benchmarks: cache- hot / cold cycles [ 12.269003] x86/bench:######## MM methods: ####################################### [ 12.369003] x86/bench: __flush_tlb() fn : 127 / 273 [ 12.469003] x86/bench: __flush_tlb_global() fn : 589 / 1501 [ 12.568003] x86/bench: access 2M page fn : 0 / 440 [ 12.668003] x86/bench: invlpg 2M page fn : 93 / 512 [ 12.768003] x86/bench: access+invlpg 2M page fn : 107 / 512 [ 12.868003] x86/bench: invlpg+access 2M page fn : 94 / 535 [ 12.972003] x86/bench: flush+access 2M page fn : 113 / 1060 [ 13.070003] x86/bench: access 1x 4K page fn : 0 / 5 [ 13.166003] x86/bench: access 2x 4K page fn : 1 / 150 [ 13.264003] x86/bench: access 3x 4K page fn : 1 / 331 [ 13.360003] x86/bench: access 4x 4K page fn : 1 / 360 [ 13.463003] x86/bench: invlpg+access 1x 4K page fn : 95 / 247 [ 13.566003] x86/bench: invlpg+access 2x 4K page fn : 188 / 543 [ 13.668003] x86/bench: invlpg+access 3x 4K page fn : 281 / 836 [ 13.774003] x86/bench: invlpg+access 4x 4K page fn : 374 / 1165 [ 13.878003] x86/bench: flush+access 1x 4K page fn : 114 / 275 [ 13.981003] x86/bench: flush+access 2x 4K page fn : 115 / 480 [ 14.083003] x86/bench: flush+access 3x 4K page fn : 115 / 671 [ 14.183003] x86/bench: flush+access 4x 4K page fn : 116 / 670 [ 14.286003] x86/bench: flush 4K page fn : 108 / 247 [ 14.385003] x86/bench: access+invlpg 4K page fn : 94 / 287 [ 14.502003] x86/bench: __flush_tlb_range() fn : 280 / 5173 Newer 64-bit CPU (Intel) using gbpages/4K pages: [ ] x86/bench: Running x86 benchmarks: cache- hot / cold cycles [ ] x86/bench:######## MM methods: ####################################### [ 124.362251] x86/bench: __flush_tlb() fn : 152 / 168 [ 125.287283] x86/bench: __flush_tlb_global() fn : 936 / 1312 [ 126.227641] x86/bench: access GB page fn : 12 / 304 [ 127.152369] x86/bench: invlpg GB page fn : 280 / 584 [ 128.084376] x86/bench: access+invlpg GB page fn : 280 / 592 [ 129.018430] x86/bench: invlpg+access GB page fn : 280 / 596 [ 129.939640] x86/bench: flush+access GB page fn : 152 / 396 [ 130.849827] x86/bench: access 1x 4K page fn : 0 / 0 [ 131.796836] x86/bench: access 2x 4K page fn : 12 / 432 [ 132.738849] x86/bench: access 3x 4K page fn : 16 / 736 [ 133.661315] x86/bench: access 4x 4K page fn : 12 / 736 [ 134.575238] x86/bench: invlpg+access 1x 4K page fn : 280 / 356 [ 135.506638] x86/bench: invlpg+access 2x 4K page fn : 500 / 1152 [ 136.432645] x86/bench: invlpg+access 3x 4K page fn : 728 / 1920 [ 137.355536] x86/bench: invlpg+access 4x 4K page fn : 956 / 3084 [ 138.296174] x86/bench: flush+access 1x 4K page fn : 148 / 280 [ 139.229264] x86/bench: flush+access 2x 4K page fn : 152 / 728 [ 140.165103] x86/bench: flush+access 3x 4K page fn : 136 / 736 [ 141.088691] x86/bench: flush+access 4x 4K page fn : 152 / 820 [ 141.994946] x86/bench: flush 4K page fn : 280 / 328 [ 142.902499] x86/bench: access+invlpg 4K page fn : 280 / 352 [ 143.814408] x86/bench: __flush_tlb_range() fn : 424 / 2100 Notes: - 'access' means an ACCESS_ONCE() load. - 'flush' means a CR3 based full flush - 'invlpg' means an __flush_tlb_one() INVLPG flush - the 4K pages are vmalloc()-ed pages. - 1x, 2x, 3x, 4x means up to 4 adjacent 4K vmalloc()-ed pages are accessed, the first byte in each - PGE is turned off in the CR4 for these measuremnts, to make sure the vmalloc()-ed page(s) get flushed by the CR3 method as well. - there's a bit of a jitter in the numbers, for example the 'flush+access 3x 4K' number is lower - but the jitter is in the 10 cycles range max As you can see (assuming the numbers are correct), the 'full' CR3 based flushing is superior to any __flush_tlb_one() method: [ 134.575238] x86/bench: invlpg+access 1x 4K page fn : 280 / 356 [ 135.506638] x86/bench: invlpg+access 2x 4K page fn : 500 / 1152 [ 136.432645] x86/bench: invlpg+access 3x 4K page fn : 728 / 1920 [ 137.355536] x86/bench: invlpg+access 4x 4K page fn : 956 / 3084 [ 138.296174] x86/bench: flush+access 1x 4K page fn : 148 / 280 [ 139.229264] x86/bench: flush+access 2x 4K page fn : 152 / 728 [ 140.165103] x86/bench: flush+access 3x 4K page fn : 136 / 736 [ 141.088691] x86/bench: flush+access 4x 4K page fn : 152 / 820 In the cache-hot case: with the CR3 based method the _full_ cost of flushing (i.e. the cost of the flush plus the cost of the access) is roughly constant. With INVLPG it's both higher, and increases linearly. In the cache-cold case: the costs are increasing linearly, but it's much higher in the INVLPG case. So in fact I'd argue that, somewhat counter-intuitively, we should consider doing a full flush even for single-page flushes, for example for page faults... I'll do more measurements (and will publish the latest patches) to increase confidence in the numbers. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752402AbbFYSqq (ORCPT ); Thu, 25 Jun 2015 14:46:46 -0400 Received: from mga02.intel.com ([134.134.136.20]:63461 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751502AbbFYSqj (ORCPT ); Thu, 25 Jun 2015 14:46:39 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,679,1427785200"; d="scan'208";a="594798129" Message-ID: <558C4C8E.5010107@intel.com> Date: Thu, 25 Jun 2015 11:46:38 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Ingo Molnar , "Kirill A. Shutemov" CC: Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Borislav Petkov Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> <20150625114819.GA20478@gmail.com> In-Reply-To: <20150625114819.GA20478@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/25/2015 04:48 AM, Ingo Molnar wrote: > I've updated the benchmarks with 4K flushes as well. Changes to the previous > measurement: Did you push these out somewhere? The tip tmp.fpu branch hasn't seen any updates. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752004AbbFYTPV (ORCPT ); Thu, 25 Jun 2015 15:15:21 -0400 Received: from cantor2.suse.de ([195.135.220.15]:41357 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751379AbbFYTPR (ORCPT ); Thu, 25 Jun 2015 15:15:17 -0400 Message-ID: <558C5342.9020702@suse.cz> Date: Thu, 25 Jun 2015 21:15:14 +0200 From: Vlastimil Babka User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Linus Torvalds , Ingo Molnar CC: Thomas Gleixner , linux-mm , Borislav Petkov , Hugh Dickins , Andrew Morton , Minchan Kim , Rik van Riel , "Kirill A. Shutemov" , H Peter Anvin , Andi Kleen , Peter Zijlstra , Dave Hansen , Mel Gorman , LKML Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> <20150625114819.GA20478@gmail.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 25.6.2015 20:36, Linus Torvalds wrote: > > On Jun 25, 2015 04:48, "Ingo Molnar" > wrote: >> >> - 1x, 2x, 3x, 4x means up to 4 adjacent 4K vmalloc()-ed pages are accessed, the >> first byte in each > > So that test is a bit unfair. From previous timing of Intel TLB fills, I can > tell you that Intel is particularly good at doing adjacent entries. > > That's independent of the fact that page tables have very good locality (if they > are the radix tree type - the hashed page tables that ppc uses are shit). So > when filling adjacent entries, you take the cache misses for the page tables > only once, but even aside from that, Intel send to do particularly well at the > "next page" TLB fill case AFAIK that's because they also cache partial translations, so if the first 3 levels are the same (as they mostly are for the "next page" scenario) it will only have to look at the last level of pages tables. AMD does that too. > Now, I think that's a reasonably common case, and I'm not saying that it's > unfair to compare for that reason, but it does highlight the good case for TLB > walking. > > So I would suggest you highlight the bad case too: use invlpg to invalidate > *one* TLB entry, and then walk four non-adjacent entries. And compare *that* to > the full TLB flush. > > Now, I happen to still believe in the full flush, but let's not pick benchmarks > that might not show the advantages of the finer granularity. > > Linus > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751817AbbFYWEj (ORCPT ); Thu, 25 Jun 2015 18:04:39 -0400 Received: from mail-ie0-f179.google.com ([209.85.223.179]:35443 "EHLO mail-ie0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751201AbbFYWEb (ORCPT ); Thu, 25 Jun 2015 18:04:31 -0400 MIME-Version: 1.0 In-Reply-To: <558C5342.9020702@suse.cz> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> <20150625114819.GA20478@gmail.com> <558C5342.9020702@suse.cz> Date: Thu, 25 Jun 2015 15:04:30 -0700 X-Google-Sender-Auth: 94oLMzQVKdy6OZY9thJk-b6feeU Message-ID: Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 From: Linus Torvalds To: Vlastimil Babka Cc: Ingo Molnar , Thomas Gleixner , linux-mm , Borislav Petkov , Hugh Dickins , Andrew Morton , Minchan Kim , Rik van Riel , "Kirill A. Shutemov" , H Peter Anvin , Andi Kleen , Peter Zijlstra , Dave Hansen , Mel Gorman , LKML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 25, 2015 at 12:15 PM, Vlastimil Babka wrote: >> [..] even aside from that, Intel seems to do particularly well at the >> "next page" TLB fill case > > AFAIK that's because they also cache partial translations, so if the first 3 > levels are the same (as they mostly are for the "next page" scenario) it will > only have to look at the last level of pages tables. AMD does that too. In the tests I did long ago, Intel seemed to do even more than that. Now, it's admittedly somewhat hard to do a good job at TLB fill cost measurement with any aggressively out-of-order core (lots of asynchronous and potentially speculative stuff going on), so without any real knowledge about what is going on it's kind of hard to say. But I had an old TLB benchmark (long lost now, so I can't re-run it to check on this) that implied that when Intel brings in a new TLB entry, they brought it in 32 bytes at a time, and would then look up the adjacent entries in that aligned half-cache-line in just a single cycle. (This was back in the P6/P4/Core Duo kinds of days - we're talking 10-15 years ago. The tests I did mapped the constant zero-page in a big area, so that the effects of the D$ would be minimized, and I'd basically *only* see the TLB miss costs. I also did things like measure the costs of the micro-fault that happens when marking a page table entry entry dirty for the first time etc, which was absolutely horrendously slow on a P4). Now, it could have been other effects too - things like multiple outstanding TLB misses in parallel due to prefetching TLB entries could possibly cause similar patterns, although I distinctly remember that effect of aligned 8-entry entries. So the simplest explanation *seemed* to be a very special single-entry 32-byte "L0 cache" for TLB fills (and 32 bytes would probably match the L1 fetch width into the core). That's in *addition* to the fact that the upper level entries are cached in the TLB. As mentioned earlier, from my Transmeta days I have some insight into what old pre-NT Windows used to do. The GDI protection traversal seemed to flush the whole TLB for every GDI kernel entry, and some of the graphics benchmarks (remember: this was back in the days when unaccelerated VGA graphics was still a big deal) would literally blow the TLB away every 5-10k instructions. So it actually makes sense that Intel and AMD spent a *lot* of effort on making TLB fills go fast, because those GDI benchmarks were important back then. It was all the basic old 2D window handling and font drawing etc benchmarking. None of the RISC vendors ever cared, they had absolutely crap hardware, and instead encouraged their SW partners (particularly databases) to change the software to use large-pages and any trick possible to avoid TLB misses. Their compilers would schedule loads early and stores late, because the whole memory subsystem was a toy. TLB misses would break the whole pipeline etc. Truly crap hardware, and people still look up to it. Sad. In the Windows world, that wasn't an option: when Microsoft said "jump", Intel and AMD both said "how high?". As a result, x86 is a *lot* more well-rounded than any RISC I have ever seen, because Intel/AMD were simply forced to run any crap software, rather than having software writers optimize for what they were good at. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752298AbbFZJJP (ORCPT ); Fri, 26 Jun 2015 05:09:15 -0400 Received: from mail-wi0-f170.google.com ([209.85.212.170]:36743 "EHLO mail-wi0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751979AbbFZJJD (ORCPT ); Fri, 26 Jun 2015 05:09:03 -0400 Date: Fri, 26 Jun 2015 11:08:56 +0200 From: Ingo Molnar To: Dave Hansen Cc: "Kirill A. Shutemov" , Mel Gorman , Andrew Morton , Rik van Riel , Hugh Dickins , Minchan Kim , Andi Kleen , H Peter Anvin , Linux-MM , LKML , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Borislav Petkov Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5 Message-ID: <20150626090856.GA31657@gmail.com> References: <1433767854-24408-1-git-send-email-mgorman@suse.de> <20150608174551.GA27558@gmail.com> <20150609084739.GQ26425@suse.de> <20150609103231.GA11026@gmail.com> <20150609112055.GS26425@suse.de> <20150609124328.GA23066@gmail.com> <5577078B.2000503@intel.com> <20150621202231.GB6766@node.dhcp.inet.fi> <20150625114819.GA20478@gmail.com> <558C4C8E.5010107@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <558C4C8E.5010107@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Dave Hansen wrote: > On 06/25/2015 04:48 AM, Ingo Molnar wrote: > > > I've updated the benchmarks with 4K flushes as well. Changes to the previous > > measurement: > > Did you push these out somewhere? The tip tmp.fpu branch hasn't seen any > updates. Not yet, I still don't trust the numbers. Thanks, Ingo