From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f48.google.com (mail-pa0-f48.google.com [209.85.220.48]) by kanga.kvack.org (Postfix) with ESMTP id 9F7B36B0035 for ; Mon, 21 Apr 2014 14:24:20 -0400 (EDT) Received: by mail-pa0-f48.google.com with SMTP id hz1so3965281pad.7 for ; Mon, 21 Apr 2014 11:24:20 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id qf5si21263346pac.375.2014.04.21.11.24.19 for ; Mon, 21 Apr 2014 11:24:19 -0700 (PDT) Subject: [PATCH 0/6] x86: rework tlb range flushing code From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:18 -0700 Message-Id: <20140421182418.81CF7519@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen Changes from v2: * Added a brief comment above the ceiling tunable * Updated the documentation to mention large pages and say "individual flush" instead of invlpg in most cases. Reposting with an instrumentation patch, and a few minor tweaks. I'd love some more eyeballs on this, but I think it's ready for -mm. I've run this through a variety of systems in the LKP harness, as well as running it on my desktop for a few days. I'm yet to see an to see if any perfmance regressions (or gains) show up. Without the last (instrumentation/debugging) patch: arch/x86/include/asm/mmu_context.h | 6 ++ arch/x86/include/asm/processor.h | 1 arch/x86/kernel/cpu/amd.c | 7 -- arch/x86/kernel/cpu/common.c | 13 ----- arch/x86/kernel/cpu/intel.c | 26 ---------- arch/x86/mm/tlb.c | 91 +++++++++++++++---------------------- include/linux/mm_types.h | 10 ++++ mm/Makefile | 2 8 files changed, 58 insertions(+), 98 deletions(-) -- I originally went to look at this becuase I realized that newer CPUs were not present in the intel_tlb_flushall_shift_set() code. I went to try to figure out where to stick newer CPUs (do we consider them more like SandyBridge or IvyBridge), and was not able to repeat the original experiments. Instead, this set does: 1. Rework the code a bit to ready it for tracepoints 2. Add tracepoints 3. Add a new tunable and set it to a sane value -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com [209.85.220.54]) by kanga.kvack.org (Postfix) with ESMTP id A68E16B0037 for ; Mon, 21 Apr 2014 14:24:21 -0400 (EDT) Received: by mail-pa0-f54.google.com with SMTP id lf10so4006332pab.13 for ; Mon, 21 Apr 2014 11:24:21 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id qf5si21263346pac.375.2014.04.21.11.24.20 for ; Mon, 21 Apr 2014 11:24:20 -0700 (PDT) Subject: [PATCH 1/6] x86: mm: clean up tlb flushing code From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:20 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182420.307A0C57@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen The if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) line of code is not exactly the easiest to audit, especially when it ends up at two different indentation levels. This eliminates one of the the copy-n-paste versions. It also gives us a unified exit point for each path through this function. We need this in a minute for our tracepoint. Signed-off-by: Dave Hansen --- b/arch/x86/mm/tlb.c | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~simplify-tlb-code 2014-04-21 11:10:34.431818610 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:34.435818791 -0700 @@ -161,23 +161,24 @@ void flush_tlb_current_task(void) void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) { + int need_flush_others_all = 1; unsigned long addr; unsigned act_entries, tlb_entries = 0; unsigned long nr_base_pages; preempt_disable(); if (current->active_mm != mm) - goto flush_all; + goto out; if (!current->mm) { leave_mm(smp_processor_id()); - goto flush_all; + goto out; } if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 || vmflag & VM_HUGETLB) { local_flush_tlb(); - goto flush_all; + goto out; } /* In modern CPU, last level tlb used for both data/ins */ @@ -196,22 +197,20 @@ void flush_tlb_mm_range(struct mm_struct count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); local_flush_tlb(); } else { + need_flush_others_all = 0; /* flush range by one by one 'invlpg' */ for (addr = start; addr < end; addr += PAGE_SIZE) { count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE); __flush_tlb_single(addr); } - - if (cpumask_any_but(mm_cpumask(mm), - smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, start, end); - preempt_enable(); - return; } - -flush_all: +out: + if (need_flush_others_all) { + start = 0UL; + end = TLB_FLUSH_ALL; + } if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); + flush_tlb_others(mm_cpumask(mm), mm, start, end); preempt_enable(); } _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f180.google.com (mail-pd0-f180.google.com [209.85.192.180]) by kanga.kvack.org (Postfix) with ESMTP id 177E26B003A for ; Mon, 21 Apr 2014 14:24:27 -0400 (EDT) Received: by mail-pd0-f180.google.com with SMTP id v10so3943003pde.39 for ; Mon, 21 Apr 2014 11:24:26 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com. [143.182.124.21]) by mx.google.com with ESMTP id s9si7674813pbj.274.2014.04.21.11.24.24 for ; Mon, 21 Apr 2014 11:24:24 -0700 (PDT) Subject: [PATCH 3/6] x86: mm: fix missed global TLB flush stat From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:22 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182422.DE5E728F@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen If we take the if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { local_flush_tlb(); goto out; } path out of flush_tlb_mm_range(), we will have flushed the tlb, but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the way out of the function so that we always take a single path when doing a full tlb flush. Signed-off-by: Dave Hansen --- b/arch/x86/mm/tlb.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff -puN arch/x86/mm/tlb.c~fix-missed-global-flush-stat arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~fix-missed-global-flush-stat 2014-04-21 11:10:35.176852256 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:35.190852888 -0700 @@ -172,8 +172,9 @@ unsigned long tlb_single_page_flush_ceil void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) { - int need_flush_others_all = 1; unsigned long addr; + /* do a global flush by default */ + unsigned long base_pages_to_flush = TLB_FLUSH_ALL; preempt_disable(); if (current->active_mm != mm) @@ -184,16 +185,14 @@ void flush_tlb_mm_range(struct mm_struct goto out; } - if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { - local_flush_tlb(); - goto out; - } + if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) + base_pages_to_flush = (end - start) >> PAGE_SHIFT; - if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) { + if (base_pages_to_flush > tlb_single_page_flush_ceiling) { + base_pages_to_flush = TLB_FLUSH_ALL; count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); local_flush_tlb(); } else { - need_flush_others_all = 0; /* flush range by one by one 'invlpg' */ for (addr = start; addr < end; addr += PAGE_SIZE) { count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE); @@ -201,7 +200,7 @@ void flush_tlb_mm_range(struct mm_struct } } out: - if (need_flush_others_all) { + if (base_pages_to_flush == TLB_FLUSH_ALL) { start = 0UL; end = TLB_FLUSH_ALL; } _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f51.google.com (mail-pb0-f51.google.com [209.85.160.51]) by kanga.kvack.org (Postfix) with ESMTP id 6271A6B003A for ; Mon, 21 Apr 2014 14:24:28 -0400 (EDT) Received: by mail-pb0-f51.google.com with SMTP id uo5so3964893pbc.24 for ; Mon, 21 Apr 2014 11:24:23 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id se7si21291423pbb.139.2014.04.21.11.24.22 for ; Mon, 21 Apr 2014 11:24:22 -0700 (PDT) Subject: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:21 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182421.DFAAD16A@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen I think the flush_tlb_mm_range() code that tries to tune the flush sizes based on the CPU needs to get ripped out for several reasons: 1. It is obviously buggy. It uses mm->total_vm to judge the task's footprint in the TLB. It should certainly be using some measure of RSS, *NOT* ->total_vm since only resident memory can populate the TLB. 2. Haswell, and several other CPUs are missing from the intel_tlb_flushall_shift_set() function. Thus, it has been demonstrated to bitrot quickly in practice. 3. It is plain wrong in my vm: [ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] tlb_flushall_shift: 6 Which leads to it to never use invlpg. 4. The assumptions about TLB refill costs are wrong: http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com (more on this in later patches) 5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59 I believe the sample times were too short. Running the benchmark in a loop yields times that vary quite a bit. Note that this leaves us with a static ceiling of 1 page. This is a conservative, dumb setting, and will be revised in a later patch. Signed-off-by: Dave Hansen --- b/arch/x86/include/asm/processor.h | 1 b/arch/x86/kernel/cpu/amd.c | 7 -- b/arch/x86/kernel/cpu/common.c | 13 ----- b/arch/x86/kernel/cpu/intel.c | 26 ---------- b/arch/x86/mm/tlb.c | 91 ++++++------------------------------- 5 files changed, 19 insertions(+), 119 deletions(-) diff -puN arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/include/asm/processor.h --- a/arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.813835861 -0700 +++ b/arch/x86/include/asm/processor.h 2014-04-21 11:10:34.823836313 -0700 @@ -72,7 +72,6 @@ extern u16 __read_mostly tlb_lld_4k[NR_I extern u16 __read_mostly tlb_lld_2m[NR_INFO]; extern u16 __read_mostly tlb_lld_4m[NR_INFO]; extern u16 __read_mostly tlb_lld_1g[NR_INFO]; -extern s8 __read_mostly tlb_flushall_shift; /* * CPU type and hardware bug flags. Kept separately for each CPU. diff -puN arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/amd.c --- a/arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.814835907 -0700 +++ b/arch/x86/kernel/cpu/amd.c 2014-04-21 11:10:34.824836358 -0700 @@ -741,11 +741,6 @@ static unsigned int amd_size_cache(struc } #endif -static void cpu_set_tlb_flushall_shift(struct cpuinfo_x86 *c) -{ - tlb_flushall_shift = 6; -} - static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c) { u32 ebx, eax, ecx, edx; @@ -793,8 +788,6 @@ static void cpu_detect_tlb_amd(struct cp tlb_lli_2m[ENTRIES] = eax & mask; tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1; - - cpu_set_tlb_flushall_shift(c); } static const struct cpu_dev amd_cpu_dev = { diff -puN arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/common.c --- a/arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.816835998 -0700 +++ b/arch/x86/kernel/cpu/common.c 2014-04-21 11:10:34.825836403 -0700 @@ -479,26 +479,17 @@ u16 __read_mostly tlb_lld_2m[NR_INFO]; u16 __read_mostly tlb_lld_4m[NR_INFO]; u16 __read_mostly tlb_lld_1g[NR_INFO]; -/* - * tlb_flushall_shift shows the balance point in replacing cr3 write - * with multiple 'invlpg'. It will do this replacement when - * flush_tlb_lines <= active_lines/2^tlb_flushall_shift. - * If tlb_flushall_shift is -1, means the replacement will be disabled. - */ -s8 __read_mostly tlb_flushall_shift = -1; - void cpu_detect_tlb(struct cpuinfo_x86 *c) { if (this_cpu->c_detect_tlb) this_cpu->c_detect_tlb(c); printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n" - "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n" - "tlb_flushall_shift: %d\n", + "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n", tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES], tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES], tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES], - tlb_lld_1g[ENTRIES], tlb_flushall_shift); + tlb_lld_1g[ENTRIES]); } void detect_ht(struct cpuinfo_x86 *c) diff -puN arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/intel.c --- a/arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.818836088 -0700 +++ b/arch/x86/kernel/cpu/intel.c 2014-04-21 11:10:34.825836403 -0700 @@ -634,31 +634,6 @@ static void intel_tlb_lookup(const unsig } } -static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c) -{ - switch ((c->x86 << 8) + c->x86_model) { - case 0x60f: /* original 65 nm celeron/pentium/core2/xeon, "Merom"/"Conroe" */ - case 0x616: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */ - case 0x617: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */ - case 0x61d: /* six-core 45 nm xeon "Dunnington" */ - tlb_flushall_shift = -1; - break; - case 0x63a: /* Ivybridge */ - tlb_flushall_shift = 2; - break; - case 0x61a: /* 45 nm nehalem, "Bloomfield" */ - case 0x61e: /* 45 nm nehalem, "Lynnfield" */ - case 0x625: /* 32 nm nehalem, "Clarkdale" */ - case 0x62c: /* 32 nm nehalem, "Gulftown" */ - case 0x62e: /* 45 nm nehalem-ex, "Beckton" */ - case 0x62f: /* 32 nm Xeon E7 */ - case 0x62a: /* SandyBridge */ - case 0x62d: /* SandyBridge, "Romely-EP" */ - default: - tlb_flushall_shift = 6; - } -} - static void intel_detect_tlb(struct cpuinfo_x86 *c) { int i, j, n; @@ -683,7 +658,6 @@ static void intel_detect_tlb(struct cpui for (j = 1 ; j < 16 ; j++) intel_tlb_lookup(desc[j]); } - intel_tlb_flushall_shift_set(c); } static const struct cpu_dev intel_cpu_dev = { diff -puN arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.820836178 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:34.826836449 -0700 @@ -158,13 +158,22 @@ void flush_tlb_current_task(void) preempt_enable(); } +/* + * See Documentation/x86/tlb.txt for details. We choose 33 + * because it is large enough to cover the vast majority (at + * least 95%) of allocations, and is small enough that we are + * confident it will not cause too much overhead. Each single + * flush is about 100 cycles, so this caps the maximum overhead + * at _about_ 3,000 cycles. + */ +/* in units of pages */ +unsigned long tlb_single_page_flush_ceiling = 1; + void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) { int need_flush_others_all = 1; unsigned long addr; - unsigned act_entries, tlb_entries = 0; - unsigned long nr_base_pages; preempt_disable(); if (current->active_mm != mm) @@ -175,25 +184,12 @@ void flush_tlb_mm_range(struct mm_struct goto out; } - if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 - || vmflag & VM_HUGETLB) { + if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { local_flush_tlb(); goto out; } - /* In modern CPU, last level tlb used for both data/ins */ - if (vmflag & VM_EXEC) - tlb_entries = tlb_lli_4k[ENTRIES]; - else - tlb_entries = tlb_lld_4k[ENTRIES]; - - /* Assume all of TLB entries was occupied by this task */ - act_entries = tlb_entries >> tlb_flushall_shift; - act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm; - nr_base_pages = (end - start) >> PAGE_SHIFT; - - /* tlb_flushall_shift is on balance point, details in commit log */ - if (nr_base_pages > act_entries) { + if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) { count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); local_flush_tlb(); } else { @@ -259,68 +255,15 @@ static void do_kernel_range_flush(void * void flush_tlb_kernel_range(unsigned long start, unsigned long end) { - unsigned act_entries; - struct flush_tlb_info info; - - /* In modern CPU, last level tlb used for both data/ins */ - act_entries = tlb_lld_4k[ENTRIES]; /* Balance as user space task's flush, a bit conservative */ - if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 || - (end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift) - + if (end == TLB_FLUSH_ALL || + (end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) { on_each_cpu(do_flush_tlb_all, NULL, 1); - else { + } else { + struct flush_tlb_info info; info.flush_start = start; info.flush_end = end; on_each_cpu(do_kernel_range_flush, &info, 1); } } - -#ifdef CONFIG_DEBUG_TLBFLUSH -static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf, - size_t count, loff_t *ppos) -{ - char buf[32]; - unsigned int len; - - len = sprintf(buf, "%hd\n", tlb_flushall_shift); - return simple_read_from_buffer(user_buf, count, ppos, buf, len); -} - -static ssize_t tlbflush_write_file(struct file *file, - const char __user *user_buf, size_t count, loff_t *ppos) -{ - char buf[32]; - ssize_t len; - s8 shift; - - len = min(count, sizeof(buf) - 1); - if (copy_from_user(buf, user_buf, len)) - return -EFAULT; - - buf[len] = '\0'; - if (kstrtos8(buf, 0, &shift)) - return -EINVAL; - - if (shift < -1 || shift >= BITS_PER_LONG) - return -EINVAL; - - tlb_flushall_shift = shift; - return count; -} - -static const struct file_operations fops_tlbflush = { - .read = tlbflush_read_file, - .write = tlbflush_write_file, - .llseek = default_llseek, -}; - -static int __init create_tlb_flushall_shift(void) -{ - debugfs_create_file("tlb_flushall_shift", S_IRUSR | S_IWUSR, - arch_debugfs_dir, NULL, &fops_tlbflush); - return 0; -} -late_initcall(create_tlb_flushall_shift); -#endif _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f47.google.com (mail-pb0-f47.google.com [209.85.160.47]) by kanga.kvack.org (Postfix) with ESMTP id 21EC26B003B for ; Mon, 21 Apr 2014 14:24:30 -0400 (EDT) Received: by mail-pb0-f47.google.com with SMTP id up15so3999993pbc.34 for ; Mon, 21 Apr 2014 11:24:29 -0700 (PDT) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTP id l4si9114175pav.36.2014.04.21.11.24.26 for ; Mon, 21 Apr 2014 11:24:27 -0700 (PDT) Subject: [PATCH 4/6] x86: mm: trace tlb flushes From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:25 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182425.93E696A3@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen We don't have any good way to figure out what kinds of flushes are being attempted. Right now, we can try to use the vm counters, but those only tell us what we actually did with the hardware (one-by-one vs full) and don't tell us what was actually _requested_. This allows us to select out "interesting" TLB flushes that we might want to optimize (like the ranged ones) and ignore the ones that we have very little control over (the ones at context switch). Also, since we have a pair of tracepoint calls in flush_tlb_mm_range(), we can time the deltas between them to make sure that we got the "invlpg vs. global flush" balance correct in practice. Signed-off-by: Dave Hansen --- b/arch/x86/include/asm/mmu_context.h | 6 +++++ b/arch/x86/mm/tlb.c | 12 +++++++++-- b/include/linux/mm_types.h | 10 +++++++++ b/include/trace/events/tlb.h | 37 +++++++++++++++++++++++++++++++++++ b/mm/Makefile | 2 - b/mm/trace_tlb.c | 12 +++++++++++ 6 files changed, 76 insertions(+), 3 deletions(-) diff -puN arch/x86/include/asm/mmu_context.h~tlb-trace-flushes arch/x86/include/asm/mmu_context.h --- a/arch/x86/include/asm/mmu_context.h~tlb-trace-flushes 2014-04-21 11:10:35.519867746 -0700 +++ b/arch/x86/include/asm/mmu_context.h 2014-04-21 11:10:35.527868108 -0700 @@ -3,6 +3,10 @@ #include #include +#include + +#include + #include #include #include @@ -44,6 +48,7 @@ static inline void switch_mm(struct mm_s /* Re-load page tables */ load_cr3(next->pgd); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); /* Stop flush ipis for the previous mm */ cpumask_clear_cpu(cpu, mm_cpumask(prev)); @@ -71,6 +76,7 @@ static inline void switch_mm(struct mm_s * to make sure to use no freed page tables. */ load_cr3(next->pgd); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); load_LDT_nolock(&next->context); } } diff -puN arch/x86/mm/tlb.c~tlb-trace-flushes arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~tlb-trace-flushes 2014-04-21 11:10:35.520867791 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:35.528868153 -0700 @@ -14,6 +14,8 @@ #include #include +#include + DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { &init_mm, 0, }; @@ -49,6 +51,7 @@ void leave_mm(int cpu) if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) { cpumask_clear_cpu(cpu, mm_cpumask(active_mm)); load_cr3(swapper_pg_dir); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); } } EXPORT_SYMBOL_GPL(leave_mm); @@ -105,9 +108,10 @@ static void flush_tlb_func(void *info) count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) { - if (f->flush_end == TLB_FLUSH_ALL) + if (f->flush_end == TLB_FLUSH_ALL) { local_flush_tlb(); - else if (!f->flush_end) + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL); + } else if (!f->flush_end) __flush_tlb_single(f->flush_start); else { unsigned long addr; @@ -152,7 +156,9 @@ void flush_tlb_current_task(void) preempt_disable(); count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); local_flush_tlb(); + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN_DONE, TLB_FLUSH_ALL); if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); preempt_enable(); @@ -188,6 +194,7 @@ void flush_tlb_mm_range(struct mm_struct if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) base_pages_to_flush = (end - start) >> PAGE_SHIFT; + trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush); if (base_pages_to_flush > tlb_single_page_flush_ceiling) { base_pages_to_flush = TLB_FLUSH_ALL; count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); @@ -199,6 +206,7 @@ void flush_tlb_mm_range(struct mm_struct __flush_tlb_single(addr); } } + trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN_DONE, base_pages_to_flush); out: if (base_pages_to_flush == TLB_FLUSH_ALL) { start = 0UL; diff -puN include/linux/mm_types.h~tlb-trace-flushes include/linux/mm_types.h --- a/include/linux/mm_types.h~tlb-trace-flushes 2014-04-21 11:10:35.522867881 -0700 +++ b/include/linux/mm_types.h 2014-04-21 11:10:35.529868198 -0700 @@ -510,4 +510,14 @@ static inline void clear_tlb_flush_pendi } #endif +enum tlb_flush_reason { + TLB_FLUSH_ON_TASK_SWITCH, + TLB_REMOTE_SHOOTDOWN, + TLB_LOCAL_SHOOTDOWN, + TLB_LOCAL_SHOOTDOWN_DONE, + TLB_LOCAL_MM_SHOOTDOWN, + TLB_LOCAL_MM_SHOOTDOWN_DONE, + NR_TLB_FLUSH_REASONS, +}; + #endif /* _LINUX_MM_TYPES_H */ diff -puN /dev/null include/trace/events/tlb.h --- /dev/null 2014-04-10 11:28:14.066815724 -0700 +++ b/include/trace/events/tlb.h 2014-04-21 11:10:35.529868198 -0700 @@ -0,0 +1,37 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM tlb + +#if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_TLB_H + +#include +#include + +extern const char * const tlb_flush_reason_desc[]; + +TRACE_EVENT(tlb_flush, + + TP_PROTO(int reason, unsigned long pages), + TP_ARGS(reason, pages), + + TP_STRUCT__entry( + __field( int, reason) + __field(unsigned long, pages) + ), + + TP_fast_assign( + __entry->reason = reason; + __entry->pages = pages; + ), + + TP_printk("pages: %ld reason: %d (%s)", + __entry->pages, + __entry->reason, + tlb_flush_reason_desc[__entry->reason]) +); + +#endif /* _TRACE_TLB_H */ + +/* This part must be outside protection */ +#include + diff -puN mm/Makefile~tlb-trace-flushes mm/Makefile --- a/mm/Makefile~tlb-trace-flushes 2014-04-21 11:10:35.524867971 -0700 +++ b/mm/Makefile 2014-04-21 11:10:35.530868243 -0700 @@ -5,7 +5,7 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \ mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ - vmalloc.o pagewalk.o pgtable-generic.o + vmalloc.o pagewalk.o pgtable-generic.o trace_tlb.o ifdef CONFIG_CROSS_MEMORY_ATTACH mmu-$(CONFIG_MMU) += process_vm_access.o diff -puN /dev/null mm/trace_tlb.c --- /dev/null 2014-04-10 11:28:14.066815724 -0700 +++ b/mm/trace_tlb.c 2014-04-21 11:10:35.530868243 -0700 @@ -0,0 +1,12 @@ +#define CREATE_TRACE_POINTS +#include + +const char * const tlb_flush_reason_desc[] = { + __stringify(TLB_FLUSH_ON_TASK_SWITCH), + __stringify(TLB_REMOTE_SHOOTDOWN), + __stringify(TLB_LOCAL_SHOOTDOWN), + __stringify(TLB_LOCAL_SHOOTDOWN_DONE), + __stringify(TLB_LOCAL_MM_SHOOTDOWN), + __stringify(TLB_LOCAL_MM_SHOOTDOWN_DONE), +}; + _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f54.google.com (mail-pb0-f54.google.com [209.85.160.54]) by kanga.kvack.org (Postfix) with ESMTP id B050D6B003D for ; Mon, 21 Apr 2014 14:24:30 -0400 (EDT) Received: by mail-pb0-f54.google.com with SMTP id ma3so4008070pbc.13 for ; Mon, 21 Apr 2014 11:24:30 -0700 (PDT) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTP id a8si21262989pbs.242.2014.04.21.11.24.29 for ; Mon, 21 Apr 2014 11:24:29 -0700 (PDT) Subject: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:26 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182426.D6DD1E8F@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Most of the logic here is in the documentation file. Please take a look at it. I know we've come full-circle here back to a tunable, but this new one is *WAY* simpler. I challenge anyone to describe in one sentence how the old one worked. Here's the way the new one works: If we are flushing more pages than the ceiling, we use the full flush, otherwise we use per-page flushes. Signed-off-by: Dave Hansen --- b/Documentation/x86/tlb.txt | 72 ++++++++++++++++++++++++++++++++++++++++++++ b/arch/x86/mm/tlb.c | 46 ++++++++++++++++++++++++++++ 2 files changed, 118 insertions(+) diff -puN arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush 2014-04-21 11:10:35.901884997 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:35.905885179 -0700 @@ -274,3 +274,49 @@ void flush_tlb_kernel_range(unsigned lon on_each_cpu(do_kernel_range_flush, &info, 1); } } + +static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf, + size_t count, loff_t *ppos) +{ + char buf[32]; + unsigned int len; + + len = sprintf(buf, "%ld\n", tlb_single_page_flush_ceiling); + return simple_read_from_buffer(user_buf, count, ppos, buf, len); +} + +static ssize_t tlbflush_write_file(struct file *file, + const char __user *user_buf, size_t count, loff_t *ppos) +{ + char buf[32]; + ssize_t len; + int ceiling; + + len = min(count, sizeof(buf) - 1); + if (copy_from_user(buf, user_buf, len)) + return -EFAULT; + + buf[len] = '\0'; + if (kstrtoint(buf, 0, &ceiling)) + return -EINVAL; + + if (ceiling < 0) + return -EINVAL; + + tlb_single_page_flush_ceiling = ceiling; + return count; +} + +static const struct file_operations fops_tlbflush = { + .read = tlbflush_read_file, + .write = tlbflush_write_file, + .llseek = default_llseek, +}; + +static int __init create_tlb_single_page_flush_ceiling(void) +{ + debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR, + arch_debugfs_dir, NULL, &fops_tlbflush); + return 0; +} +late_initcall(create_tlb_single_page_flush_ceiling); diff -puN /dev/null Documentation/x86/tlb.txt --- /dev/null 2014-04-10 11:28:14.066815724 -0700 +++ b/Documentation/x86/tlb.txt 2014-04-21 11:10:35.924886036 -0700 @@ -0,0 +1,72 @@ +nWhen the kernel unmaps or modified the attributes of a range of +memory, it has two choices: + 1. Flush the entire TLB with a two-instruction sequence. This is + a quick operation, but it causes collateral damage: TLB entries + from areas other than the one we are trying to flush will be + destroyed and must be refilled later, at some cost. + 2. Use the invlpg instruction to invalidate a single page at a + time. This could potentialy cost many more instructions, but + it is a much more precise operation, causing no collateral + damage to other TLB entries. + +Which method to do depends on a few things: + 1. The size of the flush being performed. A flush of the entire + address space is obviously better performed by flushing the + entire TLB than doing 2^48/PAGE_SIZE individual flushes. + 2. The contents of the TLB. If the TLB is empty, then there will + be no collateral damage caused by doing the global flush, and + all of the individual flush will have ended up being wasted + work. + 3. The size of the TLB. The larger the TLB, the more collateral + damage we do with a full flush. So, the larger the TLB, the + more attrative an individual flush looks. Data and + instructions have separate TLBs, as do different page sizes. + 4. The microarchitecture. The TLB has become a multi-level + cache on modern CPUs, and the global flushes have become more + expensive relative to single-page flushes. + +There is obviously no way the kernel can know all these things, +especially the contents of the TLB during a given flush. The +sizes of the flush will vary greatly depending on the workload as +well. There is essentially no "right" point to choose. + +You may be doing too many individual invalidations if you see the +invlpg instruction (or instructions _near_ it) show up high in +profiles. If you believe that individual invalidatoins being +called too often, you can lower the tunable: + + /sys/debug/kernel/x86/tlb_single_page_flush_ceiling + +This will cause us to do the global flush for more cases. +Lowering it to 0 will disable the use of the individual flushes. +Setting it to 1 is a very conservative setting and it should +never need to be 0 under normal circumstances. + +Despite the fact that a single individual flush on x86 is +guaranteed to flush a full 2MB, hugetlbfs always uses the full +flushes. THP is treated exactly the same as normal memory. + +You might see invlpg inside of flush_tlb_mm_range() show up in +profiles, or you can use the trace_tlb_flush() tracepoints. to +determine how long the flush operations are taking. + +Essentially, you are balancing the cycles you spend doing invlpg +with the cycles that you spend refilling the TLB later. + +You can measure how expensive TLB refills are by using +performance counters and 'perf stat', like this: + +perf stat -e + cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, + cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, + cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, + cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, + cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, + cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ + +That works on an IvyBridge-era CPU (i5-3320M). Different CPUs +may have differently-named counters, but they should at least +be there in some form. You can use pmu-tools 'ocperf list' +(https://github.com/andikleen/pmu-tools) to find the right +counters for a given CPU. + _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f47.google.com (mail-pb0-f47.google.com [209.85.160.47]) by kanga.kvack.org (Postfix) with ESMTP id B78A18298E for ; Mon, 21 Apr 2014 14:24:35 -0400 (EDT) Received: by mail-pb0-f47.google.com with SMTP id up15so4000086pbc.34 for ; Mon, 21 Apr 2014 11:24:35 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com. [143.182.124.21]) by mx.google.com with ESMTP id s8si21238820pas.426.2014.04.21.11.24.34 for ; Mon, 21 Apr 2014 11:24:34 -0700 (PDT) Subject: [PATCH 6/6] x86: mm: set TLB flush tunable to sane value (33) From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:28 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182428.FC2104C1@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen This has been run through Intel's LKP tests across a wide range of modern sytems and workloads and it wasn't shown to make a measurable performance difference positive or negative. Now that we have some shiny new tracepoints, we can actually figure out what the heck is going on. During a kernel compile, 60% of the flush_tlb_mm_range() calls are for a single page. It breaks down like this: size percent percent<= V V V GLOBAL: 2.20% 2.20% avg cycles: 2283 1: 56.92% 59.12% avg cycles: 1276 2: 13.78% 72.90% avg cycles: 1505 3: 8.26% 81.16% avg cycles: 1880 4: 7.41% 88.58% avg cycles: 2447 5: 1.73% 90.31% avg cycles: 2358 6: 1.32% 91.63% avg cycles: 2563 7: 1.14% 92.77% avg cycles: 2862 8: 0.62% 93.39% avg cycles: 3542 9: 0.08% 93.47% avg cycles: 3289 10: 0.43% 93.90% avg cycles: 3570 11: 0.20% 94.10% avg cycles: 3767 12: 0.08% 94.18% avg cycles: 3996 13: 0.03% 94.20% avg cycles: 4077 14: 0.02% 94.23% avg cycles: 4836 15: 0.04% 94.26% avg cycles: 5699 16: 0.06% 94.32% avg cycles: 5041 17: 0.57% 94.89% avg cycles: 5473 18: 0.02% 94.91% avg cycles: 5396 19: 0.03% 94.95% avg cycles: 5296 20: 0.02% 94.96% avg cycles: 6749 21: 0.18% 95.14% avg cycles: 6225 22: 0.01% 95.15% avg cycles: 6393 23: 0.01% 95.16% avg cycles: 6861 24: 0.12% 95.28% avg cycles: 6912 25: 0.05% 95.32% avg cycles: 7190 26: 0.01% 95.33% avg cycles: 7793 27: 0.01% 95.34% avg cycles: 7833 28: 0.01% 95.35% avg cycles: 8253 29: 0.08% 95.42% avg cycles: 8024 30: 0.03% 95.45% avg cycles: 9670 31: 0.01% 95.46% avg cycles: 8949 32: 0.01% 95.46% avg cycles: 9350 33: 3.11% 98.57% avg cycles: 8534 34: 0.02% 98.60% avg cycles: 10977 35: 0.02% 98.62% avg cycles: 11400 We get in to dimishing returns pretty quickly. On pre-IvyBridge CPUs, we used to set the limit at 8 pages, and it was set at 128 on IvyBrige. That 128 number looks pretty silly considering that less than 0.5% of the flushes are that large. The previous code tried to size this number based on the size of the TLB. Good idea, but it's error-prone, needs maintenance (which it didn't get up to now), and probably would not matter in practice much. Settting it to 33 means that we cover the mallopt M_TRIM_THRESHOLD, which is the most universally common size to do flushes. That's the short version. Here's the long one for why I chose 33: 1. These numbers have a constant bias in the timestamps from the tracing. Probably counts for a couple hundred cycles in each of these tests, but it should be fairly _even_ across all of them. The smallest delta between the tracepoints I have ever seen is 335 cycles. This is one reason the cycles/page cost goes down in general as the flushes get larger. The true cost is nearer to 100 cycles. 2. A full flush is more expensive than a single invlpg, but not by much (single percentages). 3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. 4. 22,000 cycles is approximately the equivalent of doing 85 invlpg operations. But, the odds are that the TLB can actually be filled up faster than that because TLB misses that are close in time also tend to leverage the same caches. 6. ~98% of flushes are <=33 pages. There are a lot of flushes of 33 pages, probably because libc's M_TRIM_THRESHOLD is set to 128k (32 pages) 7. I've found no consistent data to support changing the IvyBridge vs. SandyBridge tunable by a factor of 16 I used the performance counters on this hardware (IvyBridge i5-3320M) to figure out the tlb miss costs: ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush 7,720,030,970 dtlb_load_misses_walk_duration [57.13%] 169,856,353 dtlb_load_misses_walk_completed [57.15%] 708,832,859 dtlb_store_misses_walk_duration [57.17%] 19,346,823 dtlb_store_misses_walk_completed [57.17%] 2,779,687,402 itlb_misses_walk_duration [57.15%] 82,241,148 itlb_misses_walk_completed [57.13%] 770,717 itlb_itlb_flush [57.11%] Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. On a SandyBridge system with more cores and larger caches, those are dtlb=13.4ns and itlb=9.5ns. cat perf.stat.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, " ----- ", icyc,imiss, dcyc,dmiss } On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any The assumptions that this code went in under: https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are about 100ns. Being generous, that is over by a factor of 6 on the refill side, although it is fairly close on the cost of an invlpg. An increase of a single invlpg operation seems to lengthen the flush range operation by about 200 cycles. Here is one example of the data collected for flushing 10 and 11 pages (full data are below): 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 How to generate this table: echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb echo x86-tsc > /sys/kernel/debug/tracing/trace_clock echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable Pipe the trace output in to this script: http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt Note that these data were gathered with the invlpg threshold set to 150 pages. Only data points with >=50 of samples were printed: Flush % of %<= in flush this pages es size ------------------------------------------------------------------------------ -1: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960 1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895 2: 13.78% 72.90% avg cycles: 1505 cycles/page: 752 samples: 150335 3: 8.26% 81.16% avg cycles: 1880 cycles/page: 626 samples: 90131 4: 7.41% 88.58% avg cycles: 2447 cycles/page: 611 samples: 80877 5: 1.73% 90.31% avg cycles: 2358 cycles/page: 471 samples: 18885 6: 1.32% 91.63% avg cycles: 2563 cycles/page: 427 samples: 14397 7: 1.14% 92.77% avg cycles: 2862 cycles/page: 408 samples: 12441 8: 0.62% 93.39% avg cycles: 3542 cycles/page: 442 samples: 6721 9: 0.08% 93.47% avg cycles: 3289 cycles/page: 365 samples: 917 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 12: 0.08% 94.18% avg cycles: 3996 cycles/page: 333 samples: 864 13: 0.03% 94.20% avg cycles: 4077 cycles/page: 313 samples: 289 14: 0.02% 94.23% avg cycles: 4836 cycles/page: 345 samples: 236 15: 0.04% 94.26% avg cycles: 5699 cycles/page: 379 samples: 390 16: 0.06% 94.32% avg cycles: 5041 cycles/page: 315 samples: 643 17: 0.57% 94.89% avg cycles: 5473 cycles/page: 321 samples: 6229 18: 0.02% 94.91% avg cycles: 5396 cycles/page: 299 samples: 224 19: 0.03% 94.95% avg cycles: 5296 cycles/page: 278 samples: 367 20: 0.02% 94.96% avg cycles: 6749 cycles/page: 337 samples: 185 21: 0.18% 95.14% avg cycles: 6225 cycles/page: 296 samples: 1964 22: 0.01% 95.15% avg cycles: 6393 cycles/page: 290 samples: 83 23: 0.01% 95.16% avg cycles: 6861 cycles/page: 298 samples: 61 24: 0.12% 95.28% avg cycles: 6912 cycles/page: 288 samples: 1307 25: 0.05% 95.32% avg cycles: 7190 cycles/page: 287 samples: 533 26: 0.01% 95.33% avg cycles: 7793 cycles/page: 299 samples: 94 27: 0.01% 95.34% avg cycles: 7833 cycles/page: 290 samples: 66 28: 0.01% 95.35% avg cycles: 8253 cycles/page: 294 samples: 73 29: 0.08% 95.42% avg cycles: 8024 cycles/page: 276 samples: 846 30: 0.03% 95.45% avg cycles: 9670 cycles/page: 322 samples: 296 31: 0.01% 95.46% avg cycles: 8949 cycles/page: 288 samples: 79 32: 0.01% 95.46% avg cycles: 9350 cycles/page: 292 samples: 60 33: 3.11% 98.57% avg cycles: 8534 cycles/page: 258 samples: 33936 34: 0.02% 98.60% avg cycles: 10977 cycles/page: 322 samples: 268 35: 0.02% 98.62% avg cycles: 11400 cycles/page: 325 samples: 177 36: 0.01% 98.63% avg cycles: 11504 cycles/page: 319 samples: 161 37: 0.02% 98.65% avg cycles: 11596 cycles/page: 313 samples: 182 38: 0.02% 98.66% avg cycles: 11850 cycles/page: 311 samples: 195 39: 0.01% 98.68% avg cycles: 12158 cycles/page: 311 samples: 128 40: 0.01% 98.68% avg cycles: 11626 cycles/page: 290 samples: 78 41: 0.04% 98.73% avg cycles: 11435 cycles/page: 278 samples: 477 42: 0.01% 98.73% avg cycles: 12571 cycles/page: 299 samples: 74 43: 0.01% 98.74% avg cycles: 12562 cycles/page: 292 samples: 78 44: 0.01% 98.75% avg cycles: 12991 cycles/page: 295 samples: 108 45: 0.01% 98.76% avg cycles: 13169 cycles/page: 292 samples: 78 46: 0.02% 98.78% avg cycles: 12891 cycles/page: 280 samples: 261 47: 0.01% 98.79% avg cycles: 13099 cycles/page: 278 samples: 67 48: 0.01% 98.80% avg cycles: 13851 cycles/page: 288 samples: 77 49: 0.01% 98.80% avg cycles: 13749 cycles/page: 280 samples: 66 50: 0.01% 98.81% avg cycles: 13949 cycles/page: 278 samples: 73 52: 0.00% 98.82% avg cycles: 14243 cycles/page: 273 samples: 52 54: 0.01% 98.83% avg cycles: 15312 cycles/page: 283 samples: 87 55: 0.01% 98.84% avg cycles: 15197 cycles/page: 276 samples: 109 56: 0.02% 98.86% avg cycles: 15234 cycles/page: 272 samples: 208 57: 0.00% 98.86% avg cycles: 14888 cycles/page: 261 samples: 53 58: 0.01% 98.87% avg cycles: 15037 cycles/page: 259 samples: 59 59: 0.01% 98.87% avg cycles: 15752 cycles/page: 266 samples: 63 62: 0.00% 98.89% avg cycles: 16222 cycles/page: 261 samples: 54 64: 0.02% 98.91% avg cycles: 17179 cycles/page: 268 samples: 248 65: 0.12% 99.03% avg cycles: 18762 cycles/page: 288 samples: 1324 85: 0.00% 99.10% avg cycles: 21649 cycles/page: 254 samples: 50 127: 0.01% 99.18% avg cycles: 32397 cycles/page: 255 samples: 75 128: 0.13% 99.31% avg cycles: 31711 cycles/page: 247 samples: 1466 129: 0.18% 99.49% avg cycles: 33017 cycles/page: 255 samples: 1927 181: 0.33% 99.84% avg cycles: 2489 cycles/page: 13 samples: 3547 256: 0.05% 99.91% avg cycles: 2305 cycles/page: 9 samples: 550 512: 0.03% 99.95% avg cycles: 2133 cycles/page: 4 samples: 304 1512: 0.01% 99.99% avg cycles: 3038 cycles/page: 2 samples: 65 Here are the tlb counters during a 10-second slice of a kernel compile for a SandyBridge system. It's better than IvyBridge, but probably due to the larger caches since this was one of the 'X' extreme parts. 10,873,007,282 dtlb_load_misses_walk_duration 250,711,333 dtlb_load_misses_walk_completed 1,212,395,865 dtlb_store_misses_walk_duration 31,615,772 dtlb_store_misses_walk_completed 5,091,010,274 itlb_misses_walk_duration 163,193,511 itlb_misses_walk_completed 1,321,980 itlb_itlb_flush 10.008045158 seconds time elapsed # cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, " ----- ", icyc,imiss, dcyc,dmiss }' itlb ns/miss: 9.45338 dtlb ns/miss: 12.9716 Signed-off-by: Dave Hansen --- b/arch/x86/mm/tlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN arch/x86/mm/tlb.c~set-tunable-to-sane-value arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~set-tunable-to-sane-value 2014-04-21 09:58:50.012268370 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 09:58:50.016268551 -0700 @@ -173,7 +173,7 @@ void flush_tlb_current_task(void) * at _about_ 3,000 cycles. */ /* in units of pages */ -unsigned long tlb_single_page_flush_ceiling = 1; +unsigned long tlb_single_page_flush_ceiling = 33; void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f45.google.com (mail-ee0-f45.google.com [74.125.83.45]) by kanga.kvack.org (Postfix) with ESMTP id 974FB6B0044 for ; Tue, 22 Apr 2014 12:54:13 -0400 (EDT) Received: by mail-ee0-f45.google.com with SMTP id d17so4788001eek.18 for ; Tue, 22 Apr 2014 09:54:13 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id u5si60585979een.113.2014.04.22.09.54.10 for ; Tue, 22 Apr 2014 09:54:11 -0700 (PDT) Message-ID: <53569EA4.2000308@redhat.com> Date: Tue, 22 Apr 2014 12:53:56 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 1/6] x86: mm: clean up tlb flushing code References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182420.307A0C57@viggo.jf.intel.com> In-Reply-To: <20140421182420.307A0C57@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > The > > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) > > line of code is not exactly the easiest to audit, especially when > it ends up at two different indentation levels. This eliminates > one of the the copy-n-paste versions. It also gives us a unified > exit point for each path through this function. We need this in > a minute for our tracepoint. > > > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f49.google.com (mail-ee0-f49.google.com [74.125.83.49]) by kanga.kvack.org (Postfix) with ESMTP id 801F36B004D for ; Tue, 22 Apr 2014 12:54:57 -0400 (EDT) Received: by mail-ee0-f49.google.com with SMTP id c41so4816898eek.8 for ; Tue, 22 Apr 2014 09:54:56 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id n7si60548824eeu.169.2014.04.22.09.54.54 for ; Tue, 22 Apr 2014 09:54:55 -0700 (PDT) Message-ID: <53569ED3.2080206@redhat.com> Date: Tue, 22 Apr 2014 12:54:43 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182421.DFAAD16A@viggo.jf.intel.com> In-Reply-To: <20140421182421.DFAAD16A@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > I think the flush_tlb_mm_range() code that tries to tune the > flush sizes based on the CPU needs to get ripped out for > several reasons: > > 1. It is obviously buggy. It uses mm->total_vm to judge the > task's footprint in the TLB. It should certainly be using > some measure of RSS, *NOT* ->total_vm since only resident > memory can populate the TLB. > 2. Haswell, and several other CPUs are missing from the > intel_tlb_flushall_shift_set() function. Thus, it has been > demonstrated to bitrot quickly in practice. > 3. It is plain wrong in my vm: > [ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 > [ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0 > [ 0.037444] tlb_flushall_shift: 6 > Which leads to it to never use invlpg. > 4. The assumptions about TLB refill costs are wrong: > http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com > (more on this in later patches) > 5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59 > I believe the sample times were too short. Running the > benchmark in a loop yields times that vary quite a bit. > > Note that this leaves us with a static ceiling of 1 page. This > is a conservative, dumb setting, and will be revised in a later > patch. > > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f177.google.com (mail-we0-f177.google.com [74.125.82.177]) by kanga.kvack.org (Postfix) with ESMTP id B39926B0055 for ; Tue, 22 Apr 2014 13:15:50 -0400 (EDT) Received: by mail-we0-f177.google.com with SMTP id u57so5106616wes.8 for ; Tue, 22 Apr 2014 10:15:50 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id b1si5682389wiz.37.2014.04.22.10.15.48 for ; Tue, 22 Apr 2014 10:15:49 -0700 (PDT) Message-ID: <5356A3B6.30006@redhat.com> Date: Tue, 22 Apr 2014 13:15:34 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 3/6] x86: mm: fix missed global TLB flush stat References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182422.DE5E728F@viggo.jf.intel.com> In-Reply-To: <20140421182422.DE5E728F@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > If we take the > > if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { > local_flush_tlb(); > goto out; > } > > path out of flush_tlb_mm_range(), we will have flushed the tlb, > but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the > way out of the function so that we always take a single path when > doing a full tlb flush. > > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f51.google.com (mail-wg0-f51.google.com [74.125.82.51]) by kanga.kvack.org (Postfix) with ESMTP id 36B646B0069 for ; Tue, 22 Apr 2014 17:19:57 -0400 (EDT) Received: by mail-wg0-f51.google.com with SMTP id k14so49389wgh.10 for ; Tue, 22 Apr 2014 14:19:56 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id h13si14480354wjr.107.2014.04.22.14.19.55 for ; Tue, 22 Apr 2014 14:19:55 -0700 (PDT) Message-ID: <5356DCEF.3050506@redhat.com> Date: Tue, 22 Apr 2014 17:19:43 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 4/6] x86: mm: trace tlb flushes References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182425.93E696A3@viggo.jf.intel.com> In-Reply-To: <20140421182425.93E696A3@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > We don't have any good way to figure out what kinds of flushes > are being attempted. Right now, we can try to use the vm > counters, but those only tell us what we actually did with the > hardware (one-by-one vs full) and don't tell us what was actually > _requested_. > > This allows us to select out "interesting" TLB flushes that we > might want to optimize (like the ranged ones) and ignore the ones > that we have very little control over (the ones at context > switch). > > Also, since we have a pair of tracepoint calls in > flush_tlb_mm_range(), we can time the deltas between them to make > sure that we got the "invlpg vs. global flush" balance correct in > practice. > > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f180.google.com (mail-we0-f180.google.com [74.125.82.180]) by kanga.kvack.org (Postfix) with ESMTP id 5D26D6B005A for ; Tue, 22 Apr 2014 17:32:06 -0400 (EDT) Received: by mail-we0-f180.google.com with SMTP id k48so59794wev.11 for ; Tue, 22 Apr 2014 14:32:05 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id n17si24735wiv.20.2014.04.22.14.32.03 for ; Tue, 22 Apr 2014 14:32:04 -0700 (PDT) Message-ID: <5356DFC8.1060601@redhat.com> Date: Tue, 22 Apr 2014 17:31:52 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> In-Reply-To: <20140421182426.D6DD1E8F@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > Most of the logic here is in the documentation file. Please take > a look at it. > > I know we've come full-circle here back to a tunable, but this > new one is *WAY* simpler. I challenge anyone to describe in one > sentence how the old one worked. Here's the way the new one > works: > > If we are flushing more pages than the ceiling, we use > the full flush, otherwise we use per-page flushes. > > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f47.google.com (mail-ee0-f47.google.com [74.125.83.47]) by kanga.kvack.org (Postfix) with ESMTP id 3890D6B0070 for ; Tue, 22 Apr 2014 17:34:08 -0400 (EDT) Received: by mail-ee0-f47.google.com with SMTP id b15so135908eek.34 for ; Tue, 22 Apr 2014 14:34:07 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id o46si100102eem.9.2014.04.22.14.34.05 for ; Tue, 22 Apr 2014 14:34:06 -0700 (PDT) Message-ID: <5356E041.3060709@redhat.com> Date: Tue, 22 Apr 2014 17:33:53 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 6/6] x86: mm: set TLB flush tunable to sane value (33) References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182428.FC2104C1@viggo.jf.intel.com> In-Reply-To: <20140421182428.FC2104C1@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > This has been run through Intel's LKP tests across a wide range > of modern sytems and workloads and it wasn't shown to make a > measurable performance difference positive or negative. > > Now that we have some shiny new tracepoints, we can actually > figure out what the heck is going on. > > During a kernel compile, 60% of the flush_tlb_mm_range() calls > are for a single page. It breaks down like this: > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f51.google.com (mail-ee0-f51.google.com [74.125.83.51]) by kanga.kvack.org (Postfix) with ESMTP id 27C326B0035 for ; Thu, 24 Apr 2014 04:33:10 -0400 (EDT) Received: by mail-ee0-f51.google.com with SMTP id c13so1546533eek.38 for ; Thu, 24 Apr 2014 01:33:09 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id y6si7259924eep.227.2014.04.24.01.33.07 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 24 Apr 2014 01:33:08 -0700 (PDT) Date: Thu, 24 Apr 2014 09:33:04 +0100 From: Mel Gorman Subject: Re: [PATCH 1/6] x86: mm: clean up tlb flushing code Message-ID: <20140424083304.GP23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182420.307A0C57@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182420.307A0C57@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On Mon, Apr 21, 2014 at 11:24:20AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > The > > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) > > line of code is not exactly the easiest to audit, especially when > it ends up at two different indentation levels. This eliminates > one of the the copy-n-paste versions. It also gives us a unified > exit point for each path through this function. We need this in > a minute for our tracepoint. > > > Signed-off-by: Dave Hansen > --- > > b/arch/x86/mm/tlb.c | 23 +++++++++++------------ > 1 file changed, 11 insertions(+), 12 deletions(-) > > diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c > --- a/arch/x86/mm/tlb.c~simplify-tlb-code 2014-04-21 11:10:34.431818610 -0700 > +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:34.435818791 -0700 > @@ -161,23 +161,24 @@ void flush_tlb_current_task(void) > void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, > unsigned long end, unsigned long vmflag) > { > + int need_flush_others_all = 1; > unsigned long addr; > unsigned act_entries, tlb_entries = 0; > unsigned long nr_base_pages; > Could make that bool but otherwise Acked-by: Mel Gorman -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f50.google.com (mail-ee0-f50.google.com [74.125.83.50]) by kanga.kvack.org (Postfix) with ESMTP id 0F7176B0035 for ; Thu, 24 Apr 2014 04:45:57 -0400 (EDT) Received: by mail-ee0-f50.google.com with SMTP id c13so1584190eek.9 for ; Thu, 24 Apr 2014 01:45:57 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id s46si7320760eeg.135.2014.04.24.01.45.55 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 24 Apr 2014 01:45:56 -0700 (PDT) Date: Thu, 24 Apr 2014 09:45:52 +0100 From: Mel Gorman Subject: Re: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing Message-ID: <20140424084552.GQ23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182421.DFAAD16A@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182421.DFAAD16A@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On Mon, Apr 21, 2014 at 11:24:21AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > I think the flush_tlb_mm_range() code that tries to tune the > flush sizes based on the CPU needs to get ripped out for > several reasons: > > 1. It is obviously buggy. It uses mm->total_vm to judge the > task's footprint in the TLB. It should certainly be using > some measure of RSS, *NOT* ->total_vm since only resident > memory can populate the TLB. Agreed. Even an RSS check is dodgy considering that it is not a reliable indication of recent reference activity and how many relevant TLB entries there may be for the task. > 2. Haswell, and several other CPUs are missing from the > intel_tlb_flushall_shift_set() function. Thus, it has been > demonstrated to bitrot quickly in practice. I also worried that the methodology used to set that shift on different CPUs was different. > 3. It is plain wrong in my vm: > [ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 > [ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0 > [ 0.037444] tlb_flushall_shift: 6 > Which leads to it to never use invlpg. > 4. The assumptions about TLB refill costs are wrong: > http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com > (more on this in later patches) > 5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59 > I believe the sample times were too short. Running the > benchmark in a loop yields times that vary quite a bit. > FWIW, when I last visited this topic I had to modify the test case extensively and even then it was not driven by flush ranges measured from "real" workloads. > Note that this leaves us with a static ceiling of 1 page. This > is a conservative, dumb setting, and will be revised in a later > patch. > > Signed-off-by: Dave Hansen > --- > > b/arch/x86/include/asm/processor.h | 1 > b/arch/x86/kernel/cpu/amd.c | 7 -- > b/arch/x86/kernel/cpu/common.c | 13 ----- > b/arch/x86/kernel/cpu/intel.c | 26 ---------- > b/arch/x86/mm/tlb.c | 91 ++++++------------------------------- > 5 files changed, 19 insertions(+), 119 deletions(-) > > diff -puN arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/include/asm/processor.h > --- a/arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.813835861 -0700 > +++ b/arch/x86/include/asm/processor.h 2014-04-21 11:10:34.823836313 -0700 > @@ -72,7 +72,6 @@ extern u16 __read_mostly tlb_lld_4k[NR_I > extern u16 __read_mostly tlb_lld_2m[NR_INFO]; > extern u16 __read_mostly tlb_lld_4m[NR_INFO]; > extern u16 __read_mostly tlb_lld_1g[NR_INFO]; > -extern s8 __read_mostly tlb_flushall_shift; > > /* > * CPU type and hardware bug flags. Kept separately for each CPU. > diff -puN arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/amd.c > --- a/arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.814835907 -0700 > +++ b/arch/x86/kernel/cpu/amd.c 2014-04-21 11:10:34.824836358 -0700 > @@ -741,11 +741,6 @@ static unsigned int amd_size_cache(struc > } > #endif > > -static void cpu_set_tlb_flushall_shift(struct cpuinfo_x86 *c) > -{ > - tlb_flushall_shift = 6; > -} > - > static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c) > { > u32 ebx, eax, ecx, edx; > @@ -793,8 +788,6 @@ static void cpu_detect_tlb_amd(struct cp > tlb_lli_2m[ENTRIES] = eax & mask; > > tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1; > - > - cpu_set_tlb_flushall_shift(c); > } > > static const struct cpu_dev amd_cpu_dev = { > diff -puN arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/common.c > --- a/arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.816835998 -0700 > +++ b/arch/x86/kernel/cpu/common.c 2014-04-21 11:10:34.825836403 -0700 > @@ -479,26 +479,17 @@ u16 __read_mostly tlb_lld_2m[NR_INFO]; > u16 __read_mostly tlb_lld_4m[NR_INFO]; > u16 __read_mostly tlb_lld_1g[NR_INFO]; > > -/* > - * tlb_flushall_shift shows the balance point in replacing cr3 write > - * with multiple 'invlpg'. It will do this replacement when > - * flush_tlb_lines <= active_lines/2^tlb_flushall_shift. > - * If tlb_flushall_shift is -1, means the replacement will be disabled. > - */ > -s8 __read_mostly tlb_flushall_shift = -1; > - > void cpu_detect_tlb(struct cpuinfo_x86 *c) > { > if (this_cpu->c_detect_tlb) > this_cpu->c_detect_tlb(c); > > printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n" > - "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n" > - "tlb_flushall_shift: %d\n", > + "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n", > tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES], > tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES], > tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES], > - tlb_lld_1g[ENTRIES], tlb_flushall_shift); > + tlb_lld_1g[ENTRIES]); > } > > void detect_ht(struct cpuinfo_x86 *c) > diff -puN arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/intel.c > --- a/arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.818836088 -0700 > +++ b/arch/x86/kernel/cpu/intel.c 2014-04-21 11:10:34.825836403 -0700 > @@ -634,31 +634,6 @@ static void intel_tlb_lookup(const unsig > } > } > > -static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c) > -{ > - switch ((c->x86 << 8) + c->x86_model) { > - case 0x60f: /* original 65 nm celeron/pentium/core2/xeon, "Merom"/"Conroe" */ > - case 0x616: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */ > - case 0x617: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */ > - case 0x61d: /* six-core 45 nm xeon "Dunnington" */ > - tlb_flushall_shift = -1; > - break; > - case 0x63a: /* Ivybridge */ > - tlb_flushall_shift = 2; > - break; > - case 0x61a: /* 45 nm nehalem, "Bloomfield" */ > - case 0x61e: /* 45 nm nehalem, "Lynnfield" */ > - case 0x625: /* 32 nm nehalem, "Clarkdale" */ > - case 0x62c: /* 32 nm nehalem, "Gulftown" */ > - case 0x62e: /* 45 nm nehalem-ex, "Beckton" */ > - case 0x62f: /* 32 nm Xeon E7 */ > - case 0x62a: /* SandyBridge */ > - case 0x62d: /* SandyBridge, "Romely-EP" */ > - default: > - tlb_flushall_shift = 6; > - } > -} > - > static void intel_detect_tlb(struct cpuinfo_x86 *c) > { > int i, j, n; > @@ -683,7 +658,6 @@ static void intel_detect_tlb(struct cpui > for (j = 1 ; j < 16 ; j++) > intel_tlb_lookup(desc[j]); > } > - intel_tlb_flushall_shift_set(c); > } > > static const struct cpu_dev intel_cpu_dev = { > diff -puN arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/mm/tlb.c > --- a/arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.820836178 -0700 > +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:34.826836449 -0700 > @@ -158,13 +158,22 @@ void flush_tlb_current_task(void) > preempt_enable(); > } > > +/* > + * See Documentation/x86/tlb.txt for details. We choose 33 > + * because it is large enough to cover the vast majority (at > + * least 95%) of allocations, and is small enough that we are > + * confident it will not cause too much overhead. Each single > + * flush is about 100 cycles, so this caps the maximum overhead > + * at _about_ 3,000 cycles. > + */ > +/* in units of pages */ > +unsigned long tlb_single_page_flush_ceiling = 1; > + This comment is premature. The documentation file does not exist yet and 33 means nothing yet. Out of curiousity though, how confident are you that a TLB flush is generally 100 cycles across different generations and manufacturers of CPUs? I'm not suggesting you change it or auto-tune it, am just curious. > void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, > unsigned long end, unsigned long vmflag) > { > int need_flush_others_all = 1; > unsigned long addr; > - unsigned act_entries, tlb_entries = 0; > - unsigned long nr_base_pages; > > preempt_disable(); > if (current->active_mm != mm) > @@ -175,25 +184,12 @@ void flush_tlb_mm_range(struct mm_struct > goto out; > } > > - if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 > - || vmflag & VM_HUGETLB) { > + if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { > local_flush_tlb(); > goto out; > } > > - /* In modern CPU, last level tlb used for both data/ins */ > - if (vmflag & VM_EXEC) > - tlb_entries = tlb_lli_4k[ENTRIES]; > - else > - tlb_entries = tlb_lld_4k[ENTRIES]; > - > - /* Assume all of TLB entries was occupied by this task */ > - act_entries = tlb_entries >> tlb_flushall_shift; > - act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm; > - nr_base_pages = (end - start) >> PAGE_SHIFT; > - > - /* tlb_flushall_shift is on balance point, details in commit log */ > - if (nr_base_pages > act_entries) { > + if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) { > count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > local_flush_tlb(); > } else { We lose the different tuning based on whether the flush is for instructions or data. However, I cannot think of a good reason for keeping it as I expect that flushes of instructions is relatively rare. The benefit, if any, will be marginal. Still, if you do another revision it would be nice to call this out in the changelog. Otherwise Acked-by: Mel Gorman -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f51.google.com (mail-ee0-f51.google.com [74.125.83.51]) by kanga.kvack.org (Postfix) with ESMTP id C7FAD6B0035 for ; Thu, 24 Apr 2014 04:49:28 -0400 (EDT) Received: by mail-ee0-f51.google.com with SMTP id c13so1552374eek.24 for ; Thu, 24 Apr 2014 01:49:28 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id d5si7286872eei.358.2014.04.24.01.49.26 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 24 Apr 2014 01:49:27 -0700 (PDT) Date: Thu, 24 Apr 2014 09:49:23 +0100 From: Mel Gorman Subject: Re: [PATCH 3/6] x86: mm: fix missed global TLB flush stat Message-ID: <20140424084922.GR23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182422.DE5E728F@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182422.DE5E728F@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On Mon, Apr 21, 2014 at 11:24:22AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > If we take the > > if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { > local_flush_tlb(); > goto out; > } > > path out of flush_tlb_mm_range(), we will have flushed the tlb, > but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the > way out of the function so that we always take a single path when > doing a full tlb flush. > > Signed-off-by: Dave Hansen Acked-by: Mel Gorman -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f42.google.com (mail-ee0-f42.google.com [74.125.83.42]) by kanga.kvack.org (Postfix) with ESMTP id 8EFEE6B0035 for ; Thu, 24 Apr 2014 06:14:26 -0400 (EDT) Received: by mail-ee0-f42.google.com with SMTP id d17so1675496eek.29 for ; Thu, 24 Apr 2014 03:14:25 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id u49si7601602eef.352.2014.04.24.03.14.24 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 24 Apr 2014 03:14:24 -0700 (PDT) Date: Thu, 24 Apr 2014 11:14:20 +0100 From: Mel Gorman Subject: Re: [PATCH 4/6] x86: mm: trace tlb flushes Message-ID: <20140424101419.GS23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182425.93E696A3@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182425.93E696A3@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On Mon, Apr 21, 2014 at 11:24:25AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > We don't have any good way to figure out what kinds of flushes > are being attempted. Right now, we can try to use the vm > counters, but those only tell us what we actually did with the > hardware (one-by-one vs full) and don't tell us what was actually > _requested_. > And when enabled they are a penalty even for those that don't care. > This allows us to select out "interesting" TLB flushes that we > might want to optimize (like the ranged ones) and ignore the ones > that we have very little control over (the ones at context > switch). > > Also, since we have a pair of tracepoint calls in > flush_tlb_mm_range(), we can time the deltas between them to make > sure that we got the "invlpg vs. global flush" balance correct in > practice. > > Signed-off-by: Dave Hansen > --- > > b/arch/x86/include/asm/mmu_context.h | 6 +++++ > b/arch/x86/mm/tlb.c | 12 +++++++++-- > b/include/linux/mm_types.h | 10 +++++++++ > b/include/trace/events/tlb.h | 37 +++++++++++++++++++++++++++++++++++ > b/mm/Makefile | 2 - > b/mm/trace_tlb.c | 12 +++++++++++ > 6 files changed, 76 insertions(+), 3 deletions(-) > > diff -puN arch/x86/include/asm/mmu_context.h~tlb-trace-flushes arch/x86/include/asm/mmu_context.h > --- a/arch/x86/include/asm/mmu_context.h~tlb-trace-flushes 2014-04-21 11:10:35.519867746 -0700 > +++ b/arch/x86/include/asm/mmu_context.h 2014-04-21 11:10:35.527868108 -0700 > @@ -3,6 +3,10 @@ > > #include > #include > +#include > + > +#include > + > #include > #include > #include > @@ -44,6 +48,7 @@ static inline void switch_mm(struct mm_s > > /* Re-load page tables */ > load_cr3(next->pgd); > + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > > /* Stop flush ipis for the previous mm */ > cpumask_clear_cpu(cpu, mm_cpumask(prev)); > @@ -71,6 +76,7 @@ static inline void switch_mm(struct mm_s > * to make sure to use no freed page tables. > */ > load_cr3(next->pgd); > + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > load_LDT_nolock(&next->context); > } > } > diff -puN arch/x86/mm/tlb.c~tlb-trace-flushes arch/x86/mm/tlb.c > --- a/arch/x86/mm/tlb.c~tlb-trace-flushes 2014-04-21 11:10:35.520867791 -0700 > +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:35.528868153 -0700 > @@ -14,6 +14,8 @@ > #include > #include > > +#include > + > DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) > = { &init_mm, 0, }; > > @@ -49,6 +51,7 @@ void leave_mm(int cpu) > if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) { > cpumask_clear_cpu(cpu, mm_cpumask(active_mm)); > load_cr3(swapper_pg_dir); > + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > } > } > EXPORT_SYMBOL_GPL(leave_mm); > @@ -105,9 +108,10 @@ static void flush_tlb_func(void *info) > > count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); > if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) { > - if (f->flush_end == TLB_FLUSH_ALL) > + if (f->flush_end == TLB_FLUSH_ALL) { > local_flush_tlb(); > - else if (!f->flush_end) > + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL); > + } else if (!f->flush_end) > __flush_tlb_single(f->flush_start); > else { > unsigned long addr; Why is only the TLB_FLUSH_ALL case traced here and not the single flush or range of flushes? __native_flush_tlb_single() doesn't have a trace point so I worry we are missing visibility on this part in particular this part. while (addr < f->flush_end) { __flush_tlb_single(addr); addr += PAGE_SIZE; } > @@ -152,7 +156,9 @@ void flush_tlb_current_task(void) > preempt_disable(); > > count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); > local_flush_tlb(); > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN_DONE, TLB_FLUSH_ALL); > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) > flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); > preempt_enable(); Are the two tracepoints really useful? Are they fine enough to measure the cost of the TLB flush? It misses the refill obviously but not much we can do there. > @@ -188,6 +194,7 @@ void flush_tlb_mm_range(struct mm_struct > if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) > base_pages_to_flush = (end - start) >> PAGE_SHIFT; > > + trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush); > if (base_pages_to_flush > tlb_single_page_flush_ceiling) { > base_pages_to_flush = TLB_FLUSH_ALL; > count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > @@ -199,6 +206,7 @@ void flush_tlb_mm_range(struct mm_struct > __flush_tlb_single(addr); > } > } > + trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN_DONE, base_pages_to_flush); > out: > if (base_pages_to_flush == TLB_FLUSH_ALL) { > start = 0UL; > diff -puN include/linux/mm_types.h~tlb-trace-flushes include/linux/mm_types.h > --- a/include/linux/mm_types.h~tlb-trace-flushes 2014-04-21 11:10:35.522867881 -0700 > +++ b/include/linux/mm_types.h 2014-04-21 11:10:35.529868198 -0700 > @@ -510,4 +510,14 @@ static inline void clear_tlb_flush_pendi > } > #endif > > +enum tlb_flush_reason { > + TLB_FLUSH_ON_TASK_SWITCH, > + TLB_REMOTE_SHOOTDOWN, > + TLB_LOCAL_SHOOTDOWN, > + TLB_LOCAL_SHOOTDOWN_DONE, > + TLB_LOCAL_MM_SHOOTDOWN, > + TLB_LOCAL_MM_SHOOTDOWN_DONE, > + NR_TLB_FLUSH_REASONS, > +}; > + Bonus points if you use the string formatting similar to the reason field int events/writeback.h. You do something like that already but there are already helpers for use with __print_symbolic so you do not need to roll your own version. It should reduce the need to add trace_tlb.c if you include the header in something like memory.c instead. > #endif /* _LINUX_MM_TYPES_H */ > diff -puN /dev/null include/trace/events/tlb.h > --- /dev/null 2014-04-10 11:28:14.066815724 -0700 > +++ b/include/trace/events/tlb.h 2014-04-21 11:10:35.529868198 -0700 > @@ -0,0 +1,37 @@ > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM tlb > + > +#if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_TLB_H > + > +#include > +#include > + > +extern const char * const tlb_flush_reason_desc[]; > + > +TRACE_EVENT(tlb_flush, > + > + TP_PROTO(int reason, unsigned long pages), > + TP_ARGS(reason, pages), > + > + TP_STRUCT__entry( > + __field( int, reason) > + __field(unsigned long, pages) > + ), > + > + TP_fast_assign( > + __entry->reason = reason; > + __entry->pages = pages; > + ), > + > + TP_printk("pages: %ld reason: %d (%s)", > + __entry->pages, > + __entry->reason, > + tlb_flush_reason_desc[__entry->reason]) > +); > + I would also suggest you match the output formatting with writeback.h which would look like pages:%lu reason:%s The raw format should still have the integer while the string formatting would have something human readable. Instead > +#endif /* _TRACE_TLB_H */ > + > +/* This part must be outside protection */ > +#include > + > diff -puN mm/Makefile~tlb-trace-flushes mm/Makefile > --- a/mm/Makefile~tlb-trace-flushes 2014-04-21 11:10:35.524867971 -0700 > +++ b/mm/Makefile 2014-04-21 11:10:35.530868243 -0700 > @@ -5,7 +5,7 @@ > mmu-y := nommu.o > mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \ > mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ > - vmalloc.o pagewalk.o pgtable-generic.o > + vmalloc.o pagewalk.o pgtable-generic.o trace_tlb.o > > ifdef CONFIG_CROSS_MEMORY_ATTACH > mmu-$(CONFIG_MMU) += process_vm_access.o > diff -puN /dev/null mm/trace_tlb.c > --- /dev/null 2014-04-10 11:28:14.066815724 -0700 > +++ b/mm/trace_tlb.c 2014-04-21 11:10:35.530868243 -0700 > @@ -0,0 +1,12 @@ > +#define CREATE_TRACE_POINTS > +#include > + > +const char * const tlb_flush_reason_desc[] = { > + __stringify(TLB_FLUSH_ON_TASK_SWITCH), > + __stringify(TLB_REMOTE_SHOOTDOWN), > + __stringify(TLB_LOCAL_SHOOTDOWN), > + __stringify(TLB_LOCAL_SHOOTDOWN_DONE), > + __stringify(TLB_LOCAL_MM_SHOOTDOWN), > + __stringify(TLB_LOCAL_MM_SHOOTDOWN_DONE), > +}; > + > _ -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f47.google.com (mail-ee0-f47.google.com [74.125.83.47]) by kanga.kvack.org (Postfix) with ESMTP id 3F3A76B0037 for ; Thu, 24 Apr 2014 06:37:33 -0400 (EDT) Received: by mail-ee0-f47.google.com with SMTP id b15so1681975eek.6 for ; Thu, 24 Apr 2014 03:37:32 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id r9si7692764eew.348.2014.04.24.03.37.31 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 24 Apr 2014 03:37:31 -0700 (PDT) Date: Thu, 24 Apr 2014 11:37:27 +0100 From: Mel Gorman Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush Message-ID: <20140424103727.GT23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182426.D6DD1E8F@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com, "H. Peter Anvin" On Mon, Apr 21, 2014 at 11:24:26AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > Most of the logic here is in the documentation file. Please take > a look at it. > > I know we've come full-circle here back to a tunable, but this > new one is *WAY* simpler. I challenge anyone to describe in one > sentence how the old one worked. Challenge accepted. Based on the characteristics of the CPU and a given process, something semi-random will happen at flush time which may or may not benefit the workload. > Here's the way the new one > works: > > If we are flushing more pages than the ceiling, we use > the full flush, otherwise we use per-page flushes. > > Signed-off-by: Dave Hansen > --- > > b/Documentation/x86/tlb.txt | 72 ++++++++++++++++++++++++++++++++++++++++++++ > b/arch/x86/mm/tlb.c | 46 ++++++++++++++++++++++++++++ > 2 files changed, 118 insertions(+) > > diff -puN arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush arch/x86/mm/tlb.c > --- a/arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush 2014-04-21 11:10:35.901884997 -0700 > +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:35.905885179 -0700 > @@ -274,3 +274,49 @@ void flush_tlb_kernel_range(unsigned lon > on_each_cpu(do_kernel_range_flush, &info, 1); > } > } > + > +static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf, > + size_t count, loff_t *ppos) > +{ > + char buf[32]; > + unsigned int len; > + > + len = sprintf(buf, "%ld\n", tlb_single_page_flush_ceiling); > + return simple_read_from_buffer(user_buf, count, ppos, buf, len); > +} > + > +static ssize_t tlbflush_write_file(struct file *file, > + const char __user *user_buf, size_t count, loff_t *ppos) > +{ > + char buf[32]; > + ssize_t len; > + int ceiling; > + > + len = min(count, sizeof(buf) - 1); > + if (copy_from_user(buf, user_buf, len)) > + return -EFAULT; > + > + buf[len] = '\0'; > + if (kstrtoint(buf, 0, &ceiling)) > + return -EINVAL; > + > + if (ceiling < 0) > + return -EINVAL; > + > + tlb_single_page_flush_ceiling = ceiling; > + return count; > +} > + > +static const struct file_operations fops_tlbflush = { > + .read = tlbflush_read_file, > + .write = tlbflush_write_file, > + .llseek = default_llseek, > +}; > + > +static int __init create_tlb_single_page_flush_ceiling(void) > +{ > + debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR, > + arch_debugfs_dir, NULL, &fops_tlbflush); > + return 0; > +} > +late_initcall(create_tlb_single_page_flush_ceiling); > diff -puN /dev/null Documentation/x86/tlb.txt > --- /dev/null 2014-04-10 11:28:14.066815724 -0700 > +++ b/Documentation/x86/tlb.txt 2014-04-21 11:10:35.924886036 -0700 > @@ -0,0 +1,72 @@ > +nWhen the kernel unmaps or modified the attributes of a range of > +memory, it has two choices: s/nWhen/When > + 1. Flush the entire TLB with a two-instruction sequence. This is > + a quick operation, but it causes collateral damage: TLB entries > + from areas other than the one we are trying to flush will be > + destroyed and must be refilled later, at some cost. > + 2. Use the invlpg instruction to invalidate a single page at a > + time. This could potentialy cost many more instructions, but > + it is a much more precise operation, causing no collateral > + damage to other TLB entries. > + It's not stated that there is no range flush instruction for x86 but anyone who cares about this area should know that. > +Which method to do depends on a few things: > + 1. The size of the flush being performed. A flush of the entire > + address space is obviously better performed by flushing the > + entire TLB than doing 2^48/PAGE_SIZE individual flushes. > + 2. The contents of the TLB. If the TLB is empty, then there will > + be no collateral damage caused by doing the global flush, and > + all of the individual flush will have ended up being wasted > + work. > + 3. The size of the TLB. The larger the TLB, the more collateral > + damage we do with a full flush. So, the larger the TLB, the > + more attrative an individual flush looks. Data and > + instructions have separate TLBs, as do different page sizes. > + 4. The microarchitecture. The TLB has become a multi-level > + cache on modern CPUs, and the global flushes have become more > + expensive relative to single-page flushes. > + > +There is obviously no way the kernel can know all these things, > +especially the contents of the TLB during a given flush. The > +sizes of the flush will vary greatly depending on the workload as > +well. There is essentially no "right" point to choose. > + > +You may be doing too many individual invalidations if you see the > +invlpg instruction (or instructions _near_ it) show up high in > +profiles. If you believe that individual invalidatoins being > +called too often, you can lower the tunable: > + s/invalidatoins/invalidations/ > + /sys/debug/kernel/x86/tlb_single_page_flush_ceiling > + You do not describe how to use the tracepoints but again anyone investigating this area should know how to do it already so *shrugs*. Rolling a systemtap script to display the information would be a short job. > +This will cause us to do the global flush for more cases. > +Lowering it to 0 will disable the use of the individual flushes. > +Setting it to 1 is a very conservative setting and it should > +never need to be 0 under normal circumstances. > + > +Despite the fact that a single individual flush on x86 is > +guaranteed to flush a full 2MB, hugetlbfs always uses the full > +flushes. THP is treated exactly the same as normal memory. > + You are the second person that told me this and I felt the manual was unclear on this subject. I was told that it might be a documentation bug but because this discussion was in a bar I completely failed to follow up on it. Specifically this part in 4.10.2.3 caused me problems when I last looked at the area. If the paging structures specify a translation using a page larger than 4 KBytes, some processors may choose to cache multiple smaller-page TLB entries for that translation. Each such TLB entry would be associated with a page number corresponding to the smaller page size (e.g., bits 47:12 of a linear address with IA-32e paging), even though part of that page number (e.g., bits 20:12) are part of the offset with respect to the page specified by the paging structures. The upper bits of the physical address in such a TLB entry are derived from the physical address in the PDE used to create the translation, while the lower bits come from the linear address of the access for which the translation is created. There is no way for software to be aware that multiple translations for smaller pages have been used for a large page. If software modifies the paging structures so that the page size used for a 4-KByte range of linear addresses changes, the TLBs may subsequently contain multiple translations for the address range (one for each page size). A reference to a linear address in the address range may use any of these translations. Which translation is used may vary from one execution to another, and the choice may be implementation-specific. This was ambiguous to me because of "some processors may choose to cache multiple smaller-page TLB entries for that translation". The second paragraph appears to partially contradict that but I could not see an architectural guarantee that flushing a page address within a huge page entry was guaranteed to flush all entries. I understand that there are definite problems around the time of splitting/collapsing a large page where care has to be taken that old TLB entries are not present but that's a different case. > +You might see invlpg inside of flush_tlb_mm_range() show up in > +profiles, or you can use the trace_tlb_flush() tracepoints. to > +determine how long the flush operations are taking. > + > +Essentially, you are balancing the cycles you spend doing invlpg > +with the cycles that you spend refilling the TLB later. > + > +You can measure how expensive TLB refills are by using > +performance counters and 'perf stat', like this: > + > +perf stat -e > + cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, > + cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, > + cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, > + cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, > + cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, > + cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ > + > +That works on an IvyBridge-era CPU (i5-3320M). Different CPUs > +may have differently-named counters, but they should at least > +be there in some form. You can use pmu-tools 'ocperf list' > +(https://github.com/andikleen/pmu-tools) to find the right > +counters for a given CPU. > + -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f49.google.com (mail-ee0-f49.google.com [74.125.83.49]) by kanga.kvack.org (Postfix) with ESMTP id EF0616B0035 for ; Thu, 24 Apr 2014 06:46:58 -0400 (EDT) Received: by mail-ee0-f49.google.com with SMTP id c41so1689705eek.8 for ; Thu, 24 Apr 2014 03:46:58 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id m49si7753041eeo.221.2014.04.24.03.46.56 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 24 Apr 2014 03:46:57 -0700 (PDT) Date: Thu, 24 Apr 2014 11:46:53 +0100 From: Mel Gorman Subject: Re: [PATCH 6/6] x86: mm: set TLB flush tunable to sane value (33) Message-ID: <20140424104147.GU23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182428.FC2104C1@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182428.FC2104C1@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On Mon, Apr 21, 2014 at 11:24:28AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > This has been run through Intel's LKP tests across a wide range > of modern sytems and workloads and it wasn't shown to make a > measurable performance difference positive or negative. > > Now that we have some shiny new tracepoints, we can actually > figure out what the heck is going on. > Good stuff. This is the type of thing I should have done the last time to set the parameters for the tlbflush microbench. Nice one out of you! > During a kernel compile, 60% of the flush_tlb_mm_range() calls > are for a single page. It breaks down like this: > > size percent percent<= > V V V > GLOBAL: 2.20% 2.20% avg cycles: 2283 > 1: 56.92% 59.12% avg cycles: 1276 > 2: 13.78% 72.90% avg cycles: 1505 > 3: 8.26% 81.16% avg cycles: 1880 > 4: 7.41% 88.58% avg cycles: 2447 > 5: 1.73% 90.31% avg cycles: 2358 > 6: 1.32% 91.63% avg cycles: 2563 > 7: 1.14% 92.77% avg cycles: 2862 > 8: 0.62% 93.39% avg cycles: 3542 > 9: 0.08% 93.47% avg cycles: 3289 > 10: 0.43% 93.90% avg cycles: 3570 > 11: 0.20% 94.10% avg cycles: 3767 > 12: 0.08% 94.18% avg cycles: 3996 > 13: 0.03% 94.20% avg cycles: 4077 > 14: 0.02% 94.23% avg cycles: 4836 > 15: 0.04% 94.26% avg cycles: 5699 > 16: 0.06% 94.32% avg cycles: 5041 > 17: 0.57% 94.89% avg cycles: 5473 > 18: 0.02% 94.91% avg cycles: 5396 > 19: 0.03% 94.95% avg cycles: 5296 > 20: 0.02% 94.96% avg cycles: 6749 > 21: 0.18% 95.14% avg cycles: 6225 > 22: 0.01% 95.15% avg cycles: 6393 > 23: 0.01% 95.16% avg cycles: 6861 > 24: 0.12% 95.28% avg cycles: 6912 > 25: 0.05% 95.32% avg cycles: 7190 > 26: 0.01% 95.33% avg cycles: 7793 > 27: 0.01% 95.34% avg cycles: 7833 > 28: 0.01% 95.35% avg cycles: 8253 > 29: 0.08% 95.42% avg cycles: 8024 > 30: 0.03% 95.45% avg cycles: 9670 > 31: 0.01% 95.46% avg cycles: 8949 > 32: 0.01% 95.46% avg cycles: 9350 > 33: 3.11% 98.57% avg cycles: 8534 > 34: 0.02% 98.60% avg cycles: 10977 > 35: 0.02% 98.62% avg cycles: 11400 > > We get in to dimishing returns pretty quickly. On pre-IvyBridge > CPUs, we used to set the limit at 8 pages, and it was set at 128 > on IvyBrige. That 128 number looks pretty silly considering that > less than 0.5% of the flushes are that large. > > The previous code tried to size this number based on the size of > the TLB. Good idea, but it's error-prone, needs maintenance > (which it didn't get up to now), and probably would not matter in > practice much. > > Settting it to 33 means that we cover the mallopt > M_TRIM_THRESHOLD, which is the most universally common size to do > flushes. > A kernel compile is hardly a representative workload but I accept the logic of tuning it based on current settings for M_TRIM_THRESHOLD and the tools are there to do a more detailed analysis if tlb flush times for people are identified as being a problem. Acked-by: Mel Gorman -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f173.google.com (mail-pd0-f173.google.com [209.85.192.173]) by kanga.kvack.org (Postfix) with ESMTP id 7CC206B0035 for ; Thu, 24 Apr 2014 12:58:15 -0400 (EDT) Received: by mail-pd0-f173.google.com with SMTP id p10so1779104pdj.4 for ; Thu, 24 Apr 2014 09:58:15 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id zm10si3032525pbc.404.2014.04.24.09.58.12 for ; Thu, 24 Apr 2014 09:58:13 -0700 (PDT) Message-ID: <535942A3.3020800@sr71.net> Date: Thu, 24 Apr 2014 09:58:11 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182421.DFAAD16A@viggo.jf.intel.com> <20140424084552.GQ23991@suse.de> In-Reply-To: <20140424084552.GQ23991@suse.de> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On 04/24/2014 01:45 AM, Mel Gorman wrote: >> +/* >> + * See Documentation/x86/tlb.txt for details. We choose 33 >> + * because it is large enough to cover the vast majority (at >> + * least 95%) of allocations, and is small enough that we are >> + * confident it will not cause too much overhead. Each single >> + * flush is about 100 cycles, so this caps the maximum overhead >> + * at _about_ 3,000 cycles. >> + */ >> +/* in units of pages */ >> +unsigned long tlb_single_page_flush_ceiling = 1; >> + > > This comment is premature. The documentation file does not exist yet and > 33 means nothing yet. Out of curiousity though, how confident are you > that a TLB flush is generally 100 cycles across different generations > and manufacturers of CPUs? I'm not suggesting you change it or auto-tune > it, am just curious. Yeah, the comment belongs in the later patch where I set it to 33. I looked at this on the last few generations of Intel CPUs. "100 cycles" was a very general statement, and not precise at all. My laptop averages out to 113 cycles overall, but the flushes of 25 pages averaged 96 cycles/page while the flushes of 2 averaged 219/page. Those cycles include some costs of from the instrumentation as well. I did not test on other CPU manufacturers, but this should be pretty easy to reproduce. I'm happy to help folks re-run it on other hardware. I also believe with the modalias stuff we've got in sysfs for the CPU objects we can do this in the future with udev rules instead of hard-coding it in the kernel. >> - /* In modern CPU, last level tlb used for both data/ins */ >> - if (vmflag & VM_EXEC) >> - tlb_entries = tlb_lli_4k[ENTRIES]; >> - else >> - tlb_entries = tlb_lld_4k[ENTRIES]; >> - >> - /* Assume all of TLB entries was occupied by this task */ >> - act_entries = tlb_entries >> tlb_flushall_shift; >> - act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm; >> - nr_base_pages = (end - start) >> PAGE_SHIFT; >> - >> - /* tlb_flushall_shift is on balance point, details in commit log */ >> - if (nr_base_pages > act_entries) { >> + if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) { >> count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); >> local_flush_tlb(); >> } else { > > We lose the different tuning based on whether the flush is for instructions > or data. However, I cannot think of a good reason for keeping it as I > expect that flushes of instructions is relatively rare. The benefit, if > any, will be marginal. Still, if you do another revision it would be > nice to call this out in the changelog. Will do. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f41.google.com (mail-pa0-f41.google.com [209.85.220.41]) by kanga.kvack.org (Postfix) with ESMTP id AA60B6B0036 for ; Thu, 24 Apr 2014 13:25:58 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa1so2155707pad.28 for ; Thu, 24 Apr 2014 10:25:58 -0700 (PDT) Received: from blackbird.sr71.net ([2001:19d0:2:6:209:6bff:fe9a:902]) by mx.google.com with ESMTP id hi3si3090723pac.82.2014.04.24.10.25.54 for ; Thu, 24 Apr 2014 10:25:54 -0700 (PDT) Message-ID: <53594920.8030203@sr71.net> Date: Thu, 24 Apr 2014 10:25:52 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> <20140424103727.GT23991@suse.de> In-Reply-To: <20140424103727.GT23991@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com, "H. Peter Anvin" On 04/24/2014 03:37 AM, Mel Gorman wrote: > On Mon, Apr 21, 2014 at 11:24:26AM -0700, Dave Hansen wrote: >> +This will cause us to do the global flush for more cases. >> +Lowering it to 0 will disable the use of the individual flushes. >> +Setting it to 1 is a very conservative setting and it should >> +never need to be 0 under normal circumstances. >> + >> +Despite the fact that a single individual flush on x86 is >> +guaranteed to flush a full 2MB, hugetlbfs always uses the full >> +flushes. THP is treated exactly the same as normal memory. >> + > > You are the second person that told me this and I felt the manual was > unclear on this subject. I was told that it might be a documentation bug > but because this discussion was in a bar I completely failed to follow up > on it. Specifically this part in 4.10.2.3 caused me problems when I last > looked at the area. My understanding comes from "4.10.4.2 Recommended Invalidation": a?c If software modifies a paging-structure entry that identifies the final page frame for a page number (either a PTE or a paging-structure entry in which the PS flag is 1), it should execute INVLPG for any linear address with a page number whose translation uses that PTE. 2 and especially the footnote: 2. One execution of INVLPG is sufficient even for a page with size greater than 4 KBytes. I do agree that it's ambiguous at best. I'll go see if anybody cares to update that bit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f169.google.com (mail-we0-f169.google.com [74.125.82.169]) by kanga.kvack.org (Postfix) with ESMTP id BF7166B0035 for ; Thu, 24 Apr 2014 13:56:26 -0400 (EDT) Received: by mail-we0-f169.google.com with SMTP id u56so1401704wes.28 for ; Thu, 24 Apr 2014 10:56:26 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id r9si224056wia.53.2014.04.24.10.56.24 for ; Thu, 24 Apr 2014 10:56:25 -0700 (PDT) Message-ID: <53594FB3.9050505@redhat.com> Date: Thu, 24 Apr 2014 13:53:55 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> <20140424103727.GT23991@suse.de> <53594920.8030203@sr71.net> In-Reply-To: <53594920.8030203@sr71.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , Mel Gorman Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com, "H. Peter Anvin" On 04/24/2014 01:25 PM, Dave Hansen wrote: > On 04/24/2014 03:37 AM, Mel Gorman wrote: >> On Mon, Apr 21, 2014 at 11:24:26AM -0700, Dave Hansen wrote: >>> +This will cause us to do the global flush for more cases. >>> +Lowering it to 0 will disable the use of the individual flushes. >>> +Setting it to 1 is a very conservative setting and it should >>> +never need to be 0 under normal circumstances. >>> + >>> +Despite the fact that a single individual flush on x86 is >>> +guaranteed to flush a full 2MB, hugetlbfs always uses the full >>> +flushes. THP is treated exactly the same as normal memory. >>> + >> >> You are the second person that told me this and I felt the manual was >> unclear on this subject. I was told that it might be a documentation bug >> but because this discussion was in a bar I completely failed to follow up >> on it. Specifically this part in 4.10.2.3 caused me problems when I last >> looked at the area. > > > My understanding comes from "4.10.4.2 Recommended Invalidation": > > a?c If software modifies a paging-structure entry that identifies > the final page frame for a page number (either a PTE or a > paging-structure entry in which the PS flag is 1), it should > execute INVLPG for any linear address with a page number whose > translation uses that PTE. 2 > > and especially the footnote: > > 2. One execution of INVLPG is sufficient even for a page with > size greater than 4 KBytes. > > I do agree that it's ambiguous at best. I'll go see if anybody cares to > update that bit. I suspect that IF the TLB actually uses a 2MB entry for the translation, a single INVLPG will work. However, the CPU is free to cache the translations for a 2MB region with a bunch of 4kB entries, if it wanted to, so in the end we have no guarantee that an INVLPG will actually do the right thing... The same is definitely true for 1GB vs 2MB entries, with some CPUs being capable of parsing page tables with 1GB entries, but having no TLB entries for 1GB translations. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f46.google.com (mail-ee0-f46.google.com [74.125.83.46]) by kanga.kvack.org (Postfix) with ESMTP id 7200C6B0035 for ; Thu, 24 Apr 2014 14:00:37 -0400 (EDT) Received: by mail-ee0-f46.google.com with SMTP id t10so2119122eei.19 for ; Thu, 24 Apr 2014 11:00:36 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id y6si9422183eep.137.2014.04.24.11.00.35 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 24 Apr 2014 11:00:35 -0700 (PDT) Date: Thu, 24 Apr 2014 19:00:30 +0100 From: Mel Gorman Subject: Re: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing Message-ID: <20140424180030.GX23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182421.DFAAD16A@viggo.jf.intel.com> <20140424084552.GQ23991@suse.de> <535942A3.3020800@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <535942A3.3020800@sr71.net> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On Thu, Apr 24, 2014 at 09:58:11AM -0700, Dave Hansen wrote: > On 04/24/2014 01:45 AM, Mel Gorman wrote: > >> +/* > >> + * See Documentation/x86/tlb.txt for details. We choose 33 > >> + * because it is large enough to cover the vast majority (at > >> + * least 95%) of allocations, and is small enough that we are > >> + * confident it will not cause too much overhead. Each single > >> + * flush is about 100 cycles, so this caps the maximum overhead > >> + * at _about_ 3,000 cycles. > >> + */ > >> +/* in units of pages */ > >> +unsigned long tlb_single_page_flush_ceiling = 1; > >> + > > > > This comment is premature. The documentation file does not exist yet and > > 33 means nothing yet. Out of curiousity though, how confident are you > > that a TLB flush is generally 100 cycles across different generations > > and manufacturers of CPUs? I'm not suggesting you change it or auto-tune > > it, am just curious. > > Yeah, the comment belongs in the later patch where I set it to 33. > > I looked at this on the last few generations of Intel CPUs. "100 > cycles" was a very general statement, and not precise at all. My laptop > averages out to 113 cycles overall, but the flushes of 25 pages averaged > 96 cycles/page while the flushes of 2 averaged 219/page. > > Those cycles include some costs of from the instrumentation as well. > > I did not test on other CPU manufacturers, but this should be pretty > easy to reproduce. I'm happy to help folks re-run it on other hardware. > > I also believe with the modalias stuff we've got in sysfs for the CPU > objects we can do this in the future with udev rules instead of > hard-coding it in the kernel. > You convinced me. Regardless of whether you move the comment or update the changelog; Acked-by: Mel Gorman -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id B0DE26B0035 for ; Thu, 24 Apr 2014 16:42:23 -0400 (EDT) Received: by mail-pa0-f51.google.com with SMTP id fb1so1691392pad.10 for ; Thu, 24 Apr 2014 13:42:23 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id iw1si3375058pbb.24.2014.04.24.13.42.20 for ; Thu, 24 Apr 2014 13:42:20 -0700 (PDT) Message-ID: <5359772A.8070108@sr71.net> Date: Thu, 24 Apr 2014 13:42:18 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 4/6] x86: mm: trace tlb flushes References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182425.93E696A3@viggo.jf.intel.com> <20140424101419.GS23991@suse.de> In-Reply-To: <20140424101419.GS23991@suse.de> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On 04/24/2014 03:14 AM, Mel Gorman wrote: > On Mon, Apr 21, 2014 at 11:24:25AM -0700, Dave Hansen wrote: >> @@ -105,9 +108,10 @@ static void flush_tlb_func(void *info) >> >> count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); >> if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) { >> - if (f->flush_end == TLB_FLUSH_ALL) >> + if (f->flush_end == TLB_FLUSH_ALL) { >> local_flush_tlb(); >> - else if (!f->flush_end) >> + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL); >> + } else if (!f->flush_end) >> __flush_tlb_single(f->flush_start); >> else { >> unsigned long addr; > > Why is only the TLB_FLUSH_ALL case traced here and not the single flush > or range of flushes? __native_flush_tlb_single() doesn't have a trace > point so I worry we are missing visibility on this part in particular > this part. > > while (addr < f->flush_end) { > __flush_tlb_single(addr); > addr += PAGE_SIZE; > } You're right, I missed that bit. I've corrected in a later version of the patch. >> @@ -152,7 +156,9 @@ void flush_tlb_current_task(void) >> preempt_disable(); >> >> count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); >> + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); >> local_flush_tlb(); >> + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN_DONE, TLB_FLUSH_ALL); >> if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) >> flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); >> preempt_enable(); > > Are the two tracepoints really useful? Are they fine enough to measure > the cost of the TLB flush? It misses the refill obviously but not much > we can do there. It's fine enough, but I did realize over time that the cost of the tracepoint is about 3x the cost of a 1-page tlb flush itself, so these are unusable for detailed measurements. I'll remove it for now. >> #endif /* _LINUX_MM_TYPES_H */ >> diff -puN /dev/null include/trace/events/tlb.h >> --- /dev/null 2014-04-10 11:28:14.066815724 -0700 >> +++ b/include/trace/events/tlb.h 2014-04-21 11:10:35.529868198 -0700 >> @@ -0,0 +1,37 @@ >> +#undef TRACE_SYSTEM >> +#define TRACE_SYSTEM tlb >> + >> +#if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ) >> +#define _TRACE_TLB_H >> + >> +#include >> +#include >> + >> +extern const char * const tlb_flush_reason_desc[]; >> + >> +TRACE_EVENT(tlb_flush, >> + >> + TP_PROTO(int reason, unsigned long pages), >> + TP_ARGS(reason, pages), >> + >> + TP_STRUCT__entry( >> + __field( int, reason) >> + __field(unsigned long, pages) >> + ), >> + >> + TP_fast_assign( >> + __entry->reason = reason; >> + __entry->pages = pages; >> + ), >> + >> + TP_printk("pages: %ld reason: %d (%s)", >> + __entry->pages, >> + __entry->reason, >> + tlb_flush_reason_desc[__entry->reason]) >> +); >> + > > I would also suggest you match the output formatting with writeback.h > which would look like > > pages:%lu reason:%s > > The raw format should still have the integer while the string formatting > would have something human readable. I can do that. The only bummer with the human-readable strings is turning them back in to something that the filters can take. I think I'll just do: + TP_printk("pages:%ld reason:%s (%d)", + __entry->pages, + __print_symbolic(__entry->reason, TLB_FLUSH_REASON), + __entry->reason) +); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f179.google.com (mail-pd0-f179.google.com [209.85.192.179]) by kanga.kvack.org (Postfix) with ESMTP id DAF576B0035 for ; Thu, 24 Apr 2014 18:03:59 -0400 (EDT) Received: by mail-pd0-f179.google.com with SMTP id g10so2386522pdj.38 for ; Thu, 24 Apr 2014 15:03:59 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id rj9si933455pbc.246.2014.04.24.15.03.55 for ; Thu, 24 Apr 2014 15:03:55 -0700 (PDT) Message-ID: <53598A48.2090909@sr71.net> Date: Thu, 24 Apr 2014 15:03:52 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> <20140424103727.GT23991@suse.de> <53594920.8030203@sr71.net> <53594FB3.9050505@redhat.com> In-Reply-To: <53594FB3.9050505@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel , Mel Gorman Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com, "H. Peter Anvin" On 04/24/2014 10:53 AM, Rik van Riel wrote: >> I do agree that it's ambiguous at best. I'll go see if anybody cares to >> update that bit. > > I suspect that IF the TLB actually uses a 2MB entry for the > translation, a single INVLPG will work. > > However, the CPU is free to cache the translations for a 2MB > region with a bunch of 4kB entries, if it wanted to, so in > the end we have no guarantee that an INVLPG will actually do > the right thing... > > The same is definitely true for 1GB vs 2MB entries, with > some CPUs being capable of parsing page tables with 1GB > entries, but having no TLB entries for 1GB translations. I believe we _do_ have such a guarantee. There's another bit in the SDM that someone pointed out to me in a footnote in "4.10.4.1": 1. If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page (see Section 4.10.2.3), the instruction invalidates all of them. While that's not in the easiest-to-find place in the documents, it looks pretty clear. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f169.google.com (mail-pd0-f169.google.com [209.85.192.169]) by kanga.kvack.org (Postfix) with ESMTP id 26C9F6B0035 for ; Fri, 25 Apr 2014 17:40:05 -0400 (EDT) Received: by mail-pd0-f169.google.com with SMTP id y13so2642340pdi.14 for ; Fri, 25 Apr 2014 14:40:04 -0700 (PDT) Received: from blackbird.sr71.net ([2001:19d0:2:6:209:6bff:fe9a:902]) by mx.google.com with ESMTP id hp1si5617809pad.303.2014.04.25.14.39.59 for ; Fri, 25 Apr 2014 14:39:59 -0700 (PDT) Message-ID: <535AD62D.20509@sr71.net> Date: Fri, 25 Apr 2014 14:39:57 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182421.DFAAD16A@viggo.jf.intel.com> <20140424084552.GQ23991@suse.de> In-Reply-To: <20140424084552.GQ23991@suse.de> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com On 04/24/2014 01:45 AM, Mel Gorman wrote: >> > +/* >> > + * See Documentation/x86/tlb.txt for details. We choose 33 >> > + * because it is large enough to cover the vast majority (at >> > + * least 95%) of allocations, and is small enough that we are >> > + * confident it will not cause too much overhead. Each single >> > + * flush is about 100 cycles, so this caps the maximum overhead >> > + * at _about_ 3,000 cycles. >> > + */ >> > +/* in units of pages */ >> > +unsigned long tlb_single_page_flush_ceiling = 1; >> > + > This comment is premature. The documentation file does not exist yet and > 33 means nothing yet. Out of curiousity though, how confident are you > that a TLB flush is generally 100 cycles across different generations > and manufacturers of CPUs? I'm not suggesting you change it or auto-tune > it, am just curious. First of all, I changed the units here at some point, and I screwed up the comments. I meant 100 nanoseconds, *not* cycles. For the sake of completeness, here are the data on a Westmere CPU. I'm not _quite_ sure why the <=5 pages cases are so slow per-page compared to when we're flushing larger numbers of pages. (I also only printed out the flush sizes with >100 samples): The overall average was 151ns, and for 6 pages and up it was 107ns. 1 1560658 279861777 avg/page: 179 2 179981 85329139 avg/page: 237 3 99797 146972011 avg/page: 490 4 161470 133072233 avg/page: 206 5 44150 42142670 avg/page: 190 6 17364 12063833 avg/page: 115 7 12325 9899412 avg/page: 114 8 4202 3838077 avg/page: 114 9 811 990320 avg/page: 135 10 4448 4955283 avg/page: 111 11 69051 86723229 avg/page: 114 12 465 642204 avg/page: 115 13 157 226814 avg/page: 111 16 781 1741461 avg/page: 139 17 1506 2778201 avg/page: 108 18 110 211216 avg/page: 106 19 13322 27941893 avg/page: 110 21 1828 4092988 avg/page: 106 24 1566 4057605 avg/page: 107 25 246 646463 avg/page: 105 29 411 1275101 avg/page: 106 33 3191 11775818 avg/page: 111 52 3096 17297873 avg/page: 107 65 2244 15349445 avg/page: 105 129 2278 33246120 avg/page: 113 240 12181 305529055 avg/page: 104 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f172.google.com (mail-pd0-f172.google.com [209.85.192.172]) by kanga.kvack.org (Postfix) with ESMTP id 9098F6B0036 for ; Mon, 7 Jul 2014 13:43:43 -0400 (EDT) Received: by mail-pd0-f172.google.com with SMTP id w10so5753845pde.3 for ; Mon, 07 Jul 2014 10:43:43 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id bc15si5094522pdb.17.2014.07.07.10.43.38 for ; Mon, 07 Jul 2014 10:43:39 -0700 (PDT) Message-ID: <53BADC49.6000600@sr71.net> Date: Mon, 07 Jul 2014 10:43:37 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> <20140424103727.GT23991@suse.de> In-Reply-To: <20140424103727.GT23991@suse.de> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com, "H. Peter Anvin" On 04/24/2014 03:37 AM, Mel Gorman wrote: >> +Despite the fact that a single individual flush on x86 is >> > +guaranteed to flush a full 2MB, hugetlbfs always uses the full >> > +flushes. THP is treated exactly the same as normal memory. >> > + > You are the second person that told me this and I felt the manual was > unclear on this subject. I was told that it might be a documentation bug > but because this discussion was in a bar I completely failed to follow up > on it. For the record... There's a new version of the Intel SDM out, and it contains some clarifications. They're the easiest to find in this document which highlights the deltas from the last version: > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developers-manual.pdf The documentation for invlpg itself has a new footnote, and there's also a little bit of new text in section "4.10.2.3 Details of TLB Use". The footnotes say: If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page (see Section 4.10.2.3), the instruction (invlpg) invalidates all of them I hope that clears up some of the ambiguity over invlpg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f49.google.com (mail-pa0-f49.google.com [209.85.220.49]) by kanga.kvack.org (Postfix) with ESMTP id 24E436B0031 for ; Mon, 7 Jul 2014 20:43:47 -0400 (EDT) Received: by mail-pa0-f49.google.com with SMTP id lj1so6271741pab.22 for ; Mon, 07 Jul 2014 17:43:46 -0700 (PDT) Received: from mail-pd0-f169.google.com (mail-pd0-f169.google.com [209.85.192.169]) by mx.google.com with ESMTPS id h13si5418969pdl.300.2014.07.07.17.43.45 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 07 Jul 2014 17:43:45 -0700 (PDT) Received: by mail-pd0-f169.google.com with SMTP id g10so6248194pdj.0 for ; Mon, 07 Jul 2014 17:43:45 -0700 (PDT) Message-ID: <53BB3EBC.8050005@linaro.org> Date: Tue, 08 Jul 2014 08:43:40 +0800 From: Alex Shi MIME-Version: 1.0 Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> <20140424103727.GT23991@suse.de> <53BADC49.6000600@sr71.net> In-Reply-To: <53BADC49.6000600@sr71.net> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , Mel Gorman Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, dave.hansen@linux.intel.com, "H. Peter Anvin" On 07/08/2014 01:43 AM, Dave Hansen wrote: > On 04/24/2014 03:37 AM, Mel Gorman wrote: >>> +Despite the fact that a single individual flush on x86 is >>>> +guaranteed to flush a full 2MB, hugetlbfs always uses the full >>>> +flushes. THP is treated exactly the same as normal memory. >>>> + >> You are the second person that told me this and I felt the manual was >> unclear on this subject. I was told that it might be a documentation bug >> but because this discussion was in a bar I completely failed to follow up >> on it. > > For the record... There's a new version of the Intel SDM out, and it > contains some clarifications. They're the easiest to find in this > document which highlights the deltas from the last version: > >> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developers-manual.pdf > > The documentation for invlpg itself has a new footnote, and there's also > a little bit of new text in section "4.10.2.3 Details of TLB Use". > > The footnotes say: > > If the paging structures map the linear address using a page > larger than 4 KBytes and there are multiple TLB entries for > that page (see Section 4.10.2.3), the instruction (invlpg) > invalidates all of them > > I hope that clears up some of the ambiguity over invlpg. > Uh, AFAICT, the invlpg on large page has no clear effect on data retrieving, on all Intel CPU till ivybridge. No testing on later CPUs. -- Thanks Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753809AbaDUSYZ (ORCPT ); Mon, 21 Apr 2014 14:24:25 -0400 Received: from mga09.intel.com ([134.134.136.24]:24441 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753525AbaDUSYW (ORCPT ); Mon, 21 Apr 2014 14:24:22 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,897,1389772800"; d="scan'208";a="524893907" Subject: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:21 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182421.DFAAD16A@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen I think the flush_tlb_mm_range() code that tries to tune the flush sizes based on the CPU needs to get ripped out for several reasons: 1. It is obviously buggy. It uses mm->total_vm to judge the task's footprint in the TLB. It should certainly be using some measure of RSS, *NOT* ->total_vm since only resident memory can populate the TLB. 2. Haswell, and several other CPUs are missing from the intel_tlb_flushall_shift_set() function. Thus, it has been demonstrated to bitrot quickly in practice. 3. It is plain wrong in my vm: [ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] tlb_flushall_shift: 6 Which leads to it to never use invlpg. 4. The assumptions about TLB refill costs are wrong: http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com (more on this in later patches) 5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59 I believe the sample times were too short. Running the benchmark in a loop yields times that vary quite a bit. Note that this leaves us with a static ceiling of 1 page. This is a conservative, dumb setting, and will be revised in a later patch. Signed-off-by: Dave Hansen --- b/arch/x86/include/asm/processor.h | 1 b/arch/x86/kernel/cpu/amd.c | 7 -- b/arch/x86/kernel/cpu/common.c | 13 ----- b/arch/x86/kernel/cpu/intel.c | 26 ---------- b/arch/x86/mm/tlb.c | 91 ++++++------------------------------- 5 files changed, 19 insertions(+), 119 deletions(-) diff -puN arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/include/asm/processor.h --- a/arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.813835861 -0700 +++ b/arch/x86/include/asm/processor.h 2014-04-21 11:10:34.823836313 -0700 @@ -72,7 +72,6 @@ extern u16 __read_mostly tlb_lld_4k[NR_I extern u16 __read_mostly tlb_lld_2m[NR_INFO]; extern u16 __read_mostly tlb_lld_4m[NR_INFO]; extern u16 __read_mostly tlb_lld_1g[NR_INFO]; -extern s8 __read_mostly tlb_flushall_shift; /* * CPU type and hardware bug flags. Kept separately for each CPU. diff -puN arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/amd.c --- a/arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.814835907 -0700 +++ b/arch/x86/kernel/cpu/amd.c 2014-04-21 11:10:34.824836358 -0700 @@ -741,11 +741,6 @@ static unsigned int amd_size_cache(struc } #endif -static void cpu_set_tlb_flushall_shift(struct cpuinfo_x86 *c) -{ - tlb_flushall_shift = 6; -} - static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c) { u32 ebx, eax, ecx, edx; @@ -793,8 +788,6 @@ static void cpu_detect_tlb_amd(struct cp tlb_lli_2m[ENTRIES] = eax & mask; tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1; - - cpu_set_tlb_flushall_shift(c); } static const struct cpu_dev amd_cpu_dev = { diff -puN arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/common.c --- a/arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.816835998 -0700 +++ b/arch/x86/kernel/cpu/common.c 2014-04-21 11:10:34.825836403 -0700 @@ -479,26 +479,17 @@ u16 __read_mostly tlb_lld_2m[NR_INFO]; u16 __read_mostly tlb_lld_4m[NR_INFO]; u16 __read_mostly tlb_lld_1g[NR_INFO]; -/* - * tlb_flushall_shift shows the balance point in replacing cr3 write - * with multiple 'invlpg'. It will do this replacement when - * flush_tlb_lines <= active_lines/2^tlb_flushall_shift. - * If tlb_flushall_shift is -1, means the replacement will be disabled. - */ -s8 __read_mostly tlb_flushall_shift = -1; - void cpu_detect_tlb(struct cpuinfo_x86 *c) { if (this_cpu->c_detect_tlb) this_cpu->c_detect_tlb(c); printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n" - "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n" - "tlb_flushall_shift: %d\n", + "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n", tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES], tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES], tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES], - tlb_lld_1g[ENTRIES], tlb_flushall_shift); + tlb_lld_1g[ENTRIES]); } void detect_ht(struct cpuinfo_x86 *c) diff -puN arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/intel.c --- a/arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.818836088 -0700 +++ b/arch/x86/kernel/cpu/intel.c 2014-04-21 11:10:34.825836403 -0700 @@ -634,31 +634,6 @@ static void intel_tlb_lookup(const unsig } } -static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c) -{ - switch ((c->x86 << 8) + c->x86_model) { - case 0x60f: /* original 65 nm celeron/pentium/core2/xeon, "Merom"/"Conroe" */ - case 0x616: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */ - case 0x617: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */ - case 0x61d: /* six-core 45 nm xeon "Dunnington" */ - tlb_flushall_shift = -1; - break; - case 0x63a: /* Ivybridge */ - tlb_flushall_shift = 2; - break; - case 0x61a: /* 45 nm nehalem, "Bloomfield" */ - case 0x61e: /* 45 nm nehalem, "Lynnfield" */ - case 0x625: /* 32 nm nehalem, "Clarkdale" */ - case 0x62c: /* 32 nm nehalem, "Gulftown" */ - case 0x62e: /* 45 nm nehalem-ex, "Beckton" */ - case 0x62f: /* 32 nm Xeon E7 */ - case 0x62a: /* SandyBridge */ - case 0x62d: /* SandyBridge, "Romely-EP" */ - default: - tlb_flushall_shift = 6; - } -} - static void intel_detect_tlb(struct cpuinfo_x86 *c) { int i, j, n; @@ -683,7 +658,6 @@ static void intel_detect_tlb(struct cpui for (j = 1 ; j < 16 ; j++) intel_tlb_lookup(desc[j]); } - intel_tlb_flushall_shift_set(c); } static const struct cpu_dev intel_cpu_dev = { diff -puN arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.820836178 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:34.826836449 -0700 @@ -158,13 +158,22 @@ void flush_tlb_current_task(void) preempt_enable(); } +/* + * See Documentation/x86/tlb.txt for details. We choose 33 + * because it is large enough to cover the vast majority (at + * least 95%) of allocations, and is small enough that we are + * confident it will not cause too much overhead. Each single + * flush is about 100 cycles, so this caps the maximum overhead + * at _about_ 3,000 cycles. + */ +/* in units of pages */ +unsigned long tlb_single_page_flush_ceiling = 1; + void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) { int need_flush_others_all = 1; unsigned long addr; - unsigned act_entries, tlb_entries = 0; - unsigned long nr_base_pages; preempt_disable(); if (current->active_mm != mm) @@ -175,25 +184,12 @@ void flush_tlb_mm_range(struct mm_struct goto out; } - if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 - || vmflag & VM_HUGETLB) { + if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { local_flush_tlb(); goto out; } - /* In modern CPU, last level tlb used for both data/ins */ - if (vmflag & VM_EXEC) - tlb_entries = tlb_lli_4k[ENTRIES]; - else - tlb_entries = tlb_lld_4k[ENTRIES]; - - /* Assume all of TLB entries was occupied by this task */ - act_entries = tlb_entries >> tlb_flushall_shift; - act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm; - nr_base_pages = (end - start) >> PAGE_SHIFT; - - /* tlb_flushall_shift is on balance point, details in commit log */ - if (nr_base_pages > act_entries) { + if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) { count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); local_flush_tlb(); } else { @@ -259,68 +255,15 @@ static void do_kernel_range_flush(void * void flush_tlb_kernel_range(unsigned long start, unsigned long end) { - unsigned act_entries; - struct flush_tlb_info info; - - /* In modern CPU, last level tlb used for both data/ins */ - act_entries = tlb_lld_4k[ENTRIES]; /* Balance as user space task's flush, a bit conservative */ - if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 || - (end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift) - + if (end == TLB_FLUSH_ALL || + (end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) { on_each_cpu(do_flush_tlb_all, NULL, 1); - else { + } else { + struct flush_tlb_info info; info.flush_start = start; info.flush_end = end; on_each_cpu(do_kernel_range_flush, &info, 1); } } - -#ifdef CONFIG_DEBUG_TLBFLUSH -static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf, - size_t count, loff_t *ppos) -{ - char buf[32]; - unsigned int len; - - len = sprintf(buf, "%hd\n", tlb_flushall_shift); - return simple_read_from_buffer(user_buf, count, ppos, buf, len); -} - -static ssize_t tlbflush_write_file(struct file *file, - const char __user *user_buf, size_t count, loff_t *ppos) -{ - char buf[32]; - ssize_t len; - s8 shift; - - len = min(count, sizeof(buf) - 1); - if (copy_from_user(buf, user_buf, len)) - return -EFAULT; - - buf[len] = '\0'; - if (kstrtos8(buf, 0, &shift)) - return -EINVAL; - - if (shift < -1 || shift >= BITS_PER_LONG) - return -EINVAL; - - tlb_flushall_shift = shift; - return count; -} - -static const struct file_operations fops_tlbflush = { - .read = tlbflush_read_file, - .write = tlbflush_write_file, - .llseek = default_llseek, -}; - -static int __init create_tlb_flushall_shift(void) -{ - debugfs_create_file("tlb_flushall_shift", S_IRUSR | S_IWUSR, - arch_debugfs_dir, NULL, &fops_tlbflush); - return 0; -} -late_initcall(create_tlb_flushall_shift); -#endif _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754208AbaDUSYk (ORCPT ); Mon, 21 Apr 2014 14:24:40 -0400 Received: from mga03.intel.com ([143.182.124.21]:46611 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754027AbaDUSYe (ORCPT ); Mon, 21 Apr 2014 14:24:34 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,897,1389772800"; d="scan'208";a="421645878" Subject: [PATCH 6/6] x86: mm: set TLB flush tunable to sane value (33) To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:28 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182428.FC2104C1@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen This has been run through Intel's LKP tests across a wide range of modern sytems and workloads and it wasn't shown to make a measurable performance difference positive or negative. Now that we have some shiny new tracepoints, we can actually figure out what the heck is going on. During a kernel compile, 60% of the flush_tlb_mm_range() calls are for a single page. It breaks down like this: size percent percent<= V V V GLOBAL: 2.20% 2.20% avg cycles: 2283 1: 56.92% 59.12% avg cycles: 1276 2: 13.78% 72.90% avg cycles: 1505 3: 8.26% 81.16% avg cycles: 1880 4: 7.41% 88.58% avg cycles: 2447 5: 1.73% 90.31% avg cycles: 2358 6: 1.32% 91.63% avg cycles: 2563 7: 1.14% 92.77% avg cycles: 2862 8: 0.62% 93.39% avg cycles: 3542 9: 0.08% 93.47% avg cycles: 3289 10: 0.43% 93.90% avg cycles: 3570 11: 0.20% 94.10% avg cycles: 3767 12: 0.08% 94.18% avg cycles: 3996 13: 0.03% 94.20% avg cycles: 4077 14: 0.02% 94.23% avg cycles: 4836 15: 0.04% 94.26% avg cycles: 5699 16: 0.06% 94.32% avg cycles: 5041 17: 0.57% 94.89% avg cycles: 5473 18: 0.02% 94.91% avg cycles: 5396 19: 0.03% 94.95% avg cycles: 5296 20: 0.02% 94.96% avg cycles: 6749 21: 0.18% 95.14% avg cycles: 6225 22: 0.01% 95.15% avg cycles: 6393 23: 0.01% 95.16% avg cycles: 6861 24: 0.12% 95.28% avg cycles: 6912 25: 0.05% 95.32% avg cycles: 7190 26: 0.01% 95.33% avg cycles: 7793 27: 0.01% 95.34% avg cycles: 7833 28: 0.01% 95.35% avg cycles: 8253 29: 0.08% 95.42% avg cycles: 8024 30: 0.03% 95.45% avg cycles: 9670 31: 0.01% 95.46% avg cycles: 8949 32: 0.01% 95.46% avg cycles: 9350 33: 3.11% 98.57% avg cycles: 8534 34: 0.02% 98.60% avg cycles: 10977 35: 0.02% 98.62% avg cycles: 11400 We get in to dimishing returns pretty quickly. On pre-IvyBridge CPUs, we used to set the limit at 8 pages, and it was set at 128 on IvyBrige. That 128 number looks pretty silly considering that less than 0.5% of the flushes are that large. The previous code tried to size this number based on the size of the TLB. Good idea, but it's error-prone, needs maintenance (which it didn't get up to now), and probably would not matter in practice much. Settting it to 33 means that we cover the mallopt M_TRIM_THRESHOLD, which is the most universally common size to do flushes. That's the short version. Here's the long one for why I chose 33: 1. These numbers have a constant bias in the timestamps from the tracing. Probably counts for a couple hundred cycles in each of these tests, but it should be fairly _even_ across all of them. The smallest delta between the tracepoints I have ever seen is 335 cycles. This is one reason the cycles/page cost goes down in general as the flushes get larger. The true cost is nearer to 100 cycles. 2. A full flush is more expensive than a single invlpg, but not by much (single percentages). 3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. 4. 22,000 cycles is approximately the equivalent of doing 85 invlpg operations. But, the odds are that the TLB can actually be filled up faster than that because TLB misses that are close in time also tend to leverage the same caches. 6. ~98% of flushes are <=33 pages. There are a lot of flushes of 33 pages, probably because libc's M_TRIM_THRESHOLD is set to 128k (32 pages) 7. I've found no consistent data to support changing the IvyBridge vs. SandyBridge tunable by a factor of 16 I used the performance counters on this hardware (IvyBridge i5-3320M) to figure out the tlb miss costs: ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush 7,720,030,970 dtlb_load_misses_walk_duration [57.13%] 169,856,353 dtlb_load_misses_walk_completed [57.15%] 708,832,859 dtlb_store_misses_walk_duration [57.17%] 19,346,823 dtlb_store_misses_walk_completed [57.17%] 2,779,687,402 itlb_misses_walk_duration [57.15%] 82,241,148 itlb_misses_walk_completed [57.13%] 770,717 itlb_itlb_flush [57.11%] Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. On a SandyBridge system with more cores and larger caches, those are dtlb=13.4ns and itlb=9.5ns. cat perf.stat.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, " ----- ", icyc,imiss, dcyc,dmiss } On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any The assumptions that this code went in under: https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are about 100ns. Being generous, that is over by a factor of 6 on the refill side, although it is fairly close on the cost of an invlpg. An increase of a single invlpg operation seems to lengthen the flush range operation by about 200 cycles. Here is one example of the data collected for flushing 10 and 11 pages (full data are below): 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 How to generate this table: echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb echo x86-tsc > /sys/kernel/debug/tracing/trace_clock echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable Pipe the trace output in to this script: http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt Note that these data were gathered with the invlpg threshold set to 150 pages. Only data points with >=50 of samples were printed: Flush % of %<= in flush this pages es size ------------------------------------------------------------------------------ -1: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960 1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895 2: 13.78% 72.90% avg cycles: 1505 cycles/page: 752 samples: 150335 3: 8.26% 81.16% avg cycles: 1880 cycles/page: 626 samples: 90131 4: 7.41% 88.58% avg cycles: 2447 cycles/page: 611 samples: 80877 5: 1.73% 90.31% avg cycles: 2358 cycles/page: 471 samples: 18885 6: 1.32% 91.63% avg cycles: 2563 cycles/page: 427 samples: 14397 7: 1.14% 92.77% avg cycles: 2862 cycles/page: 408 samples: 12441 8: 0.62% 93.39% avg cycles: 3542 cycles/page: 442 samples: 6721 9: 0.08% 93.47% avg cycles: 3289 cycles/page: 365 samples: 917 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 12: 0.08% 94.18% avg cycles: 3996 cycles/page: 333 samples: 864 13: 0.03% 94.20% avg cycles: 4077 cycles/page: 313 samples: 289 14: 0.02% 94.23% avg cycles: 4836 cycles/page: 345 samples: 236 15: 0.04% 94.26% avg cycles: 5699 cycles/page: 379 samples: 390 16: 0.06% 94.32% avg cycles: 5041 cycles/page: 315 samples: 643 17: 0.57% 94.89% avg cycles: 5473 cycles/page: 321 samples: 6229 18: 0.02% 94.91% avg cycles: 5396 cycles/page: 299 samples: 224 19: 0.03% 94.95% avg cycles: 5296 cycles/page: 278 samples: 367 20: 0.02% 94.96% avg cycles: 6749 cycles/page: 337 samples: 185 21: 0.18% 95.14% avg cycles: 6225 cycles/page: 296 samples: 1964 22: 0.01% 95.15% avg cycles: 6393 cycles/page: 290 samples: 83 23: 0.01% 95.16% avg cycles: 6861 cycles/page: 298 samples: 61 24: 0.12% 95.28% avg cycles: 6912 cycles/page: 288 samples: 1307 25: 0.05% 95.32% avg cycles: 7190 cycles/page: 287 samples: 533 26: 0.01% 95.33% avg cycles: 7793 cycles/page: 299 samples: 94 27: 0.01% 95.34% avg cycles: 7833 cycles/page: 290 samples: 66 28: 0.01% 95.35% avg cycles: 8253 cycles/page: 294 samples: 73 29: 0.08% 95.42% avg cycles: 8024 cycles/page: 276 samples: 846 30: 0.03% 95.45% avg cycles: 9670 cycles/page: 322 samples: 296 31: 0.01% 95.46% avg cycles: 8949 cycles/page: 288 samples: 79 32: 0.01% 95.46% avg cycles: 9350 cycles/page: 292 samples: 60 33: 3.11% 98.57% avg cycles: 8534 cycles/page: 258 samples: 33936 34: 0.02% 98.60% avg cycles: 10977 cycles/page: 322 samples: 268 35: 0.02% 98.62% avg cycles: 11400 cycles/page: 325 samples: 177 36: 0.01% 98.63% avg cycles: 11504 cycles/page: 319 samples: 161 37: 0.02% 98.65% avg cycles: 11596 cycles/page: 313 samples: 182 38: 0.02% 98.66% avg cycles: 11850 cycles/page: 311 samples: 195 39: 0.01% 98.68% avg cycles: 12158 cycles/page: 311 samples: 128 40: 0.01% 98.68% avg cycles: 11626 cycles/page: 290 samples: 78 41: 0.04% 98.73% avg cycles: 11435 cycles/page: 278 samples: 477 42: 0.01% 98.73% avg cycles: 12571 cycles/page: 299 samples: 74 43: 0.01% 98.74% avg cycles: 12562 cycles/page: 292 samples: 78 44: 0.01% 98.75% avg cycles: 12991 cycles/page: 295 samples: 108 45: 0.01% 98.76% avg cycles: 13169 cycles/page: 292 samples: 78 46: 0.02% 98.78% avg cycles: 12891 cycles/page: 280 samples: 261 47: 0.01% 98.79% avg cycles: 13099 cycles/page: 278 samples: 67 48: 0.01% 98.80% avg cycles: 13851 cycles/page: 288 samples: 77 49: 0.01% 98.80% avg cycles: 13749 cycles/page: 280 samples: 66 50: 0.01% 98.81% avg cycles: 13949 cycles/page: 278 samples: 73 52: 0.00% 98.82% avg cycles: 14243 cycles/page: 273 samples: 52 54: 0.01% 98.83% avg cycles: 15312 cycles/page: 283 samples: 87 55: 0.01% 98.84% avg cycles: 15197 cycles/page: 276 samples: 109 56: 0.02% 98.86% avg cycles: 15234 cycles/page: 272 samples: 208 57: 0.00% 98.86% avg cycles: 14888 cycles/page: 261 samples: 53 58: 0.01% 98.87% avg cycles: 15037 cycles/page: 259 samples: 59 59: 0.01% 98.87% avg cycles: 15752 cycles/page: 266 samples: 63 62: 0.00% 98.89% avg cycles: 16222 cycles/page: 261 samples: 54 64: 0.02% 98.91% avg cycles: 17179 cycles/page: 268 samples: 248 65: 0.12% 99.03% avg cycles: 18762 cycles/page: 288 samples: 1324 85: 0.00% 99.10% avg cycles: 21649 cycles/page: 254 samples: 50 127: 0.01% 99.18% avg cycles: 32397 cycles/page: 255 samples: 75 128: 0.13% 99.31% avg cycles: 31711 cycles/page: 247 samples: 1466 129: 0.18% 99.49% avg cycles: 33017 cycles/page: 255 samples: 1927 181: 0.33% 99.84% avg cycles: 2489 cycles/page: 13 samples: 3547 256: 0.05% 99.91% avg cycles: 2305 cycles/page: 9 samples: 550 512: 0.03% 99.95% avg cycles: 2133 cycles/page: 4 samples: 304 1512: 0.01% 99.99% avg cycles: 3038 cycles/page: 2 samples: 65 Here are the tlb counters during a 10-second slice of a kernel compile for a SandyBridge system. It's better than IvyBridge, but probably due to the larger caches since this was one of the 'X' extreme parts. 10,873,007,282 dtlb_load_misses_walk_duration 250,711,333 dtlb_load_misses_walk_completed 1,212,395,865 dtlb_store_misses_walk_duration 31,615,772 dtlb_store_misses_walk_completed 5,091,010,274 itlb_misses_walk_duration 163,193,511 itlb_misses_walk_completed 1,321,980 itlb_itlb_flush 10.008045158 seconds time elapsed # cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, " ----- ", icyc,imiss, dcyc,dmiss }' itlb ns/miss: 9.45338 dtlb ns/miss: 12.9716 Signed-off-by: Dave Hansen --- b/arch/x86/mm/tlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN arch/x86/mm/tlb.c~set-tunable-to-sane-value arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~set-tunable-to-sane-value 2014-04-21 09:58:50.012268370 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 09:58:50.016268551 -0700 @@ -173,7 +173,7 @@ void flush_tlb_current_task(void) * at _about_ 3,000 cycles. */ /* in units of pages */ -unsigned long tlb_single_page_flush_ceiling = 1; +unsigned long tlb_single_page_flush_ceiling = 33; void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754171AbaDUSYh (ORCPT ); Mon, 21 Apr 2014 14:24:37 -0400 Received: from mga11.intel.com ([192.55.52.93]:19544 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753831AbaDUSY3 (ORCPT ); Mon, 21 Apr 2014 14:24:29 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,897,1389772800"; d="scan'208";a="516811246" Subject: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:26 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182426.D6DD1E8F@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen Most of the logic here is in the documentation file. Please take a look at it. I know we've come full-circle here back to a tunable, but this new one is *WAY* simpler. I challenge anyone to describe in one sentence how the old one worked. Here's the way the new one works: If we are flushing more pages than the ceiling, we use the full flush, otherwise we use per-page flushes. Signed-off-by: Dave Hansen --- b/Documentation/x86/tlb.txt | 72 ++++++++++++++++++++++++++++++++++++++++++++ b/arch/x86/mm/tlb.c | 46 ++++++++++++++++++++++++++++ 2 files changed, 118 insertions(+) diff -puN arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush 2014-04-21 11:10:35.901884997 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:35.905885179 -0700 @@ -274,3 +274,49 @@ void flush_tlb_kernel_range(unsigned lon on_each_cpu(do_kernel_range_flush, &info, 1); } } + +static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf, + size_t count, loff_t *ppos) +{ + char buf[32]; + unsigned int len; + + len = sprintf(buf, "%ld\n", tlb_single_page_flush_ceiling); + return simple_read_from_buffer(user_buf, count, ppos, buf, len); +} + +static ssize_t tlbflush_write_file(struct file *file, + const char __user *user_buf, size_t count, loff_t *ppos) +{ + char buf[32]; + ssize_t len; + int ceiling; + + len = min(count, sizeof(buf) - 1); + if (copy_from_user(buf, user_buf, len)) + return -EFAULT; + + buf[len] = '\0'; + if (kstrtoint(buf, 0, &ceiling)) + return -EINVAL; + + if (ceiling < 0) + return -EINVAL; + + tlb_single_page_flush_ceiling = ceiling; + return count; +} + +static const struct file_operations fops_tlbflush = { + .read = tlbflush_read_file, + .write = tlbflush_write_file, + .llseek = default_llseek, +}; + +static int __init create_tlb_single_page_flush_ceiling(void) +{ + debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR, + arch_debugfs_dir, NULL, &fops_tlbflush); + return 0; +} +late_initcall(create_tlb_single_page_flush_ceiling); diff -puN /dev/null Documentation/x86/tlb.txt --- /dev/null 2014-04-10 11:28:14.066815724 -0700 +++ b/Documentation/x86/tlb.txt 2014-04-21 11:10:35.924886036 -0700 @@ -0,0 +1,72 @@ +nWhen the kernel unmaps or modified the attributes of a range of +memory, it has two choices: + 1. Flush the entire TLB with a two-instruction sequence. This is + a quick operation, but it causes collateral damage: TLB entries + from areas other than the one we are trying to flush will be + destroyed and must be refilled later, at some cost. + 2. Use the invlpg instruction to invalidate a single page at a + time. This could potentialy cost many more instructions, but + it is a much more precise operation, causing no collateral + damage to other TLB entries. + +Which method to do depends on a few things: + 1. The size of the flush being performed. A flush of the entire + address space is obviously better performed by flushing the + entire TLB than doing 2^48/PAGE_SIZE individual flushes. + 2. The contents of the TLB. If the TLB is empty, then there will + be no collateral damage caused by doing the global flush, and + all of the individual flush will have ended up being wasted + work. + 3. The size of the TLB. The larger the TLB, the more collateral + damage we do with a full flush. So, the larger the TLB, the + more attrative an individual flush looks. Data and + instructions have separate TLBs, as do different page sizes. + 4. The microarchitecture. The TLB has become a multi-level + cache on modern CPUs, and the global flushes have become more + expensive relative to single-page flushes. + +There is obviously no way the kernel can know all these things, +especially the contents of the TLB during a given flush. The +sizes of the flush will vary greatly depending on the workload as +well. There is essentially no "right" point to choose. + +You may be doing too many individual invalidations if you see the +invlpg instruction (or instructions _near_ it) show up high in +profiles. If you believe that individual invalidatoins being +called too often, you can lower the tunable: + + /sys/debug/kernel/x86/tlb_single_page_flush_ceiling + +This will cause us to do the global flush for more cases. +Lowering it to 0 will disable the use of the individual flushes. +Setting it to 1 is a very conservative setting and it should +never need to be 0 under normal circumstances. + +Despite the fact that a single individual flush on x86 is +guaranteed to flush a full 2MB, hugetlbfs always uses the full +flushes. THP is treated exactly the same as normal memory. + +You might see invlpg inside of flush_tlb_mm_range() show up in +profiles, or you can use the trace_tlb_flush() tracepoints. to +determine how long the flush operations are taking. + +Essentially, you are balancing the cycles you spend doing invlpg +with the cycles that you spend refilling the TLB later. + +You can measure how expensive TLB refills are by using +performance counters and 'perf stat', like this: + +perf stat -e + cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, + cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, + cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, + cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, + cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, + cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ + +That works on an IvyBridge-era CPU (i5-3320M). Different CPUs +may have differently-named counters, but they should at least +be there in some form. You can use pmu-tools 'ocperf list' +(https://github.com/andikleen/pmu-tools) to find the right +counters for a given CPU. + _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753931AbaDUSYe (ORCPT ); Mon, 21 Apr 2014 14:24:34 -0400 Received: from mga11.intel.com ([192.55.52.93]:19544 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753525AbaDUSY0 (ORCPT ); Mon, 21 Apr 2014 14:24:26 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,897,1389772800"; d="scan'208";a="524407533" Subject: [PATCH 4/6] x86: mm: trace tlb flushes To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:25 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182425.93E696A3@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen We don't have any good way to figure out what kinds of flushes are being attempted. Right now, we can try to use the vm counters, but those only tell us what we actually did with the hardware (one-by-one vs full) and don't tell us what was actually _requested_. This allows us to select out "interesting" TLB flushes that we might want to optimize (like the ranged ones) and ignore the ones that we have very little control over (the ones at context switch). Also, since we have a pair of tracepoint calls in flush_tlb_mm_range(), we can time the deltas between them to make sure that we got the "invlpg vs. global flush" balance correct in practice. Signed-off-by: Dave Hansen --- b/arch/x86/include/asm/mmu_context.h | 6 +++++ b/arch/x86/mm/tlb.c | 12 +++++++++-- b/include/linux/mm_types.h | 10 +++++++++ b/include/trace/events/tlb.h | 37 +++++++++++++++++++++++++++++++++++ b/mm/Makefile | 2 - b/mm/trace_tlb.c | 12 +++++++++++ 6 files changed, 76 insertions(+), 3 deletions(-) diff -puN arch/x86/include/asm/mmu_context.h~tlb-trace-flushes arch/x86/include/asm/mmu_context.h --- a/arch/x86/include/asm/mmu_context.h~tlb-trace-flushes 2014-04-21 11:10:35.519867746 -0700 +++ b/arch/x86/include/asm/mmu_context.h 2014-04-21 11:10:35.527868108 -0700 @@ -3,6 +3,10 @@ #include #include +#include + +#include + #include #include #include @@ -44,6 +48,7 @@ static inline void switch_mm(struct mm_s /* Re-load page tables */ load_cr3(next->pgd); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); /* Stop flush ipis for the previous mm */ cpumask_clear_cpu(cpu, mm_cpumask(prev)); @@ -71,6 +76,7 @@ static inline void switch_mm(struct mm_s * to make sure to use no freed page tables. */ load_cr3(next->pgd); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); load_LDT_nolock(&next->context); } } diff -puN arch/x86/mm/tlb.c~tlb-trace-flushes arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~tlb-trace-flushes 2014-04-21 11:10:35.520867791 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:35.528868153 -0700 @@ -14,6 +14,8 @@ #include #include +#include + DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { &init_mm, 0, }; @@ -49,6 +51,7 @@ void leave_mm(int cpu) if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) { cpumask_clear_cpu(cpu, mm_cpumask(active_mm)); load_cr3(swapper_pg_dir); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); } } EXPORT_SYMBOL_GPL(leave_mm); @@ -105,9 +108,10 @@ static void flush_tlb_func(void *info) count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) { - if (f->flush_end == TLB_FLUSH_ALL) + if (f->flush_end == TLB_FLUSH_ALL) { local_flush_tlb(); - else if (!f->flush_end) + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL); + } else if (!f->flush_end) __flush_tlb_single(f->flush_start); else { unsigned long addr; @@ -152,7 +156,9 @@ void flush_tlb_current_task(void) preempt_disable(); count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); local_flush_tlb(); + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN_DONE, TLB_FLUSH_ALL); if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); preempt_enable(); @@ -188,6 +194,7 @@ void flush_tlb_mm_range(struct mm_struct if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) base_pages_to_flush = (end - start) >> PAGE_SHIFT; + trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush); if (base_pages_to_flush > tlb_single_page_flush_ceiling) { base_pages_to_flush = TLB_FLUSH_ALL; count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); @@ -199,6 +206,7 @@ void flush_tlb_mm_range(struct mm_struct __flush_tlb_single(addr); } } + trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN_DONE, base_pages_to_flush); out: if (base_pages_to_flush == TLB_FLUSH_ALL) { start = 0UL; diff -puN include/linux/mm_types.h~tlb-trace-flushes include/linux/mm_types.h --- a/include/linux/mm_types.h~tlb-trace-flushes 2014-04-21 11:10:35.522867881 -0700 +++ b/include/linux/mm_types.h 2014-04-21 11:10:35.529868198 -0700 @@ -510,4 +510,14 @@ static inline void clear_tlb_flush_pendi } #endif +enum tlb_flush_reason { + TLB_FLUSH_ON_TASK_SWITCH, + TLB_REMOTE_SHOOTDOWN, + TLB_LOCAL_SHOOTDOWN, + TLB_LOCAL_SHOOTDOWN_DONE, + TLB_LOCAL_MM_SHOOTDOWN, + TLB_LOCAL_MM_SHOOTDOWN_DONE, + NR_TLB_FLUSH_REASONS, +}; + #endif /* _LINUX_MM_TYPES_H */ diff -puN /dev/null include/trace/events/tlb.h --- /dev/null 2014-04-10 11:28:14.066815724 -0700 +++ b/include/trace/events/tlb.h 2014-04-21 11:10:35.529868198 -0700 @@ -0,0 +1,37 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM tlb + +#if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_TLB_H + +#include +#include + +extern const char * const tlb_flush_reason_desc[]; + +TRACE_EVENT(tlb_flush, + + TP_PROTO(int reason, unsigned long pages), + TP_ARGS(reason, pages), + + TP_STRUCT__entry( + __field( int, reason) + __field(unsigned long, pages) + ), + + TP_fast_assign( + __entry->reason = reason; + __entry->pages = pages; + ), + + TP_printk("pages: %ld reason: %d (%s)", + __entry->pages, + __entry->reason, + tlb_flush_reason_desc[__entry->reason]) +); + +#endif /* _TRACE_TLB_H */ + +/* This part must be outside protection */ +#include + diff -puN mm/Makefile~tlb-trace-flushes mm/Makefile --- a/mm/Makefile~tlb-trace-flushes 2014-04-21 11:10:35.524867971 -0700 +++ b/mm/Makefile 2014-04-21 11:10:35.530868243 -0700 @@ -5,7 +5,7 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \ mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ - vmalloc.o pagewalk.o pgtable-generic.o + vmalloc.o pagewalk.o pgtable-generic.o trace_tlb.o ifdef CONFIG_CROSS_MEMORY_ATTACH mmu-$(CONFIG_MMU) += process_vm_access.o diff -puN /dev/null mm/trace_tlb.c --- /dev/null 2014-04-10 11:28:14.066815724 -0700 +++ b/mm/trace_tlb.c 2014-04-21 11:10:35.530868243 -0700 @@ -0,0 +1,12 @@ +#define CREATE_TRACE_POINTS +#include + +const char * const tlb_flush_reason_desc[] = { + __stringify(TLB_FLUSH_ON_TASK_SWITCH), + __stringify(TLB_REMOTE_SHOOTDOWN), + __stringify(TLB_LOCAL_SHOOTDOWN), + __stringify(TLB_LOCAL_SHOOTDOWN_DONE), + __stringify(TLB_LOCAL_MM_SHOOTDOWN), + __stringify(TLB_LOCAL_MM_SHOOTDOWN_DONE), +}; + _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753558AbaDUSYW (ORCPT ); Mon, 21 Apr 2014 14:24:22 -0400 Received: from mga02.intel.com ([134.134.136.20]:13397 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752779AbaDUSYT (ORCPT ); Mon, 21 Apr 2014 14:24:19 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,897,1389772800"; d="scan'208";a="496887614" Subject: [PATCH 0/6] x86: rework tlb range flushing code To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:18 -0700 Message-Id: <20140421182418.81CF7519@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Changes from v2: * Added a brief comment above the ceiling tunable * Updated the documentation to mention large pages and say "individual flush" instead of invlpg in most cases. Reposting with an instrumentation patch, and a few minor tweaks. I'd love some more eyeballs on this, but I think it's ready for -mm. I've run this through a variety of systems in the LKP harness, as well as running it on my desktop for a few days. I'm yet to see an to see if any perfmance regressions (or gains) show up. Without the last (instrumentation/debugging) patch: arch/x86/include/asm/mmu_context.h | 6 ++ arch/x86/include/asm/processor.h | 1 arch/x86/kernel/cpu/amd.c | 7 -- arch/x86/kernel/cpu/common.c | 13 ----- arch/x86/kernel/cpu/intel.c | 26 ---------- arch/x86/mm/tlb.c | 91 +++++++++++++++---------------------- include/linux/mm_types.h | 10 ++++ mm/Makefile | 2 8 files changed, 58 insertions(+), 98 deletions(-) -- I originally went to look at this becuase I realized that newer CPUs were not present in the intel_tlb_flushall_shift_set() code. I went to try to figure out where to stick newer CPUs (do we consider them more like SandyBridge or IvyBridge), and was not able to repeat the original experiments. Instead, this set does: 1. Rework the code a bit to ready it for tracepoints 2. Add tracepoints 3. Add a new tunable and set it to a sane value From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754226AbaDUSZt (ORCPT ); Mon, 21 Apr 2014 14:25:49 -0400 Received: from mga03.intel.com ([143.182.124.21]:49118 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753583AbaDUSYY (ORCPT ); Mon, 21 Apr 2014 14:24:24 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,897,1389772800"; d="scan'208";a="421645834" Subject: [PATCH 3/6] x86: mm: fix missed global TLB flush stat To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:22 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182422.DE5E728F@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen If we take the if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { local_flush_tlb(); goto out; } path out of flush_tlb_mm_range(), we will have flushed the tlb, but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the way out of the function so that we always take a single path when doing a full tlb flush. Signed-off-by: Dave Hansen --- b/arch/x86/mm/tlb.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff -puN arch/x86/mm/tlb.c~fix-missed-global-flush-stat arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~fix-missed-global-flush-stat 2014-04-21 11:10:35.176852256 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:35.190852888 -0700 @@ -172,8 +172,9 @@ unsigned long tlb_single_page_flush_ceil void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) { - int need_flush_others_all = 1; unsigned long addr; + /* do a global flush by default */ + unsigned long base_pages_to_flush = TLB_FLUSH_ALL; preempt_disable(); if (current->active_mm != mm) @@ -184,16 +185,14 @@ void flush_tlb_mm_range(struct mm_struct goto out; } - if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { - local_flush_tlb(); - goto out; - } + if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) + base_pages_to_flush = (end - start) >> PAGE_SHIFT; - if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) { + if (base_pages_to_flush > tlb_single_page_flush_ceiling) { + base_pages_to_flush = TLB_FLUSH_ALL; count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); local_flush_tlb(); } else { - need_flush_others_all = 0; /* flush range by one by one 'invlpg' */ for (addr = start; addr < end; addr += PAGE_SIZE) { count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE); @@ -201,7 +200,7 @@ void flush_tlb_mm_range(struct mm_struct } } out: - if (need_flush_others_all) { + if (base_pages_to_flush == TLB_FLUSH_ALL) { start = 0UL; end = TLB_FLUSH_ALL; } _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754238AbaDUS01 (ORCPT ); Mon, 21 Apr 2014 14:26:27 -0400 Received: from mga02.intel.com ([134.134.136.20]:13397 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753276AbaDUSYU (ORCPT ); Mon, 21 Apr 2014 14:24:20 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,897,1389772800"; d="scan'208";a="496887639" Subject: [PATCH 1/6] x86: mm: clean up tlb flushing code To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Date: Mon, 21 Apr 2014 11:24:20 -0700 References: <20140421182418.81CF7519@viggo.jf.intel.com> In-Reply-To: <20140421182418.81CF7519@viggo.jf.intel.com> Message-Id: <20140421182420.307A0C57@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen The if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) line of code is not exactly the easiest to audit, especially when it ends up at two different indentation levels. This eliminates one of the the copy-n-paste versions. It also gives us a unified exit point for each path through this function. We need this in a minute for our tracepoint. Signed-off-by: Dave Hansen --- b/arch/x86/mm/tlb.c | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~simplify-tlb-code 2014-04-21 11:10:34.431818610 -0700 +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:34.435818791 -0700 @@ -161,23 +161,24 @@ void flush_tlb_current_task(void) void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) { + int need_flush_others_all = 1; unsigned long addr; unsigned act_entries, tlb_entries = 0; unsigned long nr_base_pages; preempt_disable(); if (current->active_mm != mm) - goto flush_all; + goto out; if (!current->mm) { leave_mm(smp_processor_id()); - goto flush_all; + goto out; } if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 || vmflag & VM_HUGETLB) { local_flush_tlb(); - goto flush_all; + goto out; } /* In modern CPU, last level tlb used for both data/ins */ @@ -196,22 +197,20 @@ void flush_tlb_mm_range(struct mm_struct count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); local_flush_tlb(); } else { + need_flush_others_all = 0; /* flush range by one by one 'invlpg' */ for (addr = start; addr < end; addr += PAGE_SIZE) { count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE); __flush_tlb_single(addr); } - - if (cpumask_any_but(mm_cpumask(mm), - smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, start, end); - preempt_enable(); - return; } - -flush_all: +out: + if (need_flush_others_all) { + start = 0UL; + end = TLB_FLUSH_ALL; + } if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); + flush_tlb_others(mm_cpumask(mm), mm, start, end); preempt_enable(); } _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933322AbaDVQzA (ORCPT ); Tue, 22 Apr 2014 12:55:00 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57989 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932504AbaDVQy6 (ORCPT ); Tue, 22 Apr 2014 12:54:58 -0400 Message-ID: <53569ED3.2080206@redhat.com> Date: Tue, 22 Apr 2014 12:54:43 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Dave Hansen , x86@kernel.org CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182421.DFAAD16A@viggo.jf.intel.com> In-Reply-To: <20140421182421.DFAAD16A@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > I think the flush_tlb_mm_range() code that tries to tune the > flush sizes based on the CPU needs to get ripped out for > several reasons: > > 1. It is obviously buggy. It uses mm->total_vm to judge the > task's footprint in the TLB. It should certainly be using > some measure of RSS, *NOT* ->total_vm since only resident > memory can populate the TLB. > 2. Haswell, and several other CPUs are missing from the > intel_tlb_flushall_shift_set() function. Thus, it has been > demonstrated to bitrot quickly in practice. > 3. It is plain wrong in my vm: > [ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 > [ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0 > [ 0.037444] tlb_flushall_shift: 6 > Which leads to it to never use invlpg. > 4. The assumptions about TLB refill costs are wrong: > http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com > (more on this in later patches) > 5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59 > I believe the sample times were too short. Running the > benchmark in a loop yields times that vary quite a bit. > > Note that this leaves us with a static ceiling of 1 page. This > is a conservative, dumb setting, and will be revised in a later > patch. > > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933298AbaDVQy6 (ORCPT ); Tue, 22 Apr 2014 12:54:58 -0400 Received: from mx1.redhat.com ([209.132.183.28]:63784 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932504AbaDVQy4 (ORCPT ); Tue, 22 Apr 2014 12:54:56 -0400 Message-ID: <53569EA4.2000308@redhat.com> Date: Tue, 22 Apr 2014 12:53:56 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Dave Hansen , x86@kernel.org CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 1/6] x86: mm: clean up tlb flushing code References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182420.307A0C57@viggo.jf.intel.com> In-Reply-To: <20140421182420.307A0C57@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > The > > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) > > line of code is not exactly the easiest to audit, especially when > it ends up at two different indentation levels. This eliminates > one of the the copy-n-paste versions. It also gives us a unified > exit point for each path through this function. We need this in > a minute for our tracepoint. > > > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933198AbaDVRPy (ORCPT ); Tue, 22 Apr 2014 13:15:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40800 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932199AbaDVRPx (ORCPT ); Tue, 22 Apr 2014 13:15:53 -0400 Message-ID: <5356A3B6.30006@redhat.com> Date: Tue, 22 Apr 2014 13:15:34 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Dave Hansen , x86@kernel.org CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 3/6] x86: mm: fix missed global TLB flush stat References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182422.DE5E728F@viggo.jf.intel.com> In-Reply-To: <20140421182422.DE5E728F@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > If we take the > > if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { > local_flush_tlb(); > goto out; > } > > path out of flush_tlb_mm_range(), we will have flushed the tlb, > but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the > way out of the function so that we always take a single path when > doing a full tlb flush. > > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758657AbaDVVUG (ORCPT ); Tue, 22 Apr 2014 17:20:06 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54449 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757662AbaDVVT7 (ORCPT ); Tue, 22 Apr 2014 17:19:59 -0400 Message-ID: <5356DCEF.3050506@redhat.com> Date: Tue, 22 Apr 2014 17:19:43 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Dave Hansen , x86@kernel.org CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 4/6] x86: mm: trace tlb flushes References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182425.93E696A3@viggo.jf.intel.com> In-Reply-To: <20140421182425.93E696A3@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > We don't have any good way to figure out what kinds of flushes > are being attempted. Right now, we can try to use the vm > counters, but those only tell us what we actually did with the > hardware (one-by-one vs full) and don't tell us what was actually > _requested_. > > This allows us to select out "interesting" TLB flushes that we > might want to optimize (like the ranged ones) and ignore the ones > that we have very little control over (the ones at context > switch). > > Also, since we have a pair of tracepoint calls in > flush_tlb_mm_range(), we can time the deltas between them to make > sure that we got the "invlpg vs. global flush" balance correct in > practice. > > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758836AbaDVVcK (ORCPT ); Tue, 22 Apr 2014 17:32:10 -0400 Received: from mx1.redhat.com ([209.132.183.28]:63655 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757716AbaDVVcI (ORCPT ); Tue, 22 Apr 2014 17:32:08 -0400 Message-ID: <5356DFC8.1060601@redhat.com> Date: Tue, 22 Apr 2014 17:31:52 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Dave Hansen , x86@kernel.org CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> In-Reply-To: <20140421182426.D6DD1E8F@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > Most of the logic here is in the documentation file. Please take > a look at it. > > I know we've come full-circle here back to a tunable, but this > new one is *WAY* simpler. I challenge anyone to describe in one > sentence how the old one worked. Here's the way the new one > works: > > If we are flushing more pages than the ceiling, we use > the full flush, otherwise we use per-page flushes. > > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758875AbaDVVeP (ORCPT ); Tue, 22 Apr 2014 17:34:15 -0400 Received: from mx1.redhat.com ([209.132.183.28]:13343 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757352AbaDVVeJ (ORCPT ); Tue, 22 Apr 2014 17:34:09 -0400 Message-ID: <5356E041.3060709@redhat.com> Date: Tue, 22 Apr 2014 17:33:53 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Dave Hansen , x86@kernel.org CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mgorman@suse.de, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 6/6] x86: mm: set TLB flush tunable to sane value (33) References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182428.FC2104C1@viggo.jf.intel.com> In-Reply-To: <20140421182428.FC2104C1@viggo.jf.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/21/2014 02:24 PM, Dave Hansen wrote: > From: Dave Hansen > > This has been run through Intel's LKP tests across a wide range > of modern sytems and workloads and it wasn't shown to make a > measurable performance difference positive or negative. > > Now that we have some shiny new tracepoints, we can actually > figure out what the heck is going on. > > During a kernel compile, 60% of the flush_tlb_mm_range() calls > are for a single page. It breaks down like this: > Signed-off-by: Dave Hansen Acked-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752984AbaDXIdL (ORCPT ); Thu, 24 Apr 2014 04:33:11 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46766 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751928AbaDXIdI (ORCPT ); Thu, 24 Apr 2014 04:33:08 -0400 Date: Thu, 24 Apr 2014 09:33:04 +0100 From: Mel Gorman To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 1/6] x86: mm: clean up tlb flushing code Message-ID: <20140424083304.GP23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182420.307A0C57@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182420.307A0C57@viggo.jf.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 21, 2014 at 11:24:20AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > The > > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) > > line of code is not exactly the easiest to audit, especially when > it ends up at two different indentation levels. This eliminates > one of the the copy-n-paste versions. It also gives us a unified > exit point for each path through this function. We need this in > a minute for our tracepoint. > > > Signed-off-by: Dave Hansen > --- > > b/arch/x86/mm/tlb.c | 23 +++++++++++------------ > 1 file changed, 11 insertions(+), 12 deletions(-) > > diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c > --- a/arch/x86/mm/tlb.c~simplify-tlb-code 2014-04-21 11:10:34.431818610 -0700 > +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:34.435818791 -0700 > @@ -161,23 +161,24 @@ void flush_tlb_current_task(void) > void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, > unsigned long end, unsigned long vmflag) > { > + int need_flush_others_all = 1; > unsigned long addr; > unsigned act_entries, tlb_entries = 0; > unsigned long nr_base_pages; > Could make that bool but otherwise Acked-by: Mel Gorman -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753127AbaDXIqA (ORCPT ); Thu, 24 Apr 2014 04:46:00 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46999 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752140AbaDXIp4 (ORCPT ); Thu, 24 Apr 2014 04:45:56 -0400 Date: Thu, 24 Apr 2014 09:45:52 +0100 From: Mel Gorman To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing Message-ID: <20140424084552.GQ23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182421.DFAAD16A@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182421.DFAAD16A@viggo.jf.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 21, 2014 at 11:24:21AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > I think the flush_tlb_mm_range() code that tries to tune the > flush sizes based on the CPU needs to get ripped out for > several reasons: > > 1. It is obviously buggy. It uses mm->total_vm to judge the > task's footprint in the TLB. It should certainly be using > some measure of RSS, *NOT* ->total_vm since only resident > memory can populate the TLB. Agreed. Even an RSS check is dodgy considering that it is not a reliable indication of recent reference activity and how many relevant TLB entries there may be for the task. > 2. Haswell, and several other CPUs are missing from the > intel_tlb_flushall_shift_set() function. Thus, it has been > demonstrated to bitrot quickly in practice. I also worried that the methodology used to set that shift on different CPUs was different. > 3. It is plain wrong in my vm: > [ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 > [ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0 > [ 0.037444] tlb_flushall_shift: 6 > Which leads to it to never use invlpg. > 4. The assumptions about TLB refill costs are wrong: > http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com > (more on this in later patches) > 5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59 > I believe the sample times were too short. Running the > benchmark in a loop yields times that vary quite a bit. > FWIW, when I last visited this topic I had to modify the test case extensively and even then it was not driven by flush ranges measured from "real" workloads. > Note that this leaves us with a static ceiling of 1 page. This > is a conservative, dumb setting, and will be revised in a later > patch. > > Signed-off-by: Dave Hansen > --- > > b/arch/x86/include/asm/processor.h | 1 > b/arch/x86/kernel/cpu/amd.c | 7 -- > b/arch/x86/kernel/cpu/common.c | 13 ----- > b/arch/x86/kernel/cpu/intel.c | 26 ---------- > b/arch/x86/mm/tlb.c | 91 ++++++------------------------------- > 5 files changed, 19 insertions(+), 119 deletions(-) > > diff -puN arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/include/asm/processor.h > --- a/arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.813835861 -0700 > +++ b/arch/x86/include/asm/processor.h 2014-04-21 11:10:34.823836313 -0700 > @@ -72,7 +72,6 @@ extern u16 __read_mostly tlb_lld_4k[NR_I > extern u16 __read_mostly tlb_lld_2m[NR_INFO]; > extern u16 __read_mostly tlb_lld_4m[NR_INFO]; > extern u16 __read_mostly tlb_lld_1g[NR_INFO]; > -extern s8 __read_mostly tlb_flushall_shift; > > /* > * CPU type and hardware bug flags. Kept separately for each CPU. > diff -puN arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/amd.c > --- a/arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.814835907 -0700 > +++ b/arch/x86/kernel/cpu/amd.c 2014-04-21 11:10:34.824836358 -0700 > @@ -741,11 +741,6 @@ static unsigned int amd_size_cache(struc > } > #endif > > -static void cpu_set_tlb_flushall_shift(struct cpuinfo_x86 *c) > -{ > - tlb_flushall_shift = 6; > -} > - > static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c) > { > u32 ebx, eax, ecx, edx; > @@ -793,8 +788,6 @@ static void cpu_detect_tlb_amd(struct cp > tlb_lli_2m[ENTRIES] = eax & mask; > > tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1; > - > - cpu_set_tlb_flushall_shift(c); > } > > static const struct cpu_dev amd_cpu_dev = { > diff -puN arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/common.c > --- a/arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.816835998 -0700 > +++ b/arch/x86/kernel/cpu/common.c 2014-04-21 11:10:34.825836403 -0700 > @@ -479,26 +479,17 @@ u16 __read_mostly tlb_lld_2m[NR_INFO]; > u16 __read_mostly tlb_lld_4m[NR_INFO]; > u16 __read_mostly tlb_lld_1g[NR_INFO]; > > -/* > - * tlb_flushall_shift shows the balance point in replacing cr3 write > - * with multiple 'invlpg'. It will do this replacement when > - * flush_tlb_lines <= active_lines/2^tlb_flushall_shift. > - * If tlb_flushall_shift is -1, means the replacement will be disabled. > - */ > -s8 __read_mostly tlb_flushall_shift = -1; > - > void cpu_detect_tlb(struct cpuinfo_x86 *c) > { > if (this_cpu->c_detect_tlb) > this_cpu->c_detect_tlb(c); > > printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n" > - "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n" > - "tlb_flushall_shift: %d\n", > + "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n", > tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES], > tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES], > tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES], > - tlb_lld_1g[ENTRIES], tlb_flushall_shift); > + tlb_lld_1g[ENTRIES]); > } > > void detect_ht(struct cpuinfo_x86 *c) > diff -puN arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/intel.c > --- a/arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.818836088 -0700 > +++ b/arch/x86/kernel/cpu/intel.c 2014-04-21 11:10:34.825836403 -0700 > @@ -634,31 +634,6 @@ static void intel_tlb_lookup(const unsig > } > } > > -static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c) > -{ > - switch ((c->x86 << 8) + c->x86_model) { > - case 0x60f: /* original 65 nm celeron/pentium/core2/xeon, "Merom"/"Conroe" */ > - case 0x616: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */ > - case 0x617: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */ > - case 0x61d: /* six-core 45 nm xeon "Dunnington" */ > - tlb_flushall_shift = -1; > - break; > - case 0x63a: /* Ivybridge */ > - tlb_flushall_shift = 2; > - break; > - case 0x61a: /* 45 nm nehalem, "Bloomfield" */ > - case 0x61e: /* 45 nm nehalem, "Lynnfield" */ > - case 0x625: /* 32 nm nehalem, "Clarkdale" */ > - case 0x62c: /* 32 nm nehalem, "Gulftown" */ > - case 0x62e: /* 45 nm nehalem-ex, "Beckton" */ > - case 0x62f: /* 32 nm Xeon E7 */ > - case 0x62a: /* SandyBridge */ > - case 0x62d: /* SandyBridge, "Romely-EP" */ > - default: > - tlb_flushall_shift = 6; > - } > -} > - > static void intel_detect_tlb(struct cpuinfo_x86 *c) > { > int i, j, n; > @@ -683,7 +658,6 @@ static void intel_detect_tlb(struct cpui > for (j = 1 ; j < 16 ; j++) > intel_tlb_lookup(desc[j]); > } > - intel_tlb_flushall_shift_set(c); > } > > static const struct cpu_dev intel_cpu_dev = { > diff -puN arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/mm/tlb.c > --- a/arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-04-21 11:10:34.820836178 -0700 > +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:34.826836449 -0700 > @@ -158,13 +158,22 @@ void flush_tlb_current_task(void) > preempt_enable(); > } > > +/* > + * See Documentation/x86/tlb.txt for details. We choose 33 > + * because it is large enough to cover the vast majority (at > + * least 95%) of allocations, and is small enough that we are > + * confident it will not cause too much overhead. Each single > + * flush is about 100 cycles, so this caps the maximum overhead > + * at _about_ 3,000 cycles. > + */ > +/* in units of pages */ > +unsigned long tlb_single_page_flush_ceiling = 1; > + This comment is premature. The documentation file does not exist yet and 33 means nothing yet. Out of curiousity though, how confident are you that a TLB flush is generally 100 cycles across different generations and manufacturers of CPUs? I'm not suggesting you change it or auto-tune it, am just curious. > void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, > unsigned long end, unsigned long vmflag) > { > int need_flush_others_all = 1; > unsigned long addr; > - unsigned act_entries, tlb_entries = 0; > - unsigned long nr_base_pages; > > preempt_disable(); > if (current->active_mm != mm) > @@ -175,25 +184,12 @@ void flush_tlb_mm_range(struct mm_struct > goto out; > } > > - if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 > - || vmflag & VM_HUGETLB) { > + if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { > local_flush_tlb(); > goto out; > } > > - /* In modern CPU, last level tlb used for both data/ins */ > - if (vmflag & VM_EXEC) > - tlb_entries = tlb_lli_4k[ENTRIES]; > - else > - tlb_entries = tlb_lld_4k[ENTRIES]; > - > - /* Assume all of TLB entries was occupied by this task */ > - act_entries = tlb_entries >> tlb_flushall_shift; > - act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm; > - nr_base_pages = (end - start) >> PAGE_SHIFT; > - > - /* tlb_flushall_shift is on balance point, details in commit log */ > - if (nr_base_pages > act_entries) { > + if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) { > count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > local_flush_tlb(); > } else { We lose the different tuning based on whether the flush is for instructions or data. However, I cannot think of a good reason for keeping it as I expect that flushes of instructions is relatively rare. The benefit, if any, will be marginal. Still, if you do another revision it would be nice to call this out in the changelog. Otherwise Acked-by: Mel Gorman -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752762AbaDXIt3 (ORCPT ); Thu, 24 Apr 2014 04:49:29 -0400 Received: from cantor2.suse.de ([195.135.220.15]:47028 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751787AbaDXIt1 (ORCPT ); Thu, 24 Apr 2014 04:49:27 -0400 Date: Thu, 24 Apr 2014 09:49:23 +0100 From: Mel Gorman To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 3/6] x86: mm: fix missed global TLB flush stat Message-ID: <20140424084922.GR23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182422.DE5E728F@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182422.DE5E728F@viggo.jf.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 21, 2014 at 11:24:22AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > If we take the > > if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { > local_flush_tlb(); > goto out; > } > > path out of flush_tlb_mm_range(), we will have flushed the tlb, > but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the > way out of the function so that we always take a single path when > doing a full tlb flush. > > Signed-off-by: Dave Hansen Acked-by: Mel Gorman -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755925AbaDXKOi (ORCPT ); Thu, 24 Apr 2014 06:14:38 -0400 Received: from cantor2.suse.de ([195.135.220.15]:48711 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755403AbaDXKOZ (ORCPT ); Thu, 24 Apr 2014 06:14:25 -0400 Date: Thu, 24 Apr 2014 11:14:20 +0100 From: Mel Gorman To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 4/6] x86: mm: trace tlb flushes Message-ID: <20140424101419.GS23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182425.93E696A3@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182425.93E696A3@viggo.jf.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 21, 2014 at 11:24:25AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > We don't have any good way to figure out what kinds of flushes > are being attempted. Right now, we can try to use the vm > counters, but those only tell us what we actually did with the > hardware (one-by-one vs full) and don't tell us what was actually > _requested_. > And when enabled they are a penalty even for those that don't care. > This allows us to select out "interesting" TLB flushes that we > might want to optimize (like the ranged ones) and ignore the ones > that we have very little control over (the ones at context > switch). > > Also, since we have a pair of tracepoint calls in > flush_tlb_mm_range(), we can time the deltas between them to make > sure that we got the "invlpg vs. global flush" balance correct in > practice. > > Signed-off-by: Dave Hansen > --- > > b/arch/x86/include/asm/mmu_context.h | 6 +++++ > b/arch/x86/mm/tlb.c | 12 +++++++++-- > b/include/linux/mm_types.h | 10 +++++++++ > b/include/trace/events/tlb.h | 37 +++++++++++++++++++++++++++++++++++ > b/mm/Makefile | 2 - > b/mm/trace_tlb.c | 12 +++++++++++ > 6 files changed, 76 insertions(+), 3 deletions(-) > > diff -puN arch/x86/include/asm/mmu_context.h~tlb-trace-flushes arch/x86/include/asm/mmu_context.h > --- a/arch/x86/include/asm/mmu_context.h~tlb-trace-flushes 2014-04-21 11:10:35.519867746 -0700 > +++ b/arch/x86/include/asm/mmu_context.h 2014-04-21 11:10:35.527868108 -0700 > @@ -3,6 +3,10 @@ > > #include > #include > +#include > + > +#include > + > #include > #include > #include > @@ -44,6 +48,7 @@ static inline void switch_mm(struct mm_s > > /* Re-load page tables */ > load_cr3(next->pgd); > + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > > /* Stop flush ipis for the previous mm */ > cpumask_clear_cpu(cpu, mm_cpumask(prev)); > @@ -71,6 +76,7 @@ static inline void switch_mm(struct mm_s > * to make sure to use no freed page tables. > */ > load_cr3(next->pgd); > + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > load_LDT_nolock(&next->context); > } > } > diff -puN arch/x86/mm/tlb.c~tlb-trace-flushes arch/x86/mm/tlb.c > --- a/arch/x86/mm/tlb.c~tlb-trace-flushes 2014-04-21 11:10:35.520867791 -0700 > +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:35.528868153 -0700 > @@ -14,6 +14,8 @@ > #include > #include > > +#include > + > DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) > = { &init_mm, 0, }; > > @@ -49,6 +51,7 @@ void leave_mm(int cpu) > if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) { > cpumask_clear_cpu(cpu, mm_cpumask(active_mm)); > load_cr3(swapper_pg_dir); > + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > } > } > EXPORT_SYMBOL_GPL(leave_mm); > @@ -105,9 +108,10 @@ static void flush_tlb_func(void *info) > > count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); > if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) { > - if (f->flush_end == TLB_FLUSH_ALL) > + if (f->flush_end == TLB_FLUSH_ALL) { > local_flush_tlb(); > - else if (!f->flush_end) > + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL); > + } else if (!f->flush_end) > __flush_tlb_single(f->flush_start); > else { > unsigned long addr; Why is only the TLB_FLUSH_ALL case traced here and not the single flush or range of flushes? __native_flush_tlb_single() doesn't have a trace point so I worry we are missing visibility on this part in particular this part. while (addr < f->flush_end) { __flush_tlb_single(addr); addr += PAGE_SIZE; } > @@ -152,7 +156,9 @@ void flush_tlb_current_task(void) > preempt_disable(); > > count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); > local_flush_tlb(); > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN_DONE, TLB_FLUSH_ALL); > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) > flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); > preempt_enable(); Are the two tracepoints really useful? Are they fine enough to measure the cost of the TLB flush? It misses the refill obviously but not much we can do there. > @@ -188,6 +194,7 @@ void flush_tlb_mm_range(struct mm_struct > if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) > base_pages_to_flush = (end - start) >> PAGE_SHIFT; > > + trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush); > if (base_pages_to_flush > tlb_single_page_flush_ceiling) { > base_pages_to_flush = TLB_FLUSH_ALL; > count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > @@ -199,6 +206,7 @@ void flush_tlb_mm_range(struct mm_struct > __flush_tlb_single(addr); > } > } > + trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN_DONE, base_pages_to_flush); > out: > if (base_pages_to_flush == TLB_FLUSH_ALL) { > start = 0UL; > diff -puN include/linux/mm_types.h~tlb-trace-flushes include/linux/mm_types.h > --- a/include/linux/mm_types.h~tlb-trace-flushes 2014-04-21 11:10:35.522867881 -0700 > +++ b/include/linux/mm_types.h 2014-04-21 11:10:35.529868198 -0700 > @@ -510,4 +510,14 @@ static inline void clear_tlb_flush_pendi > } > #endif > > +enum tlb_flush_reason { > + TLB_FLUSH_ON_TASK_SWITCH, > + TLB_REMOTE_SHOOTDOWN, > + TLB_LOCAL_SHOOTDOWN, > + TLB_LOCAL_SHOOTDOWN_DONE, > + TLB_LOCAL_MM_SHOOTDOWN, > + TLB_LOCAL_MM_SHOOTDOWN_DONE, > + NR_TLB_FLUSH_REASONS, > +}; > + Bonus points if you use the string formatting similar to the reason field int events/writeback.h. You do something like that already but there are already helpers for use with __print_symbolic so you do not need to roll your own version. It should reduce the need to add trace_tlb.c if you include the header in something like memory.c instead. > #endif /* _LINUX_MM_TYPES_H */ > diff -puN /dev/null include/trace/events/tlb.h > --- /dev/null 2014-04-10 11:28:14.066815724 -0700 > +++ b/include/trace/events/tlb.h 2014-04-21 11:10:35.529868198 -0700 > @@ -0,0 +1,37 @@ > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM tlb > + > +#if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_TLB_H > + > +#include > +#include > + > +extern const char * const tlb_flush_reason_desc[]; > + > +TRACE_EVENT(tlb_flush, > + > + TP_PROTO(int reason, unsigned long pages), > + TP_ARGS(reason, pages), > + > + TP_STRUCT__entry( > + __field( int, reason) > + __field(unsigned long, pages) > + ), > + > + TP_fast_assign( > + __entry->reason = reason; > + __entry->pages = pages; > + ), > + > + TP_printk("pages: %ld reason: %d (%s)", > + __entry->pages, > + __entry->reason, > + tlb_flush_reason_desc[__entry->reason]) > +); > + I would also suggest you match the output formatting with writeback.h which would look like pages:%lu reason:%s The raw format should still have the integer while the string formatting would have something human readable. Instead > +#endif /* _TRACE_TLB_H */ > + > +/* This part must be outside protection */ > +#include > + > diff -puN mm/Makefile~tlb-trace-flushes mm/Makefile > --- a/mm/Makefile~tlb-trace-flushes 2014-04-21 11:10:35.524867971 -0700 > +++ b/mm/Makefile 2014-04-21 11:10:35.530868243 -0700 > @@ -5,7 +5,7 @@ > mmu-y := nommu.o > mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \ > mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ > - vmalloc.o pagewalk.o pgtable-generic.o > + vmalloc.o pagewalk.o pgtable-generic.o trace_tlb.o > > ifdef CONFIG_CROSS_MEMORY_ATTACH > mmu-$(CONFIG_MMU) += process_vm_access.o > diff -puN /dev/null mm/trace_tlb.c > --- /dev/null 2014-04-10 11:28:14.066815724 -0700 > +++ b/mm/trace_tlb.c 2014-04-21 11:10:35.530868243 -0700 > @@ -0,0 +1,12 @@ > +#define CREATE_TRACE_POINTS > +#include > + > +const char * const tlb_flush_reason_desc[] = { > + __stringify(TLB_FLUSH_ON_TASK_SWITCH), > + __stringify(TLB_REMOTE_SHOOTDOWN), > + __stringify(TLB_LOCAL_SHOOTDOWN), > + __stringify(TLB_LOCAL_SHOOTDOWN_DONE), > + __stringify(TLB_LOCAL_MM_SHOOTDOWN), > + __stringify(TLB_LOCAL_MM_SHOOTDOWN_DONE), > +}; > + > _ -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755752AbaDXKhi (ORCPT ); Thu, 24 Apr 2014 06:37:38 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49246 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755348AbaDXKhc (ORCPT ); Thu, 24 Apr 2014 06:37:32 -0400 Date: Thu, 24 Apr 2014 11:37:27 +0100 From: Mel Gorman To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com, "H. Peter Anvin" Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush Message-ID: <20140424103727.GT23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182426.D6DD1E8F@viggo.jf.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 21, 2014 at 11:24:26AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > Most of the logic here is in the documentation file. Please take > a look at it. > > I know we've come full-circle here back to a tunable, but this > new one is *WAY* simpler. I challenge anyone to describe in one > sentence how the old one worked. Challenge accepted. Based on the characteristics of the CPU and a given process, something semi-random will happen at flush time which may or may not benefit the workload. > Here's the way the new one > works: > > If we are flushing more pages than the ceiling, we use > the full flush, otherwise we use per-page flushes. > > Signed-off-by: Dave Hansen > --- > > b/Documentation/x86/tlb.txt | 72 ++++++++++++++++++++++++++++++++++++++++++++ > b/arch/x86/mm/tlb.c | 46 ++++++++++++++++++++++++++++ > 2 files changed, 118 insertions(+) > > diff -puN arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush arch/x86/mm/tlb.c > --- a/arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush 2014-04-21 11:10:35.901884997 -0700 > +++ b/arch/x86/mm/tlb.c 2014-04-21 11:10:35.905885179 -0700 > @@ -274,3 +274,49 @@ void flush_tlb_kernel_range(unsigned lon > on_each_cpu(do_kernel_range_flush, &info, 1); > } > } > + > +static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf, > + size_t count, loff_t *ppos) > +{ > + char buf[32]; > + unsigned int len; > + > + len = sprintf(buf, "%ld\n", tlb_single_page_flush_ceiling); > + return simple_read_from_buffer(user_buf, count, ppos, buf, len); > +} > + > +static ssize_t tlbflush_write_file(struct file *file, > + const char __user *user_buf, size_t count, loff_t *ppos) > +{ > + char buf[32]; > + ssize_t len; > + int ceiling; > + > + len = min(count, sizeof(buf) - 1); > + if (copy_from_user(buf, user_buf, len)) > + return -EFAULT; > + > + buf[len] = '\0'; > + if (kstrtoint(buf, 0, &ceiling)) > + return -EINVAL; > + > + if (ceiling < 0) > + return -EINVAL; > + > + tlb_single_page_flush_ceiling = ceiling; > + return count; > +} > + > +static const struct file_operations fops_tlbflush = { > + .read = tlbflush_read_file, > + .write = tlbflush_write_file, > + .llseek = default_llseek, > +}; > + > +static int __init create_tlb_single_page_flush_ceiling(void) > +{ > + debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR, > + arch_debugfs_dir, NULL, &fops_tlbflush); > + return 0; > +} > +late_initcall(create_tlb_single_page_flush_ceiling); > diff -puN /dev/null Documentation/x86/tlb.txt > --- /dev/null 2014-04-10 11:28:14.066815724 -0700 > +++ b/Documentation/x86/tlb.txt 2014-04-21 11:10:35.924886036 -0700 > @@ -0,0 +1,72 @@ > +nWhen the kernel unmaps or modified the attributes of a range of > +memory, it has two choices: s/nWhen/When > + 1. Flush the entire TLB with a two-instruction sequence. This is > + a quick operation, but it causes collateral damage: TLB entries > + from areas other than the one we are trying to flush will be > + destroyed and must be refilled later, at some cost. > + 2. Use the invlpg instruction to invalidate a single page at a > + time. This could potentialy cost many more instructions, but > + it is a much more precise operation, causing no collateral > + damage to other TLB entries. > + It's not stated that there is no range flush instruction for x86 but anyone who cares about this area should know that. > +Which method to do depends on a few things: > + 1. The size of the flush being performed. A flush of the entire > + address space is obviously better performed by flushing the > + entire TLB than doing 2^48/PAGE_SIZE individual flushes. > + 2. The contents of the TLB. If the TLB is empty, then there will > + be no collateral damage caused by doing the global flush, and > + all of the individual flush will have ended up being wasted > + work. > + 3. The size of the TLB. The larger the TLB, the more collateral > + damage we do with a full flush. So, the larger the TLB, the > + more attrative an individual flush looks. Data and > + instructions have separate TLBs, as do different page sizes. > + 4. The microarchitecture. The TLB has become a multi-level > + cache on modern CPUs, and the global flushes have become more > + expensive relative to single-page flushes. > + > +There is obviously no way the kernel can know all these things, > +especially the contents of the TLB during a given flush. The > +sizes of the flush will vary greatly depending on the workload as > +well. There is essentially no "right" point to choose. > + > +You may be doing too many individual invalidations if you see the > +invlpg instruction (or instructions _near_ it) show up high in > +profiles. If you believe that individual invalidatoins being > +called too often, you can lower the tunable: > + s/invalidatoins/invalidations/ > + /sys/debug/kernel/x86/tlb_single_page_flush_ceiling > + You do not describe how to use the tracepoints but again anyone investigating this area should know how to do it already so *shrugs*. Rolling a systemtap script to display the information would be a short job. > +This will cause us to do the global flush for more cases. > +Lowering it to 0 will disable the use of the individual flushes. > +Setting it to 1 is a very conservative setting and it should > +never need to be 0 under normal circumstances. > + > +Despite the fact that a single individual flush on x86 is > +guaranteed to flush a full 2MB, hugetlbfs always uses the full > +flushes. THP is treated exactly the same as normal memory. > + You are the second person that told me this and I felt the manual was unclear on this subject. I was told that it might be a documentation bug but because this discussion was in a bar I completely failed to follow up on it. Specifically this part in 4.10.2.3 caused me problems when I last looked at the area. If the paging structures specify a translation using a page larger than 4 KBytes, some processors may choose to cache multiple smaller-page TLB entries for that translation. Each such TLB entry would be associated with a page number corresponding to the smaller page size (e.g., bits 47:12 of a linear address with IA-32e paging), even though part of that page number (e.g., bits 20:12) are part of the offset with respect to the page specified by the paging structures. The upper bits of the physical address in such a TLB entry are derived from the physical address in the PDE used to create the translation, while the lower bits come from the linear address of the access for which the translation is created. There is no way for software to be aware that multiple translations for smaller pages have been used for a large page. If software modifies the paging structures so that the page size used for a 4-KByte range of linear addresses changes, the TLBs may subsequently contain multiple translations for the address range (one for each page size). A reference to a linear address in the address range may use any of these translations. Which translation is used may vary from one execution to another, and the choice may be implementation-specific. This was ambiguous to me because of "some processors may choose to cache multiple smaller-page TLB entries for that translation". The second paragraph appears to partially contradict that but I could not see an architectural guarantee that flushing a page address within a huge page entry was guaranteed to flush all entries. I understand that there are definite problems around the time of splitting/collapsing a large page where care has to be taken that old TLB entries are not present but that's a different case. > +You might see invlpg inside of flush_tlb_mm_range() show up in > +profiles, or you can use the trace_tlb_flush() tracepoints. to > +determine how long the flush operations are taking. > + > +Essentially, you are balancing the cycles you spend doing invlpg > +with the cycles that you spend refilling the TLB later. > + > +You can measure how expensive TLB refills are by using > +performance counters and 'perf stat', like this: > + > +perf stat -e > + cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, > + cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, > + cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, > + cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, > + cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, > + cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ > + > +That works on an IvyBridge-era CPU (i5-3320M). Different CPUs > +may have differently-named counters, but they should at least > +be there in some form. You can use pmu-tools 'ocperf list' > +(https://github.com/andikleen/pmu-tools) to find the right > +counters for a given CPU. > + -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755882AbaDXKrD (ORCPT ); Thu, 24 Apr 2014 06:47:03 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49412 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753083AbaDXKq5 (ORCPT ); Thu, 24 Apr 2014 06:46:57 -0400 Date: Thu, 24 Apr 2014 11:46:53 +0100 From: Mel Gorman To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 6/6] x86: mm: set TLB flush tunable to sane value (33) Message-ID: <20140424104147.GU23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182428.FC2104C1@viggo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140421182428.FC2104C1@viggo.jf.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 21, 2014 at 11:24:28AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > This has been run through Intel's LKP tests across a wide range > of modern sytems and workloads and it wasn't shown to make a > measurable performance difference positive or negative. > > Now that we have some shiny new tracepoints, we can actually > figure out what the heck is going on. > Good stuff. This is the type of thing I should have done the last time to set the parameters for the tlbflush microbench. Nice one out of you! > During a kernel compile, 60% of the flush_tlb_mm_range() calls > are for a single page. It breaks down like this: > > size percent percent<= > V V V > GLOBAL: 2.20% 2.20% avg cycles: 2283 > 1: 56.92% 59.12% avg cycles: 1276 > 2: 13.78% 72.90% avg cycles: 1505 > 3: 8.26% 81.16% avg cycles: 1880 > 4: 7.41% 88.58% avg cycles: 2447 > 5: 1.73% 90.31% avg cycles: 2358 > 6: 1.32% 91.63% avg cycles: 2563 > 7: 1.14% 92.77% avg cycles: 2862 > 8: 0.62% 93.39% avg cycles: 3542 > 9: 0.08% 93.47% avg cycles: 3289 > 10: 0.43% 93.90% avg cycles: 3570 > 11: 0.20% 94.10% avg cycles: 3767 > 12: 0.08% 94.18% avg cycles: 3996 > 13: 0.03% 94.20% avg cycles: 4077 > 14: 0.02% 94.23% avg cycles: 4836 > 15: 0.04% 94.26% avg cycles: 5699 > 16: 0.06% 94.32% avg cycles: 5041 > 17: 0.57% 94.89% avg cycles: 5473 > 18: 0.02% 94.91% avg cycles: 5396 > 19: 0.03% 94.95% avg cycles: 5296 > 20: 0.02% 94.96% avg cycles: 6749 > 21: 0.18% 95.14% avg cycles: 6225 > 22: 0.01% 95.15% avg cycles: 6393 > 23: 0.01% 95.16% avg cycles: 6861 > 24: 0.12% 95.28% avg cycles: 6912 > 25: 0.05% 95.32% avg cycles: 7190 > 26: 0.01% 95.33% avg cycles: 7793 > 27: 0.01% 95.34% avg cycles: 7833 > 28: 0.01% 95.35% avg cycles: 8253 > 29: 0.08% 95.42% avg cycles: 8024 > 30: 0.03% 95.45% avg cycles: 9670 > 31: 0.01% 95.46% avg cycles: 8949 > 32: 0.01% 95.46% avg cycles: 9350 > 33: 3.11% 98.57% avg cycles: 8534 > 34: 0.02% 98.60% avg cycles: 10977 > 35: 0.02% 98.62% avg cycles: 11400 > > We get in to dimishing returns pretty quickly. On pre-IvyBridge > CPUs, we used to set the limit at 8 pages, and it was set at 128 > on IvyBrige. That 128 number looks pretty silly considering that > less than 0.5% of the flushes are that large. > > The previous code tried to size this number based on the size of > the TLB. Good idea, but it's error-prone, needs maintenance > (which it didn't get up to now), and probably would not matter in > practice much. > > Settting it to 33 means that we cover the mallopt > M_TRIM_THRESHOLD, which is the most universally common size to do > flushes. > A kernel compile is hardly a representative workload but I accept the logic of tuning it based on current settings for M_TRIM_THRESHOLD and the tools are there to do a more detailed analysis if tlb flush times for people are identified as being a problem. Acked-by: Mel Gorman -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758564AbaDXQ6P (ORCPT ); Thu, 24 Apr 2014 12:58:15 -0400 Received: from www.sr71.net ([198.145.64.142]:52759 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758310AbaDXQ6N (ORCPT ); Thu, 24 Apr 2014 12:58:13 -0400 Message-ID: <535942A3.3020800@sr71.net> Date: Thu, 24 Apr 2014 09:58:11 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Mel Gorman CC: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182421.DFAAD16A@viggo.jf.intel.com> <20140424084552.GQ23991@suse.de> In-Reply-To: <20140424084552.GQ23991@suse.de> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/24/2014 01:45 AM, Mel Gorman wrote: >> +/* >> + * See Documentation/x86/tlb.txt for details. We choose 33 >> + * because it is large enough to cover the vast majority (at >> + * least 95%) of allocations, and is small enough that we are >> + * confident it will not cause too much overhead. Each single >> + * flush is about 100 cycles, so this caps the maximum overhead >> + * at _about_ 3,000 cycles. >> + */ >> +/* in units of pages */ >> +unsigned long tlb_single_page_flush_ceiling = 1; >> + > > This comment is premature. The documentation file does not exist yet and > 33 means nothing yet. Out of curiousity though, how confident are you > that a TLB flush is generally 100 cycles across different generations > and manufacturers of CPUs? I'm not suggesting you change it or auto-tune > it, am just curious. Yeah, the comment belongs in the later patch where I set it to 33. I looked at this on the last few generations of Intel CPUs. "100 cycles" was a very general statement, and not precise at all. My laptop averages out to 113 cycles overall, but the flushes of 25 pages averaged 96 cycles/page while the flushes of 2 averaged 219/page. Those cycles include some costs of from the instrumentation as well. I did not test on other CPU manufacturers, but this should be pretty easy to reproduce. I'm happy to help folks re-run it on other hardware. I also believe with the modalias stuff we've got in sysfs for the CPU objects we can do this in the future with udev rules instead of hard-coding it in the kernel. >> - /* In modern CPU, last level tlb used for both data/ins */ >> - if (vmflag & VM_EXEC) >> - tlb_entries = tlb_lli_4k[ENTRIES]; >> - else >> - tlb_entries = tlb_lld_4k[ENTRIES]; >> - >> - /* Assume all of TLB entries was occupied by this task */ >> - act_entries = tlb_entries >> tlb_flushall_shift; >> - act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm; >> - nr_base_pages = (end - start) >> PAGE_SHIFT; >> - >> - /* tlb_flushall_shift is on balance point, details in commit log */ >> - if (nr_base_pages > act_entries) { >> + if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) { >> count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); >> local_flush_tlb(); >> } else { > > We lose the different tuning based on whether the flush is for instructions > or data. However, I cannot think of a good reason for keeping it as I > expect that flushes of instructions is relatively rare. The benefit, if > any, will be marginal. Still, if you do another revision it would be > nice to call this out in the changelog. Will do. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758230AbaDXRZ4 (ORCPT ); Thu, 24 Apr 2014 13:25:56 -0400 Received: from www.sr71.net ([198.145.64.142]:53014 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753419AbaDXRZy (ORCPT ); Thu, 24 Apr 2014 13:25:54 -0400 Message-ID: <53594920.8030203@sr71.net> Date: Thu, 24 Apr 2014 10:25:52 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Mel Gorman CC: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com, "H. Peter Anvin" Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> <20140424103727.GT23991@suse.de> In-Reply-To: <20140424103727.GT23991@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/24/2014 03:37 AM, Mel Gorman wrote: > On Mon, Apr 21, 2014 at 11:24:26AM -0700, Dave Hansen wrote: >> +This will cause us to do the global flush for more cases. >> +Lowering it to 0 will disable the use of the individual flushes. >> +Setting it to 1 is a very conservative setting and it should >> +never need to be 0 under normal circumstances. >> + >> +Despite the fact that a single individual flush on x86 is >> +guaranteed to flush a full 2MB, hugetlbfs always uses the full >> +flushes. THP is treated exactly the same as normal memory. >> + > > You are the second person that told me this and I felt the manual was > unclear on this subject. I was told that it might be a documentation bug > but because this discussion was in a bar I completely failed to follow up > on it. Specifically this part in 4.10.2.3 caused me problems when I last > looked at the area. My understanding comes from "4.10.4.2 Recommended Invalidation": • If software modifies a paging-structure entry that identifies the final page frame for a page number (either a PTE or a paging-structure entry in which the PS flag is 1), it should execute INVLPG for any linear address with a page number whose translation uses that PTE. 2 and especially the footnote: 2. One execution of INVLPG is sufficient even for a page with size greater than 4 KBytes. I do agree that it's ambiguous at best. I'll go see if anybody cares to update that bit. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758852AbaDXR4a (ORCPT ); Thu, 24 Apr 2014 13:56:30 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38904 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754765AbaDXR42 (ORCPT ); Thu, 24 Apr 2014 13:56:28 -0400 Message-ID: <53594FB3.9050505@redhat.com> Date: Thu, 24 Apr 2014 13:53:55 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Dave Hansen , Mel Gorman CC: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com, "H. Peter Anvin" Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> <20140424103727.GT23991@suse.de> <53594920.8030203@sr71.net> In-Reply-To: <53594920.8030203@sr71.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/24/2014 01:25 PM, Dave Hansen wrote: > On 04/24/2014 03:37 AM, Mel Gorman wrote: >> On Mon, Apr 21, 2014 at 11:24:26AM -0700, Dave Hansen wrote: >>> +This will cause us to do the global flush for more cases. >>> +Lowering it to 0 will disable the use of the individual flushes. >>> +Setting it to 1 is a very conservative setting and it should >>> +never need to be 0 under normal circumstances. >>> + >>> +Despite the fact that a single individual flush on x86 is >>> +guaranteed to flush a full 2MB, hugetlbfs always uses the full >>> +flushes. THP is treated exactly the same as normal memory. >>> + >> >> You are the second person that told me this and I felt the manual was >> unclear on this subject. I was told that it might be a documentation bug >> but because this discussion was in a bar I completely failed to follow up >> on it. Specifically this part in 4.10.2.3 caused me problems when I last >> looked at the area. > > > My understanding comes from "4.10.4.2 Recommended Invalidation": > > • If software modifies a paging-structure entry that identifies > the final page frame for a page number (either a PTE or a > paging-structure entry in which the PS flag is 1), it should > execute INVLPG for any linear address with a page number whose > translation uses that PTE. 2 > > and especially the footnote: > > 2. One execution of INVLPG is sufficient even for a page with > size greater than 4 KBytes. > > I do agree that it's ambiguous at best. I'll go see if anybody cares to > update that bit. I suspect that IF the TLB actually uses a 2MB entry for the translation, a single INVLPG will work. However, the CPU is free to cache the translations for a 2MB region with a bunch of 4kB entries, if it wanted to, so in the end we have no guarantee that an INVLPG will actually do the right thing... The same is definitely true for 1GB vs 2MB entries, with some CPUs being capable of parsing page tables with 1GB entries, but having no TLB entries for 1GB translations. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758767AbaDXSAh (ORCPT ); Thu, 24 Apr 2014 14:00:37 -0400 Received: from cantor2.suse.de ([195.135.220.15]:57312 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757849AbaDXSAg (ORCPT ); Thu, 24 Apr 2014 14:00:36 -0400 Date: Thu, 24 Apr 2014 19:00:30 +0100 From: Mel Gorman To: Dave Hansen Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing Message-ID: <20140424180030.GX23991@suse.de> References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182421.DFAAD16A@viggo.jf.intel.com> <20140424084552.GQ23991@suse.de> <535942A3.3020800@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <535942A3.3020800@sr71.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 24, 2014 at 09:58:11AM -0700, Dave Hansen wrote: > On 04/24/2014 01:45 AM, Mel Gorman wrote: > >> +/* > >> + * See Documentation/x86/tlb.txt for details. We choose 33 > >> + * because it is large enough to cover the vast majority (at > >> + * least 95%) of allocations, and is small enough that we are > >> + * confident it will not cause too much overhead. Each single > >> + * flush is about 100 cycles, so this caps the maximum overhead > >> + * at _about_ 3,000 cycles. > >> + */ > >> +/* in units of pages */ > >> +unsigned long tlb_single_page_flush_ceiling = 1; > >> + > > > > This comment is premature. The documentation file does not exist yet and > > 33 means nothing yet. Out of curiousity though, how confident are you > > that a TLB flush is generally 100 cycles across different generations > > and manufacturers of CPUs? I'm not suggesting you change it or auto-tune > > it, am just curious. > > Yeah, the comment belongs in the later patch where I set it to 33. > > I looked at this on the last few generations of Intel CPUs. "100 > cycles" was a very general statement, and not precise at all. My laptop > averages out to 113 cycles overall, but the flushes of 25 pages averaged > 96 cycles/page while the flushes of 2 averaged 219/page. > > Those cycles include some costs of from the instrumentation as well. > > I did not test on other CPU manufacturers, but this should be pretty > easy to reproduce. I'm happy to help folks re-run it on other hardware. > > I also believe with the modalias stuff we've got in sysfs for the CPU > objects we can do this in the future with udev rules instead of > hard-coding it in the kernel. > You convinced me. Regardless of whether you move the comment or update the changelog; Acked-by: Mel Gorman -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752880AbaDXUmV (ORCPT ); Thu, 24 Apr 2014 16:42:21 -0400 Received: from www.sr71.net ([198.145.64.142]:54481 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751789AbaDXUmU (ORCPT ); Thu, 24 Apr 2014 16:42:20 -0400 Message-ID: <5359772A.8070108@sr71.net> Date: Thu, 24 Apr 2014 13:42:18 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Mel Gorman CC: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 4/6] x86: mm: trace tlb flushes References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182425.93E696A3@viggo.jf.intel.com> <20140424101419.GS23991@suse.de> In-Reply-To: <20140424101419.GS23991@suse.de> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/24/2014 03:14 AM, Mel Gorman wrote: > On Mon, Apr 21, 2014 at 11:24:25AM -0700, Dave Hansen wrote: >> @@ -105,9 +108,10 @@ static void flush_tlb_func(void *info) >> >> count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); >> if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) { >> - if (f->flush_end == TLB_FLUSH_ALL) >> + if (f->flush_end == TLB_FLUSH_ALL) { >> local_flush_tlb(); >> - else if (!f->flush_end) >> + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL); >> + } else if (!f->flush_end) >> __flush_tlb_single(f->flush_start); >> else { >> unsigned long addr; > > Why is only the TLB_FLUSH_ALL case traced here and not the single flush > or range of flushes? __native_flush_tlb_single() doesn't have a trace > point so I worry we are missing visibility on this part in particular > this part. > > while (addr < f->flush_end) { > __flush_tlb_single(addr); > addr += PAGE_SIZE; > } You're right, I missed that bit. I've corrected in a later version of the patch. >> @@ -152,7 +156,9 @@ void flush_tlb_current_task(void) >> preempt_disable(); >> >> count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); >> + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); >> local_flush_tlb(); >> + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN_DONE, TLB_FLUSH_ALL); >> if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) >> flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); >> preempt_enable(); > > Are the two tracepoints really useful? Are they fine enough to measure > the cost of the TLB flush? It misses the refill obviously but not much > we can do there. It's fine enough, but I did realize over time that the cost of the tracepoint is about 3x the cost of a 1-page tlb flush itself, so these are unusable for detailed measurements. I'll remove it for now. >> #endif /* _LINUX_MM_TYPES_H */ >> diff -puN /dev/null include/trace/events/tlb.h >> --- /dev/null 2014-04-10 11:28:14.066815724 -0700 >> +++ b/include/trace/events/tlb.h 2014-04-21 11:10:35.529868198 -0700 >> @@ -0,0 +1,37 @@ >> +#undef TRACE_SYSTEM >> +#define TRACE_SYSTEM tlb >> + >> +#if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ) >> +#define _TRACE_TLB_H >> + >> +#include >> +#include >> + >> +extern const char * const tlb_flush_reason_desc[]; >> + >> +TRACE_EVENT(tlb_flush, >> + >> + TP_PROTO(int reason, unsigned long pages), >> + TP_ARGS(reason, pages), >> + >> + TP_STRUCT__entry( >> + __field( int, reason) >> + __field(unsigned long, pages) >> + ), >> + >> + TP_fast_assign( >> + __entry->reason = reason; >> + __entry->pages = pages; >> + ), >> + >> + TP_printk("pages: %ld reason: %d (%s)", >> + __entry->pages, >> + __entry->reason, >> + tlb_flush_reason_desc[__entry->reason]) >> +); >> + > > I would also suggest you match the output formatting with writeback.h > which would look like > > pages:%lu reason:%s > > The raw format should still have the integer while the string formatting > would have something human readable. I can do that. The only bummer with the human-readable strings is turning them back in to something that the filters can take. I think I'll just do: + TP_printk("pages:%ld reason:%s (%d)", + __entry->pages, + __print_symbolic(__entry->reason, TLB_FLUSH_REASON), + __entry->reason) +); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757216AbaDXWD5 (ORCPT ); Thu, 24 Apr 2014 18:03:57 -0400 Received: from www.sr71.net ([198.145.64.142]:55197 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755811AbaDXWDy (ORCPT ); Thu, 24 Apr 2014 18:03:54 -0400 Message-ID: <53598A48.2090909@sr71.net> Date: Thu, 24 Apr 2014 15:03:52 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Rik van Riel , Mel Gorman CC: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, alex.shi@linaro.org, dave.hansen@linux.intel.com, "H. Peter Anvin" Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> <20140424103727.GT23991@suse.de> <53594920.8030203@sr71.net> <53594FB3.9050505@redhat.com> In-Reply-To: <53594FB3.9050505@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/24/2014 10:53 AM, Rik van Riel wrote: >> I do agree that it's ambiguous at best. I'll go see if anybody cares to >> update that bit. > > I suspect that IF the TLB actually uses a 2MB entry for the > translation, a single INVLPG will work. > > However, the CPU is free to cache the translations for a 2MB > region with a bunch of 4kB entries, if it wanted to, so in > the end we have no guarantee that an INVLPG will actually do > the right thing... > > The same is definitely true for 1GB vs 2MB entries, with > some CPUs being capable of parsing page tables with 1GB > entries, but having no TLB entries for 1GB translations. I believe we _do_ have such a guarantee. There's another bit in the SDM that someone pointed out to me in a footnote in "4.10.4.1": 1. If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page (see Section 4.10.2.3), the instruction invalidates all of them. While that's not in the easiest-to-find place in the documents, it looks pretty clear. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751507AbaDYVkA (ORCPT ); Fri, 25 Apr 2014 17:40:00 -0400 Received: from www.sr71.net ([198.145.64.142]:37162 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750797AbaDYVj7 (ORCPT ); Fri, 25 Apr 2014 17:39:59 -0400 Message-ID: <535AD62D.20509@sr71.net> Date: Fri, 25 Apr 2014 14:39:57 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Mel Gorman CC: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com Subject: Re: [PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182421.DFAAD16A@viggo.jf.intel.com> <20140424084552.GQ23991@suse.de> In-Reply-To: <20140424084552.GQ23991@suse.de> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/24/2014 01:45 AM, Mel Gorman wrote: >> > +/* >> > + * See Documentation/x86/tlb.txt for details. We choose 33 >> > + * because it is large enough to cover the vast majority (at >> > + * least 95%) of allocations, and is small enough that we are >> > + * confident it will not cause too much overhead. Each single >> > + * flush is about 100 cycles, so this caps the maximum overhead >> > + * at _about_ 3,000 cycles. >> > + */ >> > +/* in units of pages */ >> > +unsigned long tlb_single_page_flush_ceiling = 1; >> > + > This comment is premature. The documentation file does not exist yet and > 33 means nothing yet. Out of curiousity though, how confident are you > that a TLB flush is generally 100 cycles across different generations > and manufacturers of CPUs? I'm not suggesting you change it or auto-tune > it, am just curious. First of all, I changed the units here at some point, and I screwed up the comments. I meant 100 nanoseconds, *not* cycles. For the sake of completeness, here are the data on a Westmere CPU. I'm not _quite_ sure why the <=5 pages cases are so slow per-page compared to when we're flushing larger numbers of pages. (I also only printed out the flush sizes with >100 samples): The overall average was 151ns, and for 6 pages and up it was 107ns. 1 1560658 279861777 avg/page: 179 2 179981 85329139 avg/page: 237 3 99797 146972011 avg/page: 490 4 161470 133072233 avg/page: 206 5 44150 42142670 avg/page: 190 6 17364 12063833 avg/page: 115 7 12325 9899412 avg/page: 114 8 4202 3838077 avg/page: 114 9 811 990320 avg/page: 135 10 4448 4955283 avg/page: 111 11 69051 86723229 avg/page: 114 12 465 642204 avg/page: 115 13 157 226814 avg/page: 111 16 781 1741461 avg/page: 139 17 1506 2778201 avg/page: 108 18 110 211216 avg/page: 106 19 13322 27941893 avg/page: 110 21 1828 4092988 avg/page: 106 24 1566 4057605 avg/page: 107 25 246 646463 avg/page: 105 29 411 1275101 avg/page: 106 33 3191 11775818 avg/page: 111 52 3096 17297873 avg/page: 107 65 2244 15349445 avg/page: 105 129 2278 33246120 avg/page: 113 240 12181 305529055 avg/page: 104 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751813AbaGGRnk (ORCPT ); Mon, 7 Jul 2014 13:43:40 -0400 Received: from www.sr71.net ([198.145.64.142]:54539 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751109AbaGGRni (ORCPT ); Mon, 7 Jul 2014 13:43:38 -0400 Message-ID: <53BADC49.6000600@sr71.net> Date: Mon, 07 Jul 2014 10:43:37 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Mel Gorman CC: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, alex.shi@linaro.org, dave.hansen@linux.intel.com, "H. Peter Anvin" Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> <20140424103727.GT23991@suse.de> In-Reply-To: <20140424103727.GT23991@suse.de> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/24/2014 03:37 AM, Mel Gorman wrote: >> +Despite the fact that a single individual flush on x86 is >> > +guaranteed to flush a full 2MB, hugetlbfs always uses the full >> > +flushes. THP is treated exactly the same as normal memory. >> > + > You are the second person that told me this and I felt the manual was > unclear on this subject. I was told that it might be a documentation bug > but because this discussion was in a bar I completely failed to follow up > on it. For the record... There's a new version of the Intel SDM out, and it contains some clarifications. They're the easiest to find in this document which highlights the deltas from the last version: > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developers-manual.pdf The documentation for invlpg itself has a new footnote, and there's also a little bit of new text in section "4.10.2.3 Details of TLB Use". The footnotes say: If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page (see Section 4.10.2.3), the instruction (invlpg) invalidates all of them I hope that clears up some of the ambiguity over invlpg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755579AbaGHAns (ORCPT ); Mon, 7 Jul 2014 20:43:48 -0400 Received: from mail-pd0-f171.google.com ([209.85.192.171]:50395 "EHLO mail-pd0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753814AbaGHAnp (ORCPT ); Mon, 7 Jul 2014 20:43:45 -0400 Message-ID: <53BB3EBC.8050005@linaro.org> Date: Tue, 08 Jul 2014 08:43:40 +0800 From: Alex Shi User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Dave Hansen , Mel Gorman CC: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, ak@linux.intel.com, riel@redhat.com, dave.hansen@linux.intel.com, "H. Peter Anvin" Subject: Re: [PATCH 5/6] x86: mm: new tunable for single vs full TLB flush References: <20140421182418.81CF7519@viggo.jf.intel.com> <20140421182426.D6DD1E8F@viggo.jf.intel.com> <20140424103727.GT23991@suse.de> <53BADC49.6000600@sr71.net> In-Reply-To: <53BADC49.6000600@sr71.net> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/08/2014 01:43 AM, Dave Hansen wrote: > On 04/24/2014 03:37 AM, Mel Gorman wrote: >>> +Despite the fact that a single individual flush on x86 is >>>> +guaranteed to flush a full 2MB, hugetlbfs always uses the full >>>> +flushes. THP is treated exactly the same as normal memory. >>>> + >> You are the second person that told me this and I felt the manual was >> unclear on this subject. I was told that it might be a documentation bug >> but because this discussion was in a bar I completely failed to follow up >> on it. > > For the record... There's a new version of the Intel SDM out, and it > contains some clarifications. They're the easiest to find in this > document which highlights the deltas from the last version: > >> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developers-manual.pdf > > The documentation for invlpg itself has a new footnote, and there's also > a little bit of new text in section "4.10.2.3 Details of TLB Use". > > The footnotes say: > > If the paging structures map the linear address using a page > larger than 4 KBytes and there are multiple TLB entries for > that page (see Section 4.10.2.3), the instruction (invlpg) > invalidates all of them > > I hope that clears up some of the ambiguity over invlpg. > Uh, AFAICT, the invlpg on large page has no clear effect on data retrieving, on all Intel CPU till ivybridge. No testing on later CPUs. -- Thanks Alex