From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id 911526B025F for ; Fri, 8 Jan 2016 18:15:37 -0500 (EST) Received: by mail-pa0-f47.google.com with SMTP id yy13so196753655pab.3 for ; Fri, 08 Jan 2016 15:15:37 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id f7si7938955pfd.188.2016.01.08.15.15.36 for ; Fri, 08 Jan 2016 15:15:36 -0800 (PST) From: Andy Lutomirski Subject: [RFC 00/13] x86/mm: PCID and INVPCID Date: Fri, 8 Jan 2016 15:15:18 -0800 Message-Id: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Here's my PCID and INVPCID work-in-progress. It seems to work well enough to play with it. (That is, I'm not aware of anything wrong with it, although it may eat your data.) PCID and INVPCID use are orthogonal here. INVPCID is a straightforward speedup for global TLB flushes. Other than that, I don't use INVPCID at all, since it seems slower than just manipulating CR3 carefully, at least on my Skylake laptop. Please play around and suggest (and run?) good benchmarks. It seems to save around 100ns on cross-process context switches for me. Unfortunately, we suck at context switches in general, so this is, at best, a little over a 10% speedup. Most of the time is spent in the scheduler, not in arch code. Andy Lutomirski (13): x86/paravirt: Turn KASAN off for parvirt.o x86/mm: Add INVPCID helpers x86/mm: Add a noinvpcid option to turn off INVPCID x86/mm: If INVPCID is available, use it to flush global mappings x86/mm: Add barriers and document switch_mm-vs-flush synchronization x86/mm: Disable PCID on 32-bit kernels x86/mm: Add nopcid to turn off PCID x86/mm: Teach CR3 readers about PCID x86/mm: Disable interrupts when flushing the TLB using CR3 x86/mm: Factor out remote TLB flushing x86/mm: Build arch/x86/mm/tlb.c even on !SMP x86/mm: Uninline switch_mm x86/mm: Try to preserve old TLB entries using PCID Documentation/kernel-parameters.txt | 4 + arch/x86/include/asm/disabled-features.h | 4 +- arch/x86/include/asm/mmu.h | 7 +- arch/x86/include/asm/mmu_context.h | 62 +----- arch/x86/include/asm/tlbflush.h | 86 ++++++++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/cpu/bugs.c | 6 + arch/x86/kernel/cpu/common.c | 38 ++++ arch/x86/kernel/head64.c | 3 +- arch/x86/kernel/ldt.c | 2 + arch/x86/kernel/process_64.c | 2 + arch/x86/mm/Makefile | 3 +- arch/x86/mm/fault.c | 8 +- arch/x86/mm/tlb.c | 324 +++++++++++++++++++++++++++++-- 14 files changed, 467 insertions(+), 83 deletions(-) -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f42.google.com (mail-pa0-f42.google.com [209.85.220.42]) by kanga.kvack.org (Postfix) with ESMTP id 826456B0260 for ; Fri, 8 Jan 2016 18:15:38 -0500 (EST) Received: by mail-pa0-f42.google.com with SMTP id yy13so196753837pab.3 for ; Fri, 08 Jan 2016 15:15:38 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id n16si7938166pfj.185.2016.01.08.15.15.37 for ; Fri, 08 Jan 2016 15:15:37 -0800 (PST) From: Andy Lutomirski Subject: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Date: Fri, 8 Jan 2016 15:15:19 -0800 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Otherwise terrible things happen if some of the callbacks end up calling into KASAN in unexpected places. This has no obvious symptoms yet, but adding a memory reference to native_flush_tlb_global without this blows up on KASAN kernels. Signed-off-by: Andy Lutomirski --- arch/x86/kernel/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index b1b78ffe01d0..b7cd5bdf314b 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -19,6 +19,7 @@ endif KASAN_SANITIZE_head$(BITS).o := n KASAN_SANITIZE_dumpstack.o := n KASAN_SANITIZE_dumpstack_$(BITS).o := n +KASAN_SANITIZE_paravirt.o := n CFLAGS_irq.o := -I$(src)/../include/asm/trace -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f42.google.com (mail-pa0-f42.google.com [209.85.220.42]) by kanga.kvack.org (Postfix) with ESMTP id E40066B0261 for ; Fri, 8 Jan 2016 18:15:39 -0500 (EST) Received: by mail-pa0-f42.google.com with SMTP id yy13so196754090pab.3 for ; Fri, 08 Jan 2016 15:15:39 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id xj5si66138734pab.84.2016.01.08.15.15.39 for ; Fri, 08 Jan 2016 15:15:39 -0800 (PST) From: Andy Lutomirski Subject: [RFC 02/13] x86/mm: Add INVPCID helpers Date: Fri, 8 Jan 2016 15:15:20 -0800 Message-Id: <21ffa5af2491feec9927446451e3d23548a7cbc0.1452294700.git.luto@kernel.org> In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski This adds helpers for each of the four currently-specified INVPCID modes. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 6df2029405a3..20fc38d8478a 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -7,6 +7,47 @@ #include #include +static inline void __invpcid(unsigned long pcid, unsigned long addr, + unsigned long type) +{ + u64 desc[2] = { pcid, addr }; + + /* + * The memory clobber is because the whole point is to invalidate + * stale TLB entries and, especially if we're flushing global + * mappings, we don't want the compiler to reorder any subsequent + * memory accesses before the TLB flush. + */ + asm volatile ( + ".byte 0x66, 0x0f, 0x38, 0x82, 0x01" /* invpcid (%cx), %ax */ + : : "m" (desc), "a" (type), "c" (desc) : "memory"); +} + +/* Flush all mappings for a given pcid and addr, not including globals. */ +static inline void invpcid_flush_one(unsigned long pcid, + unsigned long addr) +{ + __invpcid(pcid, addr, 0); +} + +/* Flush all mappings for a given PCID, not including globals. */ +static inline void invpcid_flush_single_context(unsigned long pcid) +{ + __invpcid(pcid, 0, 1); +} + +/* Flush all mappings, including globals, for all PCIDs. */ +static inline void invpcid_flush_everything(void) +{ + __invpcid(0, 0, 2); +} + +/* Flush all mappings for all PCIDs except globals. */ +static inline void invpcid_flush_all_nonglobals(void) +{ + __invpcid(0, 0, 3); +} + #ifdef CONFIG_PARAVIRT #include #else -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f44.google.com (mail-pa0-f44.google.com [209.85.220.44]) by kanga.kvack.org (Postfix) with ESMTP id 225496B0261 for ; Fri, 8 Jan 2016 18:15:41 -0500 (EST) Received: by mail-pa0-f44.google.com with SMTP id uo6so271555522pac.1 for ; Fri, 08 Jan 2016 15:15:41 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id 7si7944282pfa.205.2016.01.08.15.15.40 for ; Fri, 08 Jan 2016 15:15:40 -0800 (PST) From: Andy Lutomirski Subject: [RFC 03/13] x86/mm: Add a noinvpcid option to turn off INVPCID Date: Fri, 8 Jan 2016 15:15:21 -0800 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Signed-off-by: Andy Lutomirski --- Documentation/kernel-parameters.txt | 2 ++ arch/x86/kernel/cpu/common.c | 16 ++++++++++++++++ 2 files changed, 18 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 742f69d18fc8..b34e55e00bae 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2508,6 +2508,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. nointroute [IA-64] + noinvpcid [X86] Disable the INVPCID cpu feature. + nojitter [IA-64] Disables jitter checking for ITC timers. no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index c2b7522cbf35..48196980c1c7 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -162,6 +162,22 @@ static int __init x86_mpx_setup(char *s) } __setup("nompx", x86_mpx_setup); +static int __init x86_noinvpcid_setup(char *s) +{ + /* noinvpcid doesn't accept parameters */ + if (s) + return -EINVAL; + + /* do not emit a message if the feature is not present */ + if (!boot_cpu_has(X86_FEATURE_INVPCID)) + return 0; + + setup_clear_cpu_cap(X86_FEATURE_INVPCID); + pr_info("noinvpcid: INVPCID feature disabled\n"); + return 0; +} +early_param("noinvpcid", x86_noinvpcid_setup); + #ifdef CONFIG_X86_32 static int cachesize_override = -1; static int disable_x86_serial_nr = 1; -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f181.google.com (mail-pf0-f181.google.com [209.85.192.181]) by kanga.kvack.org (Postfix) with ESMTP id EED29828E9 for ; Fri, 8 Jan 2016 18:15:42 -0500 (EST) Received: by mail-pf0-f181.google.com with SMTP id e65so15938292pfe.0 for ; Fri, 08 Jan 2016 15:15:42 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id d16si7950622pfb.152.2016.01.08.15.15.42 for ; Fri, 08 Jan 2016 15:15:42 -0800 (PST) From: Andy Lutomirski Subject: [RFC 04/13] x86/mm: If INVPCID is available, use it to flush global mappings Date: Fri, 8 Jan 2016 15:15:22 -0800 Message-Id: <1ddac818ae831c89dc4f680bbf4bfac7b6ba2b4c.1452294700.git.luto@kernel.org> In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski On my Skylake laptop, INVPCID function 2 (flush absolutely everything) takes about 376ns, whereas saving flags, twiddling CR4.PGE to flush global mappings, and restoring flags takes about 539ns. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 20fc38d8478a..4eba5164430d 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -145,6 +145,15 @@ static inline void __native_flush_tlb_global(void) { unsigned long flags; + if (static_cpu_has_safe(X86_FEATURE_INVPCID)) { + /* + * Using INVPCID is considerably faster than a pair of writes + * to CR4 sandwiched inside an IRQ flag save/restore. + */ + invpcid_flush_everything(); + return; + } + /* * Read-modify-write to CR4 - protect it from preemption and * from interrupts. (Use the raw variant because this code can -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f182.google.com (mail-pf0-f182.google.com [209.85.192.182]) by kanga.kvack.org (Postfix) with ESMTP id 7A9FD828E9 for ; Fri, 8 Jan 2016 18:15:44 -0500 (EST) Received: by mail-pf0-f182.google.com with SMTP id e65so15938525pfe.0 for ; Fri, 08 Jan 2016 15:15:44 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id h75si3513456pfd.5.2016.01.08.15.15.43 for ; Fri, 08 Jan 2016 15:15:43 -0800 (PST) From: Andy Lutomirski Subject: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization Date: Fri, 8 Jan 2016 15:15:23 -0800 Message-Id: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski , stable@vger.kernel.org When switch_mm activates a new pgd, it also sets a bit that tells other CPUs that the pgd is in use so that tlb flush IPIs will be sent. In order for that to work correctly, the bit needs to be visible prior to loading the pgd and therefore starting to fill the local TLB. Document all the barriers that make this work correctly and add a couple that were missing. Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu_context.h | 33 ++++++++++++++++++++++++++++++++- arch/x86/mm/tlb.c | 29 ++++++++++++++++++++++++++--- 2 files changed, 58 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 379cd3658799..1edc9cd198b8 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, #endif cpumask_set_cpu(cpu, mm_cpumask(next)); - /* Re-load page tables */ + /* + * Re-load page tables. + * + * This logic has an ordering constraint: + * + * CPU 0: Write to a PTE for 'next' + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. + * CPU 1: set bit 1 in next's mm_cpumask + * CPU 1: load from the PTE that CPU 0 writes (implicit) + * + * We need to prevent an outcome in which CPU 1 observes + * the new PTE value and CPU 0 observes bit 1 clear in + * mm_cpumask. (If that occurs, then the IPI will never + * be sent, and CPU 0's TLB will contain a stale entry.) + * + * The bad outcome can occur if either CPU's load is + * reordered before that CPU's store, so both CPUs much + * execute full barriers to prevent this from happening. + * + * Thus, switch_mm needs a full barrier between the + * store to mm_cpumask and any operation that could load + * from next->pgd. This barrier synchronizes with + * remote TLB flushers. Fortunately, load_cr3 is + * serializing and thus acts as a full barrier. + * + */ load_cr3(next->pgd); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); /* Stop flush ipis for the previous mm */ @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, * schedule, protecting us from simultaneous changes. */ cpumask_set_cpu(cpu, mm_cpumask(next)); + /* * We were in lazy tlb mode and leave_mm disabled * tlb flush IPI delivery. We must reload CR3 * to make sure to use no freed page tables. + * + * As above, this is a barrier that forces + * TLB repopulation to be ordered after the + * store to mm_cpumask. */ load_cr3(next->pgd); trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 8ddb5d0d66fb..8f4cc3dfac32 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -161,7 +161,10 @@ void flush_tlb_current_task(void) preempt_disable(); count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); + + /* This is an implicit full barrier that synchronizes with switch_mm. */ local_flush_tlb(); + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long base_pages_to_flush = TLB_FLUSH_ALL; preempt_disable(); - if (current->active_mm != mm) + if (current->active_mm != mm) { + /* Synchronize with switch_mm. */ + smp_mb(); + goto out; + } if (!current->mm) { leave_mm(smp_processor_id()); + + /* Synchronize with switch_mm. */ + smp_mb(); + goto out; } if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) base_pages_to_flush = (end - start) >> PAGE_SHIFT; + /* + * Both branches below are implicit full barriers (MOV to CR or + * INVLPG) that synchronize with switch_mm. + */ if (base_pages_to_flush > tlb_single_page_flush_ceiling) { base_pages_to_flush = TLB_FLUSH_ALL; count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); @@ -228,10 +243,18 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start) preempt_disable(); if (current->active_mm == mm) { - if (current->mm) + if (current->mm) { + /* + * Implicit full barrier (INVLPG) that synchronizes + * with switch_mm. + */ __flush_tlb_one(start); - else + } else { leave_mm(smp_processor_id()); + + /* Synchronize with switch_mm. */ + smp_mb(); + } } if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by kanga.kvack.org (Postfix) with ESMTP id 9E9A9828E9 for ; Fri, 8 Jan 2016 18:15:45 -0500 (EST) Received: by mail-pa0-f45.google.com with SMTP id uo6so271556292pac.1 for ; Fri, 08 Jan 2016 15:15:45 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id r5si7928437pfr.2.2016.01.08.15.15.44 for ; Fri, 08 Jan 2016 15:15:45 -0800 (PST) From: Andy Lutomirski Subject: [RFC 06/13] x86/mm: Disable PCID on 32-bit kernels Date: Fri, 8 Jan 2016 15:15:24 -0800 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski 32-bit kernels on new hardware will see PCID in CPUID, but PCID can only be used in 64-bit mode. Rather than making all PCID code conditional, just disable the feature on 32-bit builds. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/disabled-features.h | 4 +++- arch/x86/kernel/cpu/bugs.c | 6 ++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index f226df064660..8b17c2ad1048 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -21,11 +21,13 @@ # define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31)) # define DISABLE_CYRIX_ARR (1<<(X86_FEATURE_CYRIX_ARR & 31)) # define DISABLE_CENTAUR_MCR (1<<(X86_FEATURE_CENTAUR_MCR & 31)) +# define DISABLE_PCID 0 #else # define DISABLE_VME 0 # define DISABLE_K6_MTRR 0 # define DISABLE_CYRIX_ARR 0 # define DISABLE_CENTAUR_MCR 0 +# define DISABLE_PCID (1<<(X86_FEATURE_PCID & 31)) #endif /* CONFIG_X86_64 */ /* @@ -35,7 +37,7 @@ #define DISABLED_MASK1 0 #define DISABLED_MASK2 0 #define DISABLED_MASK3 (DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR) -#define DISABLED_MASK4 0 +#define DISABLED_MASK4 (DISABLE_PCID) #define DISABLED_MASK5 0 #define DISABLED_MASK6 0 #define DISABLED_MASK7 0 diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index bd17db15a2c1..741d107e5376 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -19,6 +19,12 @@ void __init check_bugs(void) { + /* + * Regardless of whether PCID is enumerated, it can't be used in + * 32-bit mode. + */ + setup_clear_cpu_cap(X86_FEATURE_PCID); + identify_boot_cpu(); #ifndef CONFIG_SMP pr_info("CPU: "); -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f171.google.com (mail-pf0-f171.google.com [209.85.192.171]) by kanga.kvack.org (Postfix) with ESMTP id E2860828E9 for ; Fri, 8 Jan 2016 18:15:46 -0500 (EST) Received: by mail-pf0-f171.google.com with SMTP id 65so15916589pff.2 for ; Fri, 08 Jan 2016 15:15:46 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id tj2si14588153pab.76.2016.01.08.15.15.46 for ; Fri, 08 Jan 2016 15:15:46 -0800 (PST) From: Andy Lutomirski Subject: [RFC 07/13] x86/mm: Add nopcid to turn off PCID Date: Fri, 8 Jan 2016 15:15:25 -0800 Message-Id: <7752daf506becd3b36e768bd2561f562f64d9d16.1452294700.git.luto@kernel.org> In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski The parameter is only present on x86_64 systems to save a few bytes, as PCID is always disabled on x86_32. Signed-off-by: Andy Lutomirski --- Documentation/kernel-parameters.txt | 2 ++ arch/x86/kernel/cpu/common.c | 18 ++++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index b34e55e00bae..2f61809706c5 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2544,6 +2544,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. nopat [X86] Disable PAT (page attribute table extension of pagetables) support. + nopcid [X86-64] Disable the PCID cpu feature. + norandmaps Don't use address space randomization. Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 48196980c1c7..7e1fc53a4ba5 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -162,6 +162,24 @@ static int __init x86_mpx_setup(char *s) } __setup("nompx", x86_mpx_setup); +#ifdef CONFIG_X86_64 +static int __init x86_pcid_setup(char *s) +{ + /* require an exact match without trailing characters */ + if (strlen(s)) + return 0; + + /* do not emit a message if the feature is not present */ + if (!boot_cpu_has(X86_FEATURE_PCID)) + return 1; + + setup_clear_cpu_cap(X86_FEATURE_PCID); + pr_info("nopcid: PCID feature disabled\n"); + return 1; +} +__setup("nopcid", x86_pcid_setup); +#endif + static int __init x86_noinvpcid_setup(char *s) { /* noinvpcid doesn't accept parameters */ -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f173.google.com (mail-pf0-f173.google.com [209.85.192.173]) by kanga.kvack.org (Postfix) with ESMTP id 5B8D7828E9 for ; Fri, 8 Jan 2016 18:15:48 -0500 (EST) Received: by mail-pf0-f173.google.com with SMTP id q63so16321773pfb.1 for ; Fri, 08 Jan 2016 15:15:48 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id fd9si19601263pac.80.2016.01.08.15.15.47 for ; Fri, 08 Jan 2016 15:15:47 -0800 (PST) From: Andy Lutomirski Subject: [RFC 08/13] x86/mm: Teach CR3 readers about PCID Date: Fri, 8 Jan 2016 15:15:26 -0800 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski The kernel has several code paths that read CR3. Most of them assume that CR3 contains the PGD's physical address, whereas some of them awkwardly use PHYSICAL_PAGE_MASK to mask off low bits. Add explicit mask macros for CR3 and convert all of the CR3 readers. This will keep them from breaking when PCID is enabled. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 8 ++++++++ arch/x86/kernel/head64.c | 3 ++- arch/x86/mm/fault.c | 8 ++++---- 3 files changed, 14 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 4eba5164430d..3d905f12cda9 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -48,6 +48,14 @@ static inline void invpcid_flush_all_nonglobals(void) __invpcid(0, 0, 3); } +#ifdef CONFIG_X86_64 +#define CR3_ADDR_MASK 0x7FFFFFFFFFFFF000ull +#define CR3_PCID_MASK 0xFFFull +#else +#define CR3_ADDR_MASK 0xFFFFF000ull +#define CR3_PCID_MASK 0ull +#endif + #ifdef CONFIG_PARAVIRT #include #else diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c index f129a9af6357..3d075ac01a47 100644 --- a/arch/x86/kernel/head64.c +++ b/arch/x86/kernel/head64.c @@ -60,7 +60,8 @@ int __init early_make_pgtable(unsigned long address) pmdval_t pmd, *pmd_p; /* Invalid address or early pgt is done ? */ - if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt)) + if (physaddr >= MAXMEM || + (read_cr3() & CR3_ADDR_MASK) != __pa_nodebug(early_level4_pgt)) return -1; again: diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index eef44d9a3f77..9ceae2dc9be1 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -282,7 +282,7 @@ static noinline int vmalloc_fault(unsigned long address) * Do _not_ use "current" here. We might be inside * an interrupt in the middle of a task switch.. */ - pgd_paddr = read_cr3(); + pgd_paddr = read_cr3() & CR3_ADDR_MASK; pmd_k = vmalloc_sync_one(__va(pgd_paddr), address); if (!pmd_k) return -1; @@ -321,7 +321,7 @@ static bool low_pfn(unsigned long pfn) static void dump_pagetable(unsigned long address) { - pgd_t *base = __va(read_cr3()); + pgd_t *base = __va(read_cr3() & CR3_ADDR_MASK); pgd_t *pgd = &base[pgd_index(address)]; pmd_t *pmd; pte_t *pte; @@ -459,7 +459,7 @@ static int bad_address(void *p) static void dump_pagetable(unsigned long address) { - pgd_t *base = __va(read_cr3() & PHYSICAL_PAGE_MASK); + pgd_t *base = __va(read_cr3() & CR3_ADDR_MASK); pgd_t *pgd = base + pgd_index(address); pud_t *pud; pmd_t *pmd; @@ -595,7 +595,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, pgd_t *pgd; pte_t *pte; - pgd = __va(read_cr3() & PHYSICAL_PAGE_MASK); + pgd = __va(read_cr3() & CR3_ADDR_MASK); pgd += pgd_index(address); pte = lookup_address_in_pgd(pgd, address, &level); -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f49.google.com (mail-pa0-f49.google.com [209.85.220.49]) by kanga.kvack.org (Postfix) with ESMTP id 66B83828E9 for ; Fri, 8 Jan 2016 18:15:49 -0500 (EST) Received: by mail-pa0-f49.google.com with SMTP id uo6so271556877pac.1 for ; Fri, 08 Jan 2016 15:15:49 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id ww1si19628688pab.181.2016.01.08.15.15.48 for ; Fri, 08 Jan 2016 15:15:48 -0800 (PST) From: Andy Lutomirski Subject: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Date: Fri, 8 Jan 2016 15:15:27 -0800 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 3d905f12cda9..32e3d8769a22 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -135,7 +135,17 @@ static inline void cr4_set_bits_and_update_boot(unsigned long mask) static inline void __native_flush_tlb(void) { + unsigned long flags; + + /* + * We mustn't be preempted or handle an IPI while reading and + * writing CR3. Preemption could switch mms and switch back, and + * an IPI could call leave_mm. Either of those could cause our + * PCID to change asynchronously. + */ + raw_local_irq_save(flags); native_write_cr3(native_read_cr3()); + raw_local_irq_restore(flags); } static inline void __native_flush_tlb_global_irq_disabled(void) -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f172.google.com (mail-pf0-f172.google.com [209.85.192.172]) by kanga.kvack.org (Postfix) with ESMTP id A0480828E9 for ; Fri, 8 Jan 2016 18:15:50 -0500 (EST) Received: by mail-pf0-f172.google.com with SMTP id q63so16322106pfb.1 for ; Fri, 08 Jan 2016 15:15:50 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id d3si21159548pas.116.2016.01.08.15.15.49 for ; Fri, 08 Jan 2016 15:15:49 -0800 (PST) From: Andy Lutomirski Subject: [RFC 10/13] x86/mm: Factor out remote TLB flushing Date: Fri, 8 Jan 2016 15:15:28 -0800 Message-Id: <357b13bf9d6d04e585894ff5dcf40fa14ea1d3a7.1452294700.git.luto@kernel.org> In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski There are three call sites that propagate TLB flushes, and they all do exactly the same thing. Factor the code out into a helper. Signed-off-by: Andy Lutomirski --- arch/x86/mm/tlb.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 8f4cc3dfac32..b208a33571b0 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -154,6 +154,14 @@ void native_flush_tlb_others(const struct cpumask *cpumask, smp_call_function_many(cpumask, flush_tlb_func, &info, 1); } +static void propagate_tlb_flush(unsigned int this_cpu, + struct mm_struct *mm, unsigned long start, + unsigned long end) +{ + if (cpumask_any_but(mm_cpumask(mm), this_cpu) < nr_cpu_ids) + flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); +} + void flush_tlb_current_task(void) { struct mm_struct *mm = current->mm; @@ -166,8 +174,7 @@ void flush_tlb_current_task(void) local_flush_tlb(); trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); - if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); + propagate_tlb_flush(smp_processor_id(), mm, 0UL, TLB_FLUSH_ALL); preempt_enable(); } @@ -231,8 +238,7 @@ out: start = 0UL; end = TLB_FLUSH_ALL; } - if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, start, end); + propagate_tlb_flush(smp_processor_id(), mm, start, end); preempt_enable(); } @@ -257,8 +263,7 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start) } } - if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, start, 0UL); + propagate_tlb_flush(smp_processor_id(), mm, start, 0UL); preempt_enable(); } -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f175.google.com (mail-pf0-f175.google.com [209.85.192.175]) by kanga.kvack.org (Postfix) with ESMTP id C206F828E9 for ; Fri, 8 Jan 2016 18:15:51 -0500 (EST) Received: by mail-pf0-f175.google.com with SMTP id 65so15917253pff.2 for ; Fri, 08 Jan 2016 15:15:51 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id v23si7928826pfi.58.2016.01.08.15.15.51 for ; Fri, 08 Jan 2016 15:15:51 -0800 (PST) From: Andy Lutomirski Subject: [RFC 11/13] x86/mm: Build arch/x86/mm/tlb.c even on !SMP Date: Fri, 8 Jan 2016 15:15:29 -0800 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Currently all of the functions that live in tlb.c are inlined on !SMP builds. One can debate whether this is a good idea (in many respects the code in tlb.c is better than the inlined UP code). Regardless, I want to add code that needs to be built on UP and SMP kernels and relates to tlb flushing, so arrange for tlb.c to be compiled unconditionally. Signed-off-by: Andy Lutomirski --- arch/x86/mm/Makefile | 3 +-- arch/x86/mm/tlb.c | 4 ++++ 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 65c47fda26fc..1ae7c141f778 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -1,5 +1,5 @@ obj-y := init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \ - pat.o pgtable.o physaddr.o gup.o setup_nx.o + pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o # Make sure __phys_addr has no stackprotector nostackp := $(call cc-option, -fno-stack-protector) @@ -9,7 +9,6 @@ CFLAGS_setup_nx.o := $(nostackp) CFLAGS_fault.o := -I$(src)/../include/asm/trace obj-$(CONFIG_X86_PAT) += pat_rbtree.o -obj-$(CONFIG_SMP) += tlb.o obj-$(CONFIG_X86_32) += pgtable_32.o iomap_32.o diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index b208a33571b0..87fcc7a62e71 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -28,6 +28,8 @@ * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi */ +#ifdef CONFIG_SMP + struct flush_tlb_info { struct mm_struct *flush_mm; unsigned long flush_start; @@ -352,3 +354,5 @@ static int __init create_tlb_single_page_flush_ceiling(void) return 0; } late_initcall(create_tlb_single_page_flush_ceiling); + +#endif /* CONFIG_SMP */ -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f171.google.com (mail-pf0-f171.google.com [209.85.192.171]) by kanga.kvack.org (Postfix) with ESMTP id C344E828E9 for ; Fri, 8 Jan 2016 18:15:53 -0500 (EST) Received: by mail-pf0-f171.google.com with SMTP id q63so16322503pfb.1 for ; Fri, 08 Jan 2016 15:15:53 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id cz10si57770478pad.0.2016.01.08.15.15.53 for ; Fri, 08 Jan 2016 15:15:53 -0800 (PST) From: Andy Lutomirski Subject: [RFC 12/13] x86/mm: Uninline switch_mm Date: Fri, 8 Jan 2016 15:15:30 -0800 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski It's fairly large and it has quite a few callers. This may also help untangle some headers down the road. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu_context.h | 93 +----------------------------------- arch/x86/mm/tlb.c | 97 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 99 insertions(+), 91 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 1edc9cd198b8..05c4d0ab64bb 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -104,102 +104,13 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) #endif } -static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, - struct task_struct *tsk) -{ - unsigned cpu = smp_processor_id(); - - if (likely(prev != next)) { -#ifdef CONFIG_SMP - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); - this_cpu_write(cpu_tlbstate.active_mm, next); -#endif - cpumask_set_cpu(cpu, mm_cpumask(next)); +extern void switch_mm(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk); - /* - * Re-load page tables. - * - * This logic has an ordering constraint: - * - * CPU 0: Write to a PTE for 'next' - * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. - * CPU 1: set bit 1 in next's mm_cpumask - * CPU 1: load from the PTE that CPU 0 writes (implicit) - * - * We need to prevent an outcome in which CPU 1 observes - * the new PTE value and CPU 0 observes bit 1 clear in - * mm_cpumask. (If that occurs, then the IPI will never - * be sent, and CPU 0's TLB will contain a stale entry.) - * - * The bad outcome can occur if either CPU's load is - * reordered before that CPU's store, so both CPUs much - * execute full barriers to prevent this from happening. - * - * Thus, switch_mm needs a full barrier between the - * store to mm_cpumask and any operation that could load - * from next->pgd. This barrier synchronizes with - * remote TLB flushers. Fortunately, load_cr3 is - * serializing and thus acts as a full barrier. - * - */ - load_cr3(next->pgd); - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); - /* Stop flush ipis for the previous mm */ - cpumask_clear_cpu(cpu, mm_cpumask(prev)); - /* Load per-mm CR4 state */ - load_mm_cr4(next); -#ifdef CONFIG_MODIFY_LDT_SYSCALL - /* - * Load the LDT, if the LDT is different. - * - * It's possible that prev->context.ldt doesn't match - * the LDT register. This can happen if leave_mm(prev) - * was called and then modify_ldt changed - * prev->context.ldt but suppressed an IPI to this CPU. - * In this case, prev->context.ldt != NULL, because we - * never set context.ldt to NULL while the mm still - * exists. That means that next->context.ldt != - * prev->context.ldt, because mms never share an LDT. - */ - if (unlikely(prev->context.ldt != next->context.ldt)) - load_mm_ldt(next); -#endif - } -#ifdef CONFIG_SMP - else { - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); - BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); - - if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { - /* - * On established mms, the mm_cpumask is only changed - * from irq context, from ptep_clear_flush() while in - * lazy tlb mode, and here. Irqs are blocked during - * schedule, protecting us from simultaneous changes. - */ - cpumask_set_cpu(cpu, mm_cpumask(next)); - - /* - * We were in lazy tlb mode and leave_mm disabled - * tlb flush IPI delivery. We must reload CR3 - * to make sure to use no freed page tables. - * - * As above, this is a barrier that forces - * TLB repopulation to be ordered after the - * store to mm_cpumask. - */ - load_cr3(next->pgd); - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); - load_mm_cr4(next); - load_mm_ldt(next); - } - } -#endif -} #define activate_mm(prev, next) \ do { \ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 87fcc7a62e71..9790c9338e52 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -59,6 +59,103 @@ void leave_mm(int cpu) } EXPORT_SYMBOL_GPL(leave_mm); +void switch_mm(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk) +{ + unsigned cpu = smp_processor_id(); + + if (likely(prev != next)) { +#ifdef CONFIG_SMP + this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); + this_cpu_write(cpu_tlbstate.active_mm, next); +#endif + cpumask_set_cpu(cpu, mm_cpumask(next)); + + /* + * Re-load page tables. + * + * This logic has an ordering constraint: + * + * CPU 0: Write to a PTE for 'next' + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. + * CPU 1: set bit 1 in next's mm_cpumask + * CPU 1: load from the PTE that CPU 0 writes (implicit) + * + * We need to prevent an outcome in which CPU 1 observes + * the new PTE value and CPU 0 observes bit 1 clear in + * mm_cpumask. (If that occurs, then the IPI will never + * be sent, and CPU 0's TLB will contain a stale entry.) + * + * The bad outcome can occur if either CPU's load is + * reordered before that CPU's store, so both CPUs much + * execute full barriers to prevent this from happening. + * + * Thus, switch_mm needs a full barrier between the + * store to mm_cpumask and any operation that could load + * from next->pgd. This barrier synchronizes with + * remote TLB flushers. Fortunately, load_cr3 is + * serializing and thus acts as a full barrier. + * + */ + load_cr3(next->pgd); + + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + + /* Stop flush ipis for the previous mm */ + cpumask_clear_cpu(cpu, mm_cpumask(prev)); + + /* Load per-mm CR4 state */ + load_mm_cr4(next); + +#ifdef CONFIG_MODIFY_LDT_SYSCALL + /* + * Load the LDT, if the LDT is different. + * + * It's possible that prev->context.ldt doesn't match + * the LDT register. This can happen if leave_mm(prev) + * was called and then modify_ldt changed + * prev->context.ldt but suppressed an IPI to this CPU. + * In this case, prev->context.ldt != NULL, because we + * never set context.ldt to NULL while the mm still + * exists. That means that next->context.ldt != + * prev->context.ldt, because mms never share an LDT. + */ + if (unlikely(prev->context.ldt != next->context.ldt)) + load_mm_ldt(next); +#endif + } +#ifdef CONFIG_SMP + else { + this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); + BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); + + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { + /* + * On established mms, the mm_cpumask is only changed + * from irq context, from ptep_clear_flush() while in + * lazy tlb mode, and here. Irqs are blocked during + * schedule, protecting us from simultaneous changes. + */ + cpumask_set_cpu(cpu, mm_cpumask(next)); + + /* + * We were in lazy tlb mode and leave_mm disabled + * tlb flush IPI delivery. We must reload CR3 + * to make sure to use no freed page tables. + * + * As above, this is a barrier that forces + * TLB repopulation to be ordered after the + * store to mm_cpumask. + */ + load_cr3(next->pgd); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + load_mm_cr4(next); + load_mm_ldt(next); + } + } +#endif +} + /* * The flush IPI assumes that a thread switch happens in this order: * [cpu0: the cpu that switches] -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f172.google.com (mail-pf0-f172.google.com [209.85.192.172]) by kanga.kvack.org (Postfix) with ESMTP id BFB2F828E9 for ; Fri, 8 Jan 2016 18:15:56 -0500 (EST) Received: by mail-pf0-f172.google.com with SMTP id q63so16322964pfb.1 for ; Fri, 08 Jan 2016 15:15:56 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id w17si7957791pfi.241.2016.01.08.15.15.55 for ; Fri, 08 Jan 2016 15:15:55 -0800 (PST) From: Andy Lutomirski Subject: [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID Date: Fri, 8 Jan 2016 15:15:31 -0800 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski PCID is a "process context ID" -- it's what other architectures call an address space ID. Every non-global TLB entry is tagged with a PCID, only TLB entries that match the currently selected PCID are used, and we can switch PGDs without flushing the TLB. x86's PCID is 12 bits. This is an unorthodox approach to using PCID. x86's PCID is far too short to uniquely identify a process, and we can't even really uniquely identify a running process because there are monster systems with over 4096 CPUs. To make matters worse, past attempts to use all 12 PCID bits have resulted in slowdowns instead of speedups. This patch uses PCID differently. We use a PCID to identify a recently-used mm on a per-cpu basis. An mm has no fixed PCID binding at all; instead, we give it a fresh PCID each time it's loaded except in cases where we want to preserve the TLB, in which case we reuse a recent value. In particular, we use PCIDs 1-7 for recently-used mms and we reserve PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI). Nothing ever switches to PCID 0 without flushing PCID 0 non-global pages, so PCID 0 conflicts won't cause problems. This also leaves the door open for UDEREF-style address space switching: the kernel will use PCID 0, and exits could simply switch back. (As a practical matter, an in-tree implementation of that feature would probably forego the full syscall fast path and just invoke some or all of switch_mm in prepare_exit_to_usermode.) This seems to save about 100ns on context switches between mms. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu.h | 7 +- arch/x86/include/asm/tlbflush.h | 18 ++++ arch/x86/kernel/cpu/common.c | 4 + arch/x86/kernel/ldt.c | 2 + arch/x86/kernel/process_64.c | 2 + arch/x86/mm/tlb.c | 195 +++++++++++++++++++++++++++++++++++++--- 6 files changed, 213 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 55234d5e7160..adb958d41bde 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -5,10 +5,13 @@ #include /* - * The x86 doesn't have a mmu context, but - * we put the segment information here. + * x86 has an MMU context if PCID is enabled, and x86 also has arch-specific + * MMU state beyond what lives in mm_struct. */ typedef struct { + /* See arch/x86/mm/tlb.c for details. */ + struct cpumask pcid_live_cpus; + #ifdef CONFIG_MODIFY_LDT_SYSCALL struct ldt_struct *ldt; #endif diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 32e3d8769a22..407c6f5dd4a6 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -56,6 +56,13 @@ static inline void invpcid_flush_all_nonglobals(void) #define CR3_PCID_MASK 0ull #endif +#ifdef CONFIG_SMP +extern void zap_old_pcids(void); +#else +/* Until PCID is implemented for !SMP, there's nothing to do. */ +static inline void zap_old_pcids(void) {} +#endif + #ifdef CONFIG_PARAVIRT #include #else @@ -195,6 +202,8 @@ static inline void __flush_tlb_all(void) __flush_tlb_global(); else __flush_tlb(); + + zap_old_pcids(); } static inline void __flush_tlb_one(unsigned long addr) @@ -238,6 +247,7 @@ static inline void flush_tlb_all(void) { count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); __flush_tlb_all(); + zap_old_pcids(); } static inline void flush_tlb(void) @@ -254,6 +264,8 @@ static inline void flush_tlb_mm(struct mm_struct *mm) { if (mm == current->active_mm) __flush_tlb_up(); + else + zap_old_pcids(); } static inline void flush_tlb_page(struct vm_area_struct *vma, @@ -261,6 +273,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, { if (vma->vm_mm == current->active_mm) __flush_tlb_one(addr); + else + zap_old_pcids(); } static inline void flush_tlb_range(struct vm_area_struct *vma, @@ -268,6 +282,8 @@ static inline void flush_tlb_range(struct vm_area_struct *vma, { if (vma->vm_mm == current->active_mm) __flush_tlb_up(); + else + zap_old_pcids(); } static inline void flush_tlb_mm_range(struct mm_struct *mm, @@ -275,6 +291,8 @@ static inline void flush_tlb_mm_range(struct mm_struct *mm, { if (mm == current->active_mm) __flush_tlb_up(); + else + zap_old_pcids(); } static inline void native_flush_tlb_others(const struct cpumask *cpumask, diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 7e1fc53a4ba5..00bdf5806566 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -953,6 +953,10 @@ static void identify_cpu(struct cpuinfo_x86 *c) setup_smep(c); setup_smap(c); + /* Enable PCID if available. */ + if (cpu_has(c, X86_FEATURE_PCID)) + cr4_set_bits(X86_CR4_PCIDE); + /* * The vendor-specific functions might have changed features. * Now we do "generic changes." diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 6acc9dd91f36..3d73c0ddc773 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -109,6 +109,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) struct mm_struct *old_mm; int retval = 0; + cpumask_clear(&mm->context.pcid_live_cpus); + mutex_init(&mm->context.lock); old_mm = current->mm; if (!old_mm) { diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index e835d263a33b..2cdb3ba715e6 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -578,6 +578,8 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr) break; } + __flush_tlb(); + invpcid_flush_all_nonglobals(); default: ret = -EINVAL; break; diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 9790c9338e52..eb84240b8c92 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -28,6 +28,92 @@ * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi */ +/* + * An x86 PCID is what everyone else calls an ASID. TLB entries on each + * CPU are tagged with a PCID, and the current CR3 value stores a PCID. + * Switching CR3 can change the PCID and can optionally flush all TLB + * entries associated with the new PCID. + * + * The PCID is 12 bits, but we don't use anywhere near that many. + * + * The guiding principle of this code is that TLB entries that have + * survived more than a small number of context switches are mostly + * useless, so we don't try very hard not to evict them. + * + * PCID 0 is used for swapper_pg_dir and for any other special PGDs that + * are loaded directly by PCID-naive users of load_cr3. (This includes + * UEFI runtime services, etc.) Note that we never switch to PCID 0 with + * the preserve flag set -- the only TLB entries with PCID == 0 that are + * worth preserving have the global flag set. + * + * PCIDs 1 through NR_PCIDS are used for real user mms. + */ + +#define NR_DYNAMIC_PCIDS 7 + +struct pcid_cpu_entry { + const struct mm_struct *mm; +}; + +struct pcid_cpu_state { + /* This entire data structure fits in one cache line. */ + struct pcid_cpu_entry pcids[NR_DYNAMIC_PCIDS]; /* starts with PCID 1 */ + unsigned long next_pcid_minus_two; +}; + +static DEFINE_PER_CPU_ALIGNED(struct pcid_cpu_state, pcid_cpu_state); + +/* + * Allocate and return a fresh PCID for mm on this CPU. + */ +static unsigned long allocate_pcid(const struct mm_struct *mm) +{ + unsigned long pcid; + + pcid = this_cpu_add_return(pcid_cpu_state.next_pcid_minus_two, 1) + 1; + if (pcid > NR_DYNAMIC_PCIDS) { + pcid = 1; + this_cpu_write(pcid_cpu_state.next_pcid_minus_two, 0); + } + + this_cpu_write(pcid_cpu_state.pcids[pcid-1].mm, mm); + + /* + * We don't bother setting our cpu's bit on pcid_live_cpus. Any + * remote CPU that needs to shoot down one of our TLB entries will + * do it via IPI because we're all the way live. We'll take care + * of pcid_live_cpus when we remove ourselves from mm_cpumask. + */ + return pcid; +} + +/* + * Finds the PCID for the given pgd on this cpu. If that PCID was last + * used by this mm, the high bit will be set in the return value. Otherwise + * we claim ownership of the PCID and return the PCID with the high bit + * clear. + * + * This function must not be called if pgd has never been loaded on this + * CPU. Otherwise we might return a PCID allocated to a dead mm whose pgd + * page has been reused. + */ +static unsigned long choose_pcid(struct mm_struct *mm) +{ + unsigned long pcid; + + if (!static_cpu_has(X86_FEATURE_PCID)) + return 0; + + if (cpumask_test_cpu(smp_processor_id(), &mm->context.pcid_live_cpus)) { + for (pcid = 0; pcid < NR_DYNAMIC_PCIDS; pcid++) { + if (this_cpu_read(pcid_cpu_state.pcids[pcid].mm) == mm) + return (pcid + 1) | (1UL << 63); + } + } + + return allocate_pcid(mm); +} + #ifdef CONFIG_SMP struct flush_tlb_info { @@ -37,6 +123,55 @@ struct flush_tlb_info { }; /* + * This effectively invalidates non-global mappings belonging to non-current + * PCIDs on the calling CPU. Rather than doing NR_DYNAMIC_PCIDS-1 INVPCIDs, + * it invalidates the mm-to-PCID mappings. + */ +void zap_old_pcids(void) +{ + struct mm_struct *active_mm; + int i; + + if (!static_cpu_has(X86_FEATURE_PCID)) + return; + + active_mm = this_cpu_read(cpu_tlbstate.active_mm); + + for (i = 0; i < NR_DYNAMIC_PCIDS; i++) + if (this_cpu_read(pcid_cpu_state.pcids[i].mm) != active_mm) + this_cpu_write(pcid_cpu_state.pcids[i].mm, NULL); +} +EXPORT_SYMBOL(zap_old_pcids); + +static void zap_local_inactive_mm(struct mm_struct *mm) +{ + int i; + + if (!static_cpu_has(X86_FEATURE_PCID)) + return; + + for (i = 0; i < NR_DYNAMIC_PCIDS; i++) { + if (this_cpu_read(pcid_cpu_state.pcids[i].mm) == mm) { + this_cpu_write(pcid_cpu_state.pcids[i].mm, NULL); + return; + } + } +} + +static void stop_tlbflush_ipis(int cpu, struct mm_struct *mm) +{ + /* + * Stop flush ipis for the previous mm. First mark us live in + * the PCID cache. We need our store to pcid_live_cpus to + * happen before remote CPUs stop sending us IPIs; the barrier + * here synchronizes with the barrier in flush_tlb_remote. + */ + cpumask_set_cpu(cpu, &mm->context.pcid_live_cpus); + smp_mb__before_atomic(); + cpumask_clear_cpu(cpu, mm_cpumask(mm)); +} + +/* * We cannot call mmdrop() because we are in interrupt context, * instead update mm->cpu_vm_mask. */ @@ -46,7 +181,7 @@ void leave_mm(int cpu) if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) BUG(); if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) { - cpumask_clear_cpu(cpu, mm_cpumask(active_mm)); + stop_tlbflush_ipis(cpu, active_mm); load_cr3(swapper_pg_dir); /* * This gets called in the idle path where RCU @@ -63,6 +198,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { unsigned cpu = smp_processor_id(); + unsigned long pcid; if (likely(prev != next)) { #ifdef CONFIG_SMP @@ -76,9 +212,6 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, * * This logic has an ordering constraint: * - * CPU 0: Write to a PTE for 'next' - * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. - * CPU 1: set bit 1 in next's mm_cpumask * CPU 1: load from the PTE that CPU 0 writes (implicit) * * We need to prevent an outcome in which CPU 1 observes @@ -97,12 +230,14 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, * serializing and thus acts as a full barrier. * */ - load_cr3(next->pgd); - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + pcid = choose_pcid(next); + write_cr3(__pa(next->pgd) | pcid); - /* Stop flush ipis for the previous mm */ - cpumask_clear_cpu(cpu, mm_cpumask(prev)); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, + (pcid & (1ULL << 63)) ? 0 : TLB_FLUSH_ALL); + + stop_tlbflush_ipis(cpu, prev); /* Load per-mm CR4 state */ load_mm_cr4(next); @@ -146,9 +281,18 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, * As above, this is a barrier that forces * TLB repopulation to be ordered after the * store to mm_cpumask. + + * + * XXX: speedup possibility: if we end up preserving + * PCID data, then the write_cr3 is a no-op. */ - load_cr3(next->pgd); - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + pcid = choose_pcid(next); + write_cr3(__pa(next->pgd) | pcid); + + trace_tlb_flush( + TLB_FLUSH_ON_TASK_SWITCH, + (pcid & (1ULL << 63)) ? 0 : TLB_FLUSH_ALL); +; load_mm_cr4(next); load_mm_ldt(next); } @@ -203,8 +347,24 @@ static void flush_tlb_func(void *info) inc_irq_stat(irq_tlb_count); - if (f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm)) + /* + * After all relevant CPUs call flush_tlb_func, pcid_live_cpus + * will be clear. That's not enough, though: if the mm we're + * targeting isn't active, we won't directly flush the TLB, but + * we need to guarantee that the mm won't be reactivated and + * therefore reinstate stale entries prior to cpumask_clear. + * + * Solve this problem by brute force: if we don't flush the TLB + * directly, zap the PCID mapping. (We zap it using + * pcid_cpu_state instead of pcid_live_cpus to avoid excessive + * cacheline bounding.) + */ + + if (f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm)) { + zap_local_inactive_mm(f->flush_mm); return; + } + if (!f->flush_end) f->flush_end = f->flush_start + PAGE_SIZE; @@ -224,9 +384,10 @@ static void flush_tlb_func(void *info) } trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages); } - } else + } else { leave_mm(smp_processor_id()); - + zap_local_inactive_mm(f->flush_mm); + } } void native_flush_tlb_others(const struct cpumask *cpumask, @@ -259,6 +420,14 @@ static void propagate_tlb_flush(unsigned int this_cpu, { if (cpumask_any_but(mm_cpumask(mm), this_cpu) < nr_cpu_ids) flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); + /* + * Synchronize with barrier in stop_tlbflush_ipis; cpumask_clear + * must not be overridden by the pcid_live_cpus write in + * stop_tlbflush_ipis unless we sent an IPI. + */ + smp_wmb(); + + cpumask_clear(&mm->context.pcid_live_cpus); } void flush_tlb_current_task(void) -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f181.google.com (mail-io0-f181.google.com [209.85.223.181]) by kanga.kvack.org (Postfix) with ESMTP id 6C01D828ED for ; Fri, 8 Jan 2016 18:31:04 -0500 (EST) Received: by mail-io0-f181.google.com with SMTP id 77so270537687ioc.2 for ; Fri, 08 Jan 2016 15:31:04 -0800 (PST) Received: from mail-ig0-x230.google.com (mail-ig0-x230.google.com. [2607:f8b0:4001:c05::230]) by mx.google.com with ESMTPS id z9si17753309ioi.192.2016.01.08.15.31.03 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 Jan 2016 15:31:03 -0800 (PST) Received: by mail-ig0-x230.google.com with SMTP id z14so67825059igp.1 for ; Fri, 08 Jan 2016 15:31:03 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: Date: Fri, 8 Jan 2016 15:31:03 -0800 Message-ID: Subject: Re: [RFC 00/13] x86/mm: PCID and INVPCID From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: the arch/x86 maintainers , Linux Kernel Mailing List , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski wrote: > > Please play around and suggest (and run?) good benchmarks. It seems > to save around 100ns on cross-process context switches for me. Interesting. There was reportedly (I never saw it) a test-patch to use pcids inside of Intel a couple of years ago, and it never got outside because it didn't make a difference. Either things have changed (newer hardware with more pcids perhaps?) or you did a better job at it. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f44.google.com (mail-oi0-f44.google.com [209.85.218.44]) by kanga.kvack.org (Postfix) with ESMTP id DD989828ED for ; Fri, 8 Jan 2016 18:36:33 -0500 (EST) Received: by mail-oi0-f44.google.com with SMTP id k206so20687485oia.1 for ; Fri, 08 Jan 2016 15:36:33 -0800 (PST) Received: from mail-ob0-x235.google.com (mail-ob0-x235.google.com. [2607:f8b0:4003:c01::235]) by mx.google.com with ESMTPS id rl4si868471oeb.36.2016.01.08.15.36.33 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 Jan 2016 15:36:33 -0800 (PST) Received: by mail-ob0-x235.google.com with SMTP id ba1so367197849obb.3 for ; Fri, 08 Jan 2016 15:36:33 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Fri, 8 Jan 2016 15:36:13 -0800 Message-ID: Subject: Re: [RFC 00/13] x86/mm: PCID and INVPCID Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Andy Lutomirski , the arch/x86 maintainers , Linux Kernel Mailing List , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" On Fri, Jan 8, 2016 at 3:31 PM, Linus Torvalds wrote: > On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski wrote: >> >> Please play around and suggest (and run?) good benchmarks. It seems >> to save around 100ns on cross-process context switches for me. > > Interesting. There was reportedly (I never saw it) a test-patch to use > pcids inside of Intel a couple of years ago, and it never got outside > because it didn't make a difference. I have a copy of that patch, and my code works very differently. I only use 3 bits of PCID in this series. I could probably reduce that to 2 with little loss. 4 or more would be a waste. > > Either things have changed (newer hardware with more pcids perhaps?) > or you did a better job at it. On my Skylake laptop, all of the PCID bits appear to have at least some effect. Whether this means it gets hashed or whether this means that all of the bits are real, I don't know. I'll fiddle with it on an older machine. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f170.google.com (mail-io0-f170.google.com [209.85.223.170]) by kanga.kvack.org (Postfix) with ESMTP id 35567828ED for ; Fri, 8 Jan 2016 18:41:11 -0500 (EST) Received: by mail-io0-f170.google.com with SMTP id g73so104441571ioe.3 for ; Fri, 08 Jan 2016 15:41:11 -0800 (PST) Received: from mail-ig0-x22a.google.com (mail-ig0-x22a.google.com. [2607:f8b0:4001:c05::22a]) by mx.google.com with ESMTPS id tg1si2294908igb.83.2016.01.08.15.41.10 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 Jan 2016 15:41:10 -0800 (PST) Received: by mail-ig0-x22a.google.com with SMTP id z14so86055408igp.0 for ; Fri, 08 Jan 2016 15:41:10 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: Date: Fri, 8 Jan 2016 15:41:10 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: the arch/x86 maintainers , Linux Kernel Mailing List , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski wrote: > + /* > + * We mustn't be preempted or handle an IPI while reading and > + * writing CR3. Preemption could switch mms and switch back, and > + * an IPI could call leave_mm. Either of those could cause our > + * PCID to change asynchronously. > + */ > + raw_local_irq_save(flags); > native_write_cr3(native_read_cr3()); > + raw_local_irq_restore(flags); This seems sad for two reasons: - it adds unnecessary overhead on non-pcid setups (32-bit being an example of that) - on pcid setups, wouldn't invpcid_flush_single_context() be better? So on the whole I hate it. Why isn't this something like if (static_cpu_has_safe(X86_FEATURE_INVPCID)) { invpcid_flush_single_context(); return; } native_write_cr3(native_read_cr3()); *without* any flag saving crud? And yes, that means that we'd require X86_FEATURE_INVPCID in order to use X86_FEATURE_PCID, but that seems fine. Or is there some reason you wanted the odd flags version? If so, that should be documented. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f173.google.com (mail-ig0-f173.google.com [209.85.213.173]) by kanga.kvack.org (Postfix) with ESMTP id 3B4C6828ED for ; Fri, 8 Jan 2016 18:42:33 -0500 (EST) Received: by mail-ig0-f173.google.com with SMTP id z14so86070694igp.0 for ; Fri, 08 Jan 2016 15:42:33 -0800 (PST) Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com. [2607:f8b0:4001:c05::232]) by mx.google.com with ESMTPS id k11si512218iof.167.2016.01.08.15.42.32 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 Jan 2016 15:42:32 -0800 (PST) Received: by mail-ig0-x232.google.com with SMTP id ik10so89747403igb.1 for ; Fri, 08 Jan 2016 15:42:32 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: Date: Fri, 8 Jan 2016 15:42:32 -0800 Message-ID: Subject: Re: [RFC 00/13] x86/mm: PCID and INVPCID From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Andy Lutomirski , the arch/x86 maintainers , Linux Kernel Mailing List , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" On Fri, Jan 8, 2016 at 3:36 PM, Andy Lutomirski wrote: >> >> Either things have changed (newer hardware with more pcids perhaps?) >> or you did a better job at it. > > On my Skylake laptop, all of the PCID bits appear to have at least > some effect. Whether this means it gets hashed or whether this means > that all of the bits are real, I don't know. They have always gotten hashed, and no the bits aren't real - hardware doesn't actually have as many bits in the pcid as there are in cr3. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com [209.85.220.54]) by kanga.kvack.org (Postfix) with ESMTP id 09A91828DE for ; Fri, 8 Jan 2016 18:56:00 -0500 (EST) Received: by mail-pa0-f54.google.com with SMTP id yy13so197168082pab.3 for ; Fri, 08 Jan 2016 15:56:00 -0800 (PST) Received: from mga04.intel.com (mga04.intel.com. [192.55.52.120]) by mx.google.com with ESMTP id p28si8181471pfi.217.2016.01.08.15.55.53 for ; Fri, 08 Jan 2016 15:55:53 -0800 (PST) Subject: Re: [RFC 11/13] x86/mm: Build arch/x86/mm/tlb.c even on !SMP References: From: Dave Hansen Message-ID: <56904C89.8050205@linux.intel.com> Date: Fri, 8 Jan 2016 15:55:53 -0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski , x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" On 01/08/2016 03:15 PM, Andy Lutomirski wrote: > @@ -352,3 +354,5 @@ static int __init create_tlb_single_page_flush_ceiling(void) > return 0; > } > late_initcall(create_tlb_single_page_flush_ceiling); > + > +#endif /* CONFIG_SMP */ Heh, I was about to complain that you #ifdef'd out my lovely INVLPG tunable. But I guess on UP you just get flush_tlb_mm_range() from: > static inline void flush_tlb_mm_range(struct mm_struct *mm, > unsigned long start, unsigned long end, unsigned long vmflag) > { > if (mm == current->active_mm) > __flush_tlb_up(); > } which doesn't even do INVLPG. How sad. Poor UP. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f41.google.com (mail-oi0-f41.google.com [209.85.218.41]) by kanga.kvack.org (Postfix) with ESMTP id 70FF1828DE for ; Fri, 8 Jan 2016 19:18:28 -0500 (EST) Received: by mail-oi0-f41.google.com with SMTP id p187so3840967oia.2 for ; Fri, 08 Jan 2016 16:18:28 -0800 (PST) Received: from mail-ob0-x22d.google.com (mail-ob0-x22d.google.com. [2607:f8b0:4003:c01::22d]) by mx.google.com with ESMTPS id j16si2851600oes.38.2016.01.08.16.18.27 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 Jan 2016 16:18:27 -0800 (PST) Received: by mail-ob0-x22d.google.com with SMTP id ba1so367783664obb.3 for ; Fri, 08 Jan 2016 16:18:27 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Fri, 8 Jan 2016 16:18:08 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Oleg Nesterov , X86 ML , Dave Hansen , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst On Jan 8, 2016 3:41 PM, "Linus Torvalds" wrote: > > On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski wrote: > > + /* > > + * We mustn't be preempted or handle an IPI while reading and > > + * writing CR3. Preemption could switch mms and switch back, and > > + * an IPI could call leave_mm. Either of those could cause our > > + * PCID to change asynchronously. > > + */ > > + raw_local_irq_save(flags); > > native_write_cr3(native_read_cr3()); > > + raw_local_irq_restore(flags); > > This seems sad for two reasons: > > - it adds unnecessary overhead on non-pcid setups (32-bit being an > example of that) I can certainly skip the flag saving on !PCID. > > - on pcid setups, wouldn't invpcid_flush_single_context() be better? > I played with that and it was slower. I don't pretend that makes any sense. > So on the whole I hate it. > > Why isn't this something like > > if (static_cpu_has_safe(X86_FEATURE_INVPCID)) { > invpcid_flush_single_context(); > return; > } > native_write_cr3(native_read_cr3()); > > *without* any flag saving crud? > > And yes, that means that we'd require X86_FEATURE_INVPCID in order to > use X86_FEATURE_PCID, but that seems fine. I have an SNB "Extreme" with PCID but not INVPCID, and there could be a whole generation of servers like that. I think we should fully support them. We might be able to get away with just disabling preemption instead of IRQs, at least if mm == active_mm. > > Or is there some reason you wanted the odd flags version? If so, that > should be documented. What do you mean "odd"? --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f170.google.com (mail-pf0-f170.google.com [209.85.192.170]) by kanga.kvack.org (Postfix) with ESMTP id 461106B025B for ; Fri, 8 Jan 2016 19:27:40 -0500 (EST) Received: by mail-pf0-f170.google.com with SMTP id n128so16553502pfn.3 for ; Fri, 08 Jan 2016 16:27:40 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id sl3si20021783pac.220.2016.01.08.16.27.39 for ; Fri, 08 Jan 2016 16:27:39 -0800 (PST) Subject: Re: [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID References: From: Dave Hansen Message-ID: <569053F9.7060002@linux.intel.com> Date: Fri, 8 Jan 2016 16:27:37 -0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski , x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" On 01/08/2016 03:15 PM, Andy Lutomirski wrote: > + * The guiding principle of this code is that TLB entries that have > + * survived more than a small number of context switches are mostly > + * useless, so we don't try very hard not to evict them. Big ack on that. The original approach tried to keep track of the full 4k worth of possible PCIDs, it also needed an additional cpumask (which it dynamically allocated) for where the PCID was active in addition to the normal "where has this mm been" mask. That's a lot of extra data to mangle, and I can definitely see your approach as being nicer, *IF* the hardware isn't doing something useful with the other 9 bits of PCID that you're throwing away. ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f175.google.com (mail-ob0-f175.google.com [209.85.214.175]) by kanga.kvack.org (Postfix) with ESMTP id 117EC828DE for ; Fri, 8 Jan 2016 21:20:19 -0500 (EST) Received: by mail-ob0-f175.google.com with SMTP id bx1so341693258obb.0 for ; Fri, 08 Jan 2016 18:20:19 -0800 (PST) Received: from mail-ob0-x235.google.com (mail-ob0-x235.google.com. [2607:f8b0:4003:c01::235]) by mx.google.com with ESMTPS id v85si9807625oif.102.2016.01.08.18.20.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 Jan 2016 18:20:18 -0800 (PST) Received: by mail-ob0-x235.google.com with SMTP id bx1so341693066obb.0 for ; Fri, 08 Jan 2016 18:20:17 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <569053F9.7060002@linux.intel.com> References: <569053F9.7060002@linux.intel.com> From: Andy Lutomirski Date: Fri, 8 Jan 2016 18:19:58 -0800 Message-ID: Subject: Re: [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Andy Lutomirski , X86 ML , "linux-kernel@vger.kernel.org" , Borislav Petkov , Brian Gerst , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" On Fri, Jan 8, 2016 at 4:27 PM, Dave Hansen wrote: > On 01/08/2016 03:15 PM, Andy Lutomirski wrote: >> + * The guiding principle of this code is that TLB entries that have >> + * survived more than a small number of context switches are mostly >> + * useless, so we don't try very hard not to evict them. > > Big ack on that. The original approach tried to keep track of the full > 4k worth of possible PCIDs, it also needed an additional cpumask (which > it dynamically allocated) for where the PCID was active in addition to > the normal "where has this mm been" mask. My patch has a similar extra cpumask, but at least I didn't dynamically allocate it. I did it because I need a 100% reliable way to tell whether a given mm has a valid PCID in a cpu's PCID LRU list, as opposed to just matching due to struct mm reuse or similar. I also need the ability to blow away old mappings, which I can do by clearing the cpumask. This happens in init_new_context and in propagate_tlb_flush. The other way to do it would be to store some kind of generation counter in the per-cpu list. I could use a global 64-bit atomic counter to allocate never-reused mm ids (it's highly unlikely that a system will run long enough for such a counter to overflow -- it could only ever be incremented every few ns, giving hundreds of years of safety), but that's kind of expensive. I could use a per-cpu allocator, but 54 bits per cpu is uncomfortably small unless we have wraparound handling. We could do 64 bits per cpu for very cheap counter allocation, but then the "zap the pcid" logic gets much nastier in that neither the percpu entries nor the per-mm generation counter entries don't fit in a word any more. Maybe that's fine. What we can't do easily is have a per-mm generation counter, because freeing an mm and reallocating it in the same place needs to reliably zap the pcid on all CPUs. Anyway, this problem is clearly solvable, but I haven't thought of a straightforward solution that doesn't involve rarely-executed code paths, and that makes me a bit nervous. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f180.google.com (mail-io0-f180.google.com [209.85.223.180]) by kanga.kvack.org (Postfix) with ESMTP id 8EBC9828DE for ; Fri, 8 Jan 2016 21:20:25 -0500 (EST) Received: by mail-io0-f180.google.com with SMTP id 77so272854286ioc.2 for ; Fri, 08 Jan 2016 18:20:25 -0800 (PST) Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com. [2607:f8b0:4001:c05::232]) by mx.google.com with ESMTPS id qg4si3365315igb.21.2016.01.08.18.20.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 Jan 2016 18:20:25 -0800 (PST) Received: by mail-ig0-x232.google.com with SMTP id t15so64001749igr.0 for ; Fri, 08 Jan 2016 18:20:25 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: Date: Fri, 8 Jan 2016 18:20:24 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Oleg Nesterov , X86 ML , Dave Hansen , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski wrote: >> >> - on pcid setups, wouldn't invpcid_flush_single_context() be better? > > I played with that and it was slower. I don't pretend that makes any sense. Ugh. I guess reading and writing cr3 has been optimized. >> And yes, that means that we'd require X86_FEATURE_INVPCID in order to >> use X86_FEATURE_PCID, but that seems fine. > > I have an SNB "Extreme" with PCID but not INVPCID, and there could be > a whole generation of servers like that. I think we should fully > support them. Can you check the timings? IOW, is it a win on SNB? I think originally Intel only had two actual bits of process context ID in the TLB, and it was meant to be used for virtualization or something. Together with the hashing (to make it always appear as 12 bits to software - a nice idea but also means that the hardware ends up invalidating more than software really expects), it may not work all that well. That _could_ explain why the original patch from intel didn't work. > We might be able to get away with just disabling preemption instead of > IRQs, at least if mm == active_mm. I'm not convinced it is all that much faster. Of course, it's nicer on non-preempt, but nobody seems to run things that way. >> Or is there some reason you wanted the odd flags version? If so, that >> should be documented. > > What do you mean "odd"? It's odd because it makes no sense for non-pcid (christ, I wish Intel had just called it "asid" instead, "pcid" always makes me react to "pci"), and I think it would make more sense to pair up the pcid case with the invpcid rather than have those preemption rules here. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f44.google.com (mail-wm0-f44.google.com [74.125.82.44]) by kanga.kvack.org (Postfix) with ESMTP id 1AAAB828F3 for ; Sun, 10 Jan 2016 13:59:29 -0500 (EST) Received: by mail-wm0-f44.google.com with SMTP id u188so191127039wmu.1 for ; Sun, 10 Jan 2016 10:59:29 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id r138si16855772wmg.30.2016.01.10.10.59.27 for ; Sun, 10 Jan 2016 10:59:27 -0800 (PST) Date: Sun, 10 Jan 2016 19:59:16 +0100 From: Borislav Petkov Subject: Re: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Message-ID: <20160110185916.GD22896@pd.tnic> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andrey Ryabinin + Andrey. On Fri, Jan 08, 2016 at 03:15:19PM -0800, Andy Lutomirski wrote: > Otherwise terrible things happen if some of the callbacks end up > calling into KASAN in unexpected places. > > This has no obvious symptoms yet, but adding a memory reference to > native_flush_tlb_global without this blows up on KASAN kernels. > > Signed-off-by: Andy Lutomirski > --- > arch/x86/kernel/Makefile | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile > index b1b78ffe01d0..b7cd5bdf314b 100644 > --- a/arch/x86/kernel/Makefile > +++ b/arch/x86/kernel/Makefile > @@ -19,6 +19,7 @@ endif > KASAN_SANITIZE_head$(BITS).o := n > KASAN_SANITIZE_dumpstack.o := n > KASAN_SANITIZE_dumpstack_$(BITS).o := n > +KASAN_SANITIZE_paravirt.o := n > > CFLAGS_irq.o := -I$(src)/../include/asm/trace Shouldn't we take this one irrespectively of what happens to the rest in the patchset? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f54.google.com (mail-wm0-f54.google.com [74.125.82.54]) by kanga.kvack.org (Postfix) with ESMTP id 56538828F3 for ; Mon, 11 Jan 2016 05:51:09 -0500 (EST) Received: by mail-wm0-f54.google.com with SMTP id l65so205686806wmf.1 for ; Mon, 11 Jan 2016 02:51:09 -0800 (PST) Received: from mail-wm0-x243.google.com (mail-wm0-x243.google.com. [2a00:1450:400c:c09::243]) by mx.google.com with ESMTPS id p13si22023736wmb.82.2016.01.11.02.51.08 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Jan 2016 02:51:08 -0800 (PST) Received: by mail-wm0-x243.google.com with SMTP id l65so25658006wmf.3 for ; Mon, 11 Jan 2016 02:51:08 -0800 (PST) Date: Mon, 11 Jan 2016 11:51:05 +0100 From: Ingo Molnar Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Message-ID: <20160111105105.GB29448@gmail.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Andy Lutomirski , Oleg Nesterov , X86 ML , Dave Hansen , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst * Linus Torvalds wrote: > >> Or is there some reason you wanted the odd flags version? If so, that > >> should be documented. > > > > What do you mean "odd"? > > It's odd because it makes no sense for non-pcid (christ, I wish Intel had just > called it "asid" instead, "pcid" always makes me react to "pci"), and I think it > would make more sense to pair up the pcid case with the invpcid rather than have > those preemption rules here. The naming is really painful, so a trivial suggestion: could we just name all the Linux side bits 'asid' or 'ctx_id' (even in x86 arch code) and only use 'PCID' nomenclature in the very lowest level code? I.e. rename pcid_live_cpus et al and most functions to the asid or ctx_id or asid naming scheme or so. That would hide most of the naming ugliness. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f45.google.com (mail-lf0-f45.google.com [209.85.215.45]) by kanga.kvack.org (Postfix) with ESMTP id 91C7C828F3 for ; Mon, 11 Jan 2016 07:50:45 -0500 (EST) Received: by mail-lf0-f45.google.com with SMTP id h129so17029055lfh.3 for ; Mon, 11 Jan 2016 04:50:45 -0800 (PST) Received: from relay.parallels.com (relay.parallels.com. [195.214.232.42]) by mx.google.com with ESMTPS id g14si13826144lfd.41.2016.01.11.04.50.44 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Jan 2016 04:50:44 -0800 (PST) From: Andrey Ryabinin Subject: Re: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Date: Mon, 11 Jan 2016 15:51:17 +0300 Message-ID: <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> In-Reply-To: <20160110185916.GD22896@pd.tnic> References: <20160110185916.GD22896@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov , Andy Lutomirski , x86@kernel.org, Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" Cc: Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrey Ryabinin On 01/10/2016 09:59 PM, Borislav Petkov wrote: > + Andrey. > > On Fri, Jan 08, 2016 at 03:15:19PM -0800, Andy Lutomirski wrote: >> Otherwise terrible things happen if some of the callbacks end up >> calling into KASAN in unexpected places. >> >> This has no obvious symptoms yet, but adding a memory reference to >> native_flush_tlb_global without this blows up on KASAN kernels. >> >> Signed-off-by: Andy Lutomirski >> --- >> arch/x86/kernel/Makefile | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile >> index b1b78ffe01d0..b7cd5bdf314b 100644 >> --- a/arch/x86/kernel/Makefile >> +++ b/arch/x86/kernel/Makefile >> @@ -19,6 +19,7 @@ endif >> KASAN_SANITIZE_head$(BITS).o := n >> KASAN_SANITIZE_dumpstack.o := n >> KASAN_SANITIZE_dumpstack_$(BITS).o := n >> +KASAN_SANITIZE_paravirt.o := n >> >> CFLAGS_irq.o := -I$(src)/../include/asm/trace > > Shouldn't we take this one irrespectively of what happens to the rest in > the patchset? > I don't think that this patch is the right way to solve the problem. The follow-up patch "x86/kasan: clear kasan_zero_page after TLB flush" should fix Andy's problem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f169.google.com (mail-lb0-f169.google.com [209.85.217.169]) by kanga.kvack.org (Postfix) with ESMTP id 4406D828F3 for ; Mon, 11 Jan 2016 07:50:50 -0500 (EST) Received: by mail-lb0-f169.google.com with SMTP id sv6so245824935lbb.0 for ; Mon, 11 Jan 2016 04:50:50 -0800 (PST) Received: from relay.parallels.com (relay.parallels.com. [195.214.232.42]) by mx.google.com with ESMTPS id h64si33935787lfb.126.2016.01.11.04.50.48 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Jan 2016 04:50:49 -0800 (PST) From: Andrey Ryabinin Subject: [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush Date: Mon, 11 Jan 2016 15:51:18 +0300 Message-ID: <1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com> In-Reply-To: <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> References: <20160110185916.GD22896@pd.tnic> <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov , Andy Lutomirski , x86@kernel.org, Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" Cc: Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrey Ryabinin Currently we clear kasan_zero_page before __flush_tlb_all(). This works with current implementation of native_flush_tlb[_global]() because it doesn't cause do any writes to kasan shadow memory. But any subtle change made in native_flush_tlb*() could break this. Also current code seems doesn't work for paravirt guests (lguest). Only after the TLB flush we can be sure that kasan_zero_page is not used as early shadow anymore (instrumented code will not write to it). So it should cleared it only after the TLB flush. Signed-off-by: Andrey Ryabinin --- arch/x86/mm/kasan_init_64.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index d470cf2..303e470 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -120,11 +120,16 @@ void __init kasan_init(void) kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END), (void *)KASAN_SHADOW_END); - memset(kasan_zero_page, 0, PAGE_SIZE); - load_cr3(init_level4_pgt); __flush_tlb_all(); - init_task.kasan_depth = 0; + /* + * kasan_zero_page has been used as early shadow memory, thus it may + * contain some garbage. Now we can clear it, since after the TLB flush + * no one should write to it. + */ + memset(kasan_zero_page, 0, PAGE_SIZE); + + init_task.kasan_depth = 0; pr_info("KernelAddressSanitizer initialized\n"); } -- 2.4.10 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f47.google.com (mail-lf0-f47.google.com [209.85.215.47]) by kanga.kvack.org (Postfix) with ESMTP id 10517828F3 for ; Mon, 11 Jan 2016 07:50:59 -0500 (EST) Received: by mail-lf0-f47.google.com with SMTP id 17so10407979lfz.1 for ; Mon, 11 Jan 2016 04:50:59 -0800 (PST) Received: from relay.parallels.com (relay.parallels.com. [195.214.232.42]) by mx.google.com with ESMTPS id th5si68987388lbb.36.2016.01.11.04.50.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Jan 2016 04:50:57 -0800 (PST) From: Andrey Ryabinin Subject: [PATCH 2/2] x86/kasan: write protect kasan zero shadow Date: Mon, 11 Jan 2016 15:51:19 +0300 Message-ID: <1452516679-32040-3-git-send-email-aryabinin@virtuozzo.com> In-Reply-To: <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> References: <20160110185916.GD22896@pd.tnic> <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov , Andy Lutomirski , x86@kernel.org, Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" Cc: Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrey Ryabinin After kasan_init() executed, no one is allowed to write to kasan_zero_page, so write protect it. Signed-off-by: Andrey Ryabinin --- arch/x86/mm/kasan_init_64.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 303e470..1b1110f 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -125,10 +125,16 @@ void __init kasan_init(void) /* * kasan_zero_page has been used as early shadow memory, thus it may - * contain some garbage. Now we can clear it, since after the TLB flush - * no one should write to it. + * contain some garbage. Now we can clear and write protect it, since + * after the TLB flush no one should write to it. */ memset(kasan_zero_page, 0, PAGE_SIZE); + for (i = 0; i < PTRS_PER_PTE; i++) { + pte_t pte = __pte(__pa(kasan_zero_page) | __PAGE_KERNEL_RO); + set_pte(&kasan_zero_pte[i], pte); + } + /* Flush TLBs again to be sure that write protection applied. */ + __flush_tlb_all(); init_task.kasan_depth = 0; pr_info("KernelAddressSanitizer initialized\n"); -- 2.4.10 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f177.google.com (mail-ob0-f177.google.com [209.85.214.177]) by kanga.kvack.org (Postfix) with ESMTP id EA0E9828DF for ; Wed, 13 Jan 2016 18:33:17 -0500 (EST) Received: by mail-ob0-f177.google.com with SMTP id py5so101292005obc.2 for ; Wed, 13 Jan 2016 15:33:17 -0800 (PST) Received: from mail-oi0-x22f.google.com (mail-oi0-x22f.google.com. [2607:f8b0:4003:c06::22f]) by mx.google.com with ESMTPS id j186si4168147oib.4.2016.01.13.15.33.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 13 Jan 2016 15:33:17 -0800 (PST) Received: by mail-oi0-x22f.google.com with SMTP id w75so73247120oie.0 for ; Wed, 13 Jan 2016 15:33:16 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160111105105.GB29448@gmail.com> References: <20160111105105.GB29448@gmail.com> From: Andy Lutomirski Date: Wed, 13 Jan 2016 15:32:56 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Linus Torvalds , Oleg Nesterov , X86 ML , Dave Hansen , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst On Mon, Jan 11, 2016 at 2:51 AM, Ingo Molnar wrote: > > * Linus Torvalds wrote: > >> >> Or is there some reason you wanted the odd flags version? If so, that >> >> should be documented. >> > >> > What do you mean "odd"? >> >> It's odd because it makes no sense for non-pcid (christ, I wish Intel had just >> called it "asid" instead, "pcid" always makes me react to "pci"), and I think it >> would make more sense to pair up the pcid case with the invpcid rather than have >> those preemption rules here. > > The naming is really painful, so a trivial suggestion: could we just name all the > Linux side bits 'asid' or 'ctx_id' (even in x86 arch code) and only use 'PCID' > nomenclature in the very lowest level code? I'd be okay with "pctx_id" or "pctxid" for this, I think. I'd like to at least make it somewhat obvious how it maps back to hardware. FWIW, I'd guess that Intel deviated from convention because their actual address space id is (vpid, pcid), and calling it (vpid, asid) might have been slightly confusing. Or not. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f179.google.com (mail-ob0-f179.google.com [209.85.214.179]) by kanga.kvack.org (Postfix) with ESMTP id 9CD17828DF for ; Wed, 13 Jan 2016 18:36:06 -0500 (EST) Received: by mail-ob0-f179.google.com with SMTP id py5so101341661obc.2 for ; Wed, 13 Jan 2016 15:36:06 -0800 (PST) Received: from mail-ob0-x231.google.com (mail-ob0-x231.google.com. [2607:f8b0:4003:c01::231]) by mx.google.com with ESMTPS id r5si4135541obf.99.2016.01.13.15.36.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 13 Jan 2016 15:36:06 -0800 (PST) Received: by mail-ob0-x231.google.com with SMTP id is5so69047860obc.0 for ; Wed, 13 Jan 2016 15:36:05 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Wed, 13 Jan 2016 15:35:46 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Oleg Nesterov , X86 ML , Dave Hansen , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst On Fri, Jan 8, 2016 at 6:20 PM, Linus Torvalds wrote: > On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski wrote: >>> >>> - on pcid setups, wouldn't invpcid_flush_single_context() be better? >> >> I played with that and it was slower. I don't pretend that makes any sense. > > Ugh. I guess reading and writing cr3 has been optimized. > >>> And yes, that means that we'd require X86_FEATURE_INVPCID in order to >>> use X86_FEATURE_PCID, but that seems fine. >> >> I have an SNB "Extreme" with PCID but not INVPCID, and there could be >> a whole generation of servers like that. I think we should fully >> support them. > > Can you check the timings? IOW, is it a win on SNB? ~80ns gain on SNB. It's actually quite impressive on SNB: it knocks the penalty for mm switches down to 20ns or so, which I find to be fairly amazing. (This is at 3.8GHz or thereabouts.) > > I think originally Intel only had two actual bits of process context > ID in the TLB, and it was meant to be used for virtualization or > something. Together with the hashing (to make it always appear as 12 > bits to software - a nice idea but also means that the hardware ends > up invalidating more than software really expects), it may not work > all that well. > > That _could_ explain why the original patch from intel didn't work. > >> We might be able to get away with just disabling preemption instead of >> IRQs, at least if mm == active_mm. > > I'm not convinced it is all that much faster. Of course, it's nicer on > non-preempt, but nobody seems to run things that way. My current testing version has three different code paths now. If INVPCID and PCID are both available, then it uses INVPCID. If PCID is available but INVPCID is not, it does raw_local_irqsave. If PCID is not available, it just does the CR3 read/write. Yeah, it's ugly, and it's a big blob of code to do something trivial, but it seems to work and it should be the right thing to do in most cases. Can anyone here ask a hardware or microcode person what's going on with CR3 writes possibly being faster than INVPCID? Is there some trick to it? --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id 3BD03828DF for ; Wed, 13 Jan 2016 18:43:39 -0500 (EST) Received: by mail-pa0-f47.google.com with SMTP id yy13so273524690pab.3 for ; Wed, 13 Jan 2016 15:43:39 -0800 (PST) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id m82si5045583pfi.249.2016.01.13.15.43.38 for ; Wed, 13 Jan 2016 15:43:38 -0800 (PST) Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 References: From: Dave Hansen Message-ID: <5696E129.9000804@linux.intel.com> Date: Wed, 13 Jan 2016 15:43:37 -0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski , Linus Torvalds Cc: Oleg Nesterov , X86 ML , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst On 01/13/2016 03:35 PM, Andy Lutomirski wrote: > Can anyone here ask a hardware or microcode person what's going on > with CR3 writes possibly being faster than INVPCID? Is there some > trick to it? I just went and measured it myself this morning. "INVPCID Type 3" (all contexts no global) on a Skylake system was 15% slower than a CR3 write. Is that in the same ballpark from what you've observed? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f42.google.com (mail-oi0-f42.google.com [209.85.218.42]) by kanga.kvack.org (Postfix) with ESMTP id 5B5C2828DF for ; Wed, 13 Jan 2016 18:51:56 -0500 (EST) Received: by mail-oi0-f42.google.com with SMTP id k206so105147571oia.1 for ; Wed, 13 Jan 2016 15:51:56 -0800 (PST) Received: from mail-ob0-x232.google.com (mail-ob0-x232.google.com. [2607:f8b0:4003:c01::232]) by mx.google.com with ESMTPS id dp7si4187926obb.40.2016.01.13.15.51.55 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 13 Jan 2016 15:51:55 -0800 (PST) Received: by mail-ob0-x232.google.com with SMTP id vt7so69361904obb.1 for ; Wed, 13 Jan 2016 15:51:55 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <5696E129.9000804@linux.intel.com> References: <5696E129.9000804@linux.intel.com> From: Andy Lutomirski Date: Wed, 13 Jan 2016 15:51:35 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Linus Torvalds , Oleg Nesterov , X86 ML , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen wrote: > On 01/13/2016 03:35 PM, Andy Lutomirski wrote: >> Can anyone here ask a hardware or microcode person what's going on >> with CR3 writes possibly being faster than INVPCID? Is there some >> trick to it? > > I just went and measured it myself this morning. "INVPCID Type 3" (all > contexts no global) on a Skylake system was 15% slower than a CR3 write. > > Is that in the same ballpark from what you've observed? > > It's similar, except that I was comparing "INVPCID Type 1" (single context no globals) to a CR3 write. Type 2, at least, is dramatically faster than the pair of CR4 writes it replaces. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f173.google.com (mail-pf0-f173.google.com [209.85.192.173]) by kanga.kvack.org (Postfix) with ESMTP id 475FA828DF for ; Wed, 13 Jan 2016 18:56:18 -0500 (EST) Received: by mail-pf0-f173.google.com with SMTP id e65so89470466pfe.0 for ; Wed, 13 Jan 2016 15:56:18 -0800 (PST) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id y11si5086103pas.239.2016.01.13.15.56.17 for ; Wed, 13 Jan 2016 15:56:17 -0800 (PST) Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 References: <5696E129.9000804@linux.intel.com> From: Dave Hansen Message-ID: <5696E420.9040704@linux.intel.com> Date: Wed, 13 Jan 2016 15:56:16 -0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Linus Torvalds , Oleg Nesterov , X86 ML , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst On 01/13/2016 03:51 PM, Andy Lutomirski wrote: > On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen > wrote: >> On 01/13/2016 03:35 PM, Andy Lutomirski wrote: >>> Can anyone here ask a hardware or microcode person what's going on >>> with CR3 writes possibly being faster than INVPCID? Is there some >>> trick to it? >> >> I just went and measured it myself this morning. "INVPCID Type 3" (all >> contexts no global) on a Skylake system was 15% slower than a CR3 write. >> >> Is that in the same ballpark from what you've observed? > > It's similar, except that I was comparing "INVPCID Type 1" (single > context no globals) to a CR3 write. Ahh, because you're using PCID... That one I saw as being ~1.85x the number of cycles that a CR3 write was. > Type 2, at least, is dramatically faster than the pair of CR4 writes > it replaces. Yeah, I saw the same thing. Type 2 was ~2.4x faster than the CR4 writes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f181.google.com (mail-ob0-f181.google.com [209.85.214.181]) by kanga.kvack.org (Postfix) with ESMTP id 8E7476B025A for ; Wed, 13 Jan 2016 19:34:58 -0500 (EST) Received: by mail-ob0-f181.google.com with SMTP id py5so102344340obc.2 for ; Wed, 13 Jan 2016 16:34:58 -0800 (PST) Received: from mail-oi0-x232.google.com (mail-oi0-x232.google.com. [2607:f8b0:4003:c06::232]) by mx.google.com with ESMTPS id f10si4283423oek.62.2016.01.13.16.34.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 13 Jan 2016 16:34:57 -0800 (PST) Received: by mail-oi0-x232.google.com with SMTP id w75so74033005oie.0 for ; Wed, 13 Jan 2016 16:34:57 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <5696E420.9040704@linux.intel.com> References: <5696E129.9000804@linux.intel.com> <5696E420.9040704@linux.intel.com> From: Andy Lutomirski Date: Wed, 13 Jan 2016 16:34:37 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Linus Torvalds , Oleg Nesterov , X86 ML , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst On Wed, Jan 13, 2016 at 3:56 PM, Dave Hansen wrote: > On 01/13/2016 03:51 PM, Andy Lutomirski wrote: >> On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen >> wrote: >>> On 01/13/2016 03:35 PM, Andy Lutomirski wrote: >>>> Can anyone here ask a hardware or microcode person what's going on >>>> with CR3 writes possibly being faster than INVPCID? Is there some >>>> trick to it? >>> >>> I just went and measured it myself this morning. "INVPCID Type 3" (all >>> contexts no global) on a Skylake system was 15% slower than a CR3 write. >>> >>> Is that in the same ballpark from what you've observed? >> >> It's similar, except that I was comparing "INVPCID Type 1" (single >> context no globals) to a CR3 write. > > Ahh, because you're using PCID... That one I saw as being ~1.85x the > number of cycles that a CR3 write was. > I think that settles it, then: if (static_cpu_has_safe(X86_FEATURE_PCID)) { raw_local_irqsave(); native_write_cr3(native_read_cr3()); raw_local_irqrestore(); } else { native_write_cr3(native_read_cr3()); } I don't think it's worth hacking more complexity into switch_mm to make that annoyance go away. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f45.google.com (mail-oi0-f45.google.com [209.85.218.45]) by kanga.kvack.org (Postfix) with ESMTP id DB6AA6B0254 for ; Mon, 18 Jan 2016 17:24:26 -0500 (EST) Received: by mail-oi0-f45.google.com with SMTP id w75so151094470oie.0 for ; Mon, 18 Jan 2016 14:24:26 -0800 (PST) Received: from mail-ob0-x234.google.com (mail-ob0-x234.google.com. [2607:f8b0:4003:c01::234]) by mx.google.com with ESMTPS id cm5si28635227oeb.87.2016.01.18.14.24.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Jan 2016 14:24:26 -0800 (PST) Received: by mail-ob0-x234.google.com with SMTP id py5so204651110obc.2 for ; Mon, 18 Jan 2016 14:24:25 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com> References: <20160110185916.GD22896@pd.tnic> <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> <1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com> From: Andy Lutomirski Date: Mon, 18 Jan 2016 14:24:06 -0800 Message-ID: Subject: Re: [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andrey Ryabinin Cc: Borislav Petkov , Andy Lutomirski , X86 ML , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" On Mon, Jan 11, 2016 at 4:51 AM, Andrey Ryabinin wrote: > Currently we clear kasan_zero_page before __flush_tlb_all(). This > works with current implementation of native_flush_tlb[_global]() > because it doesn't cause do any writes to kasan shadow memory. > But any subtle change made in native_flush_tlb*() could break this. > Also current code seems doesn't work for paravirt guests (lguest). > > Only after the TLB flush we can be sure that kasan_zero_page is not > used as early shadow anymore (instrumented code will not write to it). > So it should cleared it only after the TLB flush. This seems to fix the issue with my patch set. Thanks. Tested-by: Andy Lutomirski --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f45.google.com (mail-oi0-f45.google.com [209.85.218.45]) by kanga.kvack.org (Postfix) with ESMTP id 95E436B0255 for ; Mon, 18 Jan 2016 17:24:52 -0500 (EST) Received: by mail-oi0-f45.google.com with SMTP id p187so164698151oia.2 for ; Mon, 18 Jan 2016 14:24:52 -0800 (PST) Received: from mail-ob0-x229.google.com (mail-ob0-x229.google.com. [2607:f8b0:4003:c01::229]) by mx.google.com with ESMTPS id p9si28624337oev.77.2016.01.18.14.24.52 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Jan 2016 14:24:52 -0800 (PST) Received: by mail-ob0-x229.google.com with SMTP id py5so204657910obc.2 for ; Mon, 18 Jan 2016 14:24:52 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1452516679-32040-3-git-send-email-aryabinin@virtuozzo.com> References: <20160110185916.GD22896@pd.tnic> <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> <1452516679-32040-3-git-send-email-aryabinin@virtuozzo.com> From: Andy Lutomirski Date: Mon, 18 Jan 2016 14:24:32 -0800 Message-ID: Subject: Re: [PATCH 2/2] x86/kasan: write protect kasan zero shadow Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andrey Ryabinin Cc: Borislav Petkov , Andy Lutomirski , X86 ML , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" On Mon, Jan 11, 2016 at 4:51 AM, Andrey Ryabinin wrote: > After kasan_init() executed, no one is allowed to write to kasan_zero_page, > so write protect it. This seems to work for me. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f49.google.com (mail-wm0-f49.google.com [74.125.82.49]) by kanga.kvack.org (Postfix) with ESMTP id ABF606B0009 for ; Fri, 29 Jan 2016 05:35:42 -0500 (EST) Received: by mail-wm0-f49.google.com with SMTP id 128so46919875wmz.1 for ; Fri, 29 Jan 2016 02:35:42 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id b63si6807674wme.9.2016.01.29.02.35.41 for ; Fri, 29 Jan 2016 02:35:41 -0800 (PST) Date: Fri, 29 Jan 2016 11:35:35 +0100 From: Borislav Petkov Subject: Re: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Message-ID: <20160129103535.GA10187@pd.tnic> References: <20160110185916.GD22896@pd.tnic> <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrey Ryabinin Cc: Andy Lutomirski , x86@kernel.org, Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Mon, Jan 11, 2016 at 03:51:17PM +0300, Andrey Ryabinin wrote: > I don't think that this patch is the right way to solve the problem. > The follow-up patch "x86/kasan: clear kasan_zero_page after TLB flush" > should fix Andy's problem. Both applied, thanks. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f70.google.com (mail-pa0-f70.google.com [209.85.220.70]) by kanga.kvack.org (Postfix) with ESMTP id 4AE896B007E for ; Fri, 3 Jun 2016 13:42:13 -0400 (EDT) Received: by mail-pa0-f70.google.com with SMTP id di3so102929316pab.0 for ; Fri, 03 Jun 2016 10:42:13 -0700 (PDT) Received: from mail-pa0-x244.google.com (mail-pa0-x244.google.com. [2607:f8b0:400e:c03::244]) by mx.google.com with ESMTPS id do10si6871150pac.124.2016.06.03.10.42.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Jun 2016 10:42:12 -0700 (PDT) Received: by mail-pa0-x244.google.com with SMTP id x1so6429090pav.1 for ; Fri, 03 Jun 2016 10:42:12 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization From: Nadav Amit In-Reply-To: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> Date: Fri, 3 Jun 2016 10:42:10 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <8D80C93B-3DD6-469B-90D6-FBC71B917EAD@gmail.com> References: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: x86@kernel.org, LKML , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" Following this patch, if (current->active_mm !=3D mm), flush_tlb_page() = still doesn=E2=80=99t call smp_mb() before checking mm_cpumask(mm). In contrast, flush_tlb_mm_range() does call smp_mb(). Is there a reason for this discrepancy? Thanks, Nadav Andy Lutomirski wrote: > When switch_mm activates a new pgd, it also sets a bit that tells > other CPUs that the pgd is in use so that tlb flush IPIs will be > sent. In order for that to work correctly, the bit needs to be > visible prior to loading the pgd and therefore starting to fill the > local TLB. >=20 > Document all the barriers that make this work correctly and add a > couple that were missing. >=20 > Cc: stable@vger.kernel.org > Signed-off-by: Andy Lutomirski > --- > arch/x86/include/asm/mmu_context.h | 33 = ++++++++++++++++++++++++++++++++- > arch/x86/mm/tlb.c | 29 ++++++++++++++++++++++++++--- > 2 files changed, 58 insertions(+), 4 deletions(-) >=20 > diff --git a/arch/x86/include/asm/mmu_context.h = b/arch/x86/include/asm/mmu_context.h > index 379cd3658799..1edc9cd198b8 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct = *prev, struct mm_struct *next, > #endif > cpumask_set_cpu(cpu, mm_cpumask(next)); >=20 > - /* Re-load page tables */ > + /* > + * Re-load page tables. > + * > + * This logic has an ordering constraint: > + * > + * CPU 0: Write to a PTE for 'next' > + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send = IPI. > + * CPU 1: set bit 1 in next's mm_cpumask > + * CPU 1: load from the PTE that CPU 0 writes = (implicit) > + * > + * We need to prevent an outcome in which CPU 1 observes > + * the new PTE value and CPU 0 observes bit 1 clear in > + * mm_cpumask. (If that occurs, then the IPI will never > + * be sent, and CPU 0's TLB will contain a stale entry.) > + * > + * The bad outcome can occur if either CPU's load is > + * reordered before that CPU's store, so both CPUs much > + * execute full barriers to prevent this from happening. > + * > + * Thus, switch_mm needs a full barrier between the > + * store to mm_cpumask and any operation that could load > + * from next->pgd. This barrier synchronizes with > + * remote TLB flushers. Fortunately, load_cr3 is > + * serializing and thus acts as a full barrier. > + * > + */ > load_cr3(next->pgd); > + > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, = TLB_FLUSH_ALL); >=20 > /* Stop flush ipis for the previous mm */ > @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct = *prev, struct mm_struct *next, > * schedule, protecting us from simultaneous = changes. > */ > cpumask_set_cpu(cpu, mm_cpumask(next)); > + > /* > * We were in lazy tlb mode and leave_mm = disabled > * tlb flush IPI delivery. We must reload CR3 > * to make sure to use no freed page tables. > + * > + * As above, this is a barrier that forces > + * TLB repopulation to be ordered after the > + * store to mm_cpumask. > */ > load_cr3(next->pgd); > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, = TLB_FLUSH_ALL); > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 8ddb5d0d66fb..8f4cc3dfac32 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -161,7 +161,10 @@ void flush_tlb_current_task(void) > preempt_disable(); >=20 > count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > + > + /* This is an implicit full barrier that synchronizes with = switch_mm. */ > local_flush_tlb(); > + > trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < = nr_cpu_ids) > flush_tlb_others(mm_cpumask(mm), mm, 0UL, = TLB_FLUSH_ALL); > @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, = unsigned long start, > unsigned long base_pages_to_flush =3D TLB_FLUSH_ALL; >=20 > preempt_disable(); > - if (current->active_mm !=3D mm) > + if (current->active_mm !=3D mm) { > + /* Synchronize with switch_mm. */ > + smp_mb(); > + > goto out; > + } >=20 > if (!current->mm) { > leave_mm(smp_processor_id()); > + > + /* Synchronize with switch_mm. */ > + smp_mb(); > + > goto out; > } >=20 > if ((end !=3D TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) > base_pages_to_flush =3D (end - start) >> PAGE_SHIFT; >=20 > + /* > + * Both branches below are implicit full barriers (MOV to CR or > + * INVLPG) that synchronize with switch_mm. > + */ > if (base_pages_to_flush > tlb_single_page_flush_ceiling) { > base_pages_to_flush =3D TLB_FLUSH_ALL; > count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > @@ -228,10 +243,18 @@ void flush_tlb_page(struct vm_area_struct *vma, = unsigned long start) > preempt_disable(); >=20 > if (current->active_mm =3D=3D mm) { > - if (current->mm) > + if (current->mm) { > + /* > + * Implicit full barrier (INVLPG) that = synchronizes > + * with switch_mm. > + */ > __flush_tlb_one(start); > - else > + } else { > leave_mm(smp_processor_id()); > + > + /* Synchronize with switch_mm. */ > + smp_mb(); > + } > } >=20 > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < = nr_cpu_ids) > --=20 > 2.5.0 >=20 > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f199.google.com (mail-ig0-f199.google.com [209.85.213.199]) by kanga.kvack.org (Postfix) with ESMTP id 0B9346B007E for ; Thu, 9 Jun 2016 13:24:39 -0400 (EDT) Received: by mail-ig0-f199.google.com with SMTP id lp2so62321178igb.3 for ; Thu, 09 Jun 2016 10:24:39 -0700 (PDT) Received: from mail-oi0-x22b.google.com (mail-oi0-x22b.google.com. [2607:f8b0:4003:c06::22b]) by mx.google.com with ESMTPS id 62si3478105ots.136.2016.06.09.10.24.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Jun 2016 10:24:38 -0700 (PDT) Received: by mail-oi0-x22b.google.com with SMTP id p204so74077333oih.3 for ; Thu, 09 Jun 2016 10:24:38 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <8D80C93B-3DD6-469B-90D6-FBC71B917EAD@gmail.com> References: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> <8D80C93B-3DD6-469B-90D6-FBC71B917EAD@gmail.com> From: Andy Lutomirski Date: Thu, 9 Jun 2016 10:24:18 -0700 Message-ID: Subject: Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , X86 ML , LKML , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" On Fri, Jun 3, 2016 at 10:42 AM, Nadav Amit wrote: > Following this patch, if (current->active_mm !=3D mm), flush_tlb_page() s= till > doesn=E2=80=99t call smp_mb() before checking mm_cpumask(mm). > > In contrast, flush_tlb_mm_range() does call smp_mb(). > > Is there a reason for this discrepancy? Not that I can remember. Is the remote flush case likely to be racy? --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f70.google.com (mail-pa0-f70.google.com [209.85.220.70]) by kanga.kvack.org (Postfix) with ESMTP id 441F86B0253 for ; Thu, 9 Jun 2016 15:46:00 -0400 (EDT) Received: by mail-pa0-f70.google.com with SMTP id fg1so67030443pad.1 for ; Thu, 09 Jun 2016 12:46:00 -0700 (PDT) Received: from mail-pf0-x234.google.com (mail-pf0-x234.google.com. [2607:f8b0:400e:c00::234]) by mx.google.com with ESMTPS id g195si9012685pfb.182.2016.06.09.12.45.58 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Jun 2016 12:45:59 -0700 (PDT) Received: by mail-pf0-x234.google.com with SMTP id 62so15841269pfd.1 for ; Thu, 09 Jun 2016 12:45:58 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization From: Nadav Amit In-Reply-To: Date: Thu, 9 Jun 2016 12:45:55 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> <8D80C93B-3DD6-469B-90D6-FBC71B917EAD@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Andy Lutomirski , X86 ML , LKML , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" Andy Lutomirski wrote: > On Fri, Jun 3, 2016 at 10:42 AM, Nadav Amit = wrote: >> Following this patch, if (current->active_mm !=3D mm), = flush_tlb_page() still >> doesn=E2=80=99t call smp_mb() before checking mm_cpumask(mm). >>=20 >> In contrast, flush_tlb_mm_range() does call smp_mb(). >>=20 >> Is there a reason for this discrepancy? >=20 > Not that I can remember. Is the remote flush case likely to be racy? You replied separately on another email that included a patch to fix this case. It turns out smp_mb is not needed on flush_tlb_page, since the PTE is always updated using an atomic operation. Yet, a compiler=20 barrier is still needed, so I added smp_mb__after_atomic instead. Nadav= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id C740A6B0038 for ; Mon, 5 Sep 2016 21:22:20 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id m139so68413405wma.0 for ; Mon, 05 Sep 2016 18:22:20 -0700 (PDT) Received: from mail-wm0-x243.google.com (mail-wm0-x243.google.com. [2a00:1450:400c:c09::243]) by mx.google.com with ESMTPS id v2si28195045wjh.115.2016.09.05.18.22.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Sep 2016 18:22:19 -0700 (PDT) Received: by mail-wm0-x243.google.com with SMTP id w12so6516612wmf.1 for ; Mon, 05 Sep 2016 18:22:19 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> References: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> From: Wanpeng Li Date: Tue, 6 Sep 2016 09:22:18 +0800 Message-ID: Subject: Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: the arch/x86 maintainers , "linux-kernel@vger.kernel.org" , Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , stable@vger.kernel.org Hi Andy, 2016-01-09 7:15 GMT+08:00 Andy Lutomirski : > When switch_mm activates a new pgd, it also sets a bit that tells > other CPUs that the pgd is in use so that tlb flush IPIs will be > sent. In order for that to work correctly, the bit needs to be > visible prior to loading the pgd and therefore starting to fill the > local TLB. > > Document all the barriers that make this work correctly and add a > couple that were missing. > > Cc: stable@vger.kernel.org > Signed-off-by: Andy Lutomirski > --- > arch/x86/include/asm/mmu_context.h | 33 ++++++++++++++++++++++++++++++++- > arch/x86/mm/tlb.c | 29 ++++++++++++++++++++++++++--- > 2 files changed, 58 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h > index 379cd3658799..1edc9cd198b8 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, > #endif > cpumask_set_cpu(cpu, mm_cpumask(next)); > > - /* Re-load page tables */ > + /* > + * Re-load page tables. > + * > + * This logic has an ordering constraint: > + * > + * CPU 0: Write to a PTE for 'next' > + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. > + * CPU 1: set bit 1 in next's mm_cpumask > + * CPU 1: load from the PTE that CPU 0 writes (implicit) > + * > + * We need to prevent an outcome in which CPU 1 observes > + * the new PTE value and CPU 0 observes bit 1 clear in > + * mm_cpumask. (If that occurs, then the IPI will never > + * be sent, and CPU 0's TLB will contain a stale entry.) I misunderstand this comments, CPU0 write to a PTE for 'next', and CPU0 observes bit 1 clear in mm_cpumask, so CPU0 won't kick IPI to CPU1, why CPU0's TLB will contain a stale entry instead of CPU1? Regards, Wanpeng Li -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756026AbcAHXPl (ORCPT ); Fri, 8 Jan 2016 18:15:41 -0500 Received: from mail.kernel.org ([198.145.29.136]:49567 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755770AbcAHXPi (ORCPT ); Fri, 8 Jan 2016 18:15:38 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Date: Fri, 8 Jan 2016 15:15:19 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Otherwise terrible things happen if some of the callbacks end up calling into KASAN in unexpected places. This has no obvious symptoms yet, but adding a memory reference to native_flush_tlb_global without this blows up on KASAN kernels. Signed-off-by: Andy Lutomirski --- arch/x86/kernel/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index b1b78ffe01d0..b7cd5bdf314b 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -19,6 +19,7 @@ endif KASAN_SANITIZE_head$(BITS).o := n KASAN_SANITIZE_dumpstack.o := n KASAN_SANITIZE_dumpstack_$(BITS).o := n +KASAN_SANITIZE_paravirt.o := n CFLAGS_irq.o := -I$(src)/../include/asm/trace -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756108AbcAHXPq (ORCPT ); Fri, 8 Jan 2016 18:15:46 -0500 Received: from mail.kernel.org ([198.145.29.136]:49640 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755814AbcAHXPm (ORCPT ); Fri, 8 Jan 2016 18:15:42 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 04/13] x86/mm: If INVPCID is available, use it to flush global mappings Date: Fri, 8 Jan 2016 15:15:22 -0800 Message-Id: <1ddac818ae831c89dc4f680bbf4bfac7b6ba2b4c.1452294700.git.luto@kernel.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On my Skylake laptop, INVPCID function 2 (flush absolutely everything) takes about 376ns, whereas saving flags, twiddling CR4.PGE to flush global mappings, and restoring flags takes about 539ns. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 20fc38d8478a..4eba5164430d 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -145,6 +145,15 @@ static inline void __native_flush_tlb_global(void) { unsigned long flags; + if (static_cpu_has_safe(X86_FEATURE_INVPCID)) { + /* + * Using INVPCID is considerably faster than a pair of writes + * to CR4 sandwiched inside an IRQ flag save/restore. + */ + invpcid_flush_everything(); + return; + } + /* * Read-modify-write to CR4 - protect it from preemption and * from interrupts. (Use the raw variant because this code can -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756158AbcAHXPt (ORCPT ); Fri, 8 Jan 2016 18:15:49 -0500 Received: from mail.kernel.org ([198.145.29.136]:49723 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756111AbcAHXPq (ORCPT ); Fri, 8 Jan 2016 18:15:46 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 07/13] x86/mm: Add nopcid to turn off PCID Date: Fri, 8 Jan 2016 15:15:25 -0800 Message-Id: <7752daf506becd3b36e768bd2561f562f64d9d16.1452294700.git.luto@kernel.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The parameter is only present on x86_64 systems to save a few bytes, as PCID is always disabled on x86_32. Signed-off-by: Andy Lutomirski --- Documentation/kernel-parameters.txt | 2 ++ arch/x86/kernel/cpu/common.c | 18 ++++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index b34e55e00bae..2f61809706c5 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2544,6 +2544,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. nopat [X86] Disable PAT (page attribute table extension of pagetables) support. + nopcid [X86-64] Disable the PCID cpu feature. + norandmaps Don't use address space randomization. Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 48196980c1c7..7e1fc53a4ba5 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -162,6 +162,24 @@ static int __init x86_mpx_setup(char *s) } __setup("nompx", x86_mpx_setup); +#ifdef CONFIG_X86_64 +static int __init x86_pcid_setup(char *s) +{ + /* require an exact match without trailing characters */ + if (strlen(s)) + return 0; + + /* do not emit a message if the feature is not present */ + if (!boot_cpu_has(X86_FEATURE_PCID)) + return 1; + + setup_clear_cpu_cap(X86_FEATURE_PCID); + pr_info("nopcid: PCID feature disabled\n"); + return 1; +} +__setup("nopcid", x86_pcid_setup); +#endif + static int __init x86_noinvpcid_setup(char *s) { /* noinvpcid doesn't accept parameters */ -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756225AbcAHXPw (ORCPT ); Fri, 8 Jan 2016 18:15:52 -0500 Received: from mail.kernel.org ([198.145.29.136]:49795 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756111AbcAHXPu (ORCPT ); Fri, 8 Jan 2016 18:15:50 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 10/13] x86/mm: Factor out remote TLB flushing Date: Fri, 8 Jan 2016 15:15:28 -0800 Message-Id: <357b13bf9d6d04e585894ff5dcf40fa14ea1d3a7.1452294700.git.luto@kernel.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org There are three call sites that propagate TLB flushes, and they all do exactly the same thing. Factor the code out into a helper. Signed-off-by: Andy Lutomirski --- arch/x86/mm/tlb.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 8f4cc3dfac32..b208a33571b0 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -154,6 +154,14 @@ void native_flush_tlb_others(const struct cpumask *cpumask, smp_call_function_many(cpumask, flush_tlb_func, &info, 1); } +static void propagate_tlb_flush(unsigned int this_cpu, + struct mm_struct *mm, unsigned long start, + unsigned long end) +{ + if (cpumask_any_but(mm_cpumask(mm), this_cpu) < nr_cpu_ids) + flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); +} + void flush_tlb_current_task(void) { struct mm_struct *mm = current->mm; @@ -166,8 +174,7 @@ void flush_tlb_current_task(void) local_flush_tlb(); trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); - if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); + propagate_tlb_flush(smp_processor_id(), mm, 0UL, TLB_FLUSH_ALL); preempt_enable(); } @@ -231,8 +238,7 @@ out: start = 0UL; end = TLB_FLUSH_ALL; } - if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, start, end); + propagate_tlb_flush(smp_processor_id(), mm, start, end); preempt_enable(); } @@ -257,8 +263,7 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start) } } - if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, start, 0UL); + propagate_tlb_flush(smp_processor_id(), mm, start, 0UL); preempt_enable(); } -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756335AbcAHXP5 (ORCPT ); Fri, 8 Jan 2016 18:15:57 -0500 Received: from mail.kernel.org ([198.145.29.136]:49846 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756227AbcAHXPx (ORCPT ); Fri, 8 Jan 2016 18:15:53 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 12/13] x86/mm: Uninline switch_mm Date: Fri, 8 Jan 2016 15:15:30 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org It's fairly large and it has quite a few callers. This may also help untangle some headers down the road. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu_context.h | 93 +----------------------------------- arch/x86/mm/tlb.c | 97 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 99 insertions(+), 91 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 1edc9cd198b8..05c4d0ab64bb 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -104,102 +104,13 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) #endif } -static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, - struct task_struct *tsk) -{ - unsigned cpu = smp_processor_id(); - - if (likely(prev != next)) { -#ifdef CONFIG_SMP - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); - this_cpu_write(cpu_tlbstate.active_mm, next); -#endif - cpumask_set_cpu(cpu, mm_cpumask(next)); +extern void switch_mm(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk); - /* - * Re-load page tables. - * - * This logic has an ordering constraint: - * - * CPU 0: Write to a PTE for 'next' - * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. - * CPU 1: set bit 1 in next's mm_cpumask - * CPU 1: load from the PTE that CPU 0 writes (implicit) - * - * We need to prevent an outcome in which CPU 1 observes - * the new PTE value and CPU 0 observes bit 1 clear in - * mm_cpumask. (If that occurs, then the IPI will never - * be sent, and CPU 0's TLB will contain a stale entry.) - * - * The bad outcome can occur if either CPU's load is - * reordered before that CPU's store, so both CPUs much - * execute full barriers to prevent this from happening. - * - * Thus, switch_mm needs a full barrier between the - * store to mm_cpumask and any operation that could load - * from next->pgd. This barrier synchronizes with - * remote TLB flushers. Fortunately, load_cr3 is - * serializing and thus acts as a full barrier. - * - */ - load_cr3(next->pgd); - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); - /* Stop flush ipis for the previous mm */ - cpumask_clear_cpu(cpu, mm_cpumask(prev)); - /* Load per-mm CR4 state */ - load_mm_cr4(next); -#ifdef CONFIG_MODIFY_LDT_SYSCALL - /* - * Load the LDT, if the LDT is different. - * - * It's possible that prev->context.ldt doesn't match - * the LDT register. This can happen if leave_mm(prev) - * was called and then modify_ldt changed - * prev->context.ldt but suppressed an IPI to this CPU. - * In this case, prev->context.ldt != NULL, because we - * never set context.ldt to NULL while the mm still - * exists. That means that next->context.ldt != - * prev->context.ldt, because mms never share an LDT. - */ - if (unlikely(prev->context.ldt != next->context.ldt)) - load_mm_ldt(next); -#endif - } -#ifdef CONFIG_SMP - else { - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); - BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); - - if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { - /* - * On established mms, the mm_cpumask is only changed - * from irq context, from ptep_clear_flush() while in - * lazy tlb mode, and here. Irqs are blocked during - * schedule, protecting us from simultaneous changes. - */ - cpumask_set_cpu(cpu, mm_cpumask(next)); - - /* - * We were in lazy tlb mode and leave_mm disabled - * tlb flush IPI delivery. We must reload CR3 - * to make sure to use no freed page tables. - * - * As above, this is a barrier that forces - * TLB repopulation to be ordered after the - * store to mm_cpumask. - */ - load_cr3(next->pgd); - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); - load_mm_cr4(next); - load_mm_ldt(next); - } - } -#endif -} #define activate_mm(prev, next) \ do { \ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 87fcc7a62e71..9790c9338e52 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -59,6 +59,103 @@ void leave_mm(int cpu) } EXPORT_SYMBOL_GPL(leave_mm); +void switch_mm(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk) +{ + unsigned cpu = smp_processor_id(); + + if (likely(prev != next)) { +#ifdef CONFIG_SMP + this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); + this_cpu_write(cpu_tlbstate.active_mm, next); +#endif + cpumask_set_cpu(cpu, mm_cpumask(next)); + + /* + * Re-load page tables. + * + * This logic has an ordering constraint: + * + * CPU 0: Write to a PTE for 'next' + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. + * CPU 1: set bit 1 in next's mm_cpumask + * CPU 1: load from the PTE that CPU 0 writes (implicit) + * + * We need to prevent an outcome in which CPU 1 observes + * the new PTE value and CPU 0 observes bit 1 clear in + * mm_cpumask. (If that occurs, then the IPI will never + * be sent, and CPU 0's TLB will contain a stale entry.) + * + * The bad outcome can occur if either CPU's load is + * reordered before that CPU's store, so both CPUs much + * execute full barriers to prevent this from happening. + * + * Thus, switch_mm needs a full barrier between the + * store to mm_cpumask and any operation that could load + * from next->pgd. This barrier synchronizes with + * remote TLB flushers. Fortunately, load_cr3 is + * serializing and thus acts as a full barrier. + * + */ + load_cr3(next->pgd); + + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + + /* Stop flush ipis for the previous mm */ + cpumask_clear_cpu(cpu, mm_cpumask(prev)); + + /* Load per-mm CR4 state */ + load_mm_cr4(next); + +#ifdef CONFIG_MODIFY_LDT_SYSCALL + /* + * Load the LDT, if the LDT is different. + * + * It's possible that prev->context.ldt doesn't match + * the LDT register. This can happen if leave_mm(prev) + * was called and then modify_ldt changed + * prev->context.ldt but suppressed an IPI to this CPU. + * In this case, prev->context.ldt != NULL, because we + * never set context.ldt to NULL while the mm still + * exists. That means that next->context.ldt != + * prev->context.ldt, because mms never share an LDT. + */ + if (unlikely(prev->context.ldt != next->context.ldt)) + load_mm_ldt(next); +#endif + } +#ifdef CONFIG_SMP + else { + this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); + BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); + + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { + /* + * On established mms, the mm_cpumask is only changed + * from irq context, from ptep_clear_flush() while in + * lazy tlb mode, and here. Irqs are blocked during + * schedule, protecting us from simultaneous changes. + */ + cpumask_set_cpu(cpu, mm_cpumask(next)); + + /* + * We were in lazy tlb mode and leave_mm disabled + * tlb flush IPI delivery. We must reload CR3 + * to make sure to use no freed page tables. + * + * As above, this is a barrier that forces + * TLB repopulation to be ordered after the + * store to mm_cpumask. + */ + load_cr3(next->pgd); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + load_mm_cr4(next); + load_mm_ldt(next); + } + } +#endif +} + /* * The flush IPI assumes that a thread switch happens in this order: * [cpu0: the cpu that switches] -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756483AbcAHXQH (ORCPT ); Fri, 8 Jan 2016 18:16:07 -0500 Received: from mail.kernel.org ([198.145.29.136]:49893 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756237AbcAHXP4 (ORCPT ); Fri, 8 Jan 2016 18:15:56 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID Date: Fri, 8 Jan 2016 15:15:31 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org PCID is a "process context ID" -- it's what other architectures call an address space ID. Every non-global TLB entry is tagged with a PCID, only TLB entries that match the currently selected PCID are used, and we can switch PGDs without flushing the TLB. x86's PCID is 12 bits. This is an unorthodox approach to using PCID. x86's PCID is far too short to uniquely identify a process, and we can't even really uniquely identify a running process because there are monster systems with over 4096 CPUs. To make matters worse, past attempts to use all 12 PCID bits have resulted in slowdowns instead of speedups. This patch uses PCID differently. We use a PCID to identify a recently-used mm on a per-cpu basis. An mm has no fixed PCID binding at all; instead, we give it a fresh PCID each time it's loaded except in cases where we want to preserve the TLB, in which case we reuse a recent value. In particular, we use PCIDs 1-7 for recently-used mms and we reserve PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI). Nothing ever switches to PCID 0 without flushing PCID 0 non-global pages, so PCID 0 conflicts won't cause problems. This also leaves the door open for UDEREF-style address space switching: the kernel will use PCID 0, and exits could simply switch back. (As a practical matter, an in-tree implementation of that feature would probably forego the full syscall fast path and just invoke some or all of switch_mm in prepare_exit_to_usermode.) This seems to save about 100ns on context switches between mms. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu.h | 7 +- arch/x86/include/asm/tlbflush.h | 18 ++++ arch/x86/kernel/cpu/common.c | 4 + arch/x86/kernel/ldt.c | 2 + arch/x86/kernel/process_64.c | 2 + arch/x86/mm/tlb.c | 195 +++++++++++++++++++++++++++++++++++++--- 6 files changed, 213 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 55234d5e7160..adb958d41bde 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -5,10 +5,13 @@ #include /* - * The x86 doesn't have a mmu context, but - * we put the segment information here. + * x86 has an MMU context if PCID is enabled, and x86 also has arch-specific + * MMU state beyond what lives in mm_struct. */ typedef struct { + /* See arch/x86/mm/tlb.c for details. */ + struct cpumask pcid_live_cpus; + #ifdef CONFIG_MODIFY_LDT_SYSCALL struct ldt_struct *ldt; #endif diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 32e3d8769a22..407c6f5dd4a6 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -56,6 +56,13 @@ static inline void invpcid_flush_all_nonglobals(void) #define CR3_PCID_MASK 0ull #endif +#ifdef CONFIG_SMP +extern void zap_old_pcids(void); +#else +/* Until PCID is implemented for !SMP, there's nothing to do. */ +static inline void zap_old_pcids(void) {} +#endif + #ifdef CONFIG_PARAVIRT #include #else @@ -195,6 +202,8 @@ static inline void __flush_tlb_all(void) __flush_tlb_global(); else __flush_tlb(); + + zap_old_pcids(); } static inline void __flush_tlb_one(unsigned long addr) @@ -238,6 +247,7 @@ static inline void flush_tlb_all(void) { count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); __flush_tlb_all(); + zap_old_pcids(); } static inline void flush_tlb(void) @@ -254,6 +264,8 @@ static inline void flush_tlb_mm(struct mm_struct *mm) { if (mm == current->active_mm) __flush_tlb_up(); + else + zap_old_pcids(); } static inline void flush_tlb_page(struct vm_area_struct *vma, @@ -261,6 +273,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, { if (vma->vm_mm == current->active_mm) __flush_tlb_one(addr); + else + zap_old_pcids(); } static inline void flush_tlb_range(struct vm_area_struct *vma, @@ -268,6 +282,8 @@ static inline void flush_tlb_range(struct vm_area_struct *vma, { if (vma->vm_mm == current->active_mm) __flush_tlb_up(); + else + zap_old_pcids(); } static inline void flush_tlb_mm_range(struct mm_struct *mm, @@ -275,6 +291,8 @@ static inline void flush_tlb_mm_range(struct mm_struct *mm, { if (mm == current->active_mm) __flush_tlb_up(); + else + zap_old_pcids(); } static inline void native_flush_tlb_others(const struct cpumask *cpumask, diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 7e1fc53a4ba5..00bdf5806566 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -953,6 +953,10 @@ static void identify_cpu(struct cpuinfo_x86 *c) setup_smep(c); setup_smap(c); + /* Enable PCID if available. */ + if (cpu_has(c, X86_FEATURE_PCID)) + cr4_set_bits(X86_CR4_PCIDE); + /* * The vendor-specific functions might have changed features. * Now we do "generic changes." diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 6acc9dd91f36..3d73c0ddc773 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -109,6 +109,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) struct mm_struct *old_mm; int retval = 0; + cpumask_clear(&mm->context.pcid_live_cpus); + mutex_init(&mm->context.lock); old_mm = current->mm; if (!old_mm) { diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index e835d263a33b..2cdb3ba715e6 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -578,6 +578,8 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr) break; } + __flush_tlb(); + invpcid_flush_all_nonglobals(); default: ret = -EINVAL; break; diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 9790c9338e52..eb84240b8c92 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -28,6 +28,92 @@ * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi */ +/* + * An x86 PCID is what everyone else calls an ASID. TLB entries on each + * CPU are tagged with a PCID, and the current CR3 value stores a PCID. + * Switching CR3 can change the PCID and can optionally flush all TLB + * entries associated with the new PCID. + * + * The PCID is 12 bits, but we don't use anywhere near that many. + * + * The guiding principle of this code is that TLB entries that have + * survived more than a small number of context switches are mostly + * useless, so we don't try very hard not to evict them. + * + * PCID 0 is used for swapper_pg_dir and for any other special PGDs that + * are loaded directly by PCID-naive users of load_cr3. (This includes + * UEFI runtime services, etc.) Note that we never switch to PCID 0 with + * the preserve flag set -- the only TLB entries with PCID == 0 that are + * worth preserving have the global flag set. + * + * PCIDs 1 through NR_PCIDS are used for real user mms. + */ + +#define NR_DYNAMIC_PCIDS 7 + +struct pcid_cpu_entry { + const struct mm_struct *mm; +}; + +struct pcid_cpu_state { + /* This entire data structure fits in one cache line. */ + struct pcid_cpu_entry pcids[NR_DYNAMIC_PCIDS]; /* starts with PCID 1 */ + unsigned long next_pcid_minus_two; +}; + +static DEFINE_PER_CPU_ALIGNED(struct pcid_cpu_state, pcid_cpu_state); + +/* + * Allocate and return a fresh PCID for mm on this CPU. + */ +static unsigned long allocate_pcid(const struct mm_struct *mm) +{ + unsigned long pcid; + + pcid = this_cpu_add_return(pcid_cpu_state.next_pcid_minus_two, 1) + 1; + if (pcid > NR_DYNAMIC_PCIDS) { + pcid = 1; + this_cpu_write(pcid_cpu_state.next_pcid_minus_two, 0); + } + + this_cpu_write(pcid_cpu_state.pcids[pcid-1].mm, mm); + + /* + * We don't bother setting our cpu's bit on pcid_live_cpus. Any + * remote CPU that needs to shoot down one of our TLB entries will + * do it via IPI because we're all the way live. We'll take care + * of pcid_live_cpus when we remove ourselves from mm_cpumask. + */ + return pcid; +} + +/* + * Finds the PCID for the given pgd on this cpu. If that PCID was last + * used by this mm, the high bit will be set in the return value. Otherwise + * we claim ownership of the PCID and return the PCID with the high bit + * clear. + * + * This function must not be called if pgd has never been loaded on this + * CPU. Otherwise we might return a PCID allocated to a dead mm whose pgd + * page has been reused. + */ +static unsigned long choose_pcid(struct mm_struct *mm) +{ + unsigned long pcid; + + if (!static_cpu_has(X86_FEATURE_PCID)) + return 0; + + if (cpumask_test_cpu(smp_processor_id(), &mm->context.pcid_live_cpus)) { + for (pcid = 0; pcid < NR_DYNAMIC_PCIDS; pcid++) { + if (this_cpu_read(pcid_cpu_state.pcids[pcid].mm) == mm) + return (pcid + 1) | (1UL << 63); + } + } + + return allocate_pcid(mm); +} + #ifdef CONFIG_SMP struct flush_tlb_info { @@ -37,6 +123,55 @@ struct flush_tlb_info { }; /* + * This effectively invalidates non-global mappings belonging to non-current + * PCIDs on the calling CPU. Rather than doing NR_DYNAMIC_PCIDS-1 INVPCIDs, + * it invalidates the mm-to-PCID mappings. + */ +void zap_old_pcids(void) +{ + struct mm_struct *active_mm; + int i; + + if (!static_cpu_has(X86_FEATURE_PCID)) + return; + + active_mm = this_cpu_read(cpu_tlbstate.active_mm); + + for (i = 0; i < NR_DYNAMIC_PCIDS; i++) + if (this_cpu_read(pcid_cpu_state.pcids[i].mm) != active_mm) + this_cpu_write(pcid_cpu_state.pcids[i].mm, NULL); +} +EXPORT_SYMBOL(zap_old_pcids); + +static void zap_local_inactive_mm(struct mm_struct *mm) +{ + int i; + + if (!static_cpu_has(X86_FEATURE_PCID)) + return; + + for (i = 0; i < NR_DYNAMIC_PCIDS; i++) { + if (this_cpu_read(pcid_cpu_state.pcids[i].mm) == mm) { + this_cpu_write(pcid_cpu_state.pcids[i].mm, NULL); + return; + } + } +} + +static void stop_tlbflush_ipis(int cpu, struct mm_struct *mm) +{ + /* + * Stop flush ipis for the previous mm. First mark us live in + * the PCID cache. We need our store to pcid_live_cpus to + * happen before remote CPUs stop sending us IPIs; the barrier + * here synchronizes with the barrier in flush_tlb_remote. + */ + cpumask_set_cpu(cpu, &mm->context.pcid_live_cpus); + smp_mb__before_atomic(); + cpumask_clear_cpu(cpu, mm_cpumask(mm)); +} + +/* * We cannot call mmdrop() because we are in interrupt context, * instead update mm->cpu_vm_mask. */ @@ -46,7 +181,7 @@ void leave_mm(int cpu) if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) BUG(); if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) { - cpumask_clear_cpu(cpu, mm_cpumask(active_mm)); + stop_tlbflush_ipis(cpu, active_mm); load_cr3(swapper_pg_dir); /* * This gets called in the idle path where RCU @@ -63,6 +198,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { unsigned cpu = smp_processor_id(); + unsigned long pcid; if (likely(prev != next)) { #ifdef CONFIG_SMP @@ -76,9 +212,6 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, * * This logic has an ordering constraint: * - * CPU 0: Write to a PTE for 'next' - * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. - * CPU 1: set bit 1 in next's mm_cpumask * CPU 1: load from the PTE that CPU 0 writes (implicit) * * We need to prevent an outcome in which CPU 1 observes @@ -97,12 +230,14 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, * serializing and thus acts as a full barrier. * */ - load_cr3(next->pgd); - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + pcid = choose_pcid(next); + write_cr3(__pa(next->pgd) | pcid); - /* Stop flush ipis for the previous mm */ - cpumask_clear_cpu(cpu, mm_cpumask(prev)); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, + (pcid & (1ULL << 63)) ? 0 : TLB_FLUSH_ALL); + + stop_tlbflush_ipis(cpu, prev); /* Load per-mm CR4 state */ load_mm_cr4(next); @@ -146,9 +281,18 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, * As above, this is a barrier that forces * TLB repopulation to be ordered after the * store to mm_cpumask. + + * + * XXX: speedup possibility: if we end up preserving + * PCID data, then the write_cr3 is a no-op. */ - load_cr3(next->pgd); - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + pcid = choose_pcid(next); + write_cr3(__pa(next->pgd) | pcid); + + trace_tlb_flush( + TLB_FLUSH_ON_TASK_SWITCH, + (pcid & (1ULL << 63)) ? 0 : TLB_FLUSH_ALL); +; load_mm_cr4(next); load_mm_ldt(next); } @@ -203,8 +347,24 @@ static void flush_tlb_func(void *info) inc_irq_stat(irq_tlb_count); - if (f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm)) + /* + * After all relevant CPUs call flush_tlb_func, pcid_live_cpus + * will be clear. That's not enough, though: if the mm we're + * targeting isn't active, we won't directly flush the TLB, but + * we need to guarantee that the mm won't be reactivated and + * therefore reinstate stale entries prior to cpumask_clear. + * + * Solve this problem by brute force: if we don't flush the TLB + * directly, zap the PCID mapping. (We zap it using + * pcid_cpu_state instead of pcid_live_cpus to avoid excessive + * cacheline bounding.) + */ + + if (f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm)) { + zap_local_inactive_mm(f->flush_mm); return; + } + if (!f->flush_end) f->flush_end = f->flush_start + PAGE_SIZE; @@ -224,9 +384,10 @@ static void flush_tlb_func(void *info) } trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages); } - } else + } else { leave_mm(smp_processor_id()); - + zap_local_inactive_mm(f->flush_mm); + } } void native_flush_tlb_others(const struct cpumask *cpumask, @@ -259,6 +420,14 @@ static void propagate_tlb_flush(unsigned int this_cpu, { if (cpumask_any_but(mm_cpumask(mm), this_cpu) < nr_cpu_ids) flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); + /* + * Synchronize with barrier in stop_tlbflush_ipis; cpumask_clear + * must not be overridden by the pcid_live_cpus write in + * stop_tlbflush_ipis unless we sent an IPI. + */ + smp_wmb(); + + cpumask_clear(&mm->context.pcid_live_cpus); } void flush_tlb_current_task(void) -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756633AbcAHXQf (ORCPT ); Fri, 8 Jan 2016 18:16:35 -0500 Received: from mail.kernel.org ([198.145.29.136]:49817 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756177AbcAHXPv (ORCPT ); Fri, 8 Jan 2016 18:15:51 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 11/13] x86/mm: Build arch/x86/mm/tlb.c even on !SMP Date: Fri, 8 Jan 2016 15:15:29 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently all of the functions that live in tlb.c are inlined on !SMP builds. One can debate whether this is a good idea (in many respects the code in tlb.c is better than the inlined UP code). Regardless, I want to add code that needs to be built on UP and SMP kernels and relates to tlb flushing, so arrange for tlb.c to be compiled unconditionally. Signed-off-by: Andy Lutomirski --- arch/x86/mm/Makefile | 3 +-- arch/x86/mm/tlb.c | 4 ++++ 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 65c47fda26fc..1ae7c141f778 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -1,5 +1,5 @@ obj-y := init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \ - pat.o pgtable.o physaddr.o gup.o setup_nx.o + pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o # Make sure __phys_addr has no stackprotector nostackp := $(call cc-option, -fno-stack-protector) @@ -9,7 +9,6 @@ CFLAGS_setup_nx.o := $(nostackp) CFLAGS_fault.o := -I$(src)/../include/asm/trace obj-$(CONFIG_X86_PAT) += pat_rbtree.o -obj-$(CONFIG_SMP) += tlb.o obj-$(CONFIG_X86_32) += pgtable_32.o iomap_32.o diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index b208a33571b0..87fcc7a62e71 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -28,6 +28,8 @@ * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi */ +#ifdef CONFIG_SMP + struct flush_tlb_info { struct mm_struct *flush_mm; unsigned long flush_start; @@ -352,3 +354,5 @@ static int __init create_tlb_single_page_flush_ceiling(void) return 0; } late_initcall(create_tlb_single_page_flush_ceiling); + +#endif /* CONFIG_SMP */ -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756681AbcAHXRI (ORCPT ); Fri, 8 Jan 2016 18:17:08 -0500 Received: from mail.kernel.org ([198.145.29.136]:49771 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756127AbcAHXPs (ORCPT ); Fri, 8 Jan 2016 18:15:48 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Date: Fri, 8 Jan 2016 15:15:27 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 3d905f12cda9..32e3d8769a22 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -135,7 +135,17 @@ static inline void cr4_set_bits_and_update_boot(unsigned long mask) static inline void __native_flush_tlb(void) { + unsigned long flags; + + /* + * We mustn't be preempted or handle an IPI while reading and + * writing CR3. Preemption could switch mms and switch back, and + * an IPI could call leave_mm. Either of those could cause our + * PCID to change asynchronously. + */ + raw_local_irq_save(flags); native_write_cr3(native_read_cr3()); + raw_local_irq_restore(flags); } static inline void __native_flush_tlb_global_irq_disabled(void) -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756716AbcAHXRd (ORCPT ); Fri, 8 Jan 2016 18:17:33 -0500 Received: from mail.kernel.org ([198.145.29.136]:49750 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755814AbcAHXPr (ORCPT ); Fri, 8 Jan 2016 18:15:47 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 08/13] x86/mm: Teach CR3 readers about PCID Date: Fri, 8 Jan 2016 15:15:26 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The kernel has several code paths that read CR3. Most of them assume that CR3 contains the PGD's physical address, whereas some of them awkwardly use PHYSICAL_PAGE_MASK to mask off low bits. Add explicit mask macros for CR3 and convert all of the CR3 readers. This will keep them from breaking when PCID is enabled. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 8 ++++++++ arch/x86/kernel/head64.c | 3 ++- arch/x86/mm/fault.c | 8 ++++---- 3 files changed, 14 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 4eba5164430d..3d905f12cda9 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -48,6 +48,14 @@ static inline void invpcid_flush_all_nonglobals(void) __invpcid(0, 0, 3); } +#ifdef CONFIG_X86_64 +#define CR3_ADDR_MASK 0x7FFFFFFFFFFFF000ull +#define CR3_PCID_MASK 0xFFFull +#else +#define CR3_ADDR_MASK 0xFFFFF000ull +#define CR3_PCID_MASK 0ull +#endif + #ifdef CONFIG_PARAVIRT #include #else diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c index f129a9af6357..3d075ac01a47 100644 --- a/arch/x86/kernel/head64.c +++ b/arch/x86/kernel/head64.c @@ -60,7 +60,8 @@ int __init early_make_pgtable(unsigned long address) pmdval_t pmd, *pmd_p; /* Invalid address or early pgt is done ? */ - if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt)) + if (physaddr >= MAXMEM || + (read_cr3() & CR3_ADDR_MASK) != __pa_nodebug(early_level4_pgt)) return -1; again: diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index eef44d9a3f77..9ceae2dc9be1 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -282,7 +282,7 @@ static noinline int vmalloc_fault(unsigned long address) * Do _not_ use "current" here. We might be inside * an interrupt in the middle of a task switch.. */ - pgd_paddr = read_cr3(); + pgd_paddr = read_cr3() & CR3_ADDR_MASK; pmd_k = vmalloc_sync_one(__va(pgd_paddr), address); if (!pmd_k) return -1; @@ -321,7 +321,7 @@ static bool low_pfn(unsigned long pfn) static void dump_pagetable(unsigned long address) { - pgd_t *base = __va(read_cr3()); + pgd_t *base = __va(read_cr3() & CR3_ADDR_MASK); pgd_t *pgd = &base[pgd_index(address)]; pmd_t *pmd; pte_t *pte; @@ -459,7 +459,7 @@ static int bad_address(void *p) static void dump_pagetable(unsigned long address) { - pgd_t *base = __va(read_cr3() & PHYSICAL_PAGE_MASK); + pgd_t *base = __va(read_cr3() & CR3_ADDR_MASK); pgd_t *pgd = base + pgd_index(address); pud_t *pud; pmd_t *pmd; @@ -595,7 +595,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, pgd_t *pgd; pte_t *pte; - pgd = __va(read_cr3() & PHYSICAL_PAGE_MASK); + pgd = __va(read_cr3() & CR3_ADDR_MASK); pgd += pgd_index(address); pte = lookup_address_in_pgd(pgd, address, &level); -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756737AbcAHXRz (ORCPT ); Fri, 8 Jan 2016 18:17:55 -0500 Received: from mail.kernel.org ([198.145.29.136]:49689 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756064AbcAHXPp (ORCPT ); Fri, 8 Jan 2016 18:15:45 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 06/13] x86/mm: Disable PCID on 32-bit kernels Date: Fri, 8 Jan 2016 15:15:24 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 32-bit kernels on new hardware will see PCID in CPUID, but PCID can only be used in 64-bit mode. Rather than making all PCID code conditional, just disable the feature on 32-bit builds. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/disabled-features.h | 4 +++- arch/x86/kernel/cpu/bugs.c | 6 ++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index f226df064660..8b17c2ad1048 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -21,11 +21,13 @@ # define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31)) # define DISABLE_CYRIX_ARR (1<<(X86_FEATURE_CYRIX_ARR & 31)) # define DISABLE_CENTAUR_MCR (1<<(X86_FEATURE_CENTAUR_MCR & 31)) +# define DISABLE_PCID 0 #else # define DISABLE_VME 0 # define DISABLE_K6_MTRR 0 # define DISABLE_CYRIX_ARR 0 # define DISABLE_CENTAUR_MCR 0 +# define DISABLE_PCID (1<<(X86_FEATURE_PCID & 31)) #endif /* CONFIG_X86_64 */ /* @@ -35,7 +37,7 @@ #define DISABLED_MASK1 0 #define DISABLED_MASK2 0 #define DISABLED_MASK3 (DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR) -#define DISABLED_MASK4 0 +#define DISABLED_MASK4 (DISABLE_PCID) #define DISABLED_MASK5 0 #define DISABLED_MASK6 0 #define DISABLED_MASK7 0 diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index bd17db15a2c1..741d107e5376 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -19,6 +19,12 @@ void __init check_bugs(void) { + /* + * Regardless of whether PCID is enumerated, it can't be used in + * 32-bit mode. + */ + setup_clear_cpu_cap(X86_FEATURE_PCID); + identify_boot_cpu(); #ifndef CONFIG_SMP pr_info("CPU: "); -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756756AbcAHXSQ (ORCPT ); Fri, 8 Jan 2016 18:18:16 -0500 Received: from mail.kernel.org ([198.145.29.136]:49668 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756029AbcAHXPo (ORCPT ); Fri, 8 Jan 2016 18:15:44 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski , stable@vger.kernel.org Subject: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization Date: Fri, 8 Jan 2016 15:15:23 -0800 Message-Id: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When switch_mm activates a new pgd, it also sets a bit that tells other CPUs that the pgd is in use so that tlb flush IPIs will be sent. In order for that to work correctly, the bit needs to be visible prior to loading the pgd and therefore starting to fill the local TLB. Document all the barriers that make this work correctly and add a couple that were missing. Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu_context.h | 33 ++++++++++++++++++++++++++++++++- arch/x86/mm/tlb.c | 29 ++++++++++++++++++++++++++--- 2 files changed, 58 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 379cd3658799..1edc9cd198b8 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, #endif cpumask_set_cpu(cpu, mm_cpumask(next)); - /* Re-load page tables */ + /* + * Re-load page tables. + * + * This logic has an ordering constraint: + * + * CPU 0: Write to a PTE for 'next' + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. + * CPU 1: set bit 1 in next's mm_cpumask + * CPU 1: load from the PTE that CPU 0 writes (implicit) + * + * We need to prevent an outcome in which CPU 1 observes + * the new PTE value and CPU 0 observes bit 1 clear in + * mm_cpumask. (If that occurs, then the IPI will never + * be sent, and CPU 0's TLB will contain a stale entry.) + * + * The bad outcome can occur if either CPU's load is + * reordered before that CPU's store, so both CPUs much + * execute full barriers to prevent this from happening. + * + * Thus, switch_mm needs a full barrier between the + * store to mm_cpumask and any operation that could load + * from next->pgd. This barrier synchronizes with + * remote TLB flushers. Fortunately, load_cr3 is + * serializing and thus acts as a full barrier. + * + */ load_cr3(next->pgd); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); /* Stop flush ipis for the previous mm */ @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, * schedule, protecting us from simultaneous changes. */ cpumask_set_cpu(cpu, mm_cpumask(next)); + /* * We were in lazy tlb mode and leave_mm disabled * tlb flush IPI delivery. We must reload CR3 * to make sure to use no freed page tables. + * + * As above, this is a barrier that forces + * TLB repopulation to be ordered after the + * store to mm_cpumask. */ load_cr3(next->pgd); trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 8ddb5d0d66fb..8f4cc3dfac32 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -161,7 +161,10 @@ void flush_tlb_current_task(void) preempt_disable(); count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); + + /* This is an implicit full barrier that synchronizes with switch_mm. */ local_flush_tlb(); + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long base_pages_to_flush = TLB_FLUSH_ALL; preempt_disable(); - if (current->active_mm != mm) + if (current->active_mm != mm) { + /* Synchronize with switch_mm. */ + smp_mb(); + goto out; + } if (!current->mm) { leave_mm(smp_processor_id()); + + /* Synchronize with switch_mm. */ + smp_mb(); + goto out; } if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) base_pages_to_flush = (end - start) >> PAGE_SHIFT; + /* + * Both branches below are implicit full barriers (MOV to CR or + * INVLPG) that synchronize with switch_mm. + */ if (base_pages_to_flush > tlb_single_page_flush_ceiling) { base_pages_to_flush = TLB_FLUSH_ALL; count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); @@ -228,10 +243,18 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start) preempt_disable(); if (current->active_mm == mm) { - if (current->mm) + if (current->mm) { + /* + * Implicit full barrier (INVLPG) that synchronizes + * with switch_mm. + */ __flush_tlb_one(start); - else + } else { leave_mm(smp_processor_id()); + + /* Synchronize with switch_mm. */ + smp_mb(); + } } if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756772AbcAHXSj (ORCPT ); Fri, 8 Jan 2016 18:18:39 -0500 Received: from mail.kernel.org ([198.145.29.136]:49610 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755752AbcAHXPk (ORCPT ); Fri, 8 Jan 2016 18:15:40 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 03/13] x86/mm: Add a noinvpcid option to turn off INVPCID Date: Fri, 8 Jan 2016 15:15:21 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Signed-off-by: Andy Lutomirski --- Documentation/kernel-parameters.txt | 2 ++ arch/x86/kernel/cpu/common.c | 16 ++++++++++++++++ 2 files changed, 18 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 742f69d18fc8..b34e55e00bae 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2508,6 +2508,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. nointroute [IA-64] + noinvpcid [X86] Disable the INVPCID cpu feature. + nojitter [IA-64] Disables jitter checking for ITC timers. no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index c2b7522cbf35..48196980c1c7 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -162,6 +162,22 @@ static int __init x86_mpx_setup(char *s) } __setup("nompx", x86_mpx_setup); +static int __init x86_noinvpcid_setup(char *s) +{ + /* noinvpcid doesn't accept parameters */ + if (s) + return -EINVAL; + + /* do not emit a message if the feature is not present */ + if (!boot_cpu_has(X86_FEATURE_INVPCID)) + return 0; + + setup_clear_cpu_cap(X86_FEATURE_INVPCID); + pr_info("noinvpcid: INVPCID feature disabled\n"); + return 0; +} +early_param("noinvpcid", x86_noinvpcid_setup); + #ifdef CONFIG_X86_32 static int cachesize_override = -1; static int disable_x86_serial_nr = 1; -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755960AbcAHXPj (ORCPT ); Fri, 8 Jan 2016 18:15:39 -0500 Received: from mail.kernel.org ([198.145.29.136]:49546 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755752AbcAHXPh (ORCPT ); Fri, 8 Jan 2016 18:15:37 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 00/13] x86/mm: PCID and INVPCID Date: Fri, 8 Jan 2016 15:15:18 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Here's my PCID and INVPCID work-in-progress. It seems to work well enough to play with it. (That is, I'm not aware of anything wrong with it, although it may eat your data.) PCID and INVPCID use are orthogonal here. INVPCID is a straightforward speedup for global TLB flushes. Other than that, I don't use INVPCID at all, since it seems slower than just manipulating CR3 carefully, at least on my Skylake laptop. Please play around and suggest (and run?) good benchmarks. It seems to save around 100ns on cross-process context switches for me. Unfortunately, we suck at context switches in general, so this is, at best, a little over a 10% speedup. Most of the time is spent in the scheduler, not in arch code. Andy Lutomirski (13): x86/paravirt: Turn KASAN off for parvirt.o x86/mm: Add INVPCID helpers x86/mm: Add a noinvpcid option to turn off INVPCID x86/mm: If INVPCID is available, use it to flush global mappings x86/mm: Add barriers and document switch_mm-vs-flush synchronization x86/mm: Disable PCID on 32-bit kernels x86/mm: Add nopcid to turn off PCID x86/mm: Teach CR3 readers about PCID x86/mm: Disable interrupts when flushing the TLB using CR3 x86/mm: Factor out remote TLB flushing x86/mm: Build arch/x86/mm/tlb.c even on !SMP x86/mm: Uninline switch_mm x86/mm: Try to preserve old TLB entries using PCID Documentation/kernel-parameters.txt | 4 + arch/x86/include/asm/disabled-features.h | 4 +- arch/x86/include/asm/mmu.h | 7 +- arch/x86/include/asm/mmu_context.h | 62 +----- arch/x86/include/asm/tlbflush.h | 86 ++++++++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/cpu/bugs.c | 6 + arch/x86/kernel/cpu/common.c | 38 ++++ arch/x86/kernel/head64.c | 3 +- arch/x86/kernel/ldt.c | 2 + arch/x86/kernel/process_64.c | 2 + arch/x86/mm/Makefile | 3 +- arch/x86/mm/fault.c | 8 +- arch/x86/mm/tlb.c | 324 +++++++++++++++++++++++++++++-- 14 files changed, 467 insertions(+), 83 deletions(-) -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932379AbcAHXTF (ORCPT ); Fri, 8 Jan 2016 18:19:05 -0500 Received: from mail.kernel.org ([198.145.29.136]:49587 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755904AbcAHXPj (ORCPT ); Fri, 8 Jan 2016 18:15:39 -0500 From: Andy Lutomirski To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andy Lutomirski Subject: [RFC 02/13] x86/mm: Add INVPCID helpers Date: Fri, 8 Jan 2016 15:15:20 -0800 Message-Id: <21ffa5af2491feec9927446451e3d23548a7cbc0.1452294700.git.luto@kernel.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This adds helpers for each of the four currently-specified INVPCID modes. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 6df2029405a3..20fc38d8478a 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -7,6 +7,47 @@ #include #include +static inline void __invpcid(unsigned long pcid, unsigned long addr, + unsigned long type) +{ + u64 desc[2] = { pcid, addr }; + + /* + * The memory clobber is because the whole point is to invalidate + * stale TLB entries and, especially if we're flushing global + * mappings, we don't want the compiler to reorder any subsequent + * memory accesses before the TLB flush. + */ + asm volatile ( + ".byte 0x66, 0x0f, 0x38, 0x82, 0x01" /* invpcid (%cx), %ax */ + : : "m" (desc), "a" (type), "c" (desc) : "memory"); +} + +/* Flush all mappings for a given pcid and addr, not including globals. */ +static inline void invpcid_flush_one(unsigned long pcid, + unsigned long addr) +{ + __invpcid(pcid, addr, 0); +} + +/* Flush all mappings for a given PCID, not including globals. */ +static inline void invpcid_flush_single_context(unsigned long pcid) +{ + __invpcid(pcid, 0, 1); +} + +/* Flush all mappings, including globals, for all PCIDs. */ +static inline void invpcid_flush_everything(void) +{ + __invpcid(0, 0, 2); +} + +/* Flush all mappings for all PCIDs except globals. */ +static inline void invpcid_flush_all_nonglobals(void) +{ + __invpcid(0, 0, 3); +} + #ifdef CONFIG_PARAVIRT #include #else -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755737AbcAHXbG (ORCPT ); Fri, 8 Jan 2016 18:31:06 -0500 Received: from mail-ig0-f175.google.com ([209.85.213.175]:37074 "EHLO mail-ig0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750835AbcAHXbF (ORCPT ); Fri, 8 Jan 2016 18:31:05 -0500 MIME-Version: 1.0 In-Reply-To: References: Date: Fri, 8 Jan 2016 15:31:03 -0800 X-Google-Sender-Auth: 0XHlwkitBWtPLCjLOVZ0vj9AYIw Message-ID: Subject: Re: [RFC 00/13] x86/mm: PCID and INVPCID From: Linus Torvalds To: Andy Lutomirski Cc: "the arch/x86 maintainers" , Linux Kernel Mailing List , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski wrote: > > Please play around and suggest (and run?) good benchmarks. It seems > to save around 100ns on cross-process context switches for me. Interesting. There was reportedly (I never saw it) a test-patch to use pcids inside of Intel a couple of years ago, and it never got outside because it didn't make a difference. Either things have changed (newer hardware with more pcids perhaps?) or you did a better job at it. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755766AbcAHXgg (ORCPT ); Fri, 8 Jan 2016 18:36:36 -0500 Received: from mail-ob0-f180.google.com ([209.85.214.180]:32908 "EHLO mail-ob0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755044AbcAHXgd (ORCPT ); Fri, 8 Jan 2016 18:36:33 -0500 MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Fri, 8 Jan 2016 15:36:13 -0800 Message-ID: Subject: Re: [RFC 00/13] x86/mm: PCID and INVPCID To: Linus Torvalds Cc: Andy Lutomirski , "the arch/x86 maintainers" , Linux Kernel Mailing List , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 8, 2016 at 3:31 PM, Linus Torvalds wrote: > On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski wrote: >> >> Please play around and suggest (and run?) good benchmarks. It seems >> to save around 100ns on cross-process context switches for me. > > Interesting. There was reportedly (I never saw it) a test-patch to use > pcids inside of Intel a couple of years ago, and it never got outside > because it didn't make a difference. I have a copy of that patch, and my code works very differently. I only use 3 bits of PCID in this series. I could probably reduce that to 2 with little loss. 4 or more would be a waste. > > Either things have changed (newer hardware with more pcids perhaps?) > or you did a better job at it. On my Skylake laptop, all of the PCID bits appear to have at least some effect. Whether this means it gets hashed or whether this means that all of the bits are real, I don't know. I'll fiddle with it on an older machine. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932226AbcAHXlN (ORCPT ); Fri, 8 Jan 2016 18:41:13 -0500 Received: from mail-ig0-f180.google.com ([209.85.213.180]:33287 "EHLO mail-ig0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756073AbcAHXlL (ORCPT ); Fri, 8 Jan 2016 18:41:11 -0500 MIME-Version: 1.0 In-Reply-To: References: Date: Fri, 8 Jan 2016 15:41:10 -0800 X-Google-Sender-Auth: HEuhTgT14k7ArFSrfy7UTVCGSvQ Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 From: Linus Torvalds To: Andy Lutomirski Cc: "the arch/x86 maintainers" , Linux Kernel Mailing List , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski wrote: > + /* > + * We mustn't be preempted or handle an IPI while reading and > + * writing CR3. Preemption could switch mms and switch back, and > + * an IPI could call leave_mm. Either of those could cause our > + * PCID to change asynchronously. > + */ > + raw_local_irq_save(flags); > native_write_cr3(native_read_cr3()); > + raw_local_irq_restore(flags); This seems sad for two reasons: - it adds unnecessary overhead on non-pcid setups (32-bit being an example of that) - on pcid setups, wouldn't invpcid_flush_single_context() be better? So on the whole I hate it. Why isn't this something like if (static_cpu_has_safe(X86_FEATURE_INVPCID)) { invpcid_flush_single_context(); return; } native_write_cr3(native_read_cr3()); *without* any flag saving crud? And yes, that means that we'd require X86_FEATURE_INVPCID in order to use X86_FEATURE_PCID, but that seems fine. Or is there some reason you wanted the odd flags version? If so, that should be documented. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932439AbcAHXme (ORCPT ); Fri, 8 Jan 2016 18:42:34 -0500 Received: from mail-ig0-f179.google.com ([209.85.213.179]:35666 "EHLO mail-ig0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751191AbcAHXmd (ORCPT ); Fri, 8 Jan 2016 18:42:33 -0500 MIME-Version: 1.0 In-Reply-To: References: Date: Fri, 8 Jan 2016 15:42:32 -0800 X-Google-Sender-Auth: TcigFUhYS0rZbW3UN0-rUPFUigQ Message-ID: Subject: Re: [RFC 00/13] x86/mm: PCID and INVPCID From: Linus Torvalds To: Andy Lutomirski Cc: Andy Lutomirski , "the arch/x86 maintainers" , Linux Kernel Mailing List , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 8, 2016 at 3:36 PM, Andy Lutomirski wrote: >> >> Either things have changed (newer hardware with more pcids perhaps?) >> or you did a better job at it. > > On my Skylake laptop, all of the PCID bits appear to have at least > some effect. Whether this means it gets hashed or whether this means > that all of the bits are real, I don't know. They have always gotten hashed, and no the bits aren't real - hardware doesn't actually have as many bits in the pcid as there are in cr3. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756414AbcAHXzz (ORCPT ); Fri, 8 Jan 2016 18:55:55 -0500 Received: from mga11.intel.com ([192.55.52.93]:56777 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756312AbcAHXzy (ORCPT ); Fri, 8 Jan 2016 18:55:54 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,541,1444719600"; d="scan'208";a="886845871" Subject: Re: [RFC 11/13] x86/mm: Build arch/x86/mm/tlb.c even on !SMP To: Andy Lutomirski , x86@kernel.org, linux-kernel@vger.kernel.org References: Cc: Borislav Petkov , Brian Gerst , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" From: Dave Hansen Message-ID: <56904C89.8050205@linux.intel.com> Date: Fri, 8 Jan 2016 15:55:53 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/08/2016 03:15 PM, Andy Lutomirski wrote: > @@ -352,3 +354,5 @@ static int __init create_tlb_single_page_flush_ceiling(void) > return 0; > } > late_initcall(create_tlb_single_page_flush_ceiling); > + > +#endif /* CONFIG_SMP */ Heh, I was about to complain that you #ifdef'd out my lovely INVLPG tunable. But I guess on UP you just get flush_tlb_mm_range() from: > static inline void flush_tlb_mm_range(struct mm_struct *mm, > unsigned long start, unsigned long end, unsigned long vmflag) > { > if (mm == current->active_mm) > __flush_tlb_up(); > } which doesn't even do INVLPG. How sad. Poor UP. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932630AbcAIASa (ORCPT ); Fri, 8 Jan 2016 19:18:30 -0500 Received: from mail-ob0-f172.google.com ([209.85.214.172]:35390 "EHLO mail-ob0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753535AbcAIAS2 (ORCPT ); Fri, 8 Jan 2016 19:18:28 -0500 MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Fri, 8 Jan 2016 16:18:08 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 To: Linus Torvalds Cc: Oleg Nesterov , X86 ML , Dave Hansen , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Jan 8, 2016 3:41 PM, "Linus Torvalds" wrote: > > On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski wrote: > > + /* > > + * We mustn't be preempted or handle an IPI while reading and > > + * writing CR3. Preemption could switch mms and switch back, and > > + * an IPI could call leave_mm. Either of those could cause our > > + * PCID to change asynchronously. > > + */ > > + raw_local_irq_save(flags); > > native_write_cr3(native_read_cr3()); > > + raw_local_irq_restore(flags); > > This seems sad for two reasons: > > - it adds unnecessary overhead on non-pcid setups (32-bit being an > example of that) I can certainly skip the flag saving on !PCID. > > - on pcid setups, wouldn't invpcid_flush_single_context() be better? > I played with that and it was slower. I don't pretend that makes any sense. > So on the whole I hate it. > > Why isn't this something like > > if (static_cpu_has_safe(X86_FEATURE_INVPCID)) { > invpcid_flush_single_context(); > return; > } > native_write_cr3(native_read_cr3()); > > *without* any flag saving crud? > > And yes, that means that we'd require X86_FEATURE_INVPCID in order to > use X86_FEATURE_PCID, but that seems fine. I have an SNB "Extreme" with PCID but not INVPCID, and there could be a whole generation of servers like that. I think we should fully support them. We might be able to get away with just disabling preemption instead of IRQs, at least if mm == active_mm. > > Or is there some reason you wanted the odd flags version? If so, that > should be documented. What do you mean "odd"? --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755042AbcAIA1l (ORCPT ); Fri, 8 Jan 2016 19:27:41 -0500 Received: from mga14.intel.com ([192.55.52.115]:33187 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752789AbcAIA1j (ORCPT ); Fri, 8 Jan 2016 19:27:39 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,541,1444719600"; d="scan'208";a="856783999" Subject: Re: [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID To: Andy Lutomirski , x86@kernel.org, linux-kernel@vger.kernel.org References: Cc: Borislav Petkov , Brian Gerst , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" From: Dave Hansen Message-ID: <569053F9.7060002@linux.intel.com> Date: Fri, 8 Jan 2016 16:27:37 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/08/2016 03:15 PM, Andy Lutomirski wrote: > + * The guiding principle of this code is that TLB entries that have > + * survived more than a small number of context switches are mostly > + * useless, so we don't try very hard not to evict them. Big ack on that. The original approach tried to keep track of the full 4k worth of possible PCIDs, it also needed an additional cpumask (which it dynamically allocated) for where the PCID was active in addition to the normal "where has this mm been" mask. That's a lot of extra data to mangle, and I can definitely see your approach as being nicer, *IF* the hardware isn't doing something useful with the other 9 bits of PCID that you're throwing away. ;) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755119AbcAICUU (ORCPT ); Fri, 8 Jan 2016 21:20:20 -0500 Received: from mail-ob0-f178.google.com ([209.85.214.178]:34530 "EHLO mail-ob0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754111AbcAICUS (ORCPT ); Fri, 8 Jan 2016 21:20:18 -0500 MIME-Version: 1.0 In-Reply-To: <569053F9.7060002@linux.intel.com> References: <569053F9.7060002@linux.intel.com> From: Andy Lutomirski Date: Fri, 8 Jan 2016 18:19:58 -0800 Message-ID: Subject: Re: [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID To: Dave Hansen Cc: Andy Lutomirski , X86 ML , "linux-kernel@vger.kernel.org" , Borislav Petkov , Brian Gerst , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 8, 2016 at 4:27 PM, Dave Hansen wrote: > On 01/08/2016 03:15 PM, Andy Lutomirski wrote: >> + * The guiding principle of this code is that TLB entries that have >> + * survived more than a small number of context switches are mostly >> + * useless, so we don't try very hard not to evict them. > > Big ack on that. The original approach tried to keep track of the full > 4k worth of possible PCIDs, it also needed an additional cpumask (which > it dynamically allocated) for where the PCID was active in addition to > the normal "where has this mm been" mask. My patch has a similar extra cpumask, but at least I didn't dynamically allocate it. I did it because I need a 100% reliable way to tell whether a given mm has a valid PCID in a cpu's PCID LRU list, as opposed to just matching due to struct mm reuse or similar. I also need the ability to blow away old mappings, which I can do by clearing the cpumask. This happens in init_new_context and in propagate_tlb_flush. The other way to do it would be to store some kind of generation counter in the per-cpu list. I could use a global 64-bit atomic counter to allocate never-reused mm ids (it's highly unlikely that a system will run long enough for such a counter to overflow -- it could only ever be incremented every few ns, giving hundreds of years of safety), but that's kind of expensive. I could use a per-cpu allocator, but 54 bits per cpu is uncomfortably small unless we have wraparound handling. We could do 64 bits per cpu for very cheap counter allocation, but then the "zap the pcid" logic gets much nastier in that neither the percpu entries nor the per-mm generation counter entries don't fit in a word any more. Maybe that's fine. What we can't do easily is have a per-mm generation counter, because freeing an mm and reallocating it in the same place needs to reliably zap the pcid on all CPUs. Anyway, this problem is clearly solvable, but I haven't thought of a straightforward solution that doesn't involve rarely-executed code paths, and that makes me a bit nervous. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755289AbcAICU2 (ORCPT ); Fri, 8 Jan 2016 21:20:28 -0500 Received: from mail-ig0-f181.google.com ([209.85.213.181]:34185 "EHLO mail-ig0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755159AbcAICUZ (ORCPT ); Fri, 8 Jan 2016 21:20:25 -0500 MIME-Version: 1.0 In-Reply-To: References: Date: Fri, 8 Jan 2016 18:20:24 -0800 X-Google-Sender-Auth: dFmp5xcMpeL0lct60FXQ9FN44NU Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 From: Linus Torvalds To: Andy Lutomirski Cc: Oleg Nesterov , X86 ML , Dave Hansen , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski wrote: >> >> - on pcid setups, wouldn't invpcid_flush_single_context() be better? > > I played with that and it was slower. I don't pretend that makes any sense. Ugh. I guess reading and writing cr3 has been optimized. >> And yes, that means that we'd require X86_FEATURE_INVPCID in order to >> use X86_FEATURE_PCID, but that seems fine. > > I have an SNB "Extreme" with PCID but not INVPCID, and there could be > a whole generation of servers like that. I think we should fully > support them. Can you check the timings? IOW, is it a win on SNB? I think originally Intel only had two actual bits of process context ID in the TLB, and it was meant to be used for virtualization or something. Together with the hashing (to make it always appear as 12 bits to software - a nice idea but also means that the hardware ends up invalidating more than software really expects), it may not work all that well. That _could_ explain why the original patch from intel didn't work. > We might be able to get away with just disabling preemption instead of > IRQs, at least if mm == active_mm. I'm not convinced it is all that much faster. Of course, it's nicer on non-preempt, but nobody seems to run things that way. >> Or is there some reason you wanted the odd flags version? If so, that >> should be documented. > > What do you mean "odd"? It's odd because it makes no sense for non-pcid (christ, I wish Intel had just called it "asid" instead, "pcid" always makes me react to "pci"), and I think it would make more sense to pair up the pcid case with the invpcid rather than have those preemption rules here. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757243AbcAJS73 (ORCPT ); Sun, 10 Jan 2016 13:59:29 -0500 Received: from mail.skyhub.de ([78.46.96.112]:50747 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756805AbcAJS72 (ORCPT ); Sun, 10 Jan 2016 13:59:28 -0500 Date: Sun, 10 Jan 2016 19:59:16 +0100 From: Borislav Petkov To: Andy Lutomirski Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , Andrey Ryabinin Subject: Re: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Message-ID: <20160110185916.GD22896@pd.tnic> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org + Andrey. On Fri, Jan 08, 2016 at 03:15:19PM -0800, Andy Lutomirski wrote: > Otherwise terrible things happen if some of the callbacks end up > calling into KASAN in unexpected places. > > This has no obvious symptoms yet, but adding a memory reference to > native_flush_tlb_global without this blows up on KASAN kernels. > > Signed-off-by: Andy Lutomirski > --- > arch/x86/kernel/Makefile | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile > index b1b78ffe01d0..b7cd5bdf314b 100644 > --- a/arch/x86/kernel/Makefile > +++ b/arch/x86/kernel/Makefile > @@ -19,6 +19,7 @@ endif > KASAN_SANITIZE_head$(BITS).o := n > KASAN_SANITIZE_dumpstack.o := n > KASAN_SANITIZE_dumpstack_$(BITS).o := n > +KASAN_SANITIZE_paravirt.o := n > > CFLAGS_irq.o := -I$(src)/../include/asm/trace Shouldn't we take this one irrespectively of what happens to the rest in the patchset? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759298AbcAKKvL (ORCPT ); Mon, 11 Jan 2016 05:51:11 -0500 Received: from mail-wm0-f65.google.com ([74.125.82.65]:33931 "EHLO mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750813AbcAKKvI (ORCPT ); Mon, 11 Jan 2016 05:51:08 -0500 Date: Mon, 11 Jan 2016 11:51:05 +0100 From: Ingo Molnar To: Linus Torvalds Cc: Andy Lutomirski , Oleg Nesterov , X86 ML , Dave Hansen , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Message-ID: <20160111105105.GB29448@gmail.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > >> Or is there some reason you wanted the odd flags version? If so, that > >> should be documented. > > > > What do you mean "odd"? > > It's odd because it makes no sense for non-pcid (christ, I wish Intel had just > called it "asid" instead, "pcid" always makes me react to "pci"), and I think it > would make more sense to pair up the pcid case with the invpcid rather than have > those preemption rules here. The naming is really painful, so a trivial suggestion: could we just name all the Linux side bits 'asid' or 'ctx_id' (even in x86 arch code) and only use 'PCID' nomenclature in the very lowest level code? I.e. rename pcid_live_cpus et al and most functions to the asid or ctx_id or asid naming scheme or so. That would hide most of the naming ugliness. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759582AbcAKMuq (ORCPT ); Mon, 11 Jan 2016 07:50:46 -0500 Received: from relay.parallels.com ([195.214.232.42]:55076 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757478AbcAKMup (ORCPT ); Mon, 11 Jan 2016 07:50:45 -0500 From: Andrey Ryabinin To: Borislav Petkov , Andy Lutomirski , , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" CC: Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , , , Andrey Ryabinin Subject: Re: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Date: Mon, 11 Jan 2016 15:51:17 +0300 Message-ID: <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> X-Mailer: git-send-email 2.4.10 In-Reply-To: <20160110185916.GD22896@pd.tnic> References: <20160110185916.GD22896@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain X-ClientProxiedBy: US-EXCH.sw.swsoft.com (10.255.249.47) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/10/2016 09:59 PM, Borislav Petkov wrote: > + Andrey. > > On Fri, Jan 08, 2016 at 03:15:19PM -0800, Andy Lutomirski wrote: >> Otherwise terrible things happen if some of the callbacks end up >> calling into KASAN in unexpected places. >> >> This has no obvious symptoms yet, but adding a memory reference to >> native_flush_tlb_global without this blows up on KASAN kernels. >> >> Signed-off-by: Andy Lutomirski >> --- >> arch/x86/kernel/Makefile | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile >> index b1b78ffe01d0..b7cd5bdf314b 100644 >> --- a/arch/x86/kernel/Makefile >> +++ b/arch/x86/kernel/Makefile >> @@ -19,6 +19,7 @@ endif >> KASAN_SANITIZE_head$(BITS).o := n >> KASAN_SANITIZE_dumpstack.o := n >> KASAN_SANITIZE_dumpstack_$(BITS).o := n >> +KASAN_SANITIZE_paravirt.o := n >> >> CFLAGS_irq.o := -I$(src)/../include/asm/trace > > Shouldn't we take this one irrespectively of what happens to the rest in > the patchset? > I don't think that this patch is the right way to solve the problem. The follow-up patch "x86/kasan: clear kasan_zero_page after TLB flush" should fix Andy's problem. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932727AbcAKMuw (ORCPT ); Mon, 11 Jan 2016 07:50:52 -0500 Received: from relay.parallels.com ([195.214.232.42]:55102 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757478AbcAKMut (ORCPT ); Mon, 11 Jan 2016 07:50:49 -0500 From: Andrey Ryabinin To: Borislav Petkov , Andy Lutomirski , , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" CC: Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , , , Andrey Ryabinin Subject: [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush Date: Mon, 11 Jan 2016 15:51:18 +0300 Message-ID: <1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com> X-Mailer: git-send-email 2.4.10 In-Reply-To: <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> References: <20160110185916.GD22896@pd.tnic> <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain X-ClientProxiedBy: US-EXCH.sw.swsoft.com (10.255.249.47) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently we clear kasan_zero_page before __flush_tlb_all(). This works with current implementation of native_flush_tlb[_global]() because it doesn't cause do any writes to kasan shadow memory. But any subtle change made in native_flush_tlb*() could break this. Also current code seems doesn't work for paravirt guests (lguest). Only after the TLB flush we can be sure that kasan_zero_page is not used as early shadow anymore (instrumented code will not write to it). So it should cleared it only after the TLB flush. Signed-off-by: Andrey Ryabinin --- arch/x86/mm/kasan_init_64.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index d470cf2..303e470 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -120,11 +120,16 @@ void __init kasan_init(void) kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END), (void *)KASAN_SHADOW_END); - memset(kasan_zero_page, 0, PAGE_SIZE); - load_cr3(init_level4_pgt); __flush_tlb_all(); - init_task.kasan_depth = 0; + /* + * kasan_zero_page has been used as early shadow memory, thus it may + * contain some garbage. Now we can clear it, since after the TLB flush + * no one should write to it. + */ + memset(kasan_zero_page, 0, PAGE_SIZE); + + init_task.kasan_depth = 0; pr_info("KernelAddressSanitizer initialized\n"); } -- 2.4.10 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932774AbcAKMvB (ORCPT ); Mon, 11 Jan 2016 07:51:01 -0500 Received: from relay.parallels.com ([195.214.232.42]:55180 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932732AbcAKMu6 (ORCPT ); Mon, 11 Jan 2016 07:50:58 -0500 From: Andrey Ryabinin To: Borislav Petkov , Andy Lutomirski , , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" CC: Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , , , Andrey Ryabinin Subject: [PATCH 2/2] x86/kasan: write protect kasan zero shadow Date: Mon, 11 Jan 2016 15:51:19 +0300 Message-ID: <1452516679-32040-3-git-send-email-aryabinin@virtuozzo.com> X-Mailer: git-send-email 2.4.10 In-Reply-To: <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> References: <20160110185916.GD22896@pd.tnic> <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain X-ClientProxiedBy: US-EXCH.sw.swsoft.com (10.255.249.47) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org After kasan_init() executed, no one is allowed to write to kasan_zero_page, so write protect it. Signed-off-by: Andrey Ryabinin --- arch/x86/mm/kasan_init_64.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 303e470..1b1110f 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -125,10 +125,16 @@ void __init kasan_init(void) /* * kasan_zero_page has been used as early shadow memory, thus it may - * contain some garbage. Now we can clear it, since after the TLB flush - * no one should write to it. + * contain some garbage. Now we can clear and write protect it, since + * after the TLB flush no one should write to it. */ memset(kasan_zero_page, 0, PAGE_SIZE); + for (i = 0; i < PTRS_PER_PTE; i++) { + pte_t pte = __pte(__pa(kasan_zero_page) | __PAGE_KERNEL_RO); + set_pte(&kasan_zero_pte[i], pte); + } + /* Flush TLBs again to be sure that write protection applied. */ + __flush_tlb_all(); init_task.kasan_depth = 0; pr_info("KernelAddressSanitizer initialized\n"); -- 2.4.10 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752205AbcAMXdS (ORCPT ); Wed, 13 Jan 2016 18:33:18 -0500 Received: from mail-oi0-f50.google.com ([209.85.218.50]:35419 "EHLO mail-oi0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751107AbcAMXdR (ORCPT ); Wed, 13 Jan 2016 18:33:17 -0500 MIME-Version: 1.0 In-Reply-To: <20160111105105.GB29448@gmail.com> References: <20160111105105.GB29448@gmail.com> From: Andy Lutomirski Date: Wed, 13 Jan 2016 15:32:56 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 To: Ingo Molnar Cc: Linus Torvalds , Oleg Nesterov , X86 ML , Dave Hansen , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 11, 2016 at 2:51 AM, Ingo Molnar wrote: > > * Linus Torvalds wrote: > >> >> Or is there some reason you wanted the odd flags version? If so, that >> >> should be documented. >> > >> > What do you mean "odd"? >> >> It's odd because it makes no sense for non-pcid (christ, I wish Intel had just >> called it "asid" instead, "pcid" always makes me react to "pci"), and I think it >> would make more sense to pair up the pcid case with the invpcid rather than have >> those preemption rules here. > > The naming is really painful, so a trivial suggestion: could we just name all the > Linux side bits 'asid' or 'ctx_id' (even in x86 arch code) and only use 'PCID' > nomenclature in the very lowest level code? I'd be okay with "pctx_id" or "pctxid" for this, I think. I'd like to at least make it somewhat obvious how it maps back to hardware. FWIW, I'd guess that Intel deviated from convention because their actual address space id is (vpid, pcid), and calling it (vpid, asid) might have been slightly confusing. Or not. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752661AbcAMXgJ (ORCPT ); Wed, 13 Jan 2016 18:36:09 -0500 Received: from mail-ob0-f174.google.com ([209.85.214.174]:36135 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752464AbcAMXgH (ORCPT ); Wed, 13 Jan 2016 18:36:07 -0500 MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Wed, 13 Jan 2016 15:35:46 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 To: Linus Torvalds Cc: Oleg Nesterov , X86 ML , Dave Hansen , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 8, 2016 at 6:20 PM, Linus Torvalds wrote: > On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski wrote: >>> >>> - on pcid setups, wouldn't invpcid_flush_single_context() be better? >> >> I played with that and it was slower. I don't pretend that makes any sense. > > Ugh. I guess reading and writing cr3 has been optimized. > >>> And yes, that means that we'd require X86_FEATURE_INVPCID in order to >>> use X86_FEATURE_PCID, but that seems fine. >> >> I have an SNB "Extreme" with PCID but not INVPCID, and there could be >> a whole generation of servers like that. I think we should fully >> support them. > > Can you check the timings? IOW, is it a win on SNB? ~80ns gain on SNB. It's actually quite impressive on SNB: it knocks the penalty for mm switches down to 20ns or so, which I find to be fairly amazing. (This is at 3.8GHz or thereabouts.) > > I think originally Intel only had two actual bits of process context > ID in the TLB, and it was meant to be used for virtualization or > something. Together with the hashing (to make it always appear as 12 > bits to software - a nice idea but also means that the hardware ends > up invalidating more than software really expects), it may not work > all that well. > > That _could_ explain why the original patch from intel didn't work. > >> We might be able to get away with just disabling preemption instead of >> IRQs, at least if mm == active_mm. > > I'm not convinced it is all that much faster. Of course, it's nicer on > non-preempt, but nobody seems to run things that way. My current testing version has three different code paths now. If INVPCID and PCID are both available, then it uses INVPCID. If PCID is available but INVPCID is not, it does raw_local_irqsave. If PCID is not available, it just does the CR3 read/write. Yeah, it's ugly, and it's a big blob of code to do something trivial, but it seems to work and it should be the right thing to do in most cases. Can anyone here ask a hardware or microcode person what's going on with CR3 writes possibly being faster than INVPCID? Is there some trick to it? --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753450AbcAMXnj (ORCPT ); Wed, 13 Jan 2016 18:43:39 -0500 Received: from mga09.intel.com ([134.134.136.24]:54865 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753378AbcAMXni (ORCPT ); Wed, 13 Jan 2016 18:43:38 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,291,1449561600"; d="scan'208";a="726498932" Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 To: Andy Lutomirski , Linus Torvalds References: Cc: Oleg Nesterov , X86 ML , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst From: Dave Hansen Message-ID: <5696E129.9000804@linux.intel.com> Date: Wed, 13 Jan 2016 15:43:37 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/13/2016 03:35 PM, Andy Lutomirski wrote: > Can anyone here ask a hardware or microcode person what's going on > with CR3 writes possibly being faster than INVPCID? Is there some > trick to it? I just went and measured it myself this morning. "INVPCID Type 3" (all contexts no global) on a Skylake system was 15% slower than a CR3 write. Is that in the same ballpark from what you've observed? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753660AbcAMXv5 (ORCPT ); Wed, 13 Jan 2016 18:51:57 -0500 Received: from mail-ob0-f182.google.com ([209.85.214.182]:35689 "EHLO mail-ob0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753596AbcAMXv4 (ORCPT ); Wed, 13 Jan 2016 18:51:56 -0500 MIME-Version: 1.0 In-Reply-To: <5696E129.9000804@linux.intel.com> References: <5696E129.9000804@linux.intel.com> From: Andy Lutomirski Date: Wed, 13 Jan 2016 15:51:35 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 To: Dave Hansen Cc: Linus Torvalds , Oleg Nesterov , X86 ML , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen wrote: > On 01/13/2016 03:35 PM, Andy Lutomirski wrote: >> Can anyone here ask a hardware or microcode person what's going on >> with CR3 writes possibly being faster than INVPCID? Is there some >> trick to it? > > I just went and measured it myself this morning. "INVPCID Type 3" (all > contexts no global) on a Skylake system was 15% slower than a CR3 write. > > Is that in the same ballpark from what you've observed? > > It's similar, except that I was comparing "INVPCID Type 1" (single context no globals) to a CR3 write. Type 2, at least, is dramatically faster than the pair of CR4 writes it replaces. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753943AbcAMX4W (ORCPT ); Wed, 13 Jan 2016 18:56:22 -0500 Received: from mga03.intel.com ([134.134.136.65]:46158 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753889AbcAMX4V (ORCPT ); Wed, 13 Jan 2016 18:56:21 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,291,1449561600"; d="scan'208";a="726504840" Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 To: Andy Lutomirski References: <5696E129.9000804@linux.intel.com> Cc: Linus Torvalds , Oleg Nesterov , X86 ML , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst From: Dave Hansen Message-ID: <5696E420.9040704@linux.intel.com> Date: Wed, 13 Jan 2016 15:56:16 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/13/2016 03:51 PM, Andy Lutomirski wrote: > On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen > wrote: >> On 01/13/2016 03:35 PM, Andy Lutomirski wrote: >>> Can anyone here ask a hardware or microcode person what's going on >>> with CR3 writes possibly being faster than INVPCID? Is there some >>> trick to it? >> >> I just went and measured it myself this morning. "INVPCID Type 3" (all >> contexts no global) on a Skylake system was 15% slower than a CR3 write. >> >> Is that in the same ballpark from what you've observed? > > It's similar, except that I was comparing "INVPCID Type 1" (single > context no globals) to a CR3 write. Ahh, because you're using PCID... That one I saw as being ~1.85x the number of cycles that a CR3 write was. > Type 2, at least, is dramatically faster than the pair of CR4 writes > it replaces. Yeah, I saw the same thing. Type 2 was ~2.4x faster than the CR4 writes. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754200AbcANAe7 (ORCPT ); Wed, 13 Jan 2016 19:34:59 -0500 Received: from mail-oi0-f44.google.com ([209.85.218.44]:35467 "EHLO mail-oi0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751657AbcANAe5 (ORCPT ); Wed, 13 Jan 2016 19:34:57 -0500 MIME-Version: 1.0 In-Reply-To: <5696E420.9040704@linux.intel.com> References: <5696E129.9000804@linux.intel.com> <5696E420.9040704@linux.intel.com> From: Andy Lutomirski Date: Wed, 13 Jan 2016 16:34:37 -0800 Message-ID: Subject: Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 To: Dave Hansen Cc: Linus Torvalds , Oleg Nesterov , X86 ML , Borislav Petkov , Linux Kernel Mailing List , "linux-mm@kvack.org" , Brian Gerst Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 13, 2016 at 3:56 PM, Dave Hansen wrote: > On 01/13/2016 03:51 PM, Andy Lutomirski wrote: >> On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen >> wrote: >>> On 01/13/2016 03:35 PM, Andy Lutomirski wrote: >>>> Can anyone here ask a hardware or microcode person what's going on >>>> with CR3 writes possibly being faster than INVPCID? Is there some >>>> trick to it? >>> >>> I just went and measured it myself this morning. "INVPCID Type 3" (all >>> contexts no global) on a Skylake system was 15% slower than a CR3 write. >>> >>> Is that in the same ballpark from what you've observed? >> >> It's similar, except that I was comparing "INVPCID Type 1" (single >> context no globals) to a CR3 write. > > Ahh, because you're using PCID... That one I saw as being ~1.85x the > number of cycles that a CR3 write was. > I think that settles it, then: if (static_cpu_has_safe(X86_FEATURE_PCID)) { raw_local_irqsave(); native_write_cr3(native_read_cr3()); raw_local_irqrestore(); } else { native_write_cr3(native_read_cr3()); } I don't think it's worth hacking more complexity into switch_mm to make that annoyance go away. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932471AbcARWY3 (ORCPT ); Mon, 18 Jan 2016 17:24:29 -0500 Received: from mail-ob0-f174.google.com ([209.85.214.174]:35334 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932372AbcARWY0 (ORCPT ); Mon, 18 Jan 2016 17:24:26 -0500 MIME-Version: 1.0 In-Reply-To: <1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com> References: <20160110185916.GD22896@pd.tnic> <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> <1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com> From: Andy Lutomirski Date: Mon, 18 Jan 2016 14:24:06 -0800 Message-ID: Subject: Re: [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush To: Andrey Ryabinin Cc: Borislav Petkov , Andy Lutomirski , X86 ML , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 11, 2016 at 4:51 AM, Andrey Ryabinin wrote: > Currently we clear kasan_zero_page before __flush_tlb_all(). This > works with current implementation of native_flush_tlb[_global]() > because it doesn't cause do any writes to kasan shadow memory. > But any subtle change made in native_flush_tlb*() could break this. > Also current code seems doesn't work for paravirt guests (lguest). > > Only after the TLB flush we can be sure that kasan_zero_page is not > used as early shadow anymore (instrumented code will not write to it). > So it should cleared it only after the TLB flush. This seems to fix the issue with my patch set. Thanks. Tested-by: Andy Lutomirski --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932495AbcARWYy (ORCPT ); Mon, 18 Jan 2016 17:24:54 -0500 Received: from mail-ob0-f172.google.com ([209.85.214.172]:36619 "EHLO mail-ob0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932442AbcARWYw (ORCPT ); Mon, 18 Jan 2016 17:24:52 -0500 MIME-Version: 1.0 In-Reply-To: <1452516679-32040-3-git-send-email-aryabinin@virtuozzo.com> References: <20160110185916.GD22896@pd.tnic> <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> <1452516679-32040-3-git-send-email-aryabinin@virtuozzo.com> From: Andy Lutomirski Date: Mon, 18 Jan 2016 14:24:32 -0800 Message-ID: Subject: Re: [PATCH 2/2] x86/kasan: write protect kasan zero shadow To: Andrey Ryabinin Cc: Borislav Petkov , Andy Lutomirski , X86 ML , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 11, 2016 at 4:51 AM, Andrey Ryabinin wrote: > After kasan_init() executed, no one is allowed to write to kasan_zero_page, > so write protect it. This seems to work for me. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755107AbcA2Kfr (ORCPT ); Fri, 29 Jan 2016 05:35:47 -0500 Received: from mail.skyhub.de ([78.46.96.112]:33515 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753272AbcA2Kfm (ORCPT ); Fri, 29 Jan 2016 05:35:42 -0500 Date: Fri, 29 Jan 2016 11:35:35 +0100 From: Borislav Petkov To: Andrey Ryabinin Cc: Andy Lutomirski , x86@kernel.org, Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Message-ID: <20160129103535.GA10187@pd.tnic> References: <20160110185916.GD22896@pd.tnic> <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1452516679-32040-1-git-send-email-aryabinin@virtuozzo.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 11, 2016 at 03:51:17PM +0300, Andrey Ryabinin wrote: > I don't think that this patch is the right way to solve the problem. > The follow-up patch "x86/kasan: clear kasan_zero_page after TLB flush" > should fix Andy's problem. Both applied, thanks. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757560AbcBIQIP (ORCPT ); Tue, 9 Feb 2016 11:08:15 -0500 Received: from terminus.zytor.com ([198.137.202.10]:38343 "EHLO terminus.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757510AbcBIQIJ (ORCPT ); Tue, 9 Feb 2016 11:08:09 -0500 Date: Tue, 9 Feb 2016 08:06:41 -0800 From: tip-bot for Andrey Ryabinin Message-ID: Cc: oleg@redhat.com, hpa@zytor.com, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, dave.hansen@linux.intel.com, dvlasenk@redhat.com, bp@alien8.de, mcgrof@suse.com, brgerst@gmail.com, bp@suse.de, mingo@kernel.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, luto@amacapital.net, toshi.kani@hp.com, aryabinin@virtuozzo.com, torvalds@linux-foundation.org Reply-To: torvalds@linux-foundation.org, toshi.kani@hp.com, aryabinin@virtuozzo.com, linux-kernel@vger.kernel.org, luto@amacapital.net, bp@suse.de, brgerst@gmail.com, akpm@linux-foundation.org, mingo@kernel.org, dvlasenk@redhat.com, dave.hansen@linux.intel.com, mcgrof@suse.com, bp@alien8.de, peterz@infradead.org, oleg@redhat.com, hpa@zytor.com, tglx@linutronix.de, luto@kernel.org In-Reply-To: <1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com> References: <1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com> To: linux-tip-commits@vger.kernel.org Subject: [tip:x86/mm] x86/kasan: Clear kasan_zero_page after TLB flush Git-Commit-ID: 69e0210fd01ff157d332102219aaf5c26ca8069b X-Mailer: tip-git-log-daemon Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Commit-ID: 69e0210fd01ff157d332102219aaf5c26ca8069b Gitweb: http://git.kernel.org/tip/69e0210fd01ff157d332102219aaf5c26ca8069b Author: Andrey Ryabinin AuthorDate: Mon, 11 Jan 2016 15:51:18 +0300 Committer: Ingo Molnar CommitDate: Tue, 9 Feb 2016 13:33:14 +0100 x86/kasan: Clear kasan_zero_page after TLB flush Currently we clear kasan_zero_page before __flush_tlb_all(). This works with current implementation of native_flush_tlb[_global]() because it doesn't cause do any writes to kasan shadow memory. But any subtle change made in native_flush_tlb*() could break this. Also current code seems doesn't work for paravirt guests (lguest). Only after the TLB flush we can be sure that kasan_zero_page is not used as early shadow anymore (instrumented code will not write to it). So it should cleared it only after the TLB flush. Signed-off-by: Andrey Ryabinin Reviewed-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Luis R. Rodriguez Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Ingo Molnar --- arch/x86/mm/kasan_init_64.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index d470cf2..303e470 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -120,11 +120,16 @@ void __init kasan_init(void) kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END), (void *)KASAN_SHADOW_END); - memset(kasan_zero_page, 0, PAGE_SIZE); - load_cr3(init_level4_pgt); __flush_tlb_all(); - init_task.kasan_depth = 0; + /* + * kasan_zero_page has been used as early shadow memory, thus it may + * contain some garbage. Now we can clear it, since after the TLB flush + * no one should write to it. + */ + memset(kasan_zero_page, 0, PAGE_SIZE); + + init_task.kasan_depth = 0; pr_info("KernelAddressSanitizer initialized\n"); } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757661AbcBIQIg (ORCPT ); Tue, 9 Feb 2016 11:08:36 -0500 Received: from terminus.zytor.com ([198.137.202.10]:38361 "EHLO terminus.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757619AbcBIQI3 (ORCPT ); Tue, 9 Feb 2016 11:08:29 -0500 Date: Tue, 9 Feb 2016 08:07:10 -0800 From: tip-bot for Andrey Ryabinin Message-ID: Cc: bp@suse.de, torvalds@linux-foundation.org, toshi.kani@hp.com, dave.hansen@linux.intel.com, bp@alien8.de, mcgrof@suse.com, dvlasenk@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, luto@kernel.org, aryabinin@virtuozzo.com, oleg@redhat.com, peterz@infradead.org, hpa@zytor.com, brgerst@gmail.com, luto@amacapital.net, mingo@kernel.org, tglx@linutronix.de Reply-To: dave.hansen@linux.intel.com, bp@alien8.de, bp@suse.de, torvalds@linux-foundation.org, toshi.kani@hp.com, aryabinin@virtuozzo.com, luto@kernel.org, oleg@redhat.com, dvlasenk@redhat.com, mcgrof@suse.com, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, peterz@infradead.org, luto@amacapital.net, mingo@kernel.org, tglx@linutronix.de, hpa@zytor.com, brgerst@gmail.com In-Reply-To: <1452516679-32040-3-git-send-email-aryabinin@virtuozzo.com> References: <1452516679-32040-3-git-send-email-aryabinin@virtuozzo.com> To: linux-tip-commits@vger.kernel.org Subject: [tip:x86/mm] x86/kasan: Write protect kasan zero shadow Git-Commit-ID: 063fb3e56f6dd29b2633b678b837e1d904200e6f X-Mailer: tip-git-log-daemon Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Commit-ID: 063fb3e56f6dd29b2633b678b837e1d904200e6f Gitweb: http://git.kernel.org/tip/063fb3e56f6dd29b2633b678b837e1d904200e6f Author: Andrey Ryabinin AuthorDate: Mon, 11 Jan 2016 15:51:19 +0300 Committer: Ingo Molnar CommitDate: Tue, 9 Feb 2016 13:33:14 +0100 x86/kasan: Write protect kasan zero shadow After kasan_init() executed, no one is allowed to write to kasan_zero_page, so write protect it. Signed-off-by: Andrey Ryabinin Reviewed-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Luis R. Rodriguez Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1452516679-32040-3-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Ingo Molnar --- arch/x86/mm/kasan_init_64.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 303e470..1b1110f 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -125,10 +125,16 @@ void __init kasan_init(void) /* * kasan_zero_page has been used as early shadow memory, thus it may - * contain some garbage. Now we can clear it, since after the TLB flush - * no one should write to it. + * contain some garbage. Now we can clear and write protect it, since + * after the TLB flush no one should write to it. */ memset(kasan_zero_page, 0, PAGE_SIZE); + for (i = 0; i < PTRS_PER_PTE; i++) { + pte_t pte = __pte(__pa(kasan_zero_page) | __PAGE_KERNEL_RO); + set_pte(&kasan_zero_pte[i], pte); + } + /* Flush TLBs again to be sure that write protection applied. */ + __flush_tlb_all(); init_task.kasan_depth = 0; pr_info("KernelAddressSanitizer initialized\n"); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932851AbcFCRmO (ORCPT ); Fri, 3 Jun 2016 13:42:14 -0400 Received: from mail-pa0-f67.google.com ([209.85.220.67]:36702 "EHLO mail-pa0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752522AbcFCRmM convert rfc822-to-8bit (ORCPT ); Fri, 3 Jun 2016 13:42:12 -0400 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization From: Nadav Amit In-Reply-To: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> Date: Fri, 3 Jun 2016 10:42:10 -0700 Cc: x86@kernel.org, LKML , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" Content-Transfer-Encoding: 8BIT Message-Id: <8D80C93B-3DD6-469B-90D6-FBC71B917EAD@gmail.com> References: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> To: Andy Lutomirski X-Mailer: Apple Mail (2.3124) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Following this patch, if (current->active_mm != mm), flush_tlb_page() still doesn’t call smp_mb() before checking mm_cpumask(mm). In contrast, flush_tlb_mm_range() does call smp_mb(). Is there a reason for this discrepancy? Thanks, Nadav Andy Lutomirski wrote: > When switch_mm activates a new pgd, it also sets a bit that tells > other CPUs that the pgd is in use so that tlb flush IPIs will be > sent. In order for that to work correctly, the bit needs to be > visible prior to loading the pgd and therefore starting to fill the > local TLB. > > Document all the barriers that make this work correctly and add a > couple that were missing. > > Cc: stable@vger.kernel.org > Signed-off-by: Andy Lutomirski > --- > arch/x86/include/asm/mmu_context.h | 33 ++++++++++++++++++++++++++++++++- > arch/x86/mm/tlb.c | 29 ++++++++++++++++++++++++++--- > 2 files changed, 58 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h > index 379cd3658799..1edc9cd198b8 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, > #endif > cpumask_set_cpu(cpu, mm_cpumask(next)); > > - /* Re-load page tables */ > + /* > + * Re-load page tables. > + * > + * This logic has an ordering constraint: > + * > + * CPU 0: Write to a PTE for 'next' > + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. > + * CPU 1: set bit 1 in next's mm_cpumask > + * CPU 1: load from the PTE that CPU 0 writes (implicit) > + * > + * We need to prevent an outcome in which CPU 1 observes > + * the new PTE value and CPU 0 observes bit 1 clear in > + * mm_cpumask. (If that occurs, then the IPI will never > + * be sent, and CPU 0's TLB will contain a stale entry.) > + * > + * The bad outcome can occur if either CPU's load is > + * reordered before that CPU's store, so both CPUs much > + * execute full barriers to prevent this from happening. > + * > + * Thus, switch_mm needs a full barrier between the > + * store to mm_cpumask and any operation that could load > + * from next->pgd. This barrier synchronizes with > + * remote TLB flushers. Fortunately, load_cr3 is > + * serializing and thus acts as a full barrier. > + * > + */ > load_cr3(next->pgd); > + > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > > /* Stop flush ipis for the previous mm */ > @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, > * schedule, protecting us from simultaneous changes. > */ > cpumask_set_cpu(cpu, mm_cpumask(next)); > + > /* > * We were in lazy tlb mode and leave_mm disabled > * tlb flush IPI delivery. We must reload CR3 > * to make sure to use no freed page tables. > + * > + * As above, this is a barrier that forces > + * TLB repopulation to be ordered after the > + * store to mm_cpumask. > */ > load_cr3(next->pgd); > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 8ddb5d0d66fb..8f4cc3dfac32 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -161,7 +161,10 @@ void flush_tlb_current_task(void) > preempt_disable(); > > count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > + > + /* This is an implicit full barrier that synchronizes with switch_mm. */ > local_flush_tlb(); > + > trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) > flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); > @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, > unsigned long base_pages_to_flush = TLB_FLUSH_ALL; > > preempt_disable(); > - if (current->active_mm != mm) > + if (current->active_mm != mm) { > + /* Synchronize with switch_mm. */ > + smp_mb(); > + > goto out; > + } > > if (!current->mm) { > leave_mm(smp_processor_id()); > + > + /* Synchronize with switch_mm. */ > + smp_mb(); > + > goto out; > } > > if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) > base_pages_to_flush = (end - start) >> PAGE_SHIFT; > > + /* > + * Both branches below are implicit full barriers (MOV to CR or > + * INVLPG) that synchronize with switch_mm. > + */ > if (base_pages_to_flush > tlb_single_page_flush_ceiling) { > base_pages_to_flush = TLB_FLUSH_ALL; > count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > @@ -228,10 +243,18 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start) > preempt_disable(); > > if (current->active_mm == mm) { > - if (current->mm) > + if (current->mm) { > + /* > + * Implicit full barrier (INVLPG) that synchronizes > + * with switch_mm. > + */ > __flush_tlb_one(start); > - else > + } else { > leave_mm(smp_processor_id()); > + > + /* Synchronize with switch_mm. */ > + smp_mb(); > + } > } > > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) > -- > 2.5.0 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932911AbcFIRYo (ORCPT ); Thu, 9 Jun 2016 13:24:44 -0400 Received: from mail-oi0-f46.google.com ([209.85.218.46]:35619 "EHLO mail-oi0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932740AbcFIRYj convert rfc822-to-8bit (ORCPT ); Thu, 9 Jun 2016 13:24:39 -0400 MIME-Version: 1.0 In-Reply-To: <8D80C93B-3DD6-469B-90D6-FBC71B917EAD@gmail.com> References: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> <8D80C93B-3DD6-469B-90D6-FBC71B917EAD@gmail.com> From: Andy Lutomirski Date: Thu, 9 Jun 2016 10:24:18 -0700 Message-ID: Subject: Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization To: Nadav Amit Cc: Andy Lutomirski , X86 ML , LKML , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 3, 2016 at 10:42 AM, Nadav Amit wrote: > Following this patch, if (current->active_mm != mm), flush_tlb_page() still > doesn’t call smp_mb() before checking mm_cpumask(mm). > > In contrast, flush_tlb_mm_range() does call smp_mb(). > > Is there a reason for this discrepancy? Not that I can remember. Is the remote flush case likely to be racy? --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751592AbcFITqA (ORCPT ); Thu, 9 Jun 2016 15:46:00 -0400 Received: from mail-pf0-f182.google.com ([209.85.192.182]:33280 "EHLO mail-pf0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751086AbcFITp7 convert rfc822-to-8bit (ORCPT ); Thu, 9 Jun 2016 15:45:59 -0400 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization From: Nadav Amit In-Reply-To: Date: Thu, 9 Jun 2016 12:45:55 -0700 Cc: Andy Lutomirski , X86 ML , LKML , Borislav Petkov , Brian Gerst , Dave Hansen , Oleg Nesterov , "linux-mm@kvack.org" Content-Transfer-Encoding: 8BIT Message-Id: References: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> <8D80C93B-3DD6-469B-90D6-FBC71B917EAD@gmail.com> To: Andy Lutomirski X-Mailer: Apple Mail (2.3124) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andy Lutomirski wrote: > On Fri, Jun 3, 2016 at 10:42 AM, Nadav Amit wrote: >> Following this patch, if (current->active_mm != mm), flush_tlb_page() still >> doesn’t call smp_mb() before checking mm_cpumask(mm). >> >> In contrast, flush_tlb_mm_range() does call smp_mb(). >> >> Is there a reason for this discrepancy? > > Not that I can remember. Is the remote flush case likely to be racy? You replied separately on another email that included a patch to fix this case. It turns out smp_mb is not needed on flush_tlb_page, since the PTE is always updated using an atomic operation. Yet, a compiler barrier is still needed, so I added smp_mb__after_atomic instead. Nadav From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753301AbcIFBWX (ORCPT ); Mon, 5 Sep 2016 21:22:23 -0400 Received: from mail-wm0-f68.google.com ([74.125.82.68]:36372 "EHLO mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752887AbcIFBWV (ORCPT ); Mon, 5 Sep 2016 21:22:21 -0400 MIME-Version: 1.0 In-Reply-To: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> References: <95a853538da28c64dfc877c60549ec79ed7a5d69.1452294700.git.luto@kernel.org> From: Wanpeng Li Date: Tue, 6 Sep 2016 09:22:18 +0800 Message-ID: Subject: Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization To: Andy Lutomirski Cc: "the arch/x86 maintainers" , "linux-kernel@vger.kernel.org" , Borislav Petkov , Brian Gerst , Dave Hansen , Linus Torvalds , Oleg Nesterov , "linux-mm@kvack.org" , stable@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Andy, 2016-01-09 7:15 GMT+08:00 Andy Lutomirski : > When switch_mm activates a new pgd, it also sets a bit that tells > other CPUs that the pgd is in use so that tlb flush IPIs will be > sent. In order for that to work correctly, the bit needs to be > visible prior to loading the pgd and therefore starting to fill the > local TLB. > > Document all the barriers that make this work correctly and add a > couple that were missing. > > Cc: stable@vger.kernel.org > Signed-off-by: Andy Lutomirski > --- > arch/x86/include/asm/mmu_context.h | 33 ++++++++++++++++++++++++++++++++- > arch/x86/mm/tlb.c | 29 ++++++++++++++++++++++++++--- > 2 files changed, 58 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h > index 379cd3658799..1edc9cd198b8 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, > #endif > cpumask_set_cpu(cpu, mm_cpumask(next)); > > - /* Re-load page tables */ > + /* > + * Re-load page tables. > + * > + * This logic has an ordering constraint: > + * > + * CPU 0: Write to a PTE for 'next' > + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. > + * CPU 1: set bit 1 in next's mm_cpumask > + * CPU 1: load from the PTE that CPU 0 writes (implicit) > + * > + * We need to prevent an outcome in which CPU 1 observes > + * the new PTE value and CPU 0 observes bit 1 clear in > + * mm_cpumask. (If that occurs, then the IPI will never > + * be sent, and CPU 0's TLB will contain a stale entry.) I misunderstand this comments, CPU0 write to a PTE for 'next', and CPU0 observes bit 1 clear in mm_cpumask, so CPU0 won't kick IPI to CPU1, why CPU0's TLB will contain a stale entry instead of CPU1? Regards, Wanpeng Li