From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 2B9E66B0005 for ; Thu, 30 Jun 2016 20:12:12 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id a69so204008350pfa.1 for ; Thu, 30 Jun 2016 17:12:12 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65]) by mx.google.com with ESMTP id z8si930759pff.143.2016.06.30.17.12.09 for ; Thu, 30 Jun 2016 17:12:10 -0700 (PDT) Subject: [PATCH 0/6] [v3] Workaround for Xeon Phi PTE A/D bits erratum From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:09 -0700 Message-Id: <20160701001209.7DA24D1C@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , minchan@kernel.org The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights Landing) has an erratum where a processor thread setting the Accessed or Dirty bits may not do so atomically against its checks for the Present bit. This may cause a thread (which is about to page fault) to set A and/or D, even though the Present bit had already been atomically cleared. If the PTE is used for storing a swap index or a NUMA migration index, the A bit could be misinterpreted as part of the swap type. The stray bits being set cause a software-cleared PTE to be interpreted as a swap entry. In some cases (like when the swap index ends up being for a non-existent swapfile), the kernel detects the stray value and WARN()s about it, but there is no guarantee that the kernel can always detect it. This patch causes the page unmap path in vmscan/direct reclaim to flush remote TLBs after clearing each page, and also clears the PTE again after the flush. For reclaim, this brings the behavior (and associated reclaim performance) back to what it was before Mel's changes that increased TLB flush batching. For the unmap path, this patch may force some additional flushes, but they are limited to a maximum of one per PTE page. This patch clears these stray A/D bits before releasing the pagetable lock which prevents other parts of the kernel from observing the stray bits. Andi Kleen wrote the original version of this patch, and Dave Hansen added the batching. The original version was much simpler but it did too many extra TLB flushes which killed performance. v3: huge rework to keep batching working in unmap case v2: out of line. avoid single thread flush. cover more clear cases Cc: Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id C77146B0253 for ; Thu, 30 Jun 2016 20:12:12 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id e189so203833137pfa.2 for ; Thu, 30 Jun 2016 17:12:12 -0700 (PDT) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTP id p7si967228pfp.244.2016.06.30.17.12.11 for ; Thu, 30 Jun 2016 17:12:11 -0700 (PDT) Subject: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:10 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001210.AA77B917@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com, luto@kernel.org, stable@vger.kernel.org From: Dave Hansen epufeatures.h currently defines X86_BUG(9) twice on 32-bit: #define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ ... #ifdef CONFIG_X86_32 #define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */ #endif I think what happened was that this added the X86_BUG_ESPFIX, but in an #ifdef below most of the bugs: [58a5aac5] x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled Then this came along and added X86_BUG_NULL_SEG, but collided with the earlier one that did the bug below the main block defining all the X86_BUG()s. [7a5d6704] x86/cpu: Probe the behavior of nulling out a segment at boot time Signed-off-by: Dave Hansen Acked-by: Andy Lutomirski Cc: stable@vger.kernel.org --- b/arch/x86/include/asm/cpufeatures.h | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros arch/x86/include/asm/cpufeatures.h --- a/arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros 2016-06-30 17:10:41.215185869 -0700 +++ b/arch/x86/include/asm/cpufeatures.h 2016-06-30 17:10:41.218186005 -0700 @@ -301,10 +301,6 @@ #define X86_BUG_FXSAVE_LEAK X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */ #define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */ #define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */ -#define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ -#define X86_BUG_SWAPGS_FENCE X86_BUG(10) /* SWAPGS without input dep on GS */ - - #ifdef CONFIG_X86_32 /* * 64-bit kernels don't use X86_BUG_ESPFIX. Make the define conditional @@ -312,5 +308,7 @@ */ #define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */ #endif +#define X86_BUG_NULL_SEG X86_BUG(10) /* Nulling a selector preserves the base */ +#define X86_BUG_SWAPGS_FENCE X86_BUG(11) /* SWAPGS without input dep on GS */ #endif /* _ASM_X86_CPUFEATURES_H */ _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id B02466B025E for ; Thu, 30 Jun 2016 20:12:30 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id g62so204227153pfb.3 for ; Thu, 30 Jun 2016 17:12:30 -0700 (PDT) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id ai12si908080pac.139.2016.06.30.17.12.29 for ; Thu, 30 Jun 2016 17:12:29 -0700 (PDT) Subject: [PATCH 3/6] mm: add force_batch_flush to mmu_gather From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:14 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001214.94D8F14C@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Currently, zap_pte_range() has a local variable called 'force_flush'. It is set when the zapping code runs in to an entry that requires flushing before the ptl is released. Currently, there are two reasons we might do that: 1. The TLB batching in __tlb_remove_page() has run out of space and can no longer record new pages being flushed. 2. An entry for a dirty page was flushed, and we need to ensure that software walking the page tables can not observe the cleared bit before we flush the TLB. We need the x86 code to be able to force a flush for an additional reason: if the PTE being cleared might have a stray Accessed or Dirty bit set on it because of a hardware erratum. For these purposes, we also need to flush before the ptl is released. So, we move the 'force_flush' variable into the mmu_gather structure where we can set it from arch code. Signed-off-by: Dave Hansen --- b/arch/arm/include/asm/tlb.h | 1 + b/arch/ia64/include/asm/tlb.h | 1 + b/arch/s390/include/asm/tlb.h | 1 + b/arch/sh/include/asm/tlb.h | 1 + b/arch/um/include/asm/tlb.h | 1 + b/include/asm-generic/tlb.h | 3 +++ b/mm/memory.c | 22 ++++++++++++---------- 7 files changed, 20 insertions(+), 10 deletions(-) diff -puN arch/arm/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/arm/include/asm/tlb.h --- a/arch/arm/include/asm/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.021222437 -0700 +++ b/arch/arm/include/asm/tlb.h 2016-06-30 17:10:42.037223163 -0700 @@ -69,6 +69,7 @@ struct mmu_gather { unsigned int need_flush; #endif unsigned int fullmm; + unsigned int force_batch_flush; struct vm_area_struct *vma; unsigned long start, end; unsigned long range_start; diff -puN arch/ia64/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/ia64/include/asm/tlb.h --- a/arch/ia64/include/asm/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.022222482 -0700 +++ b/arch/ia64/include/asm/tlb.h 2016-06-30 17:10:42.037223163 -0700 @@ -57,6 +57,7 @@ struct mmu_gather { unsigned int nr; unsigned int max; unsigned char fullmm; /* non-zero means full mm flush */ + unsigned int force_batch_flush; /* stop batching and flush */ unsigned char need_flush; /* really unmapped some PTEs? */ unsigned long start, end; unsigned long start_addr; diff -puN arch/s390/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/s390/include/asm/tlb.h --- a/arch/s390/include/asm/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.024222573 -0700 +++ b/arch/s390/include/asm/tlb.h 2016-06-30 17:10:42.038223208 -0700 @@ -32,6 +32,7 @@ struct mmu_gather { struct mm_struct *mm; struct mmu_table_batch *batch; unsigned int fullmm; + unsigned int force_batch_flush; /* stop batching and flush */ unsigned long start, end; }; diff -puN arch/sh/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/sh/include/asm/tlb.h --- a/arch/sh/include/asm/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.025222618 -0700 +++ b/arch/sh/include/asm/tlb.h 2016-06-30 17:10:42.038223208 -0700 @@ -21,6 +21,7 @@ struct mmu_gather { struct mm_struct *mm; unsigned int fullmm; + unsigned int force_batch_flush; /* stop batching and flush */ unsigned long start, end; }; diff -puN arch/um/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/um/include/asm/tlb.h --- a/arch/um/include/asm/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.027222709 -0700 +++ b/arch/um/include/asm/tlb.h 2016-06-30 17:10:42.039223253 -0700 @@ -20,6 +20,7 @@ struct mmu_gather { unsigned long start; unsigned long end; unsigned int fullmm; /* non-zero means full mm flush */ + unsigned int force_batch_flush; /* stop batching and flush */ }; static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, diff -puN include/asm-generic/tlb.h~knl-leak-30-tlb_force_flush include/asm-generic/tlb.h --- a/include/asm-generic/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.028222754 -0700 +++ b/include/asm-generic/tlb.h 2016-06-30 17:10:42.039223253 -0700 @@ -102,6 +102,9 @@ struct mmu_gather { /* we have performed an operation which * requires a complete flush of the tlb */ need_flush_all : 1, + /* need to flush the tlb and stop batching + * before we release ptl */ + force_batch_flush : 1, /* we cleared a PTE bit which may potentially * get set by hardware */ saw_unset_a_or_d: 1; diff -puN mm/memory.c~knl-leak-30-tlb_force_flush mm/memory.c --- a/mm/memory.c~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.031222890 -0700 +++ b/mm/memory.c 2016-06-30 17:10:42.040223299 -0700 @@ -225,6 +225,7 @@ void tlb_gather_mmu(struct mmu_gather *t tlb->fullmm = !(start | (end+1)); tlb->need_flush_all = 0; tlb->saw_unset_a_or_d = 0; + tlb->force_batch_flush = 0; tlb->local.next = NULL; tlb->local.nr = 0; @@ -1105,7 +1106,6 @@ static unsigned long zap_pte_range(struc struct zap_details *details) { struct mm_struct *mm = tlb->mm; - int force_flush = 0; int rss[NR_MM_COUNTERS]; spinlock_t *ptl; pte_t *start_pte; @@ -1151,7 +1151,7 @@ again: */ if (unlikely(details && details->ignore_dirty)) continue; - force_flush = 1; + tlb->force_batch_flush = 1; set_page_dirty(page); } if (pte_young(ptent) && @@ -1163,7 +1163,7 @@ again: if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); if (unlikely(!__tlb_remove_page(tlb, page))) { - force_flush = 1; + tlb->force_batch_flush = 1; addr += PAGE_SIZE; break; } @@ -1191,18 +1191,20 @@ again: arch_leave_lazy_mmu_mode(); /* Do the actual TLB flush before dropping ptl */ - if (force_flush) + if (tlb->force_batch_flush) tlb_flush_mmu_tlbonly(tlb); pte_unmap_unlock(start_pte, ptl); /* - * If we forced a TLB flush (either due to running out of - * batch buffers or because we needed to flush dirty TLB - * entries before releasing the ptl), free the batched - * memory too. Restart if we didn't do everything. + * If we forced a TLB flush, free the batched + * memory too. Restart if we didn't do everything. + * + * We force this due to running out of batch buffers, + * needing to flush dirty TLB entries before releasing + * the ptl, or for arch-specific reasons. */ - if (force_flush) { - force_flush = 0; + if (tlb->force_batch_flush) { + tlb->force_batch_flush = 0; tlb_flush_mmu_free(tlb); if (addr != end) _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id A48D66B025F for ; Thu, 30 Jun 2016 20:12:31 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id g62so204227816pfb.3 for ; Thu, 30 Jun 2016 17:12:31 -0700 (PDT) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id ai12si908080pac.139.2016.06.30.17.12.29 for ; Thu, 30 Jun 2016 17:12:30 -0700 (PDT) Subject: [PATCH 4/6] mm: move flush in madvise_free_pte_range() From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:15 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001215.C0689717@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com, minchan@kernel.org From: Dave Hansen I think this code is OK and does not *need* to be patched. We are just rewriting the PTE without the Accessed and Dirty bits. The hardware could come along and set them at any time with or without the erratum that this series addresses But this does make the ptep_get_and_clear_full() and tlb_remove_tlb_entry() calls here more consistent with the other places they are used together and look *obviously* the same between call-sites. Signed-off-by: Dave Hansen Cc: Minchan Kim --- b/mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN mm/madvise.c~knl-leak-40-madvise_free_pte_range-move-flush mm/madvise.c --- a/mm/madvise.c~knl-leak-40-madvise_free_pte_range-move-flush 2016-06-30 17:10:42.557246755 -0700 +++ b/mm/madvise.c 2016-06-30 17:10:42.561246936 -0700 @@ -369,13 +369,13 @@ static int madvise_free_pte_range(pmd_t */ ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + tlb_remove_tlb_entry(tlb, pte, addr); ptent = pte_mkold(ptent); ptent = pte_mkclean(ptent); set_pte_at(mm, addr, pte, ptent); if (PageActive(page)) deactivate_page(page); - tlb_remove_tlb_entry(tlb, pte, addr); } } out: _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 3BE4B6B0260 for ; Thu, 30 Jun 2016 20:12:34 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id g62so204229962pfb.3 for ; Thu, 30 Jun 2016 17:12:34 -0700 (PDT) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id ai12si908080pac.139.2016.06.30.17.12.30 for ; Thu, 30 Jun 2016 17:12:30 -0700 (PDT) Subject: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:18 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001218.3D316260@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights Landing) has an erratum where a processor thread setting the Accessed or Dirty bits may not do so atomically against its checks for the Present bit. This may cause a thread (which is about to page fault) to set A and/or D, even though the Present bit had already been atomically cleared. If the PTE is used for storing a swap index or a NUMA migration index, the A bit could be misinterpreted as part of the swap type. The stray bits being set cause a software-cleared PTE to be interpreted as a swap entry. In some cases (like when the swap index ends up being for a non-existent swapfile), the kernel detects the stray value and WARN()s about it, but there is no guarantee that the kernel can always detect it. There are basically three things we have to do to work around this erratum: 1. Extra TLB flushes when we clear PTEs that might be affected by this erratum. 2. Extra pass back over the suspect PTEs after the TLBs have been flushed to clear stray bits. 3. Make sure to hold ptl in pte_unmap_same() (underneath do_swap_page()) to keep the swap code from observing the bad entries between #1 and #2. Notes: * The little pte++ in zap_pte_range() is to ensure that 'pte' points _past_ the last PTE that was cleared so that the whole range can be cleaned up. * We could do more of the new arch_*() helpers inside the existing TLB flush callers if we passed the old 'ptent' in as an argument. That would require a more invasive rework, though. * change_pte_range() does not need to be modified. It fully writes back over the PTE after clearing it. It also does this inside the ptl, so the cleared PTE potentially containing stray bits is never visible to anyone. * move_ptes() does remote TLB flushes after each PTE clear. This is slow, but mremap() is not as important as munmap() Leave it simple for now. * could apply A/D optimization to huge pages too * As far as I can tell, sites that just change PTE permissions are OK. They generally do some variant of: ptent = ptep_get_and_clear(...); ptent = pte_mksomething(ptent); set_pte_at(mm, addr, pte, ptent); tlb_flush...(); This is OK because the cleared PTE (which might contain the stray bits) is written over by set_pte_at() and this all happens under the pagetable lock. Examples of this: * madvise_free_pte_range() * hugetlb_change_protection() * clear_soft_dirty() * move_ptes() - new PTE will not fault so will not hit erratum * change_pte_range() Signed-off-by: Dave Hansen --- b/arch/arm/include/asm/tlb.h | 1 b/arch/ia64/include/asm/tlb.h | 1 b/arch/s390/include/asm/tlb.h | 1 b/arch/sh/include/asm/tlb.h | 1 b/arch/um/include/asm/tlb.h | 2 b/arch/x86/include/asm/cpufeatures.h | 1 b/arch/x86/include/asm/pgtable.h | 32 +++++++++ b/arch/x86/include/asm/tlbflush.h | 37 +++++++++++ b/arch/x86/kernel/cpu/intel.c | 113 +++++++++++++++++++++++++++++++++++ b/include/asm-generic/tlb.h | 4 + b/include/linux/mm.h | 17 +++++ b/mm/memory.c | 12 ++- b/mm/mremap.c | 9 ++ b/mm/rmap.c | 4 + b/mm/vmalloc.c | 1 15 files changed, 231 insertions(+), 5 deletions(-) diff -puN arch/arm/include/asm/tlb.h~knl-leak-60-actual-fix arch/arm/include/asm/tlb.h --- a/arch/arm/include/asm/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.455287496 -0700 +++ b/arch/arm/include/asm/tlb.h 2016-06-30 17:10:43.483288766 -0700 @@ -264,6 +264,7 @@ tlb_remove_pmd_tlb_entry(struct mmu_gath #define pud_free_tlb(tlb, pudp, addr) pud_free((tlb)->mm, pudp) #define tlb_migrate_finish(mm) do { } while (0) +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) #endif /* CONFIG_MMU */ #endif diff -puN arch/ia64/include/asm/tlb.h~knl-leak-60-actual-fix arch/ia64/include/asm/tlb.h --- a/arch/ia64/include/asm/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.457287587 -0700 +++ b/arch/ia64/include/asm/tlb.h 2016-06-30 17:10:43.484288812 -0700 @@ -254,6 +254,7 @@ __tlb_remove_tlb_entry (struct mmu_gathe } #define tlb_migrate_finish(mm) platform_tlb_migrate_finish(mm) +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) #define tlb_start_vma(tlb, vma) do { } while (0) #define tlb_end_vma(tlb, vma) do { } while (0) diff -puN arch/s390/include/asm/tlb.h~knl-leak-60-actual-fix arch/s390/include/asm/tlb.h --- a/arch/s390/include/asm/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.460287723 -0700 +++ b/arch/s390/include/asm/tlb.h 2016-06-30 17:10:43.484288812 -0700 @@ -146,5 +146,6 @@ static inline void pud_free_tlb(struct m #define tlb_remove_tlb_entry(tlb, ptep, addr) do { } while (0) #define tlb_remove_pmd_tlb_entry(tlb, pmdp, addr) do { } while (0) #define tlb_migrate_finish(mm) do { } while (0) +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) #endif /* _S390_TLB_H */ diff -puN arch/sh/include/asm/tlb.h~knl-leak-60-actual-fix arch/sh/include/asm/tlb.h --- a/arch/sh/include/asm/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.461287768 -0700 +++ b/arch/sh/include/asm/tlb.h 2016-06-30 17:10:43.484288812 -0700 @@ -116,6 +116,7 @@ static inline void tlb_remove_page(struc #define pud_free_tlb(tlb, pudp, addr) pud_free((tlb)->mm, pudp) #define tlb_migrate_finish(mm) do { } while (0) +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) #if defined(CONFIG_CPU_SH4) || defined(CONFIG_SUPERH64) extern void tlb_wire_entry(struct vm_area_struct *, unsigned long, pte_t); diff -puN arch/um/include/asm/tlb.h~knl-leak-60-actual-fix arch/um/include/asm/tlb.h --- a/arch/um/include/asm/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.463287859 -0700 +++ b/arch/um/include/asm/tlb.h 2016-06-30 17:10:43.485288857 -0700 @@ -133,4 +133,6 @@ static inline void tlb_remove_page(struc #define tlb_migrate_finish(mm) do {} while (0) +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) + #endif diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-60-actual-fix arch/x86/include/asm/cpufeatures.h --- a/arch/x86/include/asm/cpufeatures.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.464287904 -0700 +++ b/arch/x86/include/asm/cpufeatures.h 2016-06-30 17:10:43.485288857 -0700 @@ -310,5 +310,6 @@ #endif #define X86_BUG_NULL_SEG X86_BUG(10) /* Nulling a selector preserves the base */ #define X86_BUG_SWAPGS_FENCE X86_BUG(11) /* SWAPGS without input dep on GS */ +#define X86_BUG_PTE_LEAK X86_BUG(12) /* PTE may leak A/D bits after clear */ #endif /* _ASM_X86_CPUFEATURES_H */ diff -puN arch/x86/include/asm/pgtable.h~knl-leak-60-actual-fix arch/x86/include/asm/pgtable.h --- a/arch/x86/include/asm/pgtable.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.466287995 -0700 +++ b/arch/x86/include/asm/pgtable.h 2016-06-30 17:10:43.486288902 -0700 @@ -794,6 +794,12 @@ extern int ptep_test_and_clear_young(str extern int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep); +#ifdef CONFIG_CPU_SUP_INTEL +#define __HAVE_ARCH_PTEP_CLEAR_FLUSH +extern pte_t ptep_clear_flush(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep); +#endif + #define __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) @@ -956,6 +962,32 @@ static inline u16 pte_flags_pkey(unsigne #endif } +#ifdef CONFIG_CPU_SUP_INTEL +/* + * These are all specific to working around an Intel-specific + * bug and the out-of-line code is all defined in intel.c. + */ +extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr, + pte_t *ptep); +#define ARCH_HAS_FIX_PTE_LEAK 1 +static inline void arch_fix_pte_leak(struct mm_struct *mm, unsigned long addr, + pte_t *ptep) +{ + if (static_cpu_has_bug(X86_BUG_PTE_LEAK)) + fix_pte_leak(mm, addr, ptep); +} +#define ARCH_HAS_NEEDS_SWAP_PTL 1 +static inline bool arch_needs_swap_ptl(void) +{ + return static_cpu_has_bug(X86_BUG_PTE_LEAK); +} +#define ARCH_DISABLE_DEFERRED_FLUSH 1 +static inline bool arch_disable_deferred_flush(void) +{ + return static_cpu_has_bug(X86_BUG_PTE_LEAK); +} +#endif /* CONFIG_CPU_SUP_INTEL */ + #include #endif /* __ASSEMBLY__ */ diff -puN arch/x86/include/asm/tlbflush.h~knl-leak-60-actual-fix arch/x86/include/asm/tlbflush.h --- a/arch/x86/include/asm/tlbflush.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.468288086 -0700 +++ b/arch/x86/include/asm/tlbflush.h 2016-06-30 17:10:43.486288902 -0700 @@ -324,4 +324,41 @@ static inline void reset_lazy_tlbstate(v native_flush_tlb_others(mask, mm, start, end) #endif +static inline bool intel_detect_leaked_pte(pte_t ptent) +{ + /* + * Hardware sets the dirty bit only on writable ptes and + * only on ptes where dirty is unset. + */ + if (pte_write(ptent) && !pte_dirty(ptent)) + return true; + /* + * The leak occurs when the hardware sees a unset A/D bit + * and tries to set it. If the PTE has both A/D bits + * set, then the hardware will not be going to try to set + * it and we have no chance for a leak. + */ + if (!pte_young(ptent)) + return true; + + return false; +} + +#ifdef CONFIG_CPU_SUP_INTEL +void intel_cleanup_pte_range(struct mmu_gather *tlb, pte_t *start_ptep, + pte_t *end_ptep); +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do { \ + if (static_cpu_has_bug(X86_BUG_PTE_LEAK)) \ + intel_cleanup_pte_range(tlb, start_ptep, end_ptep); \ +} while (0) +#define ARCH_FLUSH_CLEARED_PTE 1 +#define arch_flush_cleared_pte(tlb, ptent) do { \ + if (static_cpu_has_bug(X86_BUG_PTE_LEAK) && \ + intel_detect_leaked_pte(ptent)) { \ + tlb->saw_unset_a_or_d = 1; \ + tlb->force_batch_flush = 1; \ + } \ +} while (0) +#endif /* CONFIG_CPU_SUP_INTEL */ + #endif /* _ASM_X86_TLBFLUSH_H */ diff -puN arch/x86/kernel/cpu/intel.c~knl-leak-60-actual-fix arch/x86/kernel/cpu/intel.c --- a/arch/x86/kernel/cpu/intel.c~knl-leak-60-actual-fix 2016-06-30 17:10:43.469288131 -0700 +++ b/arch/x86/kernel/cpu/intel.c 2016-06-30 17:10:43.487288948 -0700 @@ -9,10 +9,14 @@ #include #include +#include #include #include #include #include +#include + +#include #ifdef CONFIG_X86_64 #include @@ -181,6 +185,11 @@ static void early_init_intel(struct cpui } } + if (c->x86_model == INTEL_FAM6_XEON_PHI_KNL) { + pr_info_once("x86/intel: Enabling PTE leaking workaround\n"); + set_cpu_bug(c, X86_BUG_PTE_LEAK); + } + /* * Intel Quark Core DevMan_001.pdf section 6.4.11 * "The operating system also is required to invalidate (i.e., flush) @@ -820,3 +829,107 @@ static const struct cpu_dev intel_cpu_de cpu_dev_register(intel_cpu_dev); +/* + * Workaround for KNL issue: + * + * A thread that is going to page fault due to P=0, may still + * non atomically set A or D bits, which could corrupt swap entries. + * Always flush the other CPUs and clear the PTE again to avoid + * this leakage. We are excluded using the pagetable lock. + * + * This only needs to be called on processors that might "leak" + * A or D bits and have X86_BUG_PTE_LEAK set. + */ +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep) +{ + if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) { + flush_tlb_others(mm_cpumask(mm), mm, addr, + addr + PAGE_SIZE); + set_pte(ptep, __pte(0)); + } +} + +/* + * We batched a bunch of PTE clears up. After the TLB has been + * flushed for the whole batch, we might have had some leaked + * A or D bits and need to clear them here. + * + * This should be called with the page table lock still held. + * + * This only needs to be called on processors that might "leak" + * A or D bits and have X86_BUG_PTE_LEAK set. + */ +void intel_cleanup_pte_range(struct mmu_gather *tlb, pte_t *start_ptep, + pte_t *end_ptep) +{ + pte_t *pte; + + /* + * fullmm means nobody will care that we have leaked bits + * laying around. We also skip TLB flushes when doing + * fullmm teardown, so the additional pte clearing would + * not help the issue. + */ + if (tlb->fullmm) + return; + + /* + * If none of the PTEs hit inside intel_detect_leaked_pte(), + * then we have nothing that might have been leaked and + * nothing to clear. + */ + if (!tlb->saw_unset_a_or_d) + return; + + /* + * Contexts calling us with NULL ptep's do not have any + * PTEs for us to go clear because they did not do any + * actual TLB invalidation. + */ + if (!start_ptep || !end_ptep) + return; + + /* + * Mark that the workaround is no longer needed for + * this batch. + */ + tlb->saw_unset_a_or_d = 0; + + /* + * Ensure that the compiler orders our set_pte() + * after the preceding TLB flush no matter what. + */ + barrier(); + + /* + * Re-clear out all the PTEs into which the hardware + * may have leaked Accessed or Dirty bits. + */ + for (pte = start_ptep; pte < end_ptep; pte++) + set_pte(pte, __pte(0)); +} + +/* + * Kinda weird to define this in here, but we only use it for + * an Intel-specific issue. This will get used on all + * processors (even non-Intel) if CONFIG_CPU_SUP_INTEL=y. + */ +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, + pte_t *ptep) +{ + struct mm_struct *mm = vma->vm_mm; + pte_t pte; + + pte = ptep_get_and_clear(mm, address, ptep); + if (pte_accessible(mm, pte)) { + flush_tlb_page(vma, address); + /* + * Ensure that the compiler orders our set_pte() + * after the flush_tlb_page() no matter what. + */ + barrier(); + if (static_cpu_has_bug(X86_BUG_PTE_LEAK)) + set_pte(ptep, __pte(0)); + } + return pte; +} diff -puN include/asm-generic/tlb.h~knl-leak-60-actual-fix include/asm-generic/tlb.h --- a/include/asm-generic/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.471288222 -0700 +++ b/include/asm-generic/tlb.h 2016-06-30 17:10:43.489289039 -0700 @@ -226,4 +226,8 @@ static inline void __tlb_reset_range(str #define tlb_migrate_finish(mm) do {} while (0) +#ifndef __tlb_cleanup_pte_range +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) +#endif + #endif /* _ASM_GENERIC__TLB_H */ diff -puN include/linux/mm.h~knl-leak-60-actual-fix include/linux/mm.h --- a/include/linux/mm.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.472288267 -0700 +++ b/include/linux/mm.h 2016-06-30 17:10:43.492289175 -0700 @@ -2404,6 +2404,23 @@ static inline bool debug_guardpage_enabl static inline bool page_is_guard(struct page *page) { return false; } #endif /* CONFIG_DEBUG_PAGEALLOC */ +#ifndef ARCH_HAS_NEEDS_SWAP_PTL +static inline bool arch_needs_swap_ptl(void) { return false; } +#endif + +#ifndef ARCH_HAS_FIX_PTE_LEAK +static inline void arch_fix_pte_leak(struct mm_struct *mm, unsigned long addr, + pte_t *ptep) {} +#endif + +#ifndef ARCH_DISABLE_DEFERRED_FLUSH +static inline bool arch_disable_deferred_flush(void) { return false; } +#endif + +#ifndef ARCH_FLUSH_CLEARED_PTE +static inline void arch_flush_cleared_pte(struct mmu_gather *tlb, pte_t ptent) {} +#endif + #if MAX_NUMNODES > 1 void __init setup_nr_node_ids(void); #else diff -puN mm/memory.c~knl-leak-60-actual-fix mm/memory.c --- a/mm/memory.c~knl-leak-60-actual-fix 2016-06-30 17:10:43.474288358 -0700 +++ b/mm/memory.c 2016-06-30 17:10:43.496289356 -0700 @@ -1141,6 +1141,7 @@ again: ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); tlb_remove_tlb_entry(tlb, pte, addr); + arch_flush_cleared_pte(tlb, ptent); if (unlikely(!page)) continue; @@ -1166,6 +1167,7 @@ again: if (unlikely(!__tlb_remove_page(tlb, page))) { tlb->force_batch_flush = 1; addr += PAGE_SIZE; + pte++; break; } continue; @@ -1192,8 +1194,11 @@ again: arch_leave_lazy_mmu_mode(); /* Do the actual TLB flush before dropping ptl */ - if (tlb->force_batch_flush) - tlb_flush_mmu_tlbonly(tlb); + if (tlb->force_batch_flush) { + bool did_flush = tlb_flush_mmu_tlbonly(tlb); + if (did_flush) + __tlb_cleanup_pte_range(tlb, start_pte, pte); + } pte_unmap_unlock(start_pte, ptl); /* @@ -1965,7 +1970,8 @@ static inline int pte_unmap_same(struct { int same = 1; #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT) - if (sizeof(pte_t) > sizeof(unsigned long)) { + if (unlikely(arch_needs_swap_ptl() || + sizeof(pte_t) > sizeof(unsigned long))) { spinlock_t *ptl = pte_lockptr(mm, pmd); spin_lock(ptl); same = pte_same(*page_table, orig_pte); diff -puN mm/mremap.c~knl-leak-60-actual-fix mm/mremap.c --- a/mm/mremap.c~knl-leak-60-actual-fix 2016-06-30 17:10:43.476288449 -0700 +++ b/mm/mremap.c 2016-06-30 17:10:43.497289401 -0700 @@ -24,6 +24,7 @@ #include #include +#include #include #include "internal.h" @@ -144,10 +145,14 @@ static void move_ptes(struct vm_area_str for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, new_pte++, new_addr += PAGE_SIZE) { + pte_t old_ptent; + if (pte_none(*old_pte)) continue; - pte = ptep_get_and_clear(mm, old_addr, old_pte); - pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); + old_ptent = ptep_get_and_clear(mm, old_addr, old_pte); + pte = move_pte(old_ptent, new_vma->vm_page_prot, + old_addr, new_addr); + arch_fix_pte_leak(mm, old_addr, old_pte); pte = move_soft_dirty_pte(pte); set_pte_at(mm, new_addr, new_pte, pte); } diff -puN mm/rmap.c~knl-leak-60-actual-fix mm/rmap.c --- a/mm/rmap.c~knl-leak-60-actual-fix 2016-06-30 17:10:43.478288539 -0700 +++ b/mm/rmap.c 2016-06-30 17:10:43.498289447 -0700 @@ -633,6 +633,10 @@ static bool should_defer_flush(struct mm { bool should_defer = false; + /* x86 may need an immediate flush after a pte clear */ + if (arch_disable_deferred_flush()) + return false; + if (!(flags & TTU_BATCH_FLUSH)) return false; diff -puN mm/vmalloc.c~knl-leak-60-actual-fix mm/vmalloc.c --- a/mm/vmalloc.c~knl-leak-60-actual-fix 2016-06-30 17:10:43.479288585 -0700 +++ b/mm/vmalloc.c 2016-06-30 17:10:43.500289538 -0700 @@ -66,6 +66,7 @@ static void vunmap_pte_range(pmd_t *pmd, pte = pte_offset_kernel(pmd, addr); do { pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte); + arch_fix_pte_leak(&init_mm, addr, pte); WARN_ON(!pte_none(ptent) && !pte_present(ptent)); } while (pte++, addr += PAGE_SIZE, addr != end); } _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 66DC46B0261 for ; Thu, 30 Jun 2016 20:12:36 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id 143so204384767pfx.0 for ; Thu, 30 Jun 2016 17:12:36 -0700 (PDT) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id ai12si908080pac.139.2016.06.30.17.12.33 for ; Thu, 30 Jun 2016 17:12:33 -0700 (PDT) Subject: [PATCH 5/6] mm: make tlb_flush_mmu_tlbonly() return whether it flushed From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:16 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001216.9CFEC460@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen When batching TLB operations, we track the scope of the needed TLB invalidation inside the 'struct mmu_gather'. With mmu_gather->end, we indicate whether any operations required a flush. If the flush was not performed, then we know that no pte_present() PTEs were found and no workaround is needed. But, we do not know this unless tlb_flush_mmu_tlbonly() tells us whether it flushed or not. Have it return a bool to tell us. Signed-off-by: Dave Hansen --- b/arch/arm/include/asm/tlb.h | 3 ++- b/arch/ia64/include/asm/tlb.h | 3 ++- b/arch/s390/include/asm/tlb.h | 3 ++- b/arch/sh/include/asm/tlb.h | 1 + b/arch/um/include/asm/tlb.h | 3 ++- b/mm/memory.c | 5 +++-- 6 files changed, 12 insertions(+), 6 deletions(-) diff -puN arch/arm/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/arm/include/asm/tlb.h --- a/arch/arm/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.947264449 -0700 +++ b/arch/arm/include/asm/tlb.h 2016-06-30 17:10:42.958264947 -0700 @@ -126,12 +126,13 @@ static inline void __tlb_alloc_page(stru } } -static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { tlb_flush(tlb); #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb_table_flush(tlb); #endif + return true; } static inline void tlb_flush_mmu_free(struct mmu_gather *tlb) diff -puN arch/ia64/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/ia64/include/asm/tlb.h --- a/arch/ia64/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.948264494 -0700 +++ b/arch/ia64/include/asm/tlb.h 2016-06-30 17:10:42.959264993 -0700 @@ -219,9 +219,10 @@ static inline int __tlb_remove_page(stru return tlb->max - tlb->nr; } -static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { ia64_tlb_flush_mmu_tlbonly(tlb, tlb->start_addr, tlb->end_addr); + return true; } static inline void tlb_flush_mmu_free(struct mmu_gather *tlb) diff -puN arch/s390/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/s390/include/asm/tlb.h --- a/arch/s390/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.950264585 -0700 +++ b/arch/s390/include/asm/tlb.h 2016-06-30 17:10:42.960265038 -0700 @@ -60,9 +60,10 @@ static inline void tlb_gather_mmu(struct tlb->batch = NULL; } -static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { __tlb_flush_mm_lazy(tlb->mm); + return true; } static inline void tlb_flush_mmu_free(struct mmu_gather *tlb) diff -puN arch/sh/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/sh/include/asm/tlb.h --- a/arch/sh/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.952264675 -0700 +++ b/arch/sh/include/asm/tlb.h 2016-06-30 17:10:42.960265038 -0700 @@ -89,6 +89,7 @@ tlb_end_vma(struct mmu_gather *tlb, stru static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { + return true; } static inline void tlb_flush_mmu_free(struct mmu_gather *tlb) diff -puN arch/um/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/um/include/asm/tlb.h --- a/arch/um/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.953264721 -0700 +++ b/arch/um/include/asm/tlb.h 2016-06-30 17:10:42.960265038 -0700 @@ -59,10 +59,11 @@ tlb_gather_mmu(struct mmu_gather *tlb, s extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end); -static inline void +static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { flush_tlb_mm_range(tlb->mm, tlb->start, tlb->end); + return true; } static inline void diff -puN mm/memory.c~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly mm/memory.c --- a/mm/memory.c~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.955264811 -0700 +++ b/mm/memory.c 2016-06-30 17:10:42.962265129 -0700 @@ -240,10 +240,10 @@ void tlb_gather_mmu(struct mmu_gather *t __tlb_reset_range(tlb); } -static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { if (!tlb->end) - return; + return false; tlb_flush(tlb); mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end); @@ -251,6 +251,7 @@ static void tlb_flush_mmu_tlbonly(struct tlb_table_flush(tlb); #endif __tlb_reset_range(tlb); + return true; } static void tlb_flush_mmu_free(struct mmu_gather *tlb) _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id 6E2556B0262 for ; Thu, 30 Jun 2016 20:12:38 -0400 (EDT) Received: by mail-pf0-f197.google.com with SMTP id e189so203853593pfa.2 for ; Thu, 30 Jun 2016 17:12:38 -0700 (PDT) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTP id pt9si922369pab.278.2016.06.30.17.12.34 for ; Thu, 30 Jun 2016 17:12:34 -0700 (PDT) Subject: [PATCH 2/6] mm, tlb: add mmu_gather->saw_unset_a_or_d From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:12 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001212.3001F812@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Add a field (->saw_unset_a_or_d) to the asm-generic version of mmu_gather. We will use this on x86 to indicate when a PTE got cleared that might potentially have a stray Accessed or Dirty bit set. Note that since ->saw_unset_a_or_d shares space in a bitfield with ->fullmm and ->need_flush_all, there's no incremental storage cost. In addition, since it is initialized to 0 like ->need_flush_all, they can likely be initialized together, leading to no real cost for having ->saw_unset_a_or_d around. Signed-off-by: Dave Hansen --- b/include/asm-generic/tlb.h | 7 ++++++- b/mm/memory.c | 6 ++++-- 2 files changed, 10 insertions(+), 3 deletions(-) diff -puN include/asm-generic/tlb.h~knl-leak-20-saw_unset_a_or_d include/asm-generic/tlb.h --- a/include/asm-generic/tlb.h~knl-leak-20-saw_unset_a_or_d 2016-06-30 17:10:41.606203608 -0700 +++ b/include/asm-generic/tlb.h 2016-06-30 17:10:41.611203835 -0700 @@ -101,7 +101,12 @@ struct mmu_gather { unsigned int fullmm : 1, /* we have performed an operation which * requires a complete flush of the tlb */ - need_flush_all : 1; + need_flush_all : 1, + /* we cleared a PTE bit which may potentially + * get set by hardware */ + saw_unset_a_or_d: 1; + + struct mmu_gather_batch *active; struct mmu_gather_batch local; diff -puN mm/memory.c~knl-leak-20-saw_unset_a_or_d mm/memory.c --- a/mm/memory.c~knl-leak-20-saw_unset_a_or_d 2016-06-30 17:10:41.607203654 -0700 +++ b/mm/memory.c 2016-06-30 17:10:41.614203971 -0700 @@ -222,8 +222,10 @@ void tlb_gather_mmu(struct mmu_gather *t tlb->mm = mm; /* Is it from 0 to ~0? */ - tlb->fullmm = !(start | (end+1)); - tlb->need_flush_all = 0; + tlb->fullmm = !(start | (end+1)); + tlb->need_flush_all = 0; + tlb->saw_unset_a_or_d = 0; + tlb->local.next = NULL; tlb->local.nr = 0; tlb->local.max = ARRAY_SIZE(tlb->__pages); _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 964106B0005 for ; Thu, 30 Jun 2016 21:50:24 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id g62so208141558pfb.3 for ; Thu, 30 Jun 2016 18:50:24 -0700 (PDT) Received: from mail-pa0-x244.google.com (mail-pa0-x244.google.com. [2607:f8b0:400e:c03::244]) by mx.google.com with ESMTPS id j18si1435299pfk.49.2016.06.30.18.50.23 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 Jun 2016 18:50:23 -0700 (PDT) Received: by mail-pa0-x244.google.com with SMTP id hf6so8471083pac.2 for ; Thu, 30 Jun 2016 18:50:23 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs From: Nadav Amit In-Reply-To: <20160701001218.3D316260@viggo.jf.intel.com> Date: Thu, 30 Jun 2016 18:50:21 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <5A585093-4E0D-49BC-A9CA-0072BB83A71C@gmail.com> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: LKML , x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, dave.hansen@linux.intel.com Dave Hansen wrote: > +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long = address, > + pte_t *ptep) > +{ > + struct mm_struct *mm =3D vma->vm_mm; > + pte_t pte; > + > + pte =3D ptep_get_and_clear(mm, address, ptep); > + if (pte_accessible(mm, pte)) { > + flush_tlb_page(vma, address); > + /* > + * Ensure that the compiler orders our set_pte() > + * after the flush_tlb_page() no matter what. > + */ > + barrier(); I don=E2=80=99t think such a barrier (after remote TLB flush) is needed. Eventually, if a remote flush takes place, you get csd_lock_wait() to be called, and then smp_rmb() is called (which is essentially a barrier() call on x86). Regards, Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f69.google.com (mail-pa0-f69.google.com [209.85.220.69]) by kanga.kvack.org (Postfix) with ESMTP id A03306B0005 for ; Thu, 30 Jun 2016 21:55:02 -0400 (EDT) Received: by mail-pa0-f69.google.com with SMTP id he1so178567463pac.0 for ; Thu, 30 Jun 2016 18:55:02 -0700 (PDT) Received: from blackbird.sr71.net ([2001:19d0:2:6:209:6bff:fe9a:902]) by mx.google.com with ESMTP id g70si1054781pfb.119.2016.06.30.18.55.01 for ; Thu, 30 Jun 2016 18:55:01 -0700 (PDT) Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5A585093-4E0D-49BC-A9CA-0072BB83A71C@gmail.com> From: Dave Hansen Message-ID: <5775CD73.2070809@sr71.net> Date: Thu, 30 Jun 2016 18:54:59 -0700 MIME-Version: 1.0 In-Reply-To: <5A585093-4E0D-49BC-A9CA-0072BB83A71C@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: LKML , x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, dave.hansen@linux.intel.com On 06/30/2016 06:50 PM, Nadav Amit wrote: > Dave Hansen wrote: >> +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, >> + pte_t *ptep) >> +{ >> + struct mm_struct *mm = vma->vm_mm; >> + pte_t pte; >> + >> + pte = ptep_get_and_clear(mm, address, ptep); >> + if (pte_accessible(mm, pte)) { >> + flush_tlb_page(vma, address); >> + /* >> + * Ensure that the compiler orders our set_pte() >> + * after the flush_tlb_page() no matter what. >> + */ >> + barrier(); > > I dona??t think such a barrier (after remote TLB flush) is needed. > Eventually, if a remote flush takes place, you get csd_lock_wait() to be > called, and then smp_rmb() is called (which is essentially a barrier() > call on x86). Andi really wanted to make sure this got in here. He said there was a bug that bit him really badly once where a function got reordered. Granted, a call _should_ be sufficient to keep the compiler from reordering things, but this makes double sure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f72.google.com (mail-oi0-f72.google.com [209.85.218.72]) by kanga.kvack.org (Postfix) with ESMTP id 3DEAC6B0005 for ; Thu, 30 Jun 2016 22:56:00 -0400 (EDT) Received: by mail-oi0-f72.google.com with SMTP id d132so104892007oig.0 for ; Thu, 30 Jun 2016 19:56:00 -0700 (PDT) Received: from mail-oi0-x22c.google.com (mail-oi0-x22c.google.com. [2607:f8b0:4003:c06::22c]) by mx.google.com with ESMTPS id 194si567229oie.137.2016.06.30.19.55.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 Jun 2016 19:55:59 -0700 (PDT) Received: by mail-oi0-x22c.google.com with SMTP id r2so93462370oih.2 for ; Thu, 30 Jun 2016 19:55:59 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20160701001218.3D316260@viggo.jf.intel.com> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> From: Linus Torvalds Date: Thu, 30 Jun 2016 19:55:58 -0700 Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen On Thu, Jun 30, 2016 at 5:12 PM, Dave Hansen wrote: > > From: Dave Hansen > > The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights > Landing) has an erratum where a processor thread setting the Accessed > or Dirty bits may not do so atomically against its checks for the > Present bit. This may cause a thread (which is about to page fault) > to set A and/or D, even though the Present bit had already been > atomically cleared. So I don't think your approach is wrong, but I suspect this is overkill, and what we should instead just do is to not use the A/D bits at all in the swap representation. The swap-entry representation was a bit tight on 32-bit page table entries, but in 64-bit ones, I think we have tons of bits, don't we? So we could decide just to not use those two bits on x86. It's not like anybody will ever care about 32-bit page tables on Knights Landing anyway. So rather than add this kind of complexity and worry, how about just simplifying the problem? Or was there some discussion or implication I missed? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f70.google.com (mail-it0-f70.google.com [209.85.214.70]) by kanga.kvack.org (Postfix) with ESMTP id BEF936B0005 for ; Thu, 30 Jun 2016 23:06:57 -0400 (EDT) Received: by mail-it0-f70.google.com with SMTP id f6so17127303ith.1 for ; Thu, 30 Jun 2016 20:06:57 -0700 (PDT) Received: from mail-oi0-x230.google.com (mail-oi0-x230.google.com. [2607:f8b0:4003:c06::230]) by mx.google.com with ESMTPS id p137si573611oic.252.2016.06.30.20.06.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 Jun 2016 20:06:57 -0700 (PDT) Received: by mail-oi0-x230.google.com with SMTP id s66so93640343oif.1 for ; Thu, 30 Jun 2016 20:06:57 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> From: Brian Gerst Date: Thu, 30 Jun 2016 23:06:55 -0400 Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Dave Hansen , Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen On Thu, Jun 30, 2016 at 10:55 PM, Linus Torvalds wrote: > On Thu, Jun 30, 2016 at 5:12 PM, Dave Hansen wrote: >> >> From: Dave Hansen >> >> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights >> Landing) has an erratum where a processor thread setting the Accessed >> or Dirty bits may not do so atomically against its checks for the >> Present bit. This may cause a thread (which is about to page fault) >> to set A and/or D, even though the Present bit had already been >> atomically cleared. > > So I don't think your approach is wrong, but I suspect this is > overkill, and what we should instead just do is to not use the A/D > bits at all in the swap representation. > > The swap-entry representation was a bit tight on 32-bit page table > entries, but in 64-bit ones, I think we have tons of bits, don't we? > So we could decide just to not use those two bits on x86. > > It's not like anybody will ever care about 32-bit page tables on > Knights Landing anyway. Could this affect a 32-bit guest VM? -- Brian Gerst -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f197.google.com (mail-io0-f197.google.com [209.85.223.197]) by kanga.kvack.org (Postfix) with ESMTP id 8624C6B0005 for ; Thu, 30 Jun 2016 23:21:16 -0400 (EDT) Received: by mail-io0-f197.google.com with SMTP id s63so197153879ioi.1 for ; Thu, 30 Jun 2016 20:21:16 -0700 (PDT) Received: from mail-oi0-x232.google.com (mail-oi0-x232.google.com. [2607:f8b0:4003:c06::232]) by mx.google.com with ESMTPS id m15si593792oik.250.2016.06.30.20.21.15 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 Jun 2016 20:21:15 -0700 (PDT) Received: by mail-oi0-x232.google.com with SMTP id f189so93986018oig.3 for ; Thu, 30 Jun 2016 20:21:15 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> Date: Thu, 30 Jun 2016 20:21:15 -0700 Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs From: Linus Torvalds Content-Type: multipart/alternative; boundary=001a113defaab627c405368a7ad3 Sender: owner-linux-mm@kvack.org List-ID: To: Brian Gerst Cc: Dave Hansen , Michal Hocko , Borislav Petkov , Andrew Morton , the arch/x86 maintainers , Linux Kernel Mailing List , Andi Kleen , linux-mm , Dave Hansen --001a113defaab627c405368a7ad3 Content-Type: text/plain; charset=UTF-8 On Jun 30, 2016 8:06 PM, "Brian Gerst" wrote: > > Could this affect a 32-bit guest VM? Even on 32-bit, all the distros do PAE to get NX and access to more physical memory, and that has a 64-bit page table entry. The 32-bit page table case is pretty unusual. Linus --001a113defaab627c405368a7ad3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Jun 30, 2016 8:06 PM, "Brian Gerst" <brgerst@gmail.com> wrote:
>
> Could this affect a 32-bit guest VM?

Even on 32-bit, all the distros do PAE to get NX and access = to more physical memory, and that has a 64-bit page table entry.

The 32-bit page table case is pretty unusual.

=C2=A0=C2=A0=C2=A0=C2=A0 Linus

--001a113defaab627c405368a7ad3-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id F11436B0253 for ; Fri, 1 Jul 2016 00:39:54 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id 143so214447628pfx.0 for ; Thu, 30 Jun 2016 21:39:54 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id kb6si2093146pab.202.2016.06.30.21.39.54 for ; Thu, 30 Jun 2016 21:39:54 -0700 (PDT) Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> From: Dave Hansen Message-ID: <5775F418.2000803@sr71.net> Date: Thu, 30 Jun 2016 21:39:52 -0700 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen On 06/30/2016 07:55 PM, Linus Torvalds wrote: > On Thu, Jun 30, 2016 at 5:12 PM, Dave Hansen wrote: >> From: Dave Hansen >> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights >> Landing) has an erratum where a processor thread setting the Accessed >> or Dirty bits may not do so atomically against its checks for the >> Present bit. This may cause a thread (which is about to page fault) >> to set A and/or D, even though the Present bit had already been >> atomically cleared. > > So I don't think your approach is wrong, but I suspect this is > overkill, and what we should instead just do is to not use the A/D > bits at all in the swap representation. We actually don't even use Dirty today. It's (implicitly) used to determine pte_none(), but it ends up being masked out for the swp_offset/type() calculations entirely, much to my surprise. I think what you suggest will work if we don't consider A/D in pte_none(). I think there are a bunch of code path where assume that !pte_present() && !pte_none() means swap. > The swap-entry representation was a bit tight on 32-bit page table > entries, but in 64-bit ones, I think we have tons of bits, don't we? > So we could decide just to not use those two bits on x86. Yeah, we've definitely got space. I'll go poke around and make sure that this works everywhere. I agree that throwing 32-bit non-PAE under the bus is definitely worth it here. Nobody will care about that in a million years. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f71.google.com (mail-oi0-f71.google.com [209.85.218.71]) by kanga.kvack.org (Postfix) with ESMTP id 2D3CE828E4 for ; Fri, 1 Jul 2016 01:43:11 -0400 (EDT) Received: by mail-oi0-f71.google.com with SMTP id u81so108894709oia.3 for ; Thu, 30 Jun 2016 22:43:11 -0700 (PDT) Received: from mail-oi0-x22d.google.com (mail-oi0-x22d.google.com. [2607:f8b0:4003:c06::22d]) by mx.google.com with ESMTPS id c54si831676otd.71.2016.06.30.22.43.10 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 Jun 2016 22:43:10 -0700 (PDT) Received: by mail-oi0-x22d.google.com with SMTP id s66so97073074oif.1 for ; Thu, 30 Jun 2016 22:43:10 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <5775F418.2000803@sr71.net> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> From: Linus Torvalds Date: Thu, 30 Jun 2016 22:43:09 -0700 Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: > > I think what you suggest will work if we don't consider A/D in > pte_none(). I think there are a bunch of code path where assume that > !pte_present() && !pte_none() means swap. Yeah, we would need to change pte_none() to mask off D/A, but I think that might be the only real change needed (other than making sure that we don't use the bits in the swap entries, I didn't look at that part at all) Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 458F46B0253 for ; Fri, 1 Jul 2016 10:37:38 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id e189so239443408pfa.2 for ; Fri, 01 Jul 2016 07:37:38 -0700 (PDT) Received: from out03.mta.xmission.com (out03.mta.xmission.com. [166.70.13.233]) by mx.google.com with ESMTPS id 68si4330382pff.189.2016.07.01.07.37.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 07:37:37 -0700 (PDT) From: ebiederm@xmission.com (Eric W. Biederman) References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> Date: Fri, 01 Jul 2016 09:25:10 -0500 In-Reply-To: (Linus Torvalds's message of "Thu, 30 Jun 2016 22:43:09 -0700") Message-ID: <874m89cu61.fsf@x220.int.ebiederm.org> MIME-Version: 1.0 Content-Type: text/plain Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Dave Hansen , Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen Linus Torvalds writes: > On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: >> >> I think what you suggest will work if we don't consider A/D in >> pte_none(). I think there are a bunch of code path where assume that >> !pte_present() && !pte_none() means swap. > > Yeah, we would need to change pte_none() to mask off D/A, but I think > that might be the only real change needed (other than making sure that > we don't use the bits in the swap entries, I didn't look at that part > at all) It looks like __pte_to_swp_entry also needs to be changed to mask out those bits when the swap code reads pte entries. For all of the same reasons as pte_none. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 30C0682958 for ; Fri, 1 Jul 2016 11:51:39 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id g62so243607113pfb.3 for ; Fri, 01 Jul 2016 08:51:39 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id d20si4507602pfk.240.2016.07.01.08.51.38 for ; Fri, 01 Jul 2016 08:51:38 -0700 (PDT) Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> <874m89cu61.fsf@x220.int.ebiederm.org> From: Dave Hansen Message-ID: <57769188.9060708@sr71.net> Date: Fri, 1 Jul 2016 08:51:36 -0700 MIME-Version: 1.0 In-Reply-To: <874m89cu61.fsf@x220.int.ebiederm.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Eric W. Biederman" , Linus Torvalds Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen On 07/01/2016 07:25 AM, Eric W. Biederman wrote: > Linus Torvalds writes: >> > On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: >>> >> >>> >> I think what you suggest will work if we don't consider A/D in >>> >> pte_none(). I think there are a bunch of code path where assume that >>> >> !pte_present() && !pte_none() means swap. >> > >> > Yeah, we would need to change pte_none() to mask off D/A, but I think >> > that might be the only real change needed (other than making sure that >> > we don't use the bits in the swap entries, I didn't look at that part >> > at all) > It looks like __pte_to_swp_entry also needs to be changed to mask out > those bits when the swap code reads pte entries. For all of the same > reasons as pte_none. I guess that would be nice, but isn't it redundant? static inline swp_entry_t pte_to_swp_entry(pte_t pte) { ... arch_entry = __pte_to_swp_entry(pte); return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry)); } As long as __swp_type() and __swp_offset() don't let A/D through, then we should be OK. This site is the only call to __pte_to_swp_entry() that I can find in the entire codebase. Or am I missing something? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f197.google.com (mail-io0-f197.google.com [209.85.223.197]) by kanga.kvack.org (Postfix) with ESMTP id 89E0E82958 for ; Fri, 1 Jul 2016 12:07:12 -0400 (EDT) Received: by mail-io0-f197.google.com with SMTP id s63so226605106ioi.1 for ; Fri, 01 Jul 2016 09:07:12 -0700 (PDT) Received: from mail-oi0-x243.google.com (mail-oi0-x243.google.com. [2607:f8b0:4003:c06::243]) by mx.google.com with ESMTPS id t13si2013200ott.132.2016.07.01.09.07.11 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 09:07:11 -0700 (PDT) Received: by mail-oi0-x243.google.com with SMTP id d132so13402441oig.2 for ; Fri, 01 Jul 2016 09:07:11 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <5775F418.2000803@sr71.net> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> From: Linus Torvalds Date: Fri, 1 Jul 2016 09:07:10 -0700 Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: > > I think what you suggest will work if we don't consider A/D in > pte_none(). I think there are a bunch of code path where assume that > !pte_present() && !pte_none() means swap. Hmm. Thinking about it some more, I still think it's a good idea to avoid A/D bits in the swap entries and in pte_none() and friends, and it might simplify some of this all. But I also started worrying about us just losing sight of the dirty bit in particular. It's not enough that we ignore the dirty bit - we'd still want to make sure that the underlying backing page gets marked dirty, even if the CPU is buggy and ends doing it "delayed" after we've already unmapped the page. So I get this feeling that we may need a fair chunk of your patch-series anyway. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 296D16B026A for ; Fri, 1 Jul 2016 12:14:15 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id e189so244504525pfa.2 for ; Fri, 01 Jul 2016 09:14:15 -0700 (PDT) Received: from blackbird.sr71.net ([2001:19d0:2:6:209:6bff:fe9a:902]) by mx.google.com with ESMTP id tp2si4810481pab.52.2016.07.01.09.14.14 for ; Fri, 01 Jul 2016 09:14:14 -0700 (PDT) Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> From: Dave Hansen Message-ID: <577696D5.2010609@sr71.net> Date: Fri, 1 Jul 2016 09:14:13 -0700 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen On 07/01/2016 09:07 AM, Linus Torvalds wrote: > But I also started worrying about us just losing sight of the dirty > bit in particular. It's not enough that we ignore the dirty bit - we'd > still want to make sure that the underlying backing page gets marked > dirty, even if the CPU is buggy and ends doing it "delayed" after > we've already unmapped the page. > > So I get this feeling that we may need a fair chunk of your > patch-series anyway. As I understand it, the erratum only affects a thread which is about to page fault. The write associated with the dirty bit being set never actually gets executed. So, the bit really *is* stray and isn't something we need to preserve. Otherwise, we'd be really screwed because we couldn't ever simply clear it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f72.google.com (mail-it0-f72.google.com [209.85.214.72]) by kanga.kvack.org (Postfix) with ESMTP id 1FB99828E4 for ; Fri, 1 Jul 2016 12:25:03 -0400 (EDT) Received: by mail-it0-f72.google.com with SMTP id j185so30861147ith.0 for ; Fri, 01 Jul 2016 09:25:03 -0700 (PDT) Received: from mail-oi0-x22e.google.com (mail-oi0-x22e.google.com. [2607:f8b0:4003:c06::22e]) by mx.google.com with ESMTPS id g84si2035077oib.116.2016.07.01.09.25.02 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 09:25:02 -0700 (PDT) Received: by mail-oi0-x22e.google.com with SMTP id s66so117401255oif.1 for ; Fri, 01 Jul 2016 09:25:02 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <577696D5.2010609@sr71.net> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> <577696D5.2010609@sr71.net> From: Linus Torvalds Date: Fri, 1 Jul 2016 09:25:01 -0700 Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen On Fri, Jul 1, 2016 at 9:14 AM, Dave Hansen wrote: > > As I understand it, the erratum only affects a thread which is about to > page fault. The write associated with the dirty bit being set never > actually gets executed. So, the bit really *is* stray and isn't > something we need to preserve. Ok, good. > Otherwise, we'd be really screwed because we couldn't ever simply clear it. Oh, we could do the whole "clear the pte, then flush the tlb, then go back and clear the stale dirty bits and move them into the backing page". I was afraid we might have to do something like that. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f72.google.com (mail-vk0-f72.google.com [209.85.213.72]) by kanga.kvack.org (Postfix) with ESMTP id 899E3828E4 for ; Fri, 1 Jul 2016 12:30:58 -0400 (EDT) Received: by mail-vk0-f72.google.com with SMTP id j65so73978148vkg.3 for ; Fri, 01 Jul 2016 09:30:58 -0700 (PDT) Received: from mail-vk0-x231.google.com (mail-vk0-x231.google.com. [2607:f8b0:400c:c05::231]) by mx.google.com with ESMTPS id u142si1629310vkb.121.2016.07.01.09.30.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 09:30:57 -0700 (PDT) Received: by mail-vk0-x231.google.com with SMTP id m127so100234677vkb.3 for ; Fri, 01 Jul 2016 09:30:57 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20160701092300.GD4593@pd.tnic> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001210.AA77B917@viggo.jf.intel.com> <20160701092300.GD4593@pd.tnic> From: Andy Lutomirski Date: Fri, 1 Jul 2016 09:30:37 -0700 Message-ID: Subject: Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov Cc: Michal Hocko , "linux-kernel@vger.kernel.org" , stable , Andi Kleen , "linux-mm@kvack.org" , Linus Torvalds , Andrew Morton , X86 ML , Dave Hansen , Dave Hansen On Jul 1, 2016 2:23 AM, "Borislav Petkov" wrote: > > On Thu, Jun 30, 2016 at 05:12:10PM -0700, Dave Hansen wrote: > > > > From: Dave Hansen > > > > epufeatures.h currently defines X86_BUG(9) twice on 32-bit: > > > > #define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ > > ... > > #ifdef CONFIG_X86_32 > > #define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */ > > #endif > > > > I think what happened was that this added the X86_BUG_ESPFIX, but > > in an #ifdef below most of the bugs: > > > > [58a5aac5] x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled > > > > Then this came along and added X86_BUG_NULL_SEG, but collided > > with the earlier one that did the bug below the main block > > defining all the X86_BUG()s. > > > > [7a5d6704] x86/cpu: Probe the behavior of nulling out a segment at boot time > > > > Signed-off-by: Dave Hansen > > Acked-by: Andy Lutomirski > > Cc: stable@vger.kernel.org > > --- > > > > b/arch/x86/include/asm/cpufeatures.h | 6 ++---- > > 1 file changed, 2 insertions(+), 4 deletions(-) > > > > diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros arch/x86/include/asm/cpufeatures.h > > --- a/arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros 2016-06-30 17:10:41.215185869 -0700 > > +++ b/arch/x86/include/asm/cpufeatures.h 2016-06-30 17:10:41.218186005 -0700 > > @@ -301,10 +301,6 @@ > > #define X86_BUG_FXSAVE_LEAK X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */ > > #define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */ > > #define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */ > > -#define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ > > -#define X86_BUG_SWAPGS_FENCE X86_BUG(10) /* SWAPGS without input dep on GS */ > > - > > - > > #ifdef CONFIG_X86_32 > > /* > > * 64-bit kernels don't use X86_BUG_ESPFIX. Make the define conditional > > So I'd remove the "#ifdef CONFIG_X86_32" ifdeffery too and make that bit > unconditional - so what, we have enough free bits. But I'd leave the > comment to still avoid the confusion :) > I put the ifdef there to prevent anyone from accidentally using it in a 64-bit code path, not to save a bit. We could put in the middle of the list to make the mistake much less likely to be repeated, I suppose. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 17C71828E1 for ; Fri, 1 Jul 2016 14:25:18 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id 143so251616152pfx.0 for ; Fri, 01 Jul 2016 11:25:18 -0700 (PDT) Received: from out01.mta.xmission.com (out01.mta.xmission.com. [166.70.13.231]) by mx.google.com with ESMTPS id hj10si5234457pac.0.2016.07.01.11.25.17 for (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 01 Jul 2016 11:25:17 -0700 (PDT) From: ebiederm@xmission.com (Eric W. Biederman) References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> <874m89cu61.fsf@x220.int.ebiederm.org> <57769188.9060708@sr71.net> Date: Fri, 01 Jul 2016 13:12:55 -0500 In-Reply-To: <57769188.9060708@sr71.net> (Dave Hansen's message of "Fri, 1 Jul 2016 08:51:36 -0700") Message-ID: <878txlb520.fsf@x220.int.ebiederm.org> MIME-Version: 1.0 Content-Type: text/plain Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Linus Torvalds , Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen Dave Hansen writes: > On 07/01/2016 07:25 AM, Eric W. Biederman wrote: >> Linus Torvalds writes: >>> > On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: >>>> >> >>>> >> I think what you suggest will work if we don't consider A/D in >>>> >> pte_none(). I think there are a bunch of code path where assume that >>>> >> !pte_present() && !pte_none() means swap. >>> > >>> > Yeah, we would need to change pte_none() to mask off D/A, but I think >>> > that might be the only real change needed (other than making sure that >>> > we don't use the bits in the swap entries, I didn't look at that part >>> > at all) >> It looks like __pte_to_swp_entry also needs to be changed to mask out >> those bits when the swap code reads pte entries. For all of the same >> reasons as pte_none. > > I guess that would be nice, but isn't it redundant? > > static inline swp_entry_t pte_to_swp_entry(pte_t pte) > { > ... > arch_entry = __pte_to_swp_entry(pte); > return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry)); > } > > As long as __swp_type() and __swp_offset() don't let A/D through, then > we should be OK. This site is the only call to __pte_to_swp_entry() > that I can find in the entire codebase. > > Or am I missing something? Given that __pte_to_swp_entry on x86_64 is just __pte_val or pte.pte it does no filtering. Similarly __swp_type(arch_entry) is a >> and swp_entry(type, ...) is a << of what appears to be same amount for the swap type. So any corruption in the upper bits of the pte will be preserved as a swap type. In fact I strongly suspect that the compiler can optimize out all of the work done by "swp_entry(__swp_type(arch_entry), _swp_offset(arch_entry))". Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f198.google.com (mail-yw0-f198.google.com [209.85.161.198]) by kanga.kvack.org (Postfix) with ESMTP id 420EA6B0005 for ; Sun, 3 Jul 2016 10:36:51 -0400 (EDT) Received: by mail-yw0-f198.google.com with SMTP id i12so63430005ywa.0 for ; Sun, 03 Jul 2016 07:36:51 -0700 (PDT) Received: from mail-vk0-x22c.google.com (mail-vk0-x22c.google.com. [2607:f8b0:400c:c05::22c]) by mx.google.com with ESMTPS id 16si633464uae.134.2016.07.03.07.36.50 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 03 Jul 2016 07:36:50 -0700 (PDT) Received: by mail-vk0-x22c.google.com with SMTP id k68so119179571vkb.0 for ; Sun, 03 Jul 2016 07:36:50 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20160701164656.GG4593@pd.tnic> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001210.AA77B917@viggo.jf.intel.com> <20160701092300.GD4593@pd.tnic> <20160701164656.GG4593@pd.tnic> From: Andy Lutomirski Date: Sun, 3 Jul 2016 07:36:30 -0700 Message-ID: Subject: Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov Cc: Dave Hansen , Michal Hocko , X86 ML , "linux-mm@kvack.org" , stable , Andrew Morton , "linux-kernel@vger.kernel.org" , Dave Hansen , Linus Torvalds , Andi Kleen On Jul 1, 2016 9:47 AM, "Borislav Petkov" wrote: > > On Fri, Jul 01, 2016 at 09:30:37AM -0700, Andy Lutomirski wrote: > > I put the ifdef there to prevent anyone from accidentally using it in > > a 64-bit code path, not to save a bit. We could put in the middle of > > the list to make the mistake much less likely to be repeated, I > > suppose. > > Well, if someone does, someone will notice pretty soon, no? Dunno. ESPFIX was broken under KVM for years and no one notices. > > I just don't see the reason to worry but maybe I'm missing it. > > And we can call it X86_BUG_ESPFIX_X86_32 or so too... We could do that, too, I guess. But the current solution is only two extra lines of code. We could reorder the things so that it's in the middle instead of at the end, I suppose. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id BFC7A6B0005 for ; Sun, 3 Jul 2016 13:10:45 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id a69so347942873pfa.1 for ; Sun, 03 Jul 2016 10:10:45 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id b2si4679711pav.216.2016.07.03.10.10.42 for ; Sun, 03 Jul 2016 10:10:42 -0700 (PDT) Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> From: Dave Hansen Message-ID: <5779470F.8020205@sr71.net> Date: Sun, 3 Jul 2016 10:10:39 -0700 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Brian Gerst , Linus Torvalds Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen On 06/30/2016 08:06 PM, Brian Gerst wrote: >> > It's not like anybody will ever care about 32-bit page tables on >> > Knights Landing anyway. > Could this affect a 32-bit guest VM? This isn't about 32-bit *mode*. It's about using the the 32-bit 2-level _paging_ mode that supports only 4GB virtual and 4GB physical addresses. That mode also doesn't support the No-eXecute (NX) bit, which basically everyone needs today for its security benefits. Even the little Quark CPU supports PAE (64-bit page tables). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Borislav Petkov Subject: Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro Date: Fri, 1 Jul 2016 11:23:01 +0200 Message-ID: <20160701092300.GD4593@pd.tnic> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001210.AA77B917@viggo.jf.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Return-path: Content-Disposition: inline In-Reply-To: <20160701001210.AA77B917@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org To: Dave Hansen Cc: linux-kernel@vger.kernel.org, x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, ak@linux.intel.com, mhocko@suse.com, dave.hansen@linux.intel.com, luto@kernel.org, stable@vger.kernel.org List-Id: linux-mm.kvack.org On Thu, Jun 30, 2016 at 05:12:10PM -0700, Dave Hansen wrote: > > From: Dave Hansen > > epufeatures.h currently defines X86_BUG(9) twice on 32-bit: > > #define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ > ... > #ifdef CONFIG_X86_32 > #define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */ > #endif > > I think what happened was that this added the X86_BUG_ESPFIX, but > in an #ifdef below most of the bugs: > > [58a5aac5] x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled > > Then this came along and added X86_BUG_NULL_SEG, but collided > with the earlier one that did the bug below the main block > defining all the X86_BUG()s. > > [7a5d6704] x86/cpu: Probe the behavior of nulling out a segment at boot time > > Signed-off-by: Dave Hansen > Acked-by: Andy Lutomirski > Cc: stable@vger.kernel.org > --- > > b/arch/x86/include/asm/cpufeatures.h | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros arch/x86/include/asm/cpufeatures.h > --- a/arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros 2016-06-30 17:10:41.215185869 -0700 > +++ b/arch/x86/include/asm/cpufeatures.h 2016-06-30 17:10:41.218186005 -0700 > @@ -301,10 +301,6 @@ > #define X86_BUG_FXSAVE_LEAK X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */ > #define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */ > #define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */ > -#define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ > -#define X86_BUG_SWAPGS_FENCE X86_BUG(10) /* SWAPGS without input dep on GS */ > - > - > #ifdef CONFIG_X86_32 > /* > * 64-bit kernels don't use X86_BUG_ESPFIX. Make the define conditional So I'd remove the "#ifdef CONFIG_X86_32" ifdeffery too and make that bit unconditional - so what, we have enough free bits. But I'd leave the comment to still avoid the confusion :) -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Borislav Petkov Subject: Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro Date: Fri, 1 Jul 2016 18:46:56 +0200 Message-ID: <20160701164656.GG4593@pd.tnic> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001210.AA77B917@viggo.jf.intel.com> <20160701092300.GD4593@pd.tnic> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Return-path: Content-Disposition: inline In-Reply-To: Sender: stable-owner@vger.kernel.org To: Andy Lutomirski Cc: Michal Hocko , "linux-kernel@vger.kernel.org" , stable , Andi Kleen , "linux-mm@kvack.org" , Linus Torvalds , Andrew Morton , X86 ML , Dave Hansen , Dave Hansen List-Id: linux-mm.kvack.org On Fri, Jul 01, 2016 at 09:30:37AM -0700, Andy Lutomirski wrote: > I put the ifdef there to prevent anyone from accidentally using it in > a 64-bit code path, not to save a bit. We could put in the middle of > the list to make the mistake much less likely to be repeated, I > suppose. Well, if someone does, someone will notice pretty soon, no? I just don't see the reason to worry but maybe I'm missing it. And we can call it X86_BUG_ESPFIX_X86_32 or so too... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Borislav Petkov Subject: Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro Date: Sun, 3 Jul 2016 20:44:18 +0200 Message-ID: <20160703184418.GC1781@pd.tnic> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001210.AA77B917@viggo.jf.intel.com> <20160701092300.GD4593@pd.tnic> <20160701164656.GG4593@pd.tnic> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Return-path: Content-Disposition: inline In-Reply-To: Sender: stable-owner@vger.kernel.org To: Andy Lutomirski Cc: Dave Hansen , Michal Hocko , X86 ML , "linux-mm@kvack.org" , stable , Andrew Morton , "linux-kernel@vger.kernel.org" , Dave Hansen , Linus Torvalds , Andi Kleen List-Id: linux-mm.kvack.org On Sun, Jul 03, 2016 at 07:36:30AM -0700, Andy Lutomirski wrote: > Dunno. ESPFIX was broken under KVM for years and no one notices. Ah, so this really was the case already. :-\ > We could do that, too, I guess. But the current solution is only two > extra lines of code. We could reorder the things so that it's in the > middle instead of at the end, I suppose. Yeah, sounds good. Especially if it was already broken - I missed that fact. Thanks. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752025AbcGAANO (ORCPT ); Thu, 30 Jun 2016 20:13:14 -0400 Received: from mga04.intel.com ([192.55.52.120]:50021 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751432AbcGAANM (ORCPT ); Thu, 30 Jun 2016 20:13:12 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,554,1459839600"; d="scan'208";a="838365145" Subject: [PATCH 4/6] mm: move flush in madvise_free_pte_range() To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com, minchan@kernel.org From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:15 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001215.C0689717@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen I think this code is OK and does not *need* to be patched. We are just rewriting the PTE without the Accessed and Dirty bits. The hardware could come along and set them at any time with or without the erratum that this series addresses But this does make the ptep_get_and_clear_full() and tlb_remove_tlb_entry() calls here more consistent with the other places they are used together and look *obviously* the same between call-sites. Signed-off-by: Dave Hansen Cc: Minchan Kim --- b/mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN mm/madvise.c~knl-leak-40-madvise_free_pte_range-move-flush mm/madvise.c --- a/mm/madvise.c~knl-leak-40-madvise_free_pte_range-move-flush 2016-06-30 17:10:42.557246755 -0700 +++ b/mm/madvise.c 2016-06-30 17:10:42.561246936 -0700 @@ -369,13 +369,13 @@ static int madvise_free_pte_range(pmd_t */ ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + tlb_remove_tlb_entry(tlb, pte, addr); ptent = pte_mkold(ptent); ptent = pte_mkclean(ptent); set_pte_at(mm, addr, pte, ptent); if (PageActive(page)) deactivate_page(page); - tlb_remove_tlb_entry(tlb, pte, addr); } } out: _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752284AbcGAANR (ORCPT ); Thu, 30 Jun 2016 20:13:17 -0400 Received: from mga09.intel.com ([134.134.136.24]:24761 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750924AbcGAANM (ORCPT ); Thu, 30 Jun 2016 20:13:12 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,554,1459839600"; d="scan'208";a="131603401" Subject: [PATCH 2/6] mm, tlb: add mmu_gather->saw_unset_a_or_d To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:12 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001212.3001F812@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen Add a field (->saw_unset_a_or_d) to the asm-generic version of mmu_gather. We will use this on x86 to indicate when a PTE got cleared that might potentially have a stray Accessed or Dirty bit set. Note that since ->saw_unset_a_or_d shares space in a bitfield with ->fullmm and ->need_flush_all, there's no incremental storage cost. In addition, since it is initialized to 0 like ->need_flush_all, they can likely be initialized together, leading to no real cost for having ->saw_unset_a_or_d around. Signed-off-by: Dave Hansen --- b/include/asm-generic/tlb.h | 7 ++++++- b/mm/memory.c | 6 ++++-- 2 files changed, 10 insertions(+), 3 deletions(-) diff -puN include/asm-generic/tlb.h~knl-leak-20-saw_unset_a_or_d include/asm-generic/tlb.h --- a/include/asm-generic/tlb.h~knl-leak-20-saw_unset_a_or_d 2016-06-30 17:10:41.606203608 -0700 +++ b/include/asm-generic/tlb.h 2016-06-30 17:10:41.611203835 -0700 @@ -101,7 +101,12 @@ struct mmu_gather { unsigned int fullmm : 1, /* we have performed an operation which * requires a complete flush of the tlb */ - need_flush_all : 1; + need_flush_all : 1, + /* we cleared a PTE bit which may potentially + * get set by hardware */ + saw_unset_a_or_d: 1; + + struct mmu_gather_batch *active; struct mmu_gather_batch local; diff -puN mm/memory.c~knl-leak-20-saw_unset_a_or_d mm/memory.c --- a/mm/memory.c~knl-leak-20-saw_unset_a_or_d 2016-06-30 17:10:41.607203654 -0700 +++ b/mm/memory.c 2016-06-30 17:10:41.614203971 -0700 @@ -222,8 +222,10 @@ void tlb_gather_mmu(struct mmu_gather *t tlb->mm = mm; /* Is it from 0 to ~0? */ - tlb->fullmm = !(start | (end+1)); - tlb->need_flush_all = 0; + tlb->fullmm = !(start | (end+1)); + tlb->need_flush_all = 0; + tlb->saw_unset_a_or_d = 0; + tlb->local.next = NULL; tlb->local.nr = 0; tlb->local.max = ARRAY_SIZE(tlb->__pages); _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752148AbcGAANP (ORCPT ); Thu, 30 Jun 2016 20:13:15 -0400 Received: from mga11.intel.com ([192.55.52.93]:6019 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751052AbcGAANM (ORCPT ); Thu, 30 Jun 2016 20:13:12 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,554,1459839600"; d="scan'208";a="986740986" Subject: [PATCH 3/6] mm: add force_batch_flush to mmu_gather To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:14 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001214.94D8F14C@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen Currently, zap_pte_range() has a local variable called 'force_flush'. It is set when the zapping code runs in to an entry that requires flushing before the ptl is released. Currently, there are two reasons we might do that: 1. The TLB batching in __tlb_remove_page() has run out of space and can no longer record new pages being flushed. 2. An entry for a dirty page was flushed, and we need to ensure that software walking the page tables can not observe the cleared bit before we flush the TLB. We need the x86 code to be able to force a flush for an additional reason: if the PTE being cleared might have a stray Accessed or Dirty bit set on it because of a hardware erratum. For these purposes, we also need to flush before the ptl is released. So, we move the 'force_flush' variable into the mmu_gather structure where we can set it from arch code. Signed-off-by: Dave Hansen --- b/arch/arm/include/asm/tlb.h | 1 + b/arch/ia64/include/asm/tlb.h | 1 + b/arch/s390/include/asm/tlb.h | 1 + b/arch/sh/include/asm/tlb.h | 1 + b/arch/um/include/asm/tlb.h | 1 + b/include/asm-generic/tlb.h | 3 +++ b/mm/memory.c | 22 ++++++++++++---------- 7 files changed, 20 insertions(+), 10 deletions(-) diff -puN arch/arm/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/arm/include/asm/tlb.h --- a/arch/arm/include/asm/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.021222437 -0700 +++ b/arch/arm/include/asm/tlb.h 2016-06-30 17:10:42.037223163 -0700 @@ -69,6 +69,7 @@ struct mmu_gather { unsigned int need_flush; #endif unsigned int fullmm; + unsigned int force_batch_flush; struct vm_area_struct *vma; unsigned long start, end; unsigned long range_start; diff -puN arch/ia64/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/ia64/include/asm/tlb.h --- a/arch/ia64/include/asm/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.022222482 -0700 +++ b/arch/ia64/include/asm/tlb.h 2016-06-30 17:10:42.037223163 -0700 @@ -57,6 +57,7 @@ struct mmu_gather { unsigned int nr; unsigned int max; unsigned char fullmm; /* non-zero means full mm flush */ + unsigned int force_batch_flush; /* stop batching and flush */ unsigned char need_flush; /* really unmapped some PTEs? */ unsigned long start, end; unsigned long start_addr; diff -puN arch/s390/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/s390/include/asm/tlb.h --- a/arch/s390/include/asm/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.024222573 -0700 +++ b/arch/s390/include/asm/tlb.h 2016-06-30 17:10:42.038223208 -0700 @@ -32,6 +32,7 @@ struct mmu_gather { struct mm_struct *mm; struct mmu_table_batch *batch; unsigned int fullmm; + unsigned int force_batch_flush; /* stop batching and flush */ unsigned long start, end; }; diff -puN arch/sh/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/sh/include/asm/tlb.h --- a/arch/sh/include/asm/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.025222618 -0700 +++ b/arch/sh/include/asm/tlb.h 2016-06-30 17:10:42.038223208 -0700 @@ -21,6 +21,7 @@ struct mmu_gather { struct mm_struct *mm; unsigned int fullmm; + unsigned int force_batch_flush; /* stop batching and flush */ unsigned long start, end; }; diff -puN arch/um/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/um/include/asm/tlb.h --- a/arch/um/include/asm/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.027222709 -0700 +++ b/arch/um/include/asm/tlb.h 2016-06-30 17:10:42.039223253 -0700 @@ -20,6 +20,7 @@ struct mmu_gather { unsigned long start; unsigned long end; unsigned int fullmm; /* non-zero means full mm flush */ + unsigned int force_batch_flush; /* stop batching and flush */ }; static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, diff -puN include/asm-generic/tlb.h~knl-leak-30-tlb_force_flush include/asm-generic/tlb.h --- a/include/asm-generic/tlb.h~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.028222754 -0700 +++ b/include/asm-generic/tlb.h 2016-06-30 17:10:42.039223253 -0700 @@ -102,6 +102,9 @@ struct mmu_gather { /* we have performed an operation which * requires a complete flush of the tlb */ need_flush_all : 1, + /* need to flush the tlb and stop batching + * before we release ptl */ + force_batch_flush : 1, /* we cleared a PTE bit which may potentially * get set by hardware */ saw_unset_a_or_d: 1; diff -puN mm/memory.c~knl-leak-30-tlb_force_flush mm/memory.c --- a/mm/memory.c~knl-leak-30-tlb_force_flush 2016-06-30 17:10:42.031222890 -0700 +++ b/mm/memory.c 2016-06-30 17:10:42.040223299 -0700 @@ -225,6 +225,7 @@ void tlb_gather_mmu(struct mmu_gather *t tlb->fullmm = !(start | (end+1)); tlb->need_flush_all = 0; tlb->saw_unset_a_or_d = 0; + tlb->force_batch_flush = 0; tlb->local.next = NULL; tlb->local.nr = 0; @@ -1105,7 +1106,6 @@ static unsigned long zap_pte_range(struc struct zap_details *details) { struct mm_struct *mm = tlb->mm; - int force_flush = 0; int rss[NR_MM_COUNTERS]; spinlock_t *ptl; pte_t *start_pte; @@ -1151,7 +1151,7 @@ again: */ if (unlikely(details && details->ignore_dirty)) continue; - force_flush = 1; + tlb->force_batch_flush = 1; set_page_dirty(page); } if (pte_young(ptent) && @@ -1163,7 +1163,7 @@ again: if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); if (unlikely(!__tlb_remove_page(tlb, page))) { - force_flush = 1; + tlb->force_batch_flush = 1; addr += PAGE_SIZE; break; } @@ -1191,18 +1191,20 @@ again: arch_leave_lazy_mmu_mode(); /* Do the actual TLB flush before dropping ptl */ - if (force_flush) + if (tlb->force_batch_flush) tlb_flush_mmu_tlbonly(tlb); pte_unmap_unlock(start_pte, ptl); /* - * If we forced a TLB flush (either due to running out of - * batch buffers or because we needed to flush dirty TLB - * entries before releasing the ptl), free the batched - * memory too. Restart if we didn't do everything. + * If we forced a TLB flush, free the batched + * memory too. Restart if we didn't do everything. + * + * We force this due to running out of batch buffers, + * needing to flush dirty TLB entries before releasing + * the ptl, or for arch-specific reasons. */ - if (force_flush) { - force_flush = 0; + if (tlb->force_batch_flush) { + tlb->force_batch_flush = 0; tlb_flush_mmu_free(tlb); if (addr != end) _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752700AbcGAAOF (ORCPT ); Thu, 30 Jun 2016 20:14:05 -0400 Received: from mga04.intel.com ([192.55.52.120]:30254 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751673AbcGAANM (ORCPT ); Thu, 30 Jun 2016 20:13:12 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,554,1459839600"; d="scan'208";a="727652146" Subject: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:18 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001218.3D316260@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights Landing) has an erratum where a processor thread setting the Accessed or Dirty bits may not do so atomically against its checks for the Present bit. This may cause a thread (which is about to page fault) to set A and/or D, even though the Present bit had already been atomically cleared. If the PTE is used for storing a swap index or a NUMA migration index, the A bit could be misinterpreted as part of the swap type. The stray bits being set cause a software-cleared PTE to be interpreted as a swap entry. In some cases (like when the swap index ends up being for a non-existent swapfile), the kernel detects the stray value and WARN()s about it, but there is no guarantee that the kernel can always detect it. There are basically three things we have to do to work around this erratum: 1. Extra TLB flushes when we clear PTEs that might be affected by this erratum. 2. Extra pass back over the suspect PTEs after the TLBs have been flushed to clear stray bits. 3. Make sure to hold ptl in pte_unmap_same() (underneath do_swap_page()) to keep the swap code from observing the bad entries between #1 and #2. Notes: * The little pte++ in zap_pte_range() is to ensure that 'pte' points _past_ the last PTE that was cleared so that the whole range can be cleaned up. * We could do more of the new arch_*() helpers inside the existing TLB flush callers if we passed the old 'ptent' in as an argument. That would require a more invasive rework, though. * change_pte_range() does not need to be modified. It fully writes back over the PTE after clearing it. It also does this inside the ptl, so the cleared PTE potentially containing stray bits is never visible to anyone. * move_ptes() does remote TLB flushes after each PTE clear. This is slow, but mremap() is not as important as munmap() Leave it simple for now. * could apply A/D optimization to huge pages too * As far as I can tell, sites that just change PTE permissions are OK. They generally do some variant of: ptent = ptep_get_and_clear(...); ptent = pte_mksomething(ptent); set_pte_at(mm, addr, pte, ptent); tlb_flush...(); This is OK because the cleared PTE (which might contain the stray bits) is written over by set_pte_at() and this all happens under the pagetable lock. Examples of this: * madvise_free_pte_range() * hugetlb_change_protection() * clear_soft_dirty() * move_ptes() - new PTE will not fault so will not hit erratum * change_pte_range() Signed-off-by: Dave Hansen --- b/arch/arm/include/asm/tlb.h | 1 b/arch/ia64/include/asm/tlb.h | 1 b/arch/s390/include/asm/tlb.h | 1 b/arch/sh/include/asm/tlb.h | 1 b/arch/um/include/asm/tlb.h | 2 b/arch/x86/include/asm/cpufeatures.h | 1 b/arch/x86/include/asm/pgtable.h | 32 +++++++++ b/arch/x86/include/asm/tlbflush.h | 37 +++++++++++ b/arch/x86/kernel/cpu/intel.c | 113 +++++++++++++++++++++++++++++++++++ b/include/asm-generic/tlb.h | 4 + b/include/linux/mm.h | 17 +++++ b/mm/memory.c | 12 ++- b/mm/mremap.c | 9 ++ b/mm/rmap.c | 4 + b/mm/vmalloc.c | 1 15 files changed, 231 insertions(+), 5 deletions(-) diff -puN arch/arm/include/asm/tlb.h~knl-leak-60-actual-fix arch/arm/include/asm/tlb.h --- a/arch/arm/include/asm/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.455287496 -0700 +++ b/arch/arm/include/asm/tlb.h 2016-06-30 17:10:43.483288766 -0700 @@ -264,6 +264,7 @@ tlb_remove_pmd_tlb_entry(struct mmu_gath #define pud_free_tlb(tlb, pudp, addr) pud_free((tlb)->mm, pudp) #define tlb_migrate_finish(mm) do { } while (0) +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) #endif /* CONFIG_MMU */ #endif diff -puN arch/ia64/include/asm/tlb.h~knl-leak-60-actual-fix arch/ia64/include/asm/tlb.h --- a/arch/ia64/include/asm/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.457287587 -0700 +++ b/arch/ia64/include/asm/tlb.h 2016-06-30 17:10:43.484288812 -0700 @@ -254,6 +254,7 @@ __tlb_remove_tlb_entry (struct mmu_gathe } #define tlb_migrate_finish(mm) platform_tlb_migrate_finish(mm) +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) #define tlb_start_vma(tlb, vma) do { } while (0) #define tlb_end_vma(tlb, vma) do { } while (0) diff -puN arch/s390/include/asm/tlb.h~knl-leak-60-actual-fix arch/s390/include/asm/tlb.h --- a/arch/s390/include/asm/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.460287723 -0700 +++ b/arch/s390/include/asm/tlb.h 2016-06-30 17:10:43.484288812 -0700 @@ -146,5 +146,6 @@ static inline void pud_free_tlb(struct m #define tlb_remove_tlb_entry(tlb, ptep, addr) do { } while (0) #define tlb_remove_pmd_tlb_entry(tlb, pmdp, addr) do { } while (0) #define tlb_migrate_finish(mm) do { } while (0) +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) #endif /* _S390_TLB_H */ diff -puN arch/sh/include/asm/tlb.h~knl-leak-60-actual-fix arch/sh/include/asm/tlb.h --- a/arch/sh/include/asm/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.461287768 -0700 +++ b/arch/sh/include/asm/tlb.h 2016-06-30 17:10:43.484288812 -0700 @@ -116,6 +116,7 @@ static inline void tlb_remove_page(struc #define pud_free_tlb(tlb, pudp, addr) pud_free((tlb)->mm, pudp) #define tlb_migrate_finish(mm) do { } while (0) +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) #if defined(CONFIG_CPU_SH4) || defined(CONFIG_SUPERH64) extern void tlb_wire_entry(struct vm_area_struct *, unsigned long, pte_t); diff -puN arch/um/include/asm/tlb.h~knl-leak-60-actual-fix arch/um/include/asm/tlb.h --- a/arch/um/include/asm/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.463287859 -0700 +++ b/arch/um/include/asm/tlb.h 2016-06-30 17:10:43.485288857 -0700 @@ -133,4 +133,6 @@ static inline void tlb_remove_page(struc #define tlb_migrate_finish(mm) do {} while (0) +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) + #endif diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-60-actual-fix arch/x86/include/asm/cpufeatures.h --- a/arch/x86/include/asm/cpufeatures.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.464287904 -0700 +++ b/arch/x86/include/asm/cpufeatures.h 2016-06-30 17:10:43.485288857 -0700 @@ -310,5 +310,6 @@ #endif #define X86_BUG_NULL_SEG X86_BUG(10) /* Nulling a selector preserves the base */ #define X86_BUG_SWAPGS_FENCE X86_BUG(11) /* SWAPGS without input dep on GS */ +#define X86_BUG_PTE_LEAK X86_BUG(12) /* PTE may leak A/D bits after clear */ #endif /* _ASM_X86_CPUFEATURES_H */ diff -puN arch/x86/include/asm/pgtable.h~knl-leak-60-actual-fix arch/x86/include/asm/pgtable.h --- a/arch/x86/include/asm/pgtable.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.466287995 -0700 +++ b/arch/x86/include/asm/pgtable.h 2016-06-30 17:10:43.486288902 -0700 @@ -794,6 +794,12 @@ extern int ptep_test_and_clear_young(str extern int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep); +#ifdef CONFIG_CPU_SUP_INTEL +#define __HAVE_ARCH_PTEP_CLEAR_FLUSH +extern pte_t ptep_clear_flush(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep); +#endif + #define __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) @@ -956,6 +962,32 @@ static inline u16 pte_flags_pkey(unsigne #endif } +#ifdef CONFIG_CPU_SUP_INTEL +/* + * These are all specific to working around an Intel-specific + * bug and the out-of-line code is all defined in intel.c. + */ +extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr, + pte_t *ptep); +#define ARCH_HAS_FIX_PTE_LEAK 1 +static inline void arch_fix_pte_leak(struct mm_struct *mm, unsigned long addr, + pte_t *ptep) +{ + if (static_cpu_has_bug(X86_BUG_PTE_LEAK)) + fix_pte_leak(mm, addr, ptep); +} +#define ARCH_HAS_NEEDS_SWAP_PTL 1 +static inline bool arch_needs_swap_ptl(void) +{ + return static_cpu_has_bug(X86_BUG_PTE_LEAK); +} +#define ARCH_DISABLE_DEFERRED_FLUSH 1 +static inline bool arch_disable_deferred_flush(void) +{ + return static_cpu_has_bug(X86_BUG_PTE_LEAK); +} +#endif /* CONFIG_CPU_SUP_INTEL */ + #include #endif /* __ASSEMBLY__ */ diff -puN arch/x86/include/asm/tlbflush.h~knl-leak-60-actual-fix arch/x86/include/asm/tlbflush.h --- a/arch/x86/include/asm/tlbflush.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.468288086 -0700 +++ b/arch/x86/include/asm/tlbflush.h 2016-06-30 17:10:43.486288902 -0700 @@ -324,4 +324,41 @@ static inline void reset_lazy_tlbstate(v native_flush_tlb_others(mask, mm, start, end) #endif +static inline bool intel_detect_leaked_pte(pte_t ptent) +{ + /* + * Hardware sets the dirty bit only on writable ptes and + * only on ptes where dirty is unset. + */ + if (pte_write(ptent) && !pte_dirty(ptent)) + return true; + /* + * The leak occurs when the hardware sees a unset A/D bit + * and tries to set it. If the PTE has both A/D bits + * set, then the hardware will not be going to try to set + * it and we have no chance for a leak. + */ + if (!pte_young(ptent)) + return true; + + return false; +} + +#ifdef CONFIG_CPU_SUP_INTEL +void intel_cleanup_pte_range(struct mmu_gather *tlb, pte_t *start_ptep, + pte_t *end_ptep); +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do { \ + if (static_cpu_has_bug(X86_BUG_PTE_LEAK)) \ + intel_cleanup_pte_range(tlb, start_ptep, end_ptep); \ +} while (0) +#define ARCH_FLUSH_CLEARED_PTE 1 +#define arch_flush_cleared_pte(tlb, ptent) do { \ + if (static_cpu_has_bug(X86_BUG_PTE_LEAK) && \ + intel_detect_leaked_pte(ptent)) { \ + tlb->saw_unset_a_or_d = 1; \ + tlb->force_batch_flush = 1; \ + } \ +} while (0) +#endif /* CONFIG_CPU_SUP_INTEL */ + #endif /* _ASM_X86_TLBFLUSH_H */ diff -puN arch/x86/kernel/cpu/intel.c~knl-leak-60-actual-fix arch/x86/kernel/cpu/intel.c --- a/arch/x86/kernel/cpu/intel.c~knl-leak-60-actual-fix 2016-06-30 17:10:43.469288131 -0700 +++ b/arch/x86/kernel/cpu/intel.c 2016-06-30 17:10:43.487288948 -0700 @@ -9,10 +9,14 @@ #include #include +#include #include #include #include #include +#include + +#include #ifdef CONFIG_X86_64 #include @@ -181,6 +185,11 @@ static void early_init_intel(struct cpui } } + if (c->x86_model == INTEL_FAM6_XEON_PHI_KNL) { + pr_info_once("x86/intel: Enabling PTE leaking workaround\n"); + set_cpu_bug(c, X86_BUG_PTE_LEAK); + } + /* * Intel Quark Core DevMan_001.pdf section 6.4.11 * "The operating system also is required to invalidate (i.e., flush) @@ -820,3 +829,107 @@ static const struct cpu_dev intel_cpu_de cpu_dev_register(intel_cpu_dev); +/* + * Workaround for KNL issue: + * + * A thread that is going to page fault due to P=0, may still + * non atomically set A or D bits, which could corrupt swap entries. + * Always flush the other CPUs and clear the PTE again to avoid + * this leakage. We are excluded using the pagetable lock. + * + * This only needs to be called on processors that might "leak" + * A or D bits and have X86_BUG_PTE_LEAK set. + */ +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep) +{ + if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) { + flush_tlb_others(mm_cpumask(mm), mm, addr, + addr + PAGE_SIZE); + set_pte(ptep, __pte(0)); + } +} + +/* + * We batched a bunch of PTE clears up. After the TLB has been + * flushed for the whole batch, we might have had some leaked + * A or D bits and need to clear them here. + * + * This should be called with the page table lock still held. + * + * This only needs to be called on processors that might "leak" + * A or D bits and have X86_BUG_PTE_LEAK set. + */ +void intel_cleanup_pte_range(struct mmu_gather *tlb, pte_t *start_ptep, + pte_t *end_ptep) +{ + pte_t *pte; + + /* + * fullmm means nobody will care that we have leaked bits + * laying around. We also skip TLB flushes when doing + * fullmm teardown, so the additional pte clearing would + * not help the issue. + */ + if (tlb->fullmm) + return; + + /* + * If none of the PTEs hit inside intel_detect_leaked_pte(), + * then we have nothing that might have been leaked and + * nothing to clear. + */ + if (!tlb->saw_unset_a_or_d) + return; + + /* + * Contexts calling us with NULL ptep's do not have any + * PTEs for us to go clear because they did not do any + * actual TLB invalidation. + */ + if (!start_ptep || !end_ptep) + return; + + /* + * Mark that the workaround is no longer needed for + * this batch. + */ + tlb->saw_unset_a_or_d = 0; + + /* + * Ensure that the compiler orders our set_pte() + * after the preceding TLB flush no matter what. + */ + barrier(); + + /* + * Re-clear out all the PTEs into which the hardware + * may have leaked Accessed or Dirty bits. + */ + for (pte = start_ptep; pte < end_ptep; pte++) + set_pte(pte, __pte(0)); +} + +/* + * Kinda weird to define this in here, but we only use it for + * an Intel-specific issue. This will get used on all + * processors (even non-Intel) if CONFIG_CPU_SUP_INTEL=y. + */ +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, + pte_t *ptep) +{ + struct mm_struct *mm = vma->vm_mm; + pte_t pte; + + pte = ptep_get_and_clear(mm, address, ptep); + if (pte_accessible(mm, pte)) { + flush_tlb_page(vma, address); + /* + * Ensure that the compiler orders our set_pte() + * after the flush_tlb_page() no matter what. + */ + barrier(); + if (static_cpu_has_bug(X86_BUG_PTE_LEAK)) + set_pte(ptep, __pte(0)); + } + return pte; +} diff -puN include/asm-generic/tlb.h~knl-leak-60-actual-fix include/asm-generic/tlb.h --- a/include/asm-generic/tlb.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.471288222 -0700 +++ b/include/asm-generic/tlb.h 2016-06-30 17:10:43.489289039 -0700 @@ -226,4 +226,8 @@ static inline void __tlb_reset_range(str #define tlb_migrate_finish(mm) do {} while (0) +#ifndef __tlb_cleanup_pte_range +#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0) +#endif + #endif /* _ASM_GENERIC__TLB_H */ diff -puN include/linux/mm.h~knl-leak-60-actual-fix include/linux/mm.h --- a/include/linux/mm.h~knl-leak-60-actual-fix 2016-06-30 17:10:43.472288267 -0700 +++ b/include/linux/mm.h 2016-06-30 17:10:43.492289175 -0700 @@ -2404,6 +2404,23 @@ static inline bool debug_guardpage_enabl static inline bool page_is_guard(struct page *page) { return false; } #endif /* CONFIG_DEBUG_PAGEALLOC */ +#ifndef ARCH_HAS_NEEDS_SWAP_PTL +static inline bool arch_needs_swap_ptl(void) { return false; } +#endif + +#ifndef ARCH_HAS_FIX_PTE_LEAK +static inline void arch_fix_pte_leak(struct mm_struct *mm, unsigned long addr, + pte_t *ptep) {} +#endif + +#ifndef ARCH_DISABLE_DEFERRED_FLUSH +static inline bool arch_disable_deferred_flush(void) { return false; } +#endif + +#ifndef ARCH_FLUSH_CLEARED_PTE +static inline void arch_flush_cleared_pte(struct mmu_gather *tlb, pte_t ptent) {} +#endif + #if MAX_NUMNODES > 1 void __init setup_nr_node_ids(void); #else diff -puN mm/memory.c~knl-leak-60-actual-fix mm/memory.c --- a/mm/memory.c~knl-leak-60-actual-fix 2016-06-30 17:10:43.474288358 -0700 +++ b/mm/memory.c 2016-06-30 17:10:43.496289356 -0700 @@ -1141,6 +1141,7 @@ again: ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); tlb_remove_tlb_entry(tlb, pte, addr); + arch_flush_cleared_pte(tlb, ptent); if (unlikely(!page)) continue; @@ -1166,6 +1167,7 @@ again: if (unlikely(!__tlb_remove_page(tlb, page))) { tlb->force_batch_flush = 1; addr += PAGE_SIZE; + pte++; break; } continue; @@ -1192,8 +1194,11 @@ again: arch_leave_lazy_mmu_mode(); /* Do the actual TLB flush before dropping ptl */ - if (tlb->force_batch_flush) - tlb_flush_mmu_tlbonly(tlb); + if (tlb->force_batch_flush) { + bool did_flush = tlb_flush_mmu_tlbonly(tlb); + if (did_flush) + __tlb_cleanup_pte_range(tlb, start_pte, pte); + } pte_unmap_unlock(start_pte, ptl); /* @@ -1965,7 +1970,8 @@ static inline int pte_unmap_same(struct { int same = 1; #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT) - if (sizeof(pte_t) > sizeof(unsigned long)) { + if (unlikely(arch_needs_swap_ptl() || + sizeof(pte_t) > sizeof(unsigned long))) { spinlock_t *ptl = pte_lockptr(mm, pmd); spin_lock(ptl); same = pte_same(*page_table, orig_pte); diff -puN mm/mremap.c~knl-leak-60-actual-fix mm/mremap.c --- a/mm/mremap.c~knl-leak-60-actual-fix 2016-06-30 17:10:43.476288449 -0700 +++ b/mm/mremap.c 2016-06-30 17:10:43.497289401 -0700 @@ -24,6 +24,7 @@ #include #include +#include #include #include "internal.h" @@ -144,10 +145,14 @@ static void move_ptes(struct vm_area_str for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, new_pte++, new_addr += PAGE_SIZE) { + pte_t old_ptent; + if (pte_none(*old_pte)) continue; - pte = ptep_get_and_clear(mm, old_addr, old_pte); - pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); + old_ptent = ptep_get_and_clear(mm, old_addr, old_pte); + pte = move_pte(old_ptent, new_vma->vm_page_prot, + old_addr, new_addr); + arch_fix_pte_leak(mm, old_addr, old_pte); pte = move_soft_dirty_pte(pte); set_pte_at(mm, new_addr, new_pte, pte); } diff -puN mm/rmap.c~knl-leak-60-actual-fix mm/rmap.c --- a/mm/rmap.c~knl-leak-60-actual-fix 2016-06-30 17:10:43.478288539 -0700 +++ b/mm/rmap.c 2016-06-30 17:10:43.498289447 -0700 @@ -633,6 +633,10 @@ static bool should_defer_flush(struct mm { bool should_defer = false; + /* x86 may need an immediate flush after a pte clear */ + if (arch_disable_deferred_flush()) + return false; + if (!(flags & TTU_BATCH_FLUSH)) return false; diff -puN mm/vmalloc.c~knl-leak-60-actual-fix mm/vmalloc.c --- a/mm/vmalloc.c~knl-leak-60-actual-fix 2016-06-30 17:10:43.479288585 -0700 +++ b/mm/vmalloc.c 2016-06-30 17:10:43.500289538 -0700 @@ -66,6 +66,7 @@ static void vunmap_pte_range(pmd_t *pmd, pte = pte_offset_kernel(pmd, addr); do { pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte); + arch_fix_pte_leak(&init_mm, addr, pte); WARN_ON(!pte_none(ptent) && !pte_present(ptent)); } while (pte++, addr += PAGE_SIZE, addr != end); } _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752629AbcGAAOE (ORCPT ); Thu, 30 Jun 2016 20:14:04 -0400 Received: from mga03.intel.com ([134.134.136.65]:13863 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751699AbcGAANM (ORCPT ); Thu, 30 Jun 2016 20:13:12 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,554,1459839600"; d="scan'208";a="998432633" Subject: [PATCH 5/6] mm: make tlb_flush_mmu_tlbonly() return whether it flushed To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:16 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001216.9CFEC460@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen When batching TLB operations, we track the scope of the needed TLB invalidation inside the 'struct mmu_gather'. With mmu_gather->end, we indicate whether any operations required a flush. If the flush was not performed, then we know that no pte_present() PTEs were found and no workaround is needed. But, we do not know this unless tlb_flush_mmu_tlbonly() tells us whether it flushed or not. Have it return a bool to tell us. Signed-off-by: Dave Hansen --- b/arch/arm/include/asm/tlb.h | 3 ++- b/arch/ia64/include/asm/tlb.h | 3 ++- b/arch/s390/include/asm/tlb.h | 3 ++- b/arch/sh/include/asm/tlb.h | 1 + b/arch/um/include/asm/tlb.h | 3 ++- b/mm/memory.c | 5 +++-- 6 files changed, 12 insertions(+), 6 deletions(-) diff -puN arch/arm/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/arm/include/asm/tlb.h --- a/arch/arm/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.947264449 -0700 +++ b/arch/arm/include/asm/tlb.h 2016-06-30 17:10:42.958264947 -0700 @@ -126,12 +126,13 @@ static inline void __tlb_alloc_page(stru } } -static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { tlb_flush(tlb); #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb_table_flush(tlb); #endif + return true; } static inline void tlb_flush_mmu_free(struct mmu_gather *tlb) diff -puN arch/ia64/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/ia64/include/asm/tlb.h --- a/arch/ia64/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.948264494 -0700 +++ b/arch/ia64/include/asm/tlb.h 2016-06-30 17:10:42.959264993 -0700 @@ -219,9 +219,10 @@ static inline int __tlb_remove_page(stru return tlb->max - tlb->nr; } -static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { ia64_tlb_flush_mmu_tlbonly(tlb, tlb->start_addr, tlb->end_addr); + return true; } static inline void tlb_flush_mmu_free(struct mmu_gather *tlb) diff -puN arch/s390/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/s390/include/asm/tlb.h --- a/arch/s390/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.950264585 -0700 +++ b/arch/s390/include/asm/tlb.h 2016-06-30 17:10:42.960265038 -0700 @@ -60,9 +60,10 @@ static inline void tlb_gather_mmu(struct tlb->batch = NULL; } -static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { __tlb_flush_mm_lazy(tlb->mm); + return true; } static inline void tlb_flush_mmu_free(struct mmu_gather *tlb) diff -puN arch/sh/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/sh/include/asm/tlb.h --- a/arch/sh/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.952264675 -0700 +++ b/arch/sh/include/asm/tlb.h 2016-06-30 17:10:42.960265038 -0700 @@ -89,6 +89,7 @@ tlb_end_vma(struct mmu_gather *tlb, stru static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { + return true; } static inline void tlb_flush_mmu_free(struct mmu_gather *tlb) diff -puN arch/um/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/um/include/asm/tlb.h --- a/arch/um/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.953264721 -0700 +++ b/arch/um/include/asm/tlb.h 2016-06-30 17:10:42.960265038 -0700 @@ -59,10 +59,11 @@ tlb_gather_mmu(struct mmu_gather *tlb, s extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end); -static inline void +static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { flush_tlb_mm_range(tlb->mm, tlb->start, tlb->end); + return true; } static inline void diff -puN mm/memory.c~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly mm/memory.c --- a/mm/memory.c~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly 2016-06-30 17:10:42.955264811 -0700 +++ b/mm/memory.c 2016-06-30 17:10:42.962265129 -0700 @@ -240,10 +240,10 @@ void tlb_gather_mmu(struct mmu_gather *t __tlb_reset_range(tlb); } -static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { if (!tlb->end) - return; + return false; tlb_flush(tlb); mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end); @@ -251,6 +251,7 @@ static void tlb_flush_mmu_tlbonly(struct tlb_table_flush(tlb); #endif __tlb_reset_range(tlb); + return true; } static void tlb_flush_mmu_free(struct mmu_gather *tlb) _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752225AbcGAAQI (ORCPT ); Thu, 30 Jun 2016 20:16:08 -0400 Received: from mga03.intel.com ([134.134.136.65]:52105 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751693AbcGAAQF (ORCPT ); Thu, 30 Jun 2016 20:16:05 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,554,1459839600"; d="scan'208";a="727652073" Subject: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , dave.hansen@linux.intel.com, luto@kernel.org, stable@vger.kernel.org From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:10 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001210.AA77B917@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen epufeatures.h currently defines X86_BUG(9) twice on 32-bit: #define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ ... #ifdef CONFIG_X86_32 #define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */ #endif I think what happened was that this added the X86_BUG_ESPFIX, but in an #ifdef below most of the bugs: [58a5aac5] x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled Then this came along and added X86_BUG_NULL_SEG, but collided with the earlier one that did the bug below the main block defining all the X86_BUG()s. [7a5d6704] x86/cpu: Probe the behavior of nulling out a segment at boot time Signed-off-by: Dave Hansen Acked-by: Andy Lutomirski Cc: stable@vger.kernel.org --- b/arch/x86/include/asm/cpufeatures.h | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros arch/x86/include/asm/cpufeatures.h --- a/arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros 2016-06-30 17:10:41.215185869 -0700 +++ b/arch/x86/include/asm/cpufeatures.h 2016-06-30 17:10:41.218186005 -0700 @@ -301,10 +301,6 @@ #define X86_BUG_FXSAVE_LEAK X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */ #define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */ #define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */ -#define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ -#define X86_BUG_SWAPGS_FENCE X86_BUG(10) /* SWAPGS without input dep on GS */ - - #ifdef CONFIG_X86_32 /* * 64-bit kernels don't use X86_BUG_ESPFIX. Make the define conditional @@ -312,5 +308,7 @@ */ #define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */ #endif +#define X86_BUG_NULL_SEG X86_BUG(10) /* Nulling a selector preserves the base */ +#define X86_BUG_SWAPGS_FENCE X86_BUG(11) /* SWAPGS without input dep on GS */ #endif /* _ASM_X86_CPUFEATURES_H */ _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752116AbcGAAQH (ORCPT ); Thu, 30 Jun 2016 20:16:07 -0400 Received: from mga03.intel.com ([134.134.136.65]:57371 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751652AbcGAAQF (ORCPT ); Thu, 30 Jun 2016 20:16:05 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,554,1459839600"; d="scan'208";a="998432575" Subject: [PATCH 0/6] [v3] Workaround for Xeon Phi PTE A/D bits erratum To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, Dave Hansen , minchan@kernel.org From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:09 -0700 Message-Id: <20160701001209.7DA24D1C@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights Landing) has an erratum where a processor thread setting the Accessed or Dirty bits may not do so atomically against its checks for the Present bit. This may cause a thread (which is about to page fault) to set A and/or D, even though the Present bit had already been atomically cleared. If the PTE is used for storing a swap index or a NUMA migration index, the A bit could be misinterpreted as part of the swap type. The stray bits being set cause a software-cleared PTE to be interpreted as a swap entry. In some cases (like when the swap index ends up being for a non-existent swapfile), the kernel detects the stray value and WARN()s about it, but there is no guarantee that the kernel can always detect it. This patch causes the page unmap path in vmscan/direct reclaim to flush remote TLBs after clearing each page, and also clears the PTE again after the flush. For reclaim, this brings the behavior (and associated reclaim performance) back to what it was before Mel's changes that increased TLB flush batching. For the unmap path, this patch may force some additional flushes, but they are limited to a maximum of one per PTE page. This patch clears these stray A/D bits before releasing the pagetable lock which prevents other parts of the kernel from observing the stray bits. Andi Kleen wrote the original version of this patch, and Dave Hansen added the batching. The original version was much simpler but it did too many extra TLB flushes which killed performance. v3: huge rework to keep batching working in unmap case v2: out of line. avoid single thread flush. cover more clear cases Cc: Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752138AbcGABz1 (ORCPT ); Thu, 30 Jun 2016 21:55:27 -0400 Received: from www.sr71.net ([198.145.64.142]:38333 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751938AbcGABzC (ORCPT ); Thu, 30 Jun 2016 21:55:02 -0400 Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: Nadav Amit References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5A585093-4E0D-49BC-A9CA-0072BB83A71C@gmail.com> Cc: LKML , x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, dave.hansen@linux.intel.com From: Dave Hansen Message-ID: <5775CD73.2070809@sr71.net> Date: Thu, 30 Jun 2016 18:54:59 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 MIME-Version: 1.0 In-Reply-To: <5A585093-4E0D-49BC-A9CA-0072BB83A71C@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/30/2016 06:50 PM, Nadav Amit wrote: > Dave Hansen wrote: >> +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, >> + pte_t *ptep) >> +{ >> + struct mm_struct *mm = vma->vm_mm; >> + pte_t pte; >> + >> + pte = ptep_get_and_clear(mm, address, ptep); >> + if (pte_accessible(mm, pte)) { >> + flush_tlb_page(vma, address); >> + /* >> + * Ensure that the compiler orders our set_pte() >> + * after the flush_tlb_page() no matter what. >> + */ >> + barrier(); > > I don’t think such a barrier (after remote TLB flush) is needed. > Eventually, if a remote flush takes place, you get csd_lock_wait() to be > called, and then smp_rmb() is called (which is essentially a barrier() > call on x86). Andi really wanted to make sure this got in here. He said there was a bug that bit him really badly once where a function got reordered. Granted, a call _should_ be sufficient to keep the compiler from reordering things, but this makes double sure. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752213AbcGACEo (ORCPT ); Thu, 30 Jun 2016 22:04:44 -0400 Received: from mail-pa0-f68.google.com ([209.85.220.68]:35401 "EHLO mail-pa0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751505AbcGACEm convert rfc822-to-8bit (ORCPT ); Thu, 30 Jun 2016 22:04:42 -0400 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs From: Nadav Amit In-Reply-To: <20160701001218.3D316260@viggo.jf.intel.com> Date: Thu, 30 Jun 2016 18:50:21 -0700 Cc: LKML , x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com, mhocko@suse.com, dave.hansen@linux.intel.com Content-Transfer-Encoding: 8BIT Message-Id: <5A585093-4E0D-49BC-A9CA-0072BB83A71C@gmail.com> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> To: Dave Hansen X-Mailer: Apple Mail (2.3124) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave Hansen wrote: > +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, > + pte_t *ptep) > +{ > + struct mm_struct *mm = vma->vm_mm; > + pte_t pte; > + > + pte = ptep_get_and_clear(mm, address, ptep); > + if (pte_accessible(mm, pte)) { > + flush_tlb_page(vma, address); > + /* > + * Ensure that the compiler orders our set_pte() > + * after the flush_tlb_page() no matter what. > + */ > + barrier(); I don’t think such a barrier (after remote TLB flush) is needed. Eventually, if a remote flush takes place, you get csd_lock_wait() to be called, and then smp_rmb() is called (which is essentially a barrier() call on x86). Regards, Nadav From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752335AbcGAC4i (ORCPT ); Thu, 30 Jun 2016 22:56:38 -0400 Received: from mail-oi0-f44.google.com ([209.85.218.44]:32915 "EHLO mail-oi0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751948AbcGAC4f (ORCPT ); Thu, 30 Jun 2016 22:56:35 -0400 MIME-Version: 1.0 In-Reply-To: <20160701001218.3D316260@viggo.jf.intel.com> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> From: Linus Torvalds Date: Thu, 30 Jun 2016 19:55:58 -0700 X-Google-Sender-Auth: PduFFoO6ScsxMIVd2O7JbXdcaCI Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: Dave Hansen Cc: Linux Kernel Mailing List , "the arch/x86 maintainers" , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 30, 2016 at 5:12 PM, Dave Hansen wrote: > > From: Dave Hansen > > The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights > Landing) has an erratum where a processor thread setting the Accessed > or Dirty bits may not do so atomically against its checks for the > Present bit. This may cause a thread (which is about to page fault) > to set A and/or D, even though the Present bit had already been > atomically cleared. So I don't think your approach is wrong, but I suspect this is overkill, and what we should instead just do is to not use the A/D bits at all in the swap representation. The swap-entry representation was a bit tight on 32-bit page table entries, but in 64-bit ones, I think we have tons of bits, don't we? So we could decide just to not use those two bits on x86. It's not like anybody will ever care about 32-bit page tables on Knights Landing anyway. So rather than add this kind of complexity and worry, how about just simplifying the problem? Or was there some discussion or implication I missed? Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752455AbcGADG7 (ORCPT ); Thu, 30 Jun 2016 23:06:59 -0400 Received: from mail-oi0-f45.google.com ([209.85.218.45]:36409 "EHLO mail-oi0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751900AbcGADG5 (ORCPT ); Thu, 30 Jun 2016 23:06:57 -0400 MIME-Version: 1.0 In-Reply-To: References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> From: Brian Gerst Date: Thu, 30 Jun 2016 23:06:55 -0400 Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: Linus Torvalds Cc: Dave Hansen , Linux Kernel Mailing List , "the arch/x86 maintainers" , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 30, 2016 at 10:55 PM, Linus Torvalds wrote: > On Thu, Jun 30, 2016 at 5:12 PM, Dave Hansen wrote: >> >> From: Dave Hansen >> >> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights >> Landing) has an erratum where a processor thread setting the Accessed >> or Dirty bits may not do so atomically against its checks for the >> Present bit. This may cause a thread (which is about to page fault) >> to set A and/or D, even though the Present bit had already been >> atomically cleared. > > So I don't think your approach is wrong, but I suspect this is > overkill, and what we should instead just do is to not use the A/D > bits at all in the swap representation. > > The swap-entry representation was a bit tight on 32-bit page table > entries, but in 64-bit ones, I think we have tons of bits, don't we? > So we could decide just to not use those two bits on x86. > > It's not like anybody will ever care about 32-bit page tables on > Knights Landing anyway. Could this affect a 32-bit guest VM? -- Brian Gerst From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752643AbcGAEqR (ORCPT ); Fri, 1 Jul 2016 00:46:17 -0400 Received: from www.sr71.net ([198.145.64.142]:39119 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750769AbcGAEqP (ORCPT ); Fri, 1 Jul 2016 00:46:15 -0400 Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: Linus Torvalds References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen From: Dave Hansen Message-ID: <5775F418.2000803@sr71.net> Date: Thu, 30 Jun 2016 21:39:52 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/30/2016 07:55 PM, Linus Torvalds wrote: > On Thu, Jun 30, 2016 at 5:12 PM, Dave Hansen wrote: >> From: Dave Hansen >> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights >> Landing) has an erratum where a processor thread setting the Accessed >> or Dirty bits may not do so atomically against its checks for the >> Present bit. This may cause a thread (which is about to page fault) >> to set A and/or D, even though the Present bit had already been >> atomically cleared. > > So I don't think your approach is wrong, but I suspect this is > overkill, and what we should instead just do is to not use the A/D > bits at all in the swap representation. We actually don't even use Dirty today. It's (implicitly) used to determine pte_none(), but it ends up being masked out for the swp_offset/type() calculations entirely, much to my surprise. I think what you suggest will work if we don't consider A/D in pte_none(). I think there are a bunch of code path where assume that !pte_present() && !pte_none() means swap. > The swap-entry representation was a bit tight on 32-bit page table > entries, but in 64-bit ones, I think we have tons of bits, don't we? > So we could decide just to not use those two bits on x86. Yeah, we've definitely got space. I'll go poke around and make sure that this works everywhere. I agree that throwing 32-bit non-PAE under the bus is definitely worth it here. Nobody will care about that in a million years. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751953AbcGAFty (ORCPT ); Fri, 1 Jul 2016 01:49:54 -0400 Received: from mail-oi0-f53.google.com ([209.85.218.53]:35213 "EHLO mail-oi0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751052AbcGAFtw (ORCPT ); Fri, 1 Jul 2016 01:49:52 -0400 MIME-Version: 1.0 In-Reply-To: <5775F418.2000803@sr71.net> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> From: Linus Torvalds Date: Thu, 30 Jun 2016 22:43:09 -0700 X-Google-Sender-Auth: dCnfxsdpyb5mxPACokAhOIbWzWw Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: Dave Hansen Cc: Linux Kernel Mailing List , "the arch/x86 maintainers" , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: > > I think what you suggest will work if we don't consider A/D in > pte_none(). I think there are a bunch of code path where assume that > !pte_present() && !pte_none() means swap. Yeah, we would need to change pte_none() to mask off D/A, but I think that might be the only real change needed (other than making sure that we don't use the bits in the swap entries, I didn't look at that part at all) Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752371AbcGAPb6 (ORCPT ); Fri, 1 Jul 2016 11:31:58 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:37546 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752134AbcGAPb4 (ORCPT ); Fri, 1 Jul 2016 11:31:56 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Linus Torvalds Cc: Dave Hansen , Linux Kernel Mailing List , "the arch\/x86 maintainers" , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> Date: Fri, 01 Jul 2016 09:25:10 -0500 In-Reply-To: (Linus Torvalds's message of "Thu, 30 Jun 2016 22:43:09 -0700") Message-ID: <874m89cu61.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1bIzZV-0003MG-MZ;;;mid=<874m89cu61.fsf@x220.int.ebiederm.org>;;;hst=in02.mta.xmission.com;;;ip=67.3.204.119;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX19ODGWV2O+sqOFwxW1RqB783xKLzql+i8c= X-SA-Exim-Connect-IP: 67.3.204.119 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.7 XMSubLong Long Subject * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.4994] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa02 1397; Body=1 Fuz1=1 Fuz2=1] * 0.0 T_TooManySym_01 4+ unique symbols in subject X-Spam-DCC: XMission; sa02 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: **;Linus Torvalds X-Spam-Relay-Country: X-Spam-Timing: total 562 ms - load_scoreonly_sql: 0.08 (0.0%), signal_user_changed: 4.4 (0.8%), b_tie_ro: 3.1 (0.5%), parse: 1.32 (0.2%), extract_message_metadata: 29 (5.2%), get_uri_detail_list: 1.55 (0.3%), tests_pri_-1000: 12 (2.1%), tests_pri_-950: 2.4 (0.4%), tests_pri_-900: 1.99 (0.4%), tests_pri_-400: 31 (5.5%), check_bayes: 29 (5.1%), b_tokenize: 9 (1.7%), b_tok_get_all: 7 (1.3%), b_comp_prob: 3.6 (0.6%), b_tok_touch_all: 4.4 (0.8%), b_finish: 0.92 (0.2%), tests_pri_0: 466 (82.9%), check_dkim_signature: 0.99 (0.2%), check_dkim_adsp: 90 (16.1%), tests_pri_500: 7 (1.3%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Linus Torvalds writes: > On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: >> >> I think what you suggest will work if we don't consider A/D in >> pte_none(). I think there are a bunch of code path where assume that >> !pte_present() && !pte_none() means swap. > > Yeah, we would need to change pte_none() to mask off D/A, but I think > that might be the only real change needed (other than making sure that > we don't use the bits in the swap entries, I didn't look at that part > at all) It looks like __pte_to_swp_entry also needs to be changed to mask out those bits when the swap code reads pte entries. For all of the same reasons as pte_none. Eric From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752151AbcGAPvo (ORCPT ); Fri, 1 Jul 2016 11:51:44 -0400 Received: from www.sr71.net ([198.145.64.142]:43799 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751673AbcGAPvn (ORCPT ); Fri, 1 Jul 2016 11:51:43 -0400 Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: "Eric W. Biederman" , Linus Torvalds References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> <874m89cu61.fsf@x220.int.ebiederm.org> Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen From: Dave Hansen Message-ID: <57769188.9060708@sr71.net> Date: Fri, 1 Jul 2016 08:51:36 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 MIME-Version: 1.0 In-Reply-To: <874m89cu61.fsf@x220.int.ebiederm.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/01/2016 07:25 AM, Eric W. Biederman wrote: > Linus Torvalds writes: >> > On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: >>> >> >>> >> I think what you suggest will work if we don't consider A/D in >>> >> pte_none(). I think there are a bunch of code path where assume that >>> >> !pte_present() && !pte_none() means swap. >> > >> > Yeah, we would need to change pte_none() to mask off D/A, but I think >> > that might be the only real change needed (other than making sure that >> > we don't use the bits in the swap entries, I didn't look at that part >> > at all) > It looks like __pte_to_swp_entry also needs to be changed to mask out > those bits when the swap code reads pte entries. For all of the same > reasons as pte_none. I guess that would be nice, but isn't it redundant? static inline swp_entry_t pte_to_swp_entry(pte_t pte) { ... arch_entry = __pte_to_swp_entry(pte); return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry)); } As long as __swp_type() and __swp_offset() don't let A/D through, then we should be OK. This site is the only call to __pte_to_swp_entry() that I can find in the entire codebase. Or am I missing something? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751990AbcGAQHc (ORCPT ); Fri, 1 Jul 2016 12:07:32 -0400 Received: from mail-oi0-f66.google.com ([209.85.218.66]:36794 "EHLO mail-oi0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751533AbcGAQHb (ORCPT ); Fri, 1 Jul 2016 12:07:31 -0400 MIME-Version: 1.0 In-Reply-To: <5775F418.2000803@sr71.net> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> From: Linus Torvalds Date: Fri, 1 Jul 2016 09:07:10 -0700 X-Google-Sender-Auth: ai7kTlyDBSjzP7eHreRnOQQaRb0 Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: Dave Hansen Cc: Linux Kernel Mailing List , "the arch/x86 maintainers" , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: > > I think what you suggest will work if we don't consider A/D in > pte_none(). I think there are a bunch of code path where assume that > !pte_present() && !pte_none() means swap. Hmm. Thinking about it some more, I still think it's a good idea to avoid A/D bits in the swap entries and in pte_none() and friends, and it might simplify some of this all. But I also started worrying about us just losing sight of the dirty bit in particular. It's not enough that we ignore the dirty bit - we'd still want to make sure that the underlying backing page gets marked dirty, even if the CPU is buggy and ends doing it "delayed" after we've already unmapped the page. So I get this feeling that we may need a fair chunk of your patch-series anyway. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752316AbcGAQOR (ORCPT ); Fri, 1 Jul 2016 12:14:17 -0400 Received: from www.sr71.net ([198.145.64.142]:43993 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750702AbcGAQOO (ORCPT ); Fri, 1 Jul 2016 12:14:14 -0400 Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: Linus Torvalds References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen From: Dave Hansen Message-ID: <577696D5.2010609@sr71.net> Date: Fri, 1 Jul 2016 09:14:13 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/01/2016 09:07 AM, Linus Torvalds wrote: > But I also started worrying about us just losing sight of the dirty > bit in particular. It's not enough that we ignore the dirty bit - we'd > still want to make sure that the underlying backing page gets marked > dirty, even if the CPU is buggy and ends doing it "delayed" after > we've already unmapped the page. > > So I get this feeling that we may need a fair chunk of your > patch-series anyway. As I understand it, the erratum only affects a thread which is about to page fault. The write associated with the dirty bit being set never actually gets executed. So, the bit really *is* stray and isn't something we need to preserve. Otherwise, we'd be really screwed because we couldn't ever simply clear it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752255AbcGAQZG (ORCPT ); Fri, 1 Jul 2016 12:25:06 -0400 Received: from mail-oi0-f51.google.com ([209.85.218.51]:36361 "EHLO mail-oi0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751193AbcGAQZD (ORCPT ); Fri, 1 Jul 2016 12:25:03 -0400 MIME-Version: 1.0 In-Reply-To: <577696D5.2010609@sr71.net> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> <577696D5.2010609@sr71.net> From: Linus Torvalds Date: Fri, 1 Jul 2016 09:25:01 -0700 X-Google-Sender-Auth: bWUk9zIsZyrnXzFkAr796JJvvDY Message-ID: Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: Dave Hansen Cc: Linux Kernel Mailing List , "the arch/x86 maintainers" , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 1, 2016 at 9:14 AM, Dave Hansen wrote: > > As I understand it, the erratum only affects a thread which is about to > page fault. The write associated with the dirty bit being set never > actually gets executed. So, the bit really *is* stray and isn't > something we need to preserve. Ok, good. > Otherwise, we'd be really screwed because we couldn't ever simply clear it. Oh, we could do the whole "clear the pte, then flush the tlb, then go back and clear the stale dirty bits and move them into the backing page". I was afraid we might have to do something like that. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752223AbcGAQa7 (ORCPT ); Fri, 1 Jul 2016 12:30:59 -0400 Received: from mail-vk0-f41.google.com ([209.85.213.41]:35250 "EHLO mail-vk0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751426AbcGAQa6 (ORCPT ); Fri, 1 Jul 2016 12:30:58 -0400 MIME-Version: 1.0 In-Reply-To: <20160701092300.GD4593@pd.tnic> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001210.AA77B917@viggo.jf.intel.com> <20160701092300.GD4593@pd.tnic> From: Andy Lutomirski Date: Fri, 1 Jul 2016 09:30:37 -0700 Message-ID: Subject: Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro To: Borislav Petkov Cc: Michal Hocko , "linux-kernel@vger.kernel.org" , stable , Andi Kleen , "linux-mm@kvack.org" , Linus Torvalds , Andrew Morton , X86 ML , Dave Hansen , Dave Hansen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Jul 1, 2016 2:23 AM, "Borislav Petkov" wrote: > > On Thu, Jun 30, 2016 at 05:12:10PM -0700, Dave Hansen wrote: > > > > From: Dave Hansen > > > > epufeatures.h currently defines X86_BUG(9) twice on 32-bit: > > > > #define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ > > ... > > #ifdef CONFIG_X86_32 > > #define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */ > > #endif > > > > I think what happened was that this added the X86_BUG_ESPFIX, but > > in an #ifdef below most of the bugs: > > > > [58a5aac5] x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled > > > > Then this came along and added X86_BUG_NULL_SEG, but collided > > with the earlier one that did the bug below the main block > > defining all the X86_BUG()s. > > > > [7a5d6704] x86/cpu: Probe the behavior of nulling out a segment at boot time > > > > Signed-off-by: Dave Hansen > > Acked-by: Andy Lutomirski > > Cc: stable@vger.kernel.org > > --- > > > > b/arch/x86/include/asm/cpufeatures.h | 6 ++---- > > 1 file changed, 2 insertions(+), 4 deletions(-) > > > > diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros arch/x86/include/asm/cpufeatures.h > > --- a/arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros 2016-06-30 17:10:41.215185869 -0700 > > +++ b/arch/x86/include/asm/cpufeatures.h 2016-06-30 17:10:41.218186005 -0700 > > @@ -301,10 +301,6 @@ > > #define X86_BUG_FXSAVE_LEAK X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */ > > #define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */ > > #define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */ > > -#define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ > > -#define X86_BUG_SWAPGS_FENCE X86_BUG(10) /* SWAPGS without input dep on GS */ > > - > > - > > #ifdef CONFIG_X86_32 > > /* > > * 64-bit kernels don't use X86_BUG_ESPFIX. Make the define conditional > > So I'd remove the "#ifdef CONFIG_X86_32" ifdeffery too and make that bit > unconditional - so what, we have enough free bits. But I'd leave the > comment to still avoid the confusion :) > I put the ifdef there to prevent anyone from accidentally using it in a 64-bit code path, not to save a bit. We could put in the middle of the list to make the mistake much less likely to be repeated, I suppose. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752509AbcGASZY (ORCPT ); Fri, 1 Jul 2016 14:25:24 -0400 Received: from out01.mta.xmission.com ([166.70.13.231]:50342 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752459AbcGASZW (ORCPT ); Fri, 1 Jul 2016 14:25:22 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Dave Hansen Cc: Linus Torvalds , Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> <5775F418.2000803@sr71.net> <874m89cu61.fsf@x220.int.ebiederm.org> <57769188.9060708@sr71.net> Date: Fri, 01 Jul 2016 13:12:55 -0500 In-Reply-To: <57769188.9060708@sr71.net> (Dave Hansen's message of "Fri, 1 Jul 2016 08:51:36 -0700") Message-ID: <878txlb520.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1bJ37o-0001CK-Iz;;;mid=<878txlb520.fsf@x220.int.ebiederm.org>;;;hst=in02.mta.xmission.com;;;ip=67.3.204.119;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1/imSLN9UxcbH5QX/0TmP66VitSUzKciHo= X-SA-Exim-Connect-IP: 67.3.204.119 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.7 XMSubLong Long Subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa07 1397; Body=1 Fuz1=1 Fuz2=1] * 0.0 T_TooManySym_01 4+ unique symbols in subject X-Spam-DCC: XMission; sa07 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: **;Dave Hansen X-Spam-Relay-Country: X-Spam-Timing: total 1177 ms - load_scoreonly_sql: 0.10 (0.0%), signal_user_changed: 4.4 (0.4%), b_tie_ro: 3.1 (0.3%), parse: 1.17 (0.1%), extract_message_metadata: 18 (1.5%), get_uri_detail_list: 1.90 (0.2%), tests_pri_-1000: 7 (0.6%), tests_pri_-950: 1.28 (0.1%), tests_pri_-900: 1.07 (0.1%), tests_pri_-400: 23 (1.9%), check_bayes: 21 (1.8%), b_tokenize: 6 (0.6%), b_tok_get_all: 7 (0.6%), b_comp_prob: 2.3 (0.2%), b_tok_touch_all: 3.0 (0.3%), b_finish: 0.88 (0.1%), tests_pri_0: 229 (19.5%), check_dkim_signature: 0.55 (0.0%), check_dkim_adsp: 3.2 (0.3%), tests_pri_500: 890 (75.6%), poll_dns_idle: 860 (73.0%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave Hansen writes: > On 07/01/2016 07:25 AM, Eric W. Biederman wrote: >> Linus Torvalds writes: >>> > On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: >>>> >> >>>> >> I think what you suggest will work if we don't consider A/D in >>>> >> pte_none(). I think there are a bunch of code path where assume that >>>> >> !pte_present() && !pte_none() means swap. >>> > >>> > Yeah, we would need to change pte_none() to mask off D/A, but I think >>> > that might be the only real change needed (other than making sure that >>> > we don't use the bits in the swap entries, I didn't look at that part >>> > at all) >> It looks like __pte_to_swp_entry also needs to be changed to mask out >> those bits when the swap code reads pte entries. For all of the same >> reasons as pte_none. > > I guess that would be nice, but isn't it redundant? > > static inline swp_entry_t pte_to_swp_entry(pte_t pte) > { > ... > arch_entry = __pte_to_swp_entry(pte); > return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry)); > } > > As long as __swp_type() and __swp_offset() don't let A/D through, then > we should be OK. This site is the only call to __pte_to_swp_entry() > that I can find in the entire codebase. > > Or am I missing something? Given that __pte_to_swp_entry on x86_64 is just __pte_val or pte.pte it does no filtering. Similarly __swp_type(arch_entry) is a >> and swp_entry(type, ...) is a << of what appears to be same amount for the swap type. So any corruption in the upper bits of the pte will be preserved as a swap type. In fact I strongly suspect that the compiler can optimize out all of the work done by "swp_entry(__swp_type(arch_entry), _swp_offset(arch_entry))". Eric From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752734AbcGCOgx (ORCPT ); Sun, 3 Jul 2016 10:36:53 -0400 Received: from mail-vk0-f53.google.com ([209.85.213.53]:34769 "EHLO mail-vk0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752632AbcGCOgv (ORCPT ); Sun, 3 Jul 2016 10:36:51 -0400 MIME-Version: 1.0 In-Reply-To: <20160701164656.GG4593@pd.tnic> References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001210.AA77B917@viggo.jf.intel.com> <20160701092300.GD4593@pd.tnic> <20160701164656.GG4593@pd.tnic> From: Andy Lutomirski Date: Sun, 3 Jul 2016 07:36:30 -0700 Message-ID: Subject: Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro To: Borislav Petkov Cc: Dave Hansen , Michal Hocko , X86 ML , "linux-mm@kvack.org" , stable , Andrew Morton , "linux-kernel@vger.kernel.org" , Dave Hansen , Linus Torvalds , Andi Kleen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Jul 1, 2016 9:47 AM, "Borislav Petkov" wrote: > > On Fri, Jul 01, 2016 at 09:30:37AM -0700, Andy Lutomirski wrote: > > I put the ifdef there to prevent anyone from accidentally using it in > > a 64-bit code path, not to save a bit. We could put in the middle of > > the list to make the mistake much less likely to be repeated, I > > suppose. > > Well, if someone does, someone will notice pretty soon, no? Dunno. ESPFIX was broken under KVM for years and no one notices. > > I just don't see the reason to worry but maybe I'm missing it. > > And we can call it X86_BUG_ESPFIX_X86_32 or so too... We could do that, too, I guess. But the current solution is only two extra lines of code. We could reorder the things so that it's in the middle instead of at the end, I suppose. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752858AbcGCRKr (ORCPT ); Sun, 3 Jul 2016 13:10:47 -0400 Received: from www.sr71.net ([198.145.64.142]:56503 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752720AbcGCRKq (ORCPT ); Sun, 3 Jul 2016 13:10:46 -0400 Subject: Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs To: Brian Gerst , Linus Torvalds References: <20160701001209.7DA24D1C@viggo.jf.intel.com> <20160701001218.3D316260@viggo.jf.intel.com> Cc: Linux Kernel Mailing List , the arch/x86 maintainers , linux-mm , Andrew Morton , Borislav Petkov , Andi Kleen , Michal Hocko , Dave Hansen From: Dave Hansen Message-ID: <5779470F.8020205@sr71.net> Date: Sun, 3 Jul 2016 10:10:39 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/30/2016 08:06 PM, Brian Gerst wrote: >> > It's not like anybody will ever care about 32-bit page tables on >> > Knights Landing anyway. > Could this affect a 32-bit guest VM? This isn't about 32-bit *mode*. It's about using the the 32-bit 2-level _paging_ mode that supports only 4GB virtual and 4GB physical addresses. That mode also doesn't support the No-eXecute (NX) bit, which basically everyone needs today for its security benefits. Even the little Quark CPU supports PAE (64-bit page tables). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro To: linux-kernel@vger.kernel.org Cc: x86@kernel.org,linux-mm@kvack.org,torvalds@linux-foundation.org,akpm@linux-foundation.org,bp@alien8.de,ak@linux.intel.com,mhocko@suse.com,Dave Hansen ,dave.hansen@linux.intel.com,luto@kernel.org,stable@vger.kernel.org From: Dave Hansen Date: Thu, 30 Jun 2016 17:12:10 -0700 References: <20160701001209.7DA24D1C@viggo.jf.intel.com> In-Reply-To: <20160701001209.7DA24D1C@viggo.jf.intel.com> Message-Id: <20160701001210.AA77B917@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: From: Dave Hansen epufeatures.h currently defines X86_BUG(9) twice on 32-bit: #define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ ... #ifdef CONFIG_X86_32 #define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */ #endif I think what happened was that this added the X86_BUG_ESPFIX, but in an #ifdef below most of the bugs: [58a5aac5] x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled Then this came along and added X86_BUG_NULL_SEG, but collided with the earlier one that did the bug below the main block defining all the X86_BUG()s. [7a5d6704] x86/cpu: Probe the behavior of nulling out a segment at boot time Signed-off-by: Dave Hansen Acked-by: Andy Lutomirski Cc: stable@vger.kernel.org --- b/arch/x86/include/asm/cpufeatures.h | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros arch/x86/include/asm/cpufeatures.h --- a/arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros 2016-06-30 17:10:41.215185869 -0700 +++ b/arch/x86/include/asm/cpufeatures.h 2016-06-30 17:10:41.218186005 -0700 @@ -301,10 +301,6 @@ #define X86_BUG_FXSAVE_LEAK X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */ #define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */ #define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */ -#define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ -#define X86_BUG_SWAPGS_FENCE X86_BUG(10) /* SWAPGS without input dep on GS */ - - #ifdef CONFIG_X86_32 /* * 64-bit kernels don't use X86_BUG_ESPFIX. Make the define conditional @@ -312,5 +308,7 @@ */ #define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */ #endif +#define X86_BUG_NULL_SEG X86_BUG(10) /* Nulling a selector preserves the base */ +#define X86_BUG_SWAPGS_FENCE X86_BUG(11) /* SWAPGS without input dep on GS */ #endif /* _ASM_X86_CPUFEATURES_H */ _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org