From mboxrd@z Thu Jan 1 00:00:00 1970 Return-path: Received: from mga03.intel.com ([134.134.136.65]) by Galois.linutronix.de with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1fGX2a-00089a-Ho for speck@linutronix.de; Wed, 09 May 2018 23:54:29 +0200 Date: Wed, 9 May 2018 14:54:25 -0700 From: Andi Kleen Subject: [MODERATED] Re: [PATCH v4 0/8] L1TFv4 0 Message-ID: <20180509215425.GA31444@tassilo.jf.intel.com> References: MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/mixed; boundary="mP3DRpeJDSE+ciuQ" Content-Disposition: inline To: speck@linutronix.de List-ID: --mP3DRpeJDSE+ciuQ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline And here's a mbox for easier review/applying --mP3DRpeJDSE+ciuQ Content-Type: application/vnd.wolfram.mathematica.package Content-Disposition: attachment; filename=m Content-Transfer-Encoding: quoted-printable =46rom 0c2eb2235d5476b216693f1e9ec8394d58af20b3 Mon Sep 17 00:00:00 2001=0A= =46rom: Andi Kleen =0ADate: Thu, 3 May 2018 08:35:42 -0= 700=0ASubject: [PATCH 1/8] x86, l1tf: Increase 32bit PAE __PHYSICAL_PAGE_MA= SK=0AStatus: RO=0AContent-Length: 1575=0ALines: 43=0A=0AOn 32bit PAE the ma= x PTE mask is currently set to 44 bit because that is=0Athe limit imposed b= y 32bit unsigned long PFNs in the VMs.=0A=0AThe L1TF PROT_NONE protection c= ode uses the PTE masks to determine=0Awhat bits to invert to make sure the = higher bits are set for unmapped=0Aentries to prevent L1TF speculation atta= cks against EPT inside guests.=0A=0ABut our inverted mask has to match the = host, and the host is likely=0A64bit and may use more than 43 bits of memor= y. We want to set=0Aall possible bits to be safe here.=0A=0ASo increase the= mask on 32bit PAE to 52 to match 64bit. The real=0Alimit is still 44 bits = but outside the inverted PTEs these=0Ahigher bits are set, so a bigger mask= s don't cause any problems.=0A=0ASigned-off-by: Andi Kleen =0A---=0A arch/x86/include/asm/page_32_types.h | 9 +++++++--=0A 1 file = changed, 7 insertions(+), 2 deletions(-)=0A=0Adiff --git a/arch/x86/include= /asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h=0Aindex aa30c32= 41ea7..0d5c739eebd7 100644=0A--- a/arch/x86/include/asm/page_32_types.h=0A+= ++ b/arch/x86/include/asm/page_32_types.h=0A@@ -29,8 +29,13 @@=0A #define N= _EXCEPTION_STACKS 1=0A =0A #ifdef CONFIG_X86_PAE=0A-/* 44=3D32+12, the limi= t we can fit into an unsigned long pfn */=0A-#define __PHYSICAL_MASK_SHIFT = 44=0A+/*=0A+ * This is beyond the 44 bit limit imposed by the 32bit long pf= ns,=0A+ * but we need the full mask to make sure inverted PROT_NONE=0A+ * e= ntries have all the host bits set in a guest.=0A+ * The real limit is still= 44 bits.=0A+ */=0A+#define __PHYSICAL_MASK_SHIFT 52=0A #define __VIRTUAL_M= ASK_SHIFT 32=0A =0A #else /* !CONFIG_X86_PAE */=0A-- =0A2.14.3=0A=0A=0AFro= m 1bef0e393f925379b76cb689bfb3fdbfc052e716 Mon Sep 17 00:00:00 2001=0AFrom:= Linus Torvalds =0ADate: Fri, 27 Apr 2018 09= :06:34 -0700=0ASubject: [PATCH 2/8] x86, l1tf: Protect swap entries against= L1TF=0AStatus: RO=0AContent-Length: 4505=0ALines: 108=0A=0AWith L1 termina= l fault the CPU speculates into unmapped PTEs, and=0Aresulting side effects= allow to read the memory the PTE is pointing=0Atoo, if its values are stil= l in the L1 cache.=0A=0AFor swapped out pages Linux uses unmapped PTEs and = stores a swap entry=0Ainto them.=0A=0AWe need to make sure the swap entry i= s not pointing to valid memory,=0Awhich requires setting higher bits (betwe= en bit 36 and bit 45) that=0Aare inside the CPUs physical address space, bu= t outside any real=0Amemory.=0A=0ATo do this we invert the offset to make s= ure the higher bits are always=0Aset, as long as the swap file is not too b= ig.=0A=0AHere's a patch that switches the order of "type" and=0A"offset" in= the x86-64 encoding, in addition to doing the binary 'not' on=0Athe offset= =2E=0A=0AThat means that now the offset is bits 9-58 in the page table, and= that=0Athe offset is in the bits that hardware generally doesn't care abou= t.=0A=0AThat, in turn, means that if you have a desktop chip with only 40 b= its of=0Aphysical addressing, now that the offset starts at bit 9, you stil= l have=0Ato have 30 bits of offset actually *in use* until bit 39 ends up b= eing=0Aclear.=0A=0ASo that's 4 terabyte of swap space (because the offset i= s counted in=0Apages, so 30 bits of offset is 42 bits of actual coverage). = With bigger=0Aphysical addressing, that obviously grows further, until you = hit the limit=0Aof the offset (at 50 bits of offset - 62 bits of actual swa= p file=0Acoverage).=0A=0ANote there is no workaround for 32bit !PAE, or on = systems which=0Ahave more than MAX_PA/2 memory. The later case is very unli= kely=0Ato happen on real systems.=0A=0A[updated description and minor tweak= s by AK]=0A=0ASigned-off-by: Linus Torvalds = =0ASigned-off-by: Andi Kleen =0ATested-by: Andi Kleen <= ak@linux.intel.com>=0AAcked-by: Michal Hocko =0A---=0A arc= h/x86/include/asm/pgtable_64.h | 36 +++++++++++++++++++++++++-----------=0A= 1 file changed, 25 insertions(+), 11 deletions(-)=0A=0Adiff --git a/arch/x= 86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h=0Aindex 877= bc27718ae..593c3cf259dd 100644=0A--- a/arch/x86/include/asm/pgtable_64.h=0A= +++ b/arch/x86/include/asm/pgtable_64.h=0A@@ -273,7 +273,7 @@ static inline= int pgd_large(pgd_t pgd) { return 0; }=0A *=0A * | ... | = 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number=0A * | ... |= SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names=0A- * | OFFSET (14->63) | TY= PE (9-13) |0|0|X|X| X| X|X|SD|0| <- swp entry=0A+ * | TYPE (59-63) | ~OFFS= ET (9-58) |0|0|X|X| X| X|X|SD|0| <- swp entry=0A *=0A * G (8) is aliased= and used as a PROT_NONE indicator for=0A * !present ptes. We need to sta= rt storing swap entries above=0A@@ -286,20 +286,34 @@ static inline int pgd= _large(pgd_t pgd) { return 0; }=0A *=0A * Bit 7 in swp entry should be 0 = because pmd_present checks not only P,=0A * but also L and G.=0A+ *=0A+ * = The offset is inverted by a binary not operation to make the high=0A+ * phy= sical bits set.=0A */=0A-#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + = 1)=0A-#define SWP_TYPE_BITS 5=0A-/* Place the offset above the type: */=0A-= #define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)=0A+#defin= e SWP_TYPE_BITS 5=0A+=0A+#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE = + 1)=0A+=0A+/* We always extract/encode the offset by shifting it all the w= ay up, and then down again */=0A+#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST= _BIT+SWP_TYPE_BITS)=0A =0A #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_S= WAPFILES_SHIFT > SWP_TYPE_BITS)=0A =0A-#define __swp_type(x) (((x).val >>= (SWP_TYPE_FIRST_BIT)) \=0A- & ((1U << SWP_TYPE_BITS) - 1))=0A-#define= __swp_offset(x) ((x).val >> SWP_OFFSET_FIRST_BIT)=0A-#define __swp_entry= (type, offset) ((swp_entry_t) { \=0A- ((type) << (SWP_TYPE_FIRST_BIT))= \=0A- | ((offset) << SWP_OFFSET_FIRST_BIT) })=0A+/* Extract the high = bits for type */=0A+#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))= =0A+=0A+/* Shift up (to get rid of type), then down to get value */=0A+#def= ine __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)=0A+=0A+= /*=0A+ * Shift the offset up "too far" by TYPE bits, then down again=0A+ * = The offset is inverted by a binary not operation to make the high=0A+ * phy= sical bits set.=0A+ */=0A+#define __swp_entry(type, offset) ((swp_entry_t) = { \=0A+ (~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \= =0A+ | ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })=0A+=0A #define __pt= e_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })=0A #define __pmd_to= _swp_entry(pmd) ((swp_entry_t) { pmd_val((pmd)) })=0A #define __swp_entry_= to_pte(x) ((pte_t) { .pte =3D (x).val })=0A-- =0A2.14.3=0A=0A=0AFrom 07a23= 314494bcaf78e47852462364a6d57e9b3b1 Mon Sep 17 00:00:00 2001=0AFrom: Andi K= leen =0ADate: Fri, 27 Apr 2018 09:47:37 -0700=0ASubject= : [PATCH 3/8] x86, l1tf: Protect PROT_NONE PTEs against speculation=0AStatu= s: O=0AContent-Length: 8094=0ALines: 254=0A=0AWe also need to protect PTEs = that are set to PROT_NONE against=0AL1TF speculation attacks.=0A=0AThis is = important inside guests, because L1TF speculation=0Abypasses physical page = remapping. While the VM has its own=0Amigitations preventing leaking data f= rom other VMs into=0Athe guest, this would still risk leaking the wrong pag= e=0Ainside the current guest.=0A=0AThis uses the same technique as Linus' s= wap entry patch:=0Awhile an entry is is in PROTNONE state we invert the=0Ac= omplete PFN part part of it. This ensures that the=0Athe highest bit will p= oint to non existing memory.=0A=0AThe invert is done by pte/pmd/pud_modify = and pfn/pmd/pud_pte for=0APROTNONE and pte/pmd/pud_pfn undo it.=0A=0AWe ass= ume that noone tries to touch the PFN part of=0Aa PTE without using these p= rimitives.=0A=0AThis doesn't handle the case that MMIO is on the top=0Aof t= he CPU physical memory. If such an MMIO region=0Awas exposed by an unprivil= edged driver for mmap=0Ait would be possible to attack some real memory.=0A= However this situation is all rather unlikely.=0A=0AFor 32bit non PAE we do= n't try inversion because=0Athere are really not enough bits to protect any= thing.=0A=0AQ: Why does the guest need to be protected when the=0AHyperViso= r already has L1TF mitigations?=0AA: Here's an example:=0AYou have physical= pages 1 2. They get mapped into a guest as=0AGPA 1 -> PA 2=0AGPA 2 -> PA 1= =0Athrough EPT.=0A=0AThe L1TF speculation ignores the EPT remapping.=0A=0AN= ow the guest kernel maps GPA 1 to process A and GPA 2 to process B,=0Aand t= hey belong to different users and should be isolated.=0A=0AA sets the GPA 1= PA 2 PTE to PROT_NONE to bypass the EPT remapping=0Aand gets read access t= o the underlying physical page. Which=0Ain this case points to PA 2, so it = can read process B's data,=0Aif it happened to be in L1.=0A=0ASo we broke i= solation inside the guest.=0A=0AThere's nothing the hypervisor can do about= this. This=0Amitigation has to be done in the guest.=0A=0Av2: Use new help= er to generate XOR mask to invert (Linus)=0Av3: Use inline helper for protn= one mask checking=0ASigned-off-by: Andi Kleen =0AAcked-= by: Michal Hocko =0A---=0A arch/x86/include/asm/pgtable-2l= evel.h | 17 ++++++++++++++=0A arch/x86/include/asm/pgtable-3level.h | 2 ++= =0A arch/x86/include/asm/pgtable-invert.h | 32 +++++++++++++++++++++++++=0A= arch/x86/include/asm/pgtable.h | 44 ++++++++++++++++++++++++-------= ----=0A arch/x86/include/asm/pgtable_64.h | 2 ++=0A 5 files changed, 8= 4 insertions(+), 13 deletions(-)=0A create mode 100644 arch/x86/include/asm= /pgtable-invert.h=0A=0Adiff --git a/arch/x86/include/asm/pgtable-2level.h b= /arch/x86/include/asm/pgtable-2level.h=0Aindex 685ffe8a0eaf..60d0f9015317 1= 00644=0A--- a/arch/x86/include/asm/pgtable-2level.h=0A+++ b/arch/x86/includ= e/asm/pgtable-2level.h=0A@@ -95,4 +95,21 @@ static inline unsigned long pte= _bitop(unsigned long value, unsigned int rightshi=0A #define __pte_to_swp_e= ntry(pte) ((swp_entry_t) { (pte).pte_low })=0A #define __swp_entry_to_pte(= x) ((pte_t) { .pte =3D (x).val })=0A =0A+/* No inverted PFNs on 2 level pa= ge tables */=0A+=0A+static inline u64 protnone_mask(u64 val)=0A+{=0A+ retur= n 0;=0A+}=0A+=0A+static inline u64 flip_protnone_guard(u64 oldval, u64 val,= u64 mask)=0A+{=0A+ return val;=0A+}=0A+=0A+static inline bool __pte_needs_= invert(u64 val)=0A+{=0A+ return false;=0A+}=0A+=0A #endif /* _ASM_X86_PGTAB= LE_2LEVEL_H */=0Adiff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/= x86/include/asm/pgtable-3level.h=0Aindex f24df59c40b2..76ab26a99e6e 100644= =0A--- a/arch/x86/include/asm/pgtable-3level.h=0A+++ b/arch/x86/include/asm= /pgtable-3level.h=0A@@ -295,4 +295,6 @@ static inline pte_t gup_get_pte(pte= _t *ptep)=0A return pte;=0A }=0A =0A+#include =0A+= =0A #endif /* _ASM_X86_PGTABLE_3LEVEL_H */=0Adiff --git a/arch/x86/include/= asm/pgtable-invert.h b/arch/x86/include/asm/pgtable-invert.h=0Anew file mod= e 100644=0Aindex 000000000000..c740606b0c02=0A--- /dev/null=0A+++ b/arch/x8= 6/include/asm/pgtable-invert.h=0A@@ -0,0 +1,32 @@=0A+/* SPDX-License-Identi= fier: GPL-2.0 */=0A+#ifndef _ASM_PGTABLE_INVERT_H=0A+#define _ASM_PGTABLE_I= NVERT_H 1=0A+=0A+#ifndef __ASSEMBLY__=0A+=0A+static inline bool __pte_needs= _invert(u64 val)=0A+{=0A+ return (val & (_PAGE_PRESENT|_PAGE_PROTNONE)) =3D= =3D _PAGE_PROTNONE;=0A+}=0A+=0A+/* Get a mask to xor with the page table en= try to get the correct pfn. */=0A+static inline u64 protnone_mask(u64 val)= =0A+{=0A+ return __pte_needs_invert(val) ? ~0ull : 0;=0A+}=0A+=0A+static i= nline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)=0A+{=0A+ /*=0A= + * When a PTE transitions from NONE to !NONE or vice-versa=0A+ * invert = the PFN part to stop speculation.=0A+ * pte_pfn undoes this when needed.= =0A+ */=0A+ if ((oldval & _PAGE_PROTNONE) !=3D (val & _PAGE_PROTNONE))=0A+= val =3D (val & ~mask) | (~val & mask);=0A+ return val;=0A+}=0A+=0A+#endif= /* __ASSEMBLY__ */=0A+=0A+#endif=0Adiff --git a/arch/x86/include/asm/pgtab= le.h b/arch/x86/include/asm/pgtable.h=0Aindex 5f49b4ff0c24..f811e3257e87 10= 0644=0A--- a/arch/x86/include/asm/pgtable.h=0A+++ b/arch/x86/include/asm/pg= table.h=0A@@ -185,19 +185,29 @@ static inline int pte_special(pte_t pte)=0A= return pte_flags(pte) & _PAGE_SPECIAL;=0A }=0A =0A+/* Entries that were s= et to PROT_NONE are inverted */=0A+=0A+static inline u64 protnone_mask(u64 = val);=0A+=0A static inline unsigned long pte_pfn(pte_t pte)=0A {=0A- return= (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;=0A+ unsigned long pfn =3D pte= _val(pte);=0A+ pfn ^=3D protnone_mask(pfn);=0A+ return (pfn & PTE_PFN_MASK)= >> PAGE_SHIFT;=0A }=0A =0A static inline unsigned long pmd_pfn(pmd_t pmd)= =0A {=0A- return (pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;=0A+ unsi= gned long pfn =3D pmd_val(pmd);=0A+ pfn ^=3D protnone_mask(pfn);=0A+ return= (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;=0A }=0A =0A static inline unsigne= d long pud_pfn(pud_t pud)=0A {=0A- return (pud_val(pud) & pud_pfn_mask(pud)= ) >> PAGE_SHIFT;=0A+ unsigned long pfn =3D pud_val(pud);=0A+ pfn ^=3D protn= one_mask(pfn);=0A+ return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT;=0A }=0A = =0A static inline unsigned long p4d_pfn(p4d_t p4d)=0A@@ -545,25 +555,33 @@ = static inline pgprotval_t check_pgprot(pgprot_t pgprot)=0A =0A static inlin= e pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)=0A {=0A- return __p= te(((phys_addr_t)page_nr << PAGE_SHIFT) |=0A- check_pgprot(pgprot));= =0A+ phys_addr_t pfn =3D page_nr << PAGE_SHIFT;=0A+ pfn ^=3D protnone_mask(= pgprot_val(pgprot));=0A+ pfn &=3D PTE_PFN_MASK;=0A+ return __pte(pfn | chec= k_pgprot(pgprot));=0A }=0A =0A static inline pmd_t pfn_pmd(unsigned long pa= ge_nr, pgprot_t pgprot)=0A {=0A- return __pmd(((phys_addr_t)page_nr << PAGE= _SHIFT) |=0A- check_pgprot(pgprot));=0A+ phys_addr_t pfn =3D page_nr = << PAGE_SHIFT;=0A+ pfn ^=3D protnone_mask(pgprot_val(pgprot));=0A+ pfn &=3D= PHYSICAL_PMD_PAGE_MASK;=0A+ return __pmd(pfn | check_pgprot(pgprot));=0A }= =0A =0A static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot)= =0A {=0A- return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) |=0A- che= ck_pgprot(pgprot));=0A+ phys_addr_t pfn =3D page_nr << PAGE_SHIFT;=0A+ pfn = ^=3D protnone_mask(pgprot_val(pgprot));=0A+ pfn &=3D PHYSICAL_PUD_PAGE_MASK= ;=0A+ return __pud(pfn | check_pgprot(pgprot));=0A }=0A =0A+static inline u= 64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);=0A+=0A static inline= pte_t pte_modify(pte_t pte, pgprot_t newprot)=0A {=0A- pteval_t val =3D pt= e_val(pte);=0A+ pteval_t val =3D pte_val(pte), oldval =3D val;=0A =0A /*= =0A * Chop off the NX bit (if present), and add the NX portion of=0A@@ -5= 71,17 +589,17 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot= )=0A */=0A val &=3D _PAGE_CHG_MASK;=0A val |=3D check_pgprot(newprot) &= ~_PAGE_CHG_MASK;=0A-=0A+ val =3D flip_protnone_guard(oldval, val, PTE_PFN_= MASK);=0A return __pte(val);=0A }=0A =0A static inline pmd_t pmd_modify(pm= d_t pmd, pgprot_t newprot)=0A {=0A- pmdval_t val =3D pmd_val(pmd);=0A+ pmdv= al_t val =3D pmd_val(pmd), oldval =3D val;=0A =0A val &=3D _HPAGE_CHG_MASK= ;=0A val |=3D check_pgprot(newprot) & ~_HPAGE_CHG_MASK;=0A-=0A+ val =3D fl= ip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);=0A return __pmd(va= l);=0A }=0A =0Adiff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/in= clude/asm/pgtable_64.h=0Aindex 593c3cf259dd..ea99272ab63e 100644=0A--- a/ar= ch/x86/include/asm/pgtable_64.h=0A+++ b/arch/x86/include/asm/pgtable_64.h= =0A@@ -357,5 +357,7 @@ static inline bool gup_fast_permitted(unsigned long = start, int nr_pages,=0A return true;=0A }=0A =0A+#include =0A+=0A #endif /* !__ASSEMBLY__ */=0A #endif /* _ASM_X86_PGTABLE_64_H= */=0A-- =0A2.14.3=0A=0A=0AFrom c75da7960a5888721ae8921a49dd485e8c97b3c3 Mo= n Sep 17 00:00:00 2001=0AFrom: Andi Kleen =0ADate: Mon,= 23 Apr 2018 15:57:54 -0700=0ASubject: [PATCH 4/8] x86, l1tf: Make sure the= first page is always reserved=0AStatus: O=0AContent-Length: 985=0ALines: 3= 1=0A=0AThe L1TF workaround doesn't make any attempt to mitigate speculate= =0Aaccesses to the first physical page for zeroed PTEs. Normally=0Ait only = contains some data from the early real mode BIOS.=0A=0AI couldn't convince = myself we always reserve the first page in=0Aall configurations, so add an = extra reservation call to=0Amake sure it is really reserved. In most config= urations (e.g.=0Awith the standard reservations) it's likely a nop.=0A=0ASi= gned-off-by: Andi Kleen =0A---=0A arch/x86/kernel/setup= =2Ec | 3 +++=0A 1 file changed, 3 insertions(+)=0A=0Adiff --git a/arch/x86/= kernel/setup.c b/arch/x86/kernel/setup.c=0Aindex 6285697b6e56..fadbd41094d2= 100644=0A--- a/arch/x86/kernel/setup.c=0A+++ b/arch/x86/kernel/setup.c=0A@= @ -817,6 +817,9 @@ void __init setup_arch(char **cmdline_p)=0A memblock_re= serve(__pa_symbol(_text),=0A (unsigned long)__bss_stop - (unsigned long= )_text);=0A =0A+ /* Make sure page 0 is always reserved */=0A+ memblock_res= erve(0, PAGE_SIZE);=0A+=0A early_reserve_initrd();=0A =0A /*=0A-- =0A2.14= =2E3=0A=0A=0AFrom c94c11d610008319d373292207356675438627e8 Mon Sep 17 00:00= :00 2001=0AFrom: Andi Kleen =0ADate: Fri, 27 Apr 2018 1= 4:44:53 -0700=0ASubject: [PATCH 5/8] x86, l1tf: Add sysfs reporting for l1t= f=0AStatus: O=0AContent-Length: 5334=0ALines: 141=0A=0AL1TF core kernel wor= karounds are cheap and normally always enabled,=0AHowever we still want to = report in sysfs if the system is vulnerable=0Aor mitigated. Add the necessa= ry checks.=0A=0A- We use the same checks as Meltdown to determine if the sy= stem is=0Avulnerable. This excludes some Atom CPUs which don't have this=0A= problem.=0A- We check for the (very unlikely) memory > MAX_PA/2 case=0A- We= check for 32bit non PAE and warn=0A=0ANote this patch will likely conflict= with some other workaround patches=0Afloating around, but should be straig= ht forward to fix.=0A=0Av2: Use positive instead of negative flag for WA. F= ix override=0Areporting.=0Av3: Fix L1TF_WA flag settting=0ASigned-off-by: A= ndi Kleen =0A---=0A arch/x86/include/asm/cpufeatures.h = | 2 ++=0A arch/x86/kernel/cpu/bugs.c | 11 +++++++++++=0A arch/x86/= kernel/cpu/common.c | 15 ++++++++++++++-=0A drivers/base/cpu.c = | 8 ++++++++=0A include/linux/cpu.h | 2 ++=0A 5 = files changed, 37 insertions(+), 1 deletion(-)=0A=0Adiff --git a/arch/x86/i= nclude/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h=0Aindex d554c= 11e01ff..f1bfe8a37b84 100644=0A--- a/arch/x86/include/asm/cpufeatures.h=0A+= ++ b/arch/x86/include/asm/cpufeatures.h=0A@@ -214,6 +214,7 @@=0A =0A #defin= e X86_FEATURE_USE_IBPB ( 7*32+21) /* "" Indirect Branch Prediction Barrier= enabled */=0A #define X86_FEATURE_USE_IBRS_FW ( 7*32+22) /* "" Use IBRS d= uring runtime firmware calls */=0A+#define X86_FEATURE_L1TF_WA ( 7*32+23) = /* "" L1TF workaround used */=0A =0A /* Virtualization flags: Linux defined= , word 8 */=0A #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shad= ow */=0A@@ -362,5 +363,6 @@=0A #define X86_BUG_CPU_MELTDOWN X86_BUG(14) /*= CPU is affected by meltdown attack and needs kernel page table isolation *= /=0A #define X86_BUG_SPECTRE_V1 X86_BUG(15) /* CPU is affected by Spectre = variant 1 attack with conditional branches */=0A #define X86_BUG_SPECTRE_V2= X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect = branches */=0A+#define X86_BUG_L1TF X86_BUG(17) /* CPU is affected by L1 = Terminal Fault */=0A =0A #endif /* _ASM_X86_CPUFEATURES_H */=0Adiff --git a= /arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c=0Aindex bfca937bdc= c3..e1f67b7c5217 100644=0A--- a/arch/x86/kernel/cpu/bugs.c=0A+++ b/arch/x86= /kernel/cpu/bugs.c=0A@@ -340,4 +340,15 @@ ssize_t cpu_show_spectre_v2(struc= t device *dev, struct device_attribute *attr, c=0A boot_cpu_has(X8= 6_FEATURE_USE_IBRS_FW) ? ", IBRS_FW" : "",=0A spectre_v2_module_st= ring());=0A }=0A+=0A+ssize_t cpu_show_l1tf(struct device *dev, struct devic= e_attribute *attr, char *buf)=0A+{=0A+ if (!boot_cpu_has_bug(X86_BUG_L1TF))= =0A+ return sprintf(buf, "Not affected\n");=0A+=0A+ if (boot_cpu_has(X86_F= EATURE_L1TF_WA))=0A+ return sprintf(buf, "Mitigated\n");=0A+=0A+ return sp= rintf(buf, "Mitigation Unavailable\n");=0A+}=0A #endif=0Adiff --git a/arch/= x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c=0Aindex 8a5b185735e1= =2E.8bb14ccb2f4b 100644=0A--- a/arch/x86/kernel/cpu/common.c=0A+++ b/arch/x= 86/kernel/cpu/common.c=0A@@ -940,6 +940,15 @@ static bool __init cpu_vulner= able_to_meltdown(struct cpuinfo_x86 *c)=0A return true;=0A }=0A =0A+static= bool __init l1tf_wa_possible(void)=0A+{=0A+#if CONFIG_PGTABLE_LEVELS =3D= =3D 2=0A+ pr_warn("Kernel not compiled for PAE. No workaround for L1TF\n");= =0A+ return false;=0A+#endif=0A+ return true;=0A+}=0A+=0A /*=0A * Do minim= um CPU detection early.=0A * Fields really needed: vendor, cpuid_level, fa= mily, model, mask,=0A@@ -989,8 +998,12 @@ static void __init early_identify= _cpu(struct cpuinfo_x86 *c)=0A setup_force_cpu_cap(X86_FEATURE_ALWAYS);=0A= =0A if (!x86_match_cpu(cpu_no_speculation)) {=0A- if (cpu_vulnerable_to_= meltdown(c))=0A+ if (cpu_vulnerable_to_meltdown(c)) {=0A setup_force_cp= u_bug(X86_BUG_CPU_MELTDOWN);=0A+ setup_force_cpu_bug(X86_BUG_L1TF);=0A+ = if (l1tf_wa_possible())=0A+ setup_force_cpu_cap(X86_FEATURE_L1TF_WA);= =0A+ }=0A setup_force_cpu_bug(X86_BUG_SPECTRE_V1);=0A setup_force_cpu_= bug(X86_BUG_SPECTRE_V2);=0A }=0Adiff --git a/drivers/base/cpu.c b/drivers/= base/cpu.c=0Aindex 2da998baa75c..ed7b8591d461 100644=0A--- a/drivers/base/c= pu.c=0A+++ b/drivers/base/cpu.c=0A@@ -534,14 +534,22 @@ ssize_t __weak cpu_= show_spectre_v2(struct device *dev,=0A return sprintf(buf, "Not affected\n= ");=0A }=0A =0A+ssize_t __weak cpu_show_l1tf(struct device *dev,=0A+ = struct device_attribute *attr, char *buf)=0A+{=0A+ return sprintf(buf, "Not= affected\n");=0A+}=0A+=0A static DEVICE_ATTR(meltdown, 0444, cpu_show_melt= down, NULL);=0A static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, N= ULL);=0A static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);= =0A+static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);=0A =0A static stru= ct attribute *cpu_root_vulnerabilities_attrs[] =3D {=0A &dev_attr_meltdown= =2Eattr,=0A &dev_attr_spectre_v1.attr,=0A &dev_attr_spectre_v2.attr,=0A+ = &dev_attr_l1tf.attr,=0A NULL=0A };=0A =0Adiff --git a/include/linux/cpu.h = b/include/linux/cpu.h=0Aindex 7b01bc11c692..75c430046ca0 100644=0A--- a/inc= lude/linux/cpu.h=0A+++ b/include/linux/cpu.h=0A@@ -53,6 +53,8 @@ extern ssi= ze_t cpu_show_spectre_v1(struct device *dev,=0A struct device_attrib= ute *attr, char *buf);=0A extern ssize_t cpu_show_spectre_v2(struct device = *dev,=0A struct device_attribute *attr, char *buf);=0A+extern ssize_= t cpu_show_l1tf(struct device *dev,=0A+ struct device_attribute *attr= , char *buf);=0A =0A extern __printf(4, 5)=0A struct device *cpu_device_cre= ate(struct device *parent, void *drvdata,=0A-- =0A2.14.3=0A=0A=0AFrom 524e0= 68e6b7286121da0b3979bb20fd5a2b3fe38 Mon Sep 17 00:00:00 2001=0AFrom: Andi K= leen =0ADate: Fri, 9 Feb 2018 10:36:15 -0800=0ASubject:= [PATCH 6/8] x86, l1tf: Report if too much memory for L1TF workaround=0ASta= tus: RO=0AContent-Length: 2703=0ALines: 87=0A=0AIf the system has more than= MAX_PA/2 physical memory the=0Ainvert page workarounds don't protect the s= ystem against=0Athe L1TF attack anymore, because an inverted physical addre= ss=0Awill point to valid memory.=0A=0AWe cannot do much here, after all use= rs want to use the=0Amemory, but at least print a warning and report the sy= stem as=0Avulnerable in sysfs=0A=0ANote this is all extremely unlikely to h= appen on a real machine=0Abecause they typically have far more MAX_PA than = DIMM slots=0A=0ASome VMs also report fairly small PAs to guest, e.g. only 3= 6bits.=0AIn this case the threshold will be lower, but applies only=0Ato th= e maximum guest size.=0A=0ASince this needs to clear a feature bit that has= been forced=0Aearlier add a special "unforce" macro that supports this.=0A= =0ASigned-off-by: Andi Kleen =0A---=0A arch/x86/include= /asm/cpufeature.h | 5 +++++=0A arch/x86/kernel/setup.c | 25 ++++= ++++++++++++++++++++-=0A 2 files changed, 29 insertions(+), 1 deletion(-)= =0A=0Adiff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm= /cpufeature.h=0Aindex b27da9602a6d..f78bfd2464c1 100644=0A--- a/arch/x86/in= clude/asm/cpufeature.h=0A+++ b/arch/x86/include/asm/cpufeature.h=0A@@ -138,= 6 +138,11 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned int = bit);=0A set_bit(bit, (unsigned long *)cpu_caps_set); \=0A } while (0)=0A = =0A+#define setup_unforce_cpu_cap(bit) do { \=0A+ clear_cpu_cap(&boot_cpu_d= ata, bit); \=0A+ clear_bit(bit, (unsigned long *)cpu_caps_set); \=0A+} whil= e (0)=0A+=0A #define setup_force_cpu_bug(bit) setup_force_cpu_cap(bit)=0A = =0A /*=0Adiff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c=0Ai= ndex fadbd41094d2..b49fcb3e3a97 100644=0A--- a/arch/x86/kernel/setup.c=0A++= + b/arch/x86/kernel/setup.c=0A@@ -779,7 +779,28 @@ static void __init trim_= low_memory_range(void)=0A {=0A memblock_reserve(0, ALIGN(reserve_low, PAGE= _SIZE));=0A }=0A- =0A+=0A+static __init void check_maxpa_memory(void)=0A+{= =0A+ u64 len;=0A+=0A+ if (!boot_cpu_has(X86_BUG_L1TF))=0A+ return;=0A+=0A+= len =3D BIT_ULL(boot_cpu_data.x86_phys_bits - 1) - 1;=0A+=0A+ /*=0A+ * Th= is is extremely unlikely to happen because systems near always have far=0A+= * more MAX_PA than DIMM slots.=0A+ */=0A+ if (e820__mapped_any(len, ULLO= NG_MAX - len,=0A+ E820_TYPE_RAM)) {=0A+ pr_warn("System has more t= han MAX_PA/2 memory. Disabled L1TF workaround\n");=0A+ /* Was forced earli= er, so now unforce it. */=0A+ setup_unforce_cpu_cap(X86_FEATURE_L1TF_WA);= =0A+ }=0A+}=0A+=0A /*=0A * Dump out kernel offset information on panic.=0A= */=0A@@ -1016,6 +1037,8 @@ void __init setup_arch(char **cmdline_p)=0A i= nsert_resource(&iomem_resource, &data_resource);=0A insert_resource(&iomem= _resource, &bss_resource);=0A =0A+ check_maxpa_memory();=0A+=0A e820_add_k= ernel_range();=0A trim_bios_range();=0A #ifdef CONFIG_X86_32=0A-- =0A2.14.= 3=0A=0A=0AFrom b31f6dd0e2447e3cbc0959209a946a5224d10499 Mon Sep 17 00:00:00= 2001=0AFrom: Andi Kleen =0ADate: Fri, 27 Apr 2018 15:2= 9:17 -0700=0ASubject: [PATCH 7/8] x86, l1tf: Limit swap file size to MAX_PA= /2=0AStatus: O=0AContent-Length: 5291=0ALines: 148=0A=0AFor the L1TF workar= ound we want to limit the swap file size to below=0AMAX_PA/2, so that the h= igher bits of the swap offset inverted never=0Apoint to valid memory.=0A=0A= Add a way for the architecture to override the swap file=0Asize check in sw= apfile.c and add a x86 specific max swapfile check=0Afunction that enforces= that limit.=0A=0AThe check is only enabled if the CPU is vulnerable to L1T= F.=0A=0AIn VMs with 42bit MAX_PA the typical limit is 2TB now,=0Aon a nativ= e system with 46bit PA it is 32TB. The limit=0Ais only per individual swap = file, so it's always possible=0Ato exceed these limits with multiple swap f= iles or=0Apartitions.=0A=0Av2: Use new helper for maxpa_mask computation.= =0ASigned-off-by: Andi Kleen =0A---=0A arch/x86/include= /asm/processor.h | 5 +++++=0A arch/x86/mm/init.c | 15 ++++++= ++++++++=0A include/linux/swapfile.h | 2 ++=0A mm/swapfile.c = | 44 +++++++++++++++++++++++++---------------=0A 4 files chan= ged, 50 insertions(+), 16 deletions(-)=0A=0Adiff --git a/arch/x86/include/a= sm/processor.h b/arch/x86/include/asm/processor.h=0Aindex 21a114914ba4..2bd= 676e450cf 100644=0A--- a/arch/x86/include/asm/processor.h=0A+++ b/arch/x86/= include/asm/processor.h=0A@@ -181,6 +181,11 @@ extern const struct seq_oper= ations cpuinfo_op;=0A =0A extern void cpu_detect(struct cpuinfo_x86 *c);=0A= =0A+static inline u64 maxpa_pfn_bit(int offset)=0A+{=0A+ return BIT_ULL(bo= ot_cpu_data.x86_phys_bits - offset - PAGE_SHIFT);=0A+}=0A+=0A extern void e= arly_cpu_init(void);=0A extern void identify_boot_cpu(void);=0A extern void= identify_secondary_cpu(struct cpuinfo_x86 *);=0Adiff --git a/arch/x86/mm/i= nit.c b/arch/x86/mm/init.c=0Aindex fec82b577c18..b4078eb05ca0 100644=0A--- = a/arch/x86/mm/init.c=0A+++ b/arch/x86/mm/init.c=0A@@ -4,6 +4,8 @@=0A #inclu= de =0A #include =0A #include /* for max_low_pfn */=0A+#include =0A+#include =0A =0A #include =0A #include =0A@@ -878,3 +880,16 @@ void update_cache_mode_entry(unsigned entry, enum= page_cache_mode cache)=0A __cachemode2pte_tbl[cache] =3D __cm_idx2pte(ent= ry);=0A __pte2cachemode_tbl[entry] =3D cache;=0A }=0A+=0A+unsigned long ma= x_swapfile_size(void)=0A+{=0A+ unsigned long pages;=0A+=0A+ pages =3D gener= ic_max_swapfile_size();=0A+=0A+ if (boot_cpu_has(X86_BUG_L1TF)) {=0A+ /* L= imit the swap file size to MAX_PA/2 for L1TF workaround */=0A+ pages =3D m= in_t(unsigned long, maxpa_pfn_bit(1), pages);=0A+ }=0A+ return pages;=0A+}= =0Adiff --git a/include/linux/swapfile.h b/include/linux/swapfile.h=0Aindex= 06bd7b096167..e06febf62978 100644=0A--- a/include/linux/swapfile.h=0A+++ b= /include/linux/swapfile.h=0A@@ -10,5 +10,7 @@ extern spinlock_t swap_lock;= =0A extern struct plist_head swap_active_head;=0A extern struct swap_info_s= truct *swap_info[];=0A extern int try_to_unuse(unsigned int, bool, unsigned= long);=0A+extern unsigned long generic_max_swapfile_size(void);=0A+extern = unsigned long max_swapfile_size(void);=0A =0A #endif /* _LINUX_SWAPFILE_H *= /=0Adiff --git a/mm/swapfile.c b/mm/swapfile.c=0Aindex cc2cf04d9018..413f48= 424194 100644=0A--- a/mm/swapfile.c=0A+++ b/mm/swapfile.c=0A@@ -2909,6 +290= 9,33 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode = *inode)=0A return 0;=0A }=0A =0A+=0A+/*=0A+ * Find out how many pages are = allowed for a single swap=0A+ * device. There are two limiting factors: 1) = the number=0A+ * of bits for the swap offset in the swp_entry_t type, and= =0A+ * 2) the number of bits in the swap pte as defined by the=0A+ * differ= ent architectures. In order to find the=0A+ * largest possible bit mask, a = swap entry with swap type 0=0A+ * and swap offset ~0UL is created, encoded = to a swap pte,=0A+ * decoded to a swp_entry_t again, and finally the swap= =0A+ * offset is extracted. This will mask all the bits from=0A+ * the init= ial ~0UL mask that can't be encoded in either=0A+ * the swp_entry_t or the = architecture definition of a=0A+ * swap pte.=0A+ */=0A+unsigned long generi= c_max_swapfile_size(void)=0A+{=0A+ return swp_offset(pte_to_swp_entry(=0A+ = swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;=0A+}=0A+=0A+/* Can be overrid= den by an architecture for additional checks. */=0A+__weak unsigned long ma= x_swapfile_size(void)=0A+{=0A+ return generic_max_swapfile_size();=0A+}=0A+= =0A static unsigned long read_swap_header(struct swap_info_struct *p,=0A = union swap_header *swap_header,=0A struct inode *inode)=0A@@ -2944,= 22 +2971,7 @@ static unsigned long read_swap_header(struct swap_info_struct= *p,=0A p->cluster_next =3D 1;=0A p->cluster_nr =3D 0;=0A =0A- /*=0A- * = Find out how many pages are allowed for a single swap=0A- * device. There = are two limiting factors: 1) the number=0A- * of bits for the swap offset = in the swp_entry_t type, and=0A- * 2) the number of bits in the swap pte a= s defined by the=0A- * different architectures. In order to find the=0A- = * largest possible bit mask, a swap entry with swap type 0=0A- * and swap = offset ~0UL is created, encoded to a swap pte,=0A- * decoded to a swp_entr= y_t again, and finally the swap=0A- * offset is extracted. This will mask = all the bits from=0A- * the initial ~0UL mask that can't be encoded in eit= her=0A- * the swp_entry_t or the architecture definition of a=0A- * swap = pte.=0A- */=0A- maxpages =3D swp_offset(pte_to_swp_entry(=0A- swp_entry_= to_pte(swp_entry(0, ~0UL)))) + 1;=0A+ maxpages =3D max_swapfile_size();=0A = last_page =3D swap_header->info.last_page;=0A if (!last_page) {=0A pr_w= arn("Empty swap-file\n");=0A-- =0A2.14.3=0A=0A=0AFrom 76d1413d7854087f1c2c0= 870eeedc77507c2f25a Mon Sep 17 00:00:00 2001=0AFrom: Andi Kleen =0ADate: Thu, 3 May 2018 16:39:51 -0700=0ASubject: [PATCH 8/8] mm,= l1tf: Disallow non privileged high MMIO PROT_NONE=0A mappings=0AStatus: O= =0AContent-Length: 9351=0ALines: 291=0A=0AFor L1TF PROT_NONE mappings are p= rotected by inverting the PFN in the=0Apage table entry. This sets the high= bits in the CPU's address space,=0Athus making sure to point to not point = an unmapped entry to valid=0Acached memory.=0A=0ASome server system BIOS pu= t the MMIO mappings high up in the physical=0Aaddress space. If such an hig= h mapping was mapped to an unprivileged=0Auser they could attack low memory= by setting such a mapping to=0APROT_NONE. This could happen through a spec= ial device driver=0Awhich is not access protected. Normal /dev/mem is of co= urse=0Aaccess protect.=0A=0ATo avoid this we forbid PROT_NONE mappings or m= protect for high MMIO=0Amappings.=0A=0AValid page mappings are allowed beca= use the system is then unsafe=0Aanyways.=0A=0AWe don't expect users to comm= only use PROT_NONE on MMIO. But=0Ato minimize any impact here we only do th= is if the mapping actually=0Arefers to a high MMIO address (defined as the = MAX_PA-1 bit being set),=0Aand also skip the check for root.=0A=0AFor mmaps= this is straight forward and can be handled in vm_insert_pfn=0Aand in rema= p_pfn_range().=0A=0AFor mprotect it's a bit trickier. At the point we're lo= oking at the=0Aactual PTEs a lot of state has been changed and would be dif= ficult=0Ato undo on an error. Since this is a uncommon case we use a separa= te=0Aearly page talk walk pass for MMIO PROT_NONE mappings that=0Achecks fo= r this condition early. For non MMIO and non PROT_NONE=0Athere are no chang= es.=0A=0Av2: Use new helpers added earlier=0ASigned-off-by: Andi Kleen =0A---=0A arch/x86/include/asm/pgtable.h | 4 ++++=0A arch/= x86/mm/mmap.c | 19 ++++++++++++++++=0A include/asm-generic/pgta= ble.h | 12 +++++++++++=0A mm/memory.c | 37 ++++++++++++= ++++++++++---------=0A mm/mprotect.c | 49 ++++++++++++++++= ++++++++++++++++++++++++++=0A 5 files changed, 111 insertions(+), 10 deleti= ons(-)=0A=0Adiff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/= asm/pgtable.h=0Aindex f811e3257e87..338897c3b36f 100644=0A--- a/arch/x86/in= clude/asm/pgtable.h=0A+++ b/arch/x86/include/asm/pgtable.h=0A@@ -1333,6 +13= 33,10 @@ static inline bool pud_access_permitted(pud_t pud, bool write)=0A = return __pte_access_permitted(pud_val(pud), write);=0A }=0A =0A+#define __= HAVE_ARCH_PFN_MODIFY_ALLOWED 1=0A+extern bool pfn_modify_allowed(unsigned l= ong pfn, pgprot_t prot);=0A+static inline bool arch_has_pfn_modify_check(vo= id) { return true; }=0A+=0A #include =0A #endif /* _= _ASSEMBLY__ */=0A =0Adiff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c= =0Aindex 48c591251600..369b67226f81 100644=0A--- a/arch/x86/mm/mmap.c=0A+++= b/arch/x86/mm/mmap.c=0A@@ -240,3 +240,22 @@ int valid_mmap_phys_addr_range= (unsigned long pfn, size_t count)=0A =0A return phys_addr_valid(addr + cou= nt - 1);=0A }=0A+=0A+/*=0A+ * Only allow root to set high MMIO mappings to = PROT_NONE.=0A+ * This prevents an unpriv. user to set them to PROT_NONE and= invert=0A+ * them, then pointing to valid memory for L1TF speculation.=0A+= */=0A+bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)=0A+{=0A+ i= f (!boot_cpu_has(X86_BUG_L1TF))=0A+ return true;=0A+ if (__pte_needs_inver= t(pgprot_val(prot)))=0A+ return true;=0A+ /* If it's real memory always al= low */=0A+ if (pfn_valid(pfn))=0A+ return true;=0A+ if ((pfn & maxpa_pfn_b= it(1)) && !capable(CAP_SYS_ADMIN))=0A+ return false;=0A+ return true;=0A+}= =0Adiff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable= =2Eh=0Aindex f59639afaa39..0ecc1197084b 100644=0A--- a/include/asm-generic/= pgtable.h=0A+++ b/include/asm-generic/pgtable.h=0A@@ -1097,4 +1097,16 @@ st= atic inline void init_espfix_bsp(void) { }=0A #endif=0A #endif=0A =0A+#ifnd= ef __HAVE_ARCH_PFN_MODIFY_ALLOWED=0A+static inline bool pfn_modify_allowed(= unsigned long pfn, pgprot_t prot)=0A+{=0A+ return true;=0A+}=0A+=0A+static = inline bool arch_has_pfn_modify_check(void)=0A+{=0A+ return false;=0A+}=0A+= #endif=0A+=0A #endif /* _ASM_GENERIC_PGTABLE_H */=0Adiff --git a/mm/memory.= c b/mm/memory.c=0Aindex 01f5464e0fd2..fe497cecd2ab 100644=0A--- a/mm/memory= =2Ec=0A+++ b/mm/memory.c=0A@@ -1891,6 +1891,9 @@ int vm_insert_pfn_prot(str= uct vm_area_struct *vma, unsigned long addr,=0A if (addr < vma->vm_start |= | addr >=3D vma->vm_end)=0A return -EFAULT;=0A =0A+ if (!pfn_modify_allow= ed(pfn, pgprot))=0A+ return -EACCES;=0A+=0A track_pfn_insert(vma, &pgprot= , __pfn_to_pfn_t(pfn, PFN_DEV));=0A =0A ret =3D insert_pfn(vma, addr, __pf= n_to_pfn_t(pfn, PFN_DEV), pgprot,=0A@@ -1926,6 +1929,9 @@ static int __vm_i= nsert_mixed(struct vm_area_struct *vma, unsigned long addr,=0A =0A track_p= fn_insert(vma, &pgprot, pfn);=0A =0A+ if (!pfn_modify_allowed(pfn_t_to_pfn(= pfn), pgprot))=0A+ return -EACCES;=0A+=0A /*=0A * If we don't have pte = special, then we have to use the pfn_valid()=0A * based VM_MIXEDMAP schem= e (see vm_normal_page), and thus we *must*=0A@@ -1973,6 +1979,7 @@ static i= nt remap_pte_range(struct mm_struct *mm, pmd_t *pmd,=0A {=0A pte_t *pte;= =0A spinlock_t *ptl;=0A+ int err =3D 0;=0A =0A pte =3D pte_alloc_map_lock= (mm, pmd, addr, &ptl);=0A if (!pte)=0A@@ -1980,12 +1987,16 @@ static int r= emap_pte_range(struct mm_struct *mm, pmd_t *pmd,=0A arch_enter_lazy_mmu_mo= de();=0A do {=0A BUG_ON(!pte_none(*pte));=0A+ if (!pfn_modify_allowed(p= fn, prot)) {=0A+ err =3D -EACCES;=0A+ break;=0A+ }=0A set_pte_at(mm,= addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));=0A pfn++;=0A } while (pt= e++, addr +=3D PAGE_SIZE, addr !=3D end);=0A arch_leave_lazy_mmu_mode();= =0A pte_unmap_unlock(pte - 1, ptl);=0A- return 0;=0A+ return err;=0A }=0A = =0A static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,=0A@= @ -1994,6 +2005,7 @@ static inline int remap_pmd_range(struct mm_struct *mm= , pud_t *pud,=0A {=0A pmd_t *pmd;=0A unsigned long next;=0A+ int err;=0A = =0A pfn -=3D addr >> PAGE_SHIFT;=0A pmd =3D pmd_alloc(mm, pud, addr);=0A@= @ -2002,9 +2014,10 @@ static inline int remap_pmd_range(struct mm_struct *m= m, pud_t *pud,=0A VM_BUG_ON(pmd_trans_huge(*pmd));=0A do {=0A next =3D = pmd_addr_end(addr, end);=0A- if (remap_pte_range(mm, pmd, addr, next,=0A- = pfn + (addr >> PAGE_SHIFT), prot))=0A- return -ENOMEM;=0A+ err =3D re= map_pte_range(mm, pmd, addr, next,=0A+ pfn + (addr >> PAGE_SHIFT), prot)= ;=0A+ if (err)=0A+ return err;=0A } while (pmd++, addr =3D next, addr != =3D end);=0A return 0;=0A }=0A@@ -2015,6 +2028,7 @@ static inline int rema= p_pud_range(struct mm_struct *mm, p4d_t *p4d,=0A {=0A pud_t *pud;=0A unsi= gned long next;=0A+ int err;=0A =0A pfn -=3D addr >> PAGE_SHIFT;=0A pud = =3D pud_alloc(mm, p4d, addr);=0A@@ -2022,9 +2036,10 @@ static inline int re= map_pud_range(struct mm_struct *mm, p4d_t *p4d,=0A return -ENOMEM;=0A do= {=0A next =3D pud_addr_end(addr, end);=0A- if (remap_pmd_range(mm, pud,= addr, next,=0A- pfn + (addr >> PAGE_SHIFT), prot))=0A- return -ENOMEM= ;=0A+ err =3D remap_pmd_range(mm, pud, addr, next,=0A+ pfn + (addr >> P= AGE_SHIFT), prot);=0A+ if (err)=0A+ return err;=0A } while (pud++, addr= =3D next, addr !=3D end);=0A return 0;=0A }=0A@@ -2035,6 +2050,7 @@ stati= c inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,=0A {=0A p4d= _t *p4d;=0A unsigned long next;=0A+ int err;=0A =0A pfn -=3D addr >> PAGE= _SHIFT;=0A p4d =3D p4d_alloc(mm, pgd, addr);=0A@@ -2042,9 +2058,10 @@ stat= ic inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,=0A return= -ENOMEM;=0A do {=0A next =3D p4d_addr_end(addr, end);=0A- if (remap_pu= d_range(mm, p4d, addr, next,=0A- pfn + (addr >> PAGE_SHIFT), prot))=0A- = return -ENOMEM;=0A+ err =3D remap_pud_range(mm, p4d, addr, next,=0A+ = pfn + (addr >> PAGE_SHIFT), prot);=0A+ if (err)=0A+ return err;=0A } wh= ile (p4d++, addr =3D next, addr !=3D end);=0A return 0;=0A }=0Adiff --git = a/mm/mprotect.c b/mm/mprotect.c=0Aindex 625608bc8962..6d331620b9e5 100644= =0A--- a/mm/mprotect.c=0A+++ b/mm/mprotect.c=0A@@ -306,6 +306,42 @@ unsigne= d long change_protection(struct vm_area_struct *vma, unsigned long start,= =0A return pages;=0A }=0A =0A+static int prot_none_pte_entry(pte_t *pte, u= nsigned long addr,=0A+ unsigned long next, struct mm_walk *walk)= =0A+{=0A+ return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->priv= ate)) ?=0A+ 0 : -EACCES;=0A+}=0A+=0A+static int prot_none_hugetlb_entry(pt= e_t *pte, unsigned long hmask,=0A+ unsigned long addr, unsigned long = next,=0A+ struct mm_walk *walk)=0A+{=0A+ return pfn_modify_allowed(pt= e_pfn(*pte), *(pgprot_t *)(walk->private)) ?=0A+ 0 : -EACCES;=0A+}=0A+=0A+= static int prot_none_test(unsigned long addr, unsigned long next,=0A+ s= truct mm_walk *walk)=0A+{=0A+ return 0;=0A+}=0A+=0A+static int prot_none_wa= lk(struct vm_area_struct *vma, unsigned long start,=0A+ unsigned long = end, unsigned long newflags)=0A+{=0A+ pgprot_t new_pgprot =3D vm_get_page_p= rot(newflags);=0A+ struct mm_walk prot_none_walk =3D {=0A+ .pte_entry =3D = prot_none_pte_entry,=0A+ .hugetlb_entry =3D prot_none_hugetlb_entry,=0A+ = =2Etest_walk =3D prot_none_test,=0A+ .mm =3D current->mm,=0A+ .private = =3D &new_pgprot,=0A+ };=0A+=0A+ return walk_page_range(start, end, &prot_no= ne_walk);=0A+}=0A+=0A int=0A mprotect_fixup(struct vm_area_struct *vma, str= uct vm_area_struct **pprev,=0A unsigned long start, unsigned long end, uns= igned long newflags)=0A@@ -323,6 +359,19 @@ mprotect_fixup(struct vm_area_s= truct *vma, struct vm_area_struct **pprev,=0A return 0;=0A }=0A =0A+ /*= =0A+ * Do PROT_NONE PFN permission checks here when we can still=0A+ * ba= il out without undoing a lot of state. This is a rather=0A+ * uncommon cas= e, so doesn't need to be very optimized.=0A+ */=0A+ if (arch_has_pfn_modif= y_check() &&=0A+ (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&=0A+ (= newflags & (VM_READ|VM_WRITE|VM_EXEC)) =3D=3D 0) {=0A+ error =3D prot_none= _walk(vma, start, end, newflags);=0A+ if (error)=0A+ return error;=0A+ }= =0A+=0A /*=0A * If we make a private mapping writable we increase our co= mmit;=0A * but (without finer accounting) cannot reduce our commit if we= =0A-- =0A2.14.3=0A=0A --mP3DRpeJDSE+ciuQ--