From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1455034131.2925.79.camel@hpe.com> Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages From: Toshi Kani Date: Tue, 09 Feb 2016 09:08:51 -0700 In-Reply-To: <20160209132645.55971eff@md1em3qc> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> <20160209105325.0ce9a104@md1em3qc> <20160209102235.GA9885@gmail.com> <20160209132645.55971eff@md1em3qc> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org To: Henning Schild , Ingo Molnar Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org List-ID: On Tue, 2016-02-09 at 13:26 +0100, Henning Schild wrote: > On Tue, 9 Feb 2016 11:22:35 +0100 > Ingo Molnar wrote: > > > * Henning Schild wrote: > > > > > On Tue, 9 Feb 2016 10:10:03 +0100 > > > Ingo Molnar wrote: > > >    > > > > * Toshi Kani wrote: > > > >    > > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > > > > range is limited to pte mappings. > > > > > > > > > > pgd_ctor() sets the kernel's pgd entries to user's during > > > > > fork(), which makes user processes share the same page tables > > > > > for the kernel ranges.  When a call to ioremap() is made at > > > > > run-time that leads to allocate a new 2nd level table (pud in > > > > > 64-bit and pmd in PAE), user process needs to re-sync with the > > > > > updated kernel pgd entry with vmalloc_fault(). > > > > > > > > > > Following changes are made to vmalloc_fault().     > > > > > > > > So what were the effects of this shortcoming? Were large page > > > > ioremap()s unusable? Was this harmless because no driver used this > > > > facility?   > > > > > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to > > > access the device memory a #PF and the call to vmalloc_fault would > > > eventually make the kernel treat device memory as if it was a > > > pagetable. > > > The results are illegal reads/writes on iomem and dereferencing > > > iomem content like it was a pointer to a lower level pagetable. > > > - #PF if you are lucky #PF -> vmalloc_fault -> oops > > > - funny modification of arbitrary memory possible > > > - can be abused with uio or regular userland ??    > > Looking over the code again i am not sure the last two are even > possible, it is just the pointer deref that can cause a #PF. > If the pointer turns out to "work" the code will just read and > eventually BUG(). The last two case are not possible. > > Ok, so this is a serious live bug exposed to drivers, that also > > requires a Cc: stable tag. Yes, the fix should go to stable as well. Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1455033795.2925.74.camel@hpe.com> Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages From: Toshi Kani Date: Tue, 09 Feb 2016 09:03:15 -0700 In-Reply-To: <20160209091003.GA10774@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, henning.schild@siemens.com, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org List-ID: On Tue, 2016-02-09 at 10:10 +0100, Ingo Molnar wrote: > * Toshi Kani wrote: > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 > > and PAE.  vmalloc_fault() however assumes that the vmalloc range is > > limited to pte mappings. > > > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), which > > makes user processes share the same page tables for the kernel > > ranges.  When a call to ioremap() is made at run-time that leads to > > allocate a new 2nd level table (pud in 64-bit and pmd in PAE), user > > process needs to re-sync with the updated kernel pgd entry with > > vmalloc_fault(). > > > > Following changes are made to vmalloc_fault(). > > So what were the effects of this shortcoming? Were large page ioremap()s > unusable? Was this harmless because no driver used this facility? > > If so then the changelog needs to spell this out clearly ... Large page support of ioremap() has been used for persistent memory mappings for a while. In order to hit this problem, i.e. causing a vmalloc fault, a large mount of ioremap allocations at run-time is required.  The following example repeats allocation of 16GB range. # cat /proc/vmallocinfo | grep memremap 0xffffc90040000000-0xffffc90440001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc90480000000-0xffffc90880001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc908c0000000-0xffffc90cc0001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc90d00000000-0xffffc91100001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc91140000000-0xffffc91540001000 17179873280 memremap+0xb4/0x110  phys=480000000 ioremap   : 0xffffc97300000000-0xffffc97700001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc97740000000-0xffffc97b40001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc97b80000000-0xffffc97f80001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc97fc0000000-0xffffc983c0001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap The last ioremap call above crossed a 512GB boundary (0x8000000000), which allocated a new pud table and updated the kernel pgd entry to point it.  Because user process's page table does not have this pgd entry update, a read/write syscall request to the range will hit a vmalloc fault.  Since vmalloc_fault() does not handle a large page properly, this causes an Oops as follows.  BUG: unable to handle kernel paging request at ffff880840000ff8  IP: [] vmalloc_fault+0x1be/0x300  PGD c7f03a067 PUD 0   Oops: 0000 [#1] SM    :  Call Trace:  [] __do_page_fault+0x285/0x3e0  [] do_page_fault+0x2f/0x80  [] ? put_prev_entity+0x35/0x7a0  [] page_fault+0x28/0x30  [] ? memcpy_erms+0x6/0x10  [] ? schedule+0x35/0x80  [] ? pmem_rw_bytes+0x6a/0x190 [nd_pmem]  [] ? schedule_timeout+0x183/0x240  [] btt_log_read+0x63/0x140 [nd_btt]    :  [] ? __symbol_put+0x60/0x60  [] ? kernel_read+0x50/0x80  [] SyS_finit_module+0xb9/0xf0  [] entry_SYSCALL_64_fastpath+0x1a/0xa4 Note that this issue is limited to 64-bit.  32-bit only uses index 3 of the pgd entry to cover the entire vmalloc range, which is always valid. I will add this information to the change log. Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 9 Feb 2016 13:26:45 +0100 From: Henning Schild Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages Message-ID: <20160209132645.55971eff@md1em3qc> In-Reply-To: <20160209102235.GA9885@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> <20160209105325.0ce9a104@md1em3qc> <20160209102235.GA9885@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Toshi Kani , tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org List-ID: On Tue, 9 Feb 2016 11:22:35 +0100 Ingo Molnar wrote: > * Henning Schild wrote: > > > On Tue, 9 Feb 2016 10:10:03 +0100 > > Ingo Molnar wrote: > > > > > * Toshi Kani wrote: > > > > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > > > range is limited to pte mappings. > > > > > > > > pgd_ctor() sets the kernel's pgd entries to user's during > > > > fork(), which makes user processes share the same page tables > > > > for the kernel ranges. When a call to ioremap() is made at > > > > run-time that leads to allocate a new 2nd level table (pud in > > > > 64-bit and pmd in PAE), user process needs to re-sync with the > > > > updated kernel pgd entry with vmalloc_fault(). > > > > > > > > Following changes are made to vmalloc_fault(). > > > > > > So what were the effects of this shortcoming? Were large page > > > ioremap()s unusable? Was this harmless because no driver used this > > > facility? > > > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to > > access the device memory a #PF and the call to vmalloc_fault would > > eventually make the kernel treat device memory as if it was a > > pagetable. > > The results are illegal reads/writes on iomem and dereferencing > > iomem content like it was a pointer to a lower level pagetable. > > - #PF if you are lucky > > - funny modification of arbitrary memory possible > > - can be abused with uio or regular userland ?? Looking over the code again i am not sure the last two are even possible, it is just the pointer deref that can cause a #PF. If the pointer turns out to "work" the code will just read and eventually BUG(). > Ok, so this is a serious live bug exposed to drivers, that also > requires a Cc: stable tag. > > All of this should have been in the changelog! > > Thanks, > > Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 9 Feb 2016 11:22:35 +0100 From: Ingo Molnar Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages Message-ID: <20160209102235.GA9885@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> <20160209105325.0ce9a104@md1em3qc> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160209105325.0ce9a104@md1em3qc> Sender: owner-linux-mm@kvack.org To: Henning Schild Cc: Toshi Kani , tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org List-ID: * Henning Schild wrote: > On Tue, 9 Feb 2016 10:10:03 +0100 > Ingo Molnar wrote: > > > * Toshi Kani wrote: > > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > > range is limited to pte mappings. > > > > > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), > > > which makes user processes share the same page tables for the > > > kernel ranges. When a call to ioremap() is made at run-time that > > > leads to allocate a new 2nd level table (pud in 64-bit and pmd in > > > PAE), user process needs to re-sync with the updated kernel pgd > > > entry with vmalloc_fault(). > > > > > > Following changes are made to vmalloc_fault(). > > > > So what were the effects of this shortcoming? Were large page > > ioremap()s unusable? Was this harmless because no driver used this > > facility? > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to > access the device memory a #PF and the call to vmalloc_fault would > eventually make the kernel treat device memory as if it was a > pagetable. > The results are illegal reads/writes on iomem and dereferencing iomem > content like it was a pointer to a lower level pagetable. > - #PF if you are lucky > - funny modification of arbitrary memory possible > - can be abused with uio or regular userland ?? Ok, so this is a serious live bug exposed to drivers, that also requires a Cc: stable tag. All of this should have been in the changelog! Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 9 Feb 2016 10:53:25 +0100 From: Henning Schild Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages Message-ID: <20160209105325.0ce9a104@md1em3qc> In-Reply-To: <20160209091003.GA10774@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Toshi Kani , tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org List-ID: On Tue, 9 Feb 2016 10:10:03 +0100 Ingo Molnar wrote: > * Toshi Kani wrote: > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > range is limited to pte mappings. > > > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), > > which makes user processes share the same page tables for the > > kernel ranges. When a call to ioremap() is made at run-time that > > leads to allocate a new 2nd level table (pud in 64-bit and pmd in > > PAE), user process needs to re-sync with the updated kernel pgd > > entry with vmalloc_fault(). > > > > Following changes are made to vmalloc_fault(). > > So what were the effects of this shortcoming? Were large page > ioremap()s unusable? Was this harmless because no driver used this > facility? Drivers do use huge ioremap()s. Now if a pre-existing mm is used to access the device memory a #PF and the call to vmalloc_fault would eventually make the kernel treat device memory as if it was a pagetable. The results are illegal reads/writes on iomem and dereferencing iomem content like it was a pointer to a lower level pagetable. - #PF if you are lucky - funny modification of arbitrary memory possible - can be abused with uio or regular userland ?? Henning > If so then the changelog needs to spell this out clearly ... > Thanks, > > Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 9 Feb 2016 10:10:03 +0100 From: Ingo Molnar Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages Message-ID: <20160209091003.GA10774@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, henning.schild@siemens.com, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org List-ID: * Toshi Kani wrote: > Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 and PAE. > vmalloc_fault() however assumes that the vmalloc range is limited to pte > mappings. > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), which makes > user processes share the same page tables for the kernel ranges. When a call to > ioremap() is made at run-time that leads to allocate a new 2nd level table (pud > in 64-bit and pmd in PAE), user process needs to re-sync with the updated kernel > pgd entry with vmalloc_fault(). > > Following changes are made to vmalloc_fault(). So what were the effects of this shortcoming? Were large page ioremap()s unusable? Was this harmless because no driver used this facility? If so then the changelog needs to spell this out clearly ... Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Toshi Kani Subject: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages Date: Mon, 8 Feb 2016 17:00:38 -0700 Message-Id: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> Sender: owner-linux-mm@kvack.org To: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de Cc: henning.schild@siemens.com, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Toshi Kani List-ID: Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc range is limited to pte mappings. pgd_ctor() sets the kernel's pgd entries to user's during fork(), which makes user processes share the same page tables for the kernel ranges. When a call to ioremap() is made at run-time that leads to allocate a new 2nd level table (pud in 64-bit and pmd in PAE), user process needs to re-sync with the updated kernel pgd entry with vmalloc_fault(). Following changes are made to vmalloc_fault(). 64-bit: - No change for the sync operation as set_pgd() takes care of huge pages as well. - Add pud_huge() and pmd_huge() to the validation code to handle huge pages. - Change pud_page_vaddr() to pud_pfn() since an ioremap range is not directly mapped (although the if-statement still works with a bogus addr). - Change pmd_page() to pmd_pfn() since an ioremap range is not backed by struct page table (although the if-statement still works with a bogus addr). PAE: - No change for the sync operation since the index3 pgd entry covers the entire vmalloc range, which is always valid. (A separate change will be needed if this assumption gets changed regardless of the page size.) - Add pmd_huge() to the validation code to handle huge pages. This is only for completeness since vmalloc_fault() won't happen for ioremap'd ranges as its pgd entry is always valid. (I was unable to test this part of the changes as a result.) Reported-by: Henning Schild Signed-off-by: Toshi Kani Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Borislav Petkov --- When this patch is accepted, please copy to stable up to 4.1. --- arch/x86/mm/fault.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index eef44d9..e830c71 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -287,6 +287,9 @@ static noinline int vmalloc_fault(unsigned long address) if (!pmd_k) return -1; + if (pmd_huge(*pmd_k)) + return 0; + pte_k = pte_offset_kernel(pmd_k, address); if (!pte_present(*pte_k)) return -1; @@ -360,8 +363,6 @@ void vmalloc_sync_all(void) * 64-bit: * * Handle a fault on the vmalloc area - * - * This assumes no large pages in there. */ static noinline int vmalloc_fault(unsigned long address) { @@ -403,17 +404,23 @@ static noinline int vmalloc_fault(unsigned long address) if (pud_none(*pud_ref)) return -1; - if (pud_none(*pud) || pud_page_vaddr(*pud) != pud_page_vaddr(*pud_ref)) + if (pud_none(*pud) || pud_pfn(*pud) != pud_pfn(*pud_ref)) BUG(); + if (pud_huge(*pud)) + return 0; + pmd = pmd_offset(pud, address); pmd_ref = pmd_offset(pud_ref, address); if (pmd_none(*pmd_ref)) return -1; - if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref)) + if (pmd_none(*pmd) || pmd_pfn(*pmd) != pmd_pfn(*pmd_ref)) BUG(); + if (pmd_huge(*pmd)) + return 0; + pte_ref = pte_offset_kernel(pmd_ref, address); if (!pte_present(*pte_ref)) return -1; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f180.google.com (mail-ob0-f180.google.com [209.85.214.180]) by kanga.kvack.org (Postfix) with ESMTP id 2B77B6B0253 for ; Tue, 9 Feb 2016 10:10:10 -0500 (EST) Received: by mail-ob0-f180.google.com with SMTP id xk3so187674329obc.2 for ; Tue, 09 Feb 2016 07:10:10 -0800 (PST) Received: from g4t3426.houston.hp.com (g4t3426.houston.hp.com. [15.201.208.54]) by mx.google.com with ESMTPS id s77si3643989ois.16.2016.02.09.07.10.09 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 Feb 2016 07:10:09 -0800 (PST) Message-ID: <1455033795.2925.74.camel@hpe.com> Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages From: Toshi Kani Date: Tue, 09 Feb 2016 09:03:15 -0700 In-Reply-To: <20160209091003.GA10774@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, henning.schild@siemens.com, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, 2016-02-09 at 10:10 +0100, Ingo Molnar wrote: > * Toshi Kani wrote: > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 > > and PAE. A vmalloc_fault() however assumes that the vmalloc range is > > limited to pte mappings. > > > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), which > > makes user processes share the same page tables for the kernel > > ranges.A A When a call to ioremap() is made at run-time that leads to > > allocate a new 2nd level table (pud in 64-bit and pmd in PAE), user > > process needs to re-sync with the updated kernel pgd entry with > > vmalloc_fault(). > > > > Following changes are made to vmalloc_fault(). > > So what were the effects of this shortcoming? Were large page ioremap()s > unusable? Was this harmless because no driver used this facility? > > If so then the changelog needs to spell this out clearly ... Large page support of ioremap() has been used for persistent memory mappings for a while. In order to hit this problem, i.e. causing a vmalloc fault, a large mount of ioremap allocations at run-time is required. A The following example repeats allocation of 16GB range. # cat /proc/vmallocinfo | grep memremap 0xffffc90040000000-0xffffc90440001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc90480000000-0xffffc90880001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc908c0000000-0xffffc90cc0001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc90d00000000-0xffffc91100001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc91140000000-0xffffc91540001000 17179873280 memremap+0xb4/0x110A phys=480000000 ioremap A : 0xffffc97300000000-0xffffc97700001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc97740000000-0xffffc97b40001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc97b80000000-0xffffc97f80001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc97fc0000000-0xffffc983c0001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap The last ioremap call above crossed a 512GB boundary (0x8000000000), which allocated a new pud table and updated the kernel pgd entry to point it. A Because user process's page table does not have this pgd entry update, a read/write syscall request to the range will hit a vmalloc fault. A Since vmalloc_fault() does not handle a large page properly, this causes an Oops as follows. A BUG: unable to handle kernel paging request at ffff880840000ff8 A IP: [] vmalloc_fault+0x1be/0x300 A PGD c7f03a067 PUD 0A A Oops: 0000 [#1] SM A A : A Call Trace: A [] __do_page_fault+0x285/0x3e0 A [] do_page_fault+0x2f/0x80 A [] ? put_prev_entity+0x35/0x7a0 A [] page_fault+0x28/0x30 A [] ? memcpy_erms+0x6/0x10 A [] ? schedule+0x35/0x80 A [] ? pmem_rw_bytes+0x6a/0x190 [nd_pmem] A [] ? schedule_timeout+0x183/0x240 A [] btt_log_read+0x63/0x140 [nd_btt] A A : A [] ? __symbol_put+0x60/0x60 A [] ? kernel_read+0x50/0x80 A [] SyS_finit_module+0xb9/0xf0 A [] entry_SYSCALL_64_fastpath+0x1a/0xa4 Note that this issue is limited to 64-bit. A 32-bit only uses index 3 of the pgd entry to cover the entire vmalloc range, which is always valid. I will add this information to the change log. Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f173.google.com (mail-ob0-f173.google.com [209.85.214.173]) by kanga.kvack.org (Postfix) with ESMTP id 2EA8C6B0255 for ; Tue, 9 Feb 2016 10:15:44 -0500 (EST) Received: by mail-ob0-f173.google.com with SMTP id is5so186916211obc.0 for ; Tue, 09 Feb 2016 07:15:44 -0800 (PST) Received: from g4t3426.houston.hp.com (g4t3426.houston.hp.com. [15.201.208.54]) by mx.google.com with ESMTPS id ps3si21389912obb.57.2016.02.09.07.15.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 Feb 2016 07:15:43 -0800 (PST) Message-ID: <1455034131.2925.79.camel@hpe.com> Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages From: Toshi Kani Date: Tue, 09 Feb 2016 09:08:51 -0700 In-Reply-To: <20160209132645.55971eff@md1em3qc> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> <20160209105325.0ce9a104@md1em3qc> <20160209102235.GA9885@gmail.com> <20160209132645.55971eff@md1em3qc> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Henning Schild , Ingo Molnar Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, 2016-02-09 at 13:26 +0100, Henning Schild wrote: > On Tue, 9 Feb 2016 11:22:35 +0100 > Ingo Molnar wrote: > > > * Henning Schild wrote: > > > > > On Tue, 9 Feb 2016 10:10:03 +0100 > > > Ingo Molnar wrote: > > > A A > > > > * Toshi Kani wrote: > > > > A A > > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > > > > range is limited to pte mappings. > > > > > > > > > > pgd_ctor() sets the kernel's pgd entries to user's during > > > > > fork(), which makes user processes share the same page tables > > > > > for the kernel ranges.A A When a call to ioremap() is made at > > > > > run-time that leads to allocate a new 2nd level table (pud in > > > > > 64-bit and pmd in PAE), user process needs to re-sync with the > > > > > updated kernel pgd entry with vmalloc_fault(). > > > > > > > > > > Following changes are made to vmalloc_fault().A A A A > > > > > > > > So what were the effects of this shortcoming? Were large page > > > > ioremap()s unusable? Was this harmless because no driver used this > > > > facility?A A > > > > > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to > > > access the device memory a #PF and the call to vmalloc_fault would > > > eventually make the kernel treat device memory as if it was a > > > pagetable. > > > The results are illegal reads/writes on iomem and dereferencing > > > iomem content like it was a pointer to a lower level pagetable. > > > - #PF if you are lucky #PF -> vmalloc_fault -> oops > > > - funny modification of arbitrary memory possible > > > - can be abused with uio or regular userland ??A A A > > Looking over the code again i am not sure the last two are even > possible, it is just the pointer deref that can cause a #PF. > If the pointer turns out to "work" the code will just read and > eventually BUG(). The last two case are not possible. > > Ok, so this is a serious live bug exposed to drivers, that also > > requires a Cc: stable tag. Yes, the fix should go to stable as well. Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932472AbcBHXHl (ORCPT ); Mon, 8 Feb 2016 18:07:41 -0500 Received: from g9t5009.houston.hp.com ([15.240.92.67]:50330 "EHLO g9t5009.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932113AbcBHXHj (ORCPT ); Mon, 8 Feb 2016 18:07:39 -0500 From: Toshi Kani To: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de Cc: henning.schild@siemens.com, linux-nvdimm@ml01.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Toshi Kani Subject: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages Date: Mon, 8 Feb 2016 17:00:38 -0700 Message-Id: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> X-Mailer: git-send-email 2.5.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc range is limited to pte mappings. pgd_ctor() sets the kernel's pgd entries to user's during fork(), which makes user processes share the same page tables for the kernel ranges. When a call to ioremap() is made at run-time that leads to allocate a new 2nd level table (pud in 64-bit and pmd in PAE), user process needs to re-sync with the updated kernel pgd entry with vmalloc_fault(). Following changes are made to vmalloc_fault(). 64-bit: - No change for the sync operation as set_pgd() takes care of huge pages as well. - Add pud_huge() and pmd_huge() to the validation code to handle huge pages. - Change pud_page_vaddr() to pud_pfn() since an ioremap range is not directly mapped (although the if-statement still works with a bogus addr). - Change pmd_page() to pmd_pfn() since an ioremap range is not backed by struct page table (although the if-statement still works with a bogus addr). PAE: - No change for the sync operation since the index3 pgd entry covers the entire vmalloc range, which is always valid. (A separate change will be needed if this assumption gets changed regardless of the page size.) - Add pmd_huge() to the validation code to handle huge pages. This is only for completeness since vmalloc_fault() won't happen for ioremap'd ranges as its pgd entry is always valid. (I was unable to test this part of the changes as a result.) Reported-by: Henning Schild Signed-off-by: Toshi Kani Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Borislav Petkov --- When this patch is accepted, please copy to stable up to 4.1. --- arch/x86/mm/fault.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index eef44d9..e830c71 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -287,6 +287,9 @@ static noinline int vmalloc_fault(unsigned long address) if (!pmd_k) return -1; + if (pmd_huge(*pmd_k)) + return 0; + pte_k = pte_offset_kernel(pmd_k, address); if (!pte_present(*pte_k)) return -1; @@ -360,8 +363,6 @@ void vmalloc_sync_all(void) * 64-bit: * * Handle a fault on the vmalloc area - * - * This assumes no large pages in there. */ static noinline int vmalloc_fault(unsigned long address) { @@ -403,17 +404,23 @@ static noinline int vmalloc_fault(unsigned long address) if (pud_none(*pud_ref)) return -1; - if (pud_none(*pud) || pud_page_vaddr(*pud) != pud_page_vaddr(*pud_ref)) + if (pud_none(*pud) || pud_pfn(*pud) != pud_pfn(*pud_ref)) BUG(); + if (pud_huge(*pud)) + return 0; + pmd = pmd_offset(pud, address); pmd_ref = pmd_offset(pud_ref, address); if (pmd_none(*pmd_ref)) return -1; - if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref)) + if (pmd_none(*pmd) || pmd_pfn(*pmd) != pmd_pfn(*pmd_ref)) BUG(); + if (pmd_huge(*pmd)) + return 0; + pte_ref = pte_offset_kernel(pmd_ref, address); if (!pte_present(*pte_ref)) return -1; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756023AbcBIJKT (ORCPT ); Tue, 9 Feb 2016 04:10:19 -0500 Received: from mail-wm0-f65.google.com ([74.125.82.65]:34085 "EHLO mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755801AbcBIJKI (ORCPT ); Tue, 9 Feb 2016 04:10:08 -0500 Date: Tue, 9 Feb 2016 10:10:03 +0100 From: Ingo Molnar To: Toshi Kani Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, henning.schild@siemens.com, linux-nvdimm@ml01.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages Message-ID: <20160209091003.GA10774@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Toshi Kani wrote: > Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 and PAE. > vmalloc_fault() however assumes that the vmalloc range is limited to pte > mappings. > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), which makes > user processes share the same page tables for the kernel ranges. When a call to > ioremap() is made at run-time that leads to allocate a new 2nd level table (pud > in 64-bit and pmd in PAE), user process needs to re-sync with the updated kernel > pgd entry with vmalloc_fault(). > > Following changes are made to vmalloc_fault(). So what were the effects of this shortcoming? Were large page ioremap()s unusable? Was this harmless because no driver used this facility? If so then the changelog needs to spell this out clearly ... Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756128AbcBIKPn (ORCPT ); Tue, 9 Feb 2016 05:15:43 -0500 Received: from goliath.siemens.de ([192.35.17.28]:45749 "EHLO goliath.siemens.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753375AbcBIKPl (ORCPT ); Tue, 9 Feb 2016 05:15:41 -0500 X-Greylist: delayed 1281 seconds by postgrey-1.27 at vger.kernel.org; Tue, 09 Feb 2016 05:15:40 EST Date: Tue, 9 Feb 2016 10:53:25 +0100 From: Henning Schild To: Ingo Molnar Cc: Toshi Kani , , , , , , , Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages Message-ID: <20160209105325.0ce9a104@md1em3qc> In-Reply-To: <20160209091003.GA10774@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> X-Mailer: Claws Mail 3.13.1 (GTK+ 2.24.28; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 9 Feb 2016 10:10:03 +0100 Ingo Molnar wrote: > * Toshi Kani wrote: > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > range is limited to pte mappings. > > > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), > > which makes user processes share the same page tables for the > > kernel ranges. When a call to ioremap() is made at run-time that > > leads to allocate a new 2nd level table (pud in 64-bit and pmd in > > PAE), user process needs to re-sync with the updated kernel pgd > > entry with vmalloc_fault(). > > > > Following changes are made to vmalloc_fault(). > > So what were the effects of this shortcoming? Were large page > ioremap()s unusable? Was this harmless because no driver used this > facility? Drivers do use huge ioremap()s. Now if a pre-existing mm is used to access the device memory a #PF and the call to vmalloc_fault would eventually make the kernel treat device memory as if it was a pagetable. The results are illegal reads/writes on iomem and dereferencing iomem content like it was a pointer to a lower level pagetable. - #PF if you are lucky - funny modification of arbitrary memory possible - can be abused with uio or regular userland ?? Henning > If so then the changelog needs to spell this out clearly ... > Thanks, > > Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756585AbcBIKWn (ORCPT ); Tue, 9 Feb 2016 05:22:43 -0500 Received: from mail-wm0-f66.google.com ([74.125.82.66]:32988 "EHLO mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754852AbcBIKWj (ORCPT ); Tue, 9 Feb 2016 05:22:39 -0500 Date: Tue, 9 Feb 2016 11:22:35 +0100 From: Ingo Molnar To: Henning Schild Cc: Toshi Kani , tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, linux-nvdimm@ml01.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages Message-ID: <20160209102235.GA9885@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> <20160209105325.0ce9a104@md1em3qc> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160209105325.0ce9a104@md1em3qc> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Henning Schild wrote: > On Tue, 9 Feb 2016 10:10:03 +0100 > Ingo Molnar wrote: > > > * Toshi Kani wrote: > > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > > range is limited to pte mappings. > > > > > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), > > > which makes user processes share the same page tables for the > > > kernel ranges. When a call to ioremap() is made at run-time that > > > leads to allocate a new 2nd level table (pud in 64-bit and pmd in > > > PAE), user process needs to re-sync with the updated kernel pgd > > > entry with vmalloc_fault(). > > > > > > Following changes are made to vmalloc_fault(). > > > > So what were the effects of this shortcoming? Were large page > > ioremap()s unusable? Was this harmless because no driver used this > > facility? > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to > access the device memory a #PF and the call to vmalloc_fault would > eventually make the kernel treat device memory as if it was a > pagetable. > The results are illegal reads/writes on iomem and dereferencing iomem > content like it was a pointer to a lower level pagetable. > - #PF if you are lucky > - funny modification of arbitrary memory possible > - can be abused with uio or regular userland ?? Ok, so this is a serious live bug exposed to drivers, that also requires a Cc: stable tag. All of this should have been in the changelog! Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757224AbcBIM1d (ORCPT ); Tue, 9 Feb 2016 07:27:33 -0500 Received: from david.siemens.de ([192.35.17.14]:51408 "EHLO david.siemens.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756882AbcBIM13 (ORCPT ); Tue, 9 Feb 2016 07:27:29 -0500 Date: Tue, 9 Feb 2016 13:26:45 +0100 From: Henning Schild To: Ingo Molnar Cc: Toshi Kani , , , , , , , Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages Message-ID: <20160209132645.55971eff@md1em3qc> In-Reply-To: <20160209102235.GA9885@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> <20160209105325.0ce9a104@md1em3qc> <20160209102235.GA9885@gmail.com> X-Mailer: Claws Mail 3.13.1 (GTK+ 2.24.28; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 9 Feb 2016 11:22:35 +0100 Ingo Molnar wrote: > * Henning Schild wrote: > > > On Tue, 9 Feb 2016 10:10:03 +0100 > > Ingo Molnar wrote: > > > > > * Toshi Kani wrote: > > > > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > > > range is limited to pte mappings. > > > > > > > > pgd_ctor() sets the kernel's pgd entries to user's during > > > > fork(), which makes user processes share the same page tables > > > > for the kernel ranges. When a call to ioremap() is made at > > > > run-time that leads to allocate a new 2nd level table (pud in > > > > 64-bit and pmd in PAE), user process needs to re-sync with the > > > > updated kernel pgd entry with vmalloc_fault(). > > > > > > > > Following changes are made to vmalloc_fault(). > > > > > > So what were the effects of this shortcoming? Were large page > > > ioremap()s unusable? Was this harmless because no driver used this > > > facility? > > > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to > > access the device memory a #PF and the call to vmalloc_fault would > > eventually make the kernel treat device memory as if it was a > > pagetable. > > The results are illegal reads/writes on iomem and dereferencing > > iomem content like it was a pointer to a lower level pagetable. > > - #PF if you are lucky > > - funny modification of arbitrary memory possible > > - can be abused with uio or regular userland ?? Looking over the code again i am not sure the last two are even possible, it is just the pointer deref that can cause a #PF. If the pointer turns out to "work" the code will just read and eventually BUG(). > Ok, so this is a serious live bug exposed to drivers, that also > requires a Cc: stable tag. > > All of this should have been in the changelog! > > Thanks, > > Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755830AbcBIPKM (ORCPT ); Tue, 9 Feb 2016 10:10:12 -0500 Received: from g4t3426.houston.hp.com ([15.201.208.54]:27511 "EHLO g4t3426.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754316AbcBIPKK (ORCPT ); Tue, 9 Feb 2016 10:10:10 -0500 Message-ID: <1455033795.2925.74.camel@hpe.com> Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages From: Toshi Kani To: Ingo Molnar Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, henning.schild@siemens.com, linux-nvdimm@ml01.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Date: Tue, 09 Feb 2016 09:03:15 -0700 In-Reply-To: <20160209091003.GA10774@gmail.com> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.18.4 (3.18.4-1.fc23) Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2016-02-09 at 10:10 +0100, Ingo Molnar wrote: > * Toshi Kani wrote: > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 > > and PAE.  vmalloc_fault() however assumes that the vmalloc range is > > limited to pte mappings. > > > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), which > > makes user processes share the same page tables for the kernel > > ranges.  When a call to ioremap() is made at run-time that leads to > > allocate a new 2nd level table (pud in 64-bit and pmd in PAE), user > > process needs to re-sync with the updated kernel pgd entry with > > vmalloc_fault(). > > > > Following changes are made to vmalloc_fault(). > > So what were the effects of this shortcoming? Were large page ioremap()s > unusable? Was this harmless because no driver used this facility? > > If so then the changelog needs to spell this out clearly ... Large page support of ioremap() has been used for persistent memory mappings for a while. In order to hit this problem, i.e. causing a vmalloc fault, a large mount of ioremap allocations at run-time is required.  The following example repeats allocation of 16GB range. # cat /proc/vmallocinfo | grep memremap 0xffffc90040000000-0xffffc90440001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc90480000000-0xffffc90880001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc908c0000000-0xffffc90cc0001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc90d00000000-0xffffc91100001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc91140000000-0xffffc91540001000 17179873280 memremap+0xb4/0x110  phys=480000000 ioremap   : 0xffffc97300000000-0xffffc97700001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc97740000000-0xffffc97b40001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc97b80000000-0xffffc97f80001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc97fc0000000-0xffffc983c0001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap The last ioremap call above crossed a 512GB boundary (0x8000000000), which allocated a new pud table and updated the kernel pgd entry to point it.  Because user process's page table does not have this pgd entry update, a read/write syscall request to the range will hit a vmalloc fault.  Since vmalloc_fault() does not handle a large page properly, this causes an Oops as follows.  BUG: unable to handle kernel paging request at ffff880840000ff8  IP: [] vmalloc_fault+0x1be/0x300  PGD c7f03a067 PUD 0   Oops: 0000 [#1] SM    :  Call Trace:  [] __do_page_fault+0x285/0x3e0  [] do_page_fault+0x2f/0x80  [] ? put_prev_entity+0x35/0x7a0  [] page_fault+0x28/0x30  [] ? memcpy_erms+0x6/0x10  [] ? schedule+0x35/0x80  [] ? pmem_rw_bytes+0x6a/0x190 [nd_pmem]  [] ? schedule_timeout+0x183/0x240  [] btt_log_read+0x63/0x140 [nd_btt]    :  [] ? __symbol_put+0x60/0x60  [] ? kernel_read+0x50/0x80  [] SyS_finit_module+0xb9/0xf0  [] entry_SYSCALL_64_fastpath+0x1a/0xa4 Note that this issue is limited to 64-bit.  32-bit only uses index 3 of the pgd entry to cover the entire vmalloc range, which is always valid. I will add this information to the change log. Thanks, -Toshi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756774AbcBIPPq (ORCPT ); Tue, 9 Feb 2016 10:15:46 -0500 Received: from g4t3426.houston.hp.com ([15.201.208.54]:30578 "EHLO g4t3426.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752934AbcBIPPn (ORCPT ); Tue, 9 Feb 2016 10:15:43 -0500 Message-ID: <1455034131.2925.79.camel@hpe.com> Subject: Re: [PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages From: Toshi Kani To: Henning Schild , Ingo Molnar Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, linux-nvdimm@ml01.01.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Date: Tue, 09 Feb 2016 09:08:51 -0700 In-Reply-To: <20160209132645.55971eff@md1em3qc> References: <1454976038-22486-1-git-send-email-toshi.kani@hpe.com> <20160209091003.GA10774@gmail.com> <20160209105325.0ce9a104@md1em3qc> <20160209102235.GA9885@gmail.com> <20160209132645.55971eff@md1em3qc> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.18.4 (3.18.4-1.fc23) Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2016-02-09 at 13:26 +0100, Henning Schild wrote: > On Tue, 9 Feb 2016 11:22:35 +0100 > Ingo Molnar wrote: > > > * Henning Schild wrote: > > > > > On Tue, 9 Feb 2016 10:10:03 +0100 > > > Ingo Molnar wrote: > > >    > > > > * Toshi Kani wrote: > > > >    > > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > > > > range is limited to pte mappings. > > > > > > > > > > pgd_ctor() sets the kernel's pgd entries to user's during > > > > > fork(), which makes user processes share the same page tables > > > > > for the kernel ranges.  When a call to ioremap() is made at > > > > > run-time that leads to allocate a new 2nd level table (pud in > > > > > 64-bit and pmd in PAE), user process needs to re-sync with the > > > > > updated kernel pgd entry with vmalloc_fault(). > > > > > > > > > > Following changes are made to vmalloc_fault().     > > > > > > > > So what were the effects of this shortcoming? Were large page > > > > ioremap()s unusable? Was this harmless because no driver used this > > > > facility?   > > > > > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to > > > access the device memory a #PF and the call to vmalloc_fault would > > > eventually make the kernel treat device memory as if it was a > > > pagetable. > > > The results are illegal reads/writes on iomem and dereferencing > > > iomem content like it was a pointer to a lower level pagetable. > > > - #PF if you are lucky #PF -> vmalloc_fault -> oops > > > - funny modification of arbitrary memory possible > > > - can be abused with uio or regular userland ??    > > Looking over the code again i am not sure the last two are even > possible, it is just the pointer deref that can cause a #PF. > If the pointer turns out to "work" the code will just read and > eventually BUG(). The last two case are not possible. > > Ok, so this is a serious live bug exposed to drivers, that also > > requires a Cc: stable tag. Yes, the fix should go to stable as well. Thanks, -Toshi