From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f176.google.com (mail-pf0-f176.google.com [209.85.192.176]) by kanga.kvack.org (Postfix) with ESMTP id 04D676B0255 for ; Thu, 10 Dec 2015 22:21:52 -0500 (EST) Received: by pfv76 with SMTP id 76so5866005pfv.2 for ; Thu, 10 Dec 2015 19:21:51 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id qi3si300843pac.30.2015.12.10.19.21.51 for ; Thu, 10 Dec 2015 19:21:51 -0800 (PST) From: Andy Lutomirski Subject: [PATCH 0/6] mm, x86/vdso: Special IO mapping improvements Date: Thu, 10 Dec 2015 19:21:41 -0800 Message-Id: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski This applies on top of the earlier vdso pvclock series I sent out. Once that lands in -tip, this will apply to -tip. This series cleans up the hack that is our vvar mapping. We currently initialize the vvar mapping as a special mapping vma backed by nothing whatsoever and then we abuse remap_pfn_range to populate it. This cheats the mm core, probably breaks under various evil madvise workloads, and prevents handling faults in more interesting ways. To clean it up, this series: - Adds a special mapping .fault operation - Adds a vm_insert_pfn_prot helper - Uses the new .fault infrastructure in x86's vdso and vvar mappings - Hardens the HPET mapping, mitigating an HW attack surface that bothers me I'd appreciate some review from the mm folks. Also, akpm, if you're okay with this whole series going in through -tip, that would be great -- it will avoid splitting it across two releases. Andy Lutomirski (6): mm: Add a vm_special_mapping .fault method mm: Add vm_insert_pfn_prot x86/vdso: Track each mm's loaded vdso image as well as its base x86,vdso: Use .fault for the vdso text mapping x86,vdso: Use .fault instead of remap_pfn_range for the vvar mapping x86/vdso: Disallow vvar access to vclock IO for never-used vclocks arch/x86/entry/vdso/vdso2c.h | 7 -- arch/x86/entry/vdso/vma.c | 124 ++++++++++++++++++++------------ arch/x86/entry/vsyscall/vsyscall_gtod.c | 9 ++- arch/x86/include/asm/clocksource.h | 9 +-- arch/x86/include/asm/mmu.h | 3 +- arch/x86/include/asm/vdso.h | 3 - arch/x86/include/asm/vgtod.h | 6 ++ include/linux/mm.h | 2 + include/linux/mm_types.h | 19 ++++- mm/memory.c | 25 ++++++- mm/mmap.c | 13 ++-- 11 files changed, 150 insertions(+), 70 deletions(-) -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f178.google.com (mail-pf0-f178.google.com [209.85.192.178]) by kanga.kvack.org (Postfix) with ESMTP id 763B66B0256 for ; Thu, 10 Dec 2015 22:21:53 -0500 (EST) Received: by pfv76 with SMTP id 76so5866315pfv.2 for ; Thu, 10 Dec 2015 19:21:53 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id f79si159752pfj.38.2015.12.10.19.21.52 for ; Thu, 10 Dec 2015 19:21:52 -0800 (PST) From: Andy Lutomirski Subject: [PATCH 1/6] mm: Add a vm_special_mapping .fault method Date: Thu, 10 Dec 2015 19:21:42 -0800 Message-Id: <4e911d2752d3b9e52d7496e46b389fc630cdc3a8.1449803537.git.luto@kernel.org> In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski , Andy Lutomirski From: Andy Lutomirski Requiring special mappings to give a list of struct pages is inflexible: it prevents sane use of IO memory in a special mapping, it's inefficient (it requires arch code to initialize a list of struct pages, and it requires the mm core to walk the entire list just to figure out how long it is), and it prevents arch code from doing anything fancy when a special mapping fault occurs. Add a .fault method as an alternative to filling in a .pages array. Signed-off-by: Andy Lutomirski --- include/linux/mm_types.h | 19 ++++++++++++++++++- mm/mmap.c | 13 +++++++++---- 2 files changed, 27 insertions(+), 5 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index f8d1492a114f..3d315d373daf 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -568,10 +568,27 @@ static inline void clear_tlb_flush_pending(struct mm_struct *mm) } #endif +struct vm_fault; + struct vm_special_mapping { - const char *name; + const char *name; /* The name, e.g. "[vdso]". */ + + /* + * If .fault is not provided, this is points to a + * NULL-terminated array of pages that back the special mapping. + * + * This must not be NULL unless .fault is provided. + */ struct page **pages; + + /* + * If non-NULL, then this is called to resolve page faults + * on the special mapping. If used, .pages is not checked. + */ + int (*fault)(const struct vm_special_mapping *sm, + struct vm_area_struct *vma, + struct vm_fault *vmf); }; enum tlb_flush_reason { diff --git a/mm/mmap.c b/mm/mmap.c index 2ce04a649f6b..f717453b1a57 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3030,11 +3030,16 @@ static int special_mapping_fault(struct vm_area_struct *vma, pgoff_t pgoff; struct page **pages; - if (vma->vm_ops == &legacy_special_mapping_vmops) + if (vma->vm_ops == &legacy_special_mapping_vmops) { pages = vma->vm_private_data; - else - pages = ((struct vm_special_mapping *)vma->vm_private_data)-> - pages; + } else { + struct vm_special_mapping *sm = vma->vm_private_data; + + if (sm->fault) + return sm->fault(sm, vma, vmf); + + pages = sm->pages; + } for (pgoff = vmf->pgoff; pgoff && *pages; ++pages) pgoff--; -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f43.google.com (mail-pa0-f43.google.com [209.85.220.43]) by kanga.kvack.org (Postfix) with ESMTP id D5CAE6B0259 for ; Thu, 10 Dec 2015 22:21:54 -0500 (EST) Received: by pacdm15 with SMTP id dm15so57971557pac.3 for ; Thu, 10 Dec 2015 19:21:54 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id q5si157141pfq.13.2015.12.10.19.21.54 for ; Thu, 10 Dec 2015 19:21:54 -0800 (PST) From: Andy Lutomirski Subject: [PATCH 2/6] mm: Add vm_insert_pfn_prot Date: Thu, 10 Dec 2015 19:21:43 -0800 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski The x86 vvar mapping contains pages with differing cacheability flags. This is currently only supported using (io_)remap_pfn_range, but those functions can't be used inside page faults. Add vm_insert_pfn_prot to support varying cacheability within the same non-COW VMA in a more sane manner. x86 needs this to avoid a CRIU-breaking and memory-wasting explosion of VMAs when supporting userspace access to the HPET. Signed-off-by: Andy Lutomirski --- include/linux/mm.h | 2 ++ mm/memory.c | 25 +++++++++++++++++++++++-- 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 00bad7793788..87ef1d7730ba 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2080,6 +2080,8 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr, int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, pgprot_t pgprot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index c387430f06c3..a29f0b90fc56 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1564,8 +1564,29 @@ out: int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { + return vm_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot); +} +EXPORT_SYMBOL(vm_insert_pfn); + +/** + * vm_insert_pfn_prot - insert single pfn into user vma with specified pgprot + * @vma: user vma to map to + * @addr: target user address of this page + * @pfn: source kernel pfn + * @pgprot: pgprot flags for the inserted page + * + * This is exactly like vm_insert_pfn, except that it allows drivers to + * to override pgprot on a per-page basis. + * + * This only makes sense for IO mappings, and it makes no sense for + * cow mappings. In general, using multiple vmas is preferable; + * vm_insert_pfn_prot should only be used if using multiple VMAs is + * impractical. + */ +int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, pgprot_t pgprot) +{ int ret; - pgprot_t pgprot = vma->vm_page_prot; /* * Technically, architectures with pte_special can avoid all these * restrictions (same for remap_pfn_range). However we would like @@ -1587,7 +1608,7 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, return ret; } -EXPORT_SYMBOL(vm_insert_pfn); +EXPORT_SYMBOL(vm_insert_pfn_prot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id 1B0AE6B0259 for ; Thu, 10 Dec 2015 22:21:56 -0500 (EST) Received: by pacdm15 with SMTP id dm15so57971789pac.3 for ; Thu, 10 Dec 2015 19:21:55 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id g14si160802pfd.164.2015.12.10.19.21.55 for ; Thu, 10 Dec 2015 19:21:55 -0800 (PST) From: Andy Lutomirski Subject: [PATCH 3/6] x86/vdso: Track each mm's loaded vdso image as well as its base Date: Thu, 10 Dec 2015 19:21:44 -0800 Message-Id: <09f0c1f952c071b86b29cb39532a08851096e4b4.1449803537.git.luto@kernel.org> In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski As we start to do more intelligent things with the vdso at runtime (as opposed to just at mm initialization time), we'll need to know which vdso is in use. In principle, we could guess based on the mm type, but that's over-complicated and error-prone. Instead, just track it in the mmu context. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vma.c | 1 + arch/x86/include/asm/mmu.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index b8f69e264ac4..80b021067bd6 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -121,6 +121,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) text_start = addr - image->sym_vvar_start; current->mm->context.vdso = (void __user *)text_start; + current->mm->context.vdso_image = image; /* * MAYWRITE to allow gdb to COW and set breakpoints diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 55234d5e7160..1ea0baef1175 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -19,7 +19,8 @@ typedef struct { #endif struct mutex lock; - void __user *vdso; + void __user *vdso; /* vdso base address */ + const struct vdso_image *vdso_image; /* vdso image in use */ atomic_t perf_rdpmc_allowed; /* nonzero if rdpmc is allowed */ } mm_context_t; -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f174.google.com (mail-pf0-f174.google.com [209.85.192.174]) by kanga.kvack.org (Postfix) with ESMTP id 981116B025A for ; Thu, 10 Dec 2015 22:21:57 -0500 (EST) Received: by pfbg73 with SMTP id g73so59629132pfb.1 for ; Thu, 10 Dec 2015 19:21:57 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id lo5si289877pab.188.2015.12.10.19.21.56 for ; Thu, 10 Dec 2015 19:21:56 -0800 (PST) From: Andy Lutomirski Subject: [PATCH 4/6] x86,vdso: Use .fault for the vdso text mapping Date: Thu, 10 Dec 2015 19:21:45 -0800 Message-Id: <6057b142a12a9d08a1980d4e3be9d65319a20a59.1449803537.git.luto@kernel.org> In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski The old scheme for mapping the vdso text is rather complicated. vdso2c generates a struct vm_special_mapping and a blank .pages array of the correct size for each vdso image. Init code in vdso/vma.c populates the .pages array for each vdso image, and the mapping code selects the appropriate struct vm_special_mapping. With .fault, we can use a less roundabout approach: vdso_fault just returns the appropriate page for the selected vdso image. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vdso2c.h | 7 ------- arch/x86/entry/vdso/vma.c | 26 +++++++++++++++++++------- arch/x86/include/asm/vdso.h | 3 --- 3 files changed, 19 insertions(+), 17 deletions(-) diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h index 0224987556ce..abe961c7c71c 100644 --- a/arch/x86/entry/vdso/vdso2c.h +++ b/arch/x86/entry/vdso/vdso2c.h @@ -150,16 +150,9 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len, } fprintf(outfile, "\n};\n\n"); - fprintf(outfile, "static struct page *pages[%lu];\n\n", - mapping_size / 4096); - fprintf(outfile, "const struct vdso_image %s = {\n", name); fprintf(outfile, "\t.data = raw_data,\n"); fprintf(outfile, "\t.size = %lu,\n", mapping_size); - fprintf(outfile, "\t.text_mapping = {\n"); - fprintf(outfile, "\t\t.name = \"[vdso]\",\n"); - fprintf(outfile, "\t\t.pages = pages,\n"); - fprintf(outfile, "\t},\n"); if (alt_sec) { fprintf(outfile, "\t.alt = %lu,\n", (unsigned long)GET_LE(&alt_sec->sh_offset)); diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 80b021067bd6..eb50d7c1f161 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -27,13 +27,7 @@ unsigned int __read_mostly vdso64_enabled = 1; void __init init_vdso_image(const struct vdso_image *image) { - int i; - int npages = (image->size) / PAGE_SIZE; - BUG_ON(image->size % PAGE_SIZE != 0); - for (i = 0; i < npages; i++) - image->text_mapping.pages[i] = - virt_to_page(image->data + i*PAGE_SIZE); apply_alternatives((struct alt_instr *)(image->data + image->alt), (struct alt_instr *)(image->data + image->alt + @@ -90,6 +84,24 @@ static unsigned long vdso_addr(unsigned long start, unsigned len) #endif } +static int vdso_fault(const struct vm_special_mapping *sm, + struct vm_area_struct *vma, struct vm_fault *vmf) +{ + const struct vdso_image *image = vma->vm_mm->context.vdso_image; + + if (!image || (vmf->pgoff << PAGE_SHIFT) >= image->size) + return VM_FAULT_SIGBUS; + + vmf->page = virt_to_page(image->data + (vmf->pgoff << PAGE_SHIFT)); + get_page(vmf->page); + return 0; +} + +static const struct vm_special_mapping text_mapping = { + .name = "[vdso]", + .fault = vdso_fault, +}; + static int map_vdso(const struct vdso_image *image, bool calculate_addr) { struct mm_struct *mm = current->mm; @@ -131,7 +143,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) image->size, VM_READ|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, - &image->text_mapping); + &text_mapping); if (IS_ERR(vma)) { ret = PTR_ERR(vma); diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h index deabaf9759b6..43dc55be524e 100644 --- a/arch/x86/include/asm/vdso.h +++ b/arch/x86/include/asm/vdso.h @@ -13,9 +13,6 @@ struct vdso_image { void *data; unsigned long size; /* Always a multiple of PAGE_SIZE */ - /* text_mapping.pages is big enough for data/size page pointers */ - struct vm_special_mapping text_mapping; - unsigned long alt, alt_len; long sym_vvar_start; /* Negative offset to the vvar area */ -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com [209.85.220.54]) by kanga.kvack.org (Postfix) with ESMTP id ED0556B025B for ; Thu, 10 Dec 2015 22:21:58 -0500 (EST) Received: by pacwq6 with SMTP id wq6so57763126pac.1 for ; Thu, 10 Dec 2015 19:21:58 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id q5si157539pfq.13.2015.12.10.19.21.58 for ; Thu, 10 Dec 2015 19:21:58 -0800 (PST) From: Andy Lutomirski Subject: [PATCH 5/6] x86,vdso: Use .fault instead of remap_pfn_range for the vvar mapping Date: Thu, 10 Dec 2015 19:21:46 -0800 Message-Id: <1312c2af60551278d65e7e410b1cf845fefd7769.1449803537.git.luto@kernel.org> In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski This is IMO much less ugly, and it also opens the door to disallowing unprivileged userspace HPET access on systems with usable TSCs. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vma.c | 97 ++++++++++++++++++++++++++++------------------- 1 file changed, 57 insertions(+), 40 deletions(-) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index eb50d7c1f161..02221e98b83f 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -102,18 +102,69 @@ static const struct vm_special_mapping text_mapping = { .fault = vdso_fault, }; +static int vvar_fault(const struct vm_special_mapping *sm, + struct vm_area_struct *vma, struct vm_fault *vmf) +{ + const struct vdso_image *image = vma->vm_mm->context.vdso_image; + long sym_offset; + int ret = -EFAULT; + + if (!image) + return VM_FAULT_SIGBUS; + + sym_offset = (long)(vmf->pgoff << PAGE_SHIFT) + + image->sym_vvar_start; + + /* + * Sanity check: a symbol offset of zero means that the page + * does not exist for this vdso image, not that the page is at + * offset zero relative to the text mapping. This should be + * impossible here, because sym_offset should only be zero for + * the page past the end of the vvar mapping. + */ + if (sym_offset == 0) + return VM_FAULT_SIGBUS; + + if (sym_offset == image->sym_vvar_page) { + ret = vm_insert_pfn(vma, (unsigned long)vmf->virtual_address, + __pa_symbol(&__vvar_page) >> PAGE_SHIFT); + } else if (sym_offset == image->sym_hpet_page) { +#ifdef CONFIG_HPET_TIMER + if (hpet_address) { + ret = vm_insert_pfn_prot( + vma, + (unsigned long)vmf->virtual_address, + hpet_address >> PAGE_SHIFT, + pgprot_noncached(PAGE_READONLY)); + } +#endif + } else if (sym_offset == image->sym_pvclock_page) { + struct pvclock_vsyscall_time_info *pvti = + pvclock_pvti_cpu0_va(); + if (pvti) { + ret = vm_insert_pfn( + vma, + (unsigned long)vmf->virtual_address, + __pa(pvti) >> PAGE_SHIFT); + } + } + + if (ret == 0) + return VM_FAULT_NOPAGE; + + return VM_FAULT_SIGBUS; +} + static int map_vdso(const struct vdso_image *image, bool calculate_addr) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long addr, text_start; int ret = 0; - static struct page *no_pages[] = {NULL}; - static struct vm_special_mapping vvar_mapping = { + static const struct vm_special_mapping vvar_mapping = { .name = "[vvar]", - .pages = no_pages, + .fault = vvar_fault, }; - struct pvclock_vsyscall_time_info *pvti; if (calculate_addr) { addr = vdso_addr(current->mm->start_stack, @@ -153,7 +204,8 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) vma = _install_special_mapping(mm, addr, -image->sym_vvar_start, - VM_READ|VM_MAYREAD, + VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP| + VM_PFNMAP, &vvar_mapping); if (IS_ERR(vma)) { @@ -161,41 +213,6 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) goto up_fail; } - if (image->sym_vvar_page) - ret = remap_pfn_range(vma, - text_start + image->sym_vvar_page, - __pa_symbol(&__vvar_page) >> PAGE_SHIFT, - PAGE_SIZE, - PAGE_READONLY); - - if (ret) - goto up_fail; - -#ifdef CONFIG_HPET_TIMER - if (hpet_address && image->sym_hpet_page) { - ret = io_remap_pfn_range(vma, - text_start + image->sym_hpet_page, - hpet_address >> PAGE_SHIFT, - PAGE_SIZE, - pgprot_noncached(PAGE_READONLY)); - - if (ret) - goto up_fail; - } -#endif - - pvti = pvclock_pvti_cpu0_va(); - if (pvti && image->sym_pvclock_page) { - ret = remap_pfn_range(vma, - text_start + image->sym_pvclock_page, - __pa(pvti) >> PAGE_SHIFT, - PAGE_SIZE, - PAGE_READONLY); - - if (ret) - goto up_fail; - } - up_fail: if (ret) current->mm->context.vdso = NULL; -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f175.google.com (mail-pf0-f175.google.com [209.85.192.175]) by kanga.kvack.org (Postfix) with ESMTP id 8195D6B025C for ; Thu, 10 Dec 2015 22:22:00 -0500 (EST) Received: by pfbu66 with SMTP id u66so14692508pfb.3 for ; Thu, 10 Dec 2015 19:22:00 -0800 (PST) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTP id 85si161453pfn.11.2015.12.10.19.21.59 for ; Thu, 10 Dec 2015 19:21:59 -0800 (PST) From: Andy Lutomirski Subject: [PATCH 6/6] x86/vdso: Disallow vvar access to vclock IO for never-used vclocks Date: Thu, 10 Dec 2015 19:21:47 -0800 Message-Id: <3edcff8e6f4169cccbb227d74fc65bf061ded6e2.1449803537.git.luto@kernel.org> In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski It makes me uncomfortable that even modern systems grant every process direct read access to the HPET. While fixing this for real without regressing anything is a mess (unmapping the HPET is tricky because we don't adequately track all the mappings), we can do almost as well by tracking which vclocks have ever been used and only allowing pages associated with used vclocks to be faulted in. This will cause rogue programs that try to peek at the HPET to get SIGBUS instead on most systems. We can't restrict faults to vclock pages that are associated with the currently selected vclock due to a race: a process could start to access the HPET for the first time and race against a switch away from the HPET as the current clocksource. We can't segfault the process trying to peek at the HPET in this case, even though the process isn't going to do anything useful with the data. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vma.c | 4 ++-- arch/x86/entry/vsyscall/vsyscall_gtod.c | 9 ++++++++- arch/x86/include/asm/clocksource.h | 9 +++++---- arch/x86/include/asm/vgtod.h | 6 ++++++ 4 files changed, 21 insertions(+), 7 deletions(-) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 02221e98b83f..aa4f5b99f1f3 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -130,7 +130,7 @@ static int vvar_fault(const struct vm_special_mapping *sm, __pa_symbol(&__vvar_page) >> PAGE_SHIFT); } else if (sym_offset == image->sym_hpet_page) { #ifdef CONFIG_HPET_TIMER - if (hpet_address) { + if (hpet_address && vclock_was_used(VCLOCK_HPET)) { ret = vm_insert_pfn_prot( vma, (unsigned long)vmf->virtual_address, @@ -141,7 +141,7 @@ static int vvar_fault(const struct vm_special_mapping *sm, } else if (sym_offset == image->sym_pvclock_page) { struct pvclock_vsyscall_time_info *pvti = pvclock_pvti_cpu0_va(); - if (pvti) { + if (pvti && vclock_was_used(VCLOCK_PVCLOCK)) { ret = vm_insert_pfn( vma, (unsigned long)vmf->virtual_address, diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 51e330416995..0fb3a104ac62 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -16,6 +16,8 @@ #include #include +int vclocks_used __read_mostly; + DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data); void update_vsyscall_tz(void) @@ -26,12 +28,17 @@ void update_vsyscall_tz(void) void update_vsyscall(struct timekeeper *tk) { + int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode; struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data; + /* Mark the new vclock used. */ + BUILD_BUG_ON(VCLOCK_MAX >= 32); + WRITE_ONCE(vclocks_used, READ_ONCE(vclocks_used) | (1 << vclock_mode)); + gtod_write_begin(vdata); /* copy vsyscall data */ - vdata->vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode; + vdata->vclock_mode = vclock_mode; vdata->cycle_last = tk->tkr_mono.cycle_last; vdata->mask = tk->tkr_mono.mask; vdata->mult = tk->tkr_mono.mult; diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h index eda81dc0f4ae..d194266acb28 100644 --- a/arch/x86/include/asm/clocksource.h +++ b/arch/x86/include/asm/clocksource.h @@ -3,10 +3,11 @@ #ifndef _ASM_X86_CLOCKSOURCE_H #define _ASM_X86_CLOCKSOURCE_H -#define VCLOCK_NONE 0 /* No vDSO clock available. */ -#define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */ -#define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */ -#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */ +#define VCLOCK_NONE 0 /* No vDSO clock available. */ +#define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */ +#define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */ +#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */ +#define VCLOCK_MAX 3 struct arch_clocksource_data { int vclock_mode; diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index f556c4843aa1..e728699db774 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -37,6 +37,12 @@ struct vsyscall_gtod_data { }; extern struct vsyscall_gtod_data vsyscall_gtod_data; +extern int vclocks_used; +static inline bool vclock_was_used(int vclock) +{ + return READ_ONCE(vclocks_used) & (1 << vclock); +} + static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s) { unsigned ret; -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f52.google.com (mail-pa0-f52.google.com [209.85.220.52]) by kanga.kvack.org (Postfix) with ESMTP id 7FF9B6B0254 for ; Fri, 11 Dec 2015 17:28:16 -0500 (EST) Received: by pacdm15 with SMTP id dm15so71806010pac.3 for ; Fri, 11 Dec 2015 14:28:16 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id d5si3758219pas.82.2015.12.11.14.28.15 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 14:28:15 -0800 (PST) Date: Fri, 11 Dec 2015 14:28:14 -0800 From: Andrew Morton Subject: Re: [PATCH 1/6] mm: Add a vm_special_mapping .fault method Message-Id: <20151211142814.25cc806e3f5180d525ee807e@linux-foundation.org> In-Reply-To: <4e911d2752d3b9e52d7496e46b389fc630cdc3a8.1449803537.git.luto@kernel.org> References: <4e911d2752d3b9e52d7496e46b389fc630cdc3a8.1449803537.git.luto@kernel.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andy Lutomirski On Thu, 10 Dec 2015 19:21:42 -0800 Andy Lutomirski wrote: > From: Andy Lutomirski > > Requiring special mappings to give a list of struct pages is > inflexible: it prevents sane use of IO memory in a special mapping, > it's inefficient (it requires arch code to initialize a list of > struct pages, and it requires the mm core to walk the entire list > just to figure out how long it is), and it prevents arch code from > doing anything fancy when a special mapping fault occurs. > > Add a .fault method as an alternative to filling in a .pages array. > > ... > > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -568,10 +568,27 @@ static inline void clear_tlb_flush_pending(struct mm_struct *mm) > } > #endif > > +struct vm_fault; > + > struct vm_special_mapping > { We may as well fix the code layout while we're in there. > - const char *name; > + const char *name; /* The name, e.g. "[vdso]". */ > + > + /* > + * If .fault is not provided, this is points to a s/is// > + * NULL-terminated array of pages that back the special mapping. > + * > + * This must not be NULL unless .fault is provided. > + */ > struct page **pages; > + > + /* > + * If non-NULL, then this is called to resolve page faults > + * on the special mapping. If used, .pages is not checked. > + */ > + int (*fault)(const struct vm_special_mapping *sm, > + struct vm_area_struct *vma, > + struct vm_fault *vmf); > }; > > enum tlb_flush_reason { > diff --git a/mm/mmap.c b/mm/mmap.c > index 2ce04a649f6b..f717453b1a57 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3030,11 +3030,16 @@ static int special_mapping_fault(struct vm_area_struct *vma, > pgoff_t pgoff; > struct page **pages; > > - if (vma->vm_ops == &legacy_special_mapping_vmops) > + if (vma->vm_ops == &legacy_special_mapping_vmops) { > pages = vma->vm_private_data; > - else > - pages = ((struct vm_special_mapping *)vma->vm_private_data)-> > - pages; > + } else { > + struct vm_special_mapping *sm = vma->vm_private_data; > + > + if (sm->fault) > + return sm->fault(sm, vma, vmf); > + > + pages = sm->pages; > + } > > for (pgoff = vmf->pgoff; pgoff && *pages; ++pages) > pgoff--; Otherwise looks OK. I'll assume this will be merged via an x86 tree. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f42.google.com (mail-pa0-f42.google.com [209.85.220.42]) by kanga.kvack.org (Postfix) with ESMTP id 82AC16B0253 for ; Fri, 11 Dec 2015 17:33:02 -0500 (EST) Received: by padhk6 with SMTP id hk6so31720644pad.2 for ; Fri, 11 Dec 2015 14:33:02 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id bf7si3784542pad.239.2015.12.11.14.33.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 14:33:01 -0800 (PST) Date: Fri, 11 Dec 2015 14:33:00 -0800 From: Andrew Morton Subject: Re: [PATCH 2/6] mm: Add vm_insert_pfn_prot Message-Id: <20151211143300.0ac516fbd219a67954698f9a@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, 10 Dec 2015 19:21:43 -0800 Andy Lutomirski wrote: > The x86 vvar mapping contains pages with differing cacheability > flags. This is currently only supported using (io_)remap_pfn_range, > but those functions can't be used inside page faults. Foggy. What does "support" mean here? > Add vm_insert_pfn_prot to support varying cacheability within the > same non-COW VMA in a more sane manner. Here, "support" presumably means "insertion of pfns". Can we spell all this out more completely please? > x86 needs this to avoid a CRIU-breaking and memory-wasting explosion > of VMAs when supporting userspace access to the HPET. > OtherwiseAck. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f44.google.com (mail-oi0-f44.google.com [209.85.218.44]) by kanga.kvack.org (Postfix) with ESMTP id 07B1F6B0253 for ; Fri, 11 Dec 2015 17:44:59 -0500 (EST) Received: by oihr132 with SMTP id r132so11216434oih.1 for ; Fri, 11 Dec 2015 14:44:58 -0800 (PST) Received: from mail-oi0-x22b.google.com (mail-oi0-x22b.google.com. [2607:f8b0:4003:c06::22b]) by mx.google.com with ESMTPS id k17si19125891oib.66.2015.12.11.14.44.58 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 14:44:58 -0800 (PST) Received: by oihr132 with SMTP id r132so11216324oih.1 for ; Fri, 11 Dec 2015 14:44:58 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20151211143300.0ac516fbd219a67954698f9a@linux-foundation.org> References: <20151211143300.0ac516fbd219a67954698f9a@linux-foundation.org> From: Andy Lutomirski Date: Fri, 11 Dec 2015 14:44:38 -0800 Message-ID: Subject: Re: [PATCH 2/6] mm: Add vm_insert_pfn_prot Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andy Lutomirski , X86 ML , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" On Fri, Dec 11, 2015 at 2:33 PM, Andrew Morton wrote: > On Thu, 10 Dec 2015 19:21:43 -0800 Andy Lutomirski wrote: > >> The x86 vvar mapping contains pages with differing cacheability >> flags. This is currently only supported using (io_)remap_pfn_range, >> but those functions can't be used inside page faults. > > Foggy. What does "support" mean here? We currently have a hack in which every x86 mm has a "vvar" vma that has a .fault handler that always fails (it's the vm_special_mapping fault handler backed by an empty pages array). To make everything work, at mm startup, the vdso code uses remap_pfn_range and io_remap_pfn_range to poke the pfns into the page tables. I'd much rather implement this using the new .fault mechanism, and the canonical way to implement .fault seems to be vm_insert_pfn, and vm_insert_pfn doesn't allow setting per-page cacheability. Unfortunately, one of the three x86 vvar pages needs to be uncacheable because it's a genuine IO page, so I can't use vm_insert_pfn. I suppose I could just call io_remap_pfn_range from .fault, but I think that's frowned upon. Admittedly, I wasn't really sure *why* that's frowned upon. This goes way back to 2007 (e0dc0d8f4a327d033bfb63d43f113d5f31d11b3c) when .fault got fancier. > >> Add vm_insert_pfn_prot to support varying cacheability within the >> same non-COW VMA in a more sane manner. > > Here, "support" presumably means "insertion of pfns". Can we spell all > this out more completely please? Yes, will fix. > >> x86 needs this to avoid a CRIU-breaking and memory-wasting explosion >> of VMAs when supporting userspace access to the HPET. >> > > OtherwiseAck. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f45.google.com (mail-wm0-f45.google.com [74.125.82.45]) by kanga.kvack.org (Postfix) with ESMTP id 235EA6B0038 for ; Mon, 14 Dec 2015 04:17:14 -0500 (EST) Received: by mail-wm0-f45.google.com with SMTP id p66so52149688wmp.1 for ; Mon, 14 Dec 2015 01:17:14 -0800 (PST) Received: from mail-wm0-x22a.google.com (mail-wm0-x22a.google.com. [2a00:1450:400c:c09::22a]) by mx.google.com with ESMTPS id w191si24098731wme.107.2015.12.14.01.17.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Dec 2015 01:17:13 -0800 (PST) Received: by mail-wm0-x22a.google.com with SMTP id n186so35767579wmn.0 for ; Mon, 14 Dec 2015 01:17:12 -0800 (PST) Date: Mon, 14 Dec 2015 10:17:09 +0100 From: Ingo Molnar Subject: Re: [PATCH 1/6] mm: Add a vm_special_mapping .fault method Message-ID: <20151214091709.GA29878@gmail.com> References: <4e911d2752d3b9e52d7496e46b389fc630cdc3a8.1449803537.git.luto@kernel.org> <20151211142814.25cc806e3f5180d525ee807e@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151211142814.25cc806e3f5180d525ee807e@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andy Lutomirski , x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andy Lutomirski , Thomas Gleixner , "H. Peter Anvin" * Andrew Morton wrote: > > + } else { > > + struct vm_special_mapping *sm = vma->vm_private_data; > > + > > + if (sm->fault) > > + return sm->fault(sm, vma, vmf); > > + > > + pages = sm->pages; > > + } > > > > for (pgoff = vmf->pgoff; pgoff && *pages; ++pages) > > pgoff--; > > Otherwise looks OK. I'll assume this will be merged via an x86 tree. Yeah, was hoping to be able to do that with your Acked-by. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754061AbbLKDVx (ORCPT ); Thu, 10 Dec 2015 22:21:53 -0500 Received: from mail.kernel.org ([198.145.29.136]:41007 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753945AbbLKDVv (ORCPT ); Thu, 10 Dec 2015 22:21:51 -0500 From: Andy Lutomirski To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski Subject: [PATCH 0/6] mm, x86/vdso: Special IO mapping improvements Date: Thu, 10 Dec 2015 19:21:41 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This applies on top of the earlier vdso pvclock series I sent out. Once that lands in -tip, this will apply to -tip. This series cleans up the hack that is our vvar mapping. We currently initialize the vvar mapping as a special mapping vma backed by nothing whatsoever and then we abuse remap_pfn_range to populate it. This cheats the mm core, probably breaks under various evil madvise workloads, and prevents handling faults in more interesting ways. To clean it up, this series: - Adds a special mapping .fault operation - Adds a vm_insert_pfn_prot helper - Uses the new .fault infrastructure in x86's vdso and vvar mappings - Hardens the HPET mapping, mitigating an HW attack surface that bothers me I'd appreciate some review from the mm folks. Also, akpm, if you're okay with this whole series going in through -tip, that would be great -- it will avoid splitting it across two releases. Andy Lutomirski (6): mm: Add a vm_special_mapping .fault method mm: Add vm_insert_pfn_prot x86/vdso: Track each mm's loaded vdso image as well as its base x86,vdso: Use .fault for the vdso text mapping x86,vdso: Use .fault instead of remap_pfn_range for the vvar mapping x86/vdso: Disallow vvar access to vclock IO for never-used vclocks arch/x86/entry/vdso/vdso2c.h | 7 -- arch/x86/entry/vdso/vma.c | 124 ++++++++++++++++++++------------ arch/x86/entry/vsyscall/vsyscall_gtod.c | 9 ++- arch/x86/include/asm/clocksource.h | 9 +-- arch/x86/include/asm/mmu.h | 3 +- arch/x86/include/asm/vdso.h | 3 - arch/x86/include/asm/vgtod.h | 6 ++ include/linux/mm.h | 2 + include/linux/mm_types.h | 19 ++++- mm/memory.c | 25 ++++++- mm/mmap.c | 13 ++-- 11 files changed, 150 insertions(+), 70 deletions(-) -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754149AbbLKDVz (ORCPT ); Thu, 10 Dec 2015 22:21:55 -0500 Received: from mail.kernel.org ([198.145.29.136]:41022 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753948AbbLKDVw (ORCPT ); Thu, 10 Dec 2015 22:21:52 -0500 From: Andy Lutomirski To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski , Andy Lutomirski Subject: [PATCH 1/6] mm: Add a vm_special_mapping .fault method Date: Thu, 10 Dec 2015 19:21:42 -0800 Message-Id: <4e911d2752d3b9e52d7496e46b389fc630cdc3a8.1449803537.git.luto@kernel.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Andy Lutomirski Requiring special mappings to give a list of struct pages is inflexible: it prevents sane use of IO memory in a special mapping, it's inefficient (it requires arch code to initialize a list of struct pages, and it requires the mm core to walk the entire list just to figure out how long it is), and it prevents arch code from doing anything fancy when a special mapping fault occurs. Add a .fault method as an alternative to filling in a .pages array. Signed-off-by: Andy Lutomirski --- include/linux/mm_types.h | 19 ++++++++++++++++++- mm/mmap.c | 13 +++++++++---- 2 files changed, 27 insertions(+), 5 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index f8d1492a114f..3d315d373daf 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -568,10 +568,27 @@ static inline void clear_tlb_flush_pending(struct mm_struct *mm) } #endif +struct vm_fault; + struct vm_special_mapping { - const char *name; + const char *name; /* The name, e.g. "[vdso]". */ + + /* + * If .fault is not provided, this is points to a + * NULL-terminated array of pages that back the special mapping. + * + * This must not be NULL unless .fault is provided. + */ struct page **pages; + + /* + * If non-NULL, then this is called to resolve page faults + * on the special mapping. If used, .pages is not checked. + */ + int (*fault)(const struct vm_special_mapping *sm, + struct vm_area_struct *vma, + struct vm_fault *vmf); }; enum tlb_flush_reason { diff --git a/mm/mmap.c b/mm/mmap.c index 2ce04a649f6b..f717453b1a57 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3030,11 +3030,16 @@ static int special_mapping_fault(struct vm_area_struct *vma, pgoff_t pgoff; struct page **pages; - if (vma->vm_ops == &legacy_special_mapping_vmops) + if (vma->vm_ops == &legacy_special_mapping_vmops) { pages = vma->vm_private_data; - else - pages = ((struct vm_special_mapping *)vma->vm_private_data)-> - pages; + } else { + struct vm_special_mapping *sm = vma->vm_private_data; + + if (sm->fault) + return sm->fault(sm, vma, vmf); + + pages = sm->pages; + } for (pgoff = vmf->pgoff; pgoff && *pages; ++pages) pgoff--; -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751890AbbLKDYO (ORCPT ); Thu, 10 Dec 2015 22:24:14 -0500 Received: from mail.kernel.org ([198.145.29.136]:41044 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753945AbbLKDVy (ORCPT ); Thu, 10 Dec 2015 22:21:54 -0500 From: Andy Lutomirski To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski Subject: [PATCH 2/6] mm: Add vm_insert_pfn_prot Date: Thu, 10 Dec 2015 19:21:43 -0800 Message-Id: X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The x86 vvar mapping contains pages with differing cacheability flags. This is currently only supported using (io_)remap_pfn_range, but those functions can't be used inside page faults. Add vm_insert_pfn_prot to support varying cacheability within the same non-COW VMA in a more sane manner. x86 needs this to avoid a CRIU-breaking and memory-wasting explosion of VMAs when supporting userspace access to the HPET. Signed-off-by: Andy Lutomirski --- include/linux/mm.h | 2 ++ mm/memory.c | 25 +++++++++++++++++++++++-- 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 00bad7793788..87ef1d7730ba 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2080,6 +2080,8 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr, int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, pgprot_t pgprot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index c387430f06c3..a29f0b90fc56 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1564,8 +1564,29 @@ out: int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { + return vm_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot); +} +EXPORT_SYMBOL(vm_insert_pfn); + +/** + * vm_insert_pfn_prot - insert single pfn into user vma with specified pgprot + * @vma: user vma to map to + * @addr: target user address of this page + * @pfn: source kernel pfn + * @pgprot: pgprot flags for the inserted page + * + * This is exactly like vm_insert_pfn, except that it allows drivers to + * to override pgprot on a per-page basis. + * + * This only makes sense for IO mappings, and it makes no sense for + * cow mappings. In general, using multiple vmas is preferable; + * vm_insert_pfn_prot should only be used if using multiple VMAs is + * impractical. + */ +int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, pgprot_t pgprot) +{ int ret; - pgprot_t pgprot = vma->vm_page_prot; /* * Technically, architectures with pte_special can avoid all these * restrictions (same for remap_pfn_range). However we would like @@ -1587,7 +1608,7 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, return ret; } -EXPORT_SYMBOL(vm_insert_pfn); +EXPORT_SYMBOL(vm_insert_pfn_prot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754224AbbLKDV5 (ORCPT ); Thu, 10 Dec 2015 22:21:57 -0500 Received: from mail.kernel.org ([198.145.29.136]:41064 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753948AbbLKDVz (ORCPT ); Thu, 10 Dec 2015 22:21:55 -0500 From: Andy Lutomirski To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski Subject: [PATCH 3/6] x86/vdso: Track each mm's loaded vdso image as well as its base Date: Thu, 10 Dec 2015 19:21:44 -0800 Message-Id: <09f0c1f952c071b86b29cb39532a08851096e4b4.1449803537.git.luto@kernel.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org As we start to do more intelligent things with the vdso at runtime (as opposed to just at mm initialization time), we'll need to know which vdso is in use. In principle, we could guess based on the mm type, but that's over-complicated and error-prone. Instead, just track it in the mmu context. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vma.c | 1 + arch/x86/include/asm/mmu.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index b8f69e264ac4..80b021067bd6 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -121,6 +121,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) text_start = addr - image->sym_vvar_start; current->mm->context.vdso = (void __user *)text_start; + current->mm->context.vdso_image = image; /* * MAYWRITE to allow gdb to COW and set breakpoints diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 55234d5e7160..1ea0baef1175 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -19,7 +19,8 @@ typedef struct { #endif struct mutex lock; - void __user *vdso; + void __user *vdso; /* vdso base address */ + const struct vdso_image *vdso_image; /* vdso image in use */ atomic_t perf_rdpmc_allowed; /* nonzero if rdpmc is allowed */ } mm_context_t; -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754648AbbLKDW4 (ORCPT ); Thu, 10 Dec 2015 22:22:56 -0500 Received: from mail.kernel.org ([198.145.29.136]:41087 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754159AbbLKDV5 (ORCPT ); Thu, 10 Dec 2015 22:21:57 -0500 From: Andy Lutomirski To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski Subject: [PATCH 4/6] x86,vdso: Use .fault for the vdso text mapping Date: Thu, 10 Dec 2015 19:21:45 -0800 Message-Id: <6057b142a12a9d08a1980d4e3be9d65319a20a59.1449803537.git.luto@kernel.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The old scheme for mapping the vdso text is rather complicated. vdso2c generates a struct vm_special_mapping and a blank .pages array of the correct size for each vdso image. Init code in vdso/vma.c populates the .pages array for each vdso image, and the mapping code selects the appropriate struct vm_special_mapping. With .fault, we can use a less roundabout approach: vdso_fault just returns the appropriate page for the selected vdso image. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vdso2c.h | 7 ------- arch/x86/entry/vdso/vma.c | 26 +++++++++++++++++++------- arch/x86/include/asm/vdso.h | 3 --- 3 files changed, 19 insertions(+), 17 deletions(-) diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h index 0224987556ce..abe961c7c71c 100644 --- a/arch/x86/entry/vdso/vdso2c.h +++ b/arch/x86/entry/vdso/vdso2c.h @@ -150,16 +150,9 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len, } fprintf(outfile, "\n};\n\n"); - fprintf(outfile, "static struct page *pages[%lu];\n\n", - mapping_size / 4096); - fprintf(outfile, "const struct vdso_image %s = {\n", name); fprintf(outfile, "\t.data = raw_data,\n"); fprintf(outfile, "\t.size = %lu,\n", mapping_size); - fprintf(outfile, "\t.text_mapping = {\n"); - fprintf(outfile, "\t\t.name = \"[vdso]\",\n"); - fprintf(outfile, "\t\t.pages = pages,\n"); - fprintf(outfile, "\t},\n"); if (alt_sec) { fprintf(outfile, "\t.alt = %lu,\n", (unsigned long)GET_LE(&alt_sec->sh_offset)); diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 80b021067bd6..eb50d7c1f161 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -27,13 +27,7 @@ unsigned int __read_mostly vdso64_enabled = 1; void __init init_vdso_image(const struct vdso_image *image) { - int i; - int npages = (image->size) / PAGE_SIZE; - BUG_ON(image->size % PAGE_SIZE != 0); - for (i = 0; i < npages; i++) - image->text_mapping.pages[i] = - virt_to_page(image->data + i*PAGE_SIZE); apply_alternatives((struct alt_instr *)(image->data + image->alt), (struct alt_instr *)(image->data + image->alt + @@ -90,6 +84,24 @@ static unsigned long vdso_addr(unsigned long start, unsigned len) #endif } +static int vdso_fault(const struct vm_special_mapping *sm, + struct vm_area_struct *vma, struct vm_fault *vmf) +{ + const struct vdso_image *image = vma->vm_mm->context.vdso_image; + + if (!image || (vmf->pgoff << PAGE_SHIFT) >= image->size) + return VM_FAULT_SIGBUS; + + vmf->page = virt_to_page(image->data + (vmf->pgoff << PAGE_SHIFT)); + get_page(vmf->page); + return 0; +} + +static const struct vm_special_mapping text_mapping = { + .name = "[vdso]", + .fault = vdso_fault, +}; + static int map_vdso(const struct vdso_image *image, bool calculate_addr) { struct mm_struct *mm = current->mm; @@ -131,7 +143,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) image->size, VM_READ|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, - &image->text_mapping); + &text_mapping); if (IS_ERR(vma)) { ret = PTR_ERR(vma); diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h index deabaf9759b6..43dc55be524e 100644 --- a/arch/x86/include/asm/vdso.h +++ b/arch/x86/include/asm/vdso.h @@ -13,9 +13,6 @@ struct vdso_image { void *data; unsigned long size; /* Always a multiple of PAGE_SIZE */ - /* text_mapping.pages is big enough for data/size page pointers */ - struct vm_special_mapping text_mapping; - unsigned long alt, alt_len; long sym_vvar_start; /* Negative offset to the vvar area */ -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754302AbbLKDWB (ORCPT ); Thu, 10 Dec 2015 22:22:01 -0500 Received: from mail.kernel.org ([198.145.29.136]:41108 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753948AbbLKDV6 (ORCPT ); Thu, 10 Dec 2015 22:21:58 -0500 From: Andy Lutomirski To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski Subject: [PATCH 5/6] x86,vdso: Use .fault instead of remap_pfn_range for the vvar mapping Date: Thu, 10 Dec 2015 19:21:46 -0800 Message-Id: <1312c2af60551278d65e7e410b1cf845fefd7769.1449803537.git.luto@kernel.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is IMO much less ugly, and it also opens the door to disallowing unprivileged userspace HPET access on systems with usable TSCs. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vma.c | 97 ++++++++++++++++++++++++++++------------------- 1 file changed, 57 insertions(+), 40 deletions(-) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index eb50d7c1f161..02221e98b83f 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -102,18 +102,69 @@ static const struct vm_special_mapping text_mapping = { .fault = vdso_fault, }; +static int vvar_fault(const struct vm_special_mapping *sm, + struct vm_area_struct *vma, struct vm_fault *vmf) +{ + const struct vdso_image *image = vma->vm_mm->context.vdso_image; + long sym_offset; + int ret = -EFAULT; + + if (!image) + return VM_FAULT_SIGBUS; + + sym_offset = (long)(vmf->pgoff << PAGE_SHIFT) + + image->sym_vvar_start; + + /* + * Sanity check: a symbol offset of zero means that the page + * does not exist for this vdso image, not that the page is at + * offset zero relative to the text mapping. This should be + * impossible here, because sym_offset should only be zero for + * the page past the end of the vvar mapping. + */ + if (sym_offset == 0) + return VM_FAULT_SIGBUS; + + if (sym_offset == image->sym_vvar_page) { + ret = vm_insert_pfn(vma, (unsigned long)vmf->virtual_address, + __pa_symbol(&__vvar_page) >> PAGE_SHIFT); + } else if (sym_offset == image->sym_hpet_page) { +#ifdef CONFIG_HPET_TIMER + if (hpet_address) { + ret = vm_insert_pfn_prot( + vma, + (unsigned long)vmf->virtual_address, + hpet_address >> PAGE_SHIFT, + pgprot_noncached(PAGE_READONLY)); + } +#endif + } else if (sym_offset == image->sym_pvclock_page) { + struct pvclock_vsyscall_time_info *pvti = + pvclock_pvti_cpu0_va(); + if (pvti) { + ret = vm_insert_pfn( + vma, + (unsigned long)vmf->virtual_address, + __pa(pvti) >> PAGE_SHIFT); + } + } + + if (ret == 0) + return VM_FAULT_NOPAGE; + + return VM_FAULT_SIGBUS; +} + static int map_vdso(const struct vdso_image *image, bool calculate_addr) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long addr, text_start; int ret = 0; - static struct page *no_pages[] = {NULL}; - static struct vm_special_mapping vvar_mapping = { + static const struct vm_special_mapping vvar_mapping = { .name = "[vvar]", - .pages = no_pages, + .fault = vvar_fault, }; - struct pvclock_vsyscall_time_info *pvti; if (calculate_addr) { addr = vdso_addr(current->mm->start_stack, @@ -153,7 +204,8 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) vma = _install_special_mapping(mm, addr, -image->sym_vvar_start, - VM_READ|VM_MAYREAD, + VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP| + VM_PFNMAP, &vvar_mapping); if (IS_ERR(vma)) { @@ -161,41 +213,6 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) goto up_fail; } - if (image->sym_vvar_page) - ret = remap_pfn_range(vma, - text_start + image->sym_vvar_page, - __pa_symbol(&__vvar_page) >> PAGE_SHIFT, - PAGE_SIZE, - PAGE_READONLY); - - if (ret) - goto up_fail; - -#ifdef CONFIG_HPET_TIMER - if (hpet_address && image->sym_hpet_page) { - ret = io_remap_pfn_range(vma, - text_start + image->sym_hpet_page, - hpet_address >> PAGE_SHIFT, - PAGE_SIZE, - pgprot_noncached(PAGE_READONLY)); - - if (ret) - goto up_fail; - } -#endif - - pvti = pvclock_pvti_cpu0_va(); - if (pvti && image->sym_pvclock_page) { - ret = remap_pfn_range(vma, - text_start + image->sym_pvclock_page, - __pa(pvti) >> PAGE_SHIFT, - PAGE_SIZE, - PAGE_READONLY); - - if (ret) - goto up_fail; - } - up_fail: if (ret) current->mm->context.vdso = NULL; -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754386AbbLKDWI (ORCPT ); Thu, 10 Dec 2015 22:22:08 -0500 Received: from mail.kernel.org ([198.145.29.136]:41138 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754269AbbLKDWA (ORCPT ); Thu, 10 Dec 2015 22:22:00 -0500 From: Andy Lutomirski To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Andy Lutomirski Subject: [PATCH 6/6] x86/vdso: Disallow vvar access to vclock IO for never-used vclocks Date: Thu, 10 Dec 2015 19:21:47 -0800 Message-Id: <3edcff8e6f4169cccbb227d74fc65bf061ded6e2.1449803537.git.luto@kernel.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org It makes me uncomfortable that even modern systems grant every process direct read access to the HPET. While fixing this for real without regressing anything is a mess (unmapping the HPET is tricky because we don't adequately track all the mappings), we can do almost as well by tracking which vclocks have ever been used and only allowing pages associated with used vclocks to be faulted in. This will cause rogue programs that try to peek at the HPET to get SIGBUS instead on most systems. We can't restrict faults to vclock pages that are associated with the currently selected vclock due to a race: a process could start to access the HPET for the first time and race against a switch away from the HPET as the current clocksource. We can't segfault the process trying to peek at the HPET in this case, even though the process isn't going to do anything useful with the data. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vma.c | 4 ++-- arch/x86/entry/vsyscall/vsyscall_gtod.c | 9 ++++++++- arch/x86/include/asm/clocksource.h | 9 +++++---- arch/x86/include/asm/vgtod.h | 6 ++++++ 4 files changed, 21 insertions(+), 7 deletions(-) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 02221e98b83f..aa4f5b99f1f3 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -130,7 +130,7 @@ static int vvar_fault(const struct vm_special_mapping *sm, __pa_symbol(&__vvar_page) >> PAGE_SHIFT); } else if (sym_offset == image->sym_hpet_page) { #ifdef CONFIG_HPET_TIMER - if (hpet_address) { + if (hpet_address && vclock_was_used(VCLOCK_HPET)) { ret = vm_insert_pfn_prot( vma, (unsigned long)vmf->virtual_address, @@ -141,7 +141,7 @@ static int vvar_fault(const struct vm_special_mapping *sm, } else if (sym_offset == image->sym_pvclock_page) { struct pvclock_vsyscall_time_info *pvti = pvclock_pvti_cpu0_va(); - if (pvti) { + if (pvti && vclock_was_used(VCLOCK_PVCLOCK)) { ret = vm_insert_pfn( vma, (unsigned long)vmf->virtual_address, diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 51e330416995..0fb3a104ac62 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -16,6 +16,8 @@ #include #include +int vclocks_used __read_mostly; + DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data); void update_vsyscall_tz(void) @@ -26,12 +28,17 @@ void update_vsyscall_tz(void) void update_vsyscall(struct timekeeper *tk) { + int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode; struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data; + /* Mark the new vclock used. */ + BUILD_BUG_ON(VCLOCK_MAX >= 32); + WRITE_ONCE(vclocks_used, READ_ONCE(vclocks_used) | (1 << vclock_mode)); + gtod_write_begin(vdata); /* copy vsyscall data */ - vdata->vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode; + vdata->vclock_mode = vclock_mode; vdata->cycle_last = tk->tkr_mono.cycle_last; vdata->mask = tk->tkr_mono.mask; vdata->mult = tk->tkr_mono.mult; diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h index eda81dc0f4ae..d194266acb28 100644 --- a/arch/x86/include/asm/clocksource.h +++ b/arch/x86/include/asm/clocksource.h @@ -3,10 +3,11 @@ #ifndef _ASM_X86_CLOCKSOURCE_H #define _ASM_X86_CLOCKSOURCE_H -#define VCLOCK_NONE 0 /* No vDSO clock available. */ -#define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */ -#define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */ -#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */ +#define VCLOCK_NONE 0 /* No vDSO clock available. */ +#define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */ +#define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */ +#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */ +#define VCLOCK_MAX 3 struct arch_clocksource_data { int vclock_mode; diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index f556c4843aa1..e728699db774 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -37,6 +37,12 @@ struct vsyscall_gtod_data { }; extern struct vsyscall_gtod_data vsyscall_gtod_data; +extern int vclocks_used; +static inline bool vclock_was_used(int vclock) +{ + return READ_ONCE(vclocks_used) & (1 << vclock); +} + static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s) { unsigned ret; -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755102AbbLKW2R (ORCPT ); Fri, 11 Dec 2015 17:28:17 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:50959 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753315AbbLKW2Q (ORCPT ); Fri, 11 Dec 2015 17:28:16 -0500 Date: Fri, 11 Dec 2015 14:28:14 -0800 From: Andrew Morton To: Andy Lutomirski Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andy Lutomirski Subject: Re: [PATCH 1/6] mm: Add a vm_special_mapping .fault method Message-Id: <20151211142814.25cc806e3f5180d525ee807e@linux-foundation.org> In-Reply-To: <4e911d2752d3b9e52d7496e46b389fc630cdc3a8.1449803537.git.luto@kernel.org> References: <4e911d2752d3b9e52d7496e46b389fc630cdc3a8.1449803537.git.luto@kernel.org> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 10 Dec 2015 19:21:42 -0800 Andy Lutomirski wrote: > From: Andy Lutomirski > > Requiring special mappings to give a list of struct pages is > inflexible: it prevents sane use of IO memory in a special mapping, > it's inefficient (it requires arch code to initialize a list of > struct pages, and it requires the mm core to walk the entire list > just to figure out how long it is), and it prevents arch code from > doing anything fancy when a special mapping fault occurs. > > Add a .fault method as an alternative to filling in a .pages array. > > ... > > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -568,10 +568,27 @@ static inline void clear_tlb_flush_pending(struct mm_struct *mm) > } > #endif > > +struct vm_fault; > + > struct vm_special_mapping > { We may as well fix the code layout while we're in there. > - const char *name; > + const char *name; /* The name, e.g. "[vdso]". */ > + > + /* > + * If .fault is not provided, this is points to a s/is// > + * NULL-terminated array of pages that back the special mapping. > + * > + * This must not be NULL unless .fault is provided. > + */ > struct page **pages; > + > + /* > + * If non-NULL, then this is called to resolve page faults > + * on the special mapping. If used, .pages is not checked. > + */ > + int (*fault)(const struct vm_special_mapping *sm, > + struct vm_area_struct *vma, > + struct vm_fault *vmf); > }; > > enum tlb_flush_reason { > diff --git a/mm/mmap.c b/mm/mmap.c > index 2ce04a649f6b..f717453b1a57 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3030,11 +3030,16 @@ static int special_mapping_fault(struct vm_area_struct *vma, > pgoff_t pgoff; > struct page **pages; > > - if (vma->vm_ops == &legacy_special_mapping_vmops) > + if (vma->vm_ops == &legacy_special_mapping_vmops) { > pages = vma->vm_private_data; > - else > - pages = ((struct vm_special_mapping *)vma->vm_private_data)-> > - pages; > + } else { > + struct vm_special_mapping *sm = vma->vm_private_data; > + > + if (sm->fault) > + return sm->fault(sm, vma, vmf); > + > + pages = sm->pages; > + } > > for (pgoff = vmf->pgoff; pgoff && *pages; ++pages) > pgoff--; Otherwise looks OK. I'll assume this will be merged via an x86 tree. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754010AbbLKWdD (ORCPT ); Fri, 11 Dec 2015 17:33:03 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:51675 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753277AbbLKWdB (ORCPT ); Fri, 11 Dec 2015 17:33:01 -0500 Date: Fri, 11 Dec 2015 14:33:00 -0800 From: Andrew Morton To: Andy Lutomirski Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 2/6] mm: Add vm_insert_pfn_prot Message-Id: <20151211143300.0ac516fbd219a67954698f9a@linux-foundation.org> In-Reply-To: References: X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 10 Dec 2015 19:21:43 -0800 Andy Lutomirski wrote: > The x86 vvar mapping contains pages with differing cacheability > flags. This is currently only supported using (io_)remap_pfn_range, > but those functions can't be used inside page faults. Foggy. What does "support" mean here? > Add vm_insert_pfn_prot to support varying cacheability within the > same non-COW VMA in a more sane manner. Here, "support" presumably means "insertion of pfns". Can we spell all this out more completely please? > x86 needs this to avoid a CRIU-breaking and memory-wasting explosion > of VMAs when supporting userspace access to the HPET. > OtherwiseAck. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754847AbbLKWpA (ORCPT ); Fri, 11 Dec 2015 17:45:00 -0500 Received: from mail-oi0-f43.google.com ([209.85.218.43]:33753 "EHLO mail-oi0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754230AbbLKWo6 (ORCPT ); Fri, 11 Dec 2015 17:44:58 -0500 MIME-Version: 1.0 In-Reply-To: <20151211143300.0ac516fbd219a67954698f9a@linux-foundation.org> References: <20151211143300.0ac516fbd219a67954698f9a@linux-foundation.org> From: Andy Lutomirski Date: Fri, 11 Dec 2015 14:44:38 -0800 Message-ID: Subject: Re: [PATCH 2/6] mm: Add vm_insert_pfn_prot To: Andrew Morton Cc: Andy Lutomirski , X86 ML , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 11, 2015 at 2:33 PM, Andrew Morton wrote: > On Thu, 10 Dec 2015 19:21:43 -0800 Andy Lutomirski wrote: > >> The x86 vvar mapping contains pages with differing cacheability >> flags. This is currently only supported using (io_)remap_pfn_range, >> but those functions can't be used inside page faults. > > Foggy. What does "support" mean here? We currently have a hack in which every x86 mm has a "vvar" vma that has a .fault handler that always fails (it's the vm_special_mapping fault handler backed by an empty pages array). To make everything work, at mm startup, the vdso code uses remap_pfn_range and io_remap_pfn_range to poke the pfns into the page tables. I'd much rather implement this using the new .fault mechanism, and the canonical way to implement .fault seems to be vm_insert_pfn, and vm_insert_pfn doesn't allow setting per-page cacheability. Unfortunately, one of the three x86 vvar pages needs to be uncacheable because it's a genuine IO page, so I can't use vm_insert_pfn. I suppose I could just call io_remap_pfn_range from .fault, but I think that's frowned upon. Admittedly, I wasn't really sure *why* that's frowned upon. This goes way back to 2007 (e0dc0d8f4a327d033bfb63d43f113d5f31d11b3c) when .fault got fancier. > >> Add vm_insert_pfn_prot to support varying cacheability within the >> same non-COW VMA in a more sane manner. > > Here, "support" presumably means "insertion of pfns". Can we spell all > this out more completely please? Yes, will fix. > >> x86 needs this to avoid a CRIU-breaking and memory-wasting explosion >> of VMAs when supporting userspace access to the HPET. >> > > OtherwiseAck. --Andy -- Andy Lutomirski AMA Capital Management, LLC From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932127AbbLNJRQ (ORCPT ); Mon, 14 Dec 2015 04:17:16 -0500 Received: from mail-wm0-f41.google.com ([74.125.82.41]:38242 "EHLO mail-wm0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932075AbbLNJRO (ORCPT ); Mon, 14 Dec 2015 04:17:14 -0500 Date: Mon, 14 Dec 2015 10:17:09 +0100 From: Ingo Molnar To: Andrew Morton Cc: Andy Lutomirski , x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andy Lutomirski , Thomas Gleixner , "H. Peter Anvin" Subject: Re: [PATCH 1/6] mm: Add a vm_special_mapping .fault method Message-ID: <20151214091709.GA29878@gmail.com> References: <4e911d2752d3b9e52d7496e46b389fc630cdc3a8.1449803537.git.luto@kernel.org> <20151211142814.25cc806e3f5180d525ee807e@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151211142814.25cc806e3f5180d525ee807e@linux-foundation.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Andrew Morton wrote: > > + } else { > > + struct vm_special_mapping *sm = vma->vm_private_data; > > + > > + if (sm->fault) > > + return sm->fault(sm, vma, vmf); > > + > > + pages = sm->pages; > > + } > > > > for (pgoff = vmf->pgoff; pgoff && *pages; ++pages) > > pgoff--; > > Otherwise looks OK. I'll assume this will be merged via an x86 tree. Yeah, was hoping to be able to do that with your Acked-by. Thanks, Ingo