From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751404AbcFPERu (ORCPT ); Thu, 16 Jun 2016 00:17:50 -0400 Received: from mail-db3on0088.outbound.protection.outlook.com ([157.55.234.88]:55683 "EHLO emea01-db3-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750757AbcFPERs (ORCPT ); Thu, 16 Jun 2016 00:17:48 -0400 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=mika.penttila@nextfour.com; Subject: Re: [PATCH 12/13] x86/mm/64: Enable vmapped stacks To: Andy Lutomirski , "linux-kernel@vger.kernel.org" , , Borislav Petkov References: <3f0299bde58d0161c1dad75e0b7f93f074a6cd12.1466036668.git.luto@kernel.org> CC: Nadav Amit , Kees Cook , Brian Gerst , "kernel-hardening@lists.openwall.com" , Linus Torvalds , Josh Poimboeuf From: =?UTF-8?Q?Mika_Penttil=c3=a4?= Message-ID: <57622865.2070701@nextfour.com> Date: Thu, 16 Jun 2016 07:17:41 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 MIME-Version: 1.0 In-Reply-To: <3f0299bde58d0161c1dad75e0b7f93f074a6cd12.1466036668.git.luto@kernel.org> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit X-Originating-IP: [194.157.170.34] X-ClientProxiedBy: AMSPR02CA0014.eurprd02.prod.outlook.com (10.242.225.142) To AM5PR0701MB1729.eurprd07.prod.outlook.com (10.167.215.136) X-MS-Office365-Filtering-Correlation-Id: f1d06367-3a35-4ce4-edf8-08d3959d2abf X-Microsoft-Exchange-Diagnostics: 1;AM5PR0701MB1729;2:D+cLKhH6cb3VmnxKCCTzpi1AY2rj7FX8kjyhSSRuUAUWpxKeeuBb+eE2IBTBRV0B/iCt3yUlkqL1Mmq3kCLrx2mq2GPVKKZGXtdOoO5ejXtFqyu4C4DLsuoBDmt6qldOLueC0/De2vPg/6+6uRd5sgmzX+OiHDIYgmt88dIS3iwasWqYLmRUTyCeBcG0cxTW;3:EdZ9dTWWQQE8rAMVsgTwIvU5uQGompXxBQ7Z31PKdLrv5bnCH1QQXfBCqEd15ZaCNhcj/8dfYuDGrjjjdF0G14bFt2Zq4GYLmLQzlHOVtIqkd1bYI7Qnzn8do6J4Zbfb;25:wULM/gloen14WHvTFJdTkQbti0T30CdA4KOaB6OmmEXniQ5yXNEpO9AyMrs9KLO0dNUugFj3evY530BsjkLXp1HPcWdlOmoC+Cu47YYq4x/UWs/9RjC1Uh4uM3jmYGKr17rvV25G9SEX4QoJbIjoPT531ZZqlvjdzOreUJ4JA5WVmlJQpt9P9a4m5VpP5thfN9KkCtnca081zyBZxsTKzc9tZLcBXIVz/exxdv+kFE7jwS/zMyRHTAop1ZCD23MkdfdxHd19uuf/4bOBfzft2pKs9ngSMarJRh1jIvlyMgPzAuGsMS2nPvgApRvA/Cwxb8LUarzbHn18kq9uyR8JWuhu2G2GqmdGrwQC8mpgC64/6pVvg0ehNlmMTVAU5aE6HT+D86PpvIxsQSPnxT3KAqO2ypOojEh36vTBG+eS7tc= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:AM5PR0701MB1729; X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(2401047)(5005006)(8121501046)(10201501046)(3002001);SRVR:AM5PR0701MB1729;BCL:0;PCL:0;RULEID:;SRVR:AM5PR0701MB1729; X-Microsoft-Exchange-Diagnostics: 1;AM5PR0701MB1729;4:u+DnnT8kqN65IRsPeByMuhUif/fF8J9/LqCgzj2a20DEI2oRvXBB3qzV1W2KOOtno8fDO3/7tuMwLY1qlMByYeY/h/9YSREMUQA7XfqM9iLn/awvlGIRl06KSk8bNpFk64AqgZBcPOR+LrTKwpbkU3RQWzZULo9CIIvLPNdTi4Or3aAvvzMvotetSYKX9WetIasggCzNHfTaDqjvtq7KuEDbWXd0FPUDFMIpxXg1xq50KbX7frTAxYmiY8H0CzQp3MQBYRq6w6gnSIdj8G9kQjkiHOasSqflt9TfxMwEKWiGUJf0VphYEdSe9GtVdxLXw1k5Hnd+Z5j+UuwrJ4obchk/FXdQGwXMISjng2qSe58/k58hZFk/I9n5QP62QdEv X-Forefront-PRVS: 09752BC779 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(4630300001)(6009001)(6049001)(7916002)(377454003)(189002)(199003)(24454002)(19580405001)(230700001)(42186005)(8676002)(5001770100001)(5008740100001)(47776003)(4001350100001)(86362001)(97736004)(77096005)(50466002)(54356999)(87266999)(76176999)(575784001)(189998001)(81166006)(81156014)(36756003)(83506001)(19580395003)(99136001)(106356001)(5004730100002)(586003)(92566002)(117636001)(101416001)(6116002)(122286003)(3846002)(105586002)(68736007)(4326007)(65956001)(50986999)(65806001)(66066001)(64126003)(2950100001)(65816999)(59896002)(23746002)(2906002)(2201001)(2501003)(62816006);DIR:OUT;SFP:1101;SCL:1;SRVR:AM5PR0701MB1729;H:[10.10.10.110];FPR:;SPF:None;PTR:InfoNoRecords;MX:1;A:1;CAT:NONE;LANG:en;CAT:NONE; X-Microsoft-Exchange-Diagnostics: =?Windows-1252?Q?1;AM5PR0701MB1729;23:CRLxaViId0AILX+K6USX3vYyVubAS2cVk5D?= =?Windows-1252?Q?bICw1y6bW25OSEDTcBHw2qOiHWvFG22huNtc+knfXCcWJLpkX1x4jWTt?= =?Windows-1252?Q?45UyNgBPU0Y2SnQCwnKpQZeabCBaoSehUqMDMwOq5df0yGEpjVTRYq7L?= =?Windows-1252?Q?VXHgqrfSyHdRXS/bJEE33rEVjDC+vHlAgv+HGGlVZsoxLeoR7QqpRapy?= =?Windows-1252?Q?uaPzUQOrEuPswnMQXIN70LX1V5HZBhc1ug2X5v2LsYSJHwTy/uGFpnEV?= =?Windows-1252?Q?8BPxEQ5d2NZVSzpfAgnqp8zQ44F5Pnawy3G90vFS7QxT8KuIFSOM79uT?= =?Windows-1252?Q?O7NQ9Zmg0N8VA61+ddqtHGYUcZYdvh7Io5dgaxLkoi4Gqi6Fh7RZD+b5?= =?Windows-1252?Q?gdNWZCVjV6zEkyQsFeoTtHqlY0hU+ToAyXUJRZpLgttn6d5wjmxRL8Z8?= =?Windows-1252?Q?dHcQr+J0vGCd7nXw5O3EycrZRcBNoNZE10hi6KDsxtaJQnytKKIuLZYD?= =?Windows-1252?Q?YdUtJPuSQN0c0MDm2Rsd7cw01QOLi2sNH8Sm5NxHnK4pTwgZDPnqucK+?= =?Windows-1252?Q?GNJDDdRSahbUEOOzwGBGqL5MRZmS44d9RC04SLJ5Z7O1tEh7iAbSe61E?= =?Windows-1252?Q?yGWH2qE8QahE/Q2pi0ZHGvzbPas0d8Yj0e5uFmBug5gtsBx8363OSQ7r?= =?Windows-1252?Q?PTQ9RkODMBgR4ucSC9xwJtCUMHZdLFmCAvbz+z8lO+oYob15dLtNkNb8?= =?Windows-1252?Q?8hSbVzNog0WqVy3C6Iciz+oXnMi7TrVgceRAcap7t0AsQOUt+r2Qaq+F?= =?Windows-1252?Q?pIMrdCjnNmGInAmY0R/PPrzHimtCm3V3KaML1U49P3iT8XwWWOKkd77q?= =?Windows-1252?Q?ZYCWo7u45JBQj8wNIgWjEgaUEzH2ewmkK5zLEWt5d7TY6nxNjzbQyWti?= =?Windows-1252?Q?7hASlfSh2XMAcY+GocbdccjjTcm7Wde3dqknIz7Bfa5OFEaWN0rVVFD5?= =?Windows-1252?Q?eT/wX6q7wmPschjTGNE6bzFX2NDDfFUbAGyGnh4Thk7jI/RiPTiCNene?= =?Windows-1252?Q?vTl1BjZM3tNOa0HwodGJWbgTOBgzQPAX1pXizJ8IhprAJrIH9iQ3fXD6?= =?Windows-1252?Q?FOCkaEF+dTs11AhqR8Eyjdz+a85vW85YTfCXbQVovTY/7P58lWFLwOfD?= =?Windows-1252?Q?BoJeBY736MaT95FPkhibTKi+iN3622wHSWFjZO3HTWBQ+hvxBFCdfjZl?= =?Windows-1252?Q?abSv0qGfOESbbcz9zbe32Xk0P6hT8MKYC1LfRdI+J29QKzo37u0K8uIy?= =?Windows-1252?Q?AetKctBGHcg2T3tpPpLcbNRzT6ar6XtaZ4lBBA2fCZuRrNZUpc+EykVV?= =?Windows-1252?Q?NOaGyQlnceXkrznewH6Oisx5K9adXXFyLNWLDoiO2vO8e6TSzIEcacUy?= =?Windows-1252?Q?YtA26TP+HXqzzrQvV+isxmRWvYxqWMsIL/Fn0LcKpWu48KA8P+5H0nW+?= =?Windows-1252?Q?nrFgSHamKclVmykJEF4uvReZzxuxW?= X-Microsoft-Exchange-Diagnostics: 1;AM5PR0701MB1729;6:I5dg4f4JCFmVO3cDGcr71VvseKG3zAmsbz1pAzoPxtd1pSaQgFSryEsZ3f8pNqIHEyzSAQvMdaBbbH/xNiRqkvJELCquJ6V3GMoi6UBA1yqd5Pl2M/qjad4pV0YEvZTTgYINDzi0nZFXREiabNq+2zfNQYk2avEOxf97CCvX2wqTlwY8UmwSwgrIqFtFeJgKioFgXcWP9Wi5qC6l40u5MgsBXiRih/Hoz4wKC0lJHs4dhlDyMgMo+H3vELJuUiYVCIt8Ybi0BRlSPg80QTAfcg==;5:jp38Su1IMgL8Kj077FQqKVFzI6BQX+24bPxws4aZ6jfO/LYNyBdjGAtURDIZQtgoCyO0FPh4FKT4+zOpKMc5av+hKA1fynIKQQPaLzjAQlOf6Myi0nsyJl/bvRFj2AxQCk9ZdYqegOYnMmQziXduCQ==;24:ChE/YzV9zq0ToRUlOL5fz2pgF881WxWtgZh/u72GokVUV+DePGVapK7ITkN8eRJzf26O77OI+OdZl28QDsLtEE0p45ZA1AoogxVVR9Vy+R4=;7:nN6VqfelHvUDsCwApeGvl6rIsNgfYue+n5+cVv7b1PnD/MIRr9h+KbuFjJzj+h4ksJFLVTdJrGTDF/L/PMI8FZEWluChWWnKKEZjlrvUSbLBbE+9yJqqGnilcvPh58D5DZHklI5XANiFOXBIvB6chilCJ8D5Al9vxhABmqoQtl/86P3zc/Mi/DfJo3wuEYA6n/JNKrwOf5Z4R0IDAvLVHg== SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: nextfour.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 16 Jun 2016 04:17:44.4593 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM5PR0701MB1729 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On 06/16/2016 03:28 AM, Andy Lutomirski wrote: > This allows x86_64 kernels to enable vmapped stacks. There are a > couple of interesting bits. > > First, x86 lazily faults in top-level paging entries for the vmalloc > area. This won't work if we get a page fault while trying to access > the stack: the CPU will promote it to a double-fault and we'll die. > To avoid this problem, probe the new stack when switching stacks and > forcibly populate the pgd entry for the stack when switching mms. > > Second, once we have guard pages around the stack, we'll want to > detect and handle stack overflow. > > I didn't enable it on x86_32. We'd need to rework the double-fault > code a bit and I'm concerned about running out of vmalloc virtual > addresses under some workloads. > > This patch, by itself, will behave somewhat erratically when the > stack overflows while RSP is still more than a few tens of bytes > above the bottom of the stack. Specifically, we'll get #PF and make > it to no_context and an oops without triggering a double-fault, and > no_context doesn't know about stack overflows. The next patch will > improve that case. > > Signed-off-by: Andy Lutomirski > --- > arch/x86/Kconfig | 1 + > arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++- > arch/x86/kernel/traps.c | 32 ++++++++++++++++++++++++++++++++ > arch/x86/mm/tlb.c | 15 +++++++++++++++ > 4 files changed, 75 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 0a7b885964ba..b624b24d1dc1 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -92,6 +92,7 @@ config X86 > select HAVE_ARCH_TRACEHOOK > select HAVE_ARCH_TRANSPARENT_HUGEPAGE > select HAVE_EBPF_JIT if X86_64 > + select HAVE_ARCH_VMAP_STACK if X86_64 > select HAVE_CC_STACKPROTECTOR > select HAVE_CMPXCHG_DOUBLE > select HAVE_CMPXCHG_LOCAL > diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h > index 8f321a1b03a1..14e4b20f0aaf 100644 > --- a/arch/x86/include/asm/switch_to.h > +++ b/arch/x86/include/asm/switch_to.h > @@ -8,6 +8,28 @@ struct tss_struct; > void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p, > struct tss_struct *tss); > > +/* This runs runs on the previous thread's stack. */ > +static inline void prepare_switch_to(struct task_struct *prev, > + struct task_struct *next) > +{ > +#ifdef CONFIG_VMAP_STACK > + /* > + * If we switch to a stack that has a top-level paging entry > + * that is not present in the current mm, the resulting #PF will > + * will be promoted to a double-fault and we'll panic. Probe > + * the new stack now so that vmalloc_fault can fix up the page > + * tables if needed. This can only happen if we use a stack > + * in vmap space. > + * > + * We assume that the stack is aligned so that it never spans > + * more than one top-level paging entry. > + * > + * To minimize cache pollution, just follow the stack pointer. > + */ > + READ_ONCE(*(unsigned char *)next->thread.sp); > +#endif > +} > + > #ifdef CONFIG_X86_32 > > #ifdef CONFIG_CC_STACKPROTECTOR > @@ -39,6 +61,8 @@ do { \ > */ \ > unsigned long ebx, ecx, edx, esi, edi; \ > \ > + prepare_switch_to(prev, next); \ > + \ > asm volatile("pushl %%ebp\n\t" /* save EBP */ \ > "movl %%esp,%[prev_sp]\n\t" /* save ESP */ \ > "movl %[next_sp],%%esp\n\t" /* restore ESP */ \ > @@ -103,7 +127,9 @@ do { \ > * clean in kernel mode, with the possible exception of IOPL. Kernel IOPL > * has no effect. > */ > -#define switch_to(prev, next, last) \ > +#define switch_to(prev, next, last) \ > + prepare_switch_to(prev, next); \ > + \ > asm volatile(SAVE_CONTEXT \ > "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \ > "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */ \ > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > index 00f03d82e69a..9cb7ea781176 100644 > --- a/arch/x86/kernel/traps.c > +++ b/arch/x86/kernel/traps.c > @@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP, SIGBUS, "segment not present", segment_not_present) > DO_ERROR(X86_TRAP_SS, SIGBUS, "stack segment", stack_segment) > DO_ERROR(X86_TRAP_AC, SIGBUS, "alignment check", alignment_check) > > +#ifdef CONFIG_VMAP_STACK > +static void __noreturn handle_stack_overflow(const char *message, > + struct pt_regs *regs, > + unsigned long fault_address) > +{ > + printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n", > + (void *)fault_address, current->stack, > + (char *)current->stack + THREAD_SIZE - 1); > + die(message, regs, 0); > + > + /* Be absolutely certain we don't return. */ > + panic(message); > +} > +#endif > + > #ifdef CONFIG_X86_64 > /* Runs on IST stack */ > dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code) > { > static const char str[] = "double fault"; > struct task_struct *tsk = current; > +#ifdef CONFIG_VMAP_STACK > + unsigned long cr2; > +#endif > > #ifdef CONFIG_X86_ESPFIX64 > extern unsigned char native_irq_return_iret[]; > @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code) > tsk->thread.error_code = error_code; > tsk->thread.trap_nr = X86_TRAP_DF; > > +#ifdef CONFIG_VMAP_STACK > + /* > + * If we overflow the stack into a guard page, the CPU will fail > + * to deliver #PF and will send #DF instead. CR2 will contain > + * the linear address of the second fault, which will be in the > + * guard page below the bottom of the stack. > + */ > + cr2 = read_cr2(); > + if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE) > + handle_stack_overflow( > + "kernel stack overflow (double-fault)", > + regs, cr2); > +#endif > + > #ifdef CONFIG_DOUBLEFAULT > df_debug(regs, error_code); > #endif > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 5643fd0b1a7d..fbf036ae72ac 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, > unsigned cpu = smp_processor_id(); > > if (likely(prev != next)) { > + if (IS_ENABLED(CONFIG_VMAP_STACK)) { > + /* > + * If our current stack is in vmalloc space and isn't > + * mapped in the new pgd, we'll double-fault. Forcibly > + * map it. > + */ > + unsigned int stack_pgd_index = > + pgd_index(current_stack_pointer()); stack pointer is still the previous task's, current_stack_pointer() returns that, not next task's which was intention I guess. Things may happen to work if on same pgd, but at least the boot cpu init_task_struct is special. > + pgd_t *pgd = next->pgd + stack_pgd_index; > + > + if (unlikely(pgd_none(*pgd))) > + set_pgd(pgd, init_mm.pgd[stack_pgd_index]); > + } > + > #ifdef CONFIG_SMP > this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); > this_cpu_write(cpu_tlbstate.active_mm, next); > #endif > + > cpumask_set_cpu(cpu, mm_cpumask(next)); > > /* > --Mika