From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christian Borntraeger Subject: Re: [PATCH v5 00/32] virtually mapped stacks and thread_info cleanup Date: Wed, 13 Jul 2016 20:53:42 +0200 Message-ID: <57868E36.7050605@de.ibm.com> References: <578601B3.3050903@de.ibm.com> Reply-To: kernel-hardening@lists.openwall.com Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Return-path: List-Post: List-Help: List-Unsubscribe: List-Subscribe: In-Reply-To: To: Andy Lutomirski Cc: Andy Lutomirski , X86 ML , "linux-kernel@vger.kernel.org" , linux-arch , Borislav Petkov , Nadav Amit , Kees Cook , Brian Gerst , "kernel-hardening@lists.openwall.com" , Linus Torvalds , Josh Poimboeuf , Jann Horn , Heiko Carstens , linux-s390 List-Id: linux-arch.vger.kernel.org On 07/13/2016 08:36 PM, Andy Lutomirski wrote: > On Wed, Jul 13, 2016 at 1:54 AM, Christian Borntraeger > wrote: >> On 07/11/2016 10:53 PM, Andy Lutomirski wrote: >>> Hi all- >>> >>> Since the dawn of time, a kernel stack overflow has been a real PITA >>> to debug, has caused nondeterministic crashes some time after the >>> actual overflow, and has generally been easy to exploit for root. >>> >>> With this series, arches can enable HAVE_ARCH_VMAP_STACK. Arches >>> that enable it (just x86 for now) get virtually mapped stacks with >>> guard pages. This causes reliable faults when the stack overflows. >>> >>> If the arch implements it well, we get a nice OOPS on stack overflow >>> (as opposed to panicing directly or otherwise exploding badly). On >>> x86, the OOPS is nice, has a usable call trace, and the overflowing >>> task is killed cleanly. >>> >>> This series (starting with v4) also extensively cleans up >>> thread_info. thread_info has been partially redundant with >>> thread_struct for a long time -- both are places for arch code to >>> add additional per-task variables. thread_struct is much cleaner: >>> it's always in task_struct, and there's nothing particularly magical >>> about it. So this series contains a bunch of cleanups on x86 to >>> move almost everything from thread_info to thread_struct (which, >>> even by itself, deletes more code than it adds) and to remove x86's >>> dependence on thread_info's position on the stack. Then it opts x86 >>> into a new config option THREAD_INFO_IN_TASK to get rid of >>> arch-specific thread_info entirely and simply embed a defanged >>> thread_info (containing only flags) and 'int cpu' into task_struct. >>> >>> Once thread_info stops being magical, there's another benefit: we >>> can free the thread stack as soon as the task is dead (without >>> waiting for RCU) and then, if vmapped stacks are in use, cache the >>> entire stack for reuse on the same cpu. >>> >>> This seems to be an overall speedup of about 0.5-1 µs per >>> pthread_create/join in a simple test -- a percpu cache of vmalloced >>> stacks appears to be a bit faster than a high-order stack >>> allocation, at least when the cache hits. (I expect that workloads >>> with a low cache hit rate are likely to be dominated by other >>> effects anyway.) >>> >>> This does not address interrupt stacks. >>> >>> It's worth noting that s390 has an arch-specific gcc feature that >>> detects stack overflows by adjusting function prologues. Arches >>> with features like that may wish to avoid using vmapped stacks to >>> minimize the performance hit. >> >> Yes, might not need this for stack overflow detection. What might >> be interesting is the thread_info/thread_struct change, if we can >> strip down thread_info.(CONFIG_THREAD_INFO_IN_TASK). Would it actually >> make sense to separate these two changes to see what performance >> impact CONFIG_THREAD_INFO_IN_TASK has on its own? >> > > They're already separated. > > CONFIG_THREAD_INFO_IN_TASK should have basically no performance impact > unless there are arch-dependent (percpu?) issues involved. It does > enable immediate thread stack deallocation, though, and it would be > straightforward to make CONFIG_THREAD_INFO_IN_TASK cache stacks even > if CONFIG_VMAP_STACK=n. That should be a moderate clone() speedup. Yes. My point was more of having two patch series in case the discussion goes on regarding virtual stack. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:32567 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751966AbcGMSyH (ORCPT ); Wed, 13 Jul 2016 14:54:07 -0400 Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.11/8.16.0.11) with SMTP id u6DIrrDU008399 for ; Wed, 13 Jul 2016 14:53:56 -0400 Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109]) by mx0b-001b2d01.pphosted.com with ESMTP id 2455fda97v-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Wed, 13 Jul 2016 14:53:56 -0400 Received: from localhost by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 13 Jul 2016 19:53:54 +0100 Subject: Re: [PATCH v5 00/32] virtually mapped stacks and thread_info cleanup References: <578601B3.3050903@de.ibm.com> From: Christian Borntraeger Date: Wed, 13 Jul 2016 20:53:42 +0200 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Message-ID: <57868E36.7050605@de.ibm.com> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Andy Lutomirski Cc: Andy Lutomirski , X86 ML , "linux-kernel@vger.kernel.org" , linux-arch , Borislav Petkov , Nadav Amit , Kees Cook , Brian Gerst , "kernel-hardening@lists.openwall.com" , Linus Torvalds , Josh Poimboeuf , Jann Horn , Heiko Carstens , linux-s390 Message-ID: <20160713185342.0QlLKtAs1zgr87H900wzdCkBGVuDpbcFgJDbWroJaxI@z> On 07/13/2016 08:36 PM, Andy Lutomirski wrote: > On Wed, Jul 13, 2016 at 1:54 AM, Christian Borntraeger > wrote: >> On 07/11/2016 10:53 PM, Andy Lutomirski wrote: >>> Hi all- >>> >>> Since the dawn of time, a kernel stack overflow has been a real PITA >>> to debug, has caused nondeterministic crashes some time after the >>> actual overflow, and has generally been easy to exploit for root. >>> >>> With this series, arches can enable HAVE_ARCH_VMAP_STACK. Arches >>> that enable it (just x86 for now) get virtually mapped stacks with >>> guard pages. This causes reliable faults when the stack overflows. >>> >>> If the arch implements it well, we get a nice OOPS on stack overflow >>> (as opposed to panicing directly or otherwise exploding badly). On >>> x86, the OOPS is nice, has a usable call trace, and the overflowing >>> task is killed cleanly. >>> >>> This series (starting with v4) also extensively cleans up >>> thread_info. thread_info has been partially redundant with >>> thread_struct for a long time -- both are places for arch code to >>> add additional per-task variables. thread_struct is much cleaner: >>> it's always in task_struct, and there's nothing particularly magical >>> about it. So this series contains a bunch of cleanups on x86 to >>> move almost everything from thread_info to thread_struct (which, >>> even by itself, deletes more code than it adds) and to remove x86's >>> dependence on thread_info's position on the stack. Then it opts x86 >>> into a new config option THREAD_INFO_IN_TASK to get rid of >>> arch-specific thread_info entirely and simply embed a defanged >>> thread_info (containing only flags) and 'int cpu' into task_struct. >>> >>> Once thread_info stops being magical, there's another benefit: we >>> can free the thread stack as soon as the task is dead (without >>> waiting for RCU) and then, if vmapped stacks are in use, cache the >>> entire stack for reuse on the same cpu. >>> >>> This seems to be an overall speedup of about 0.5-1 µs per >>> pthread_create/join in a simple test -- a percpu cache of vmalloced >>> stacks appears to be a bit faster than a high-order stack >>> allocation, at least when the cache hits. (I expect that workloads >>> with a low cache hit rate are likely to be dominated by other >>> effects anyway.) >>> >>> This does not address interrupt stacks. >>> >>> It's worth noting that s390 has an arch-specific gcc feature that >>> detects stack overflows by adjusting function prologues. Arches >>> with features like that may wish to avoid using vmapped stacks to >>> minimize the performance hit. >> >> Yes, might not need this for stack overflow detection. What might >> be interesting is the thread_info/thread_struct change, if we can >> strip down thread_info.(CONFIG_THREAD_INFO_IN_TASK). Would it actually >> make sense to separate these two changes to see what performance >> impact CONFIG_THREAD_INFO_IN_TASK has on its own? >> > > They're already separated. > > CONFIG_THREAD_INFO_IN_TASK should have basically no performance impact > unless there are arch-dependent (percpu?) issues involved. It does > enable immediate thread stack deallocation, though, and it would be > straightforward to make CONFIG_THREAD_INFO_IN_TASK cache stacks even > if CONFIG_VMAP_STACK=n. That should be a moderate clone() speedup. Yes. My point was more of having two patch series in case the discussion goes on regarding virtual stack.