* [kernel-hardening] Re: [PATCH v5 00/32] virtually mapped stacks and thread_info cleanup [not found] <cover.1468270393.git.luto@kernel.org> @ 2016-07-13 8:54 ` Christian Borntraeger 2016-07-13 18:36 ` Andy Lutomirski 0 siblings, 1 reply; 3+ messages in thread From: Christian Borntraeger @ 2016-07-13 8:54 UTC (permalink / raw) To: Andy Lutomirski, x86, linux-kernel Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening@lists.openwall.com, Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens, linux-s390 On 07/11/2016 10:53 PM, Andy Lutomirski wrote: > Hi all- > > Since the dawn of time, a kernel stack overflow has been a real PITA > to debug, has caused nondeterministic crashes some time after the > actual overflow, and has generally been easy to exploit for root. > > With this series, arches can enable HAVE_ARCH_VMAP_STACK. Arches > that enable it (just x86 for now) get virtually mapped stacks with > guard pages. This causes reliable faults when the stack overflows. > > If the arch implements it well, we get a nice OOPS on stack overflow > (as opposed to panicing directly or otherwise exploding badly). On > x86, the OOPS is nice, has a usable call trace, and the overflowing > task is killed cleanly. > > This series (starting with v4) also extensively cleans up > thread_info. thread_info has been partially redundant with > thread_struct for a long time -- both are places for arch code to > add additional per-task variables. thread_struct is much cleaner: > it's always in task_struct, and there's nothing particularly magical > about it. So this series contains a bunch of cleanups on x86 to > move almost everything from thread_info to thread_struct (which, > even by itself, deletes more code than it adds) and to remove x86's > dependence on thread_info's position on the stack. Then it opts x86 > into a new config option THREAD_INFO_IN_TASK to get rid of > arch-specific thread_info entirely and simply embed a defanged > thread_info (containing only flags) and 'int cpu' into task_struct. > > Once thread_info stops being magical, there's another benefit: we > can free the thread stack as soon as the task is dead (without > waiting for RCU) and then, if vmapped stacks are in use, cache the > entire stack for reuse on the same cpu. > > This seems to be an overall speedup of about 0.5-1 µs per > pthread_create/join in a simple test -- a percpu cache of vmalloced > stacks appears to be a bit faster than a high-order stack > allocation, at least when the cache hits. (I expect that workloads > with a low cache hit rate are likely to be dominated by other > effects anyway.) > > This does not address interrupt stacks. > > It's worth noting that s390 has an arch-specific gcc feature that > detects stack overflows by adjusting function prologues. Arches > with features like that may wish to avoid using vmapped stacks to > minimize the performance hit. Yes, might not need this for stack overflow detection. What might be interesting is the thread_info/thread_struct change, if we can strip down thread_info.(CONFIG_THREAD_INFO_IN_TASK). Would it actually make sense to separate these two changes to see what performance impact CONFIG_THREAD_INFO_IN_TASK has on its own? ^ permalink raw reply [flat|nested] 3+ messages in thread
* [kernel-hardening] Re: [PATCH v5 00/32] virtually mapped stacks and thread_info cleanup 2016-07-13 8:54 ` [kernel-hardening] Re: [PATCH v5 00/32] virtually mapped stacks and thread_info cleanup Christian Borntraeger @ 2016-07-13 18:36 ` Andy Lutomirski 2016-07-13 18:53 ` Christian Borntraeger 0 siblings, 1 reply; 3+ messages in thread From: Andy Lutomirski @ 2016-07-13 18:36 UTC (permalink / raw) To: Christian Borntraeger Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening@lists.openwall.com, Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens, linux-s390 On Wed, Jul 13, 2016 at 1:54 AM, Christian Borntraeger <borntraeger@de.ibm.com> wrote: > On 07/11/2016 10:53 PM, Andy Lutomirski wrote: >> Hi all- >> >> Since the dawn of time, a kernel stack overflow has been a real PITA >> to debug, has caused nondeterministic crashes some time after the >> actual overflow, and has generally been easy to exploit for root. >> >> With this series, arches can enable HAVE_ARCH_VMAP_STACK. Arches >> that enable it (just x86 for now) get virtually mapped stacks with >> guard pages. This causes reliable faults when the stack overflows. >> >> If the arch implements it well, we get a nice OOPS on stack overflow >> (as opposed to panicing directly or otherwise exploding badly). On >> x86, the OOPS is nice, has a usable call trace, and the overflowing >> task is killed cleanly. >> >> This series (starting with v4) also extensively cleans up >> thread_info. thread_info has been partially redundant with >> thread_struct for a long time -- both are places for arch code to >> add additional per-task variables. thread_struct is much cleaner: >> it's always in task_struct, and there's nothing particularly magical >> about it. So this series contains a bunch of cleanups on x86 to >> move almost everything from thread_info to thread_struct (which, >> even by itself, deletes more code than it adds) and to remove x86's >> dependence on thread_info's position on the stack. Then it opts x86 >> into a new config option THREAD_INFO_IN_TASK to get rid of >> arch-specific thread_info entirely and simply embed a defanged >> thread_info (containing only flags) and 'int cpu' into task_struct. >> >> Once thread_info stops being magical, there's another benefit: we >> can free the thread stack as soon as the task is dead (without >> waiting for RCU) and then, if vmapped stacks are in use, cache the >> entire stack for reuse on the same cpu. >> >> This seems to be an overall speedup of about 0.5-1 µs per >> pthread_create/join in a simple test -- a percpu cache of vmalloced >> stacks appears to be a bit faster than a high-order stack >> allocation, at least when the cache hits. (I expect that workloads >> with a low cache hit rate are likely to be dominated by other >> effects anyway.) >> >> This does not address interrupt stacks. >> >> It's worth noting that s390 has an arch-specific gcc feature that >> detects stack overflows by adjusting function prologues. Arches >> with features like that may wish to avoid using vmapped stacks to >> minimize the performance hit. > > Yes, might not need this for stack overflow detection. What might > be interesting is the thread_info/thread_struct change, if we can > strip down thread_info.(CONFIG_THREAD_INFO_IN_TASK). Would it actually > make sense to separate these two changes to see what performance > impact CONFIG_THREAD_INFO_IN_TASK has on its own? > They're already separated. CONFIG_THREAD_INFO_IN_TASK should have basically no performance impact unless there are arch-dependent (percpu?) issues involved. It does enable immediate thread stack deallocation, though, and it would be straightforward to make CONFIG_THREAD_INFO_IN_TASK cache stacks even if CONFIG_VMAP_STACK=n. That should be a moderate clone() speedup. -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 3+ messages in thread
* [kernel-hardening] Re: [PATCH v5 00/32] virtually mapped stacks and thread_info cleanup 2016-07-13 18:36 ` Andy Lutomirski @ 2016-07-13 18:53 ` Christian Borntraeger 0 siblings, 0 replies; 3+ messages in thread From: Christian Borntraeger @ 2016-07-13 18:53 UTC (permalink / raw) To: Andy Lutomirski Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening@lists.openwall.com, Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens, linux-s390 On 07/13/2016 08:36 PM, Andy Lutomirski wrote: > On Wed, Jul 13, 2016 at 1:54 AM, Christian Borntraeger > <borntraeger@de.ibm.com> wrote: >> On 07/11/2016 10:53 PM, Andy Lutomirski wrote: >>> Hi all- >>> >>> Since the dawn of time, a kernel stack overflow has been a real PITA >>> to debug, has caused nondeterministic crashes some time after the >>> actual overflow, and has generally been easy to exploit for root. >>> >>> With this series, arches can enable HAVE_ARCH_VMAP_STACK. Arches >>> that enable it (just x86 for now) get virtually mapped stacks with >>> guard pages. This causes reliable faults when the stack overflows. >>> >>> If the arch implements it well, we get a nice OOPS on stack overflow >>> (as opposed to panicing directly or otherwise exploding badly). On >>> x86, the OOPS is nice, has a usable call trace, and the overflowing >>> task is killed cleanly. >>> >>> This series (starting with v4) also extensively cleans up >>> thread_info. thread_info has been partially redundant with >>> thread_struct for a long time -- both are places for arch code to >>> add additional per-task variables. thread_struct is much cleaner: >>> it's always in task_struct, and there's nothing particularly magical >>> about it. So this series contains a bunch of cleanups on x86 to >>> move almost everything from thread_info to thread_struct (which, >>> even by itself, deletes more code than it adds) and to remove x86's >>> dependence on thread_info's position on the stack. Then it opts x86 >>> into a new config option THREAD_INFO_IN_TASK to get rid of >>> arch-specific thread_info entirely and simply embed a defanged >>> thread_info (containing only flags) and 'int cpu' into task_struct. >>> >>> Once thread_info stops being magical, there's another benefit: we >>> can free the thread stack as soon as the task is dead (without >>> waiting for RCU) and then, if vmapped stacks are in use, cache the >>> entire stack for reuse on the same cpu. >>> >>> This seems to be an overall speedup of about 0.5-1 µs per >>> pthread_create/join in a simple test -- a percpu cache of vmalloced >>> stacks appears to be a bit faster than a high-order stack >>> allocation, at least when the cache hits. (I expect that workloads >>> with a low cache hit rate are likely to be dominated by other >>> effects anyway.) >>> >>> This does not address interrupt stacks. >>> >>> It's worth noting that s390 has an arch-specific gcc feature that >>> detects stack overflows by adjusting function prologues. Arches >>> with features like that may wish to avoid using vmapped stacks to >>> minimize the performance hit. >> >> Yes, might not need this for stack overflow detection. What might >> be interesting is the thread_info/thread_struct change, if we can >> strip down thread_info.(CONFIG_THREAD_INFO_IN_TASK). Would it actually >> make sense to separate these two changes to see what performance >> impact CONFIG_THREAD_INFO_IN_TASK has on its own? >> > > They're already separated. > > CONFIG_THREAD_INFO_IN_TASK should have basically no performance impact > unless there are arch-dependent (percpu?) issues involved. It does > enable immediate thread stack deallocation, though, and it would be > straightforward to make CONFIG_THREAD_INFO_IN_TASK cache stacks even > if CONFIG_VMAP_STACK=n. That should be a moderate clone() speedup. Yes. My point was more of having two patch series in case the discussion goes on regarding virtual stack. ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2016-07-13 18:53 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <cover.1468270393.git.luto@kernel.org>
2016-07-13 8:54 ` [kernel-hardening] Re: [PATCH v5 00/32] virtually mapped stacks and thread_info cleanup Christian Borntraeger
2016-07-13 18:36 ` Andy Lutomirski
2016-07-13 18:53 ` Christian Borntraeger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).