* [PATCH 0/3] Batched user access support @ 2015-12-17 18:33 Linus Torvalds 2015-12-18 9:44 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Linus Torvalds @ 2015-12-17 18:33 UTC (permalink / raw) To: linux-arm-kernel So I already sent the end result of these three patches to the x86 people, but since I *think* it may bve an arm64 issue too, I'm including the arm64 people too for information. Background for the the arm64 people: I upgraded my main desktop to Skylake, and did my usual build performance tests, including a perf run to check that everything looks fine. Yes, the machine is 20% faster than my old one, but the profile also shows that now that I have a CPU that supports SMAP, the overhead of that on the user string handling functions was horrendous. Normally, that probably isn't really noticeable, but on loads that do a ton of pathname handling (like a "make -j" on the fully built kernel, or doing "git diff" etc - both of which spend most of their time just doing 'lstat()' on all the files they care about), the user space string accesses really are pretty hot. On the 'make -j' test on a fully built kernel, strncpy_from_user() was about 1.5% of all CPU time. And almost two thirds of that was just the SMAP overhead. So this patch series introduces a model for batching that SMAP overhead on x86, and the reason the ARM people are involved is that the same _may_ be true of the PAN overhead. I don't know - for all I know, the pstate "set pan" instruction may be so cheap on ARM64 that it doesn't really matter. Thew new interface is very simple: new "unsafe_{get,put}_user()" functions that have exactly the same semantics as the old unsafe ones (that weren't called "unsafe", but have the two underscores). The only difference is that you have to use "user_access_{begin,end}()" around them, which allows the architecture to hoist the user access permission wrapper to outside the loop, and then batch the raw accesses. The series contains this addition to uaccess.h: #ifndef user_access_begin #define user_access_begin() do { } while (0) #define user_access_end() do { } while (0) #define unsafe_get_user(x, ptr) __get_user(x, ptr) #define unsafe_put_user(x, ptr) __put_user(x, ptr) #endif so architectures that don't care or haven't implemented it yet, don't need to worry about it. Architectures that _do_ care just need to implement their own versions, and make sure that user_access_begin is a macro (it may obviously be an inline function and just then an additional self-defining macro). Any comments? Linus ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 0/3] Batched user access support 2015-12-17 18:33 [PATCH 0/3] Batched user access support Linus Torvalds @ 2015-12-18 9:44 ` Ingo Molnar 2015-12-18 17:06 ` Linus Torvalds 2015-12-18 11:13 ` Will Deacon 2015-12-18 19:56 ` Russell King - ARM Linux 2 siblings, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2015-12-18 9:44 UTC (permalink / raw) To: linux-arm-kernel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I already sent the end result of these three patches to the x86 people, > but since I *think* it may bve an arm64 issue too, I'm including the arm64 > people too for information. > > Background for the the arm64 people: I upgraded my main desktop to > Skylake, and did my usual build performance tests, including a perf run to > check that everything looks fine. Yes, the machine is 20% faster than my > old one, but the profile also shows that now that I have a CPU that > supports SMAP, the overhead of that on the user string handling functions > was horrendous. > > Normally, that probably isn't really noticeable, but on loads that do a > ton of pathname handling (like a "make -j" on the fully built kernel, or > doing "git diff" etc - both of which spend most of their time just doing > 'lstat()' on all the files they care about), the user space string > accesses really are pretty hot. > > On the 'make -j' test on a fully built kernel, strncpy_from_user() was > about 1.5% of all CPU time. And almost two thirds of that was just the > SMAP overhead. Just curious: by how much did the overhead shift after your patches? > So this patch series introduces a model for batching that SMAP overhead on > x86, and the reason the ARM people are involved is that the same _may_ be > true of the PAN overhead. I don't know - for all I know, the pstate "set > pan" instruction may be so cheap on ARM64 that it doesn't really matter. > > Thew new interface is very simple: new "unsafe_{get,put}_user()" functions > that have exactly the same semantics as the old unsafe ones (that weren't > called "unsafe", but have the two underscores). The only difference is > that you have to use "user_access_{begin,end}()" around them, which allows > the architecture to hoist the user access permission wrapper to outside > the loop, and then batch the raw accesses. > > The series contains this addition to uaccess.h: > > #ifndef user_access_begin > #define user_access_begin() do { } while (0) > #define user_access_end() do { } while (0) > #define unsafe_get_user(x, ptr) __get_user(x, ptr) > #define unsafe_put_user(x, ptr) __put_user(x, ptr) > #endif > > so architectures that don't care or haven't implemented it yet, don't need > to worry about it. Architectures that _do_ care just need to implement > their own versions, and make sure that user_access_begin is a macro (it > may obviously be an inline function and just then an additional > self-defining macro). > > Any comments? I like this interface, especially the method of naming it 'unsafe'. This draws review focus on them - and we had a number of security holes with the __get/put_user() 'fast' variants before, which are named in a too benign fashion. Btw., still today, when I use get_user()/put_user() APIs after a few weeks of not having coded such code, I have to look up their argument order in the headers ... while I don't have to look up memcpy() arguments. In hindsight, if we could go back 20 years, I'd organize our user memory access APIs in such a fashion: memcpy_begin() memcpy_user_kernel(userptr, kernelvar) memcpy_kernel_user(kernelvar, userptr) memcpy_user_kernel(userptr, kernelptr, size) memcpy_kernel_user(kernelptr, userptr, size) memcpy(kernelptr1, kernelptr2, size) memcpy_end() and I'd use the regular memcpy ordering. GCC macro magic would make a distinction between our 2-argument auto-sizing optimized APIs: memcpy_user_kernel(userptr, kernelvar) and the regular 3-argument sized memcpy: memcpy_user_kernel(userptr, kernelptr, size) i.e. generated could would still be the exact same as today, but the argument order is streamlined, and it matches the naming of the API: 'user_kernel' fits 'userptr, kernelvar'. It's also easy to read at a glance, every 'user_kernel' API conceptually maps to a regular C assignment: uservar = kernelvar; In fact we might want to harmonize the APIs a bit more and make the 2-argument APIs all pointer based: memcpy_user_kernel(userptr, kernelptr) memcpy_kernel_user(kernelptr, userptr) the pointers still have to be type compatible so I think it's still just as robust. There would be various other advantages to such a family of APIs as well: - while get/put matches the read/write IO logic, in reality it's very often mixed with memcpy and C variable assigment code, and the mixed and conflicting semantics are IMHO lethal to robustness. - with this it's all a single well-known pattern, throughout all major user memory access APIs. - subsequently it's also harder to mis-review what such code does and what security implications it has. - the naming space is easily extensible: does any code need memcpy_user_user() perhaps? Or memcpy_user_iomem()? - another dimension for intuitive extensions would be _nocache() variants for non-local copies. Variants of such APIs do exist, but the naming is not very well organized. But there are of course the disadvantages: - I guess we don't want to change thousands of get_user()/put_user() API usages ... We could automate 99.9% of it, and we could make it all graceful (i.e. keep the old APIs as well), but it would still be a big change. - [ I might have missed some advantage of the get/put idiom that my suggestion regresses. ] Changing this is something I'd love to volunteer for though, so I thought I'd throw in the offer ;-) Thanks, Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 0/3] Batched user access support 2015-12-18 9:44 ` Ingo Molnar @ 2015-12-18 17:06 ` Linus Torvalds 0 siblings, 0 replies; 8+ messages in thread From: Linus Torvalds @ 2015-12-18 17:06 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 18, 2015 at 1:44 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> On the 'make -j' test on a fully built kernel, strncpy_from_user() was >> about 1.5% of all CPU time. And almost two thirds of that was just the >> SMAP overhead. > > Just curious: by how much did the overhead shift after your patches? So just in case you want to reproduce this, here's what I do: (a) configure a maximal kernel build: make allmodconfig (b) build the kernel fully (not timing this part): make -j16 (c) profile the now empty kernel re-build: perf record -e cycles:pp make -j (d) look at just the kernel part of the profile, by zooming into the kernel DSO when doing perf report --sort=symbol That empty kernel rebuild is one of my ways to check for somewhat real VFS path-lookup performance. It's a real load, and fairly relevant: most of my kernel builds are reasonably empty (ie when I pull from people, I always rebuild, but if it's a small pull, it's not necessarily rebuilding very much). Anyway, on that profile, what you *should* normally see is that the path walking is the hottest part by far, because most of the costs there is "make" doing a lot of "stat()" calls to get the timestamps of all the source and object files. It has a long tail - "make" does other things too, but the top kernel entry should basically be "__d_lookup_rcu()", with "link_path_walk()" and "selinux_inode_permission()" being up there too. Those are basically the three hottest path lookup functions (and yes, it's sad how high selinux_inode_permission() is, but it's largely because it's the first place where we start touching the actual inode data, so you an see in the instruction profiles how much of it are those first accesses). With SMAP and the stac/clac on every access, in my profiles I see strncpy_from_user() being neck-and-neck with link_path_walk() and selinux_inode_permission(). And it definitely shouldn't be there. Don't get me wrong: it's a function that I expect to see in the profiles - copying the pathname from user space really is a noticeable part of pathname lookup - but it shouldn't be in the top three. So for me, without that patch-series, "strncpu_from_user()" was the third-hottest function (ok, looked at my numbers again, and it wasn't 1.5%, it was 1.2% of all time). And looking at the instruction profile, the overhead of _just_ the stac/clac instructions (using "cycles:pp" in the profile) was 60% of that 1.2%. Put another way: just those two instructions used to be 0.7% of the whole CPU cost of that empty "make -j". Now, as I'm sure you have seen many many times, instruction-level profiling isn't "real performance". It shifts around, and with out-of-order CPU's, even with the nice precise PEBS profiling, the fact that 0.7% of all time is *accounted* to the instructions doesn't mean that you really spend that much on it. But it's still a pretty big sign. Anyway, *with* the patches, strncpu_from_user() falls from third place and 1.2% of the cost down to ninth place, and 0.52% of the total cost. And the stac/clac instructions go from 60% of the cost to 33%. So overall, those two stac/clac instructions went from 0.7% of the whole build cost down to 0.17%. So the whole strncpy_from_user() function sped up by between a factor of two and three, and that's because the cost of just the stac/clac was cut by almost a factor of five. And all the numbers fluctuate, so take all of the above with a pinch of salt. But they are repeatable enough for me to be a reasonable ballpark, even if they do fluctuate a bit. I think you have a Skylake machine too, so you can redo the above. Linus ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 0/3] Batched user access support 2015-12-17 18:33 [PATCH 0/3] Batched user access support Linus Torvalds 2015-12-18 9:44 ` Ingo Molnar @ 2015-12-18 11:13 ` Will Deacon 2015-12-18 18:33 ` H. Peter Anvin 2015-12-18 19:56 ` Russell King - ARM Linux 2 siblings, 1 reply; 8+ messages in thread From: Will Deacon @ 2015-12-18 11:13 UTC (permalink / raw) To: linux-arm-kernel Hi Linus, Thanks for Cc'ing us ARM people. On Thu, Dec 17, 2015 at 10:33:21AM -0800, Linus Torvalds wrote: > > So I already sent the end result of these three patches to the x86 people, > but since I *think* it may bve an arm64 issue too, I'm including the arm64 > people too for information. > > Background for the the arm64 people: I upgraded my main desktop to > Skylake, and did my usual build performance tests, including a perf run to > check that everything looks fine. Yes, the machine is 20% faster than my > old one, but the profile also shows that now that I have a CPU that > supports SMAP, the overhead of that on the user string handling functions > was horrendous. > > Normally, that probably isn't really noticeable, but on loads that do a > ton of pathname handling (like a "make -j" on the fully built kernel, or > doing "git diff" etc - both of which spend most of their time just doing > 'lstat()' on all the files they care about), the user space string > accesses really are pretty hot. > > On the 'make -j' test on a fully built kernel, strncpy_from_user() was > about 1.5% of all CPU time. And almost two thirds of that was just the > SMAP overhead. > > So this patch series introduces a model for batching that SMAP overhead on > x86, and the reason the ARM people are involved is that the same _may_ be > true of the PAN overhead. I don't know - for all I know, the pstate "set > pan" instruction may be so cheap on ARM64 that it doesn't really matter. Changing the PAN state on arm64 is a "self-synchronising" operation (i.e. no explicit barriers are required to ensure that the updated PAN state is applied to subsequent memory accesses), so there certainly will be some overhead involved in that. Unfortunately, we don't currently have silicon on which we can benchmark the PAN feature (we developed the code using simulation), so it's difficult to provide concrete numbers. Adding an isb instruction (forcing instruction synchronisation) to the PAN-swizzling locations appears to yield sub 1% overhead in some basic tests, but that's not to say we shouldn't avoid turning it on and off all the time for back-to-back userspace accesses. > Thew new interface is very simple: new "unsafe_{get,put}_user()" functions > that have exactly the same semantics as the old unsafe ones (that weren't > called "unsafe", but have the two underscores). The only difference is > that you have to use "user_access_{begin,end}()" around them, which allows > the architecture to hoist the user access permission wrapper to outside > the loop, and then batch the raw accesses. > > The series contains this addition to uaccess.h: > > #ifndef user_access_begin > #define user_access_begin() do { } while (0) > #define user_access_end() do { } while (0) > #define unsafe_get_user(x, ptr) __get_user(x, ptr) > #define unsafe_put_user(x, ptr) __put_user(x, ptr) > #endif > > so architectures that don't care or haven't implemented it yet, don't need > to worry about it. Architectures that _do_ care just need to implement > their own versions, and make sure that user_access_begin is a macro (it > may obviously be an inline function and just then an additional > self-defining macro). > > Any comments? >From an implementation and performance point of view, this can certainly be used by arm64. My only concern is that we increase the region where PAN is disabled (that is, user accesses are permitted). Currently, that's carefully restricted to the single userspace access, but now it could easily include accesses to the kernel stack, perhaps even generated as a result of compiler spills. I'm pretty unimaginative when it comes to security exploits, but that does sound worse than the current implementation from a security perspective. We *could* hide this behind a kconfig option, like we do for things like stack protector and kasan, where we have a "strong" mode that has worst performance but really isolates the PAN switch to a single access, but the default case could do what you suggest above. In both cases the same APIs could be used. Expanding the set of kernel configurations is rarely popular, but this seems to be the usual performance vs security trade-off and we should provide some way to choose. Will ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 0/3] Batched user access support 2015-12-18 11:13 ` Will Deacon @ 2015-12-18 18:33 ` H. Peter Anvin 2015-12-18 18:43 ` Linus Torvalds 0 siblings, 1 reply; 8+ messages in thread From: H. Peter Anvin @ 2015-12-18 18:33 UTC (permalink / raw) To: linux-arm-kernel On 12/18/15 03:13, Will Deacon wrote: > > From an implementation and performance point of view, this can certainly > be used by arm64. My only concern is that we increase the region where > PAN is disabled (that is, user accesses are permitted). Currently, that's > carefully restricted to the single userspace access, but now it could > easily include accesses to the kernel stack, perhaps even generated as > a result of compiler spills. > > I'm pretty unimaginative when it comes to security exploits, but that > does sound worse than the current implementation from a security > perspective. > It is, but it is a tradeoff. It is way better than opening it up for the entire kernel. In the end the only real way to avoid this is compiler support, which I *have* discussed for x86 with the gcc people. gcc could avoid the back-to-back on and off and even batch accesses by moving them into registers. -hpa ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 0/3] Batched user access support 2015-12-18 18:33 ` H. Peter Anvin @ 2015-12-18 18:43 ` Linus Torvalds 0 siblings, 0 replies; 8+ messages in thread From: Linus Torvalds @ 2015-12-18 18:43 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 18, 2015 at 10:33 AM, H. Peter Anvin <hpa@zytor.com> wrote: > On 12/18/15 03:13, Will Deacon wrote: >> >> I'm pretty unimaginative when it comes to security exploits, but that >> does sound worse than the current implementation from a security >> perspective. >> > > It is, but it is a tradeoff. It is way better than opening it up for > the entire kernel. So I wouldn't worry about the security part as long as we *only* do this in core library routines, and never really expose it as some kind of direct generic interface for random users. It's one of the reasons I named those functions "unsafe". They really aren't any more unsafe than the double underscore functions (which I really dislike outside of core kernel code too), but it's just a bad bad idea to use these in random drivers etc. In fact, my first version of the patch restricted the code to only non-module builds, but I ended up not posting that because the extra #ifndef MODULE ... #endif just made it a bit uglier. But I suspect we should do that in the long run. In fact, I would want to do that for the __{get,put}_user() functions too, because there really is no valid reason for drivers to use them. So I would very strongly advocate that we only ever use these new functions only for very core functions. Exactly things like "strncpy_from_user()" and friends. I'd also possibly see it for things like "cp_new_stat()", which right now copies to a temporary struicture on the kernel stack, and then uses a "copy_to_user()" to copy the temporary to user space. It's another very hot function on similar stat-hot workloads, although in profiles it ends up not showing quite as hot because the profile is split between the "set up on the kernel stack" and the actual user copy. Linus ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 0/3] Batched user access support 2015-12-17 18:33 [PATCH 0/3] Batched user access support Linus Torvalds 2015-12-18 9:44 ` Ingo Molnar 2015-12-18 11:13 ` Will Deacon @ 2015-12-18 19:56 ` Russell King - ARM Linux 2015-12-18 20:18 ` Linus Torvalds 2 siblings, 1 reply; 8+ messages in thread From: Russell King - ARM Linux @ 2015-12-18 19:56 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 17, 2015 at 10:33:21AM -0800, Linus Torvalds wrote: > So this patch series introduces a model for batching that SMAP overhead on > x86, and the reason the ARM people are involved is that the same _may_ be > true of the PAN overhead. I don't know - for all I know, the pstate "set > pan" instruction may be so cheap on ARM64 that it doesn't really matter. I think it may be of use on 32-bit ARM, so we don't have to keep switching the domain register between allowing and denying userspace access across a batch of userspace operations. The area I'm specifically thinking that it'd be useful is in the ARM signal handling code, where we already have our special "__get_user_error" which tries to be as efficient as possible, though that may be a false optimisation. Since the SW PAN stuff, I've been debating about converting that to use copy_to_user()/copy_from_user() instead, but haven't yet found the time to do any analysis. See restore_sigframe() and setup_sigframe() in arch/arm/kernel/signal.c. I'm not sure I'll have any time to look into this before the new year though. -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 0/3] Batched user access support 2015-12-18 19:56 ` Russell King - ARM Linux @ 2015-12-18 20:18 ` Linus Torvalds 0 siblings, 0 replies; 8+ messages in thread From: Linus Torvalds @ 2015-12-18 20:18 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 18, 2015 at 11:56 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > > The area I'm specifically thinking that it'd be useful is in the ARM > signal handling code, where we already have our special "__get_user_error" > which tries to be as efficient as possible, though that may be a false > optimisation. Since the SW PAN stuff, I've been debating about converting > that to use copy_to_user()/copy_from_user() instead, but haven't yet found > the time to do any analysis. Yes, on x86, we have something very similar. There's a separate exception handling setup for exactly that code, which (with my patches) also now batches up the accesses and does the STAC/CLAC only once around the whole thing. It also checks the exception status only once. It turns out that with SMAP, the exception status optimization is completely worthless. Checking the exception status is a couple of extra instructions, but they are basically "free" instructions. So that optimization was largely worthless, but avoiding the STAC/CLAC on every access is much more noticeable. Linus ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-12-18 20:18 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-12-17 18:33 [PATCH 0/3] Batched user access support Linus Torvalds 2015-12-18 9:44 ` Ingo Molnar 2015-12-18 17:06 ` Linus Torvalds 2015-12-18 11:13 ` Will Deacon 2015-12-18 18:33 ` H. Peter Anvin 2015-12-18 18:43 ` Linus Torvalds 2015-12-18 19:56 ` Russell King - ARM Linux 2015-12-18 20:18 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).