[PATCH 0/3] Batched user access support

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Batched user access support
@ 2015-12-17 18:33 Linus Torvalds
  2015-12-18  9:44 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Linus Torvalds @ 2015-12-17 18:33 UTC (permalink / raw)
  To: linux-arm-kernel

So I already sent the end result of these three patches to the x86 people, 
but since I *think* it may bve an arm64 issue too, I'm including the arm64 
people too for information.

Background for the the arm64 people: I upgraded my main desktop to 
Skylake, and did my usual build performance tests, including a perf run to 
check that everything looks fine. Yes, the machine is 20% faster than my 
old one, but the profile also shows that now that I have a CPU that 
supports SMAP, the overhead of that on the user string handling functions 
was horrendous.

Normally, that probably isn't really noticeable, but on loads that do a 
ton of pathname handling (like a "make -j" on the fully built kernel, or 
doing "git diff" etc - both of which spend most of their time just doing 
'lstat()' on all the files they care about), the user space string 
accesses really are pretty hot.

On the 'make -j' test on a fully built kernel, strncpy_from_user() was 
about 1.5% of all CPU time. And almost two thirds of that was just the 
SMAP overhead.

So this patch series introduces a model for batching that SMAP overhead on 
x86, and the reason the ARM people are involved is that the same _may_ be 
true of the PAN overhead. I don't know - for all I know, the pstate "set 
pan" instruction may be so cheap on ARM64 that it doesn't really matter.

Thew new interface is very simple: new "unsafe_{get,put}_user()" functions 
that have exactly the same semantics as the old unsafe ones (that weren't 
called "unsafe", but have the two underscores). The only difference is 
that you have to use "user_access_{begin,end}()" around them, which allows 
the architecture to hoist the user access permission wrapper to outside 
the loop, and then batch the raw accesses.

The series contains this addition to uaccess.h:

  #ifndef user_access_begin
  #define user_access_begin() do { } while (0)
  #define user_access_end() do { } while (0)
  #define unsafe_get_user(x, ptr) __get_user(x, ptr)
  #define unsafe_put_user(x, ptr) __put_user(x, ptr)
  #endif

so architectures that don't care or haven't implemented it yet, don't need 
to worry about it. Architectures that _do_ care just need to implement 
their own versions, and make sure that user_access_begin is a macro (it 
may obviously be an inline function and just then an additional 
self-defining macro).

Any comments? 

                   Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/3] Batched user access support
  2015-12-17 18:33 [PATCH 0/3] Batched user access support Linus Torvalds
@ 2015-12-18  9:44 ` Ingo Molnar
  2015-12-18 17:06   ` Linus Torvalds
  2015-12-18 11:13 ` Will Deacon
  2015-12-18 19:56 ` Russell King - ARM Linux
  2 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2015-12-18  9:44 UTC (permalink / raw)
  To: linux-arm-kernel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> So I already sent the end result of these three patches to the x86 people, 
> but since I *think* it may bve an arm64 issue too, I'm including the arm64 
> people too for information.
> 
> Background for the the arm64 people: I upgraded my main desktop to 
> Skylake, and did my usual build performance tests, including a perf run to 
> check that everything looks fine. Yes, the machine is 20% faster than my 
> old one, but the profile also shows that now that I have a CPU that 
> supports SMAP, the overhead of that on the user string handling functions 
> was horrendous.
> 
> Normally, that probably isn't really noticeable, but on loads that do a 
> ton of pathname handling (like a "make -j" on the fully built kernel, or 
> doing "git diff" etc - both of which spend most of their time just doing 
> 'lstat()' on all the files they care about), the user space string 
> accesses really are pretty hot.
> 
> On the 'make -j' test on a fully built kernel, strncpy_from_user() was 
> about 1.5% of all CPU time. And almost two thirds of that was just the 
> SMAP overhead.

Just curious: by how much did the overhead shift after your patches?

> So this patch series introduces a model for batching that SMAP overhead on 
> x86, and the reason the ARM people are involved is that the same _may_ be 
> true of the PAN overhead. I don't know - for all I know, the pstate "set 
> pan" instruction may be so cheap on ARM64 that it doesn't really matter.
> 
> Thew new interface is very simple: new "unsafe_{get,put}_user()" functions 
> that have exactly the same semantics as the old unsafe ones (that weren't 
> called "unsafe", but have the two underscores). The only difference is 
> that you have to use "user_access_{begin,end}()" around them, which allows 
> the architecture to hoist the user access permission wrapper to outside 
> the loop, and then batch the raw accesses.
> 
> The series contains this addition to uaccess.h:
> 
>   #ifndef user_access_begin
>   #define user_access_begin() do { } while (0)
>   #define user_access_end() do { } while (0)
>   #define unsafe_get_user(x, ptr) __get_user(x, ptr)
>   #define unsafe_put_user(x, ptr) __put_user(x, ptr)
>   #endif
> 
> so architectures that don't care or haven't implemented it yet, don't need 
> to worry about it. Architectures that _do_ care just need to implement 
> their own versions, and make sure that user_access_begin is a macro (it 
> may obviously be an inline function and just then an additional 
> self-defining macro).
> 
> Any comments? 

I like this interface, especially the method of naming it 'unsafe'. This draws 
review focus on them - and we had a number of security holes with the 
__get/put_user() 'fast' variants before, which are named in a too benign fashion.

Btw., still today, when I use get_user()/put_user() APIs after a few weeks of not 
having coded such code, I have to look up their argument order in the headers ... 
while I don't have to look up memcpy() arguments.

In hindsight, if we could go back 20 years, I'd organize our user memory access 
APIs in such a fashion:

	memcpy_begin()

	memcpy_user_kernel(userptr, kernelvar)
	memcpy_kernel_user(kernelvar, userptr)

	memcpy_user_kernel(userptr, kernelptr, size)
	memcpy_kernel_user(kernelptr, userptr, size)

	memcpy(kernelptr1, kernelptr2, size)

	memcpy_end()

and I'd use the regular memcpy ordering. GCC macro magic would make a distinction 
between our 2-argument auto-sizing optimized APIs:

	memcpy_user_kernel(userptr, kernelvar)

and the regular 3-argument sized memcpy:

	memcpy_user_kernel(userptr, kernelptr, size)

i.e. generated could would still be the exact same as today, but the argument 
order is streamlined, and it matches the naming of the API: 'user_kernel' fits 
'userptr, kernelvar'.

It's also easy to read at a glance, every 'user_kernel' API conceptually maps to a 
regular C assignment:

	uservar = kernelvar;

In fact we might want to harmonize the APIs a bit more and make the 2-argument 
APIs all pointer based:

	memcpy_user_kernel(userptr, kernelptr)
	memcpy_kernel_user(kernelptr, userptr)

the pointers still have to be type compatible so I think it's still just as 
robust.

There would be various other advantages to such a family of APIs as well:

 - while get/put matches the read/write IO logic, in reality it's very often mixed 
   with memcpy and C variable assigment code, and the mixed and conflicting
   semantics are IMHO lethal to robustness.

 - with this it's all a single well-known pattern, throughout all major user
   memory access APIs.

 - subsequently it's also harder to mis-review what such code does and what
   security implications it has.

 - the naming space is easily extensible: does any code need memcpy_user_user()
   perhaps? Or memcpy_user_iomem()?

 - another dimension for intuitive extensions would be _nocache() variants for 
   non-local copies. Variants of such APIs do exist, but the naming is not very 
   well organized.

But there are of course the disadvantages:

 - I guess we don't want to change thousands of get_user()/put_user() API
   usages ... We could automate 99.9% of it, and we could make it all graceful
   (i.e. keep the old APIs as well), but it would still be a big change.

 - [ I might have missed some advantage of the get/put idiom that my suggestion
     regresses. ]

Changing this is something I'd love to volunteer for though, so I thought I'd 
throw in the offer ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/3] Batched user access support
  2015-12-18  9:44 ` Ingo Molnar
@ 2015-12-18 17:06   ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2015-12-18 17:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 18, 2015 at 1:44 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> On the 'make -j' test on a fully built kernel, strncpy_from_user() was
>> about 1.5% of all CPU time. And almost two thirds of that was just the
>> SMAP overhead.
>
> Just curious: by how much did the overhead shift after your patches?

So just in case you want to reproduce this, here's what I do:

 (a) configure a maximal kernel build:

      make allmodconfig

 (b) build the kernel fully (not timing this part):

      make -j16

 (c) profile the now empty kernel re-build:

      perf record -e cycles:pp make -j

 (d) look at just the kernel part of the profile, by zooming into the
kernel DSO when doing

      perf report --sort=symbol

That empty kernel rebuild is one of my ways to check for somewhat real
VFS path-lookup performance. It's a real load, and fairly relevant:
most of my kernel builds are reasonably empty (ie when I pull from
people, I always rebuild, but if it's a small pull, it's not
necessarily rebuilding very much).

Anyway, on that profile, what you *should* normally see is that the
path walking is the hottest part by far, because most of the costs
there is "make" doing a lot of "stat()" calls to get the timestamps of
all the source and object files. It has a long tail - "make" does
other things too, but the top kernel entry should basically be
"__d_lookup_rcu()", with "link_path_walk()" and
"selinux_inode_permission()" being up there too.

Those are basically the three hottest path lookup functions (and yes,
it's sad how high selinux_inode_permission() is, but it's largely
because it's the first place where we start touching the actual inode
data, so you an see in the instruction profiles how much of it are
those first accesses).

With SMAP and the stac/clac on every access, in my profiles I see
strncpy_from_user() being neck-and-neck with link_path_walk() and
selinux_inode_permission(). And it definitely shouldn't be there.
Don't get me wrong: it's a function that I expect to see in the
profiles - copying the pathname from user space really is a noticeable
part of pathname lookup - but it shouldn't be in the top three.

So for me, without that patch-series, "strncpu_from_user()" was the
third-hottest function (ok, looked at my numbers again, and it wasn't
1.5%, it was 1.2% of all time).

And looking at the instruction profile, the overhead of _just_ the
stac/clac instructions (using "cycles:pp" in the profile) was 60% of
that 1.2%.

Put another way: just those two instructions used to be 0.7% of the
whole CPU cost of that empty "make -j".

Now, as I'm sure you have seen many many times, instruction-level
profiling isn't "real performance". It shifts around, and with
out-of-order CPU's, even with the nice precise PEBS profiling, the
fact that 0.7% of all time is *accounted* to the instructions doesn't
mean that you really spend that much on it. But it's still a pretty
big sign.

Anyway, *with* the patches, strncpu_from_user() falls from third place
and 1.2% of the cost down to ninth place, and 0.52% of the total cost.
And the stac/clac instructions go from 60% of the cost to 33%. So
overall, those two stac/clac instructions went from 0.7% of the whole
build cost down to 0.17%.

So the whole strncpy_from_user() function sped up by between a factor
of two and three, and that's because the cost of just the stac/clac
was cut by almost a factor of five.

And all the numbers fluctuate, so take all of the above with a pinch
of salt. But they are repeatable enough for me to be a reasonable
ballpark, even if they do fluctuate a bit.

I think you have a Skylake machine too, so you can redo the above.

                           Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/3] Batched user access support
  2015-12-17 18:33 [PATCH 0/3] Batched user access support Linus Torvalds
  2015-12-18  9:44 ` Ingo Molnar
@ 2015-12-18 11:13 ` Will Deacon
  2015-12-18 18:33   ` H. Peter Anvin
  2015-12-18 19:56 ` Russell King - ARM Linux
  2 siblings, 1 reply; 8+ messages in thread
From: Will Deacon @ 2015-12-18 11:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Linus,

Thanks for Cc'ing us ARM people.

On Thu, Dec 17, 2015 at 10:33:21AM -0800, Linus Torvalds wrote:
> 
> So I already sent the end result of these three patches to the x86 people, 
> but since I *think* it may bve an arm64 issue too, I'm including the arm64 
> people too for information.
> 
> Background for the the arm64 people: I upgraded my main desktop to 
> Skylake, and did my usual build performance tests, including a perf run to 
> check that everything looks fine. Yes, the machine is 20% faster than my 
> old one, but the profile also shows that now that I have a CPU that 
> supports SMAP, the overhead of that on the user string handling functions 
> was horrendous.
> 
> Normally, that probably isn't really noticeable, but on loads that do a 
> ton of pathname handling (like a "make -j" on the fully built kernel, or 
> doing "git diff" etc - both of which spend most of their time just doing 
> 'lstat()' on all the files they care about), the user space string 
> accesses really are pretty hot.
> 
> On the 'make -j' test on a fully built kernel, strncpy_from_user() was 
> about 1.5% of all CPU time. And almost two thirds of that was just the 
> SMAP overhead.
> 
> So this patch series introduces a model for batching that SMAP overhead on 
> x86, and the reason the ARM people are involved is that the same _may_ be 
> true of the PAN overhead. I don't know - for all I know, the pstate "set 
> pan" instruction may be so cheap on ARM64 that it doesn't really matter.

Changing the PAN state on arm64 is a "self-synchronising" operation (i.e.
no explicit barriers are required to ensure that the updated PAN state
is applied to subsequent memory accesses), so there certainly will be
some overhead involved in that. Unfortunately, we don't currently have
silicon on which we can benchmark the PAN feature (we developed the code
using simulation), so it's difficult to provide concrete numbers.

Adding an isb instruction (forcing instruction synchronisation) to the
PAN-swizzling locations appears to yield sub 1% overhead in some basic
tests, but that's not to say we shouldn't avoid turning it on and off
all the time for back-to-back userspace accesses.

> Thew new interface is very simple: new "unsafe_{get,put}_user()" functions 
> that have exactly the same semantics as the old unsafe ones (that weren't 
> called "unsafe", but have the two underscores). The only difference is 
> that you have to use "user_access_{begin,end}()" around them, which allows 
> the architecture to hoist the user access permission wrapper to outside 
> the loop, and then batch the raw accesses.
> 
> The series contains this addition to uaccess.h:
> 
>   #ifndef user_access_begin
>   #define user_access_begin() do { } while (0)
>   #define user_access_end() do { } while (0)
>   #define unsafe_get_user(x, ptr) __get_user(x, ptr)
>   #define unsafe_put_user(x, ptr) __put_user(x, ptr)
>   #endif
> 
> so architectures that don't care or haven't implemented it yet, don't need 
> to worry about it. Architectures that _do_ care just need to implement 
> their own versions, and make sure that user_access_begin is a macro (it 
> may obviously be an inline function and just then an additional 
> self-defining macro).
> 
> Any comments? 

>From an implementation and performance point of view, this can certainly
be used by arm64. My only concern is that we increase the region where
PAN is disabled (that is, user accesses are permitted). Currently, that's
carefully restricted to the single userspace access, but now it could
easily include accesses to the kernel stack, perhaps even generated as
a result of compiler spills.

I'm pretty unimaginative when it comes to security exploits, but that
does sound worse than the current implementation from a security
perspective.

We *could* hide this behind a kconfig option, like we do for things like
stack protector and kasan, where we have a "strong" mode that has worst
performance but really isolates the PAN switch to a single access, but
the default case could do what you suggest above. In both cases the same
APIs could be used. Expanding the set of kernel configurations is rarely
popular, but this seems to be the usual performance vs security trade-off
and we should provide some way to choose.

Will

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/3] Batched user access support
  2015-12-18 11:13 ` Will Deacon
@ 2015-12-18 18:33   ` H. Peter Anvin
  2015-12-18 18:43     ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: H. Peter Anvin @ 2015-12-18 18:33 UTC (permalink / raw)
  To: linux-arm-kernel

On 12/18/15 03:13, Will Deacon wrote:
> 
> From an implementation and performance point of view, this can certainly
> be used by arm64. My only concern is that we increase the region where
> PAN is disabled (that is, user accesses are permitted). Currently, that's
> carefully restricted to the single userspace access, but now it could
> easily include accesses to the kernel stack, perhaps even generated as
> a result of compiler spills.
> 
> I'm pretty unimaginative when it comes to security exploits, but that
> does sound worse than the current implementation from a security
> perspective.
> 

It is, but it is a tradeoff.  It is way better than opening it up for
the entire kernel.  In the end the only real way to avoid this is
compiler support, which I *have* discussed for x86 with the gcc people.
 gcc could avoid the back-to-back on and off and even batch accesses by
moving them into registers.

	-hpa

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/3] Batched user access support
  2015-12-18 18:33   ` H. Peter Anvin
@ 2015-12-18 18:43     ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2015-12-18 18:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 18, 2015 at 10:33 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 12/18/15 03:13, Will Deacon wrote:
>>
>> I'm pretty unimaginative when it comes to security exploits, but that
>> does sound worse than the current implementation from a security
>> perspective.
>>
>
> It is, but it is a tradeoff.  It is way better than opening it up for
> the entire kernel.

So I wouldn't worry about the security part as long as we *only* do
this in core library routines, and never really expose it as some kind
of direct generic interface for random users.

It's one of the reasons I named those functions "unsafe". They really
aren't any more unsafe than the double underscore functions (which I
really dislike outside of core kernel code too), but it's just a bad
bad idea to use these in random drivers etc.

In fact, my first version of the patch restricted the code to only
non-module builds, but I ended up not posting that because the extra

   #ifndef MODULE
   ...
   #endif

just made it a bit uglier.

But I suspect we should do that in the long run. In fact, I would want
to do that for the __{get,put}_user() functions too, because there
really is no valid reason for drivers to use them.

So I would very strongly advocate that we only ever use these new
functions only for very core functions. Exactly things like
"strncpy_from_user()" and friends. I'd also possibly see it for things
like "cp_new_stat()", which right now copies to a temporary struicture
on the kernel stack, and then uses a "copy_to_user()" to copy the
temporary to user space. It's another very hot function on similar
stat-hot workloads, although in profiles it ends up not showing quite
as hot because the profile is split between the "set up on the kernel
stack" and the actual user copy.

                 Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/3] Batched user access support
  2015-12-17 18:33 [PATCH 0/3] Batched user access support Linus Torvalds
  2015-12-18  9:44 ` Ingo Molnar
  2015-12-18 11:13 ` Will Deacon
@ 2015-12-18 19:56 ` Russell King - ARM Linux
  2015-12-18 20:18   ` Linus Torvalds
  2 siblings, 1 reply; 8+ messages in thread
From: Russell King - ARM Linux @ 2015-12-18 19:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 17, 2015 at 10:33:21AM -0800, Linus Torvalds wrote:
> So this patch series introduces a model for batching that SMAP overhead on 
> x86, and the reason the ARM people are involved is that the same _may_ be 
> true of the PAN overhead. I don't know - for all I know, the pstate "set 
> pan" instruction may be so cheap on ARM64 that it doesn't really matter.

I think it may be of use on 32-bit ARM, so we don't have to keep switching
the domain register between allowing and denying userspace access across
a batch of userspace operations.

The area I'm specifically thinking that it'd be useful is in the ARM
signal handling code, where we already have our special "__get_user_error"
which tries to be as efficient as possible, though that may be a false
optimisation.  Since the SW PAN stuff, I've been debating about converting
that to use copy_to_user()/copy_from_user() instead, but haven't yet found
the time to do any analysis.

See restore_sigframe() and setup_sigframe() in arch/arm/kernel/signal.c.

I'm not sure I'll have any time to look into this before the new year
though.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/3] Batched user access support
  2015-12-18 19:56 ` Russell King - ARM Linux
@ 2015-12-18 20:18   ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2015-12-18 20:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 18, 2015 at 11:56 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
>
> The area I'm specifically thinking that it'd be useful is in the ARM
> signal handling code, where we already have our special "__get_user_error"
> which tries to be as efficient as possible, though that may be a false
> optimisation.  Since the SW PAN stuff, I've been debating about converting
> that to use copy_to_user()/copy_from_user() instead, but haven't yet found
> the time to do any analysis.

Yes, on x86, we have something very similar. There's a separate
exception handling setup for exactly that code, which (with my
patches) also now batches up the accesses and does the STAC/CLAC only
once around the whole thing. It also checks the exception status only
once.

It turns out that with SMAP, the exception status optimization is
completely worthless. Checking the exception status is a couple of
extra instructions, but they are basically "free" instructions. So
that optimization was largely worthless, but avoiding the STAC/CLAC on
every access is much more noticeable.

             Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-12-18 20:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-17 18:33 [PATCH 0/3] Batched user access support Linus Torvalds
2015-12-18  9:44 ` Ingo Molnar
2015-12-18 17:06   ` Linus Torvalds
2015-12-18 11:13 ` Will Deacon
2015-12-18 18:33   ` H. Peter Anvin
2015-12-18 18:43     ` Linus Torvalds
2015-12-18 19:56 ` Russell King - ARM Linux
2015-12-18 20:18   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).