linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vineet Gupta <Vineet.Gupta1@synopsys.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Richard Henderson <rth@twiddle.net>,
	Russell King <linux@armlinux.org.uk>,
	Will Deacon <will.deacon@arm.com>,
	Haavard Skinnemoen <hskinnemoen@gmail.com>,
	Steven Miao <realmz6@gmail.com>,
	Jesper Nilsson <jesper.nilsson@axis.com>,
	Mark Salter <msalter@redhat.com>,
	Yoshinori Sato <ysato@users.sourceforge.jp>,
	Richard Kuo <rkuo@codeaurora.org>,
	Tony Luck <tony.luck@intel.com>,
	Geert Uytterhoeven <geert@linux-m68k.org>,
	James Hogan <james.hogan@imgtec.com>,
	Michal Simek <monstr@monstr.eu>,
	David Howells <dhowells@redhat.com>,
	Ley Foon Tan <lftan@altera.com>, Jonas Bonn <Jonas.Nilsson@syn>
Subject: Re: [RFC][CFT][PATCHSET v1] uaccess unification
Date: Thu, 30 Mar 2017 13:40:31 -0700	[thread overview]
Message-ID: <efb7aaa4-7d25-0c68-ebf8-cdd7eb1297dc@synopsys.com> (raw)
In-Reply-To: <CA+55aFyGwYwdk8i7-GbXV7NLTn38e-bow3VD-hHcQmTr9ebAjw@mail.gmail.com>

On 03/29/2017 05:27 PM, Linus Torvalds wrote:
> On Wed, Mar 29, 2017 at 5:02 PM, Vineet Gupta
> <Vineet.Gupta1@synopsys.com> wrote:
>>
>> I guess I can in next day or two - but mind you the inline version for ARC is kind
>> of special vs. other arches. We have this "manual" constant propagation to elide
>> the unrolled LD/ST for 1-15 byte stragglers, when @sz is constant.
> 
> I don't think that's special. We do that on x86 too, and I suspect ARC
> copied it from there (or from somebody else who did it).

No, I (re)wrote that code and AFAIKR didn't copy from anyone and AFAICS it is
certainly different from others if not special. If you look closely at
arc:access.h it is not the trivial check for 1-2-4 conversion as in the commit you
referred to. It actually tries to compile time eliminate hunks from inline
assembly, for constant @sz (so is designed purely for inlined variants, whether
that matters or  not is a different story). Thing is from the hardware POV, 4
LD/ST in flight is good (atleast for ARC700 cores) so we wrap it up in a Zero
delay loop. This takes care of multiples of 16 bytes, the last 15 bytes are the
killer which requires bunch of conditionals which is what I try to eliminate.

FWIW, I experimented with uaccess inlining on ARC
1. pristine 4.11-rc1 (all inline)
2. Inline + disabling the "smart" const propagation
3. Out of line only variants (which already existed/default on ARC for -Os, but
hacked for current -O3)

Numbers for LMBench FS latency (off of tmpfs to avoid any device related
perturbation). Note that LMBench already runs them several times itself and each
of below is obviously with a fresh reboot since kernels were different.

So it seems 0k file create/del gets worse without the smart inline, while 10k gets
better. mmap (16k) got worse as well. With out of line some got better while some
worse.


   File & VM system latencies in microseconds - smaller is better
   -------------------------------------------------------------------------------
   Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                           Create Delete Create Delete Latency Fault  Fault  selct
   --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
   170329-v4 Linux 4.11.0-  124.3   75.3  734.2  147.8  2200.0 6.205    10.9  87.6
   170330-v4 Linux 4.11.0-  154.9   88.3  709.2  131.2  2494.0 4.056    11.0  91.1
   170330-v4 Linux 4.11.0-  157.7   69.8  622.7  140.8  2168.0 5.654    10.8  91.0

Compare that to data against

1. pristine 4.11-rc1 (all inline)
2. Al's series + ARC forced inline
3. Al's series + ARC forced NOT inline

   File & VM system latencies in microseconds - smaller is better
   -------------------------------------------------------------------------------
   Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                           Create Delete Create Delete Latency Fault  Fault  selct
   --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
   170329-v4 Linux 4.11.0-  124.3   75.3  734.2  147.8  2200.0 6.205    10.9  87.6
   170329-v4 Linux 4.11.0-  141.2   63.4  629.7  130.0  2172.0 5.796    10.8  90.0
   170329-v4 Linux 4.11.0-  154.9   89.2  691.6  147.7  2323.0 4.922    10.8  92.3

So it's a mix bag really. Maybe we need some better directed test to really drill
it down.


> But at least on x86 is is limited entirely to the "__" versions, and
> it's almost entirely pointless. We actually removed some of that kind
> of code because it was *do* pointless, and it had just been copied
> around into the "atomic" versions too.
> 
> See for example commit bd28b14591b9 ("x86: remove more uaccess_32.h
> complexity"), which did that.
> 
> The basic "__" versions still do that constant-size thing, but they
> really are questionable. 

Perhaps because the scope of constant usage was pretty narrow - it would only
benefit if *copy_from_user() were called with 1,2,4 which is relatively unlikely
as we have __get_user and friends for that already.

> Exactly because it's just the "__" versions -
> the *regular* "copy_to/from_user()" is an unconditional function call,
> because inlining it isn't just the access operations, it's the size
> check, and on modern x86 it's also the "set AC to mark the user access
> as safe".

So what you are saying is it is relatively costly on x86 because of SMAP which may
not be true for arches w/o hardware support.
Note that I'm not arguing for/against inlining per-se, it seems it doesn't matter

-Vineet

WARNING: multiple messages have this Message-ID (diff)
From: Vineet Gupta <Vineet.Gupta1@synopsys.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Richard Henderson <rth@twiddle.net>,
	Russell King <linux@armlinux.org.uk>,
	Will Deacon <will.deacon@arm.com>,
	Haavard Skinnemoen <hskinnemoen@gmail.com>,
	Steven Miao <realmz6@gmail.com>,
	Jesper Nilsson <jesper.nilsson@axis.com>,
	Mark Salter <msalter@redhat.com>,
	Yoshinori Sato <ysato@users.sourceforge.jp>,
	Richard Kuo <rkuo@codeaurora.org>,
	Tony Luck <tony.luck@intel.com>,
	Geert Uytterhoeven <geert@linux-m68k.org>,
	James Hogan <james.hogan@imgtec.com>,
	Michal Simek <monstr@monstr.eu>,
	David Howells <dhowells@redhat.com>,
	Ley Foon Tan <lftan@altera.com>,
	Jonas Bonn <Jonas.Nilsson@synopsys>
Subject: Re: [RFC][CFT][PATCHSET v1] uaccess unification
Date: Thu, 30 Mar 2017 13:40:31 -0700	[thread overview]
Message-ID: <efb7aaa4-7d25-0c68-ebf8-cdd7eb1297dc@synopsys.com> (raw)
Message-ID: <20170330204031.9SBBdAqZ2mEeOkBlpBxUQ3cfhG5V4TD6Xq9uWCrGse8@z> (raw)
In-Reply-To: <CA+55aFyGwYwdk8i7-GbXV7NLTn38e-bow3VD-hHcQmTr9ebAjw@mail.gmail.com>

On 03/29/2017 05:27 PM, Linus Torvalds wrote:
> On Wed, Mar 29, 2017 at 5:02 PM, Vineet Gupta
> <Vineet.Gupta1@synopsys.com> wrote:
>>
>> I guess I can in next day or two - but mind you the inline version for ARC is kind
>> of special vs. other arches. We have this "manual" constant propagation to elide
>> the unrolled LD/ST for 1-15 byte stragglers, when @sz is constant.
> 
> I don't think that's special. We do that on x86 too, and I suspect ARC
> copied it from there (or from somebody else who did it).

No, I (re)wrote that code and AFAIKR didn't copy from anyone and AFAICS it is
certainly different from others if not special. If you look closely at
arc:access.h it is not the trivial check for 1-2-4 conversion as in the commit you
referred to. It actually tries to compile time eliminate hunks from inline
assembly, for constant @sz (so is designed purely for inlined variants, whether
that matters or  not is a different story). Thing is from the hardware POV, 4
LD/ST in flight is good (atleast for ARC700 cores) so we wrap it up in a Zero
delay loop. This takes care of multiples of 16 bytes, the last 15 bytes are the
killer which requires bunch of conditionals which is what I try to eliminate.

FWIW, I experimented with uaccess inlining on ARC
1. pristine 4.11-rc1 (all inline)
2. Inline + disabling the "smart" const propagation
3. Out of line only variants (which already existed/default on ARC for -Os, but
hacked for current -O3)

Numbers for LMBench FS latency (off of tmpfs to avoid any device related
perturbation). Note that LMBench already runs them several times itself and each
of below is obviously with a fresh reboot since kernels were different.

So it seems 0k file create/del gets worse without the smart inline, while 10k gets
better. mmap (16k) got worse as well. With out of line some got better while some
worse.


   File & VM system latencies in microseconds - smaller is better
   -------------------------------------------------------------------------------
   Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                           Create Delete Create Delete Latency Fault  Fault  selct
   --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
   170329-v4 Linux 4.11.0-  124.3   75.3  734.2  147.8  2200.0 6.205    10.9  87.6
   170330-v4 Linux 4.11.0-  154.9   88.3  709.2  131.2  2494.0 4.056    11.0  91.1
   170330-v4 Linux 4.11.0-  157.7   69.8  622.7  140.8  2168.0 5.654    10.8  91.0

Compare that to data against

1. pristine 4.11-rc1 (all inline)
2. Al's series + ARC forced inline
3. Al's series + ARC forced NOT inline

   File & VM system latencies in microseconds - smaller is better
   -------------------------------------------------------------------------------
   Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                           Create Delete Create Delete Latency Fault  Fault  selct
   --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
   170329-v4 Linux 4.11.0-  124.3   75.3  734.2  147.8  2200.0 6.205    10.9  87.6
   170329-v4 Linux 4.11.0-  141.2   63.4  629.7  130.0  2172.0 5.796    10.8  90.0
   170329-v4 Linux 4.11.0-  154.9   89.2  691.6  147.7  2323.0 4.922    10.8  92.3

So it's a mix bag really. Maybe we need some better directed test to really drill
it down.


> But at least on x86 is is limited entirely to the "__" versions, and
> it's almost entirely pointless. We actually removed some of that kind
> of code because it was *do* pointless, and it had just been copied
> around into the "atomic" versions too.
> 
> See for example commit bd28b14591b9 ("x86: remove more uaccess_32.h
> complexity"), which did that.
> 
> The basic "__" versions still do that constant-size thing, but they
> really are questionable. 

Perhaps because the scope of constant usage was pretty narrow - it would only
benefit if *copy_from_user() were called with 1,2,4 which is relatively unlikely
as we have __get_user and friends for that already.

> Exactly because it's just the "__" versions -
> the *regular* "copy_to/from_user()" is an unconditional function call,
> because inlining it isn't just the access operations, it's the size
> check, and on modern x86 it's also the "set AC to mark the user access
> as safe".

So what you are saying is it is relatively costly on x86 because of SMAP which may
not be true for arches w/o hardware support.
Note that I'm not arguing for/against inlining per-se, it seems it doesn't matter

-Vineet

  parent reply	other threads:[~2017-03-30 20:41 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-29  5:57 [RFC][CFT][PATCHSET v1] uaccess unification Al Viro
2017-03-29  5:57 ` Al Viro
2017-03-29 20:08 ` Vineet Gupta
2017-03-29 20:08   ` Vineet Gupta
2017-03-29 20:29   ` Al Viro
2017-03-29 20:29     ` Al Viro
2017-03-29 20:37     ` Linus Torvalds
2017-03-29 20:37       ` Linus Torvalds
2017-03-29 21:03       ` Al Viro
2017-03-29 21:03         ` Al Viro
2017-03-29 21:24         ` Linus Torvalds
2017-03-29 21:24           ` Linus Torvalds
2017-03-29 23:09           ` Al Viro
2017-03-29 23:09             ` Al Viro
2017-03-29 23:43             ` Linus Torvalds
2017-03-29 23:43               ` Linus Torvalds
2017-03-30 15:31               ` Al Viro
2017-03-30 15:31                 ` Al Viro
2017-03-29 21:14     ` Vineet Gupta
2017-03-29 21:14       ` Vineet Gupta
2017-03-29 23:42       ` Al Viro
2017-03-29 23:42         ` Al Viro
2017-03-30  0:02         ` Vineet Gupta
2017-03-30  0:02           ` Vineet Gupta
2017-03-30  0:27           ` Linus Torvalds
2017-03-30  0:27             ` Linus Torvalds
2017-03-30  1:15             ` Al Viro
2017-03-30  1:15               ` Al Viro
2017-03-30 20:40             ` Vineet Gupta [this message]
2017-03-30 20:40               ` Vineet Gupta
2017-03-30 20:59               ` Linus Torvalds
2017-03-30 20:59                 ` Linus Torvalds
2017-03-30 23:21                 ` Russell King - ARM Linux
2017-03-30 23:21                   ` Russell King - ARM Linux
2017-03-30 12:32 ` Martin Schwidefsky
2017-03-30 12:32   ` Martin Schwidefsky
2017-03-30 14:48   ` Al Viro
2017-03-30 14:48     ` Al Viro
2017-03-30 16:22 ` Russell King - ARM Linux
2017-03-30 16:22   ` Russell King - ARM Linux
2017-03-30 16:43   ` Al Viro
2017-03-30 16:43     ` Al Viro
2017-03-30 17:18     ` Linus Torvalds
2017-03-30 17:18       ` Linus Torvalds
2017-03-30 18:48       ` Al Viro
2017-03-30 18:48         ` Al Viro
2017-03-30 18:54         ` Al Viro
2017-03-30 18:54           ` Al Viro
2017-03-30 18:59           ` Linus Torvalds
2017-03-30 18:59             ` Linus Torvalds
2017-03-30 19:10             ` Al Viro
2017-03-30 19:10               ` Al Viro
2017-03-30 19:19               ` Linus Torvalds
2017-03-30 19:19                 ` Linus Torvalds
2017-03-30 21:08                 ` Al Viro
2017-03-30 21:08                   ` Al Viro
2017-03-30 18:56         ` Linus Torvalds
2017-03-30 18:56           ` Linus Torvalds
2017-03-31  0:21 ` Kees Cook
2017-03-31  0:21   ` Kees Cook
2017-03-31 13:38   ` James Hogan
2017-03-31 13:38     ` James Hogan
2017-04-03 16:27 ` James Morse
2017-04-03 16:27   ` James Morse
2017-04-04 20:26 ` Max Filippov
2017-04-04 20:26   ` Max Filippov
2017-04-04 20:52   ` Al Viro
2017-04-04 20:52     ` Al Viro
2017-04-05  5:05 ` ia64 exceptions (Re: [RFC][CFT][PATCHSET v1] uaccess unification) Al Viro
2017-04-05  8:08   ` Al Viro
2017-04-05  8:08     ` Al Viro
2017-04-05 18:44     ` Tony Luck
2017-04-05 18:44       ` Tony Luck
2017-04-05 20:33       ` Al Viro
2017-04-05 20:33         ` Al Viro
2017-04-07  0:24 ` [RFC][CFT][PATCHSET v2] uaccess unification Al Viro
2017-04-07  0:24   ` Al Viro
2017-04-07  0:35   ` Al Viro
2017-04-07  0:35     ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=efb7aaa4-7d25-0c68-ebf8-cdd7eb1297dc@synopsys.com \
    --to=vineet.gupta1@synopsys.com \
    --cc=Jonas.Nilsson@syn \
    --cc=dhowells@redhat.com \
    --cc=geert@linux-m68k.org \
    --cc=hskinnemoen@gmail.com \
    --cc=james.hogan@imgtec.com \
    --cc=jesper.nilsson@axis.com \
    --cc=lftan@altera.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=monstr@monstr.eu \
    --cc=msalter@redhat.com \
    --cc=realmz6@gmail.com \
    --cc=rkuo@codeaurora.org \
    --cc=rth@twiddle.net \
    --cc=tony.luck@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=will.deacon@arm.com \
    --cc=ysato@users.sourceforge.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).