All of lore.kernel.org
 help / color / mirror / Atom feed
From: afzal mohammed <afzal.mohd.ma@gmail.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: Nicolas Pitre <nico@fluxnic.net>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Linus Walleij <linus.walleij@linaro.org>,
	Russell King - ARM Linux admin <linux@armlinux.org.uk>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>, Will Deacon <will@kernel.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: [RFC 1/3] lib: copy_{from,to}_user using gup & kmap_atomic()
Date: Fri, 12 Jun 2020 19:25:38 +0530	[thread overview]
Message-ID: <20200612135538.GA13399@afzalpc> (raw)
In-Reply-To: <CAK8P3a1XUJHC0kG_Qwh4D4AoxTgCL5ggHd=45yNSmzaYWLUWXw@mail.gmail.com>

Hi,

On Fri, Jun 12, 2020 at 02:02:13PM +0200, Arnd Bergmann wrote:
> On Fri, Jun 12, 2020 at 12:18 PM afzal mohammed <afzal.mohd.ma@gmail.com> wrote:

> > Roughly a one-third drop in performance. Disabling highmem improves
> > performance only slightly.

> There are probably some things that can be done to optimize it,
> but I guess most of the overhead is from the page table operations
> and cannot be avoided.

Ingo's series did a follow_page() first, then as a fallback did it
invoke get_user_pages(), i will try that way as well.

Yes, i too feel get_user_pages_fast() path is the most time consuming,
will instrument & check.

> What was the exact 'dd' command you used, in particular the block size?
> Note that by default, 'dd' will request 512 bytes at a time, so you usually
> only access a single page. It would be interesting to see the overhead with
> other typical or extreme block sizes, e.g. '1', '64', '4K', '64K' or '1M'.

It was the default(512), more test results follows (in MB/s),

                512     1K      4K      16K     32K     64K     1M

w/o series      30      46      89      95      90      85      65

w/ series       22      36      72      79      78      75      61

perf drop       26%     21%     19%     16%     13%     12%    6%

Hmm, results ain't that bad :)

> If you want to drill down into where exactly the overhead is (i.e.
> get_user_pages or kmap_atomic, or something different), using
> 'perf record dd ..', and 'perf report' may be helpful.

Let me dig deeper & try to find out where the major overhead and try
to figure out ways to reduce it.

One reason to disable highmem & test (results mentioned earlier) was
to make kmap_atomic() very lightweight, that was not making much
difference, around 3% only.

> > +static int copy_chunk_from_user(unsigned long from, int len, void *arg)
> > +{
> > +       unsigned long *to_ptr = arg, to = *to_ptr;
> > +
> > +       memcpy((void *) to, (void *) from, len);
> > +       *to_ptr += len;
> > +       return 0;
> > +}
> > +
> > +static int copy_chunk_to_user(unsigned long to, int len, void *arg)
> > +{
> > +       unsigned long *from_ptr = arg, from = *from_ptr;
> > +
> > +       memcpy((void *) to, (void *) from, len);
> > +       *from_ptr += len;
> > +       return 0;
> > +}
> 
> Will gcc optimize away the indirect function call and inline everything?
> If not, that would be a small part of the overhead.

i think not, based on objdump, i will make these & wherever other
places possible inline & see the difference.

> > +       num_pages = DIV_ROUND_UP((unsigned long)from + n, PAGE_SIZE) -
> > +                                (unsigned long)from / PAGE_SIZE;
> 
> Make sure this doesn't turn into actual division operations but uses shifts.
> It might even be clearer here to open-code the shift operation so readers
> can see what this is meant to compile into.

Okay

> 
> > +       pages = kmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO);
> > +       if (!pages)
> > +               goto end;
> 
> Another micro-optimization may be to avoid the kmalloc for the common case,
> e.g. anything with "num_pages <= 64", using an array on the stack.

Okay

> > +       ret = get_user_pages_fast((unsigned long)from, num_pages, 0, pages);
> > +       if (ret < 0)
> > +               goto free_pages;
> > +
> > +       if (ret != num_pages) {
> > +               num_pages = ret;
> > +               goto put_pages;
> > +       }
> 
> I think this is technically incorrect: if get_user_pages_fast() only
> gets some of the
> pages, you should continue with the short buffer and return the number
> of remaining
> bytes rather than not copying anything. I think you did that correctly
> for a failed
> kmap_atomic(), but this has to use the same logic.

yes, will fix that.


Regards
afzal

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: afzal mohammed <afzal.mohd.ma@gmail.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>,
	Linus Walleij <linus.walleij@linaro.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Nicolas Pitre <nico@fluxnic.net>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>
Subject: Re: [RFC 1/3] lib: copy_{from,to}_user using gup & kmap_atomic()
Date: Fri, 12 Jun 2020 19:25:38 +0530	[thread overview]
Message-ID: <20200612135538.GA13399@afzalpc> (raw)
In-Reply-To: <CAK8P3a1XUJHC0kG_Qwh4D4AoxTgCL5ggHd=45yNSmzaYWLUWXw@mail.gmail.com>

Hi,

On Fri, Jun 12, 2020 at 02:02:13PM +0200, Arnd Bergmann wrote:
> On Fri, Jun 12, 2020 at 12:18 PM afzal mohammed <afzal.mohd.ma@gmail.com> wrote:

> > Roughly a one-third drop in performance. Disabling highmem improves
> > performance only slightly.

> There are probably some things that can be done to optimize it,
> but I guess most of the overhead is from the page table operations
> and cannot be avoided.

Ingo's series did a follow_page() first, then as a fallback did it
invoke get_user_pages(), i will try that way as well.

Yes, i too feel get_user_pages_fast() path is the most time consuming,
will instrument & check.

> What was the exact 'dd' command you used, in particular the block size?
> Note that by default, 'dd' will request 512 bytes at a time, so you usually
> only access a single page. It would be interesting to see the overhead with
> other typical or extreme block sizes, e.g. '1', '64', '4K', '64K' or '1M'.

It was the default(512), more test results follows (in MB/s),

                512     1K      4K      16K     32K     64K     1M

w/o series      30      46      89      95      90      85      65

w/ series       22      36      72      79      78      75      61

perf drop       26%     21%     19%     16%     13%     12%    6%

Hmm, results ain't that bad :)

> If you want to drill down into where exactly the overhead is (i.e.
> get_user_pages or kmap_atomic, or something different), using
> 'perf record dd ..', and 'perf report' may be helpful.

Let me dig deeper & try to find out where the major overhead and try
to figure out ways to reduce it.

One reason to disable highmem & test (results mentioned earlier) was
to make kmap_atomic() very lightweight, that was not making much
difference, around 3% only.

> > +static int copy_chunk_from_user(unsigned long from, int len, void *arg)
> > +{
> > +       unsigned long *to_ptr = arg, to = *to_ptr;
> > +
> > +       memcpy((void *) to, (void *) from, len);
> > +       *to_ptr += len;
> > +       return 0;
> > +}
> > +
> > +static int copy_chunk_to_user(unsigned long to, int len, void *arg)
> > +{
> > +       unsigned long *from_ptr = arg, from = *from_ptr;
> > +
> > +       memcpy((void *) to, (void *) from, len);
> > +       *from_ptr += len;
> > +       return 0;
> > +}
> 
> Will gcc optimize away the indirect function call and inline everything?
> If not, that would be a small part of the overhead.

i think not, based on objdump, i will make these & wherever other
places possible inline & see the difference.

> > +       num_pages = DIV_ROUND_UP((unsigned long)from + n, PAGE_SIZE) -
> > +                                (unsigned long)from / PAGE_SIZE;
> 
> Make sure this doesn't turn into actual division operations but uses shifts.
> It might even be clearer here to open-code the shift operation so readers
> can see what this is meant to compile into.

Okay

> 
> > +       pages = kmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO);
> > +       if (!pages)
> > +               goto end;
> 
> Another micro-optimization may be to avoid the kmalloc for the common case,
> e.g. anything with "num_pages <= 64", using an array on the stack.

Okay

> > +       ret = get_user_pages_fast((unsigned long)from, num_pages, 0, pages);
> > +       if (ret < 0)
> > +               goto free_pages;
> > +
> > +       if (ret != num_pages) {
> > +               num_pages = ret;
> > +               goto put_pages;
> > +       }
> 
> I think this is technically incorrect: if get_user_pages_fast() only
> gets some of the
> pages, you should continue with the short buffer and return the number
> of remaining
> bytes rather than not copying anything. I think you did that correctly
> for a failed
> kmap_atomic(), but this has to use the same logic.

yes, will fix that.


Regards
afzal


  reply	other threads:[~2020-06-12 13:57 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-12 10:17 [RFC 0/3] ARM: copy_{from,to}_user() for vmsplit 4g/4g afzal mohammed
2020-06-12 10:17 ` afzal mohammed
2020-06-12 10:17 ` [RFC 1/3] lib: copy_{from,to}_user using gup & kmap_atomic() afzal mohammed
2020-06-12 10:35   ` [RFC PATCH " afzal mohammed
2020-06-12 12:02   ` [RFC " Arnd Bergmann
2020-06-12 12:02     ` Arnd Bergmann
2020-06-12 13:55     ` afzal mohammed [this message]
2020-06-12 13:55       ` afzal mohammed
2020-06-12 20:07       ` Arnd Bergmann
2020-06-12 20:07         ` Arnd Bergmann
2020-06-13 12:04         ` afzal mohammed
2020-06-13 12:04           ` afzal mohammed
2020-06-13 12:51           ` Al Viro
2020-06-13 12:51             ` Al Viro
2020-06-13 12:56             ` Al Viro
2020-06-13 12:56               ` Al Viro
2020-06-13 13:42               ` afzal mohammed
2020-06-13 13:42                 ` afzal mohammed
2020-06-13 15:31                 ` Al Viro
2020-06-13 15:31                   ` Al Viro
2020-06-13 15:41                   ` Al Viro
2020-06-13 15:41                     ` Al Viro
2020-06-13 16:00                     ` Al Viro
2020-06-13 16:00                       ` Al Viro
2020-06-13 18:55                       ` Arnd Bergmann
2020-06-13 18:55                         ` Arnd Bergmann
2020-06-15 11:22                   ` David Laight
2020-06-15 11:22                     ` David Laight
2020-06-13 13:15           ` Russell King - ARM Linux admin
2020-06-13 13:15             ` Russell King - ARM Linux admin
2020-06-14 13:06             ` afzal mohammed
2020-06-14 13:06               ` afzal mohammed
2020-06-13 20:45           ` Arnd Bergmann
2020-06-13 20:45             ` Arnd Bergmann
2020-06-13 22:16             ` Matthew Wilcox
2020-06-13 22:16               ` Matthew Wilcox
2020-06-14 13:21             ` afzal mohammed
2020-06-14 13:21               ` afzal mohammed
2020-06-14 14:55               ` afzal mohammed
2020-06-14 14:55                 ` afzal mohammed
2020-06-13 11:08   ` Andy Shevchenko
2020-06-13 11:08     ` Andy Shevchenko
2020-06-13 13:29     ` afzal mohammed
2020-06-13 13:29       ` afzal mohammed
2020-06-12 10:18 ` [RFC 2/3] ARM: uaccess: let UACCESS_GUP_KMAP_MEMCPY enabling afzal mohammed
2020-06-12 10:40   ` [RFC PATCH " afzal mohammed
2020-06-12 10:18 ` [RFC 3/3] ARM: provide CONFIG_VMSPLIT_4G_DEV for development afzal mohammed
2020-06-12 10:38   ` [RFC PATCH " afzal mohammed
2020-06-12 15:19 ` [RFC 0/3] ARM: copy_{from,to}_user() for vmsplit 4g/4g Nicolas Pitre
2020-06-12 15:19   ` Nicolas Pitre
2020-06-12 16:01   ` afzal mohammed
2020-06-12 16:01     ` afzal mohammed
2020-06-12 16:03     ` afzal mohammed
2020-06-12 16:03       ` afzal mohammed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200612135538.GA13399@afzalpc \
    --to=afzal.mohd.ma@gmail.com \
    --cc=arnd@arndb.de \
    --cc=catalin.marinas@arm.com \
    --cc=linus.walleij@linaro.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@armlinux.org.uk \
    --cc=nico@fluxnic.net \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.