From: afzal mohammed <afzal.mohd.ma@gmail.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: Nicolas Pitre <nico@fluxnic.net>,
Catalin Marinas <catalin.marinas@arm.com>,
Linus Walleij <linus.walleij@linaro.org>,
Russell King - ARM Linux admin <linux@armlinux.org.uk>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>, Will Deacon <will@kernel.org>,
Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: [RFC 1/3] lib: copy_{from,to}_user using gup & kmap_atomic()
Date: Fri, 12 Jun 2020 19:25:38 +0530 [thread overview]
Message-ID: <20200612135538.GA13399@afzalpc> (raw)
In-Reply-To: <CAK8P3a1XUJHC0kG_Qwh4D4AoxTgCL5ggHd=45yNSmzaYWLUWXw@mail.gmail.com>
Hi,
On Fri, Jun 12, 2020 at 02:02:13PM +0200, Arnd Bergmann wrote:
> On Fri, Jun 12, 2020 at 12:18 PM afzal mohammed <afzal.mohd.ma@gmail.com> wrote:
> > Roughly a one-third drop in performance. Disabling highmem improves
> > performance only slightly.
> There are probably some things that can be done to optimize it,
> but I guess most of the overhead is from the page table operations
> and cannot be avoided.
Ingo's series did a follow_page() first, then as a fallback did it
invoke get_user_pages(), i will try that way as well.
Yes, i too feel get_user_pages_fast() path is the most time consuming,
will instrument & check.
> What was the exact 'dd' command you used, in particular the block size?
> Note that by default, 'dd' will request 512 bytes at a time, so you usually
> only access a single page. It would be interesting to see the overhead with
> other typical or extreme block sizes, e.g. '1', '64', '4K', '64K' or '1M'.
It was the default(512), more test results follows (in MB/s),
512 1K 4K 16K 32K 64K 1M
w/o series 30 46 89 95 90 85 65
w/ series 22 36 72 79 78 75 61
perf drop 26% 21% 19% 16% 13% 12% 6%
Hmm, results ain't that bad :)
> If you want to drill down into where exactly the overhead is (i.e.
> get_user_pages or kmap_atomic, or something different), using
> 'perf record dd ..', and 'perf report' may be helpful.
Let me dig deeper & try to find out where the major overhead and try
to figure out ways to reduce it.
One reason to disable highmem & test (results mentioned earlier) was
to make kmap_atomic() very lightweight, that was not making much
difference, around 3% only.
> > +static int copy_chunk_from_user(unsigned long from, int len, void *arg)
> > +{
> > + unsigned long *to_ptr = arg, to = *to_ptr;
> > +
> > + memcpy((void *) to, (void *) from, len);
> > + *to_ptr += len;
> > + return 0;
> > +}
> > +
> > +static int copy_chunk_to_user(unsigned long to, int len, void *arg)
> > +{
> > + unsigned long *from_ptr = arg, from = *from_ptr;
> > +
> > + memcpy((void *) to, (void *) from, len);
> > + *from_ptr += len;
> > + return 0;
> > +}
>
> Will gcc optimize away the indirect function call and inline everything?
> If not, that would be a small part of the overhead.
i think not, based on objdump, i will make these & wherever other
places possible inline & see the difference.
> > + num_pages = DIV_ROUND_UP((unsigned long)from + n, PAGE_SIZE) -
> > + (unsigned long)from / PAGE_SIZE;
>
> Make sure this doesn't turn into actual division operations but uses shifts.
> It might even be clearer here to open-code the shift operation so readers
> can see what this is meant to compile into.
Okay
>
> > + pages = kmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO);
> > + if (!pages)
> > + goto end;
>
> Another micro-optimization may be to avoid the kmalloc for the common case,
> e.g. anything with "num_pages <= 64", using an array on the stack.
Okay
> > + ret = get_user_pages_fast((unsigned long)from, num_pages, 0, pages);
> > + if (ret < 0)
> > + goto free_pages;
> > +
> > + if (ret != num_pages) {
> > + num_pages = ret;
> > + goto put_pages;
> > + }
>
> I think this is technically incorrect: if get_user_pages_fast() only
> gets some of the
> pages, you should continue with the short buffer and return the number
> of remaining
> bytes rather than not copying anything. I think you did that correctly
> for a failed
> kmap_atomic(), but this has to use the same logic.
yes, will fix that.
Regards
afzal
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: afzal mohammed <afzal.mohd.ma@gmail.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>,
Linus Walleij <linus.walleij@linaro.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>,
Linux ARM <linux-arm-kernel@lists.infradead.org>,
Nicolas Pitre <nico@fluxnic.net>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>
Subject: Re: [RFC 1/3] lib: copy_{from,to}_user using gup & kmap_atomic()
Date: Fri, 12 Jun 2020 19:25:38 +0530 [thread overview]
Message-ID: <20200612135538.GA13399@afzalpc> (raw)
In-Reply-To: <CAK8P3a1XUJHC0kG_Qwh4D4AoxTgCL5ggHd=45yNSmzaYWLUWXw@mail.gmail.com>
Hi,
On Fri, Jun 12, 2020 at 02:02:13PM +0200, Arnd Bergmann wrote:
> On Fri, Jun 12, 2020 at 12:18 PM afzal mohammed <afzal.mohd.ma@gmail.com> wrote:
> > Roughly a one-third drop in performance. Disabling highmem improves
> > performance only slightly.
> There are probably some things that can be done to optimize it,
> but I guess most of the overhead is from the page table operations
> and cannot be avoided.
Ingo's series did a follow_page() first, then as a fallback did it
invoke get_user_pages(), i will try that way as well.
Yes, i too feel get_user_pages_fast() path is the most time consuming,
will instrument & check.
> What was the exact 'dd' command you used, in particular the block size?
> Note that by default, 'dd' will request 512 bytes at a time, so you usually
> only access a single page. It would be interesting to see the overhead with
> other typical or extreme block sizes, e.g. '1', '64', '4K', '64K' or '1M'.
It was the default(512), more test results follows (in MB/s),
512 1K 4K 16K 32K 64K 1M
w/o series 30 46 89 95 90 85 65
w/ series 22 36 72 79 78 75 61
perf drop 26% 21% 19% 16% 13% 12% 6%
Hmm, results ain't that bad :)
> If you want to drill down into where exactly the overhead is (i.e.
> get_user_pages or kmap_atomic, or something different), using
> 'perf record dd ..', and 'perf report' may be helpful.
Let me dig deeper & try to find out where the major overhead and try
to figure out ways to reduce it.
One reason to disable highmem & test (results mentioned earlier) was
to make kmap_atomic() very lightweight, that was not making much
difference, around 3% only.
> > +static int copy_chunk_from_user(unsigned long from, int len, void *arg)
> > +{
> > + unsigned long *to_ptr = arg, to = *to_ptr;
> > +
> > + memcpy((void *) to, (void *) from, len);
> > + *to_ptr += len;
> > + return 0;
> > +}
> > +
> > +static int copy_chunk_to_user(unsigned long to, int len, void *arg)
> > +{
> > + unsigned long *from_ptr = arg, from = *from_ptr;
> > +
> > + memcpy((void *) to, (void *) from, len);
> > + *from_ptr += len;
> > + return 0;
> > +}
>
> Will gcc optimize away the indirect function call and inline everything?
> If not, that would be a small part of the overhead.
i think not, based on objdump, i will make these & wherever other
places possible inline & see the difference.
> > + num_pages = DIV_ROUND_UP((unsigned long)from + n, PAGE_SIZE) -
> > + (unsigned long)from / PAGE_SIZE;
>
> Make sure this doesn't turn into actual division operations but uses shifts.
> It might even be clearer here to open-code the shift operation so readers
> can see what this is meant to compile into.
Okay
>
> > + pages = kmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO);
> > + if (!pages)
> > + goto end;
>
> Another micro-optimization may be to avoid the kmalloc for the common case,
> e.g. anything with "num_pages <= 64", using an array on the stack.
Okay
> > + ret = get_user_pages_fast((unsigned long)from, num_pages, 0, pages);
> > + if (ret < 0)
> > + goto free_pages;
> > +
> > + if (ret != num_pages) {
> > + num_pages = ret;
> > + goto put_pages;
> > + }
>
> I think this is technically incorrect: if get_user_pages_fast() only
> gets some of the
> pages, you should continue with the short buffer and return the number
> of remaining
> bytes rather than not copying anything. I think you did that correctly
> for a failed
> kmap_atomic(), but this has to use the same logic.
yes, will fix that.
Regards
afzal
next prev parent reply other threads:[~2020-06-12 13:57 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-12 10:17 [RFC 0/3] ARM: copy_{from,to}_user() for vmsplit 4g/4g afzal mohammed
2020-06-12 10:17 ` afzal mohammed
2020-06-12 10:17 ` [RFC 1/3] lib: copy_{from,to}_user using gup & kmap_atomic() afzal mohammed
2020-06-12 10:35 ` [RFC PATCH " afzal mohammed
2020-06-12 12:02 ` [RFC " Arnd Bergmann
2020-06-12 12:02 ` Arnd Bergmann
2020-06-12 13:55 ` afzal mohammed [this message]
2020-06-12 13:55 ` afzal mohammed
2020-06-12 20:07 ` Arnd Bergmann
2020-06-12 20:07 ` Arnd Bergmann
2020-06-13 12:04 ` afzal mohammed
2020-06-13 12:04 ` afzal mohammed
2020-06-13 12:51 ` Al Viro
2020-06-13 12:51 ` Al Viro
2020-06-13 12:56 ` Al Viro
2020-06-13 12:56 ` Al Viro
2020-06-13 13:42 ` afzal mohammed
2020-06-13 13:42 ` afzal mohammed
2020-06-13 15:31 ` Al Viro
2020-06-13 15:31 ` Al Viro
2020-06-13 15:41 ` Al Viro
2020-06-13 15:41 ` Al Viro
2020-06-13 16:00 ` Al Viro
2020-06-13 16:00 ` Al Viro
2020-06-13 18:55 ` Arnd Bergmann
2020-06-13 18:55 ` Arnd Bergmann
2020-06-15 11:22 ` David Laight
2020-06-15 11:22 ` David Laight
2020-06-13 13:15 ` Russell King - ARM Linux admin
2020-06-13 13:15 ` Russell King - ARM Linux admin
2020-06-14 13:06 ` afzal mohammed
2020-06-14 13:06 ` afzal mohammed
2020-06-13 20:45 ` Arnd Bergmann
2020-06-13 20:45 ` Arnd Bergmann
2020-06-13 22:16 ` Matthew Wilcox
2020-06-13 22:16 ` Matthew Wilcox
2020-06-14 13:21 ` afzal mohammed
2020-06-14 13:21 ` afzal mohammed
2020-06-14 14:55 ` afzal mohammed
2020-06-14 14:55 ` afzal mohammed
2020-06-13 11:08 ` Andy Shevchenko
2020-06-13 11:08 ` Andy Shevchenko
2020-06-13 13:29 ` afzal mohammed
2020-06-13 13:29 ` afzal mohammed
2020-06-12 10:18 ` [RFC 2/3] ARM: uaccess: let UACCESS_GUP_KMAP_MEMCPY enabling afzal mohammed
2020-06-12 10:40 ` [RFC PATCH " afzal mohammed
2020-06-12 10:18 ` [RFC 3/3] ARM: provide CONFIG_VMSPLIT_4G_DEV for development afzal mohammed
2020-06-12 10:38 ` [RFC PATCH " afzal mohammed
2020-06-12 15:19 ` [RFC 0/3] ARM: copy_{from,to}_user() for vmsplit 4g/4g Nicolas Pitre
2020-06-12 15:19 ` Nicolas Pitre
2020-06-12 16:01 ` afzal mohammed
2020-06-12 16:01 ` afzal mohammed
2020-06-12 16:03 ` afzal mohammed
2020-06-12 16:03 ` afzal mohammed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200612135538.GA13399@afzalpc \
--to=afzal.mohd.ma@gmail.com \
--cc=arnd@arndb.de \
--cc=catalin.marinas@arm.com \
--cc=linus.walleij@linaro.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@armlinux.org.uk \
--cc=nico@fluxnic.net \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.