From: "Lu, Aaron" <aaron.lu@intel.com>
To: "vbabka@suse.cz" <vbabka@suse.cz>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"tim.c.chen@linux.intel.com" <tim.c.chen@linux.intel.com>,
"khandual@linux.vnet.ibm.com" <khandual@linux.vnet.ibm.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"ak@linux.intel.com" <ak@linux.intel.com>,
"Wang, Kemi" <kemi.wang@intel.com>,
"Hansen, Dave" <dave.hansen@intel.com>,
"Huang, Ying" <ying.huang@intel.com>
Subject: Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline
Date: Wed, 18 Oct 2017 01:53:47 +0000 [thread overview]
Message-ID: <1508291629.14336.14.camel@intel.com> (raw)
In-Reply-To: <7304b3a4-d6cb-63fa-743d-ea8e7b126e32@suse.cz>
On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote:
> On 10/13/2017 08:31 AM, Aaron Lu wrote:
> > __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
> > __rmqueue_cma_fallback() are all in page allocator's hot path and
> > better be finished as soon as possible. One way to make them faster
> > is by making them inline. But as Andrew Morton and Andi Kleen pointed
> > out:
> > https://lkml.org/lkml/2017/10/10/1252
> > https://lkml.org/lkml/2017/10/10/1279
> > To make sure they are inlined, we should use __always_inline for them.
> >
> > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
> > processes to stress buddy, the results for will-it-scale.processes with
> > and without the patch are:
> >
> > On a 2-sockets Intel-Skylake machine:
> >
> > compiler base head
> > gcc-4.4.7 6496131 6911823 +6.4%
> > gcc-4.9.4 7225110 7731072 +7.0%
> > gcc-5.4.1 7054224 7688146 +9.0%
> > gcc-6.2.0 7059794 7651675 +8.4%
> >
> > On a 4-sockets Intel-Skylake machine:
> >
> > compiler base head
> > gcc-4.4.7 13162890 13508193 +2.6%
> > gcc-4.9.4 14997463 15484353 +3.2%
> > gcc-5.4.1 14708711 15449805 +5.0%
> > gcc-6.2.0 14574099 15349204 +5.3%
> >
> > The above 4 compilers are used becuase I've done the tests through Intel's
> > Linux Kernel Performance(LKP) infrastructure and they are the available
> > compilers there.
> >
> > The benefit being less on 4 sockets machine is due to the lock contention
> > there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe
> > than on the 2 sockets machine(85%).
> >
> > What the benchmark does is: it forks nr_cpu processes and then each
> > process does the following:
> > 1 mmap() 128M anonymous space;
> > 2 writes to each page there to trigger actual page allocation;
> > 3 munmap() it.
> > in a loop.
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
>
> Are transparent hugepages enabled? If yes, __rmqueue() is called from
> rmqueue(), and there's only one page fault (and __rmqueue()) per 512
> "writes to each page". If not, __rmqueue() is called from rmqueue_bulk()
> in bursts once pcplists are depleted. I guess it's the latter, otherwise
> I wouldn't expect a function call to have such visible overhead.
THP is disabled. I should have mentioned this in the changelog, sorry
about that.
>
> I guess what would help much more would be a bulk __rmqueue_smallest()
> to grab multiple pages from the freelists. But can't argue with your
Do I understand you correctly that you suggest to use a bulk
__rmqueue_smallest(), say __rmqueue_smallest_bulk(). With that, instead
of looping pcp->batch times in rmqueue_bulk(), a single call to
__rmqueue_smallest_bulk() is enough and __rmqueue_smallest_bulk() will
loop pcp->batch times to get those pages?
Then it feels like __rmqueue_smallest_bulk() has become rmqueue_bulk(),
or do I miss something?
> numbers against this patch.
>
> > Binary size wise, I have locally built them with different compilers:
> >
> > [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
> > text data bss dec hex filename
> > 37409 9904 8524 55837 da1d gcc-4.9.4/base/mm/page_alloc.o
> > 38273 9904 8524 56701 dd7d gcc-4.9.4/head/mm/page_alloc.o
> > 37465 9840 8428 55733 d9b5 gcc-5.5.0/base/mm/page_alloc.o
> > 38169 9840 8428 56437 dc75 gcc-5.5.0/head/mm/page_alloc.o
> > 37573 9840 8428 55841 da21 gcc-6.4.0/base/mm/page_alloc.o
> > 38261 9840 8428 56529 dcd1 gcc-6.4.0/head/mm/page_alloc.o
> > 36863 9840 8428 55131 d75b gcc-7.2.0/base/mm/page_alloc.o
> > 37711 9840 8428 55979 daab gcc-7.2.0/head/mm/page_alloc.o
> >
> > Text size increased about 800 bytes for mm/page_alloc.o.
>
> BTW, do you know about ./scripts/bloat-o-meter? :)
NO!!! Thanks for bringing this up :)
> With gcc 7.2.1:
> > ./scripts/bloat-o-meter base.o mm/page_alloc.o
>
> add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844)
Nice, it clearly showed 844 bytes bloat.
> function old new delta
> get_page_from_freelist 2898 4937 +2039
> steal_suitable_fallback - 365 +365
> find_suitable_fallback 31 120 +89
> find_suitable_fallback.part 115 - -115
> __rmqueue 1534 - -1534
>
>
> > [aaron@aaronlu obj]$ size */*/vmlinux
> > text data bss dec hex filename
> > 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux
> > 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux
> > 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux
> > 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux
> > 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux
> > 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux
> > 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux
> > 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux
> >
> > Text size for vmlinux has no change though, probably due to function
> > alignment.
>
> Yep that's useless to show. These differences do add up though, until
> they eventually cross the alignment boundary.
Agreed.
But you know, it is the hot path, the performance improvement might be
worth it.
next prev parent reply other threads:[~2017-10-18 1:54 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-09 5:44 [PATCH] page_alloc.c: inline __rmqueue() Aaron Lu
2017-10-09 7:37 ` Anshuman Khandual
2017-10-09 7:53 ` Aaron Lu
2017-10-09 20:23 ` Dave Hansen
2017-10-10 2:51 ` Aaron Lu
2017-10-10 2:56 ` [PATCH v2] mm/page_alloc.c: " Aaron Lu
2017-10-10 5:19 ` Dave Hansen
2017-10-10 5:43 ` Aaron Lu
2017-10-10 21:45 ` Andrew Morton
2017-10-10 22:27 ` Andi Kleen
2017-10-11 2:34 ` Aaron Lu
2017-10-13 6:31 ` [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline Aaron Lu
2017-10-17 11:32 ` Vlastimil Babka
2017-10-18 1:53 ` Lu, Aaron [this message]
2017-10-18 6:28 ` Vlastimil Babka
2017-10-18 8:57 ` Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1508291629.14336.14.camel@intel.com \
--to=aaron.lu@intel.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=dave.hansen@intel.com \
--cc=kemi.wang@intel.com \
--cc=khandual@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=tim.c.chen@linux.intel.com \
--cc=vbabka@suse.cz \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).