All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Lu, Aaron" <aaron.lu@intel.com>
To: "vbabka@suse.cz" <vbabka@suse.cz>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"tim.c.chen@linux.intel.com" <tim.c.chen@linux.intel.com>,
	"khandual@linux.vnet.ibm.com" <khandual@linux.vnet.ibm.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"ak@linux.intel.com" <ak@linux.intel.com>,
	"Wang, Kemi" <kemi.wang@intel.com>,
	"Hansen, Dave" <dave.hansen@intel.com>,
	"Huang, Ying" <ying.huang@intel.com>
Subject: Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline
Date: Wed, 18 Oct 2017 01:53:47 +0000	[thread overview]
Message-ID: <1508291629.14336.14.camel@intel.com> (raw)
In-Reply-To: <7304b3a4-d6cb-63fa-743d-ea8e7b126e32@suse.cz>

On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote:
> On 10/13/2017 08:31 AM, Aaron Lu wrote:
> > __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
> > __rmqueue_cma_fallback() are all in page allocator's hot path and
> > better be finished as soon as possible. One way to make them faster
> > is by making them inline. But as Andrew Morton and Andi Kleen pointed
> > out:
> > https://lkml.org/lkml/2017/10/10/1252
> > https://lkml.org/lkml/2017/10/10/1279
> > To make sure they are inlined, we should use __always_inline for them.
> > 
> > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
> > processes to stress buddy, the results for will-it-scale.processes with
> > and without the patch are:
> > 
> > On a 2-sockets Intel-Skylake machine:
> > 
> >  compiler          base        head
> > gcc-4.4.7       6496131     6911823 +6.4%
> > gcc-4.9.4       7225110     7731072 +7.0%
> > gcc-5.4.1       7054224     7688146 +9.0%
> > gcc-6.2.0       7059794     7651675 +8.4%
> > 
> > On a 4-sockets Intel-Skylake machine:
> > 
> >  compiler          base        head
> > gcc-4.4.7      13162890    13508193 +2.6%
> > gcc-4.9.4      14997463    15484353 +3.2%
> > gcc-5.4.1      14708711    15449805 +5.0%
> > gcc-6.2.0      14574099    15349204 +5.3%
> > 
> > The above 4 compilers are used becuase I've done the tests through Intel's
> > Linux Kernel Performance(LKP) infrastructure and they are the available
> > compilers there.
> > 
> > The benefit being less on 4 sockets machine is due to the lock contention
> > there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe
> > than on the 2 sockets machine(85%).
> > 
> > What the benchmark does is: it forks nr_cpu processes and then each
> > process does the following:
> >     1 mmap() 128M anonymous space;
> >     2 writes to each page there to trigger actual page allocation;
> >     3 munmap() it.
> > in a loop.
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Are transparent hugepages enabled? If yes, __rmqueue() is called from
> rmqueue(), and there's only one page fault (and __rmqueue()) per 512
> "writes to each page". If not, __rmqueue() is called from rmqueue_bulk()
> in bursts once pcplists are depleted. I guess it's the latter, otherwise
> I wouldn't expect a function call to have such visible overhead.

THP is disabled. I should have mentioned this in the changelog, sorry
about that.

> 
> I guess what would help much more would be a bulk __rmqueue_smallest()
> to grab multiple pages from the freelists. But can't argue with your

Do I understand you correctly that you suggest to use a bulk
__rmqueue_smallest(), say __rmqueue_smallest_bulk(). With that, instead
of looping pcp->batch times in rmqueue_bulk(), a single call to
__rmqueue_smallest_bulk() is enough and __rmqueue_smallest_bulk() will
loop pcp->batch times to get those pages?

Then it feels like __rmqueue_smallest_bulk() has become rmqueue_bulk(),
or do I miss something?

> numbers against this patch.
> 
> > Binary size wise, I have locally built them with different compilers:
> > 
> > [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
> >    text    data     bss     dec     hex filename
> >   37409    9904    8524   55837    da1d gcc-4.9.4/base/mm/page_alloc.o
> >   38273    9904    8524   56701    dd7d gcc-4.9.4/head/mm/page_alloc.o
> >   37465    9840    8428   55733    d9b5 gcc-5.5.0/base/mm/page_alloc.o
> >   38169    9840    8428   56437    dc75 gcc-5.5.0/head/mm/page_alloc.o
> >   37573    9840    8428   55841    da21 gcc-6.4.0/base/mm/page_alloc.o
> >   38261    9840    8428   56529    dcd1 gcc-6.4.0/head/mm/page_alloc.o
> >   36863    9840    8428   55131    d75b gcc-7.2.0/base/mm/page_alloc.o
> >   37711    9840    8428   55979    daab gcc-7.2.0/head/mm/page_alloc.o
> > 
> > Text size increased about 800 bytes for mm/page_alloc.o.
> 
> BTW, do you know about ./scripts/bloat-o-meter? :)

NO!!! Thanks for bringing this up :)

> With gcc 7.2.1:
> > ./scripts/bloat-o-meter base.o mm/page_alloc.o
> 
> add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844)

Nice, it clearly showed 844 bytes bloat.

> function                                     old     new   delta
> get_page_from_freelist                      2898    4937   +2039
> steal_suitable_fallback                        -     365    +365
> find_suitable_fallback                        31     120     +89
> find_suitable_fallback.part                  115       -    -115
> __rmqueue                                   1534       -   -1534
> 
> 
> > [aaron@aaronlu obj]$ size */*/vmlinux
> >    text    data     bss     dec       hex     filename
> > 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
> > 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
> > 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
> > 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
> > 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
> > 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
> > 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
> > 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux
> > 
> > Text size for vmlinux has no change though, probably due to function
> > alignment.
> 
> Yep that's useless to show. These differences do add up though, until
> they eventually cross the alignment boundary.

Agreed.
But you know, it is the hot path, the performance improvement might be
worth it.

  reply	other threads:[~2017-10-18  1:54 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-09  5:44 [PATCH] page_alloc.c: inline __rmqueue() Aaron Lu
2017-10-09  5:44 ` Aaron Lu
2017-10-09  7:37 ` Anshuman Khandual
2017-10-09  7:37   ` Anshuman Khandual
2017-10-09  7:53   ` Aaron Lu
2017-10-09  7:53     ` Aaron Lu
2017-10-09 20:23 ` Dave Hansen
2017-10-09 20:23   ` Dave Hansen
2017-10-10  2:51   ` Aaron Lu
2017-10-10  2:51     ` Aaron Lu
2017-10-10  2:56     ` [PATCH v2] mm/page_alloc.c: " Aaron Lu
2017-10-10  2:56       ` Aaron Lu
2017-10-10  5:19       ` Dave Hansen
2017-10-10  5:19         ` Dave Hansen
2017-10-10  5:43         ` Aaron Lu
2017-10-10  5:43           ` Aaron Lu
2017-10-10 21:45           ` Andrew Morton
2017-10-10 21:45             ` Andrew Morton
2017-10-10 22:27             ` Andi Kleen
2017-10-10 22:27               ` Andi Kleen
2017-10-11  2:34             ` Aaron Lu
2017-10-11  2:34               ` Aaron Lu
2017-10-13  6:31               ` [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline Aaron Lu
2017-10-13  6:31                 ` Aaron Lu
2017-10-17 11:32                 ` Vlastimil Babka
2017-10-17 11:32                   ` Vlastimil Babka
2017-10-18  1:53                   ` Lu, Aaron [this message]
2017-10-18  6:28                     ` Vlastimil Babka
2017-10-18  6:28                       ` Vlastimil Babka
2017-10-18  8:57                       ` Aaron Lu
2017-10-18  8:57                         ` Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1508291629.14336.14.camel@intel.com \
    --to=aaron.lu@intel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=kemi.wang@intel.com \
    --cc=khandual@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vbabka@suse.cz \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.