[PATCH] page_alloc.c: inline _

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] page_alloc.c: inline __rmqueue()
@ 2017-10-09  5:44 Aaron Lu
  2017-10-09  7:37 ` Anshuman Khandual
  2017-10-09 20:23 ` Dave Hansen
  0 siblings, 2 replies; 16+ messages in thread
From: Aaron Lu @ 2017-10-09  5:44 UTC (permalink / raw)
  To: linux-mm, lkml
  Cc: Andrew Morton, Andi Kleen, Dave Hansen, Huang Ying, Tim Chen,
	Kemi Wang

__rmqueue() is called by rmqueue_bulk() and rmqueue() under zone->lock
and that lock can be heavily contended with memory intensive applications.

Since __rmqueue() is a small function, inline it can save us some time.
With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
processes to stress buddy:

On a 2 sockets Intel-Skylake machine:
      base          %change       head
     77342            +6.3%      82203        will-it-scale.per_process_ops

On a 4 sockets Intel-Skylake machine:
      base          %change       head
     75746            +4.6%      79248        will-it-scale.per_process_ops

This patch adds inline to __rmqueue().

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e309ce4a44a..c9605c7ebaf6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2291,7 +2291,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order,
+static inline struct page *__rmqueue(struct zone *zone, unsigned int order,
 				int migratetype)
 {
 	struct page *page;
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] page_alloc.c: inline __rmqueue()
  2017-10-09  5:44 [PATCH] page_alloc.c: inline __rmqueue() Aaron Lu
@ 2017-10-09  7:37 ` Anshuman Khandual
  2017-10-09  7:53   ` Aaron Lu
  2017-10-09 20:23 ` Dave Hansen
  1 sibling, 1 reply; 16+ messages in thread
From: Anshuman Khandual @ 2017-10-09  7:37 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, lkml
  Cc: Andrew Morton, Andi Kleen, Dave Hansen, Huang Ying, Tim Chen,
	Kemi Wang

On 10/09/2017 11:14 AM, Aaron Lu wrote:
> __rmqueue() is called by rmqueue_bulk() and rmqueue() under zone->lock
> and that lock can be heavily contended with memory intensive applications.
> 
> Since __rmqueue() is a small function, inline it can save us some time.
> With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
> processes to stress buddy:
> 
> On a 2 sockets Intel-Skylake machine:
>       base          %change       head
>      77342            +6.3%      82203        will-it-scale.per_process_ops
> 
> On a 4 sockets Intel-Skylake machine:
>       base          %change       head
>      75746            +4.6%      79248        will-it-scale.per_process_ops
> 
> This patch adds inline to __rmqueue().
> 
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

Ran it through kernel bench and ebizzy micro benchmarks. Results
were comparable with and without the patch. May be these are not
the appropriate tests for this inlining improvement. Anyways it
does not have any performance degradation either.

Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] page_alloc.c: inline __rmqueue()
  2017-10-09  7:37 ` Anshuman Khandual
@ 2017-10-09  7:53   ` Aaron Lu
  0 siblings, 0 replies; 16+ messages in thread
From: Aaron Lu @ 2017-10-09  7:53 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, lkml, Andrew Morton, Andi Kleen, Dave Hansen,
	Huang Ying, Tim Chen, Kemi Wang

On Mon, Oct 09, 2017 at 01:07:36PM +0530, Anshuman Khandual wrote:
> On 10/09/2017 11:14 AM, Aaron Lu wrote:
> > __rmqueue() is called by rmqueue_bulk() and rmqueue() under zone->lock
> > and that lock can be heavily contended with memory intensive applications.
> > 
> > Since __rmqueue() is a small function, inline it can save us some time.
> > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
> > processes to stress buddy:
> > 
> > On a 2 sockets Intel-Skylake machine:
> >       base          %change       head
> >      77342            +6.3%      82203        will-it-scale.per_process_ops
> > 
> > On a 4 sockets Intel-Skylake machine:
> >       base          %change       head
> >      75746            +4.6%      79248        will-it-scale.per_process_ops
> > 
> > This patch adds inline to __rmqueue().
> > 
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> 
> Ran it through kernel bench and ebizzy micro benchmarks. Results
> were comparable with and without the patch. May be these are not
> the appropriate tests for this inlining improvement. Anyways it

I think so.

The benefit only appears when the lock contention is huge enough, e.g.
perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath is as high
as 80% with the workload I have used.

> does not have any performance degradation either.
> 
> Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] page_alloc.c: inline __rmqueue()
  2017-10-09  5:44 [PATCH] page_alloc.c: inline __rmqueue() Aaron Lu
  2017-10-09  7:37 ` Anshuman Khandual
@ 2017-10-09 20:23 ` Dave Hansen
  2017-10-10  2:51   ` Aaron Lu
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2017-10-09 20:23 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, lkml
  Cc: Andrew Morton, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang

On 10/08/2017 10:44 PM, Aaron Lu wrote:
> __rmqueue() is called by rmqueue_bulk() and rmqueue() under zone->lock
> and that lock can be heavily contended with memory intensive applications.

What does "memory intensive" mean?  I'd probably just say: "The two
__rmqueue() call sites are in very hot page allocator paths."

> Since __rmqueue() is a small function, inline it can save us some time.
> With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
> processes to stress buddy:

Please include a description of the test and a link to the source.

> On a 2 sockets Intel-Skylake machine:
>       base          %change       head
>      77342            +6.3%      82203        will-it-scale.per_process_ops

What's the unit here?  That seems ridiculously low for page_fault1.
It's usually in the millions.

> On a 4 sockets Intel-Skylake machine:
>       base          %change       head
>      75746            +4.6%      79248        will-it-scale.per_process_ops

It's probably worth noting the reason that this is _less_ beneficial on
a larger system.

I'd also just put this in text rather than wasting space in tables like
that.  It took me a few minutes to figure out what the table was trying
top say.  This is one of those places where LKP output is harmful.

Why not just say:

	This patch improved the benchmark by 6.3% on a 2-socket system
	and 4.6% on a 4-socket system.

> This patch adds inline to __rmqueue().

How much text bloat does this cost?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] page_alloc.c: inline __rmqueue()
  2017-10-09 20:23 ` Dave Hansen
@ 2017-10-10  2:51   ` Aaron Lu
  2017-10-10  2:56     ` [PATCH v2] mm/page_alloc.c: " Aaron Lu
  0 siblings, 1 reply; 16+ messages in thread
From: Aaron Lu @ 2017-10-10  2:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, lkml, Andrew Morton, Andi Kleen, Huang Ying, Tim Chen,
	Kemi Wang

On Mon, Oct 09, 2017 at 01:23:34PM -0700, Dave Hansen wrote:
> On 10/08/2017 10:44 PM, Aaron Lu wrote:
> > On a 2 sockets Intel-Skylake machine:
> >       base          %change       head
> >      77342            +6.3%      82203        will-it-scale.per_process_ops
> 
> What's the unit here?  That seems ridiculously low for page_fault1.
> It's usually in the millions.

per_process_ops = processes/nr_process

since nr_process here is nr_cpu, so on the 2 sockets machine with 104
CPUs, processes are 8043568(base) and 8549112(head), which are in the
millions as you correctly pointed out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2] mm/page_alloc.c: inline __rmqueue()
  2017-10-10  2:51   ` Aaron Lu
@ 2017-10-10  2:56     ` Aaron Lu
  2017-10-10  5:19       ` Dave Hansen
  0 siblings, 1 reply; 16+ messages in thread
From: Aaron Lu @ 2017-10-10  2:56 UTC (permalink / raw)
  To: linux-mm, lkml
  Cc: Dave Hansen, Andrew Morton, Andi Kleen, Huang Ying, Tim Chen,
	Kemi Wang, Anshuman Khandual

__rmqueue() is called by rmqueue_bulk() and rmqueue() under zone->lock
and the two __rmqueue() call sites are in very hot page allocator paths.

Since __rmqueue() is a small function, inline it can save us some time.
With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
processes to stress buddy, this patch improved the benchmark by 6.3% on
a 2-sockets Intel-Skylake system and 4.6% on a 4-sockets Intel-Skylake
system. The benefit being less on 4 sockets machine is due to the lock
contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is
less severe than on the 2 sockets machine(84%).

What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
    1 mmap() 128M anonymous space;
    2 writes to each page there to trigger actual page allocation;
    3 munmap() it.
in a loop.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

This patch adds inline to __rmqueue() and vmlinux' size doesn't have any
change after this patch according to size(1).

without this patch:
   text    data     bss     dec     hex     filename
9968576 5793372 17715200  33477148  1fed21c vmlinux

with this patch:
   text    data     bss     dec     hex     filename
9968576 5793372 17715200  33477148  1fed21c vmlinux

Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v2: change commit message according to Dave Hansen's suggestion.

 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e309ce4a44a..c9605c7ebaf6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2291,7 +2291,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order,
+static inline struct page *__rmqueue(struct zone *zone, unsigned int order,
 				int migratetype)
 {
 	struct page *page;
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] mm/page_alloc.c: inline __rmqueue()
  2017-10-10  2:56     ` [PATCH v2] mm/page_alloc.c: " Aaron Lu
@ 2017-10-10  5:19       ` Dave Hansen
  2017-10-10  5:43         ` Aaron Lu
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2017-10-10  5:19 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, lkml
  Cc: Andrew Morton, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang,
	Anshuman Khandual

On 10/09/2017 07:56 PM, Aaron Lu wrote:
> This patch adds inline to __rmqueue() and vmlinux' size doesn't have any
> change after this patch according to size(1).
> 
> without this patch:
>    text    data     bss     dec     hex     filename
> 9968576 5793372 17715200  33477148  1fed21c vmlinux
> 
> with this patch:
>    text    data     bss     dec     hex     filename
> 9968576 5793372 17715200  33477148  1fed21c vmlinux

This is unexpected.  Could you double-check this, please?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] mm/page_alloc.c: inline __rmqueue()
  2017-10-10  5:19       ` Dave Hansen
@ 2017-10-10  5:43         ` Aaron Lu
  2017-10-10 21:45           ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Aaron Lu @ 2017-10-10  5:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, lkml, Andrew Morton, Andi Kleen, Huang Ying, Tim Chen,
	Kemi Wang, Anshuman Khandual

On Mon, Oct 09, 2017 at 10:19:52PM -0700, Dave Hansen wrote:
> On 10/09/2017 07:56 PM, Aaron Lu wrote:
> > This patch adds inline to __rmqueue() and vmlinux' size doesn't have any
> > change after this patch according to size(1).
> > 
> > without this patch:
> >    text    data     bss     dec     hex     filename
> > 9968576 5793372 17715200  33477148  1fed21c vmlinux
> > 
> > with this patch:
> >    text    data     bss     dec     hex     filename
> > 9968576 5793372 17715200  33477148  1fed21c vmlinux
> 
> This is unexpected.  Could you double-check this, please?

mm/page_alloc.o has size changes:

Without this patch:
$ size mm/page_alloc.o
  text    data     bss     dec     hex filename
 36695    9792    8396   54883    d663 mm/page_alloc.o

With this patch:
$ size mm/page_alloc.o
  text    data     bss     dec     hex filename
 37511    9792    8396   55699    d993 mm/page_alloc.o

But vmlinux doesn't.

It's not clear to me what happened, do you want to me dig this out?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] mm/page_alloc.c: inline __rmqueue()
  2017-10-10  5:43         ` Aaron Lu
@ 2017-10-10 21:45           ` Andrew Morton
  2017-10-10 22:27             ` Andi Kleen
  2017-10-11  2:34             ` Aaron Lu
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew Morton @ 2017-10-10 21:45 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Dave Hansen, linux-mm, lkml, Andi Kleen, Huang Ying, Tim Chen,
	Kemi Wang, Anshuman Khandual

On Tue, 10 Oct 2017 13:43:43 +0800 Aaron Lu <aaron.lu@intel.com> wrote:

> On Mon, Oct 09, 2017 at 10:19:52PM -0700, Dave Hansen wrote:
> > On 10/09/2017 07:56 PM, Aaron Lu wrote:
> > > This patch adds inline to __rmqueue() and vmlinux' size doesn't have any
> > > change after this patch according to size(1).
> > > 
> > > without this patch:
> > >    text    data     bss     dec     hex     filename
> > > 9968576 5793372 17715200  33477148  1fed21c vmlinux
> > > 
> > > with this patch:
> > >    text    data     bss     dec     hex     filename
> > > 9968576 5793372 17715200  33477148  1fed21c vmlinux
> > 
> > This is unexpected.  Could you double-check this, please?
> 
> mm/page_alloc.o has size changes:
> 
> Without this patch:
> $ size mm/page_alloc.o
>   text    data     bss     dec     hex filename
>  36695    9792    8396   54883    d663 mm/page_alloc.o
> 
> With this patch:
> $ size mm/page_alloc.o
>   text    data     bss     dec     hex filename
>  37511    9792    8396   55699    d993 mm/page_alloc.o
> 
> But vmlinux doesn't.
> 
> It's not clear to me what happened, do you want to me dig this out?

There's weird stuff going on.

With x86_64 gcc-4.8.4

Patch not applied:

akpm3:/usr/local/google/home/akpm/k/25> nm mm/page_alloc.o|grep __rmqueue
0000000000002a00 t __rmqueue

Patch applied:

akpm3:/usr/local/google/home/akpm/k/25> nm mm/page_alloc.o|grep __rmqueue
000000000000039f t __rmqueue_fallback
0000000000001220 t __rmqueue_smallest

So inlining __rmqueue has caused the compiler to decide to uninline
__rmqueue_fallback and __rmqueue_smallest, which largely undoes the
effect of your patch.

`inline' is basically advisory (or ignored) in modern gcc's.  So gcc
has felt free to ignore it in __rmqueue_fallback and __rmqueue_smallest
because gcc thinks it knows best.  That's why we created
__always_inline, to grab gcc by the scruff of its neck.

So...  I think this patch could do with quite a bit more care, tuning
and testing with various gcc versions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] mm/page_alloc.c: inline __rmqueue()
  2017-10-10 21:45           ` Andrew Morton
@ 2017-10-10 22:27             ` Andi Kleen
  2017-10-11  2:34             ` Aaron Lu
  1 sibling, 0 replies; 16+ messages in thread
From: Andi Kleen @ 2017-10-10 22:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Lu, Dave Hansen, linux-mm, lkml, Huang Ying, Tim Chen,
	Kemi Wang, Anshuman Khandual

> `inline' is basically advisory (or ignored) in modern gcc's.  So gcc
> has felt free to ignore it in __rmqueue_fallback and __rmqueue_smallest
> because gcc thinks it knows best.  That's why we created
> __always_inline, to grab gcc by the scruff of its neck.
> 
> So...  I think this patch could do with quite a bit more care, tuning
> and testing with various gcc versions.

We should just everything in the hot path mark __always_inline.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] mm/page_alloc.c: inline __rmqueue()
  2017-10-10 21:45           ` Andrew Morton
  2017-10-10 22:27             ` Andi Kleen
@ 2017-10-11  2:34             ` Aaron Lu
  2017-10-13  6:31               ` [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline Aaron Lu
  1 sibling, 1 reply; 16+ messages in thread
From: Aaron Lu @ 2017-10-11  2:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, linux-mm, lkml, Andi Kleen, Huang Ying, Tim Chen,
	Kemi Wang, Anshuman Khandual

On Tue, Oct 10, 2017 at 02:45:45PM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2017 13:43:43 +0800 Aaron Lu <aaron.lu@intel.com> wrote:
> 
> > On Mon, Oct 09, 2017 at 10:19:52PM -0700, Dave Hansen wrote:
> > > On 10/09/2017 07:56 PM, Aaron Lu wrote:
> > > > This patch adds inline to __rmqueue() and vmlinux' size doesn't have any
> > > > change after this patch according to size(1).
> > > > 
> > > > without this patch:
> > > >    text    data     bss     dec     hex     filename
> > > > 9968576 5793372 17715200  33477148  1fed21c vmlinux
> > > > 
> > > > with this patch:
> > > >    text    data     bss     dec     hex     filename
> > > > 9968576 5793372 17715200  33477148  1fed21c vmlinux
> > > 
> > > This is unexpected.  Could you double-check this, please?
> > 
> > mm/page_alloc.o has size changes:
> > 
> > Without this patch:
> > $ size mm/page_alloc.o
> >   text    data     bss     dec     hex filename
> >  36695    9792    8396   54883    d663 mm/page_alloc.o
> > 
> > With this patch:
> > $ size mm/page_alloc.o
> >   text    data     bss     dec     hex filename
> >  37511    9792    8396   55699    d993 mm/page_alloc.o
> > 
> > But vmlinux doesn't.
> > 
> > It's not clear to me what happened, do you want to me dig this out?
> 
> There's weird stuff going on.
> 
> With x86_64 gcc-4.8.4
> 
> Patch not applied:
> 
> akpm3:/usr/local/google/home/akpm/k/25> nm mm/page_alloc.o|grep __rmqueue
> 0000000000002a00 t __rmqueue
> 
> Patch applied:
> 
> akpm3:/usr/local/google/home/akpm/k/25> nm mm/page_alloc.o|grep __rmqueue
> 000000000000039f t __rmqueue_fallback
> 0000000000001220 t __rmqueue_smallest
> 
> So inlining __rmqueue has caused the compiler to decide to uninline
> __rmqueue_fallback and __rmqueue_smallest, which largely undoes the
> effect of your patch.
> 
> `inline' is basically advisory (or ignored) in modern gcc's.  So gcc
> has felt free to ignore it in __rmqueue_fallback and __rmqueue_smallest
> because gcc thinks it knows best.  That's why we created
> __always_inline, to grab gcc by the scruff of its neck.

This is a good point and I agree with Andi to use always_inline for
those functions that we really want to inline.

> 
> So...  I think this patch could do with quite a bit more care, tuning
> and testing with various gcc versions.

I did some more testing.

With x86_64 gcc-4.6.3 available from kernel.org crosstool:

Patch not applied:

[aaron@aaronlu linux]$ nm mm/page_alloc.o |grep __rmqueue
00000000000023f0 t __rmqueue
00000000000027c0 t __rmqueue_pcplist.isra.95

Patch applied:

[aaron@aaronlu linux]$ nm mm/page_alloc.o |grep __rmqueue
0000000000002950 t __rmqueue_pcplist.isra.95

Works expected.

With self built x86_64 gcc-4.8.4:

Patch not applied:

[aaron@aaronlu linux]$ nm mm/page_alloc.o |grep __rmqueue
0000000000001f20 t __rmqueue

Patch applied:

[aaron@aaronlu linux]$ nm mm/page_alloc.o |grep __rmqueue

Works expected.(conflicts with your result though).

I also tested gcc-4.9.4, gcc-5.3.1, gcc-6.4.0 and gcc-7.2.1, all have
the same output as the above gcc-4.8.4.

Then I realized CONFIG_OPTIMIZE_INLINING which I always disabled as
suggested by the help message(If unsure, say N). Turnining that config
on indeed caused gcc-4.8.4 to emit __rmqueue_fallback here.

I think I'll just mark those functions always_inline.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline
  2017-10-11  2:34             ` Aaron Lu
@ 2017-10-13  6:31               ` Aaron Lu
  2017-10-17 11:32                 ` Vlastimil Babka
  0 siblings, 1 reply; 16+ messages in thread
From: Aaron Lu @ 2017-10-13  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, linux-mm, lkml, Andi Kleen, Huang Ying, Tim Chen,
	Kemi Wang, Anshuman Khandual

__rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
__rmqueue_cma_fallback() are all in page allocator's hot path and
better be finished as soon as possible. One way to make them faster
is by making them inline. But as Andrew Morton and Andi Kleen pointed
out:
https://lkml.org/lkml/2017/10/10/1252
https://lkml.org/lkml/2017/10/10/1279
To make sure they are inlined, we should use __always_inline for them.

With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
processes to stress buddy, the results for will-it-scale.processes with
and without the patch are:

On a 2-sockets Intel-Skylake machine:

 compiler          base        head
gcc-4.4.7       6496131     6911823 +6.4%
gcc-4.9.4       7225110     7731072 +7.0%
gcc-5.4.1       7054224     7688146 +9.0%
gcc-6.2.0       7059794     7651675 +8.4%

On a 4-sockets Intel-Skylake machine:

 compiler          base        head
gcc-4.4.7      13162890    13508193 +2.6%
gcc-4.9.4      14997463    15484353 +3.2%
gcc-5.4.1      14708711    15449805 +5.0%
gcc-6.2.0      14574099    15349204 +5.3%

The above 4 compilers are used becuase I've done the tests through Intel's
Linux Kernel Performance(LKP) infrastructure and they are the available
compilers there.

The benefit being less on 4 sockets machine is due to the lock contention
there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe
than on the 2 sockets machine(85%).

What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
    1 mmap() 128M anonymous space;
    2 writes to each page there to trigger actual page allocation;
    3 munmap() it.
in a loop.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Binary size wise, I have locally built them with different compilers:

[aaron@aaronlu obj]$ size */*/mm/page_alloc.o
   text    data     bss     dec     hex filename
  37409    9904    8524   55837    da1d gcc-4.9.4/base/mm/page_alloc.o
  38273    9904    8524   56701    dd7d gcc-4.9.4/head/mm/page_alloc.o
  37465    9840    8428   55733    d9b5 gcc-5.5.0/base/mm/page_alloc.o
  38169    9840    8428   56437    dc75 gcc-5.5.0/head/mm/page_alloc.o
  37573    9840    8428   55841    da21 gcc-6.4.0/base/mm/page_alloc.o
  38261    9840    8428   56529    dcd1 gcc-6.4.0/head/mm/page_alloc.o
  36863    9840    8428   55131    d75b gcc-7.2.0/base/mm/page_alloc.o
  37711    9840    8428   55979    daab gcc-7.2.0/head/mm/page_alloc.o

Text size increased about 800 bytes for mm/page_alloc.o.

[aaron@aaronlu obj]$ size */*/vmlinux
   text    data     bss     dec       hex     filename
10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux

Text size for vmlinux has no change though, probably due to function
alignment.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/page_alloc.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e309ce4a44a..0fe3e2095268 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1794,7 +1794,7 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
  */
-static inline
+static __always_inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
@@ -1838,7 +1838,7 @@ static int fallbacks[MIGRATE_TYPES][4] = {
 };
 
 #ifdef CONFIG_CMA
-static struct page *__rmqueue_cma_fallback(struct zone *zone,
+static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order)
 {
 	return __rmqueue_smallest(zone, order, MIGRATE_CMA);
@@ -2219,7 +2219,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
  * deviation from the rest of this file, to make the for loop
  * condition simpler.
  */
-static inline bool
+static __always_inline bool
 __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 {
 	struct free_area *area;
@@ -2291,8 +2291,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order,
-				int migratetype)
+static __always_inline struct page *
+__rmqueue(struct zone *zone, unsigned int order, int migratetype)
 {
 	struct page *page;
 
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline
  2017-10-13  6:31               ` [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline Aaron Lu
@ 2017-10-17 11:32                 ` Vlastimil Babka
  2017-10-18  1:53                   ` Lu, Aaron
  0 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka @ 2017-10-17 11:32 UTC (permalink / raw)
  To: Aaron Lu, Andrew Morton
  Cc: Dave Hansen, linux-mm, lkml, Andi Kleen, Huang Ying, Tim Chen,
	Kemi Wang, Anshuman Khandual

On 10/13/2017 08:31 AM, Aaron Lu wrote:
> __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
> __rmqueue_cma_fallback() are all in page allocator's hot path and
> better be finished as soon as possible. One way to make them faster
> is by making them inline. But as Andrew Morton and Andi Kleen pointed
> out:
> https://lkml.org/lkml/2017/10/10/1252
> https://lkml.org/lkml/2017/10/10/1279
> To make sure they are inlined, we should use __always_inline for them.
> 
> With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
> processes to stress buddy, the results for will-it-scale.processes with
> and without the patch are:
> 
> On a 2-sockets Intel-Skylake machine:
> 
>  compiler          base        head
> gcc-4.4.7       6496131     6911823 +6.4%
> gcc-4.9.4       7225110     7731072 +7.0%
> gcc-5.4.1       7054224     7688146 +9.0%
> gcc-6.2.0       7059794     7651675 +8.4%
> 
> On a 4-sockets Intel-Skylake machine:
> 
>  compiler          base        head
> gcc-4.4.7      13162890    13508193 +2.6%
> gcc-4.9.4      14997463    15484353 +3.2%
> gcc-5.4.1      14708711    15449805 +5.0%
> gcc-6.2.0      14574099    15349204 +5.3%
> 
> The above 4 compilers are used becuase I've done the tests through Intel's
> Linux Kernel Performance(LKP) infrastructure and they are the available
> compilers there.
> 
> The benefit being less on 4 sockets machine is due to the lock contention
> there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe
> than on the 2 sockets machine(85%).
> 
> What the benchmark does is: it forks nr_cpu processes and then each
> process does the following:
>     1 mmap() 128M anonymous space;
>     2 writes to each page there to trigger actual page allocation;
>     3 munmap() it.
> in a loop.
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Are transparent hugepages enabled? If yes, __rmqueue() is called from
rmqueue(), and there's only one page fault (and __rmqueue()) per 512
"writes to each page". If not, __rmqueue() is called from rmqueue_bulk()
in bursts once pcplists are depleted. I guess it's the latter, otherwise
I wouldn't expect a function call to have such visible overhead.

I guess what would help much more would be a bulk __rmqueue_smallest()
to grab multiple pages from the freelists. But can't argue with your
numbers against this patch.

> Binary size wise, I have locally built them with different compilers:
> 
> [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
>    text    data     bss     dec     hex filename
>   37409    9904    8524   55837    da1d gcc-4.9.4/base/mm/page_alloc.o
>   38273    9904    8524   56701    dd7d gcc-4.9.4/head/mm/page_alloc.o
>   37465    9840    8428   55733    d9b5 gcc-5.5.0/base/mm/page_alloc.o
>   38169    9840    8428   56437    dc75 gcc-5.5.0/head/mm/page_alloc.o
>   37573    9840    8428   55841    da21 gcc-6.4.0/base/mm/page_alloc.o
>   38261    9840    8428   56529    dcd1 gcc-6.4.0/head/mm/page_alloc.o
>   36863    9840    8428   55131    d75b gcc-7.2.0/base/mm/page_alloc.o
>   37711    9840    8428   55979    daab gcc-7.2.0/head/mm/page_alloc.o
> 
> Text size increased about 800 bytes for mm/page_alloc.o.

BTW, do you know about ./scripts/bloat-o-meter? :)
With gcc 7.2.1:
> ./scripts/bloat-o-meter base.o mm/page_alloc.o
add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844)
function                                     old     new   delta
get_page_from_freelist                      2898    4937   +2039
steal_suitable_fallback                        -     365    +365
find_suitable_fallback                        31     120     +89
find_suitable_fallback.part                  115       -    -115
__rmqueue                                   1534       -   -1534


> [aaron@aaronlu obj]$ size */*/vmlinux
>    text    data     bss     dec       hex     filename
> 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
> 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
> 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
> 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
> 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
> 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
> 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
> 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux
> 
> Text size for vmlinux has no change though, probably due to function
> alignment.

Yep that's useless to show. These differences do add up though, until
they eventually cross the alignment boundary.

Thanks,
Vlastimil

> 
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  mm/page_alloc.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0e309ce4a44a..0fe3e2095268 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1794,7 +1794,7 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
>   * Go through the free lists for the given migratetype and remove
>   * the smallest available page from the freelists
>   */
> -static inline
> +static __always_inline
>  struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>  						int migratetype)
>  {
> @@ -1838,7 +1838,7 @@ static int fallbacks[MIGRATE_TYPES][4] = {
>  };
>  
>  #ifdef CONFIG_CMA
> -static struct page *__rmqueue_cma_fallback(struct zone *zone,
> +static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>  					unsigned int order)
>  {
>  	return __rmqueue_smallest(zone, order, MIGRATE_CMA);
> @@ -2219,7 +2219,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>   * deviation from the rest of this file, to make the for loop
>   * condition simpler.
>   */
> -static inline bool
> +static __always_inline bool
>  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>  {
>  	struct free_area *area;
> @@ -2291,8 +2291,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>   * Do the hard work of removing an element from the buddy allocator.
>   * Call me with the zone->lock already held.
>   */
> -static struct page *__rmqueue(struct zone *zone, unsigned int order,
> -				int migratetype)
> +static __always_inline struct page *
> +__rmqueue(struct zone *zone, unsigned int order, int migratetype)
>  {
>  	struct page *page;
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline
  2017-10-17 11:32                 ` Vlastimil Babka
@ 2017-10-18  1:53                   ` Lu, Aaron
  2017-10-18  6:28                     ` Vlastimil Babka
  0 siblings, 1 reply; 16+ messages in thread
From: Lu, Aaron @ 2017-10-18  1:53 UTC (permalink / raw)
  To: vbabka@suse.cz, akpm@linux-foundation.org
  Cc: linux-kernel@vger.kernel.org, tim.c.chen@linux.intel.com,
	khandual@linux.vnet.ibm.com, linux-mm@kvack.org,
	ak@linux.intel.com, Wang, Kemi, Hansen, Dave, Huang, Ying

On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote:
> On 10/13/2017 08:31 AM, Aaron Lu wrote:
> > __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
> > __rmqueue_cma_fallback() are all in page allocator's hot path and
> > better be finished as soon as possible. One way to make them faster
> > is by making them inline. But as Andrew Morton and Andi Kleen pointed
> > out:
> > https://lkml.org/lkml/2017/10/10/1252
> > https://lkml.org/lkml/2017/10/10/1279
> > To make sure they are inlined, we should use __always_inline for them.
> > 
> > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
> > processes to stress buddy, the results for will-it-scale.processes with
> > and without the patch are:
> > 
> > On a 2-sockets Intel-Skylake machine:
> > 
> >  compiler          base        head
> > gcc-4.4.7       6496131     6911823 +6.4%
> > gcc-4.9.4       7225110     7731072 +7.0%
> > gcc-5.4.1       7054224     7688146 +9.0%
> > gcc-6.2.0       7059794     7651675 +8.4%
> > 
> > On a 4-sockets Intel-Skylake machine:
> > 
> >  compiler          base        head
> > gcc-4.4.7      13162890    13508193 +2.6%
> > gcc-4.9.4      14997463    15484353 +3.2%
> > gcc-5.4.1      14708711    15449805 +5.0%
> > gcc-6.2.0      14574099    15349204 +5.3%
> > 
> > The above 4 compilers are used becuase I've done the tests through Intel's
> > Linux Kernel Performance(LKP) infrastructure and they are the available
> > compilers there.
> > 
> > The benefit being less on 4 sockets machine is due to the lock contention
> > there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe
> > than on the 2 sockets machine(85%).
> > 
> > What the benchmark does is: it forks nr_cpu processes and then each
> > process does the following:
> >     1 mmap() 128M anonymous space;
> >     2 writes to each page there to trigger actual page allocation;
> >     3 munmap() it.
> > in a loop.
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Are transparent hugepages enabled? If yes, __rmqueue() is called from
> rmqueue(), and there's only one page fault (and __rmqueue()) per 512
> "writes to each page". If not, __rmqueue() is called from rmqueue_bulk()
> in bursts once pcplists are depleted. I guess it's the latter, otherwise
> I wouldn't expect a function call to have such visible overhead.

THP is disabled. I should have mentioned this in the changelog, sorry
about that.

> 
> I guess what would help much more would be a bulk __rmqueue_smallest()
> to grab multiple pages from the freelists. But can't argue with your

Do I understand you correctly that you suggest to use a bulk
__rmqueue_smallest(), say __rmqueue_smallest_bulk(). With that, instead
of looping pcp->batch times in rmqueue_bulk(), a single call to
__rmqueue_smallest_bulk() is enough and __rmqueue_smallest_bulk() will
loop pcp->batch times to get those pages?

Then it feels like __rmqueue_smallest_bulk() has become rmqueue_bulk(),
or do I miss something?

> numbers against this patch.
> 
> > Binary size wise, I have locally built them with different compilers:
> > 
> > [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
> >    text    data     bss     dec     hex filename
> >   37409    9904    8524   55837    da1d gcc-4.9.4/base/mm/page_alloc.o
> >   38273    9904    8524   56701    dd7d gcc-4.9.4/head/mm/page_alloc.o
> >   37465    9840    8428   55733    d9b5 gcc-5.5.0/base/mm/page_alloc.o
> >   38169    9840    8428   56437    dc75 gcc-5.5.0/head/mm/page_alloc.o
> >   37573    9840    8428   55841    da21 gcc-6.4.0/base/mm/page_alloc.o
> >   38261    9840    8428   56529    dcd1 gcc-6.4.0/head/mm/page_alloc.o
> >   36863    9840    8428   55131    d75b gcc-7.2.0/base/mm/page_alloc.o
> >   37711    9840    8428   55979    daab gcc-7.2.0/head/mm/page_alloc.o
> > 
> > Text size increased about 800 bytes for mm/page_alloc.o.
> 
> BTW, do you know about ./scripts/bloat-o-meter? :)

NO!!! Thanks for bringing this up :)

> With gcc 7.2.1:
> > ./scripts/bloat-o-meter base.o mm/page_alloc.o
> 
> add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844)

Nice, it clearly showed 844 bytes bloat.

> function                                     old     new   delta
> get_page_from_freelist                      2898    4937   +2039
> steal_suitable_fallback                        -     365    +365
> find_suitable_fallback                        31     120     +89
> find_suitable_fallback.part                  115       -    -115
> __rmqueue                                   1534       -   -1534
> 
> 
> > [aaron@aaronlu obj]$ size */*/vmlinux
> >    text    data     bss     dec       hex     filename
> > 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
> > 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
> > 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
> > 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
> > 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
> > 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
> > 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
> > 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux
> > 
> > Text size for vmlinux has no change though, probably due to function
> > alignment.
> 
> Yep that's useless to show. These differences do add up though, until
> they eventually cross the alignment boundary.

Agreed.
But you know, it is the hot path, the performance improvement might be
worth it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline
  2017-10-18  1:53                   ` Lu, Aaron
@ 2017-10-18  6:28                     ` Vlastimil Babka
  2017-10-18  8:57                       ` Aaron Lu
  0 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka @ 2017-10-18  6:28 UTC (permalink / raw)
  To: Lu, Aaron, akpm@linux-foundation.org
  Cc: linux-kernel@vger.kernel.org, tim.c.chen@linux.intel.com,
	khandual@linux.vnet.ibm.com, linux-mm@kvack.org,
	ak@linux.intel.com, Wang, Kemi, Hansen, Dave, Huang, Ying

On 10/18/2017 03:53 AM, Lu, Aaron wrote:
> On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote:
>>
>> Are transparent hugepages enabled? If yes, __rmqueue() is called from
>> rmqueue(), and there's only one page fault (and __rmqueue()) per 512
>> "writes to each page". If not, __rmqueue() is called from rmqueue_bulk()
>> in bursts once pcplists are depleted. I guess it's the latter, otherwise
>> I wouldn't expect a function call to have such visible overhead.
> 
> THP is disabled. I should have mentioned this in the changelog, sorry
> about that.

OK, then it makes sense!

>>
>> I guess what would help much more would be a bulk __rmqueue_smallest()
>> to grab multiple pages from the freelists. But can't argue with your
> 
> Do I understand you correctly that you suggest to use a bulk
> __rmqueue_smallest(), say __rmqueue_smallest_bulk(). With that, instead
> of looping pcp->batch times in rmqueue_bulk(), a single call to
> __rmqueue_smallest_bulk() is enough and __rmqueue_smallest_bulk() will
> loop pcp->batch times to get those pages?

Yeah, but I looked at it more closely, and maybe there's not much to
gain after all. E.g., there seem to be no atomic counter updates that
would benefit from batching, or expensive setup/cleanup in
__rmqueue_smallest().

> Then it feels like __rmqueue_smallest_bulk() has become rmqueue_bulk(),
> or do I miss something?

Right, looks like thanks to inlining, the compiler can already achieve
most of the potential gains.

>> With gcc 7.2.1:
>>> ./scripts/bloat-o-meter base.o mm/page_alloc.o
>>
>> add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844)
> 
> Nice, it clearly showed 844 bytes bloat.
> 
>> function                                     old     new   delta
>> get_page_from_freelist                      2898    4937   +2039
>> steal_suitable_fallback                        -     365    +365
>> find_suitable_fallback                        31     120     +89
>> find_suitable_fallback.part                  115       -    -115
>> __rmqueue                                   1534       -   -1534

It also shows that steal_suitable_fallback() is no longer inlined. Which
is fine, because that should ideally be rarely executed.

>>
>>> [aaron@aaronlu obj]$ size */*/vmlinux
>>>    text    data     bss     dec       hex     filename
>>> 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
>>> 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
>>> 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
>>> 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
>>> 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
>>> 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
>>> 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
>>> 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux
>>>
>>> Text size for vmlinux has no change though, probably due to function
>>> alignment.
>>
>> Yep that's useless to show. These differences do add up though, until
>> they eventually cross the alignment boundary.
> 
> Agreed.
> But you know, it is the hot path, the performance improvement might be
> worth it.

I'd agree, so you can add

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline
  2017-10-18  6:28                     ` Vlastimil Babka
@ 2017-10-18  8:57                       ` Aaron Lu
  0 siblings, 0 replies; 16+ messages in thread
From: Aaron Lu @ 2017-10-18  8:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	tim.c.chen@linux.intel.com, khandual@linux.vnet.ibm.com,
	linux-mm@kvack.org, ak@linux.intel.com, Wang, Kemi, Hansen, Dave,
	Huang, Ying

On Wed, Oct 18, 2017 at 08:28:56AM +0200, Vlastimil Babka wrote:
> On 10/18/2017 03:53 AM, Lu, Aaron wrote:
> > On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote:
> >> With gcc 7.2.1:
> >>> ./scripts/bloat-o-meter base.o mm/page_alloc.o
> >>
> >> add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844)
> > 
> > Nice, it clearly showed 844 bytes bloat.
> > 
> >> function                                     old     new   delta
> >> get_page_from_freelist                      2898    4937   +2039
> >> steal_suitable_fallback                        -     365    +365
> >> find_suitable_fallback                        31     120     +89
> >> find_suitable_fallback.part                  115       -    -115
> >> __rmqueue                                   1534       -   -1534
> 
> It also shows that steal_suitable_fallback() is no longer inlined. Which
> is fine, because that should ideally be rarely executed.

Ah right, so this script is really good for analysing inline changes.

> 
> >>
> >>> [aaron@aaronlu obj]$ size */*/vmlinux
> >>>    text    data     bss     dec       hex     filename
> >>> 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
> >>> 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
> >>> 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
> >>> 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
> >>> 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
> >>> 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
> >>> 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
> >>> 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux
> >>>
> >>> Text size for vmlinux has no change though, probably due to function
> >>> alignment.
> >>
> >> Yep that's useless to show. These differences do add up though, until
> >> they eventually cross the alignment boundary.
> > 
> > Agreed.
> > But you know, it is the hot path, the performance improvement might be
> > worth it.
> 
> I'd agree, so you can add
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-10-18  8:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-09  5:44 [PATCH] page_alloc.c: inline __rmqueue() Aaron Lu
2017-10-09  7:37 ` Anshuman Khandual
2017-10-09  7:53   ` Aaron Lu
2017-10-09 20:23 ` Dave Hansen
2017-10-10  2:51   ` Aaron Lu
2017-10-10  2:56     ` [PATCH v2] mm/page_alloc.c: " Aaron Lu
2017-10-10  5:19       ` Dave Hansen
2017-10-10  5:43         ` Aaron Lu
2017-10-10 21:45           ` Andrew Morton
2017-10-10 22:27             ` Andi Kleen
2017-10-11  2:34             ` Aaron Lu
2017-10-13  6:31               ` [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline Aaron Lu
2017-10-17 11:32                 ` Vlastimil Babka
2017-10-18  1:53                   ` Lu, Aaron
2017-10-18  6:28                     ` Vlastimil Babka
2017-10-18  8:57                       ` Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).