Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Hillf Danton <hillf.zj@alibaba-inc.com>,
	brouer@redhat.com
Subject: Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
Date: Wed, 11 Jan 2017 14:27:12 +0100	[thread overview]
Message-ID: <20170111142712.5fd8bea8@redhat.com> (raw)
In-Reply-To: <20170111134420.368efb9e@redhat.com>

On Wed, 11 Jan 2017 13:44:20 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Mon,  9 Jan 2017 16:35:17 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
>  
> > The following is results from a page allocator micro-benchmark. Only
> > order-0 is interesting as higher orders do not use the per-cpu allocator  
> 
> Micro-benchmarked with [1] page_bench02:
>  modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
>   rmmod page_bench02 ; dmesg --notime | tail -n 4
> 
> Compared to baseline: 213 cycles(tsc) 53.417 ns
>  - against this     : 184 cycles(tsc) 46.056 ns
>  - Saving           : -29 cycles
>  - Very close to expected 27 cycles saving [see below [2]]

When perf benchmarking I noticed that the "summed" children perf
overhead from calling alloc_pages_current() is 65.05%. Compared to
"free-path" of summed 28.28% of calls "under" __free_pages().

This is caused by CONFIG_NUMA=y, as call path is long with NUMA
(and other helpers are also non-inlined calls):

 alloc_pages
  -> alloc_pages_current
      -> __alloc_pages_nodemask
          -> get_page_from_freelist

Without NUMA the call levels gets compacted by inlining to:

 __alloc_pages_nodemask
  -> get_page_from_freelist

After disabling NUMA, the split between alloc(48.80%) vs. free(42.67%)
side is more balanced.

Saving by disabling CONFIG_NUMA of:
 - CONFIG_NUMA=y : 184 cycles(tsc) 46.056 ns
 - CONFIG_NUMA=n : 143 cycles(tsc) 35.913 ns
 - Saving:       :  41 cycles (approx 22%)

I would conclude, there is room for improvements with CONFIG_NUMA code
path case. Lets followup on that in a later patch series...


> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>  
> 
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> -
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> [2] Expected saving comes from Mel removing a local_irq_{save,restore}
> and adding a preempt_{disable,enable} instead.
> 
> Micro benchmarking via time_bench_sample[3], we get the cost of these
> operations:
> 
>  time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
>  time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
>  time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
>  time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
>  time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
>  time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
>  time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
>  [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
>  time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
>  [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
>  time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
>  time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
>  time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
> 
> Thus, expected improvement is: 38-11 = 27 cycles.
> 
> [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
> 
> CPU used: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
> 
> Config options of interest:
>  CONFIG_NUMA=y
>  CONFIG_DEBUG_LIST=n
>  CONFIG_VM_EVENT_COUNTERS=y



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Hillf Danton <hillf.zj@alibaba-inc.com>,
	brouer@redhat.com
Subject: Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
Date: Wed, 11 Jan 2017 14:27:12 +0100	[thread overview]
Message-ID: <20170111142712.5fd8bea8@redhat.com> (raw)
In-Reply-To: <20170111134420.368efb9e@redhat.com>

On Wed, 11 Jan 2017 13:44:20 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Mon,  9 Jan 2017 16:35:17 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
>  
> > The following is results from a page allocator micro-benchmark. Only
> > order-0 is interesting as higher orders do not use the per-cpu allocator  
> 
> Micro-benchmarked with [1] page_bench02:
>  modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
>   rmmod page_bench02 ; dmesg --notime | tail -n 4
> 
> Compared to baseline: 213 cycles(tsc) 53.417 ns
>  - against this     : 184 cycles(tsc) 46.056 ns
>  - Saving           : -29 cycles
>  - Very close to expected 27 cycles saving [see below [2]]

When perf benchmarking I noticed that the "summed" children perf
overhead from calling alloc_pages_current() is 65.05%. Compared to
"free-path" of summed 28.28% of calls "under" __free_pages().

This is caused by CONFIG_NUMA=y, as call path is long with NUMA
(and other helpers are also non-inlined calls):

 alloc_pages
  -> alloc_pages_current
      -> __alloc_pages_nodemask
          -> get_page_from_freelist

Without NUMA the call levels gets compacted by inlining to:

 __alloc_pages_nodemask
  -> get_page_from_freelist

After disabling NUMA, the split between alloc(48.80%) vs. free(42.67%)
side is more balanced.

Saving by disabling CONFIG_NUMA of:
 - CONFIG_NUMA=y : 184 cycles(tsc) 46.056 ns
 - CONFIG_NUMA=n : 143 cycles(tsc) 35.913 ns
 - Saving:       :  41 cycles (approx 22%)

I would conclude, there is room for improvements with CONFIG_NUMA code
path case. Lets followup on that in a later patch series...


> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>  
> 
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> -
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> [2] Expected saving comes from Mel removing a local_irq_{save,restore}
> and adding a preempt_{disable,enable} instead.
> 
> Micro benchmarking via time_bench_sample[3], we get the cost of these
> operations:
> 
>  time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
>  time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
>  time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
>  time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
>  time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
>  time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
>  time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
>  [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
>  time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
>  [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
>  time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
>  time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
>  time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
> 
> Thus, expected improvement is: 38-11 = 27 cycles.
> 
> [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
> 
> CPU used: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
> 
> Config options of interest:
>  CONFIG_NUMA=y
>  CONFIG_DEBUG_LIST=n
>  CONFIG_VM_EVENT_COUNTERS=y



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2017-01-11 13:27 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-09 16:35 [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7 Mel Gorman
2017-01-09 16:35 ` Mel Gorman
2017-01-09 16:35 ` [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue Mel Gorman
2017-01-09 16:35   ` Mel Gorman
2017-01-11 12:31   ` Jesper Dangaard Brouer
2017-01-11 12:31     ` Jesper Dangaard Brouer
2017-01-12  3:09   ` Hillf Danton
2017-01-12  3:09     ` Hillf Danton
2017-01-09 16:35 ` [PATCH 2/4] mm, page_alloc: Split alloc_pages_nodemask Mel Gorman
2017-01-09 16:35   ` Mel Gorman
2017-01-11 12:32   ` Jesper Dangaard Brouer
2017-01-11 12:32     ` Jesper Dangaard Brouer
2017-01-12  3:11   ` Hillf Danton
2017-01-12  3:11     ` Hillf Danton
2017-01-09 16:35 ` [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests Mel Gorman
2017-01-09 16:35   ` Mel Gorman
2017-01-11 12:44   ` Jesper Dangaard Brouer
2017-01-11 12:44     ` Jesper Dangaard Brouer
2017-01-11 13:27     ` Jesper Dangaard Brouer [this message]
2017-01-11 13:27       ` Jesper Dangaard Brouer
2017-01-12 10:47       ` Mel Gorman
2017-01-12 10:47         ` Mel Gorman
2017-01-09 16:35 ` [PATCH 4/4] mm, page_alloc: Add a bulk page allocator Mel Gorman
2017-01-09 16:35   ` Mel Gorman
2017-01-10  4:00   ` Hillf Danton
2017-01-10  4:00     ` Hillf Danton
2017-01-10  8:34     ` Mel Gorman
2017-01-10  8:34       ` Mel Gorman
2017-01-16 14:25   ` Jesper Dangaard Brouer
2017-01-16 14:25     ` Jesper Dangaard Brouer
2017-01-16 15:01     ` Mel Gorman
2017-01-16 15:01       ` Mel Gorman
2017-01-29  4:00 ` [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7 Andy Lutomirski
2017-01-29  4:00   ` Andy Lutomirski
  -- strict thread matches above, loose matches on Subject: below --
2017-01-04 11:10 [RFC PATCH 0/4] Fast noirq bulk page allocator Mel Gorman
2017-01-04 11:10 ` [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests Mel Gorman
2017-01-04 11:10   ` Mel Gorman
2017-01-04 14:20   ` Jesper Dangaard Brouer
2017-01-04 14:20     ` Jesper Dangaard Brouer
2017-01-06  3:26   ` Hillf Danton
2017-01-06  3:26     ` Hillf Danton
2017-01-06 10:15     ` Mel Gorman
2017-01-06 10:15       ` Mel Gorman
2017-01-09  3:14       ` Hillf Danton
2017-01-09  3:14         ` Hillf Danton
2017-01-09  9:48         ` Mel Gorman
2017-01-09  9:48           ` Mel Gorman
2017-01-09  9:55           ` Hillf Danton
2017-01-09  9:55             ` Hillf Danton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170111142712.5fd8bea8@redhat.com \
    --to=brouer@redhat.com \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.