Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Hillf Danton <hillf.zj@alibaba-inc.com>,
	brouer@redhat.com
Subject: Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
Date: Wed, 11 Jan 2017 14:27:12 +0100	[thread overview]
Message-ID: <20170111142712.5fd8bea8@redhat.com> (raw)
In-Reply-To: <20170111134420.368efb9e@redhat.com>

On Wed, 11 Jan 2017 13:44:20 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Mon,  9 Jan 2017 16:35:17 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
>  
> > The following is results from a page allocator micro-benchmark. Only
> > order-0 is interesting as higher orders do not use the per-cpu allocator  
> 
> Micro-benchmarked with [1] page_bench02:
>  modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
>   rmmod page_bench02 ; dmesg --notime | tail -n 4
> 
> Compared to baseline: 213 cycles(tsc) 53.417 ns
>  - against this     : 184 cycles(tsc) 46.056 ns
>  - Saving           : -29 cycles
>  - Very close to expected 27 cycles saving [see below [2]]

When perf benchmarking I noticed that the "summed" children perf
overhead from calling alloc_pages_current() is 65.05%. Compared to
"free-path" of summed 28.28% of calls "under" __free_pages().

This is caused by CONFIG_NUMA=y, as call path is long with NUMA
(and other helpers are also non-inlined calls):

 alloc_pages
  -> alloc_pages_current
      -> __alloc_pages_nodemask
          -> get_page_from_freelist

Without NUMA the call levels gets compacted by inlining to:

 __alloc_pages_nodemask
  -> get_page_from_freelist

After disabling NUMA, the split between alloc(48.80%) vs. free(42.67%)
side is more balanced.

Saving by disabling CONFIG_NUMA of:
 - CONFIG_NUMA=y : 184 cycles(tsc) 46.056 ns
 - CONFIG_NUMA=n : 143 cycles(tsc) 35.913 ns
 - Saving:       :  41 cycles (approx 22%)

I would conclude, there is room for improvements with CONFIG_NUMA code
path case. Lets followup on that in a later patch series...


> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>  
> 
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> -
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> [2] Expected saving comes from Mel removing a local_irq_{save,restore}
> and adding a preempt_{disable,enable} instead.
> 
> Micro benchmarking via time_bench_sample[3], we get the cost of these
> operations:
> 
>  time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
>  time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
>  time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
>  time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
>  time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
>  time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
>  time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
>  [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
>  time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
>  [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
>  time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
>  time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
>  time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
> 
> Thus, expected improvement is: 38-11 = 27 cycles.
> 
> [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
> 
> CPU used: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
> 
> Config options of interest:
>  CONFIG_NUMA=y
>  CONFIG_DEBUG_LIST=n
>  CONFIG_VM_EVENT_COUNTERS=y



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-01-11 13:27 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-09 16:35 [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7 Mel Gorman
2017-01-09 16:35 ` [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue Mel Gorman
2017-01-11 12:31   ` Jesper Dangaard Brouer
2017-01-12  3:09   ` Hillf Danton
2017-01-09 16:35 ` [PATCH 2/4] mm, page_alloc: Split alloc_pages_nodemask Mel Gorman
2017-01-11 12:32   ` Jesper Dangaard Brouer
2017-01-12  3:11   ` Hillf Danton
2017-01-09 16:35 ` [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests Mel Gorman
2017-01-11 12:44   ` Jesper Dangaard Brouer
2017-01-11 13:27     ` Jesper Dangaard Brouer [this message]
2017-01-12 10:47       ` Mel Gorman
2017-01-09 16:35 ` [PATCH 4/4] mm, page_alloc: Add a bulk page allocator Mel Gorman
2017-01-10  4:00   ` Hillf Danton
2017-01-10  8:34     ` Mel Gorman
2017-01-16 14:25   ` Jesper Dangaard Brouer
2017-01-16 15:01     ` Mel Gorman
2017-01-29  4:00 ` [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7 Andy Lutomirski
  -- strict thread matches above, loose matches on Subject: below --
2017-01-04 11:10 [RFC PATCH 0/4] Fast noirq bulk page allocator Mel Gorman
2017-01-04 11:10 ` [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests Mel Gorman
2017-01-04 14:20   ` Jesper Dangaard Brouer
2017-01-06  3:26   ` Hillf Danton
2017-01-06 10:15     ` Mel Gorman
2017-01-09  3:14       ` Hillf Danton
2017-01-09  9:48         ` Mel Gorman
2017-01-09  9:55           ` Hillf Danton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170111142712.5fd8bea8@redhat.com \
    --to=brouer@redhat.com \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).