Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Joonsoo Kim <js1304@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Christoph Lameter <cl@linux.com>,
	akpm@linuxfoundation.org, Steven Rostedt <rostedt@goodmis.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Pekka Enberg <penberg@kernel.org>,
	brouer@redhat.com
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
Date: Wed, 17 Dec 2014 13:08:41 +0100	[thread overview]
Message-ID: <20141217130841.100dac71@redhat.com> (raw)
In-Reply-To: <CAAmzW4NCpx5aJyW36fgOfu3EaDj6=uv6MUiBC+a0ggePWPXndQ@mail.gmail.com>

On Wed, 17 Dec 2014 16:13:49 +0900 Joonsoo Kim <js1304@gmail.com> wrote:

> Ping... and I found another way to remove preempt_disable/enable
> without complex changes.
> 
> What we want to ensure is getting tid and kmem_cache_cpu
> on the same cpu. We can achieve that goal with below condition loop.
> 
> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
> kmem_cache_alloc+free in CONFIG_PREEMPT.
> 
> 14.5 ns -> 13.8 ns

Hi Kim,

I've tested you patch.  Full report below patch.

Summary, I'm seeing 18.599 ns -> 17.523 ns (-1.076ns better).

For network overload tests:

Dropping packets in iptables raw, which is hitting the slub fast-path.
Here I'm seeing an improvement of 3ns.

For IP-forward, which is also invoking the slub slower path, I'm seeing
an improvement of 6ns (I were not expecting to see any improvement
here, the kmem_cache_alloc code is 24bytes smaller, so perhaps it's
saving some icache).

Full report below patch...
 
> See following patch.
> 
> Thanks.
> 
> ----------->8-------------
> diff --git a/mm/slub.c b/mm/slub.c
> index 95d2142..e537af5 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2399,8 +2399,10 @@ redo:
>          * on a different processor between the determination of the pointer
>          * and the retrieval of the tid.
>          */
> -       preempt_disable();
> -       c = this_cpu_ptr(s->cpu_slab);
> +       do {
> +               tid = this_cpu_read(s->cpu_slab->tid);
> +               c = this_cpu_ptr(s->cpu_slab);
> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> 
>         /*
>          * The transaction ids are globally unique per cpu and per operation on
> @@ -2408,8 +2410,6 @@ redo:
>          * occurs on the right processor and that there was no operation on the
>          * linked list in between.
>          */
> -       tid = c->tid;
> -       preempt_enable();
> 
>         object = c->freelist;
>         page = c->page;
> @@ -2655,11 +2655,10 @@ redo:
>          * data is retrieved via this pointer. If we are on the same cpu
>          * during the cmpxchg then the free will succedd.
>          */
> -       preempt_disable();
> -       c = this_cpu_ptr(s->cpu_slab);
> -
> -       tid = c->tid;
> -       preempt_enable();
> +       do {
> +               tid = this_cpu_read(s->cpu_slab->tid);
> +               c = this_cpu_ptr(s->cpu_slab);
> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> 
>         if (likely(page == c->page)) {
>                 set_freepointer(s, object, c->freelist);

SLUB evaluation 03
==================

Testing patch from Joonsoo Kim <iamjoonsoo.kim@lge.com> slub fast-path
preempt_{disable,enable} avoidance.

Kernel
======
Compiler: GCC 4.9.1

Kernel config ::

 $ grep PREEMPT .config
 CONFIG_PREEMPT_RCU=y
 CONFIG_PREEMPT_NOTIFIERS=y
 # CONFIG_PREEMPT_NONE is not set
 # CONFIG_PREEMPT_VOLUNTARY is not set
 CONFIG_PREEMPT=y
 CONFIG_PREEMPT_COUNT=y
 # CONFIG_DEBUG_PREEMPT is not set

 $ egrep -e "SLUB|SLAB" .config
 # CONFIG_SLUB_DEBUG is not set
 # CONFIG_SLAB is not set
 CONFIG_SLUB=y
 # CONFIG_SLUB_CPU_PARTIAL is not set
 # CONFIG_SLUB_STATS is not set

On top of::

 commit f96fe225677b3efb74346ebd56fafe3997b02afa
 Merge: 5543798 eea3e8f
 Author: Linus Torvalds <torvalds@linux-foundation.org>
 Date:   Fri Dec 12 16:11:12 2014 -0800

    Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net


Setup
=====

netfilter_unload_modules.sh
netfilter_unload_modules.sh
sudo rmmod nf_reject_ipv4 nf_reject_ipv6

base_device_setup.sh eth4  # 10G sink/receiving interface (ixgbe)
base_device_setup.sh eth5
sudo ethtool --coalesce eth4 rx-usecs 30
sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0
sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5


# sudo tuned-adm active
Current active profile: latency-performance

Drop in raw
-----------
alias iptables='sudo iptables'
iptables -t raw -N simple || iptables -t raw -F simple
iptables -t raw -I simple -d 198.18.0.0/15 -j DROP
iptables -t raw -D PREROUTING -j simple
iptables -t raw -I PREROUTING -j simple

Generator
---------
./pktgen02_burst.sh -d 198.18.0.2 -i eth8 -m 90:E2:BA:0A:56:B4 -b 8 -t 3 -s 64


Patch by Joonsoo Kim to avoid preempt in slub
=============================================

baseline: without patch
-----------------------

baseline kernel v3.18-7016-gf96fe22 at commit f96fe22567

Type:kmem fastpath reuse Per elem: 46 cycles(tsc) 18.599 ns
 - (measurement period time:1.859917529 sec time_interval:1859917529)
 - (invoke count:100000000 tsc_interval:4649791431)

alloc N-pattern before free with 256 elements

Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.077 ns
 - (measurement period time:1.025993290 sec time_interval:1025993290)
 - (invoke count:25600000 tsc_interval:2564981743)

single flow/CPU
 * IP-forward
  - instant rx:0 tx:1165376 pps n:60 average: rx:0 tx:1165928 pps
    (instant variation TX -0.407 ns (min:-0.828 max:0.507) RX 0.000 ns)
 * Drop in RAW (slab fast-path test)
   - instant rx:3245248 tx:0 pps n:60 average: rx:3245325 tx:0 pps
     (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.007 ns)

Christoph's slab_test, baseline kernel (at commit f96fe22567)::

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 68 cycles
 10000 times kmalloc(16)/kfree -> 68 cycles
 10000 times kmalloc(32)/kfree -> 69 cycles
 10000 times kmalloc(64)/kfree -> 68 cycles
 10000 times kmalloc(128)/kfree -> 68 cycles
 10000 times kmalloc(256)/kfree -> 68 cycles
 10000 times kmalloc(512)/kfree -> 74 cycles
 10000 times kmalloc(1024)/kfree -> 75 cycles
 10000 times kmalloc(2048)/kfree -> 74 cycles
 10000 times kmalloc(4096)/kfree -> 74 cycles
 10000 times kmalloc(8192)/kfree -> 75 cycles
 10000 times kmalloc(16384)/kfree -> 510 cycles

$ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
ffffffff81163bd0 00000000000000e1 T kmem_cache_alloc
ffffffff81163ac0 000000000000010c T kmem_cache_alloc_node
ffffffff81162cb0 000000000000013b T kmem_cache_free


with patch
----------

single flow/CPU
 * IP-forward
  - instant rx:0 tx:1174652 pps n:60 average: rx:0 tx:1174222 pps
    (instant variation TX 0.311 ns (min:-0.230 max:1.018) RX 0.000 ns)
 * compare against baseline:
  - 1174222-1165928 = +8294pps
  - (1/1174222*10^9)-(1/1165928*10^9) = -6.058ns

 * Drop in RAW (slab fast-path test)
  - instant rx:3277440 tx:0 pps n:74 average: rx:3277737 tx:0 pps
    (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.028 ns)
 * compare against baseline:
  - 3277737-3245325 = +32412 pps
  - (1/3277737*10^9)-(1/3245325*10^9) = -3.047ns

SLUB fast-path test: time_bench_kmem_cache1
 * modprobe time_bench_kmem_cache1 ; rmmod time_bench_kmem_cache1; sudo dmesg -c

Type:kmem fastpath reuse Per elem: 43 cycles(tsc) 17.523 ns (step:0)
 - (measurement period time:1.752338378 sec time_interval:1752338378)
 - (invoke count:100000000 tsc_interval:4380843588)
  * difference: 17.523 - 18.599 = -1.076ns

alloc N-pattern before free with 256 elements

Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.369 ns (step:0)
 - (measurement period time:1.033447112 sec time_interval:1033447112)
 - (invoke count:25600000 tsc_interval:2583616203)
    * difference: 40.369 - 40.077 = +0.292ns


Christoph's slab_test::

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 65 cycles
 10000 times kmalloc(16)/kfree -> 66 cycles
 10000 times kmalloc(32)/kfree -> 65 cycles
 10000 times kmalloc(64)/kfree -> 66 cycles
 10000 times kmalloc(128)/kfree -> 66 cycles
 10000 times kmalloc(256)/kfree -> 71 cycles
 10000 times kmalloc(512)/kfree -> 72 cycles
 10000 times kmalloc(1024)/kfree -> 71 cycles
 10000 times kmalloc(2048)/kfree -> 71 cycles
 10000 times kmalloc(4096)/kfree -> 71 cycles
 10000 times kmalloc(8192)/kfree -> 65 cycles
 10000 times kmalloc(16384)/kfree -> 511 cycles

$ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
ffffffff81163ba0 00000000000000c9 T kmem_cache_alloc
ffffffff81163aa0 00000000000000f8 T kmem_cache_alloc_node
ffffffff81162cb0 0000000000000133 T kmem_cache_free



Kernel size change
------------------

 $ scripts/bloat-o-meter vmlinux vmlinux-kim-preempt-avoid
 add/remove: 0/0 grow/shrink: 0/8 up/down: 0/-248 (-248)
 function                                     old     new   delta
 kmem_cache_free                              315     307      -8
 kmem_cache_alloc_node                        268     248     -20
 kmem_cache_alloc                             225     201     -24
 kfree                                        274     250     -24
 __kmalloc_node_track_caller                  356     324     -32
 __kmalloc_node                               340     308     -32
 __kmalloc                                    324     273     -51
 __kmalloc_track_caller                       343     286     -57


Qmempool notes:
---------------

On baseline kernel:

Type:qmempool fastpath reuse SOFTIRQ Per elem: 33 cycles(tsc) 13.287 ns
 - (measurement period time:0.398628965 sec time_interval:398628965)
 - (invoke count:30000000 tsc_interval:996571541)

Type:qmempool fastpath reuse BH-disable Per elem: 47 cycles(tsc) 19.180 ns
 - (measurement period time:0.575425927 sec time_interval:575425927)
 - (invoke count:30000000 tsc_interval:1438563781)

qmempool_bench: N-pattern with 256 elements

Type:qmempool alloc+free N-pattern Per elem: 62 cycles(tsc) 24.955 ns (step:0)
 - (measurement period time:0.638871008 sec time_interval:638871008)
 - (invoke count:25600000 tsc_interval:1597176303)


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Joonsoo Kim <js1304@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Christoph Lameter <cl@linux.com>,
	akpm@linuxfoundation.org, Steven Rostedt <rostedt@goodmis.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Pekka Enberg <penberg@kernel.org>,
	brouer@redhat.com
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
Date: Wed, 17 Dec 2014 13:08:41 +0100	[thread overview]
Message-ID: <20141217130841.100dac71@redhat.com> (raw)
In-Reply-To: <CAAmzW4NCpx5aJyW36fgOfu3EaDj6=uv6MUiBC+a0ggePWPXndQ@mail.gmail.com>

On Wed, 17 Dec 2014 16:13:49 +0900 Joonsoo Kim <js1304@gmail.com> wrote:

> Ping... and I found another way to remove preempt_disable/enable
> without complex changes.
> 
> What we want to ensure is getting tid and kmem_cache_cpu
> on the same cpu. We can achieve that goal with below condition loop.
> 
> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
> kmem_cache_alloc+free in CONFIG_PREEMPT.
> 
> 14.5 ns -> 13.8 ns

Hi Kim,

I've tested you patch.  Full report below patch.

Summary, I'm seeing 18.599 ns -> 17.523 ns (-1.076ns better).

For network overload tests:

Dropping packets in iptables raw, which is hitting the slub fast-path.
Here I'm seeing an improvement of 3ns.

For IP-forward, which is also invoking the slub slower path, I'm seeing
an improvement of 6ns (I were not expecting to see any improvement
here, the kmem_cache_alloc code is 24bytes smaller, so perhaps it's
saving some icache).

Full report below patch...
 
> See following patch.
> 
> Thanks.
> 
> ----------->8-------------
> diff --git a/mm/slub.c b/mm/slub.c
> index 95d2142..e537af5 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2399,8 +2399,10 @@ redo:
>          * on a different processor between the determination of the pointer
>          * and the retrieval of the tid.
>          */
> -       preempt_disable();
> -       c = this_cpu_ptr(s->cpu_slab);
> +       do {
> +               tid = this_cpu_read(s->cpu_slab->tid);
> +               c = this_cpu_ptr(s->cpu_slab);
> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> 
>         /*
>          * The transaction ids are globally unique per cpu and per operation on
> @@ -2408,8 +2410,6 @@ redo:
>          * occurs on the right processor and that there was no operation on the
>          * linked list in between.
>          */
> -       tid = c->tid;
> -       preempt_enable();
> 
>         object = c->freelist;
>         page = c->page;
> @@ -2655,11 +2655,10 @@ redo:
>          * data is retrieved via this pointer. If we are on the same cpu
>          * during the cmpxchg then the free will succedd.
>          */
> -       preempt_disable();
> -       c = this_cpu_ptr(s->cpu_slab);
> -
> -       tid = c->tid;
> -       preempt_enable();
> +       do {
> +               tid = this_cpu_read(s->cpu_slab->tid);
> +               c = this_cpu_ptr(s->cpu_slab);
> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> 
>         if (likely(page == c->page)) {
>                 set_freepointer(s, object, c->freelist);

SLUB evaluation 03
==================

Testing patch from Joonsoo Kim <iamjoonsoo.kim@lge.com> slub fast-path
preempt_{disable,enable} avoidance.

Kernel
======
Compiler: GCC 4.9.1

Kernel config ::

 $ grep PREEMPT .config
 CONFIG_PREEMPT_RCU=y
 CONFIG_PREEMPT_NOTIFIERS=y
 # CONFIG_PREEMPT_NONE is not set
 # CONFIG_PREEMPT_VOLUNTARY is not set
 CONFIG_PREEMPT=y
 CONFIG_PREEMPT_COUNT=y
 # CONFIG_DEBUG_PREEMPT is not set

 $ egrep -e "SLUB|SLAB" .config
 # CONFIG_SLUB_DEBUG is not set
 # CONFIG_SLAB is not set
 CONFIG_SLUB=y
 # CONFIG_SLUB_CPU_PARTIAL is not set
 # CONFIG_SLUB_STATS is not set

On top of::

 commit f96fe225677b3efb74346ebd56fafe3997b02afa
 Merge: 5543798 eea3e8f
 Author: Linus Torvalds <torvalds@linux-foundation.org>
 Date:   Fri Dec 12 16:11:12 2014 -0800

    Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net


Setup
=====

netfilter_unload_modules.sh
netfilter_unload_modules.sh
sudo rmmod nf_reject_ipv4 nf_reject_ipv6

base_device_setup.sh eth4  # 10G sink/receiving interface (ixgbe)
base_device_setup.sh eth5
sudo ethtool --coalesce eth4 rx-usecs 30
sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0
sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5


# sudo tuned-adm active
Current active profile: latency-performance

Drop in raw
-----------
alias iptables='sudo iptables'
iptables -t raw -N simple || iptables -t raw -F simple
iptables -t raw -I simple -d 198.18.0.0/15 -j DROP
iptables -t raw -D PREROUTING -j simple
iptables -t raw -I PREROUTING -j simple

Generator
---------
./pktgen02_burst.sh -d 198.18.0.2 -i eth8 -m 90:E2:BA:0A:56:B4 -b 8 -t 3 -s 64


Patch by Joonsoo Kim to avoid preempt in slub
=============================================

baseline: without patch
-----------------------

baseline kernel v3.18-7016-gf96fe22 at commit f96fe22567

Type:kmem fastpath reuse Per elem: 46 cycles(tsc) 18.599 ns
 - (measurement period time:1.859917529 sec time_interval:1859917529)
 - (invoke count:100000000 tsc_interval:4649791431)

alloc N-pattern before free with 256 elements

Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.077 ns
 - (measurement period time:1.025993290 sec time_interval:1025993290)
 - (invoke count:25600000 tsc_interval:2564981743)

single flow/CPU
 * IP-forward
  - instant rx:0 tx:1165376 pps n:60 average: rx:0 tx:1165928 pps
    (instant variation TX -0.407 ns (min:-0.828 max:0.507) RX 0.000 ns)
 * Drop in RAW (slab fast-path test)
   - instant rx:3245248 tx:0 pps n:60 average: rx:3245325 tx:0 pps
     (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.007 ns)

Christoph's slab_test, baseline kernel (at commit f96fe22567)::

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 68 cycles
 10000 times kmalloc(16)/kfree -> 68 cycles
 10000 times kmalloc(32)/kfree -> 69 cycles
 10000 times kmalloc(64)/kfree -> 68 cycles
 10000 times kmalloc(128)/kfree -> 68 cycles
 10000 times kmalloc(256)/kfree -> 68 cycles
 10000 times kmalloc(512)/kfree -> 74 cycles
 10000 times kmalloc(1024)/kfree -> 75 cycles
 10000 times kmalloc(2048)/kfree -> 74 cycles
 10000 times kmalloc(4096)/kfree -> 74 cycles
 10000 times kmalloc(8192)/kfree -> 75 cycles
 10000 times kmalloc(16384)/kfree -> 510 cycles

$ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
ffffffff81163bd0 00000000000000e1 T kmem_cache_alloc
ffffffff81163ac0 000000000000010c T kmem_cache_alloc_node
ffffffff81162cb0 000000000000013b T kmem_cache_free


with patch
----------

single flow/CPU
 * IP-forward
  - instant rx:0 tx:1174652 pps n:60 average: rx:0 tx:1174222 pps
    (instant variation TX 0.311 ns (min:-0.230 max:1.018) RX 0.000 ns)
 * compare against baseline:
  - 1174222-1165928 = +8294pps
  - (1/1174222*10^9)-(1/1165928*10^9) = -6.058ns

 * Drop in RAW (slab fast-path test)
  - instant rx:3277440 tx:0 pps n:74 average: rx:3277737 tx:0 pps
    (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.028 ns)
 * compare against baseline:
  - 3277737-3245325 = +32412 pps
  - (1/3277737*10^9)-(1/3245325*10^9) = -3.047ns

SLUB fast-path test: time_bench_kmem_cache1
 * modprobe time_bench_kmem_cache1 ; rmmod time_bench_kmem_cache1; sudo dmesg -c

Type:kmem fastpath reuse Per elem: 43 cycles(tsc) 17.523 ns (step:0)
 - (measurement period time:1.752338378 sec time_interval:1752338378)
 - (invoke count:100000000 tsc_interval:4380843588)
  * difference: 17.523 - 18.599 = -1.076ns

alloc N-pattern before free with 256 elements

Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.369 ns (step:0)
 - (measurement period time:1.033447112 sec time_interval:1033447112)
 - (invoke count:25600000 tsc_interval:2583616203)
    * difference: 40.369 - 40.077 = +0.292ns


Christoph's slab_test::

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 65 cycles
 10000 times kmalloc(16)/kfree -> 66 cycles
 10000 times kmalloc(32)/kfree -> 65 cycles
 10000 times kmalloc(64)/kfree -> 66 cycles
 10000 times kmalloc(128)/kfree -> 66 cycles
 10000 times kmalloc(256)/kfree -> 71 cycles
 10000 times kmalloc(512)/kfree -> 72 cycles
 10000 times kmalloc(1024)/kfree -> 71 cycles
 10000 times kmalloc(2048)/kfree -> 71 cycles
 10000 times kmalloc(4096)/kfree -> 71 cycles
 10000 times kmalloc(8192)/kfree -> 65 cycles
 10000 times kmalloc(16384)/kfree -> 511 cycles

$ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
ffffffff81163ba0 00000000000000c9 T kmem_cache_alloc
ffffffff81163aa0 00000000000000f8 T kmem_cache_alloc_node
ffffffff81162cb0 0000000000000133 T kmem_cache_free



Kernel size change
------------------

 $ scripts/bloat-o-meter vmlinux vmlinux-kim-preempt-avoid
 add/remove: 0/0 grow/shrink: 0/8 up/down: 0/-248 (-248)
 function                                     old     new   delta
 kmem_cache_free                              315     307      -8
 kmem_cache_alloc_node                        268     248     -20
 kmem_cache_alloc                             225     201     -24
 kfree                                        274     250     -24
 __kmalloc_node_track_caller                  356     324     -32
 __kmalloc_node                               340     308     -32
 __kmalloc                                    324     273     -51
 __kmalloc_track_caller                       343     286     -57


Qmempool notes:
---------------

On baseline kernel:

Type:qmempool fastpath reuse SOFTIRQ Per elem: 33 cycles(tsc) 13.287 ns
 - (measurement period time:0.398628965 sec time_interval:398628965)
 - (invoke count:30000000 tsc_interval:996571541)

Type:qmempool fastpath reuse BH-disable Per elem: 47 cycles(tsc) 19.180 ns
 - (measurement period time:0.575425927 sec time_interval:575425927)
 - (invoke count:30000000 tsc_interval:1438563781)

qmempool_bench: N-pattern with 256 elements

Type:qmempool alloc+free N-pattern Per elem: 62 cycles(tsc) 24.955 ns (step:0)
 - (measurement period time:0.638871008 sec time_interval:638871008)
 - (invoke count:25600000 tsc_interval:1597176303)


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2014-12-17 12:08 UTC|newest]

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
2014-12-10 16:30 ` Christoph Lameter
2014-12-10 16:30 ` [PATCH 1/7] slub: Remove __slab_alloc code duplication Christoph Lameter
2014-12-10 16:30   ` Christoph Lameter
2014-12-10 16:39   ` Pekka Enberg
2014-12-10 16:39     ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 2/7] slub: Use page-mapping to store address of page frame like done in SLAB Christoph Lameter
2014-12-10 16:30   ` Christoph Lameter
2014-12-10 16:45   ` Pekka Enberg
2014-12-10 16:45     ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 3/7] slub: Do not use c->page on free Christoph Lameter
2014-12-10 16:30   ` Christoph Lameter
2014-12-10 16:54   ` Pekka Enberg
2014-12-10 16:54     ` Pekka Enberg
2014-12-10 17:08     ` Christoph Lameter
2014-12-10 17:08       ` Christoph Lameter
2014-12-10 17:32       ` Pekka Enberg
2014-12-10 17:32         ` Pekka Enberg
2014-12-10 17:37         ` Christoph Lameter
2014-12-10 17:37           ` Christoph Lameter
2014-12-11 13:19           ` Jesper Dangaard Brouer
2014-12-11 13:19             ` Jesper Dangaard Brouer
2014-12-11 15:01             ` Christoph Lameter
2014-12-11 15:01               ` Christoph Lameter
2014-12-15  8:03   ` Joonsoo Kim
2014-12-15  8:03     ` Joonsoo Kim
2014-12-15 14:16     ` Christoph Lameter
2014-12-15 14:16       ` Christoph Lameter
2014-12-16  2:42       ` Joonsoo Kim
2014-12-16  2:42         ` Joonsoo Kim
2014-12-16  7:54         ` Andrey Ryabinin
2014-12-16  7:54           ` Andrey Ryabinin
2014-12-16  8:25           ` Joonsoo Kim
2014-12-16  8:25             ` Joonsoo Kim
2014-12-16 14:53             ` Christoph Lameter
2014-12-16 14:53               ` Christoph Lameter
2014-12-16 15:15               ` Jesper Dangaard Brouer
2014-12-16 15:15                 ` Jesper Dangaard Brouer
2014-12-16 15:34                 ` Andrey Ryabinin
2014-12-16 15:34                   ` Andrey Ryabinin
2014-12-16 15:48                 ` Christoph Lameter
2014-12-16 15:48                   ` Christoph Lameter
2014-12-17  7:15                   ` Joonsoo Kim
2014-12-17  7:15                     ` Joonsoo Kim
2014-12-16 15:33               ` Andrey Ryabinin
2014-12-16 15:33                 ` Andrey Ryabinin
2014-12-16 14:05           ` Jesper Dangaard Brouer
2014-12-16 14:05             ` Jesper Dangaard Brouer
2014-12-10 16:30 ` [PATCH 4/7] slub: Avoid using the page struct address in allocation fastpath Christoph Lameter
2014-12-10 16:30   ` Christoph Lameter
2014-12-10 16:56   ` Pekka Enberg
2014-12-10 16:56     ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 5/7] slub: Use end_token instead of NULL to terminate freelists Christoph Lameter
2014-12-10 16:30   ` Christoph Lameter
2014-12-10 16:59   ` Pekka Enberg
2014-12-10 16:59     ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 6/7] slub: Drop ->page field from kmem_cache_cpu Christoph Lameter
2014-12-10 16:30   ` Christoph Lameter
2014-12-10 17:29   ` Pekka Enberg
2014-12-10 17:29     ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 7/7] slub: Remove preemption disable/enable from fastpath Christoph Lameter
2014-12-10 16:30   ` Christoph Lameter
2014-12-11 13:35 ` [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Jesper Dangaard Brouer
2014-12-11 13:35   ` Jesper Dangaard Brouer
2014-12-11 15:03   ` Christoph Lameter
2014-12-11 15:03     ` Christoph Lameter
2014-12-11 16:50     ` Jesper Dangaard Brouer
2014-12-11 16:50       ` Jesper Dangaard Brouer
2014-12-11 17:18       ` Christoph Lameter
2014-12-11 17:18         ` Christoph Lameter
2014-12-11 18:11         ` Jesper Dangaard Brouer
2014-12-11 18:11           ` Jesper Dangaard Brouer
2014-12-11 17:37 ` Jesper Dangaard Brouer
2014-12-11 17:37   ` Jesper Dangaard Brouer
2014-12-12 10:39   ` Jesper Dangaard Brouer
2014-12-12 10:39     ` Jesper Dangaard Brouer
2014-12-12 18:31     ` Christoph Lameter
2014-12-12 18:31       ` Christoph Lameter
2014-12-15  7:59 ` Joonsoo Kim
2014-12-15  7:59   ` Joonsoo Kim
2014-12-17  7:13   ` Joonsoo Kim
2014-12-17  7:13     ` Joonsoo Kim
2014-12-17 12:08     ` Jesper Dangaard Brouer [this message]
2014-12-17 12:08       ` Jesper Dangaard Brouer
2014-12-18 14:34       ` Joonsoo Kim
2014-12-18 14:34         ` Joonsoo Kim
2014-12-17 15:36     ` Christoph Lameter
2014-12-17 15:36       ` Christoph Lameter
2014-12-18 14:38       ` Joonsoo Kim
2014-12-18 14:38         ` Joonsoo Kim
2014-12-18 14:57         ` Christoph Lameter
2014-12-18 14:57           ` Christoph Lameter
2014-12-18 15:08           ` Joonsoo Kim
2014-12-18 15:08             ` Joonsoo Kim
2014-12-17 16:10     ` Christoph Lameter
2014-12-17 16:10       ` Christoph Lameter
2014-12-17 19:44       ` Christoph Lameter
2014-12-17 19:44         ` Christoph Lameter
2014-12-18 14:41       ` Joonsoo Kim
2014-12-18 14:41         ` Joonsoo Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141217130841.100dac71@redhat.com \
    --to=brouer@redhat.com \
    --cc=akpm@linuxfoundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=js1304@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penberg@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.