* Created benchmarks modules for page_pool @ 2020-01-21 16:09 Jesper Dangaard Brouer 2020-01-22 10:42 ` Ilias Apalodimas 0 siblings, 1 reply; 6+ messages in thread From: Jesper Dangaard Brouer @ 2020-01-21 16:09 UTC (permalink / raw) To: Ilias Apalodimas, Lorenzo Bianconi Cc: brouer, Saeed Mahameed, Matteo Croce, Tariq Toukan, Toke Høiland-Jørgensen, Jonathan Lemon, netdev@vger.kernel.org Hi Ilias and Lorenzo, (Cc others + netdev) I've created two benchmarks modules for page_pool. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c I think we/you could actually use this as part of your presentation[3]? The first benchmark[1] illustrate/measure what happen when page_pool alloc and free/return happens on the same CPU. Here there are 3 modes of operations with different performance characteristic. Fast_path NAPI recycle (XDP_DROP use-case) - cost per elem: 15 cycles(tsc) 4.437 ns Recycle via ptr_ring - cost per elem: 48 cycles(tsc) 13.439 ns Failed recycle, return to page-allocator - cost per elem: 256 cycles(tsc) 71.169 ns The second benchmark[2] measures what happens cross-CPU. It is primarily the concurrent return-path that I want to capture. As this is page_pool's weak spot, that we/I need to improve performance of. Hint when SKBs use page_pool return this will happen more often. It is a little more tricky to get proper measurement as we want to observe the case, where return-path isn't stalling/waiting on pages to return. - 1 CPU returning , cost per elem: 110 cycles(tsc) 30.709 ns - 2 concurrent CPUs, cost per elem: 989 cycles(tsc) 274.861 ns - 3 concurrent CPUs, cost per elem: 2089 cycles(tsc) 580.530 ns - 4 concurrent CPUs, cost per elem: 2339 cycles(tsc) 649.984 ns [3] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Created benchmarks modules for page_pool 2020-01-21 16:09 Created benchmarks modules for page_pool Jesper Dangaard Brouer @ 2020-01-22 10:42 ` Ilias Apalodimas 2020-01-22 12:09 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 6+ messages in thread From: Ilias Apalodimas @ 2020-01-22 10:42 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Lorenzo Bianconi, Saeed Mahameed, Matteo Croce, Tariq Toukan, Toke Høiland-Jørgensen, Jonathan Lemon, netdev@vger.kernel.org Hi Jesper, On Tue, Jan 21, 2020 at 05:09:45PM +0100, Jesper Dangaard Brouer wrote: > Hi Ilias and Lorenzo, (Cc others + netdev) > > I've created two benchmarks modules for page_pool. > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c > > I think we/you could actually use this as part of your presentation[3]? I think we can mention this as part of the improvements we can offer, alongside with native SKB recycling. > > The first benchmark[1] illustrate/measure what happen when page_pool > alloc and free/return happens on the same CPU. Here there are 3 modes > of operations with different performance characteristic. > > Fast_path NAPI recycle (XDP_DROP use-case) > - cost per elem: 15 cycles(tsc) 4.437 ns > > Recycle via ptr_ring > - cost per elem: 48 cycles(tsc) 13.439 ns > > Failed recycle, return to page-allocator > - cost per elem: 256 cycles(tsc) 71.169 ns > > > The second benchmark[2] measures what happens cross-CPU. It is > primarily the concurrent return-path that I want to capture. As this > is page_pool's weak spot, that we/I need to improve performance of. > Hint when SKBs use page_pool return this will happen more often. > It is a little more tricky to get proper measurement as we want to > observe the case, where return-path isn't stalling/waiting on pages to > return. > > - 1 CPU returning , cost per elem: 110 cycles(tsc) 30.709 ns > - 2 concurrent CPUs, cost per elem: 989 cycles(tsc) 274.861 ns > - 3 concurrent CPUs, cost per elem: 2089 cycles(tsc) 580.530 ns > - 4 concurrent CPUs, cost per elem: 2339 cycles(tsc) 649.984 ns Interesting, i'll try having a look at the code and maybe run then on my armv8 board. Thanks! /Ilias > > [3] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Created benchmarks modules for page_pool 2020-01-22 10:42 ` Ilias Apalodimas @ 2020-01-22 12:09 ` Jesper Dangaard Brouer 2020-01-28 16:22 ` Matteo Croce 0 siblings, 1 reply; 6+ messages in thread From: Jesper Dangaard Brouer @ 2020-01-22 12:09 UTC (permalink / raw) To: Ilias Apalodimas Cc: Lorenzo Bianconi, Saeed Mahameed, Matteo Croce, Tariq Toukan, Toke Høiland-Jørgensen, Jonathan Lemon, netdev@vger.kernel.org, brouer On Wed, 22 Jan 2020 12:42:05 +0200 Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > Hi Jesper, > > On Tue, Jan 21, 2020 at 05:09:45PM +0100, Jesper Dangaard Brouer wrote: > > Hi Ilias and Lorenzo, (Cc others + netdev) > > > > I've created two benchmarks modules for page_pool. > > > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c > > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c > > > > I think we/you could actually use this as part of your presentation[3]? > > I think we can mention this as part of the improvements we can offer, > alongside with native SKB recycling. Yes, but you should notice that the cross CPU return benchmark test show that we/page_pool is too slow... > > > > The first benchmark[1] illustrate/measure what happen when page_pool > > alloc and free/return happens on the same CPU. Here there are 3 > > modes of operations with different performance characteristic. > > > > Fast_path NAPI recycle (XDP_DROP use-case) > > - cost per elem: 15 cycles(tsc) 4.437 ns > > > > Recycle via ptr_ring > > - cost per elem: 48 cycles(tsc) 13.439 ns > > > > Failed recycle, return to page-allocator > > - cost per elem: 256 cycles(tsc) 71.169 ns > > > > > > The second benchmark[2] measures what happens cross-CPU. It is > > primarily the concurrent return-path that I want to capture. As this > > is page_pool's weak spot, that we/I need to improve performance of. > > Hint when SKBs use page_pool return this will happen more often. > > It is a little more tricky to get proper measurement as we want to > > observe the case, where return-path isn't stalling/waiting on pages > > to return. > > > > - 1 CPU returning , cost per elem: 110 cycles(tsc) 30.709 ns > > - 2 concurrent CPUs, cost per elem: 989 cycles(tsc) 274.861 ns > > - 3 concurrent CPUs, cost per elem: 2089 cycles(tsc) 580.530 ns > > - 4 concurrent CPUs, cost per elem: 2339 cycles(tsc) 649.984 ns Add a small bug, thus re-run of cross_cpu bench numbers: - 2 concurrent CPUs, cost per elem: 462 cycles(tsc) 128.502 ns - 3 concurrent CPUs, cost per elem: 1992 cycles(tsc) 553.507 ns - 4 concurrent CPUs, cost per elem: 2323 cycles(tsc) 645.389 ns > Interesting, i'll try having a look at the code and maybe run then on > my armv8 board. That will be great, but we/you have to fixup the Intel specific ASM instructions in time_bench.c (which we already discussed on IRC). > > > > [3] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Created benchmarks modules for page_pool 2020-01-22 12:09 ` Jesper Dangaard Brouer @ 2020-01-28 16:22 ` Matteo Croce 2020-01-28 18:41 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 6+ messages in thread From: Matteo Croce @ 2020-01-28 16:22 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Ilias Apalodimas, Lorenzo Bianconi, Saeed Mahameed, Tariq Toukan, Toke Høiland-Jørgensen, Jonathan Lemon, netdev@vger.kernel.org On Wed, Jan 22, 2020 at 1:09 PM Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Wed, 22 Jan 2020 12:42:05 +0200 > > Interesting, i'll try having a look at the code and maybe run then on > > my armv8 board. > > That will be great, but we/you have to fixup the Intel specific ASM > instructions in time_bench.c (which we already discussed on IRC). > What does it need to work on arm64? Replace RDPMC with something generic? -- Matteo Croce per aspera ad upstream ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Created benchmarks modules for page_pool 2020-01-28 16:22 ` Matteo Croce @ 2020-01-28 18:41 ` Jesper Dangaard Brouer 2020-01-29 9:07 ` Ilias Apalodimas 0 siblings, 1 reply; 6+ messages in thread From: Jesper Dangaard Brouer @ 2020-01-28 18:41 UTC (permalink / raw) To: Matteo Croce Cc: Ilias Apalodimas, Lorenzo Bianconi, Saeed Mahameed, Tariq Toukan, Toke Høiland-Jørgensen, Jonathan Lemon, netdev@vger.kernel.org, brouer On Tue, 28 Jan 2020 17:22:47 +0100 Matteo Croce <mcroce@redhat.com> wrote: > On Wed, Jan 22, 2020 at 1:09 PM Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > On Wed, 22 Jan 2020 12:42:05 +0200 > > > Interesting, i'll try having a look at the code and maybe run then on > > > my armv8 board. > > > > That will be great, but we/you have to fixup the Intel specific ASM > > instructions in time_bench.c (which we already discussed on IRC). > > > > What does it need to work on arm64? Replace RDPMC with something generic? Replacing the RDTSC. Hoping Ilias will fix it for ARM ;-) You can also fix yourself via using get_cycles() include <linux/timex.h>. If the ARCH doesn't have support it will just return 0. Have you tried it out on your normal x86/Intel box? Hint: https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Created benchmarks modules for page_pool 2020-01-28 18:41 ` Jesper Dangaard Brouer @ 2020-01-29 9:07 ` Ilias Apalodimas 0 siblings, 0 replies; 6+ messages in thread From: Ilias Apalodimas @ 2020-01-29 9:07 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Matteo Croce, Lorenzo Bianconi, Saeed Mahameed, Tariq Toukan, Toke Høiland-Jørgensen, Jonathan Lemon, netdev@vger.kernel.org On Tue, Jan 28, 2020 at 07:41:36PM +0100, Jesper Dangaard Brouer wrote: > On Tue, 28 Jan 2020 17:22:47 +0100 > Matteo Croce <mcroce@redhat.com> wrote: > > > On Wed, Jan 22, 2020 at 1:09 PM Jesper Dangaard Brouer > > <brouer@redhat.com> wrote: > > > On Wed, 22 Jan 2020 12:42:05 +0200 > > > > Interesting, i'll try having a look at the code and maybe run then on > > > > my armv8 board. > > > > > > That will be great, but we/you have to fixup the Intel specific ASM > > > instructions in time_bench.c (which we already discussed on IRC). > > > > > > > What does it need to work on arm64? Replace RDPMC with something generic? > > Replacing the RDTSC. Hoping Ilias will fix it for ARM ;-) I'll have a look today and run it on my armv8 box Cheers /Ilias ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-01-29 9:07 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-01-21 16:09 Created benchmarks modules for page_pool Jesper Dangaard Brouer 2020-01-22 10:42 ` Ilias Apalodimas 2020-01-22 12:09 ` Jesper Dangaard Brouer 2020-01-28 16:22 ` Matteo Croce 2020-01-28 18:41 ` Jesper Dangaard Brouer 2020-01-29 9:07 ` Ilias Apalodimas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox