From: Thomas Monjalon <thomas@monjalon.net>
To: "Li, Xiaoyun" <xiaoyun.li@intel.com>,
"Richardson, Bruce" <bruce.richardson@intel.com>
Cc: dev@dpdk.org, "Ananyev,
Konstantin" <konstantin.ananyev@intel.com>,
"Lu, Wenzhuo" <wenzhuo.lu@intel.com>,
"Zhang, Helin" <helin.zhang@intel.com>,
"ophirmu@mellanox.com" <ophirmu@mellanox.com>
Subject: Re: [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
Date: Wed, 25 Oct 2017 09:25:21 +0200 [thread overview]
Message-ID: <8258746.M4EytlN248@xps> (raw)
In-Reply-To: <B9E724F4CB7543449049E7AE7669D82F4816C5@SHSMSX101.ccr.corp.intel.com>
25/10/2017 08:55, Li, Xiaoyun:
> From: Li, Xiaoyun
> > From: Richardson, Bruce
> > > On Thu, Oct 19, 2017 at 11:00:33AM +0200, Thomas Monjalon wrote:
> > > > 19/10/2017 10:50, Li, Xiaoyun:
> > > > > From: Thomas Monjalon
> > > > > > 19/10/2017 09:51, Li, Xiaoyun:
> > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > > > > > Hi
> > > > > > > > > > > >
> > > > > > > > > > > > The significant change of this patch is to call a
> > > > > > > > > > > > function pointer for packet size > 128
> > > (RTE_X86_MEMCPY_THRESH).
> > > > > > > > > > > The perf drop is due to function call replacing inline.
> > > > > > > > > > >
> > > > > > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > > > > > I ran memcpy_perf_test which would show the time cost
> > > > > > > > > > > of memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > > > > > But I just draw pictures and looked at the trend not
> > > > > > > > > > > computed the exact percentage. Sorry about that.
> > > > > > > > > > > The picture shows results of copy size of 2, 4, 6, 8,
> > > > > > > > > > > 9, 12, 16, 32, 64, 128, 192, 256, 320, 384, 448, 512,
> > > > > > > > > > > 768, 1024, 1518, 1522, 1536, 1600, 2048, 2560, 3072,
> > > > > > > > > > > 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168,
> > > > > > > > > > 7680, 8192.
> > > > > > > > > > > In my test, the size grows, the drop degrades. (Using
> > > > > > > > > > > copy time indicates the
> > > > > > > > > > > perf.) From the trend picture, when the size is
> > > > > > > > > > > smaller than
> > > > > > > > > > > 128 bytes, the perf drops a lot, almost 50%. And above
> > > > > > > > > > > 128 bytes, it approaches the original dpdk.
> > > > > > > > > > > I computed it right now, it shows that when greater
> > > > > > > > > > > than
> > > > > > > > > > > 128 bytes and smaller than 1024 bytes, the perf drops
> > > > > > > > > > > about
> > > 15%.
> > > > > > > > > > > When above
> > > > > > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > > > > > >
> > > > > > > > > > > > From a test done at Mellanox, there might be a
> > > > > > > > > > > > performance degradation of about 15% in testpmd
> > > > > > > > > > > > txonly
> > > with AVX2.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I did tests on X710, XXV710, X540 and MT27710 but didn't
> > > > > > > > > see
> > > > > > > > performance degradation.
> > > > > > > > >
> > > > > > > > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd
> > > > > > > > > -c 0xf -n
> > > > > > > > > 4 -- -
> > > > > > > > I" and set fwd txonly.
> > > > > > > > > I tested it on v17.11-rc1, then revert my patch and tested it
> > again.
> > > > > > > > > Show port stats all and see the throughput pps. But the
> > > > > > > > > results are similar
> > > > > > > > and no drop.
> > > > > > > > >
> > > > > > > > > Did I miss something?
> > > > > > > >
> > > > > > > > I do not understand. Yesterday you confirmed a 15% drop with
> > > > > > > > buffers between
> > > > > > > > 128 and 1024 bytes.
> > > > > > > > But you do not see this drop in your txonly tests, right?
> > > > > > > >
> > > > > > > Yes. The drop is using test.
> > > > > > > Using command "make test -j" and then " ./build/app/test -c f -n 4 "
> > > > > > > Then run "memcpy_perf_autotest"
> > > > > > > The results are the cycles that memory copy costs.
> > > > > > > But I just use it to show the trend because I heard that it's
> > > > > > > not
> > > > > > recommended to use micro benchmarks like test_memcpy_perf for
> > > > > > memcpy performance report as they aren't likely able to reflect
> > > > > > performance of real world applications.
> > > > > >
> > > > > > Yes real applications can hide the memcpy cost.
> > > > > > Sometimes, the cost appear for real :)
> > > > > >
> > > > > > > Details can be seen at
> > > > > > > https://software.intel.com/en-us/articles/performance-optimiza
> > > > > > > ti
> > > > > > > on-of-
> > > > > > > memcpy-in-dpdk
> > > > > > >
> > > > > > > And I didn't see drop in testpmd txonly test. Maybe it's
> > > > > > > because not a lot
> > > > > > memcpy calls.
> > > > > >
> > > > > > It has been seen in a mlx4 use-case using more memcpy.
> > > > > > I think 15% in micro-benchmark is too much.
> > > > > > What can we do? Raise the threshold?
> > > > > >
> > > > > I think so. If there is big drop, can try raise the threshold. Maybe 1024?
> > > but not sure.
> > > > > But I didn't reproduce the 15% drop on mellanox and not sure how
> > > > > to
> > > verify it.
> > > >
> > > > I think we should focus on micro-benchmark and find a reasonnable
> > > > threshold for a reasonnable drop tradeoff.
> > > >
> > > Sadly, it may not be that simple. What shows best performance for
> > > micro- benchmarks may not show the same effect in a real application.
> > >
> > > /Bruce
> >
> > Then how to measure the performance?
> >
> > And I cannot reproduce 15% drop on mellanox.
> > Could the person who tested 15% drop help to do test again with 1024
> > threshold and see if there is any improvement?
>
> As Bruce said, best performance on micro-benchmark may not show the same effect in real applications.
Yes real applications may hide the impact.
You keep saying that it is a reason to allow degrading memcpy raw perf.
But can you see better performance with buffers of 256 bytes with
any application thanks to your patch?
I am not sure whether there is a benefit keeping a code which imply
a signicative drop in micro-benchmarks.
> And I cannot reproduce the 15% drop.
> And I don't know if raising the threshold can improve the perf or not.
> Could the person who tested 15% drop help to do test again with 1024 threshold and see if there is any improvement?
We will test a increased threshold today.
next prev parent reply other threads:[~2017-10-25 7:25 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-26 7:41 [PATCH v3 0/3] dynamic linking support Xiaoyun Li
2017-09-26 7:41 ` [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-01 23:41 ` Ananyev, Konstantin
2017-10-02 0:12 ` Li, Xiaoyun
2017-09-26 7:41 ` [PATCH v3 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-09-26 7:41 ` [PATCH v3 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-02 0:08 ` Ananyev, Konstantin
2017-10-02 0:09 ` Li, Xiaoyun
2017-10-02 9:35 ` Ananyev, Konstantin
2017-10-02 16:13 ` [PATCH v4 0/3] run-time Linking support Xiaoyun Li
2017-10-02 16:13 ` [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-02 16:39 ` Ananyev, Konstantin
2017-10-02 23:10 ` Li, Xiaoyun
2017-10-03 11:15 ` Ananyev, Konstantin
2017-10-03 11:39 ` Li, Xiaoyun
2017-10-03 12:12 ` Ananyev, Konstantin
2017-10-03 12:23 ` Li, Xiaoyun
2017-10-02 16:13 ` [PATCH v4 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-10-02 16:13 ` [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-02 16:52 ` Ananyev, Konstantin
2017-10-03 8:15 ` Li, Xiaoyun
2017-10-03 11:23 ` Ananyev, Konstantin
2017-10-03 11:27 ` Li, Xiaoyun
2017-10-03 14:59 ` [PATCH v5 0/3] run-time Linking support Xiaoyun Li
2017-10-03 14:59 ` [PATCH v5 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-03 14:59 ` [PATCH v5 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-10-03 14:59 ` [PATCH v5 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-04 17:56 ` [PATCH v5 0/3] run-time Linking support Ananyev, Konstantin
2017-10-04 22:33 ` Li, Xiaoyun
2017-10-04 22:58 ` [PATCH v6 " Xiaoyun Li
2017-10-04 22:58 ` [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-05 9:37 ` Ananyev, Konstantin
2017-10-05 9:38 ` Ananyev, Konstantin
2017-10-05 11:19 ` Li, Xiaoyun
2017-10-05 11:26 ` Richardson, Bruce
2017-10-05 11:26 ` Li, Xiaoyun
2017-10-05 12:12 ` Ananyev, Konstantin
2017-10-04 22:58 ` [PATCH v6 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-10-04 22:58 ` [PATCH v6 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-05 9:40 ` Ananyev, Konstantin
2017-10-05 10:23 ` Li, Xiaoyun
2017-10-05 12:33 ` [PATCH v7 0/3] run-time Linking support Xiaoyun Li
2017-10-05 12:33 ` [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-09 17:47 ` Thomas Monjalon
2017-10-13 1:06 ` Li, Xiaoyun
2017-10-13 7:21 ` Thomas Monjalon
2017-10-13 7:30 ` Li, Xiaoyun
2017-10-13 7:31 ` Ananyev, Konstantin
2017-10-13 7:36 ` Thomas Monjalon
2017-10-13 7:41 ` Li, Xiaoyun
2017-10-05 12:33 ` [PATCH v7 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-10-05 12:33 ` [PATCH v7 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-05 13:24 ` [PATCH v7 0/3] run-time Linking support Ananyev, Konstantin
2017-10-09 17:40 ` Thomas Monjalon
2017-10-13 0:58 ` Li, Xiaoyun
2017-10-13 9:01 ` [PATCH v8 " Xiaoyun Li
2017-10-13 9:01 ` [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-13 9:28 ` Thomas Monjalon
2017-10-13 10:26 ` Ananyev, Konstantin
2017-10-17 21:24 ` Thomas Monjalon
2017-10-18 2:21 ` Li, Xiaoyun
2017-10-18 6:22 ` Li, Xiaoyun
2017-10-19 2:45 ` Li, Xiaoyun
2017-10-19 6:58 ` Thomas Monjalon
2017-10-19 7:51 ` Li, Xiaoyun
2017-10-19 8:33 ` Thomas Monjalon
2017-10-19 8:50 ` Li, Xiaoyun
2017-10-19 8:59 ` Ananyev, Konstantin
2017-10-19 9:00 ` Thomas Monjalon
2017-10-19 9:29 ` Bruce Richardson
2017-10-20 1:02 ` Li, Xiaoyun
2017-10-25 6:55 ` Li, Xiaoyun
2017-10-25 7:25 ` Thomas Monjalon [this message]
2017-10-29 8:49 ` Thomas Monjalon
2017-11-02 10:22 ` Wang, Zhihong
2017-11-02 10:44 ` Thomas Monjalon
2017-11-02 10:58 ` Li, Xiaoyun
2017-11-02 12:15 ` Thomas Monjalon
2017-11-03 7:47 ` Yao, Lei A
2017-10-25 8:50 ` Ananyev, Konstantin
2017-10-25 8:54 ` Li, Xiaoyun
2017-10-25 9:00 ` Thomas Monjalon
2017-10-25 10:32 ` Li, Xiaoyun
2017-10-25 9:14 ` Ananyev, Konstantin
2017-10-13 9:01 ` [PATCH v8 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-10-13 9:01 ` [PATCH v8 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-13 13:13 ` [PATCH v8 0/3] run-time Linking support Thomas Monjalon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8258746.M4EytlN248@xps \
--to=thomas@monjalon.net \
--cc=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
--cc=helin.zhang@intel.com \
--cc=konstantin.ananyev@intel.com \
--cc=ophirmu@mellanox.com \
--cc=wenzhuo.lu@intel.com \
--cc=xiaoyun.li@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.