From: "Sekhar, Ashwin" <Ashwin.Sekhar@cavium.com>
To: "tomasz.kantecki@intel.com" <tomasz.kantecki@intel.com>,
"Jacob, Jerin" <Jerin.JacobKollanukkaran@cavium.com>,
"jianbo.liu@linaro.org" <jianbo.liu@linaro.org>,
"dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [PATCH v2 5/7] examples/l3fwd: add neon support for l3fwd
Date: Wed, 10 May 2017 15:00:19 +0000 [thread overview]
Message-ID: <1494428417.2713.55.camel@caviumnetworks.com> (raw)
In-Reply-To: <1494383419-9677-6-git-send-email-jianbo.liu@linaro.org>
Hi Jianbo,
Thanks for version v2. Addition of the prefetch instructions is
definitely helping performance on ThunderX. But still performance is
slightly less than that of scalar.
I tried few small tweaks which helped improve performance on my
Thunderx setup. For details see comments inline.
On Wed, 2017-05-10 at 10:30 +0800, Jianbo Liu wrote:
> Use ARM NEON intrinsics to accelerate l3 fowarding.
>
> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> ---
> examples/l3fwd/l3fwd_em.c | 4 +-
> examples/l3fwd/l3fwd_em_hlm.h | 19 ++-
> examples/l3fwd/l3fwd_em_hlm_neon.h | 74 ++++++++++
> examples/l3fwd/l3fwd_em_sequential.h | 20 ++-
> examples/l3fwd/l3fwd_lpm.c | 4 +-
> examples/l3fwd/l3fwd_lpm_neon.h | 165 ++++++++++++++++++++++
> examples/l3fwd/l3fwd_neon.h | 259
> +++++++++++++++++++++++++++++++++++
> 7 files changed, 539 insertions(+), 6 deletions(-)
> create mode 100644 examples/l3fwd/l3fwd_em_hlm_neon.h
> create mode 100644 examples/l3fwd/l3fwd_lpm_neon.h
> create mode 100644 examples/l3fwd/l3fwd_neon.h
>
> [...]
> diff --git a/examples/l3fwd/l3fwd_em_hlm.h
> b/examples/l3fwd/l3fwd_em_hlm.h
> index 636dea4..4ec600a 100644
> --- a/examples/l3fwd/l3fwd_em_hlm.h
> +++ b/examples/l3fwd/l3fwd_em_hlm.h
> @@ -35,8 +35,13 @@
> #ifndef __L3FWD_EM_HLM_H__
> #define __L3FWD_EM_HLM_H__
>
> +#if defined(__SSE4_1__)
> #include "l3fwd_sse.h"
> #include "l3fwd_em_hlm_sse.h"
> +#elif defined(RTE_MACHINE_CPUFLAG_NEON)
> +#include "l3fwd_neon.h"
> +#include "l3fwd_em_hlm_neon.h"
> +#endif
>
> static inline __attribute__((always_inline)) void
> em_get_dst_port_ipv4x8(struct lcore_conf *qconf, struct rte_mbuf
> *m[8],
> @@ -238,7 +243,7 @@ static inline __attribute__((always_inline))
> uint16_t
> l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
> uint8_t portid, struct lcore_conf *qconf)
> {
> - int32_t j;
> + int32_t i, j, pos;
> uint16_t dst_port[MAX_PKT_BURST];
>
> /*
> @@ -247,6 +252,12 @@ static inline __attribute__((always_inline))
> uint16_t
> */
> int32_t n = RTE_ALIGN_FLOOR(nb_rx, 8);
>
> + for (j = 0; j < 8 && j < nb_rx; j++) {
> + rte_prefetch0(pkts_burst[j]);
The above prefetch of rte_mbuf struct is unnecessary. With this we wont
see any performance improvement as the contents of rte_mbuf (buf_addr
and data_off) is used in right next instruction. Removing the above
prefetch and similar prefetches at multiple places was improving
performance on my ThunderX setup.
> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> + struct ether_hdr *) +
> 1);
Better to prefetch at eth_hdr itself and not at eth_hdr + 1. In
process_packet in l3fwd_neon.h, eth_header is accessed.
> + }
> +
> for (j = 0; j < n; j += 8) {
>
> uint32_t pkt_type =
> @@ -263,6 +274,12 @@ static inline __attribute__((always_inline))
> uint16_t
> uint32_t tcp_or_udp = pkt_type &
> (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP);
>
> + for (i = 0, pos = j + 8; i < 8 && pos < nb_rx; i++,
> pos++) {
> + rte_prefetch0(pkts_burst[pos]);
The above prefetch of rte_mbuf struct is unnecessary.
> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[po
> s],
> + struct
> ether_hdr *) + 1);
Better to prefetch at eth_hdr itself and not at eth_hdr + 1
> + }
> +
> if (tcp_or_udp && (l3_type == RTE_PTYPE_L3_IPV4)) {
>
> em_get_dst_port_ipv4x8(qconf,
> &pkts_burst[j], portid,
>
> [...]
> diff --git a/examples/l3fwd/l3fwd_em_sequential.h
> b/examples/l3fwd/l3fwd_em_sequential.h
> index c0a9725..c3df473 100644
> --- a/examples/l3fwd/l3fwd_em_sequential.h
> +++ b/examples/l3fwd/l3fwd_em_sequential.h
> @@ -43,7 +43,11 @@
> * compilation time.
> */
>
> +#if defined(__SSE4_1__)
> #include "l3fwd_sse.h"
> +#elif defined(RTE_MACHINE_CPUFLAG_NEON)
> +#include "l3fwd_neon.h"
> +#endif
>
> static inline __attribute__((always_inline)) uint16_t
> em_get_dst_port(const struct lcore_conf *qconf, struct rte_mbuf
> *pkt,
> @@ -101,11 +105,23 @@ static inline __attribute__((always_inline))
> uint16_t
> l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
> uint8_t portid, struct lcore_conf *qconf)
> {
> - int32_t j;
> + int32_t i, j;
> uint16_t dst_port[MAX_PKT_BURST];
>
> - for (j = 0; j < nb_rx; j++)
> + if (nb_rx > 0) {
> + rte_prefetch0(pkts_burst[0]);
The above prefetch of rte_mbuf struct is unnecessary.
> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[0],
> + struct ether_hdr *) +
> 1);
Better to prefetch at eth_hdr itself and not at eth_hdr + 1
> + }
> +
> + for (i = 1, j = 0; j < nb_rx; i++, j++) {
> + if (i < nb_rx) {
> + rte_prefetch0(pkts_burst[i]);
The above prefetch of rte_mbuf struct is unnecessary.
> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i]
> ,
> + struct
> ether_hdr *) + 1);
Better to prefetch at eth_hdr itself and not at eth_hdr + 1
> + }
> dst_port[j] = em_get_dst_port(qconf, pkts_burst[j],
> portid);
> + }
>
> send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
> }
> [...]
> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h
> b/examples/l3fwd/l3fwd_lpm_neon.h
> new file mode 100644
> index 0000000..2f047b3
> --- /dev/null
> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
>
> [...]
> +/*
> + * Buffer optimized handling of packets, invoked
> + * from main_loop.
> + */
> +static inline void
> +l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
> + uint8_t portid, struct lcore_conf *qconf)
> +{
> + int32_t i, j, pos;
> + uint16_t dst_port[MAX_PKT_BURST];
> + int32x4_t dip[MAX_PKT_BURST / FWDSTEP];
If you see carefully, we dont need an array of dip. We just need a
single element. dip value is calculated in processx4_step1 and consumed
in processx4_step2, and thats it. No need to save it in an array.
> + uint32_t ipv4_flag[MAX_PKT_BURST / FWDSTEP];
Same as dip. We dont need an array of ipv4_flag.
> + const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
> +
> + for (j = 0; j < FWDSTEP && j < nb_rx; j++) {
> + rte_prefetch0(pkts_burst[j]);
The above prefetch of rte_mbuf struct is unnecessary.
> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> + struct ether_hdr *) +
> 1);
Better to prefetch at eth_hdr itself and not at eth_hdr + 1
> + }
> +
> + for (j = 0; j != k; j += FWDSTEP) {
> + for (i = 0, pos = j + FWDSTEP; i < FWDSTEP && pos <
> nb_rx;
> + i++, pos++) {
> + rte_prefetch0(pkts_burst[pos]);
The above prefetch of rte_mbuf struct is unnecessary.
> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[po
> s],
> + struct
> ether_hdr *) + 1);
Better to prefetch at eth_hdr itself and not at eth_hdr + 1
> + }
> + processx4_step1(&pkts_burst[j], &dip[j / FWDSTEP],
> + &ipv4_flag[j / FWDSTEP]);
> +
> + processx4_step2(qconf, dip[j / FWDSTEP],
> + ipv4_flag[j / FWDSTEP], portid,
> &pkts_burst[j],
> + &dst_port[j]);
> + }
> +
> + /* Classify last up to 3 packets one by one */
> + switch (nb_rx % FWDSTEP) {
> + case 3:
> + dst_port[j] = lpm_get_dst_port(qconf, pkts_burst[j],
> portid);
> + j++;
> + /* fallthrough */
> + case 2:
> + dst_port[j] = lpm_get_dst_port(qconf, pkts_burst[j],
> portid);
> + j++;
> + /* fallthrough */
> + case 1:
> + dst_port[j] = lpm_get_dst_port(qconf, pkts_burst[j],
> portid);
> + j++;
> + }
> +
> + send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
> +}
> +
> +#endif /* __L3FWD_LPM_NEON_H__ */
> diff --git a/examples/l3fwd/l3fwd_neon.h
> b/examples/l3fwd/l3fwd_neon.h
> new file mode 100644
> index 0000000..75c8976
> --- /dev/null
> +++ b/examples/l3fwd/l3fwd_neon.h
> [...]
> +
> +/**
> + * Process one packet:
> + * Update source and destination MAC addresses in the ethernet
> header.
> + * Perform RFC1812 checks and updates for IPV4 packets.
> + */
> +static inline void
> +process_packet(struct rte_mbuf *pkt, uint16_t *dst_port)
> +{
> + struct ether_hdr *eth_hdr;
> + uint32x4_t te, ve;
> +
> + eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
eth_hdr accessed here. Hence the earlier comments about prefetching at
eth header.
> +
> + te = vld1q_u32((uint32_t *)eth_hdr);
> + ve = vreinterpretq_u32_s32(val_eth[dst_port[0]]);
> +
> +
> + rfc1812_process((struct ipv4_hdr *)(eth_hdr + 1), dst_port,
> + pkt->packet_type);
> +
> + ve = vsetq_lane_u32(vgetq_lane_u32(te, 3), ve, 3);
Use vcopyq_laneq_u32 for easily doing the above.
> + vst1q_u32((uint32_t *)eth_hdr, ve);
> +}
> +
> [...]
> +#endif /* _L3FWD_NEON_H_ */
Combining all the above comments, I made some changes on top of your
patch. These changes are giving 3-4% improvement over your version.
You may find the changes at
https://gist.github.com/ashwinyes/34cbdd999784402c859c71613587fafc
Please check it out and let me know your comments.
Thanks
Ashwin
next prev parent reply other threads:[~2017-05-10 15:00 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-02 7:14 [PATCH 1/5] examples/l3fwd: extract arch independent code from multi hash lookup Jianbo Liu
2017-05-02 7:14 ` [PATCH 2/5] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_single.h Jianbo Liu
2017-05-02 9:40 ` Sekhar, Ashwin
2017-05-02 7:14 ` [PATCH 3/5] examples/l3fwd: extract common code from multi packet send Jianbo Liu
2017-05-02 7:14 ` [PATCH 4/5] examples/l3fwd: rearrange the code for lpm_l3fwd Jianbo Liu
2017-05-02 7:14 ` [PATCH 5/5] examples/l3fwd: add neon support for l3fwd Jianbo Liu
2017-05-02 11:20 ` Sekhar, Ashwin
2017-05-02 11:47 ` Sekhar, Ashwin
2017-05-03 5:24 ` Jianbo Liu
2017-05-04 8:42 ` Jianbo Liu
2017-05-05 4:24 ` Sekhar, Ashwin
2017-05-05 5:43 ` Jianbo Liu
2017-05-09 8:10 ` Sekhar, Ashwin
2017-05-10 2:39 ` Jianbo Liu
2017-05-10 2:30 ` [PATCH v2 0/7] accelerate examples/l3fwd with NEON on ARM64 platform Jianbo Liu
2017-05-10 2:30 ` [PATCH v2 1/7] examples/l3fwd: extract arch independent code from multi hash lookup Jianbo Liu
2017-05-10 2:30 ` [PATCH v2 2/7] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_sequential.h Jianbo Liu
2017-05-10 2:30 ` [PATCH v2 3/7] examples/l3fwd: extract common code from multi packet send Jianbo Liu
2017-05-10 2:30 ` [PATCH v2 4/7] examples/l3fwd: rearrange the code for lpm_l3fwd Jianbo Liu
2017-05-10 2:30 ` [PATCH v2 5/7] examples/l3fwd: add neon support for l3fwd Jianbo Liu
2017-05-10 15:00 ` Sekhar, Ashwin [this message]
2017-05-11 3:16 ` Jianbo Liu
2017-05-11 4:14 ` Sekhar, Ashwin
2017-05-11 4:27 ` Sekhar, Ashwin
2017-05-11 6:11 ` Jianbo Liu
2017-05-10 2:30 ` [PATCH v2 6/7] examples/l3fwd: add the times of hash multi-lookup for different Archs Jianbo Liu
2017-05-10 2:30 ` [PATCH v2 7/7] examples/l3fwd: change the guard micro name for header file Jianbo Liu
2017-05-10 11:57 ` Sekhar, Ashwin
2017-05-11 9:25 ` [PATCH v3 0/7] accelerate examples/l3fwd with NEON on ARM64 platform Jianbo Liu
2017-05-11 9:25 ` [PATCH v3 1/7] examples/l3fwd: extract arch independent code from multi hash lookup Jianbo Liu
2017-05-11 9:25 ` [PATCH v3 2/7] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_sequential.h Jianbo Liu
2017-05-11 9:25 ` [PATCH v3 3/7] examples/l3fwd: extract common code from multi packet send Jianbo Liu
2017-05-11 9:25 ` [PATCH v3 4/7] examples/l3fwd: rearrange the code for lpm_l3fwd Jianbo Liu
2017-05-11 9:25 ` [PATCH v3 5/7] examples/l3fwd: add neon support for l3fwd Jianbo Liu
2017-05-11 9:49 ` Sekhar, Ashwin
2017-05-11 10:01 ` Jianbo Liu
2017-05-11 10:27 ` Sekhar, Ashwin
2017-05-12 2:40 ` Jianbo Liu
2017-05-11 9:25 ` [PATCH v3 6/7] examples/l3fwd: add the times of hash multi-lookup for different Archs Jianbo Liu
2017-05-11 9:25 ` [PATCH v3 7/7] examples/l3fwd: change the guard macro name for header file Jianbo Liu
2017-05-15 3:34 ` [PATCH v4 0/8] accelerate examples/l3fwd with NEON on ARM64 platform Jianbo Liu
2017-05-15 3:34 ` [PATCH v4 1/8] examples/l3fwd: extract arch independent code from multi hash lookup Jianbo Liu
2017-05-15 3:34 ` [PATCH v4 2/8] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_sequential.h Jianbo Liu
2017-05-15 3:34 ` [PATCH v4 3/8] examples/l3fwd: extract common code from multi packet send Jianbo Liu
2017-05-15 3:34 ` [PATCH v4 4/8] examples/l3fwd: rearrange the code for lpm_l3fwd Jianbo Liu
2017-05-15 3:34 ` [PATCH v4 5/8] arch/arm: add vcopyq_laneq_u32 for old version of gcc Jianbo Liu
2017-05-15 4:01 ` Jerin Jacob
2017-05-15 3:34 ` [PATCH v4 6/8] examples/l3fwd: add neon support for l3fwd Jianbo Liu
2017-05-15 5:22 ` Sekhar, Ashwin
2017-05-15 3:34 ` [PATCH v4 7/8] examples/l3fwd: add the times of hash multi-lookup for different Archs Jianbo Liu
2017-05-15 3:34 ` [PATCH v4 8/8] examples/l3fwd: change the guard macro name for header file Jianbo Liu
2017-07-03 21:02 ` [PATCH v4 0/8] accelerate examples/l3fwd with NEON on ARM64 platform Thomas Monjalon
2017-07-04 10:23 ` [PATCH v5 " Jianbo Liu
2017-07-04 10:23 ` [PATCH v5 1/8] examples/l3fwd: extract arch independent code from multi hash lookup Jianbo Liu
2017-07-04 10:23 ` [PATCH v5 2/8] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_sequential.h Jianbo Liu
2017-07-04 10:24 ` [PATCH v5 3/8] examples/l3fwd: extract common code from multi packet send Jianbo Liu
2017-07-04 10:24 ` [PATCH v5 4/8] examples/l3fwd: rearrange the code for lpm_l3fwd Jianbo Liu
2017-07-04 10:24 ` [PATCH v5 5/8] arch/arm: add vcopyq_laneq_u32 for old version of gcc Jianbo Liu
2017-07-04 10:24 ` [PATCH v5 6/8] examples/l3fwd: add neon support for l3fwd Jianbo Liu
2017-07-04 10:24 ` [PATCH v5 7/8] examples/l3fwd: add the times of hash multi-lookup for different Archs Jianbo Liu
2017-07-04 10:24 ` [PATCH v5 8/8] examples/l3fwd: change the guard macro name for header file Jianbo Liu
2017-07-04 15:11 ` [PATCH v5 0/8] accelerate examples/l3fwd with NEON on ARM64 platform Thomas Monjalon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1494428417.2713.55.camel@caviumnetworks.com \
--to=ashwin.sekhar@cavium.com \
--cc=Jerin.JacobKollanukkaran@cavium.com \
--cc=dev@dpdk.org \
--cc=jianbo.liu@linaro.org \
--cc=tomasz.kantecki@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.