From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Wiles, Keith" Subject: Re: Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function? Date: Wed, 18 Apr 2018 18:36:34 +0000 Message-ID: References: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: "dev@dpdk.org" To: Shailja Pandey Return-path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id EECD35F32 for ; Wed, 18 Apr 2018 20:36:36 +0200 (CEST) In-Reply-To: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in> Content-Language: en-US Content-ID: <3352008C98427D45B75CB715D51999C8@intel.com> List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > On Apr 18, 2018, at 11:43 AM, Shailja Pandey wrote= : >=20 > Hello, >=20 > I am doing packet replication and I need to change the ethernet and IP he= ader field for each replicated packet. I did it in two different ways: >=20 > 1. Share payload from the original packet using rte_mbuf_refcnt_update > and allocate new mbuf for L2-L4 headers. > 2. memcpy() payload from the original packet to newly created mbuf and > prepend L2-L4 headers to the mbuf. >=20 > I performed experiments with varying replication factor as well as varyin= g packet size and found that memcpy() is performing way better than using r= te_mbuf_refcnt_update(). But I am not sure why it is happening and what is = making rte_mbuf_refcnt_update() even worse than memcpy(). >=20 > Here is the sample code for both implementations: The two code fragments are doing two different ways the first is using a lo= op to create possible more then one replication and the second one is not, = correct? The loop can cause performance hits, but should be small. The first one is using the hdr->next pointer which is in the second cacheli= ne of the mbuf header, this can and will cause a cacheline miss and degrade= your performance. The second code does not touch hdr->next and will not ca= use a cacheline miss. When the packet goes beyond 64bytes then you hit the = second cacheline, are you starting to see the problem here. Every time you = touch a new cache line performance will drop unless the cacheline is prefet= ched into memory first, but in this case it really can not be done easily. = Count the cachelines you are touching and make sure they are the same numbe= r in each case. On Intel x86 systems 64 byte is the cacheline size and other arches have di= fferent sizes. >=20 > *1. Using rte_mbuf_refcnt_update:* > **struct rte_mbuf *pkt =3D original packet;** > ** > ******rte_pktmbuf_adj(pkt, (uint16_t)sizeof(struct ether_hdr)+sizeof(stru= ct ipv4_hdr)); > rte_pktmbuf_refcnt_update(pkt, replication_factor); > for(int i =3D 0; i < replication_factor; i++) { > struct rte_mbuf *hdr; > if (unlikely ((hdr =3D rte_pktmbuf_alloc(header_pool)) =3D= =3D NULL)) { > printf("Failed while cloning $$$\n"); > return NULL; > } > hdr->next =3D pkt; > hdr->pkt_len =3D (uint16_t)(hdr->data_len + pkt->pkt_len); > hdr->nb_segs =3D (uint8_t)(pkt->nb_segs + 1); > //*Update more metadate fields* > * > * > **rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ether_hdr)); > //*modify L2 fields* >=20 > rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ipv4_hdr)); > //Modify L3 fields > . > . > . > } > * > * > * > * > *2. Using memcpy():* > **struct rte_mbuf *pkt =3D original packet > **struct rte_mbuf *hdr;** > if (unlikely ((hdr =3D rte_pktmbuf_alloc(header_pool)) =3D=3D NUL= L)) { > printf("Failed while cloning $$$\n"); > return NULL; > } >=20 > /* prepend new header */ > char *eth_hdr =3D (char *)rte_pktmbuf_prepend(hdr, pkt->pkt_len); > if(eth_hdr =3D=3D NULL) { > printf("panic\n"); > } > char *b =3D rte_pktmbuf_mtod((struct rte_mbuf*)pkt, char *); > memcpy(eth_hdr, b, pkt->pkt_len); > Change L2-L4 header fields in new packet >=20 > The throughput becomes roughly half when the packet size is increased fro= m 64 bytes to 128 bytes and replication is done using *rte_mbuf_refcnt_upda= te(). *The throughput remains more or less same when packet size increases = and replication is done using *memcpy()*. Why did you use memcpy and not rte_memcpy here as rte_memcpy should be fast= er? I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the nu= mber of rte_pktmbuf_alloc() calls, which should help if you know the number= of packets you need to replicate up front. >=20 > Any help would be appreciated. > ** >=20 > -- >=20 > Thanks, > Shailja >=20 Regards, Keith