From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Wiles, Keith" <keith.wiles@intel.com>
Subject: Re: Why packet replication is more efficient when done
 using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
Date: Wed, 18 Apr 2018 18:36:34 +0000
Message-ID: <F7D6AC79-958C-4C08-81FD-4926DA54A1B9@intel.com>
References: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Cc: "dev@dpdk.org" <dev@dpdk.org>
To: Shailja Pandey <csz168117@iitd.ac.in>
Return-path: <dev-bounces@dpdk.org>
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
 by dpdk.org (Postfix) with ESMTP id EECD35F32
 for <dev@dpdk.org>; Wed, 18 Apr 2018 20:36:36 +0200 (CEST)
In-Reply-To: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in>
Content-Language: en-US
Content-ID: <3352008C98427D45B75CB715D51999C8@intel.com>
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>


> On Apr 18, 2018, at 11:43 AM, Shailja Pandey <csz168117@iitd.ac.in> wrote=
:
>=20
> Hello,
>=20
> I am doing packet replication and I need to change the ethernet and IP he=
ader field for each replicated packet. I did it in two different ways:
>=20
> 1. Share payload from the original packet using rte_mbuf_refcnt_update
>   and allocate new mbuf for L2-L4 headers.
> 2. memcpy() payload from the original packet to newly created mbuf and
>   prepend L2-L4 headers to the mbuf.
>=20
> I performed experiments with varying replication factor as well as varyin=
g packet size and found that memcpy() is performing way better than using r=
te_mbuf_refcnt_update(). But I am not sure why it is happening and what is =
making rte_mbuf_refcnt_update() even worse than memcpy().
>=20
> Here is the sample code for both implementations:


The two code fragments are doing two different ways the first is using a lo=
op to create possible more then one replication and the second one is not, =
correct? The loop can cause performance hits, but should be small.

The first one is using the hdr->next pointer which is in the second cacheli=
ne of the mbuf header, this can and will cause a cacheline miss and degrade=
 your performance. The second code does not touch hdr->next and will not ca=
use a cacheline miss. When the packet goes beyond 64bytes then you hit the =
second cacheline, are you starting to see the problem here. Every time you =
touch a new cache line performance will drop unless the cacheline is prefet=
ched into memory first, but in this case it really can not be done easily. =
Count the cachelines you are touching and make sure they are the same numbe=
r in each case.

On Intel x86 systems 64 byte is the cacheline size and other arches have di=
fferent sizes.

>=20
> *1. Using rte_mbuf_refcnt_update:*
> **struct rte_mbuf *pkt =3D original packet;**
> **
> ******rte_pktmbuf_adj(pkt, (uint16_t)sizeof(struct ether_hdr)+sizeof(stru=
ct ipv4_hdr));
>         rte_pktmbuf_refcnt_update(pkt, replication_factor);
>         for(int i =3D 0; i < replication_factor; i++) {
>               struct rte_mbuf *hdr;
>               if (unlikely ((hdr =3D rte_pktmbuf_alloc(header_pool)) =3D=
=3D NULL)) {
>                 printf("Failed while cloning $$$\n");
>                 return NULL;
>            }
>            hdr->next =3D pkt;
>            hdr->pkt_len =3D (uint16_t)(hdr->data_len + pkt->pkt_len);
>            hdr->nb_segs =3D (uint8_t)(pkt->nb_segs + 1);
>            //*Update more metadate fields*
> *
> *
> **rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ether_hdr));
>             //*modify L2 fields*
>=20
>             rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ipv4_hdr));
>             //Modify L3 fields
>             .
>             .
>             .
>         }
> *
> *
> *
> *
> *2. Using memcpy():*
> **struct rte_mbuf *pkt =3D original packet
> **struct rte_mbuf *hdr;**
>         if (unlikely ((hdr =3D rte_pktmbuf_alloc(header_pool)) =3D=3D NUL=
L)) {
>                 printf("Failed while cloning $$$\n");
>                 return NULL;
>         }
>=20
>         /* prepend new header */
>         char *eth_hdr =3D (char *)rte_pktmbuf_prepend(hdr, pkt->pkt_len);
>         if(eth_hdr =3D=3D NULL) {
>                 printf("panic\n");
>         }
>         char *b =3D rte_pktmbuf_mtod((struct rte_mbuf*)pkt, char *);
>         memcpy(eth_hdr, b, pkt->pkt_len);
>         Change L2-L4 header fields in new packet
>=20
> The throughput becomes roughly half when the packet size is increased fro=
m 64 bytes to 128 bytes and replication is done using *rte_mbuf_refcnt_upda=
te(). *The throughput remains more or less same when packet size increases =
and replication is done using *memcpy()*.

Why did you use memcpy and not rte_memcpy here as rte_memcpy should be fast=
er?

I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the nu=
mber of rte_pktmbuf_alloc() calls, which should help if you know the number=
 of packets you need to replicate up front.

>=20
> Any help would be appreciated.
> **
>=20
> --
>=20
> Thanks,
> Shailja
>=20

Regards,
Keith