From mboxrd@z Thu Jan 1 00:00:00 1970 From: Thomas Monjalon Subject: Re: [PATCH v2] eal: optimize aligned rte_memcpy Date: Tue, 17 Jan 2017 16:08:42 +0100 Message-ID: <1597948.LxUmgnGZos@xps13> References: <1480641582-56186-1-git-send-email-zhihong.wang@intel.com> <1481074266-4461-1-git-send-email-zhihong.wang@intel.com> <20161208021843.GM31182@yliu-dev.sh.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7Bit Cc: Yuanhan Liu , dev@dpdk.org, lei.a.yao@intel.com To: Zhihong Wang Return-path: Received: from mail-wm0-f47.google.com (mail-wm0-f47.google.com [74.125.82.47]) by dpdk.org (Postfix) with ESMTP id AECB81094 for ; Tue, 17 Jan 2017 16:08:44 +0100 (CET) Received: by mail-wm0-f47.google.com with SMTP id c85so203995352wmi.1 for ; Tue, 17 Jan 2017 07:08:44 -0800 (PST) In-Reply-To: <20161208021843.GM31182@yliu-dev.sh.intel.com> List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" 2016-12-08 10:18, Yuanhan Liu: > On Tue, Dec 06, 2016 at 08:31:06PM -0500, Zhihong Wang wrote: > > This patch optimizes rte_memcpy for well aligned cases, where both > > dst and src addr are aligned to maximum MOV width. It introduces a > > dedicated function called rte_memcpy_aligned to handle the aligned > > cases with simplified instruction stream. The existing rte_memcpy > > is renamed as rte_memcpy_generic. The selection between them 2 is > > done at the entry of rte_memcpy. > > > > The existing rte_memcpy is for generic cases, it handles unaligned > > copies and make store aligned, it even makes load aligned for micro > > architectures like Ivy Bridge. However alignment handling comes at > > a price: It adds extra load/store instructions, which can cause > > complications sometime. > > > > DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example: > > The copy is aligned, and remote, and there is header write along > > which is also remote. In this case the memcpy instruction stream > > should be simplified, to reduce extra load/store, therefore reduce > > the probability of load/store buffer full caused pipeline stall, to > > let the actual memcpy instructions be issued and let H/W prefetcher > > goes to work as early as possible. > > > > This patch is tested on Ivy Bridge, Haswell and Skylake, it provides > > up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging > > from 64 to 1500 bytes. > > > > The test can also be conducted without NIC, by setting loopback > > traffic between Virtio and Vhost. For example, modify the macro > > TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h, > > rebuild and start testpmd in both host and guest, then "start" on > > one side and "start tx_first 32" on the other. > > > > > > Signed-off-by: Zhihong Wang > > Reviewed-by: Yuanhan Liu Applied, thanks