All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yuanhan Liu <yuanhan.liu@linux.intel.com>
To: Zhihong Wang <zhihong.wang@intel.com>
Cc: dev@dpdk.org
Subject: Re: [PATCH] vhost: optimize vhost memcpy
Date: Mon, 5 Dec 2016 16:27:14 +0800	[thread overview]
Message-ID: <20161205082714.GJ24403@yliu-dev.sh.intel.com> (raw)
In-Reply-To: <1480641582-56186-1-git-send-email-zhihong.wang@intel.com>

On Thu, Dec 01, 2016 at 08:19:42PM -0500, Zhihong Wang wrote:
> This patch optimizes Vhost performance for large packets when the
> Mergeable Rx buffer feature is enabled. It introduces a dedicated
> memcpy function for vhost enqueue/dequeue to replace rte_memcpy.
> 
> The reason is that rte_memcpy is for general cases, it handles
> unaligned copies and make store aligned, it even makes load aligned
> for micro architectures like Ivy Bridge. However alignment handling
> comes at a price: It introduces extra load/store instructions.
> 
> Vhost memcpy is rather special: The copy is aligned, and remote,
> and there is header write along which is also remote. In this case
> the memcpy instruction stream should be simplified, to reduce extra
> load/store, therefore reduce the probability of load/store buffer
> full caused pipeline stall, to let the actual memcpy instructions
> be issued and let H/W prefetcher goes to work as early as possible.
...
>  
> +/**
> + * This function is used to for vhost memcpy, to replace rte_memcpy.
> + * The reason is that rte_memcpy is for general cases, where vhost
> + * memcpy is a rather special case: The copy is aligned, and remote,
> + * and there is header write along which is also remote. In this case
> + * the memcpy instruction stream should be simplified to reduce extra
> + * load/store, therefore reduce the probability of load/store buffer
> + * full caused pipeline stall, to let the actual memcpy instructions
> + * be issued and let H/W prefetcher goes to work as early as possible.
> + */
> +static inline void __attribute__((always_inline))
> +vhost_memcpy(void *dst, const void *src, size_t n)

I like this function a lot, since it's really simple and straightforward!
Moreover, it performs better.

But, I don't quite like how this function is proposed:

- rte_movX are more like internal help functions that should be used only
  in corresponding rte_memcpy.h file.

- It's a good optimization, however, it will not benefit for other use
  cases, though vhost is the most typical case here.

- The optimization proves to be good for X86, but think there is no
  guarantee it may behave well for other platforms, say ARM.

I still would suggest you to go this way: move this function into x86's
rte_memcpy.h and call it when the data is well aligned.

	--yliu
> +{
> +	/* Copy size <= 16 bytes */
> +	if (n < 16) {
> +		if (n & 0x01) {
> +			*(uint8_t *)dst = *(const uint8_t *)src;
> +			src = (const uint8_t *)src + 1;
> +			dst = (uint8_t *)dst + 1;
> +		}
> +		if (n & 0x02) {
> +			*(uint16_t *)dst = *(const uint16_t *)src;
> +			src = (const uint16_t *)src + 1;
> +			dst = (uint16_t *)dst + 1;
> +		}
> +		if (n & 0x04) {
> +			*(uint32_t *)dst = *(const uint32_t *)src;
> +			src = (const uint32_t *)src + 1;
> +			dst = (uint32_t *)dst + 1;
> +		}
> +		if (n & 0x08)
> +			*(uint64_t *)dst = *(const uint64_t *)src;
> +
> +		return;
> +	}
> +
> +	/* Copy 16 <= size <= 32 bytes */
> +	if (n <= 32) {
> +		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst - 16 + n,
> +				(const uint8_t *)src - 16 + n);
> +
> +		return;
> +	}
> +
> +	/* Copy 32 < size <= 64 bytes */
> +	if (n <= 64) {
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov32((uint8_t *)dst - 32 + n,
> +				(const uint8_t *)src - 32 + n);
> +
> +		return;
> +	}
> +
> +	/* Copy 64 bytes blocks */
> +	for (; n >= 64; n -= 64) {
> +		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> +		dst = (uint8_t *)dst + 64;
> +		src = (const uint8_t *)src + 64;
> +	}
> +
> +	/* Copy whatever left */
> +	rte_mov64((uint8_t *)dst - 64 + n,
> +			(const uint8_t *)src - 64 + n);
> +}

  parent reply	other threads:[~2016-12-05  8:26 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-02  1:19 [PATCH] vhost: optimize vhost memcpy Zhihong Wang
2016-12-02 10:30 ` Thomas Monjalon
2016-12-05  8:27 ` Yuanhan Liu [this message]
2016-12-05 10:27   ` Wang, Zhihong
2016-12-05 10:37     ` Yuanhan Liu
2016-12-07  6:11       ` Wang, Zhihong
2016-12-07  1:31 ` [PATCH v2] eal: optimize aligned rte_memcpy Zhihong Wang
2016-12-08  0:55   ` Yao, Lei A
2016-12-08  2:18   ` Yuanhan Liu
2017-01-17 15:08     ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161205082714.GJ24403@yliu-dev.sh.intel.com \
    --to=yuanhan.liu@linux.intel.com \
    --cc=dev@dpdk.org \
    --cc=zhihong.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.