public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Cc: Akash Goel <akash.goel@intel.com>,
	Mika Kuoppala <mika.kuoppala@intel.com>
Subject: Re: [PATCH v5] drm/i915: Use SSE4.1 movntdqa to accelerate reads from WC memory
Date: Mon, 18 Jul 2016 10:31:05 +0100	[thread overview]
Message-ID: <578CA1D9.2020004@linux.intel.com> (raw)
In-Reply-To: <1468683858-28383-1-git-send-email-chris@chris-wilson.co.uk>


On 16/07/16 16:44, Chris Wilson wrote:
> This patch provides the infrastructure for performing a 16-byte aligned
> read from WC memory using non-temporal instructions introduced with sse4.1.
> Using movntdqa we can bypass the CPU caches and read directly from memory
> and ignoring the page attributes set on the CPU PTE i.e. negating the
> impact of an otherwise UC access. Copying using movntqda from WC is almost
> as fast as reading from WB memory, modulo the possibility of both hitting
> the CPU cache or leaving the data in the CPU cache for the next consumer.
> (The CPU cache itself my be flushed for the region of the movntdqa and on
> later access the movntdqa reads from a separate internal buffer for the
> cacheline.) The write back to the memory is however cached.
>
> This will be used in later patches to accelerate accessing WC memory.
>
> v2: Report whether the accelerated copy is successful/possible.
> v3: Function alignment override was only necessary when using the
> function target("sse4.1") - which is not necessary for emitting movntdqa
> from __asm__.
> v4: Improve notes on CPU cache behaviour vs non-temporal stores.
> v5: Fix byte offsets for unrolled moves.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Akash Goel <akash.goel@intel.com>
> Cc: Damien Lespiau <damien.lespiau@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/Makefile      |  3 ++
>   drivers/gpu/drm/i915/i915_drv.c    |  2 +
>   drivers/gpu/drm/i915/i915_drv.h    |  3 ++
>   drivers/gpu/drm/i915/i915_memcpy.c | 75 ++++++++++++++++++++++++++++++++++++++
>   4 files changed, 83 insertions(+)
>   create mode 100644 drivers/gpu/drm/i915/i915_memcpy.c
>
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 75318ebb8d25..a53853daa998 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -3,12 +3,15 @@
>   # Direct Rendering Infrastructure (DRI) in XFree86 4.1.0 and higher.
>
>   subdir-ccflags-$(CONFIG_DRM_I915_WERROR) := -Werror
> +subdir-ccflags-y += \
> +	$(call as-instr,movntdqa (%eax)$(comma)%xmm0,-DCONFIG_AS_MOVNTDQA)
>
>   # Please keep these build lists sorted!
>
>   # core driver code
>   i915-y := i915_drv.o \
>   	  i915_irq.o \
> +	  i915_memcpy.o \
>   	  i915_params.o \
>   	  i915_pci.o \
>             i915_suspend.o \
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index c5b7b8e0678a..16612542496a 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -848,6 +848,8 @@ static int i915_driver_init_early(struct drm_i915_private *dev_priv,
>   	mutex_init(&dev_priv->wm.wm_mutex);
>   	mutex_init(&dev_priv->pps_mutex);
>
> +	i915_memcpy_init_early(dev_priv);
> +
>   	ret = i915_workqueues_init(dev_priv);
>   	if (ret < 0)
>   		return ret;
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 27d9b2c374b3..3c266e7866ba 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -4070,4 +4070,7 @@ static inline bool __i915_request_irq_complete(struct drm_i915_gem_request *req)
>   	return false;
>   }
>
> +void i915_memcpy_init_early(struct drm_i915_private *dev_priv);
> +bool i915_memcpy_from_wc(void *dst, const void *src, unsigned long len);
> +
>   #endif
> diff --git a/drivers/gpu/drm/i915/i915_memcpy.c b/drivers/gpu/drm/i915/i915_memcpy.c
> new file mode 100644
> index 000000000000..4ed4d3bb2f3e
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_memcpy.c
> @@ -0,0 +1,75 @@
> +#include "i915_drv.h"
> +
> +DEFINE_STATIC_KEY_FALSE(has_movntqa);
> +
> +#ifdef CONFIG_AS_MOVNTDQA
> +static void __movntqda(void *dst, const void *src, unsigned long len)
> +{
> +	len >>= 4;
> +	while (len >= 4) {
> +		__asm__ __volatile__(
> +		"movntdqa (%0), %%xmm0\n"
> +		"movntdqa 16(%0), %%xmm1\n"
> +		"movntdqa 32(%0), %%xmm2\n"
> +		"movntdqa 48(%0), %%xmm3\n"
> +		"movaps %%xmm0, (%1)\n"
> +		"movaps %%xmm1, 16(%1)\n"
> +		"movaps %%xmm2, 32(%1)\n"
> +		"movaps %%xmm3, 48(%1)\n"
> +		: : "r" (src), "r" (dst) : "memory");
> +		src += 64;
> +		dst += 64;
> +		len -= 4;
> +	}
> +	while (len--) {
> +		__asm__ __volatile__(
> +		"movntdqa (%0), %%xmm0\n"
> +		"movaps %%xmm0, (%1)\n"
> +		: : "r" (src), "r" (dst) : "memory");
> +		src += 16;
> +		dst += 16;
> +	}
> +}
> +#endif

Is it okay nowadays to just use these registers in the kernel?

Many years ago when I last looked into this FPU and MMX registers were 
discouraged against and needed explicit kernel_gpu_begin/end around the 
block. Since they were not saved/restored by the kernel and doing 
otherwise would mess up the userspace context.

Perhaps these new registers are different or the things have generally 
changed since then?

Regards,

Tvrtko

> +
> +/**
> + * i915_memcpy_from_wc: perform an accelerated *aligned* read from WC
> + * @dst: destination pointer
> + * @src: source pointer
> + * @len: how many bytes to copy
> + *
> + * i915_memcpy_from_wc copies @len bytes from @src to @dst using
> + * non-temporal instructions where available. Note that all arguments
> + * (@src, @dst) must be aligned to 16 bytes and @len must be a multiple
> + * of 16.
> + *
> + * To test whether accelerated reads from WC are supported, use
> + * i915_memcpy_from_wc(NULL, NULL, 0);
> + *
> + * Returns true if the copy was successful, false if the preconditions
> + * are not met.
> + */
> +bool i915_memcpy_from_wc(void *dst, const void *src, unsigned long len)
> +{
> +	GEM_BUG_ON((unsigned long)dst & 15);
> +	GEM_BUG_ON((unsigned long)src & 15);
> +
> +	if (unlikely(len & 15))
> +		return false;
> +
> +#ifdef CONFIG_AS_MOVNTDQA
> +	if (static_branch_likely(&has_movntqa)) {
> +		if (len)
> +			__movntqda(dst, src, len);
> +		return true;
> +	}
> +#endif
> +
> +	return false;
> +}
> +
> +void i915_memcpy_init_early(struct drm_i915_private *dev_priv)
> +{
> +	if (static_cpu_has(X86_FEATURE_XMM4_1))
> +		static_branch_enable(&has_movntqa);
> +}
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2016-07-18  9:31 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-16 12:07 [PATCH] drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory Chris Wilson
2016-07-16 12:32 ` ✗ Ro.CI.BAT: failure for " Patchwork
2016-07-16 12:33 ` [PATCH] " Chris Wilson
2016-07-16 12:45 ` [PATCH v2] " Chris Wilson
2016-07-16 12:53   ` [PATCH v3] " Chris Wilson
2016-07-16 15:44     ` [PATCH v5] drm/i915: Use SSE4.1 movntdqa " Chris Wilson
2016-07-17  3:21       ` Matt Turner
2016-07-17  8:12         ` Chris Wilson
2016-07-18  9:31       ` Tvrtko Ursulin [this message]
2016-07-18 10:01         ` Chris Wilson
2016-07-18 10:07           ` [PATCH] " Chris Wilson
2016-07-18 10:29             ` Chris Wilson
2016-07-18 11:15             ` Tvrtko Ursulin
2016-07-18 11:35               ` Chris Wilson
2016-07-18 11:57                 ` Dave Gordon
2016-07-18 12:56                   ` Tvrtko Ursulin
2016-07-18 13:46                     ` Tvrtko Ursulin
2016-07-18 15:06                       ` Tvrtko Ursulin
2016-07-18 16:05                         ` Dave Gordon
2016-07-19 10:26                         ` Tvrtko Ursulin
2016-07-18 13:48                     ` Dave Gordon
2016-07-18 12:07                 ` Tvrtko Ursulin
2016-07-19  6:50             ` Daniel Vetter
2016-07-16 13:12 ` ✗ Ro.CI.BAT: failure for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev2) Patchwork
2016-07-16 13:34 ` ✓ Ro.CI.BAT: success for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev3) Patchwork
2016-07-16 16:12 ` ✗ Ro.CI.BAT: failure for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev4) Patchwork
2016-07-18 10:59 ` ✗ Ro.CI.BAT: failure for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev5) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=578CA1D9.2020004@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=akash.goel@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=mika.kuoppala@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox