public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Dave Gordon <david.s.gordon@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	Chris Wilson <chris@chris-wilson.co.uk>,
	intel-gfx@lists.freedesktop.org,
	Akash Goel <akash.goel@intel.com>,
	Mika Kuoppala <mika.kuoppala@intel.com>
Subject: Re: [PATCH] drm/i915: Use SSE4.1 movntdqa to accelerate reads from WC memory
Date: Mon, 18 Jul 2016 17:05:29 +0100	[thread overview]
Message-ID: <578CFE49.70008@intel.com> (raw)
In-Reply-To: <578CF070.50300@linux.intel.com>

On 18/07/16 16:06, Tvrtko Ursulin wrote:
>
> On 18/07/16 14:46, Tvrtko Ursulin wrote:
>
> [snip]
>
>> This version generates the smallest code:
>>
>> static void __memcpy_ntdqa(struct qw2 *dst, const struct qw2 *src, unsigned long len)
>> {
>> 	unsigned long l4;
>>
>> 	kernel_fpu_begin();
>>
>> 	l4 = len / 4;
>> 	while (l4) {
>> 		asm("movntdqa   (%0), %%xmm0" :: "r" (src), "m" (src[0]));
>> 		asm("movntdqa 16(%0), %%xmm1" :: "r" (src), "m" (src[1]));
>> 		asm("movntdqa 32(%0), %%xmm2" :: "r" (src), "m" (src[2]));
>> 		asm("movntdqa 48(%0), %%xmm3" :: "r" (src), "m" (src[3]));
>> 		asm("movaps %%xmm0,   (%1)" : "=m" (dst[0]) : "r" (dst));
>> 		asm("movaps %%xmm1, 16(%1)" : "=m" (dst[1]) : "r" (dst));
>> 		asm("movaps %%xmm2, 32(%1)" : "=m" (dst[2]) : "r" (dst));
>> 		asm("movaps %%xmm3, 48(%1)" : "=m" (dst[3]) : "r" (dst));
>> 		src += 4;
>> 		dst += 4;
>> 		l4--;
>> 	}
>>
>> 	len %= 4;
>> 	while (len) {
>> 		asm("movntdqa (%0), %%xmm0" :: "r" (src), "m" (src[0]));
>> 		asm("movaps %%xmm0, (%1)" : "=m" (dst[0]) : "r" (dst));
>> 		src++;
>> 		dst++;
>> 		len--;
>> 	}
>>
>> 	kernel_fpu_end();
>> }
>>
>> Although I still haven't figured out a way to convince it to use
>> the same registers for src and dest between the two loops.
>
> I remembered one famous interview question, along the lines of, "what
> is the code below doing". Translated to this example:
>
> static void __memcpy_ntdqa(struct qw2 *dst, const struct qw2 *src, unsigned long len)
> {
> 	unsigned long n;
>
> 	kernel_fpu_begin();
>
> 	n = (len + 3) / 4;
> 	switch (len % 4) {
> 	case 0: do { asm("movntdqa %1, %%xmm0\n"
> 			  "movaps %%xmm0, %0\n" : "=m" (*dst): "m" (*src));
> 		     src++; dst++;
> 	case 3:	     asm("movntdqa %1, %%xmm1\n"
> 			  "movaps %%xmm1, %0\n" : "=m" (*dst): "m" (*src));
> 		     src++; dst++;
> 	case 2:	     asm("movntdqa %1, %%xmm2\n"
> 			  "movaps %%xmm2, %0\n" : "=m" (*dst): "m" (*src));
> 		     src++; dst++;
> 	case 1:	     asm("movntdqa %1, %%xmm3\n"
> 			  "movaps %%xmm3, %0\n" : "=m" (*dst): "m" (*src));
> 		     src++; dst++;
> 		} while (--n > 0);
> 	}
>
> 	kernel_fpu_end();
> }
>
> :D
>
> No idea if loads/stores can run async in this case.
>
> Regards,
> Tvrtko

Here's yet another variant, just to document other ways of writing it:

#include "asm/fpu/api.h"

/* This is the datatype of an xmm register */
typedef double xmmd_t __attribute__ ((vector_size (16)));

__attribute__((target("sse4.1")))
void __memcpy_ntdqa(xmmd_t *dst, const xmmd_t *src, unsigned long len)
{
	xmmd_t tmp0, tmp1, tmp2, tmp3;
	unsigned long l64;

	kernel_fpu_begin();

	/* Whole 64-byte blocks as 4*16 bytes */
	for (l64 = len/64; l64--; ) {
		asm("movntdqa %1, %0" : "=x" (tmp0) : "m" (*src++));
		asm("movntdqa %1, %0" : "=x" (tmp1) : "m" (*src++));
		asm("movntdqa %1, %0" : "=x" (tmp2) : "m" (*src++));
		asm("movntdqa %1, %0" : "=x" (tmp3) : "m" (*src++));
		asm("movaps   %1, %0" : "=m" (*dst++) : "x" (tmp0));
		asm("movaps   %1, %0" : "=m" (*dst++) : "x" (tmp1));
		asm("movaps   %1, %0" : "=m" (*dst++) : "x" (tmp2));
		asm("movaps   %1, %0" : "=m" (*dst++) : "x" (tmp3));
	}

	/* Remaining up-to-3 16-byte chunks */
	for (len &= 63, len >>= 4; len--; ) {
		asm("movntdqa %1, %0" : "=x" (tmp0) : "m" (*src++));
		asm("movaps   %1, %0" : "=m" (*dst++) : "x" (tmp0));
	}

	kernel_fpu_end();
}

I wondered whether we could get GCC to unroll the loops automatically 
i.e. just write the one loop and say we wanted it unrolled four times, 
leaving the compiler to deal with the remainder; but I didn't find a way 
to specify "unroll 4 times" as opposed to just "unroll this some".

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2016-07-18 16:05 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-16 12:07 [PATCH] drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory Chris Wilson
2016-07-16 12:32 ` ✗ Ro.CI.BAT: failure for " Patchwork
2016-07-16 12:33 ` [PATCH] " Chris Wilson
2016-07-16 12:45 ` [PATCH v2] " Chris Wilson
2016-07-16 12:53   ` [PATCH v3] " Chris Wilson
2016-07-16 15:44     ` [PATCH v5] drm/i915: Use SSE4.1 movntdqa " Chris Wilson
2016-07-17  3:21       ` Matt Turner
2016-07-17  8:12         ` Chris Wilson
2016-07-18  9:31       ` Tvrtko Ursulin
2016-07-18 10:01         ` Chris Wilson
2016-07-18 10:07           ` [PATCH] " Chris Wilson
2016-07-18 10:29             ` Chris Wilson
2016-07-18 11:15             ` Tvrtko Ursulin
2016-07-18 11:35               ` Chris Wilson
2016-07-18 11:57                 ` Dave Gordon
2016-07-18 12:56                   ` Tvrtko Ursulin
2016-07-18 13:46                     ` Tvrtko Ursulin
2016-07-18 15:06                       ` Tvrtko Ursulin
2016-07-18 16:05                         ` Dave Gordon [this message]
2016-07-19 10:26                         ` Tvrtko Ursulin
2016-07-18 13:48                     ` Dave Gordon
2016-07-18 12:07                 ` Tvrtko Ursulin
2016-07-19  6:50             ` Daniel Vetter
2016-07-16 13:12 ` ✗ Ro.CI.BAT: failure for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev2) Patchwork
2016-07-16 13:34 ` ✓ Ro.CI.BAT: success for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev3) Patchwork
2016-07-16 16:12 ` ✗ Ro.CI.BAT: failure for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev4) Patchwork
2016-07-18 10:59 ` ✗ Ro.CI.BAT: failure for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev5) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=578CFE49.70008@intel.com \
    --to=david.s.gordon@intel.com \
    --cc=akash.goel@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=mika.kuoppala@intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox