From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Dave Gordon <david.s.gordon@intel.com>,
Chris Wilson <chris@chris-wilson.co.uk>,
intel-gfx@lists.freedesktop.org,
Akash Goel <akash.goel@intel.com>,
Mika Kuoppala <mika.kuoppala@intel.com>
Subject: Re: [PATCH] drm/i915: Use SSE4.1 movntdqa to accelerate reads from WC memory
Date: Tue, 19 Jul 2016 11:26:25 +0100 [thread overview]
Message-ID: <578E0051.2000303@linux.intel.com> (raw)
In-Reply-To: <578CF070.50300@linux.intel.com>
On 18/07/16 16:06, Tvrtko Ursulin wrote:
> On 18/07/16 14:46, Tvrtko Ursulin wrote:
>
> [snip]
>
>> This version generates the smallest code:
>>
>> static void __memcpy_ntdqa(struct qw2 *dst, const struct qw2 *src, unsigned long len)
>> {
>> unsigned long l4;
>>
>> kernel_fpu_begin();
>>
>> l4 = len / 4;
>> while (l4) {
>> asm("movntdqa (%0), %%xmm0" :: "r" (src), "m" (src[0]));
>> asm("movntdqa 16(%0), %%xmm1" :: "r" (src), "m" (src[1]));
>> asm("movntdqa 32(%0), %%xmm2" :: "r" (src), "m" (src[2]));
>> asm("movntdqa 48(%0), %%xmm3" :: "r" (src), "m" (src[3]));
>> asm("movaps %%xmm0, (%1)" : "=m" (dst[0]) : "r" (dst));
>> asm("movaps %%xmm1, 16(%1)" : "=m" (dst[1]) : "r" (dst));
>> asm("movaps %%xmm2, 32(%1)" : "=m" (dst[2]) : "r" (dst));
>> asm("movaps %%xmm3, 48(%1)" : "=m" (dst[3]) : "r" (dst));
>> src += 4;
>> dst += 4;
>> l4--;
>> }
>>
>> len %= 4;
>> while (len) {
>> asm("movntdqa (%0), %%xmm0" :: "r" (src), "m" (src[0]));
>> asm("movaps %%xmm0, (%1)" : "=m" (dst[0]) : "r" (dst));
>> src++;
>> dst++;
>> len--;
>> }
>>
>> kernel_fpu_end();
>> }
>>
>> Although I still haven't figured out a way to convince it to use
>> the same registers for src and dest between the two loops.
>
> I remembered one famous interview question, along the lines of, "what
> is the code below doing". Translated to this example:
>
> static void __memcpy_ntdqa(struct qw2 *dst, const struct qw2 *src, unsigned long len)
> {
> unsigned long n;
>
> kernel_fpu_begin();
>
Bugfix here:
+ len /= 16;
> n = (len + 3) / 4;
> switch (len % 4) {
> case 0: do { asm("movntdqa %1, %%xmm0\n"
> "movaps %%xmm0, %0\n" : "=m" (*dst): "m" (*src));
> src++; dst++;
> case 3: asm("movntdqa %1, %%xmm1\n"
> "movaps %%xmm1, %0\n" : "=m" (*dst): "m" (*src));
> src++; dst++;
> case 2: asm("movntdqa %1, %%xmm2\n"
> "movaps %%xmm2, %0\n" : "=m" (*dst): "m" (*src));
> src++; dst++;
> case 1: asm("movntdqa %1, %%xmm3\n"
> "movaps %%xmm3, %0\n" : "=m" (*dst): "m" (*src));
> src++; dst++;
> } while (--n > 0);
> }
>
> kernel_fpu_end();
> }
>
> :D
>
> No idea if loads/stores can run async in this case.
It seems to be equally fast as the unrolled loop. And it generates the
smallest code. :)
I would also suggest a "likely (len)" in the original patch since zero
length copies are not expected and that also manages to shrink the code
a bit.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2016-07-19 10:26 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-16 12:07 [PATCH] drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory Chris Wilson
2016-07-16 12:32 ` ✗ Ro.CI.BAT: failure for " Patchwork
2016-07-16 12:33 ` [PATCH] " Chris Wilson
2016-07-16 12:45 ` [PATCH v2] " Chris Wilson
2016-07-16 12:53 ` [PATCH v3] " Chris Wilson
2016-07-16 15:44 ` [PATCH v5] drm/i915: Use SSE4.1 movntdqa " Chris Wilson
2016-07-17 3:21 ` Matt Turner
2016-07-17 8:12 ` Chris Wilson
2016-07-18 9:31 ` Tvrtko Ursulin
2016-07-18 10:01 ` Chris Wilson
2016-07-18 10:07 ` [PATCH] " Chris Wilson
2016-07-18 10:29 ` Chris Wilson
2016-07-18 11:15 ` Tvrtko Ursulin
2016-07-18 11:35 ` Chris Wilson
2016-07-18 11:57 ` Dave Gordon
2016-07-18 12:56 ` Tvrtko Ursulin
2016-07-18 13:46 ` Tvrtko Ursulin
2016-07-18 15:06 ` Tvrtko Ursulin
2016-07-18 16:05 ` Dave Gordon
2016-07-19 10:26 ` Tvrtko Ursulin [this message]
2016-07-18 13:48 ` Dave Gordon
2016-07-18 12:07 ` Tvrtko Ursulin
2016-07-19 6:50 ` Daniel Vetter
2016-07-16 13:12 ` ✗ Ro.CI.BAT: failure for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev2) Patchwork
2016-07-16 13:34 ` ✓ Ro.CI.BAT: success for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev3) Patchwork
2016-07-16 16:12 ` ✗ Ro.CI.BAT: failure for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev4) Patchwork
2016-07-18 10:59 ` ✗ Ro.CI.BAT: failure for drm/i915: Use SSE4.1 movntqda to accelerate reads from WC memory (rev5) Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=578E0051.2000303@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=akash.goel@intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=david.s.gordon@intel.com \
--cc=intel-gfx@lists.freedesktop.org \
--cc=mika.kuoppala@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.