From: Doug Ledford <dledford@redhat.com>
To: Neil Horman <nhorman@tuxdriver.com>
Cc: Ingo Molnar <mingo@kernel.org>,
Eric Dumazet <eric.dumazet@gmail.com>,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
David Laight <David.Laight@ACULAB.COM>
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
Date: Wed, 30 Oct 2013 09:35:05 -0400 [thread overview]
Message-ID: <52710B09.6090302@redhat.com> (raw)
In-Reply-To: <20131030110214.GA10220@localhost.localdomain>
On 10/30/2013 07:02 AM, Neil Horman wrote:
> That does makes sense, but it then begs the question, whats the advantage of
> having multiple alu's at all?
There's lots of ALU operations that don't operate on the flags or other
entities that can be run in parallel.
> If they're just going to serialize on the
> updating of the condition register, there doesn't seem to be much advantage in
> having multiple alu's at all, especially if a common use case (parallelizing an
> operation on a large linear dataset) resulted in lower performance.
>
> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1
>
> would prevent pipeline stalls. That would be interesting data, and (I think)
> support your theory, Doug. I'll give that a try
Just to avoid spending too much time on various combinations, here are
the methods I've tried:
Original code
2 chains doing interleaved memory accesses
2 chains doing serial memory accesses (as above)
4 chains doing serial memory accesses
4 chains using 32bit values in 64bit registers so you can always use add
instead of adc and never need the carry flag
And I've done all of the above with simple prefetch and smart prefetch.
In all cases, the result is basically that the add method doesn't matter
much in the grand scheme of things, but the prefetch does, and smart
prefetch always beat simple prefetch.
My simple prefetch was to just go into the main while() loop for the
csum operation and always prefetch 5*64 into the future.
My smart prefetch looks like this:
static inline void prefetch_line(unsigned long *cur_line,
unsigned long *end_line,
size_t size)
{
size_t fetched = 0;
while (*cur_line <= *end_line && fetched < size) {
prefetch((void *)*cur_line);
*cur_line += cache_line_size();
fetched += cache_line_size();
}
}
static unsigned do_csum(const unsigned char *buff, unsigned len)
{
...
unsigned long cur_line = (unsigned long)buff &
~(cache_line_size() - 1);
unsigned long end_line = ((unsigned long)buff + len) &
~(cache_line_size() - 1);
...
/* Don't bother to prefetch the first line, we'll end up
stalling on
* it anyway, but go ahead and start the prefetch on the next 3 */
cur_line += cache_line_size();
prefetch_line(&cur_line, &end_line, cache_line_size() * 3);
odd = 1 & (unsigned long) buff;
if (unlikely(odd)) {
result = *buff << 8;
...
count >>= 1; /* nr of 32-bit words.. */
/* prefetch line #4 ahead of main loop */
prefetch_line(&cur_line, &end_line, cache_line_size());
if (count) {
...
while (count64) {
/* we are now prefetching line #5 ahead of
* where we are starting, and will stay 5
* ahead throughout the loop, at least
until
* we get to the end line and then
we'll stop
* prefetching */
prefetch_line(&cur_line, &end_line, 64);
ADDL_64;
buff += 64;
count64--;
}
ADDL_64_FINISH;
I was going to tinker today and tomorrow with this function once I get a
toolchain that will compile it (I reinstalled all my rhel6 hosts as f20
and I'm hoping that does the trick, if not I need to do more work):
#define ADCXQ_64 \
asm("xorq %[res1],%[res1]\n\t" \
"adcxq 0*8(%[src]),%[res1]\n\t" \
"adoxq 1*8(%[src]),%[res2]\n\t" \
"adcxq 2*8(%[src]),%[res1]\n\t" \
"adoxq 3*8(%[src]),%[res2]\n\t" \
"adcxq 4*8(%[src]),%[res1]\n\t" \
"adoxq 5*8(%[src]),%[res2]\n\t" \
"adcxq 6*8(%[src]),%[res1]\n\t" \
"adoxq 7*8(%[src]),%[res2]\n\t" \
"adcxq %[zero],%[res1]\n\t" \
"adoxq %[zero],%[res2]\n\t" \
: [res1] "=r" (result1), \
[res2] "=r" (result2) \
: [src] "r" (buff), [zero] "r" (zero), \
"[res1]" (result1), "[res2]" (result2))
and then I also wanted to try using both xmm and ymm registers and doing
64bit adds with 32bit numbers across multiple xmm/ymm registers as that
should parallel nicely. David, you mentioned you've tried this, how did
your experiment turn out and what was your method? I was planning on
doing regular full size loads into one xmm/ymm register, then using
pshufd/vshufd to move the data into two different registers, then
summing into a fourth register, and possible running two of those
pipelines in parallel.
next prev parent reply other threads:[~2013-10-30 13:35 UTC|newest]
Thread overview: 107+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-30 5:25 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
2013-10-30 10:27 ` David Laight
2013-10-30 11:02 ` Neil Horman
2013-10-30 12:18 ` David Laight
2013-10-30 13:22 ` Doug Ledford
2013-10-30 13:35 ` Doug Ledford [this message]
2013-10-30 14:04 ` David Laight
2013-10-30 14:52 ` Neil Horman
2013-10-31 18:30 ` Neil Horman
2013-11-01 9:21 ` Ingo Molnar
2013-11-01 15:42 ` Ben Hutchings
2013-11-01 16:08 ` Neil Horman
2013-11-01 16:16 ` Ben Hutchings
2013-11-01 16:18 ` David Laight
2013-11-01 17:37 ` Neil Horman
2013-11-01 19:45 ` Joe Perches
2013-11-01 19:58 ` Neil Horman
2013-11-01 20:26 ` Joe Perches
2013-11-02 2:07 ` Neil Horman
2013-11-04 9:47 ` David Laight
-- strict thread matches above, loose matches on Subject: below --
2013-10-18 17:42 Doug Ledford
2013-10-19 8:23 ` Ingo Molnar
2013-10-21 17:54 ` Doug Ledford
2013-10-26 11:55 ` Ingo Molnar
2013-10-28 17:02 ` Doug Ledford
2013-10-29 8:38 ` Ingo Molnar
2013-10-18 15:46 Doug Ledford
2013-10-11 16:51 Neil Horman
2013-10-12 17:21 ` Ingo Molnar
2013-10-13 12:53 ` Neil Horman
2013-10-14 20:28 ` Neil Horman
2013-10-14 21:19 ` Eric Dumazet
2013-10-14 22:18 ` Eric Dumazet
2013-10-14 22:37 ` Joe Perches
2013-10-14 22:44 ` Eric Dumazet
2013-10-14 22:49 ` Joe Perches
2013-10-15 7:41 ` Ingo Molnar
2013-10-15 10:51 ` Borislav Petkov
2013-10-15 12:04 ` Ingo Molnar
2013-10-15 16:21 ` Joe Perches
2013-10-16 0:34 ` Eric Dumazet
2013-10-16 6:25 ` Ingo Molnar
2013-10-16 16:55 ` Joe Perches
2013-10-17 0:34 ` Neil Horman
2013-10-17 1:42 ` Eric Dumazet
2013-10-18 16:50 ` Neil Horman
2013-10-18 17:20 ` Eric Dumazet
2013-10-18 20:11 ` Neil Horman
2013-10-18 21:15 ` Eric Dumazet
2013-10-20 21:29 ` Neil Horman
2013-10-21 17:31 ` Eric Dumazet
2013-10-21 17:46 ` Neil Horman
2013-10-21 19:21 ` Neil Horman
2013-10-21 19:44 ` Eric Dumazet
2013-10-21 20:19 ` Neil Horman
2013-10-26 12:01 ` Ingo Molnar
2013-10-26 13:58 ` Neil Horman
2013-10-27 7:26 ` Ingo Molnar
2013-10-27 17:05 ` Neil Horman
2013-10-17 8:41 ` Ingo Molnar
2013-10-17 18:19 ` H. Peter Anvin
2013-10-17 18:48 ` Eric Dumazet
2013-10-18 6:43 ` Ingo Molnar
2013-10-28 16:01 ` Neil Horman
2013-10-28 16:20 ` Ingo Molnar
2013-10-28 17:49 ` Neil Horman
2013-10-28 16:24 ` Ingo Molnar
2013-10-28 16:49 ` David Ahern
2013-10-28 17:46 ` Neil Horman
2013-10-28 18:29 ` Neil Horman
2013-10-29 8:25 ` Ingo Molnar
2013-10-29 11:20 ` Neil Horman
2013-10-29 11:30 ` Ingo Molnar
2013-10-29 11:49 ` Neil Horman
2013-10-29 12:52 ` Ingo Molnar
2013-10-29 13:07 ` Neil Horman
2013-10-29 13:11 ` Ingo Molnar
2013-10-29 13:20 ` Neil Horman
2013-10-29 14:17 ` Neil Horman
2013-10-29 14:27 ` Ingo Molnar
2013-10-29 20:26 ` Neil Horman
2013-10-31 10:22 ` Ingo Molnar
2013-10-31 14:33 ` Neil Horman
2013-11-01 9:13 ` Ingo Molnar
2013-11-01 14:06 ` Neil Horman
2013-10-29 14:12 ` David Ahern
2013-10-15 7:32 ` Ingo Molnar
2013-10-15 13:14 ` Neil Horman
2013-10-12 22:29 ` H. Peter Anvin
2013-10-13 12:53 ` Neil Horman
2013-10-18 16:42 ` Neil Horman
2013-10-18 17:09 ` H. Peter Anvin
2013-10-25 13:06 ` Neil Horman
2013-10-14 4:38 ` Andi Kleen
2013-10-14 7:49 ` Ingo Molnar
2013-10-14 21:07 ` Eric Dumazet
2013-10-15 13:17 ` Neil Horman
2013-10-14 20:25 ` Neil Horman
2013-10-15 7:12 ` Sébastien Dugué
2013-10-15 13:33 ` Andi Kleen
2013-10-15 13:56 ` Sébastien Dugué
2013-10-15 14:06 ` Eric Dumazet
2013-10-15 14:15 ` Sébastien Dugué
2013-10-15 14:26 ` Eric Dumazet
2013-10-15 14:52 ` Eric Dumazet
2013-10-15 16:02 ` Andi Kleen
2013-10-16 0:28 ` Eric Dumazet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52710B09.6090302@redhat.com \
--to=dledford@redhat.com \
--cc=David.Laight@ACULAB.COM \
--cc=eric.dumazet@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=nhorman@tuxdriver.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).