All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Bob Pearson" <rpearson@systemfabricworks.com>
To: "'Joakim Tjernlund'" <joakim.tjernlund@transmode.se>
Cc: <akpm@linux-foundation.org>, <fzago@systemfabricworks.com>,
	"'George Spelvin'" <linux@horizon.com>,
	<linux-kernel@vger.kernel.org>
Subject: RE: [patch v3 7/7] crc32: final-cleanup.diff
Date: Wed, 10 Aug 2011 10:13:00 -0500	[thread overview]
Message-ID: <002601cc576f$ff062180$fd126480$@systemfabricworks.com> (raw)
In-Reply-To: <OF3719A120.AE2CD7D2-ONC12578E8.00342C83-C12578E8.00402A3E@transmode.se>

OK. Can you post your current version of crc32.c? I'll try to merge them
together.

> -----Original Message-----
> From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> Sent: Wednesday, August 10, 2011 6:41 AM
> To: Bob Pearson
> Cc: akpm@linux-foundation.org; fzago@systemfabricworks.com; George
> Spelvin; linux-kernel@vger.kernel.org
> Subject: RE: [patch v3 7/7] crc32: final-cleanup.diff
> 
> "Bob Pearson" <rpearson@systemfabricworks.com> wrote on 2011/08/10
> 01:05:56:
> >
> > >
> > > Doing this adds one insn to ppc. What is good for x86 isn't for
> > > pcc32 (and I guess any modern arch).
> >
> > I looked at the assembly output from the original code and I see two
> > compares at the end of the first loop, one for --len and one for (buf &
3)
> > which it has to compute. You are better off just computing buf & 3 once
> and
> > comparing with len once to compute init_len then in each iteration in
the
> > loop you only have to compare one thing.
> >
> > As proposed:
> >         init_len = min((-((uintptr_t)buf)) & 3, len);
> >         ...
> >
> >         /* process unaligned initial bytes */
> >         for (i = 0; i < init_len; i++)
> >                 DO_CRC(*buf++);
> >
> >
> > crc32x_le:
> >         <.....>
> >         negq    %rcx            # (-buf)
> >         andl    $3, %ecx            # (-buf) & 3
> >         cmpq    %rdx, %rcx            # min()
> >         cmova   %rdx, %rcx            # init_len =
> >         xorl    %edi, %edi            # i = 0
> >         <.....>
> >         jmp     .L2
> > .L3:
> > # crc = tab[0][(crc ^ (*buf++)) & 255] ^ (crc >> 8)
> >         movb    (%rsi,%rdi), %dl         # buf[i]
> >         incq    %rdi               # buf++
> >         xorl    %eax, %edx            # crc ^ *buf++
> >         shrl    $8, %eax            # crc >> 8
> >         movzbl  %dl, %edx            # & 255
> >         xorl    crc32table_le(,%rdx,4), %eax                 # crc =
> > tab[...] ^ (crc >> 8)
> > .L2:
> >         cmpq    %rcx, %rdi            # compare i with
> > init_len
> >         jb      .L3
> >         <.....>
> >
> > As was originally:
> >         if (unlikely((long)buf & 3 && len)) {
> >                 do {
> >                         DO_CRC(*buf++);
> >                 } while ((--len) && ((long)buf)&3);
> >         }
> >
> > crc32x_le:
> >         pushq   %rbp
> >         movl    %edi, %eax
> >         pushq   %rbx
> >         testb   $3, %sil
> >         je      .L16
> >         testq   %rdx, %rdx
> > .L25:
> >         je      .L16
> >         movb    (%rsi), %cl            # *p
> >         incq    %rsi               # p++
> >         xorl    %eax, %ecx            # crc ^ *p
> >         shrl    $8, %eax            # crc >> 8
> >         movzbl  %cl, %ecx            # ... & 255
> >         xorl    crc32table_le(,%rcx,4), %eax      # tab[...] ^ (crc >>
> > 8)
> >         decq    %rdx            # len--
> >         je      .L16               # if len != 0
> > continue
> >         testb   $3, %sil            # buf & 3
> >         jmp     .L25               # if buf & 3
> > continue
> > .L16:
> >
> > The first loop has 8 instructions second one has 11 instructions. First
loop
> > has slightly more setup to compute init_len, second loop has a couple of
> > extra branches to compute the if() outside of the loop.
> >
> > I get better performance with the new form of the loop. How, does PPC
> get
> > better results with two tests?
> 
> You are right, even ppc gets less insns in the byte loop. However I tested
> your latest patch series and your version is a tiny bit slower than my
best
> version(numbers below)
> Probably due to that your version has a higher startup cost and uses more
> registers(3 regs are pushed to the stack vs. 2 regs in mine)
> 
> >
> > >
> > > > >
> > > > > -/* implements slicing-by-4 or slicing-by-8 algorithm */
> > > > > -static inline u32
> > > > > -crc32_body(u32 crc, unsigned char const *buf, size_t len, const
u32
> > > > > (*tab)[256])
> > > >
> > > > After careful measurements and looking at asm code I figured out
that
> > > there
> > > > was no penalty for using 2D array. That allowed me to go back to the
> > > > original form.
> > >
> > > What about my suggestion to assign each table to a ptr? This was good
> > > for ppc so if it doesn't slow x86 it should be added.
> >
> > Gcc is very clever and inlined crc32_body() into crc32_le() and
crc32_be()
> > and then replaced references to tab[x][y] with references to
crc32table_le
> > and crc32table_be plus a constant offset which depended on the first
index
> > ([x]). So in effect it completely unrolled back to 1D arrays as
> > (crc32table_le + offset)[y], which is equivalent to the t0_le[y] code I
had
> > before.
> 
> So x86 optimizes the table accesses well then. PPC32 doesn't and adding
> my small optimization makes a huge difference:
> org:
> crc32: CRC_LE_BITS = 8, CRC_BE BITS = 8
> crc32: self tests passed, processed 225944 bytes in 2257673 nsec
> 
> my t0, t1, t2, t3 ptrs
> crc32: CRC_LE_BITS = 8, CRC_BE BITS = 8
> crc32: self tests passed, processed 225944 bytes in 1949869 nsec
> 
> Numbers for 32 bits with your patch.
> 
> v3 up to 6 of 7
> crc32: CRC_LE_BITS = 32, CRC_BE BITS = 32
> crc32: self tests passed, processed 225944 bytes in 2362044 nsec
> 
> v3 up to 7 of 7
> crc32: CRC_LE_BITS = 32, CRC_BE BITS = 32
> crc32: self tests passed, processed 225944 bytes in 2149972 nsec
> 
> my t0, t1, t2, t3 ptrs on v3 7 of 7
> crc32: CRC_LE_BITS = 32, CRC_BE BITS = 32
> crc32: self tests passed, processed 225944 bytes in 1953290 nsec
> 
> Finally, 64 bits on PPC is broken in v3!
> 
>  Jocke



  reply	other threads:[~2011-08-10 15:13 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-09  5:29 [patch v3 7/7] crc32: final-cleanup.diff Bob Pearson
2011-08-09  6:21 ` Bob Pearson
2011-08-09 17:39   ` Joakim Tjernlund
2011-08-09 23:05     ` Bob Pearson
2011-08-10 11:40       ` Joakim Tjernlund
2011-08-10 15:13         ` Bob Pearson [this message]
2011-08-10 15:48           ` Joakim Tjernlund
2011-08-10 21:57             ` Bob Pearson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='002601cc576f$ff062180$fd126480$@systemfabricworks.com' \
    --to=rpearson@systemfabricworks.com \
    --cc=akpm@linux-foundation.org \
    --cc=fzago@systemfabricworks.com \
    --cc=joakim.tjernlund@transmode.se \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@horizon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.