RE: [patch v3 7/7] crc32: final-cleanup.diff

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Bob Pearson" <rpearson@systemfabricworks.com>
To: "'Joakim Tjernlund'" <joakim.tjernlund@transmode.se>
Cc: <akpm@linux-foundation.org>, <fzago@systemfabricworks.com>,
	"'George Spelvin'" <linux@horizon.com>,
	<linux-kernel@vger.kernel.org>
Subject: RE: [patch v3 7/7] crc32: final-cleanup.diff
Date: Wed, 10 Aug 2011 10:13:00 -0500	[thread overview]
Message-ID: <002601cc576f$ff062180$fd126480$@systemfabricworks.com> (raw)
In-Reply-To: <OF3719A120.AE2CD7D2-ONC12578E8.00342C83-C12578E8.00402A3E@transmode.se>

OK. Can you post your current version of crc32.c? I'll try to merge them
together.

> -----Original Message-----
> From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> Sent: Wednesday, August 10, 2011 6:41 AM
> To: Bob Pearson
> Cc: akpm@linux-foundation.org; fzago@systemfabricworks.com; George
> Spelvin; linux-kernel@vger.kernel.org
> Subject: RE: [patch v3 7/7] crc32: final-cleanup.diff
> 
> "Bob Pearson" <rpearson@systemfabricworks.com> wrote on 2011/08/10
> 01:05:56:
> >
> > >
> > > Doing this adds one insn to ppc. What is good for x86 isn't for
> > > pcc32 (and I guess any modern arch).
> >
> > I looked at the assembly output from the original code and I see two
> > compares at the end of the first loop, one for --len and one for (buf &
3)
> > which it has to compute. You are better off just computing buf & 3 once
> and
> > comparing with len once to compute init_len then in each iteration in
the
> > loop you only have to compare one thing.
> >
> > As proposed:
> >         init_len = min((-((uintptr_t)buf)) & 3, len);
> >         ...
> >
> >         /* process unaligned initial bytes */
> >         for (i = 0; i < init_len; i++)
> >                 DO_CRC(*buf++);
> >
> >
> > crc32x_le:
> >         <.....>
> >         negq    %rcx            # (-buf)
> >         andl    $3, %ecx            # (-buf) & 3
> >         cmpq    %rdx, %rcx            # min()
> >         cmova   %rdx, %rcx            # init_len =
> >         xorl    %edi, %edi            # i = 0
> >         <.....>
> >         jmp     .L2
> > .L3:
> > # crc = tab[0][(crc ^ (*buf++)) & 255] ^ (crc >> 8)
> >         movb    (%rsi,%rdi), %dl         # buf[i]
> >         incq    %rdi               # buf++
> >         xorl    %eax, %edx            # crc ^ *buf++
> >         shrl    $8, %eax            # crc >> 8
> >         movzbl  %dl, %edx            # & 255
> >         xorl    crc32table_le(,%rdx,4), %eax                 # crc =
> > tab[...] ^ (crc >> 8)
> > .L2:
> >         cmpq    %rcx, %rdi            # compare i with
> > init_len
> >         jb      .L3
> >         <.....>
> >
> > As was originally:
> >         if (unlikely((long)buf & 3 && len)) {
> >                 do {
> >                         DO_CRC(*buf++);
> >                 } while ((--len) && ((long)buf)&3);
> >         }
> >
> > crc32x_le:
> >         pushq   %rbp
> >         movl    %edi, %eax
> >         pushq   %rbx
> >         testb   $3, %sil
> >         je      .L16
> >         testq   %rdx, %rdx
> > .L25:
> >         je      .L16
> >         movb    (%rsi), %cl            # *p
> >         incq    %rsi               # p++
> >         xorl    %eax, %ecx            # crc ^ *p
> >         shrl    $8, %eax            # crc >> 8
> >         movzbl  %cl, %ecx            # ... & 255
> >         xorl    crc32table_le(,%rcx,4), %eax      # tab[...] ^ (crc >>
> > 8)
> >         decq    %rdx            # len--
> >         je      .L16               # if len != 0
> > continue
> >         testb   $3, %sil            # buf & 3
> >         jmp     .L25               # if buf & 3
> > continue
> > .L16:
> >
> > The first loop has 8 instructions second one has 11 instructions. First
loop
> > has slightly more setup to compute init_len, second loop has a couple of
> > extra branches to compute the if() outside of the loop.
> >
> > I get better performance with the new form of the loop. How, does PPC
> get
> > better results with two tests?
> 
> You are right, even ppc gets less insns in the byte loop. However I tested
> your latest patch series and your version is a tiny bit slower than my
best
> version(numbers below)
> Probably due to that your version has a higher startup cost and uses more
> registers(3 regs are pushed to the stack vs. 2 regs in mine)
> 
> >
> > >
> > > > >
> > > > > -/* implements slicing-by-4 or slicing-by-8 algorithm */
> > > > > -static inline u32
> > > > > -crc32_body(u32 crc, unsigned char const *buf, size_t len, const
u32
> > > > > (*tab)[256])
> > > >
> > > > After careful measurements and looking at asm code I figured out
that
> > > there
> > > > was no penalty for using 2D array. That allowed me to go back to the
> > > > original form.
> > >
> > > What about my suggestion to assign each table to a ptr? This was good
> > > for ppc so if it doesn't slow x86 it should be added.
> >
> > Gcc is very clever and inlined crc32_body() into crc32_le() and
crc32_be()
> > and then replaced references to tab[x][y] with references to
crc32table_le
> > and crc32table_be plus a constant offset which depended on the first
index
> > ([x]). So in effect it completely unrolled back to 1D arrays as
> > (crc32table_le + offset)[y], which is equivalent to the t0_le[y] code I
had
> > before.
> 
> So x86 optimizes the table accesses well then. PPC32 doesn't and adding
> my small optimization makes a huge difference:
> org:
> crc32: CRC_LE_BITS = 8, CRC_BE BITS = 8
> crc32: self tests passed, processed 225944 bytes in 2257673 nsec
> 
> my t0, t1, t2, t3 ptrs
> crc32: CRC_LE_BITS = 8, CRC_BE BITS = 8
> crc32: self tests passed, processed 225944 bytes in 1949869 nsec
> 
> Numbers for 32 bits with your patch.
> 
> v3 up to 6 of 7
> crc32: CRC_LE_BITS = 32, CRC_BE BITS = 32
> crc32: self tests passed, processed 225944 bytes in 2362044 nsec
> 
> v3 up to 7 of 7
> crc32: CRC_LE_BITS = 32, CRC_BE BITS = 32
> crc32: self tests passed, processed 225944 bytes in 2149972 nsec
> 
> my t0, t1, t2, t3 ptrs on v3 7 of 7
> crc32: CRC_LE_BITS = 32, CRC_BE BITS = 32
> crc32: self tests passed, processed 225944 bytes in 1953290 nsec
> 
> Finally, 64 bits on PPC is broken in v3!
> 
>  Jocke

next prev parent reply	other threads:[~2011-08-10 15:13 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-09  5:29 [patch v3 7/7] crc32: final-cleanup.diff Bob Pearson
2011-08-09  6:21 ` Bob Pearson
2011-08-09 17:39   ` Joakim Tjernlund
2011-08-09 23:05     ` Bob Pearson
2011-08-10 11:40       ` Joakim Tjernlund
2011-08-10 15:13         ` Bob Pearson [this message]
2011-08-10 15:48           ` Joakim Tjernlund
2011-08-10 21:57             ` Bob Pearson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='002601cc576f$ff062180$fd126480$@systemfabricworks.com' \
    --to=rpearson@systemfabricworks.com \
    --cc=akpm@linux-foundation.org \
    --cc=fzago@systemfabricworks.com \
    --cc=joakim.tjernlund@transmode.se \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@horizon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox