public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Charlie Jenkins <charlie@rivosinc.com>
To: David Laight <David.Laight@aculab.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>,
	Conor Dooley <conor@kernel.org>,
	Samuel Holland <samuel.holland@sifive.com>,
	"linux-riscv@lists.infradead.org"
	<linux-riscv@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	Albert Ou <aou@eecs.berkeley.edu>
Subject: Re: [PATCH v4 2/5] riscv: Add checksum library
Date: Wed, 13 Sep 2023 19:18:18 -0400	[thread overview]
Message-ID: <ZQJDOj/6lUumg7WP@ghost> (raw)
In-Reply-To: <b503e341f2334d6b9b902a6e621ed7c3@AcuMS.aculab.com>

On Wed, Sep 13, 2023 at 08:47:49AM +0000, David Laight wrote:
> From: Charlie Jenkins
> > Sent: 13 September 2023 04:10
> > 
> > On Tue, Sep 12, 2023 at 08:45:38AM +0000, David Laight wrote:
> > > From: Charlie Jenkins
> > > > Sent: 11 September 2023 23:57
> > > >
> > > > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > > > will load from the buffer in groups of 32 bits, and when compiled for
> > > > 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> > > > csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> > > > the riscv generated code in QEMU, discovered that summing in a
> > > > tree-like structure is about 4% faster than doing 64-bit reads.
> > > >
> > > ...
> > > > +	sum   = saddr->s6_addr32[0];
> > > > +	sum  += saddr->s6_addr32[1];
> > > > +	sum1  = saddr->s6_addr32[2];
> > > > +	sum1 += saddr->s6_addr32[3];
> > > > +
> > > > +	sum2  = daddr->s6_addr32[0];
> > > > +	sum2 += daddr->s6_addr32[1];
> > > > +	sum3  = daddr->s6_addr32[2];
> > > > +	sum3 += daddr->s6_addr32[3];
> > > > +
> > > > +	sum4  = csum;
> > > > +	sum4 += ulen;
> > > > +	sum4 += uproto;
> > > > +
> > > > +	sum  += sum1;
> > > > +	sum2 += sum3;
> > > > +
> > > > +	sum += sum2;
> > > > +	sum += sum4;
> > >
> > > Have you got gcc to compile that as-is?
> > >
> > > Whenever I've tried to get a 'tree add' compiled so that the
> > > early adds can be executed in parallel gcc always pessimises
> > > it to a linear sequence of adds.
> > >
> > > But I agree that adding 32bit values to a 64bit register
> > > may be no slower than trying to do an 'add carry' sequence
> > > that is guaranteed to only do one add/clock.
> > > (And on Intel cpu from core-2 until IIRC Haswell adc took 2 clocks!)
> > >
> > > IIRC RISCV doesn't have a carry flag, so the adc sequence
> > > is hard - probably takes two extra instructions per value.
> > > Although with parallel execute it may not matter.
> > > Consider:
> > > 	val = buf[offset];
> > > 	sum += val;
> > > 	carry += sum < val;
> > > 	val = buf[offset1];
> > > 	sum += val;
> > > 	...
> > > the compare and 'carry +=' can be executed at the same time
> > > as the following two instructions.
> > > You do then a final sum += carry; sum += sum < carry;
> > >
> > > Assuming all instructions are 1 clock and any read delays
> > > get filled with other instructions (by source or hardware
> > > instruction re-ordering) even without parallel execute
> > > that is 4 clocks for 64 bits, which is much the same as the
> > > 2 clocks for 32 bits.
> > >
> > > Remember that all the 32bit values can summed first as
> > > they won't overflow.
> > >
> > > 	David
> 
> > Yeah it does seem like the tree-add does just do a linear add. All three
> > of them were pretty much the same on riscv so I used the version that
> > did best on x86 with the knowledge that my QEMU setup does not
> > accurately represent real hardware.
> 
> The problem there is that any measurement on x86 has pretty much
> no relevance to what any RISCV cpu might do.
> The multiple execution units and out of order execution on x86
> are far different from anything any RISCV cpu is likely to have
> for many years.
> You might get nearer running on one of the Atom cpu - but it won't
> really match.
> There are too many fundamental differences between the architectures.
> 
> All you can do is to find and read the instruction timings for
> a target physical cpu and look for things like:
> - Whether arithmetic results are available next clock.
>   (It probably is)
> - How many clocks it takes for read data to be available.
>   I suspect the cpu will stall if the data is needed.
>   A block of sequential reads is one way to avoid the stall.
>   On x86 the instruction that needs the data is just deferred
>   until it is available, the following instructions execute
>   (provided their input are all available).
> - Clock delays for taken/not taken predicted/not predicted branches.
>   
> > I don't quite understand how doing the carry in the middle of each
> > stage, even though it can be executed at the same time, would be faster
> > than just doing a single overflow check at the end.
> 
> You need to do half as many reads and adds.
> 
I missed that you were talking about 64-bit reads. I was talking to
somebody about this a couple weeks ago and they mentioned a counter
example that showed that adding the carry after was not the same as
adding it in the middle. Even though addition is commutative, I wasn't
sure if the overflow checking was. I can't rememeber what the counter
example was, but I have a feeling it was flawed.
> > I can just revert
> > back to the non-tree add version since there is no improvement on riscv.
> 
> The 'tree' version is only likely to be faster on cpu (like x86)
> that can (at least sometimes) do two memory reads in one clock
> and can do two adds and two read in the same clock.
> Even then, without out of order execution, it is hard to get right.
> 
> Oh, you might want to try getting the default csum_fold() to
> be the faster 'arc' version rather than adding your own version.
> 
> 	David
> 
> > I can also revert back to the default version that uses carry += sum < val
> > as well.
> > 
> > - Charlie
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

  reply	other threads:[~2023-09-13 23:18 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-11 22:57 [PATCH v4 0/5] riscv: Add fine-tuned checksum functions Charlie Jenkins
2023-09-11 22:57 ` [PATCH v4 1/5] riscv: Checksum header Charlie Jenkins
2023-09-12 10:24   ` Emil Renner Berthing
2023-09-13  2:38     ` Charlie Jenkins
2023-09-13  9:19       ` Emil Renner Berthing
2023-09-11 22:57 ` [PATCH v4 2/5] riscv: Add checksum library Charlie Jenkins
2023-09-12  8:45   ` David Laight
2023-09-13  3:09     ` Charlie Jenkins
2023-09-13  8:47       ` David Laight
2023-09-13 23:18         ` Charlie Jenkins [this message]
2023-09-14  0:41           ` Charlie Jenkins
2023-09-14 12:25   ` Conor Dooley
2023-09-14 17:58     ` Charlie Jenkins
2023-09-14 18:02       ` Conor Dooley
2023-09-14 23:30         ` Charlie Jenkins
2023-09-11 22:57 ` [PATCH v4 3/5] riscv: Vector checksum header Charlie Jenkins
2023-09-11 22:57 ` [PATCH v4 4/5] riscv: Vector checksum library Charlie Jenkins
2023-09-14 12:46   ` Conor Dooley
2023-09-14 16:14     ` Charlie Jenkins
2023-09-14 16:29       ` Conor Dooley
2023-09-14 17:29         ` Charlie Jenkins
2023-09-14 17:36           ` Conor Dooley
2023-09-14 20:59             ` Charlie Jenkins
2023-09-11 22:57 ` [PATCH v4 5/5] riscv: Test checksum functions Charlie Jenkins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZQJDOj/6lUumg7WP@ghost \
    --to=charlie@rivosinc.com \
    --cc=David.Laight@aculab.com \
    --cc=aou@eecs.berkeley.edu \
    --cc=conor@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=samuel.holland@sifive.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox