All of lore.kernel.org
 help / color / mirror / Atom feed
From: Charlie Jenkins <charlie@rivosinc.com>
To: David Laight <David.Laight@aculab.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>,
	Conor Dooley <conor@kernel.org>,
	Samuel Holland <samuel.holland@sifive.com>,
	"linux-riscv@lists.infradead.org"
	<linux-riscv@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	Albert Ou <aou@eecs.berkeley.edu>, Arnd Bergmann <arnd@arndb.de>
Subject: Re: [PATCH v6 3/4] riscv: Add checksum library
Date: Mon, 18 Sep 2023 22:58:17 -0400	[thread overview]
Message-ID: <ZQkOSf1b66lHzjaf@ghost> (raw)
In-Reply-To: <0357e092c05043fba13eccad77ba799f@AcuMS.aculab.com>

On Sat, Sep 16, 2023 at 09:32:40AM +0000, David Laight wrote:
> From: Charlie Jenkins
> > Sent: 15 September 2023 18:01
> > 
> > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > will load from the buffer in groups of 32 bits, and when compiled for
> > 64-bit will load in groups of 64 bits.
> > 
> ...
> > +	/*
> > +	 * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be
> > +	 * faster than doing 32-bit reads on architectures that support larger
> > +	 * reads.
> > +	 */
> > +	while (len > 0) {
> > +		csum += data;
> > +		csum += csum < data;
> > +		len -= sizeof(unsigned long);
> > +		ptr += 1;
> > +		data = *ptr;
> > +	}
> 
> I think you'd be better adding the 'carry' bits in a separate
> variable.
> It reduces the register dependency chain length in the loop.
> (Helps if the cpu can execute two instructions in one clock.)
> 
> The masked misaligned data values are max 24 bits
> (if 
> 
> You'll also almost certainly remove at least one instruction
> from the loop by comparing against the end address rather than
> changing 'len'.
> 
> So ending up with (something like):
> 	end = buff + length;
> 	...
> 	while (++ptr < end) {
> 		csum += data;
> 		carry += csum < data;
> 		data = ptr[-1];
> 	}
> (Although a do-while loop tends to generate better code
> and gcc will pretty much always make that transformation.)
> 
> I think that is 4 instructions per word (load, add, cmp+set, add).
> In principle they could be completely pipelined and all
> execute (for different loop iterations) in the same clock.
> (But that is pretty unlikely to happen - even x86 isn't that good.)
> But taking two clocks is quite plausible.
> Plus 2 instructions per loop (inc, cmp+jmp).
> They might execute in parallel, but unrolling once
> may be required.
> 
It looks like GCC actually ends up generating 7 total instructions:
ffffffff808d2acc:	97b6                	add	a5,a5,a3
ffffffff808d2ace:	00d7b533          	sltu	a0,a5,a3
ffffffff808d2ad2:	0721                	add	a4,a4,8
ffffffff808d2ad4:	86be                	mv	a3,a5
ffffffff808d2ad6:	962a                	add	a2,a2,a0
ffffffff808d2ad8:	ff873783          	ld	a5,-8(a4)
ffffffff808d2adc:	feb768e3          	bltu	a4,a1,ffffffff808d2acc <do_csum+0x34>

This mv instruction could be avoided if the registers were shuffled
around, but perhaps this way reduces some dependency chains.
> ...
> > +	if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
> > +	    riscv_has_extension_likely(RISCV_ISA_EXT_ZBB)) {
> ...
> > +		}
> > +end:
> > +		return csum >> 16;
> > +	}
> 
> Is it really worth doing all that to save (I think) 4 instructions?
> (shift, shift, or with rotate twice).
> There is much more to be gained by careful inspection
> of the loop (even leaving it in C).
> 

The main benefit was from using rev8 to replace swab32. However, now
that I am looking at the assembly in the kernel it is not outputting the
asm that matches what I have from an out of kernel test case, so rev8
might not be beneficial. I am going to have to look at this more to
figure out what is happening.

> > +
> > +#ifndef CONFIG_32BIT
> > +	csum += (csum >> 32) | (csum << 32);
> > +	csum >>= 32;
> > +#endif
> > +	csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16));
> 
> Use ror64() and ror32().
> 
> 	David
> 

Good idea.

- Charlie

> > +	if (offset & 1)
> > +		return (unsigned short)swab32(csum);
> > +	return csum >> 16;
> > +}
> > 
> > --
> > 2.42.0
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

WARNING: multiple messages have this Message-ID (diff)
From: Charlie Jenkins <charlie@rivosinc.com>
To: David Laight <David.Laight@aculab.com>
Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	Albert Ou <aou@eecs.berkeley.edu>, Arnd Bergmann <arnd@arndb.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Conor Dooley <conor@kernel.org>,
	Palmer Dabbelt <palmer@dabbelt.com>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	"linux-riscv@lists.infradead.org"
	<linux-riscv@lists.infradead.org>
Subject: Re: [PATCH v6 3/4] riscv: Add checksum library
Date: Mon, 18 Sep 2023 22:58:17 -0400	[thread overview]
Message-ID: <ZQkOSf1b66lHzjaf@ghost> (raw)
In-Reply-To: <0357e092c05043fba13eccad77ba799f@AcuMS.aculab.com>

On Sat, Sep 16, 2023 at 09:32:40AM +0000, David Laight wrote:
> From: Charlie Jenkins
> > Sent: 15 September 2023 18:01
> > 
> > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > will load from the buffer in groups of 32 bits, and when compiled for
> > 64-bit will load in groups of 64 bits.
> > 
> ...
> > +	/*
> > +	 * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be
> > +	 * faster than doing 32-bit reads on architectures that support larger
> > +	 * reads.
> > +	 */
> > +	while (len > 0) {
> > +		csum += data;
> > +		csum += csum < data;
> > +		len -= sizeof(unsigned long);
> > +		ptr += 1;
> > +		data = *ptr;
> > +	}
> 
> I think you'd be better adding the 'carry' bits in a separate
> variable.
> It reduces the register dependency chain length in the loop.
> (Helps if the cpu can execute two instructions in one clock.)
> 
> The masked misaligned data values are max 24 bits
> (if 
> 
> You'll also almost certainly remove at least one instruction
> from the loop by comparing against the end address rather than
> changing 'len'.
> 
> So ending up with (something like):
> 	end = buff + length;
> 	...
> 	while (++ptr < end) {
> 		csum += data;
> 		carry += csum < data;
> 		data = ptr[-1];
> 	}
> (Although a do-while loop tends to generate better code
> and gcc will pretty much always make that transformation.)
> 
> I think that is 4 instructions per word (load, add, cmp+set, add).
> In principle they could be completely pipelined and all
> execute (for different loop iterations) in the same clock.
> (But that is pretty unlikely to happen - even x86 isn't that good.)
> But taking two clocks is quite plausible.
> Plus 2 instructions per loop (inc, cmp+jmp).
> They might execute in parallel, but unrolling once
> may be required.
> 
It looks like GCC actually ends up generating 7 total instructions:
ffffffff808d2acc:	97b6                	add	a5,a5,a3
ffffffff808d2ace:	00d7b533          	sltu	a0,a5,a3
ffffffff808d2ad2:	0721                	add	a4,a4,8
ffffffff808d2ad4:	86be                	mv	a3,a5
ffffffff808d2ad6:	962a                	add	a2,a2,a0
ffffffff808d2ad8:	ff873783          	ld	a5,-8(a4)
ffffffff808d2adc:	feb768e3          	bltu	a4,a1,ffffffff808d2acc <do_csum+0x34>

This mv instruction could be avoided if the registers were shuffled
around, but perhaps this way reduces some dependency chains.
> ...
> > +	if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
> > +	    riscv_has_extension_likely(RISCV_ISA_EXT_ZBB)) {
> ...
> > +		}
> > +end:
> > +		return csum >> 16;
> > +	}
> 
> Is it really worth doing all that to save (I think) 4 instructions?
> (shift, shift, or with rotate twice).
> There is much more to be gained by careful inspection
> of the loop (even leaving it in C).
> 

The main benefit was from using rev8 to replace swab32. However, now
that I am looking at the assembly in the kernel it is not outputting the
asm that matches what I have from an out of kernel test case, so rev8
might not be beneficial. I am going to have to look at this more to
figure out what is happening.

> > +
> > +#ifndef CONFIG_32BIT
> > +	csum += (csum >> 32) | (csum << 32);
> > +	csum >>= 32;
> > +#endif
> > +	csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16));
> 
> Use ror64() and ror32().
> 
> 	David
> 

Good idea.

- Charlie

> > +	if (offset & 1)
> > +		return (unsigned short)swab32(csum);
> > +	return csum >> 16;
> > +}
> > 
> > --
> > 2.42.0
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

  reply	other threads:[~2023-09-19  2:58 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-15 17:01 [PATCH v6 0/4] riscv: Add fine-tuned checksum functions Charlie Jenkins
2023-09-15 17:01 ` Charlie Jenkins
2023-09-15 17:01 ` [PATCH v6 1/4] asm-generic: Improve csum_fold Charlie Jenkins
2023-09-15 17:01   ` Charlie Jenkins
2023-09-16  8:50   ` Conor Dooley
2023-09-16  8:50     ` Conor Dooley
2023-09-15 17:01 ` [PATCH v6 2/4] riscv: Checksum header Charlie Jenkins
2023-09-15 17:01   ` Charlie Jenkins
2023-09-15 17:01 ` [PATCH v6 3/4] riscv: Add checksum library Charlie Jenkins
2023-09-15 17:01   ` Charlie Jenkins
2023-09-16  9:32   ` David Laight
2023-09-16  9:32     ` David Laight
2023-09-19  2:58     ` Charlie Jenkins [this message]
2023-09-19  2:58       ` Charlie Jenkins
2023-09-19  8:00       ` David Laight
2023-09-19  8:00         ` David Laight
2023-09-19 18:04         ` Charlie Jenkins
2023-09-19 18:04           ` Charlie Jenkins
2023-09-15 17:01 ` [PATCH v6 4/4] riscv: Test checksum functions Charlie Jenkins
2023-09-15 17:01   ` Charlie Jenkins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZQkOSf1b66lHzjaf@ghost \
    --to=charlie@rivosinc.com \
    --cc=David.Laight@aculab.com \
    --cc=aou@eecs.berkeley.edu \
    --cc=arnd@arndb.de \
    --cc=conor@kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=samuel.holland@sifive.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.