netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
@ 2014-02-22 13:01 Daniel Borkmann
  2014-02-24 10:17 ` David Laight
  2014-02-25  0:00 ` David Miller
  0 siblings, 2 replies; 11+ messages in thread
From: Daniel Borkmann @ 2014-02-22 13:01 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-sctp

Drivers are allowed to set NETIF_F_SCTP_CSUM if they have
hardware crc32c checksumming support for the SCTP protocol.
Currently, NETIF_F_SCTP_CSUM flag is available in igb,
ixgbe, i40e/i40evf drivers and for vlan devices.

If we don't have NETIF_F_SCTP_CSUM then crc32c is done
through CPU instructions, invoked from crypto layer, or
if not available as slow-path fallback in software.

Currently, loopback device propagates checksum offloading
feature flags in dev->features, but is missing SCTP checksum
offloading. Therefore, account for NETIF_F_SCTP_CSUM as
well.

Before patch:

./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY
SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

4194304 4194304   4096    10.00    4683.50

After patch:

./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY
SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

4194304 4194304   4096    10.00    15348.26

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 drivers/net/loopback.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 771c9bf..282effe 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -176,6 +176,7 @@ static void loopback_setup(struct net_device *dev)
 		| NETIF_F_UFO
 		| NETIF_F_HW_CSUM
 		| NETIF_F_RXCSUM
+		| NETIF_F_SCTP_CSUM
 		| NETIF_F_HIGHDMA
 		| NETIF_F_LLTX
 		| NETIF_F_NETNS_LOCAL
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
  2014-02-22 13:01 [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features Daniel Borkmann
@ 2014-02-24 10:17 ` David Laight
  2014-02-24 10:31   ` Daniel Borkmann
  2014-02-25  0:00 ` David Miller
  1 sibling, 1 reply; 11+ messages in thread
From: David Laight @ 2014-02-24 10:17 UTC (permalink / raw)
  To: 'Daniel Borkmann', davem@davemloft.net
  Cc: netdev@vger.kernel.org, linux-sctp@vger.kernel.org

From: Daniel Borkmann
> Drivers are allowed to set NETIF_F_SCTP_CSUM if they have
> hardware crc32c checksumming support for the SCTP protocol.
> Currently, NETIF_F_SCTP_CSUM flag is available in igb,
> ixgbe, i40e/i40evf drivers and for vlan devices.
> 
> If we don't have NETIF_F_SCTP_CSUM then crc32c is done
> through CPU instructions, invoked from crypto layer, or
> if not available as slow-path fallback in software.
> 
> Currently, loopback device propagates checksum offloading
> feature flags in dev->features, but is missing SCTP checksum
> offloading. Therefore, account for NETIF_F_SCTP_CSUM as
> well.
> 
> Before patch:
> 
> ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY
> SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 4194304 4194304   4096    10.00    4683.50
> 
> After patch:
...
> 4194304 4194304   4096    10.00    15348.26

That seems a much larger increase than you'd expect from removing
a software CRC of the data chunks.
Are you sure that some other difference in the data flows wasn't
also triggered.

I'm also not sure that 4096 is a representative message size for SCTP.
I'm not sure what else it is used for, but SCTP is used to carry various
SS7 protocols and interfaces over IP (SIGTRAN: eg M3UA). For SS7 the maximum
message size is around 270 bytes, and the average size nearer 100.

It is also difficult to model, but the typical traffic for SS7 is neither
bulk data, nor command-response. Rather it is single packets (maybe received
from one of many 64k links) that need transmitting within a reasonable time
interval, a delay of 1-2ms might be acceptable.
This basically means that Nagle has to be disabled (because you never want
the 'timeout waiting for an ack') and almost every data chunk ends up in
its own (short) ethernet frame.

	David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
  2014-02-24 10:17 ` David Laight
@ 2014-02-24 10:31   ` Daniel Borkmann
  2014-02-24 10:42     ` David Laight
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Borkmann @ 2014-02-24 10:31 UTC (permalink / raw)
  To: David Laight
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-sctp@vger.kernel.org

On 02/24/2014 11:17 AM, David Laight wrote:
> From: Daniel Borkmann
>> Drivers are allowed to set NETIF_F_SCTP_CSUM if they have
>> hardware crc32c checksumming support for the SCTP protocol.
>> Currently, NETIF_F_SCTP_CSUM flag is available in igb,
>> ixgbe, i40e/i40evf drivers and for vlan devices.
>>
>> If we don't have NETIF_F_SCTP_CSUM then crc32c is done
>> through CPU instructions, invoked from crypto layer, or
>> if not available as slow-path fallback in software.
>>
>> Currently, loopback device propagates checksum offloading
>> feature flags in dev->features, but is missing SCTP checksum
>> offloading. Therefore, account for NETIF_F_SCTP_CSUM as
>> well.
>>
>> Before patch:
>>
>> ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY
>> SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET
>> Recv   Send    Send
>> Socket Socket  Message  Elapsed
>> Size   Size    Size     Time     Throughput
>> bytes  bytes   bytes    secs.    10^6bits/sec
>>
>> 4194304 4194304   4096    10.00    4683.50
>>
>> After patch:
> ...
>> 4194304 4194304   4096    10.00    15348.26
>
> That seems a much larger increase than you'd expect from removing
> a software CRC of the data chunks.
> Are you sure that some other difference in the data flows wasn't
> also triggered.

Yes, I run this multiple times with similar results and I double-checked
it with perf. Current code triggers crc32c implementation in software
fallback on my machine which is very expensive.

> I'm also not sure that 4096 is a representative message size for SCTP.

I used netperf default in this case.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
  2014-02-24 10:31   ` Daniel Borkmann
@ 2014-02-24 10:42     ` David Laight
  2014-02-24 12:02       ` Daniel Borkmann
  0 siblings, 1 reply; 11+ messages in thread
From: David Laight @ 2014-02-24 10:42 UTC (permalink / raw)
  To: 'Daniel Borkmann'
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-sctp@vger.kernel.org

From: Daniel Borkmann 
> On 02/24/2014 11:17 AM, David Laight wrote:
> > From: Daniel Borkmann
> >> Drivers are allowed to set NETIF_F_SCTP_CSUM if they have
> >> hardware crc32c checksumming support for the SCTP protocol.
> >> Currently, NETIF_F_SCTP_CSUM flag is available in igb,
> >> ixgbe, i40e/i40evf drivers and for vlan devices.
> >>
> >> If we don't have NETIF_F_SCTP_CSUM then crc32c is done
> >> through CPU instructions, invoked from crypto layer, or
> >> if not available as slow-path fallback in software.
> >>
> >> Currently, loopback device propagates checksum offloading
> >> feature flags in dev->features, but is missing SCTP checksum
> >> offloading. Therefore, account for NETIF_F_SCTP_CSUM as
> >> well.
> >>
> >> Before patch:
> >>
> >> ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY
> >> SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET
> >> Recv   Send    Send
> >> Socket Socket  Message  Elapsed
> >> Size   Size    Size     Time     Throughput
> >> bytes  bytes   bytes    secs.    10^6bits/sec
> >>
> >> 4194304 4194304   4096    10.00    4683.50
> >>
> >> After patch:
> > ...
> >> 4194304 4194304   4096    10.00    15348.26
> >
> > That seems a much larger increase than you'd expect from removing
> > a software CRC of the data chunks.
> > Are you sure that some other difference in the data flows wasn't
> > also triggered.
> 
> Yes, I run this multiple times with similar results and I double-checked
> it with perf. Current code triggers crc32c implementation in software
> fallback on my machine which is very expensive.

I'm sure it shouldn't be that expensive, you are implying that it spent
about 70% of the time doing crc32.
The loop should be dominated by the per-byte lookup in a 256 word table.
With 4k data the table will soon be in the data cache.
Unless it is (stupidly) generating the table on each call, or trying
to use a crc32 instruction, faulting, and emulating it, I wouldn't
really have expected more than a few % improvement.

> > I'm also not sure that 4096 is a representative message size for SCTP.
> 
> I used netperf default in this case.

I was just pointing out that the defaults may not be appropriate here.

	David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
  2014-02-24 10:42     ` David Laight
@ 2014-02-24 12:02       ` Daniel Borkmann
  2014-02-24 13:24         ` David Laight
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Borkmann @ 2014-02-24 12:02 UTC (permalink / raw)
  To: David Laight
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-sctp@vger.kernel.org

On 02/24/2014 11:42 AM, David Laight wrote:
...
> I'm sure it shouldn't be that expensive, you are implying that it spent
> about 70% of the time doing crc32.

In this scenario, the following perf log I get that shows where cycles
are being spent on my machine:

  65.95%       netperf  [kernel.kallsyms]  [k] __crc32c_le
   3.79%       netperf  [kernel.kallsyms]  [k] memcpy
   2.38%       netperf  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
   0.62%       netperf  [sctp]             [k] sctp_datamsg_from_user
   0.62%       netperf  [sctp]             [k] sctp_sendmsg
   0.55%       netperf  [kernel.kallsyms]  [k] __slab_free
   0.52%       netperf  [sctp]             [k] sctp_outq_flush
   0.50%       netperf  [kernel.kallsyms]  [k] kfree
   0.49%       netperf  [kernel.kallsyms]  [k] cmpxchg_double_slab.isra.52
   0.48%       netperf  [kernel.kallsyms]  [k] kmem_cache_alloc
   0.43%       netperf  [kernel.kallsyms]  [k] __slab_alloc
   0.42%       netperf  [kernel.kallsyms]  [k] __copy_skb_header
   0.41%       netperf  [kernel.kallsyms]  [k] __alloc_skb

> The loop should be dominated by the per-byte lookup in a 256 word table.
> With 4k data the table will soon be in the data cache.
> Unless it is (stupidly) generating the table on each call, or trying
> to use a crc32 instruction, faulting, and emulating it, I wouldn't
> really have expected more than a few % improvement.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
  2014-02-24 12:02       ` Daniel Borkmann
@ 2014-02-24 13:24         ` David Laight
  2014-02-24 13:36           ` Daniel Borkmann
  2014-02-24 14:06           ` Daniel Borkmann
  0 siblings, 2 replies; 11+ messages in thread
From: David Laight @ 2014-02-24 13:24 UTC (permalink / raw)
  To: 'Daniel Borkmann'
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-sctp@vger.kernel.org

From: Daniel Borkmann
> On 02/24/2014 11:42 AM, David Laight wrote:
> ...
> > I'm sure it shouldn't be that expensive, you are implying that it spent
> > about 70% of the time doing crc32.
> 
> In this scenario, the following perf log I get that shows where cycles
> are being spent on my machine:
> 
>   65.95%       netperf  [kernel.kallsyms]  [k] __crc32c_le

WTF is that function doing!
Even doing the crc naively bit by bit shouldn't manage to use 65%.

Maybe the _le has something to do with it.
Could it be bit-reversing the crc and data bytes all the time?
The packet will (one would hope) want the crc in the same bit-order
as the data, so no bit reversal is needed - just the correct logic
and lookup table.

Which architecture and which version of crc32_le() does your kernel use?
(I'd guess there are several lurking).
Whichever function you are using wants killing.

	David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
  2014-02-24 13:24         ` David Laight
@ 2014-02-24 13:36           ` Daniel Borkmann
  2014-02-24 14:07             ` David Laight
  2014-02-24 14:06           ` Daniel Borkmann
  1 sibling, 1 reply; 11+ messages in thread
From: Daniel Borkmann @ 2014-02-24 13:36 UTC (permalink / raw)
  To: David Laight
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-sctp@vger.kernel.org

On 02/24/2014 02:24 PM, David Laight wrote:
...
> Which architecture and which version of crc32_le() does your kernel use?

It's using slice-by-8 algorithm, and my machine is [only]
a core i7 (x86_64). Apparently, it is not using the crypto
version that is accelerated.

> (I'd guess there are several lurking).
> Whichever function you are using wants killing.
>
> 	David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
  2014-02-24 13:24         ` David Laight
  2014-02-24 13:36           ` Daniel Borkmann
@ 2014-02-24 14:06           ` Daniel Borkmann
  2014-02-24 14:28             ` David Laight
  1 sibling, 1 reply; 11+ messages in thread
From: Daniel Borkmann @ 2014-02-24 14:06 UTC (permalink / raw)
  To: David Laight
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-sctp@vger.kernel.org

On 02/24/2014 02:24 PM, David Laight wrote:
...
> Maybe the _le has something to do with it.
> Could it be bit-reversing the crc and data bytes all the time?
> The packet will (one would hope) want the crc in the same bit-order
> as the data, so no bit reversal is needed - just the correct logic
> and lookup table.

Although it's a bit off-topic for this patch, but I'll have a look
into __crc32c_le(); maybe something is just wrong with that code.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
  2014-02-24 13:36           ` Daniel Borkmann
@ 2014-02-24 14:07             ` David Laight
  0 siblings, 0 replies; 11+ messages in thread
From: David Laight @ 2014-02-24 14:07 UTC (permalink / raw)
  To: 'Daniel Borkmann'
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-sctp@vger.kernel.org

From: Daniel Borkmann
> > Which architecture and which version of crc32_le() does your kernel use?
> 
> It's using slice-by-8 algorithm, and my machine is [only]
> a core i7 (x86_64). Apparently, it is not using the crypto
> version that is accelerated.

I still can't imagine it being that slow.
The sctp code isn't known for its fast paths, although these are
per-message, not per byte I'd have thought they'd still be significant.

Having looked at the code, try compiling with CRC_LE_BITS == 8.
I suspect the much smaller data cache footprint will help, and
the number of instructions won't be that many less (due to all
the shifts).

What may help is to have a 'double shift' table and compute
the crc's of the odd and even bytes separately, then xor them
together at the end. That might improve the dependency chains
and let all the cpu execute more instructions in parallel.

Another improvement is to read the next byte from the buffer
before processing the previous one (possibly with a slight
loop unroll).

As has been noted elsewhere, using a 16 entry lookup table
can be faster than the 256 entry one.
What I don't know is whether the crc32 can be analysed down do
a small number of shifts and xors (crc16 reduces quite nicely)
that might easily execute faster than the lookup table.

	David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
  2014-02-24 14:06           ` Daniel Borkmann
@ 2014-02-24 14:28             ` David Laight
  0 siblings, 0 replies; 11+ messages in thread
From: David Laight @ 2014-02-24 14:28 UTC (permalink / raw)
  To: 'Daniel Borkmann'
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-sctp@vger.kernel.org

From: Daniel Borkmann 
> > Maybe the _le has something to do with it.
> > Could it be bit-reversing the crc and data bytes all the time?
> > The packet will (one would hope) want the crc in the same bit-order
> > as the data, so no bit reversal is needed - just the correct logic
> > and lookup table.
> 
> Although it's a bit off-topic for this patch, but I'll have a look
> into __crc32c_le(); maybe something is just wrong with that code.

Yes it generates a nice dependency chain of:
	xor	table_b(,%reg,4),%eax
using 8 tables (256*4 bytes each).

I bet that with a little bit of pipelining a simple byte by byte
loop should manage 1 byte every two clocks (ie 1 memory cycle per
clock).

	David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features
  2014-02-22 13:01 [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features Daniel Borkmann
  2014-02-24 10:17 ` David Laight
@ 2014-02-25  0:00 ` David Miller
  1 sibling, 0 replies; 11+ messages in thread
From: David Miller @ 2014-02-25  0:00 UTC (permalink / raw)
  To: dborkman; +Cc: netdev, linux-sctp

From: Daniel Borkmann <dborkman@redhat.com>
Date: Sat, 22 Feb 2014 14:01:53 +0100

> Drivers are allowed to set NETIF_F_SCTP_CSUM if they have
> hardware crc32c checksumming support for the SCTP protocol.
> Currently, NETIF_F_SCTP_CSUM flag is available in igb,
> ixgbe, i40e/i40evf drivers and for vlan devices.
> 
> If we don't have NETIF_F_SCTP_CSUM then crc32c is done
> through CPU instructions, invoked from crypto layer, or
> if not available as slow-path fallback in software.
> 
> Currently, loopback device propagates checksum offloading
> feature flags in dev->features, but is missing SCTP checksum
> offloading. Therefore, account for NETIF_F_SCTP_CSUM as
> well.
 ...
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

Applied, thanks a lot Daniel.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-02-25  0:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-22 13:01 [PATCH net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features Daniel Borkmann
2014-02-24 10:17 ` David Laight
2014-02-24 10:31   ` Daniel Borkmann
2014-02-24 10:42     ` David Laight
2014-02-24 12:02       ` Daniel Borkmann
2014-02-24 13:24         ` David Laight
2014-02-24 13:36           ` Daniel Borkmann
2014-02-24 14:07             ` David Laight
2014-02-24 14:06           ` Daniel Borkmann
2014-02-24 14:28             ` David Laight
2014-02-25  0:00 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).