public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* ib_ipoib: CSUM support in connected mode
@ 2014-09-14 18:46 Yuval Shaia
  2014-09-15 14:47 ` Or Gerlitz
  2014-10-01 11:55 ` Yuval Shaia
  0 siblings, 2 replies; 14+ messages in thread
From: Yuval Shaia @ 2014-09-14 18:46 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA

Hi,
Lately i was working on fixing an issue with IPoIB driver and i'd like to share the 
details with you.

By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance.
This MTU plus overhead puts the memory allocation for IP based packets at 32 4k pages
(order 5), which have to be contiguous. When the system memory under pressure, it was
observed that allocating 128k contiguous physical memory is difficult and causes serious
errors (such as system becomes unusable).
This enhancement resolve the issue by removing the physically contiguous memory requirement
using Scatter/Gather feature that exists in Linux stack. In order to use Scatter/Gather
feature in Linux IPoIB-CM driver, Linux IPoIB-CM must support IP checksum offload feature
(requirements as per the current Linux N/W implementation). But IB HCA hardware does not
support this feature and hence IPoIB cannot support the same.
IPoIB Connected Mode driver uses RC (Reliable Connection) which guarantees the corruption
free delivery of the packet. InfiniBand uses 32b CRC which provides stronger data integrity
protection compare to 16b IP Checksum. So, there is no added value that IP Checksum provides
in the IB world. The proposal is to tell to network stack that IPoIB-CM supports IP
Checksum offload. This enables Linux IPoIB-CM driver to use Scatter/Gather feature. Network
sends the IP packet without adding the IP Checksum to the header. On the receive side, IPoIB
driver again tells the network stack that IP Checksum is good for the incoming packets and
network stack avoids the IP Checksum calculations.
During connection establishment the driver determine if the other end supports IB CRC
as checksum. This is done so driver will be able to calculate checksum before transmitting
the packet in case the other end does not support this feature.
A support for fragmented skb is added to transmit path.

Please note that a "very welcome side-effect" of this feature is a high of degree performance 
improvement as a result of the removal of csum calculation.
I will be happy to share results with you if needed.

At present i have patch ready which was tested with an old kernel and i'm working to port 
it to latest.

Please review and comment.

Yuval
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
  2014-09-14 18:46 ib_ipoib: CSUM support in connected mode Yuval Shaia
@ 2014-09-15 14:47 ` Or Gerlitz
       [not found]   ` <CAJ3xEMhEzdyzcAufQU--VbM7aoAzsw7wV2i_i=kjcS9PbdC0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-10-01 11:55 ` Yuval Shaia
  1 sibling, 1 reply; 14+ messages in thread
From: Or Gerlitz @ 2014-09-15 14:47 UTC (permalink / raw)
  To: Yuval Shaia; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Sun, Sep 14, 2014 at 9:46 PM, Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance.
> This MTU plus overhead puts the memory allocation for IP based packets at 32 4k pages
> (order 5),

So if we make sure that the advertized netdevice MTU is 64K minus that
over head we're back to order four
allocation and problem is solved?   note that RFC 4755 makes sure that
the MTU is negotiated in both directions,
so it can have any value, specifically 64K - that epsilon which will
hopefully make you happy


> [...] The proposal is to tell to network stack that IPoIB-CM supports IP
> Checksum offload. This enables Linux IPoIB-CM driver to use Scatter/Gather feature. Network
> sends the IP packet without adding the IP Checksum to the header.

AFAIK, on the TX side, Linux will always compute the IP checksum, but
with this suggestion,
not the TCP checksum which is assumed to be computed by the card... so
we will have a TCP
packet on the wire without checksum. And if this packet goes through
gateway it will be dropped at
some point, agree?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
       [not found]   ` <CAJ3xEMhEzdyzcAufQU--VbM7aoAzsw7wV2i_i=kjcS9PbdC0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-09-15 16:58     ` Jason Gunthorpe
       [not found]       ` <20140915165820.GB12397-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2014-09-15 18:55     ` Yuval Shaia
  1 sibling, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2014-09-15 16:58 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Yuval Shaia, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Sep 15, 2014 at 05:47:19PM +0300, Or Gerlitz wrote:

> > [...] The proposal is to tell to network stack that IPoIB-CM supports IP
> > Checksum offload. This enables Linux IPoIB-CM driver to use Scatter/Gather feature. Network
> > sends the IP packet without adding the IP Checksum to the header.
> 
> AFAIK, on the TX side, Linux will always compute the IP checksum,
> but with this suggestion, not the TCP checksum which is assumed to
> be computed by the card... so we will have a TCP packet on the wire
> without checksum. And if this packet goes through gateway it will be
> dropped at some point, agree?

I remember this was discussed a few years ago on this list.

To do this, you need to transfer the offload state across the wire, so
on receive you inject the packet with the proper tag that the csum is
not computed but ready for offload. A node receiving a packet like
this would have to compute the csum before sending it onwards, so no,
if done properly it will not break gateways.

All the core infrastructure is there, all the virtualization drivers
work like this - the guest side does not compute the csum, and the
hyperviser side receives the packet with that flag, and the csum
ultimately is offloaded to the physical NIC. Look at the xen net
driver for an example.

The main thing is to negotiate this and other features at RC
connection time. Be sure to leave room for other optimizations, for
instance IPOIB could forward a GSO packet unbroken to the remote side.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
       [not found]       ` <20140915165820.GB12397-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2014-09-15 17:20         ` Or Gerlitz
       [not found]           ` <CAJ3xEMir_FgqS7j+fuhugocawdZXHG9hAK-jpArZ_5vkzVjZeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-09-15 19:03         ` Yuval Shaia
  1 sibling, 1 reply; 14+ messages in thread
From: Or Gerlitz @ 2014-09-15 17:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yuval Shaia, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Sep 15, 2014 at 7:58 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> To do this, you need to transfer the offload state across the wire, so
> on receive you inject the packet with the proper tag that the csum is
> not computed but ready for offload. A node receiving a packet like
> this would have to compute the csum before sending it onwards, so no,
> if done properly it will not break gateways.
>
> All the core infrastructure is there, all the virtualization drivers
> work like this - the guest side does not compute the csum, and the
> hyperviser side receives the packet with that flag, and the csum
> ultimately is offloaded to the physical NIC. Look at the xen net
> driver for an example.

But is done on the xmitting hypervisor, isn't it? if this is the case,
I don't see
the similarity to the IPoIB CM case.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
       [not found]           ` <CAJ3xEMir_FgqS7j+fuhugocawdZXHG9hAK-jpArZ_5vkzVjZeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-09-15 17:45             ` Jason Gunthorpe
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2014-09-15 17:45 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Yuval Shaia, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Sep 15, 2014 at 08:20:25PM +0300, Or Gerlitz wrote:
> On Mon, Sep 15, 2014 at 7:58 PM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > To do this, you need to transfer the offload state across the wire, so
> > on receive you inject the packet with the proper tag that the csum is
> > not computed but ready for offload. A node receiving a packet like
> > this would have to compute the csum before sending it onwards, so no,
> > if done properly it will not break gateways.
> >
> > All the core infrastructure is there, all the virtualization drivers
> > work like this - the guest side does not compute the csum, and the
> > hyperviser side receives the packet with that flag, and the csum
> > ultimately is offloaded to the physical NIC. Look at the xen net
> > driver for an example.
> 
> But is done on the xmitting hypervisor, isn't it? if this is the case,
> I don't see
> the similarity to the IPoIB CM case.

I'm not sure what you mean?

You raised the concern about gateways, which is identical to the
hypervisor case:

G-LINUX --(NO CSUM)--> ring buffer --> H-LINUX --(NO CSUM)--> NIC->WIRE

A-LINUX --(NO CSUM)-->     RC QP   --> B-LINUX --(NO CSUM)--> NIC->WIRE

The key is that csum state is placed in the ring buffer/RC QP with
every packet. Basically, you serialize the entire offload state the
IPoIB send receives from the kernel net stack, dump that onto the
wire, and restore that exact same semantic state on the receive side.

The NIC sees the same packet, with the same offload meta data, as
though it were directly connected to the sending Linux kernel.

The *typical* IPoIB CM case is similar to a guest talking to another
guest:

G1 --(NO CSUM)--> ring buffer --> H-LINUX --(NO CSUM)--> ring buffer --(NO CSUM)--> G2

Here the packet is never csum'd - the 2nd guest simply accepts the
packet with an uncsum'd tag. If you flatten the above it looks
identical to the typical IPoIB case.

Hypervisors are now also doing the same trick with GSO, they send
large packets without a high MTU, because they can take then GSO
master packet state from the sending guest and shuttle the whole thing
without segmentation to the receiving guest (or NIC). IPoIB should do
the same.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
       [not found]   ` <CAJ3xEMhEzdyzcAufQU--VbM7aoAzsw7wV2i_i=kjcS9PbdC0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-09-15 16:58     ` Jason Gunthorpe
@ 2014-09-15 18:55     ` Yuval Shaia
  2014-09-16  6:47       ` Or Gerlitz
  1 sibling, 1 reply; 14+ messages in thread
From: Yuval Shaia @ 2014-09-15 18:55 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Sep 15, 2014 at 05:47:19PM +0300, Or Gerlitz wrote:
> On Sun, Sep 14, 2014 at 9:46 PM, Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance.
> > This MTU plus overhead puts the memory allocation for IP based packets at 32 4k pages
> > (order 5),
> 
> So if we make sure that the advertized netdevice MTU is 64K minus that
> over head we're back to order four
> allocation and problem is solved?   note that RFC 4755 makes sure that
> the MTU is negotiated in both directions,
> so it can have any value, specifically 64K - that epsilon which will
> hopefully make you happy
Interesting point.
But please note that in any case, when not using scatter/gather we force the 
allocation of large contiguous physical memory.
> 
> 
> > [...] The proposal is to tell to network stack that IPoIB-CM supports IP
> > Checksum offload. This enables Linux IPoIB-CM driver to use Scatter/Gather feature. Network
> > sends the IP packet without adding the IP Checksum to the header.
> 
> AFAIK, on the TX side, Linux will always compute the IP checksum, but
> with this suggestion,
> not the TCP checksum which is assumed to be computed by the card... so
> we will have a TCP
> packet on the wire without checksum. And if this packet goes through
> gateway it will be dropped at
> some point, agree?
Agree.
This is why driver must expose interface to system administrator to 
disable/enable this feature according to his network topology.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
       [not found]       ` <20140915165820.GB12397-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2014-09-15 17:20         ` Or Gerlitz
@ 2014-09-15 19:03         ` Yuval Shaia
  1 sibling, 0 replies; 14+ messages in thread
From: Yuval Shaia @ 2014-09-15 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Sep 15, 2014 at 10:58:20AM -0600, Jason Gunthorpe wrote:
> On Mon, Sep 15, 2014 at 05:47:19PM +0300, Or Gerlitz wrote:
> 
> > > [...] The proposal is to tell to network stack that IPoIB-CM supports IP
> > > Checksum offload. This enables Linux IPoIB-CM driver to use Scatter/Gather feature. Network
> > > sends the IP packet without adding the IP Checksum to the header.
> > 
> > AFAIK, on the TX side, Linux will always compute the IP checksum,
> > but with this suggestion, not the TCP checksum which is assumed to
> > be computed by the card... so we will have a TCP packet on the wire
> > without checksum. And if this packet goes through gateway it will be
> > dropped at some point, agree?
> 
> I remember this was discussed a few years ago on this list.
> 
> To do this, you need to transfer the offload state across the wire, so
> on receive you inject the packet with the proper tag that the csum is
> not computed but ready for offload. A node receiving a packet like
> this would have to compute the csum before sending it onwards, so no,
> if done properly it will not break gateways.
> 
> All the core infrastructure is there, all the virtualization drivers
> work like this - the guest side does not compute the csum, and the
> hyperviser side receives the packet with that flag, and the csum
> ultimately is offloaded to the physical NIC. Look at the xen net
> driver for an example.
> 
> The main thing is to negotiate this and other features at RC
> connection time. Be sure to leave room for other optimizations, for
> instance IPOIB could forward a GSO packet unbroken to the remote side.
Correct, driver must support the case of peer does not support this feature.
Currently i have defined a 16 bits capability field that exchanged during 
RC setup time.
struct ipoib_cm_data {
	__be32 qpn; /* High byte MUST be ignored on receive */
	__be32 mtu;
+	__be16 sig; /* must be IPOIB_CM_PROTO_SIG */
+	__be16 caps; /* 4 bits proto ver and 12 bits capabilities */
};
This enactment breaks RFC but we will get to it as soon as the idea will 
be accepted here.
> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
  2014-09-15 18:55     ` Yuval Shaia
@ 2014-09-16  6:47       ` Or Gerlitz
       [not found]         ` <5417DD0F.9090201-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Or Gerlitz @ 2014-09-16  6:47 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org


On 9/15/2014 9:55 PM, Yuval Shaia wrote:
> On Mon, Sep 15, 2014 at 05:47:19PM +0300, Or Gerlitz wrote:
>> >On Sun, Sep 14, 2014 at 9:46 PM, Yuval Shaia<yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>  wrote:
>>> >>By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance.
>>> >>This MTU plus overhead puts the memory allocation for IP based packets at 32 4k pages (order 5),
>>
>> >So if we make sure that the advertized netdevice MTU is 64K minus that
>> >over head we're back to order four
>> >allocation and problem is solved?   note that RFC 4755 makes sure that
>> >the MTU is negotiated in both directions,
>> >so it can have any value, specifically 64K - that epsilon which will
>> >hopefully make you happy
> Interesting point. But please note that in any case, when not using scatter/gather we force the allocation of large contiguous physical memory.

On the post you wrote "[...] resolve the issue by removing the physically contiguous memory requirement using Scatter/Gather feature that exists in Linux".

I assume you refer to NETIF_F_SG, right? so your claim is that Linux will not effectively use the driver ability to serve SG skbs unless the driver also advertizes (say) NETIF_F_IP_CSUM?!

I thought it's the other way around -- that is supporting checksum offloading is useless unless SG is supported. Can you provide pointer into the network stack code/documentation that supports your claim?
  
Or.

  


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
       [not found]         ` <5417DD0F.9090201-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-09-22 19:28           ` Yuval Shaia
  2014-09-30  8:39             ` Yuval Shaia
  2014-10-02 13:00           ` Yuval Shaia
  1 sibling, 1 reply; 14+ messages in thread
From: Yuval Shaia @ 2014-09-22 19:28 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Vadim Makhervaks

On Tue, Sep 16, 2014 at 09:47:43AM +0300, Or Gerlitz wrote:
> 
> On 9/15/2014 9:55 PM, Yuval Shaia wrote:
> >On Mon, Sep 15, 2014 at 05:47:19PM +0300, Or Gerlitz wrote:
> >>>On Sun, Sep 14, 2014 at 9:46 PM, Yuval Shaia<yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>  wrote:
> >>>>>By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance.
> >>>>>This MTU plus overhead puts the memory allocation for IP based packets at 32 4k pages (order 5),
> >>
> >>>So if we make sure that the advertized netdevice MTU is 64K minus that
> >>>over head we're back to order four
> >>>allocation and problem is solved?   note that RFC 4755 makes sure that
> >>>the MTU is negotiated in both directions,
> >>>so it can have any value, specifically 64K - that epsilon which will
> >>>hopefully make you happy
> >Interesting point. But please note that in any case, when not using scatter/gather we force the allocation of large contiguous physical memory.
> 
> On the post you wrote "[...] resolve the issue by removing the physically contiguous memory requirement using Scatter/Gather feature that exists in Linux".
> 
> I assume you refer to NETIF_F_SG, right? so your claim is that Linux will not effectively use the driver ability to serve SG skbs unless the driver also advertizes (say) NETIF_F_IP_CSUM?!
> 
> I thought it's the other way around -- that is supporting checksum offloading is useless unless SG is supported. Can you provide pointer into the network stack code/documentation that supports your claim?
> Or.
While porting the patch to latest kernel i have learned that this limitation is no longer exist.
The limitation i was talking about was in older versions of the function netdev_fix_features().
	/* Fix illegal SG+CSUM combinations. */
	if ((features & NETIF_F_SG) &&
	    !(features & NETIF_F_ALL_CSUM)) {
		netdev_dbg(dev,
			"Dropping NETIF_F_SG since no checksum feature.\n");
		features &= ~NETIF_F_SG;
	}
Now, with the removal of this limitation the argument of "...In order to use SG....IPoIB-CM must support IP checksum offload" is no longer applicable.
We now left with the performance benefit argument only, can we ignore the performance improvement achieved by eliminating the need to calculate checksum in SW?
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
  2014-09-22 19:28           ` Yuval Shaia
@ 2014-09-30  8:39             ` Yuval Shaia
  0 siblings, 0 replies; 14+ messages in thread
From: Yuval Shaia @ 2014-09-30  8:39 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Vadim Makhervaks

On Mon, Sep 22, 2014 at 10:28:02PM +0300, Yuval Shaia wrote:
> On Tue, Sep 16, 2014 at 09:47:43AM +0300, Or Gerlitz wrote:
> > 
> > On 9/15/2014 9:55 PM, Yuval Shaia wrote:
> > >On Mon, Sep 15, 2014 at 05:47:19PM +0300, Or Gerlitz wrote:
> > >>>On Sun, Sep 14, 2014 at 9:46 PM, Yuval Shaia<yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>  wrote:
> > >>>>>By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance.
> > >>>>>This MTU plus overhead puts the memory allocation for IP based packets at 32 4k pages (order 5),
> > >>
> > >>>So if we make sure that the advertized netdevice MTU is 64K minus that
> > >>>over head we're back to order four
> > >>>allocation and problem is solved?   note that RFC 4755 makes sure that
> > >>>the MTU is negotiated in both directions,
> > >>>so it can have any value, specifically 64K - that epsilon which will
> > >>>hopefully make you happy
> > >Interesting point. But please note that in any case, when not using scatter/gather we force the allocation of large contiguous physical memory.
> > 
> > On the post you wrote "[...] resolve the issue by removing the physically contiguous memory requirement using Scatter/Gather feature that exists in Linux".
> > 
> > I assume you refer to NETIF_F_SG, right? so your claim is that Linux will not effectively use the driver ability to serve SG skbs unless the driver also advertizes (say) NETIF_F_IP_CSUM?!
> > 
> > I thought it's the other way around -- that is supporting checksum offloading is useless unless SG is supported. Can you provide pointer into the network stack code/documentation that supports your claim?
> > Or.
> While porting the patch to latest kernel i have learned that this limitation is no longer exist.
> The limitation i was talking about was in older versions of the function netdev_fix_features().
> 	/* Fix illegal SG+CSUM combinations. */
> 	if ((features & NETIF_F_SG) &&
> 	    !(features & NETIF_F_ALL_CSUM)) {
> 		netdev_dbg(dev,
> 			"Dropping NETIF_F_SG since no checksum feature.\n");
> 		features &= ~NETIF_F_SG;
> 	}
> Now, with the removal of this limitation the argument of "...In order to use SG....IPoIB-CM must support IP checksum offload" is no longer applicable.
> We now left with the performance benefit argument only, can we ignore the performance improvement achieved by eliminating the need to calculate checksum in SW?
Another benefit we should take into account is that that IB-CRC is **much more** reliable than checksum!!
> > 
> > 
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
  2014-09-14 18:46 ib_ipoib: CSUM support in connected mode Yuval Shaia
  2014-09-15 14:47 ` Or Gerlitz
@ 2014-10-01 11:55 ` Yuval Shaia
  2014-10-01 12:13   ` Or Gerlitz
  1 sibling, 1 reply; 14+ messages in thread
From: Yuval Shaia @ 2014-10-01 11:55 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sun, Sep 14, 2014 at 09:46:22PM +0300, Yuval Shaia wrote:
> Hi,
> Lately i was working on fixing an issue with IPoIB driver and i'd like to share the 
> details with you.
> 
> By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance.
> This MTU plus overhead puts the memory allocation for IP based packets at 32 4k pages
> (order 5), which have to be contiguous. When the system memory under pressure, it was
> observed that allocating 128k contiguous physical memory is difficult and causes serious
> errors (such as system becomes unusable).
> This enhancement resolve the issue by removing the physically contiguous memory requirement
> using Scatter/Gather feature that exists in Linux stack. In order to use Scatter/Gather
> feature in Linux IPoIB-CM driver, Linux IPoIB-CM must support IP checksum offload feature
> (requirements as per the current Linux N/W implementation). But IB HCA hardware does not
> support this feature and hence IPoIB cannot support the same.
> IPoIB Connected Mode driver uses RC (Reliable Connection) which guarantees the corruption
> free delivery of the packet. InfiniBand uses 32b CRC which provides stronger data integrity
> protection compare to 16b IP Checksum. So, there is no added value that IP Checksum provides
> in the IB world. The proposal is to tell to network stack that IPoIB-CM supports IP
> Checksum offload. This enables Linux IPoIB-CM driver to use Scatter/Gather feature. Network
> sends the IP packet without adding the IP Checksum to the header. On the receive side, IPoIB
> driver again tells the network stack that IP Checksum is good for the incoming packets and
> network stack avoids the IP Checksum calculations.
> During connection establishment the driver determine if the other end supports IB CRC
> as checksum. This is done so driver will be able to calculate checksum before transmitting
> the packet in case the other end does not support this feature.
> A support for fragmented skb is added to transmit path.
> 
> Please note that a "very welcome side-effect" of this feature is a high of degree performance 
> improvement as a result of the removal of csum calculation.
> I will be happy to share results with you if needed.
I ran a simple performance tests using iperf with the latest stable kernel.
Do not consider the results as *best results* it is just to sense the performance improvement.
For the test i used two servers, a "Sender" runs iperf client and "Receiver" runs iperf server.
>From the results we can see that benefits is with the "Receiver" side where we saved the time needed for checksum calculation.
On transmit checksum calculation is done while copying buffer from user-space so not much benefit.

Sender		Receiver	Results 
---------------------------------------
Legacy		Legacy		12 Gbps
Legacy		Patched		20 Gbps
Patched		Legacy		12 Gbps
Patched		Patched		20 Gbps

Legacy is a server which runs kernel without the patch.

I think we can't ignore 70% improvement in performances.
> 
> At present i have patch ready which was tested with an old kernel and i'm working to port 
> it to latest.
> 
> Please review and comment.
> 
> Yuval
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
  2014-10-01 11:55 ` Yuval Shaia
@ 2014-10-01 12:13   ` Or Gerlitz
       [not found]     ` <542BEFED.6050203-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Or Gerlitz @ 2014-10-01 12:13 UTC (permalink / raw)
  To: Yuval Shaia, linux-rdma-u79uwXL29TY76Z2rM5mHXA


On 10/1/2014 2:55 PM, Yuval Shaia wrote:
> On transmit checksum calculation is done while copying buffer from user-space so not much benefit
but this copying can go away, right?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
       [not found]         ` <5417DD0F.9090201-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-09-22 19:28           ` Yuval Shaia
@ 2014-10-02 13:00           ` Yuval Shaia
  1 sibling, 0 replies; 14+ messages in thread
From: Yuval Shaia @ 2014-10-02 13:00 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Tue, Sep 16, 2014 at 09:47:43AM +0300, Or Gerlitz wrote:
> 
> On the post you wrote "[...] resolve the issue by removing the physically contiguous memory requirement using Scatter/Gather feature that exists in Linux".
> 
> I assume you refer to NETIF_F_SG, right? so your claim is that Linux will not effectively use the driver ability to serve SG skbs unless the driver also advertizes (say) NETIF_F_IP_CSUM?!
Or,
Correct me if i'm wrong here but i didn't saw any handling of fragmented skb in driver.
The current implementation dma map only skb->data where in fragmented skb there is a need to dma map all frags, right?
> 
> I thought it's the other way around -- that is supporting checksum offloading is useless unless SG is supported. Can you provide pointer into the network stack code/documentation that supports your claim?
> Or.
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ib_ipoib: CSUM support in connected mode
       [not found]     ` <542BEFED.6050203-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-04 18:36       ` Yuval Shaia
  0 siblings, 0 replies; 14+ messages in thread
From: Yuval Shaia @ 2014-10-04 18:36 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 01, 2014 at 03:13:33PM +0300, Or Gerlitz wrote:
> 
> On 10/1/2014 2:55 PM, Yuval Shaia wrote:
> >On transmit checksum calculation is done while copying buffer from user-space so not much benefit
> but this copying can go away, right?
I assume this is exactly what networking-layer does when driver reports NETIF_F_IP_CSUM.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-10-04 18:36 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-14 18:46 ib_ipoib: CSUM support in connected mode Yuval Shaia
2014-09-15 14:47 ` Or Gerlitz
     [not found]   ` <CAJ3xEMhEzdyzcAufQU--VbM7aoAzsw7wV2i_i=kjcS9PbdC0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-09-15 16:58     ` Jason Gunthorpe
     [not found]       ` <20140915165820.GB12397-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-09-15 17:20         ` Or Gerlitz
     [not found]           ` <CAJ3xEMir_FgqS7j+fuhugocawdZXHG9hAK-jpArZ_5vkzVjZeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-09-15 17:45             ` Jason Gunthorpe
2014-09-15 19:03         ` Yuval Shaia
2014-09-15 18:55     ` Yuval Shaia
2014-09-16  6:47       ` Or Gerlitz
     [not found]         ` <5417DD0F.9090201-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-09-22 19:28           ` Yuval Shaia
2014-09-30  8:39             ` Yuval Shaia
2014-10-02 13:00           ` Yuval Shaia
2014-10-01 11:55 ` Yuval Shaia
2014-10-01 12:13   ` Or Gerlitz
     [not found]     ` <542BEFED.6050203-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-04 18:36       ` Yuval Shaia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox