linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* rsockets and other performance
@ 2012-06-14 15:24 Pradeep Satyanarayana
       [not found] ` <4FDA0233.6090409-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Pradeep Satyanarayana @ 2012-06-14 15:24 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kashyapv-r/Jw6+rmf7HQT0dZR+AlfA, sri-r/Jw6+rmf7HQT0dZR+AlfA

Traditional sockets based applications wanting high throughput could use 
rsockets Since it is layered on top of uverbs we expected to see good 
throughput numbers.
So, we started to run netperf and iperf. We observed that it tops off at 
about 20Gb/s with QDR adapters. A quick "perf top" revealed a lot of 
cycles spent in memcpy().
We had hoped these numbers would be somewhat higher since we did not 
expect the memcpy() to have such a large overhead.

Given the copy overhead, we wanted to revisit the IPoIB and SDP 
performance. Hence we installed to OFED-1.5.4.1 on RHEL 6.2. We found 
that for small packets SDP starts
with low throughputs, but seems to catch up with rsockets at about 16 KB 
packets. On the other hand IPoIB CM tops off at about 10 Gb/s.

Since SDP does in kernel RDMA we expected IPoIB CM and SDP numbers to be 
much closer. Again "perf top" revealed that IPoIB was spending a large 
number of cycles in
checksum computation. Out of curiosity Sridhar made the following changes:

--- ipoib_cm.c.orig    2012-06-10 15:27:10.589325138 -0400
+++ ipoib_cm.c    2012-06-12 11:29:49.073262516 -0400
@@ -670,6 +670,7 @@ copied:
      skb->dev = dev;
      /* XXX get correct PACKET_ type here */
      skb->pkt_type = PACKET_HOST;
+    skb->ip_summed = CHECKSUM_UNNECESSARY;
      netif_receive_skb(skb);

@@ -1464,7 +1464,8 @@ static ssize_t set_mode(struct device *d
                 "will cause multicast packet drops\n");

          rtnl_lock();
-        dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_SG | NETIF_F_TSO);
+        dev->features &= ~(NETIF_F_SG | NETIF_F_TSO);
          priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;

          if (ipoib_cm_max_mtu(dev) > priv->mcast_mtu)


With these minimal changes IPoIB throughput reached between 19-20Gb/s 
with just 2 threads. This was really unexpected. Given that, we wanted 
to revisit the usage of checksums in IPoIB.
So, it looks worthwhile to allow for 'checksum-less' IPoIB-CM within a 
cluster on a single subnet. From a checksum perspective, this would be 
no different from RDMA. What are your thoughts?

Thanks
Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: rsockets and other performance
       [not found] ` <4FDA0233.6090409-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2012-06-14 16:02   ` Hefty, Sean
  2012-06-14 16:09   ` Sridhar Samudrala
  2012-06-14 17:22   ` Jason Gunthorpe
  2 siblings, 0 replies; 10+ messages in thread
From: Hefty, Sean @ 2012-06-14 16:02 UTC (permalink / raw)
  To: Pradeep Satyanarayana
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kashyapv-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org,
	sri-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

> Traditional sockets based applications wanting high throughput could use
> rsockets Since it is layered on top of uverbs we expected to see good
> throughput numbers.
> So, we started to run netperf and iperf. We observed that it tops off at
> about 20Gb/s with QDR adapters. A quick "perf top" revealed a lot of
> cycles spent in memcpy().
> We had hoped these numbers would be somewhat higher since we did not
> expect the memcpy() to have such a large overhead.

Someone more familiar with ipoib needs to respond regarding those changes.  For rsockets, please make sure that you've pulled the latest code.  You can improve the performance by adjusting the QP size and send/receive buffers, which can be done through config files.  I started an rsocket man page that describes the files that I just pushed out to my git tree.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rsockets and other performance
       [not found] ` <4FDA0233.6090409-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2012-06-14 16:02   ` Hefty, Sean
@ 2012-06-14 16:09   ` Sridhar Samudrala
       [not found]     ` <1339690194.14317.5.camel-5vSEHtyIv2TJ4MwkZ4db91aTQe2KTcn/@public.gmane.org>
  2012-06-14 17:22   ` Jason Gunthorpe
  2 siblings, 1 reply; 10+ messages in thread
From: Sridhar Samudrala @ 2012-06-14 16:09 UTC (permalink / raw)
  To: Pradeep Satyanarayana
  Cc: Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kashyapv-r/Jw6+rmf7HQT0dZR+AlfA

On Thu, 2012-06-14 at 08:24 -0700, Pradeep Satyanarayana wrote:
> Traditional sockets based applications wanting high throughput could use 
> rsockets Since it is layered on top of uverbs we expected to see good 
> throughput numbers.
> So, we started to run netperf and iperf. We observed that it tops off at 
> about 20Gb/s with QDR adapters. A quick "perf top" revealed a lot of 
> cycles spent in memcpy().
> We had hoped these numbers would be somewhat higher since we did not 
> expect the memcpy() to have such a large overhead.
> 
> Given the copy overhead, we wanted to revisit the IPoIB and SDP 
> performance. Hence we installed to OFED-1.5.4.1 on RHEL 6.2. We found 
> that for small packets SDP starts
> with low throughputs, but seems to catch up with rsockets at about 16 KB 
> packets. On the other hand IPoIB CM tops off at about 10 Gb/s.
> 
> Since SDP does in kernel RDMA we expected IPoIB CM and SDP numbers to be 
> much closer. Again "perf top" revealed that IPoIB was spending a large 
> number of cycles in
> checksum computation. Out of curiosity Sridhar made the following changes:
> 
> --- ipoib_cm.c.orig    2012-06-10 15:27:10.589325138 -0400
> +++ ipoib_cm.c    2012-06-12 11:29:49.073262516 -0400
> @@ -670,6 +670,7 @@ copied:
>       skb->dev = dev;
>       /* XXX get correct PACKET_ type here */
>       skb->pkt_type = PACKET_HOST;
> +    skb->ip_summed = CHECKSUM_UNNECESSARY;
>       netif_receive_skb(skb);
> 
> @@ -1464,7 +1464,8 @@ static ssize_t set_mode(struct device *d
>                  "will cause multicast packet drops\n");
> 
>           rtnl_lock();
> -        dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_SG | NETIF_F_TSO);
> +        dev->features &= ~(NETIF_F_SG | NETIF_F_TSO);

Enabling NETIF_F_SG improves the throughput further by avoiding a
additional kernel memcpy caused by skb_linearize() in dev_queue_xmit().

Thanks
Sridhar

>           priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
> 
>           if (ipoib_cm_max_mtu(dev) > priv->mcast_mtu)
> 
> 
> With these minimal changes IPoIB throughput reached between 19-20Gb/s 
> with just 2 threads. This was really unexpected. Given that, we wanted 
> to revisit the usage of checksums in IPoIB.
> So, it looks worthwhile to allow for 'checksum-less' IPoIB-CM within a 
> cluster on a single subnet. From a checksum perspective, this would be 
> no different from RDMA. What are your thoughts?
> 
> Thanks
> Pradeep


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rsockets and other performance
       [not found] ` <4FDA0233.6090409-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2012-06-14 16:02   ` Hefty, Sean
  2012-06-14 16:09   ` Sridhar Samudrala
@ 2012-06-14 17:22   ` Jason Gunthorpe
       [not found]     ` <20120614172245.GC6552-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2012-06-14 17:22 UTC (permalink / raw)
  To: Pradeep Satyanarayana
  Cc: Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kashyapv-r/Jw6+rmf7HQT0dZR+AlfA, sri-r/Jw6+rmf7HQT0dZR+AlfA

On Thu, Jun 14, 2012 at 08:24:35AM -0700, Pradeep Satyanarayana wrote:

> With these minimal changes IPoIB throughput reached between
> 19-20Gb/s with just 2 threads. This was really unexpected. Given
> that, we wanted to revisit the usage of checksums in IPoIB.
> So, it looks worthwhile to allow for 'checksum-less' IPoIB-CM within
> a cluster on a single subnet. From a checksum perspective, this
> would be no different from RDMA. What are your thoughts?

There have been discussions around a 'checksum-less' IPoIB operation
for a little while.

The basic notion was to enable the checksum offload mechanism, pass
the information from Linux for offload straight through to the other
side (eg via an extra header or something), have the other side
reconstruct the offload indication on RX and inject back to into the
net stack.

This would be similar to the way checksum bypass works in
virtualization (Xen/KVM) where the virtualized net TX just packages
the offload data and sends it to the hyperviser kernel which then RX's
it and restores the very same checksum offload information.

During the CM process this feature would be negotiated.

I don't think anyone ever made patches for this, but considering the
performance delta you see it really seems worthwhile.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rsockets and other performance
       [not found]     ` <20120614172245.GC6552-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2012-06-15  6:48       ` Vivek Kashyap
       [not found]         ` <alpine.LRH.2.00.1206142336220.13728-Nzfcc2Us4m+Xfo/+vza+31aTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Vivek Kashyap @ 2012-06-15  6:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pradeep Satyanarayana, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	sri-r/Jw6+rmf7HQT0dZR+AlfA



On Thu, 14 Jun 2012, Jason Gunthorpe wrote:

> On Thu, Jun 14, 2012 at 08:24:35AM -0700, Pradeep Satyanarayana wrote:
>
>> With these minimal changes IPoIB throughput reached between
>> 19-20Gb/s with just 2 threads. This was really unexpected. Given
>> that, we wanted to revisit the usage of checksums in IPoIB.
>> So, it looks worthwhile to allow for 'checksum-less' IPoIB-CM within
>> a cluster on a single subnet. From a checksum perspective, this
>> would be no different from RDMA. What are your thoughts?
>
> There have been discussions around a 'checksum-less' IPoIB operation
> for a little while.
>
> The basic notion was to enable the checksum offload mechanism, pass
> the information from Linux for offload straight through to the other
> side (eg via an extra header or something), have the other side
> reconstruct the offload indication on RX and inject back to into the
> net stack.
>
> This would be similar to the way checksum bypass works in
> virtualization (Xen/KVM) where the virtualized net TX just packages
> the offload data and sends it to the hyperviser kernel which then RX's
> it and restores the very same checksum offload information.
>
> During the CM process this feature would be negotiated.
>
> I don't think anyone ever made patches for this, but considering the
> performance delta you see it really seems worthwhile.

How about something like below? Basically the the 'checksum-less' operation 
is only between hosts that both support it by extending  the existing IB
connection setup mechanism. The following also keeps the
changes confined to ipoib-cm module.


- add a  sysctl variable csum_simulate
- In ipoinb-cm module
 	if (csum_simulate)
 		advertize hardware checksum offload capabilities



- when a QP is created to a remote host it checks csum_simulate.
 	if (csum_simulate)
 		include CSUM_SIMULATE command in RC private data when
 		setting up the connection
 		Note: the RFC 4755  utilizes this private data to exchange
 		the receive MTU and UD QP. We just add another parameter to it.

 		If accepted by the other end during connection negotiation,
 		then set csum_simulate_on  = 1 (For the QP)

- when a QP connection request is received
 	if (csum_simulate)
 		look for CSUM_SIMULATE field in the private data
 		if present respond with CSUM_SIMULATE else zero it in response
 		set csum_simulate_on flag = 1  for the QP


In the above two steps one would also want to check that the peer is a 
directly connected host.

- when sending data
 	if (csum_simulate_on)
 		send the data over ipoib-cm link normally (no data checksum
 		added)
 	else /* sending over a QP not enabled for checksum offload */
 		calculate the overall checksum
 		send data

- when receiving data
 	if (csum_simulate_on)
 		set CSUM_UNNECESSARY indicating csum has been validated


thanks
Vivek

>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rsockets and other performance
       [not found]         ` <alpine.LRH.2.00.1206142336220.13728-Nzfcc2Us4m+Xfo/+vza+31aTQe2KTcn/@public.gmane.org>
@ 2012-06-15 17:19           ` Jason Gunthorpe
       [not found]             ` <20120615171919.GA17224-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2012-06-15 17:19 UTC (permalink / raw)
  To: Vivek Kashyap
  Cc: Pradeep Satyanarayana, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	sri-r/Jw6+rmf7HQT0dZR+AlfA

On Thu, Jun 14, 2012 at 11:48:09PM -0700, Vivek Kashyap wrote:
> >I don't think anyone ever made patches for this, but considering the
> >performance delta you see it really seems worthwhile.
> 
> How about something like below? Basically the the 'checksum-less'
> operation is only between hosts that both support it by extending
> the existing IB connection setup mechanism. The following also keeps
> the changes confined to ipoib-cm module.

It is much better to do things properly (as I described) so you don't
have a gotcha when a packet gets routed.

We don't want to fake that we are doing csum, we want to forward the
csum offload data to the other side, which might discard it or might
use it to forward the packet out another interface. This is not hard,
just fiddly work.

-- 
Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>        (780)4406067x832
Chief Technology Officer, Obsidian Research Corp         Edmonton, Canada
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rsockets and other performance
       [not found]     ` <1339690194.14317.5.camel-5vSEHtyIv2TJ4MwkZ4db91aTQe2KTcn/@public.gmane.org>
@ 2012-06-17  7:22       ` Or Gerlitz
       [not found]         ` <4FDD85B3.2090406-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Or Gerlitz @ 2012-06-17  7:22 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Pradeep Satyanarayana, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kashyapv-r/Jw6+rmf7HQT0dZR+AlfA, Shlomo Pongratz

On 6/14/2012 7:09 PM, Sridhar Samudrala wrote:
> Enabling NETIF_F_SG improves the throughput further by avoiding a 
> additional kernel memcpy caused by skb_linearize() in dev_queue_xmit().

Hi Sridhar,

If you **only** enable NETIF_F_SG for CM, does this yields any gain? did 
you had to patch the code for that end?

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rsockets and other performance
       [not found]         ` <4FDD85B3.2090406-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2012-06-17 21:20           ` Pradeep Satyanarayana
  0 siblings, 0 replies; 10+ messages in thread
From: Pradeep Satyanarayana @ 2012-06-17 21:20 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Sridhar Samudrala, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kashyapv-r/Jw6+rmf7HQT0dZR+AlfA, Shlomo Pongratz

On 06/17/2012 12:22 AM, Or Gerlitz wrote:
> On 6/14/2012 7:09 PM, Sridhar Samudrala wrote:
>> Enabling NETIF_F_SG improves the throughput further by avoiding a 
>> additional kernel memcpy caused by skb_linearize() in dev_queue_xmit().
>
> Hi Sridhar,
>
> If you **only** enable NETIF_F_SG for CM, does this yields any gain? 
> did you had to patch the code for that end?
>
Hi Or, We did not try only NETIF_F_SG for IPoIB CM. We tried it along 
with NETIF_F_IP_CSUM. We did run
into issues  with netperf i.e. netperf errors out  along with 
NETIF_F_SG. However, iperf worked fine. Did not investigate that further.

Thanks
Pradeep


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rsockets and other performance
       [not found]             ` <20120615171919.GA17224-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2012-06-18  6:37               ` Vivek Kashyap
       [not found]                 ` <alpine.LRH.2.00.1206172330220.13728-Nzfcc2Us4m+Xfo/+vza+31aTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Vivek Kashyap @ 2012-06-18  6:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pradeep Satyanarayana, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	sri-r/Jw6+rmf7HQT0dZR+AlfA


On Fri, 15 Jun 2012, Jason Gunthorpe wrote:

> On Thu, Jun 14, 2012 at 11:48:09PM -0700, Vivek Kashyap wrote:
>>> I don't think anyone ever made patches for this, but considering the
>>> performance delta you see it really seems worthwhile.
>>
>> How about something like below? Basically the the 'checksum-less'
>> operation is only between hosts that both support it by extending
>> the existing IB connection setup mechanism. The following also keeps
>> the changes confined to ipoib-cm module.
>
> It is much better to do things properly (as I described) so you don't
> have a gotcha when a packet gets routed.
>
> We don't want to fake that we are doing csum, we want to forward the
> csum offload data to the other side, which might discard it or might
> use it to forward the packet out another interface. This is not hard,
> just fiddly work.

We certainly need to ensure that the routed packet is not left without a
checksum. One either does not enable 'checksum less link' on routers, or 
else the router must be able to take care of it when it forwards the packet.

By 'csum offload data' do you mean something akin to csum_start/csum_offset?
How would one transmit this information when using IPoIB-CM across from the
sending host to the router? If we can do that it will certainly be useful.

Otherwise what I was proposing was that a router is either configured to not
accept checksum less configuration on IPoIb-CM, or if checksum less links are 
supported then, when forwarding to a 'checksum less' interface it does nothing;
but when forwarding to any other interface the checksum is added.

This addition is really done in the ipoib-cm driver to insulate from the
rest of the stack. The 'faking' of checksum offload is only to the local
IP stack so that it does not do the calculation but leaves it to the
IPoIB-CM driver. That is what we essentially do with virtio or hardware
offload mechanisms.

We are working on a test patch. That should make it easier to discuss the 
gaps or if the solution is complete.

thanks
Vivek

>
> -- 
> Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>        (780)4406067x832
> Chief Technology Officer, Obsidian Research Corp         Edmonton, Canada
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rsockets and other performance
       [not found]                 ` <alpine.LRH.2.00.1206172330220.13728-Nzfcc2Us4m+Xfo/+vza+31aTQe2KTcn/@public.gmane.org>
@ 2012-06-18 17:53                   ` Jason Gunthorpe
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2012-06-18 17:53 UTC (permalink / raw)
  To: Vivek Kashyap
  Cc: Pradeep Satyanarayana, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	sri-r/Jw6+rmf7HQT0dZR+AlfA

On Sun, Jun 17, 2012 at 11:37:33PM -0700, Vivek Kashyap wrote:

> >On Thu, Jun 14, 2012 at 11:48:09PM -0700, Vivek Kashyap wrote:

> >We don't want to fake that we are doing csum, we want to forward the
> >csum offload data to the other side, which might discard it or might
> >use it to forward the packet out another interface. This is not hard,
> >just fiddly work.
> 
> We certainly need to ensure that the routed packet is not left without a
> checksum. One either does not enable 'checksum less link' on
> routers, or else the router must be able to take care of it when it
> forwards the packet.

The latter is vastly preferred.

> By 'csum offload data' do you mean something akin to csum_start/csum_offset?
> How would one transmit this information when using IPoIB-CM across from the
> sending host to the router? If we can do that it will certainly be useful.

That plus a ip_summed needed flag seems to match what virtio_net is
doing. Review virtio_net.c:receive_buf and xmit_skb to see how it
handles the packet.

Also, I am just reminded, there has been some interest in also
forwarding GSO through the CM connection. This would speed things up
as well and avoid the need for the wonky 64k MTU. 

ie rework the CM mode to follow more closely how virtio-net works.

Like virtio_net you'd have to prefix a small header, or if you are
lucky, maybe encoding things in the 32 bits of immediate data.

> Otherwise what I was proposing was that a router is either
> configured to not accept checksum less configuration on IPoIb-CM, or

This sort of solution is something I'd like to avoid, it is not
necessary to make something so fragile.

> This addition is really done in the ipoib-cm driver to insulate from the
> rest of the stack. The 'faking' of checksum offload is only to the local
> IP stack so that it does not do the calculation but leaves it to the
> IPoIB-CM driver. That is what we essentially do with virtio or hardware
> offload mechanisms.

I mean 'faking' in the sense you tell the stack you will compute the
checksum, throw away the information needed to do that calculation,
and the send a packet on the wire that can never be check-summed
properly.

I don't see the point of keeping things contained to the CM part of
the ipoib driver, if the ipoib packet handling needs to be changed then
change it..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-06-18 17:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-14 15:24 rsockets and other performance Pradeep Satyanarayana
     [not found] ` <4FDA0233.6090409-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2012-06-14 16:02   ` Hefty, Sean
2012-06-14 16:09   ` Sridhar Samudrala
     [not found]     ` <1339690194.14317.5.camel-5vSEHtyIv2TJ4MwkZ4db91aTQe2KTcn/@public.gmane.org>
2012-06-17  7:22       ` Or Gerlitz
     [not found]         ` <4FDD85B3.2090406-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2012-06-17 21:20           ` Pradeep Satyanarayana
2012-06-14 17:22   ` Jason Gunthorpe
     [not found]     ` <20120614172245.GC6552-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2012-06-15  6:48       ` Vivek Kashyap
     [not found]         ` <alpine.LRH.2.00.1206142336220.13728-Nzfcc2Us4m+Xfo/+vza+31aTQe2KTcn/@public.gmane.org>
2012-06-15 17:19           ` Jason Gunthorpe
     [not found]             ` <20120615171919.GA17224-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2012-06-18  6:37               ` Vivek Kashyap
     [not found]                 ` <alpine.LRH.2.00.1206172330220.13728-Nzfcc2Us4m+Xfo/+vza+31aTQe2KTcn/@public.gmane.org>
2012-06-18 17:53                   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).