public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* sense remote hardware address change by rdma-cm applications
@ 2010-07-19 21:42 Or Gerlitz
       [not found] ` <AANLkTimmWiNqHJIqSEKbY-X6mSx6zA19p__JDYPEmp8b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Or Gerlitz @ 2010-07-19 21:42 UTC (permalink / raw)
  To: Sean Hefty; +Cc: Jason Gunthorpe, Steve Wise, linux-rdma

Today, the kernel neighbouring maintainance state-machine / engine
doesn't come into play for neighbours created on behalf of rdma-cm
consumers. This is b/c the send path is offloaded away from the
network-stack to the app QP, and as such the neighbour created
follwing the ARP request / reply initiated by rdma_resolve_address is
quickly getting aged and deleted, am I correct in that?

This behaviour makes rdma-cm RC apps to sense remote hardware address
change based only on the RC QP timeout, where UD  apps have no way
other then implementing some sort of keep-alive / probing mechanism to
make sure their AH is valid,  so how about


A. ref a neighbour created on behalf of or used by an rdma-cm ID  (*)

B. enhance the rdma-cm address_change event to report on remote
hardware address change, based on neighbour events

Or.

(*) would per ID neigh_hold() call (paired with neigh_release() when
the ID gets destroyed) work for that end?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found] ` <AANLkTimmWiNqHJIqSEKbY-X6mSx6zA19p__JDYPEmp8b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-20  0:14   ` Jason Gunthorpe
       [not found]     ` <20100720001436.GH7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2010-07-20  0:14 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Sean Hefty, Steve Wise, linux-rdma

On Tue, Jul 20, 2010 at 12:42:06AM +0300, Or Gerlitz wrote:
> Today, the kernel neighbouring maintainance state-machine / engine
> doesn't come into play for neighbours created on behalf of rdma-cm
> consumers. This is b/c the send path is offloaded away from the
> network-stack to the app QP, and as such the neighbour created
> follwing the ARP request / reply initiated by rdma_resolve_address is
> quickly getting aged and deleted, am I correct in that?

It is a bit wider problem than just ND entries, changes in routing can
also alter the L2 address, so that needs to be tracked as well. 

Bit of a rat hole unfortunately, this is back to original criticisms
from netdev of this whole seperated stack idea - it isn't integrated,
so where do you draw the line? What gets left out?

Today, it is pretty clear that only the CM portion integrates at all
with netdev and after that things are separate.

So.. I think to tackle this you need to start looking at how the
dst_entry structure works in netdev and apply the same idea to RDMA-CM
and reflect the changes in AH back to the QP owner.

Basically, holding the ND and route structure should work identically
to TCP, not be different and half baked. If you recall Sean recently
put through a big patch set fixing this kind of divergance in the
route lookup area.. Doing anything different from TCP should be well
and completely justified.

Is this an iwarp problem too? Not sure how L3->L2 translation works
there.

> This behaviour makes rdma-cm RC apps to sense remote hardware address
> change based only on the RC QP timeout, where UD  apps have no way
> other then implementing some sort of keep-alive / probing mechanism to
> make sure their AH is valid,  so how about

Not sure what you do about UD.. Maybe RDMA-CM learns to do UC where
the only action is to register notification monitors for L2 addressing
changes in the kernel?

Ugly in user-space though.. Can this be hidden with Sean's recent work
on simplified progamming models?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]     ` <20100720001436.GH7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-07-20  7:25       ` Or Gerlitz
       [not found]         ` <4C454F80.1060808-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Or Gerlitz @ 2010-07-20  7:25 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Sean Hefty, Steve Wise, linux-rdma

Jason Gunthorpe wrote:
> It is a bit wider problem than just ND entries, changes in routing can
> also alter the L2 address, so that needs to be tracked as well. 

sure, when we did the address change work, see commit dd5bdff "RDMA/cma: Add RDMA_CM_EVENT_ADDR_CHANGE event", the problem I wanted to solve was related to 
the local bonding. Over the review thread, remote address change related 
to bonding fail-over and routing changes were mentioned, and left to future work.


> this is back to original criticisms from netdev of this whole separated 
> stack idea - it isn't integrated, so where do you draw the line? What gets left out? 
> Today, it is pretty clear that only the CM portion integrates at all
> with netdev and after that things are separate.

the address change event was an attempt to make the CM part which integrates with netdev
go a step further and help the data path which is offloaded to be more consistent with netdev,
this email is about going another step.

> So.. I think to tackle this you need to start looking at how the
> dst_entry structure works in netdev and apply the same idea to RDMA-CM
> and reflect the changes in AH back to the QP owner.

I can take a look (pointer would be very much appreciated...) still, the dst entry is used
for every netdev xmit where here the xmit is offloaded, so I don't see what could be really used from the dst code, but I might be wrong. The rdma app uses the neighbour once, upon address resolving, and I was trying to see if we can ref the neighbour so the neigh sub-system probes would keep going even though the neighbour is not directly used.

> Is this an iwarp problem too? Not sure how L3->L2 translation works there.

I never managed to understand how address resolving really works with iwarp... 

Doing a bit of detective work... you can see that addr4_resolve says

>         /* If the device does ARP internally, return 'done' */
>         if (rt->idev->dev->flags & IFF_NOARP) {
>                 rdma_copy_addr(addr, rt->idev->dev, NULL);
>                 goto put;
>         }

and later cma_connect_iw places into the iwarp cm the src/dst IP addresses

>         sin = (struct sockaddr_in*) &id_priv->id.route.addr.src_addr;
>         cm_id->local_addr = *sin;
>         sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr;
>         cm_id->remote_addr = *sin;

so all the iwarp providers do ARP resolving in their TOE stack?! Steve, can you
clarify that?

 
> Not sure what you do about UD.. Maybe RDMA-CM learns to do UC where
> the only action is to register notification monitors for L2 addressing
> changes in the kernel?

The problem exists for all IB transports (even for RD, if it would have been implemented...), the only difference between the U and R onces is that for the R's, if the remote side vanished, eventually the IB HW would let you know on that in the form of CQ error.

> Can this be hidden with Sean's recent work on simplified progamming models?

not sure how Sean's work relates to this proposed change.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]         ` <4C454F80.1060808-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
@ 2010-07-20 17:22           ` Jason Gunthorpe
  2010-07-20 18:12           ` Steve Wise
  1 sibling, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2010-07-20 17:22 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Sean Hefty, Steve Wise, linux-rdma

On Tue, Jul 20, 2010 at 10:25:52AM +0300, Or Gerlitz wrote:

> > So.. I think to tackle this you need to start looking at how the
> > dst_entry structure works in netdev and apply the same idea to RDMA-CM
> > and reflect the changes in AH back to the QP owner.
> 
> I can take a look (pointer would be very much appreciated...) still,
> the dst entry is used for every netdev xmit where here the xmit is
> offloaded, so I don't see what could be really used from the dst
> code, but I might be wrong. The rdma app uses the neighbour once,
> upon address resolving, and I was trying to see if we can ref the
> neighbour so the neigh sub-system probes would keep going even
> though the neighbour is not directly used.

It has been a while since I looked through this .. but, IIRC, the
general idea was that the socket held onto a cached dst and then at
each send it would use that dst to generate the L2 headers. Somehow
the dst would become invalidated when the routing cache was flushed
out.

So, basically, if you can add to RDMA-CM a way to get, hold and re-get
the dst you have solved the first problem, - how do know the current
routing information, and hold onto it, keep it in caches, etc.

The second problem, is how do you get notified that the dst may have
been changed? sockets seem to basically just poll every packet, so you
might need to use some netdev notifications, maybe also a timer?

I'd see a flow like this:
 - in the current route lookup code stash the dst
 - add a function to freshen the dst
 - hook events that might indicate the dst is invalid
 - on event trigger freshen the dst, regenerate the L2 address info
   and compare it to what is already in use
 - If different, send an event to user space.

stashing the dst lets you get back to the L2 information by using the
routing cache, and by holding onto a neighbor reference (in the dst)

Also, while doing this you are going to need to do something to have
the kernel send ND probes to keep the ND entry fresh when the
connection is open. Not sure, but I think this also has something to
do with the DST.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]         ` <4C454F80.1060808-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
  2010-07-20 17:22           ` Jason Gunthorpe
@ 2010-07-20 18:12           ` Steve Wise
       [not found]             ` <4C45E701.7030501-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  1 sibling, 1 reply; 16+ messages in thread
From: Steve Wise @ 2010-07-20 18:12 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Jason Gunthorpe, Sean Hefty, linux-rdma

Or Gerlitz wrote:
> Jason Gunthorpe wrote:
>   
>> It is a bit wider problem than just ND entries, changes in routing can
>> also alter the L2 address, so that needs to be tracked as well. 
>>     
>
> sure, when we did the address change work, see commit dd5bdff "RDMA/cma: Add RDMA_CM_EVENT_ADDR_CHANGE event", the problem I wanted to solve was related to 
> the local bonding. Over the review thread, remote address change related 
> to bonding fail-over and routing changes were mentioned, and left to future work.
>
>
>   
>> this is back to original criticisms from netdev of this whole separated 
>> stack idea - it isn't integrated, so where do you draw the line? What gets left out? 
>> Today, it is pretty clear that only the CM portion integrates at all
>> with netdev and after that things are separate.
>>     
>
> the address change event was an attempt to make the CM part which integrates with netdev
> go a step further and help the data path which is offloaded to be more consistent with netdev,
> this email is about going another step.
>
>   
>> So.. I think to tackle this you need to start looking at how the
>> dst_entry structure works in netdev and apply the same idea to RDMA-CM
>> and reflect the changes in AH back to the QP owner.
>>     
>
> I can take a look (pointer would be very much appreciated...) still, the dst entry is used
> for every netdev xmit where here the xmit is offloaded, so I don't see what could be really used from the dst code, but I might be wrong. The rdma app uses the neighbour once, upon address resolving, and I was trying to see if we can ref the neighbour so the neigh sub-system probes would keep going even though the neighbour is not directly used.
>
>   
>> Is this an iwarp problem too? Not sure how L3->L2 translation works there.
>>     
>
> I never managed to understand how address resolving really works with iwarp... 
>
> Doing a bit of detective work... you can see that addr4_resolve says
>
>   
>>         /* If the device does ARP internally, return 'done' */
>>         if (rt->idev->dev->flags & IFF_NOARP) {
>>                 rdma_copy_addr(addr, rt->idev->dev, NULL);
>>                 goto put;
>>         }
>>     
>
> and later cma_connect_iw places into the iwarp cm the src/dst IP addresses
>
>   
>>         sin = (struct sockaddr_in*) &id_priv->id.route.addr.src_addr;
>>         cm_id->local_addr = *sin;
>>         sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr;
>>         cm_id->remote_addr = *sin;
>>     
>
> so all the iwarp providers do ARP resolving in their TOE stack?! Steve, can you
> clarify that?
>
>   

The Ammasso driver uses the IFF_NOARP, and I think actually that is the 
only iwarp driver that uses it. 


The cxgb3/4 drivers do not set IFF_NOARP and rely on ND being done as 
part of connection setup.  The driver will initiate ND if there isn't a 
neigh entry available at the time the iwarp driver tries to send a SYN 
or SYN/ACK.  So even though the rdma_cm does ND initially, the cxgb* 
drivers don't assume that.  The code that handles all this is in 
cxgb3.ko.  See drivers/net/cxgb3/l2t.c.  The iwarp driver code that uses 
the L2T services is mainly in drivers/infiniband/hw/cxgb3/iwch_cm.c. 


The cxgb* drivers actually reference the neigh and dst structs until the 
offload connection is gone.  Also if the the offloaded connection has 
problems transmitting (due to a L2 address change, for example), then 
the driver will initiate ND again by calling neigh_event_send().  See 
t4_l2t_send_event() in l2t.c which is called by the iwarp driver in 
peer_abort() from iwch_cm.c when the HW tells us its retransmitting too 
much.


The cxgb* drivers also handle routing redirects, but I think that path 
has bugs.


What doesn't happen is active positive feedback during the connection to 
avoid NUD.  IE once the connection is setup, nobody calls 
dst_confirm().   It is only called during connection setup/teardown.



Steve.





>  
>   
>> Not sure what you do about UD.. Maybe RDMA-CM learns to do UC where
>> the only action is to register notification monitors for L2 addressing
>> changes in the kernel?
>>     
>
> The problem exists for all IB transports (even for RD, if it would have been implemented...), the only difference between the U and R onces is that for the R's, if the remote side vanished, eventually the IB HW would let you know on that in the form of CQ error.
>
>   
>> Can this be hidden with Sean's recent work on simplified progamming models?
>>     
>
> not sure how Sean's work relates to this proposed change.
>
> Or.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]             ` <4C45E701.7030501-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2010-07-20 18:46               ` Jason Gunthorpe
       [not found]                 ` <20100720184620.GJ7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2010-07-21 14:33               ` Or Gerlitz
  1 sibling, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2010-07-20 18:46 UTC (permalink / raw)
  To: Steve Wise; +Cc: Or Gerlitz, Sean Hefty, linux-rdma

On Tue, Jul 20, 2010 at 01:12:17PM -0500, Steve Wise wrote:

> The cxgb* drivers actually reference the neigh and dst structs until the  
> offload connection is gone.  Also if the the offloaded connection has  
> problems transmitting (due to a L2 address change, for example), then  
> the driver will initiate ND again by calling neigh_event_send().  See  
> t4_l2t_send_event() in l2t.c which is called by the iwarp driver in  
> peer_abort() from iwch_cm.c when the HW tells us its retransmitting too  
> much.

That strikes me as mildly scary.. The cxgb can't possibly get the
right dst (ie, the same dst that the RDMA CM got) in all the corner
cases? Ie how can setting oif to 0 in iwch_cm.c:find_route be right??

So, looks like there is a larger cleanup here, if the RDMACM holds the
dst and has functions to freshen it/track it then the iwarp driver
should rely on the RDMACM to manage the dst..

In other words, moving the dst handling from iwch_cm into RDMACM would
also mostly satisfy why Or is trying to do.

Does that make sense to you Steve?

How does the cxgb3 driver know when to update the HW if the dst/nd
entries change?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]                 ` <20100720184620.GJ7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-07-20 19:20                   ` Steve Wise
       [not found]                     ` <4C45F6F5.6050008-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Steve Wise @ 2010-07-20 19:20 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Gerlitz, Sean Hefty, linux-rdma

Jason Gunthorpe wrote:
> On Tue, Jul 20, 2010 at 01:12:17PM -0500, Steve Wise wrote:
>
>   
>> The cxgb* drivers actually reference the neigh and dst structs until the  
>> offload connection is gone.  Also if the the offloaded connection has  
>> problems transmitting (due to a L2 address change, for example), then  
>> the driver will initiate ND again by calling neigh_event_send().  See  
>> t4_l2t_send_event() in l2t.c which is called by the iwarp driver in  
>> peer_abort() from iwch_cm.c when the HW tells us its retransmitting too  
>> much.
>>     
> 0
> That strikes me as mildly scary.. The cxgb can't possibly get the
> right dst (ie, the same dst that the RDMA CM got) in all the corner
> cases? Ie how can setting oif to 0 in iwch_cm.c:find_route be right??
>
>   


I guess it should be using the oif from the cm_id?


> So, looks like there is a larger cleanup here, if the RDMACM holds the
> dst and has functions to freshen it/track it then the iwarp driver
> should rely on the RDMACM to manage the dst..
>
> In other words, moving the dst handling from iwch_cm into RDMACM would
> also mostly satisfy why Or is trying to do.
>
> Does that make sense to you Steve?
>   

Yes, in principle. 

If you want to move all this into the RDMACM, then an interface must be 
devised so the drivers can tell the RDMACM that an offload connection is 
failing and probably needs ND/NUD done.  Or some such feedback 
interface.  And the RDMACM needs to call the devices if something 
changes like routing redirects I guess.   You might want the device to 
specify whether it wants the rdma-cm to handle all this or not.  Some 
devices might be better able to handle this stuff.


> How does the cxgb3 driver know when to update the HW if the dst/nd
> entries change?
>   

It uses netevents.  See nb_callback() in drivers/net/cxgb3/cxgb3_offload.c.



> Jason
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]                     ` <4C45F6F5.6050008-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2010-07-20 20:30                       ` Jason Gunthorpe
       [not found]                         ` <20100720203044.GK7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2010-07-20 20:30 UTC (permalink / raw)
  To: Steve Wise; +Cc: Or Gerlitz, Sean Hefty, linux-rdma

On Tue, Jul 20, 2010 at 02:20:21PM -0500, Steve Wise wrote:

> I guess it should be using the oif from the cm_id?

Not sure exactly what is best here :|

>> So, looks like there is a larger cleanup here, if the RDMACM holds the
>> dst and has functions to freshen it/track it then the iwarp driver
>> should rely on the RDMACM to manage the dst..
>>
>> In other words, moving the dst handling from iwch_cm into RDMACM would
>> also mostly satisfy why Or is trying to do.
>>
>> Does that make sense to you Steve?
>>   
>
> Yes, in principle. 
>
> If you want to move all this into the RDMACM, then an interface must be  
> devised so the drivers can tell the RDMACM that an offload connection is  
> failing and probably needs ND/NUD done.  Or some such feedback  
> interface.  And the RDMACM needs to call the devices if something  
> changes like routing redirects I guess.   

I think if RDMACM manages the dst and lets the devices access it then
all the existing netdev infrastructure for poking at a dst should be
available to the device?

> You might want the device to specify whether it wants the rdma-cm to
> handle all this or not.  Some devices might be better able to handle
> this stuff.

?? either you integrate with netdev in this area or your device is
broken :( :( Ie doing ND under the covers is broken, it breaks corner
case netdev ND management stuff like static ND entries. Same for ICMP
redirects, same for route lookups and caching, same for route PMTU
.. :(

IMHO, going down the path of integration is all or nothing, you don't
get to support things like Amasso doing seperate ND while providing
much fuller integration for cxgb. That just creates a huge complex
mess for end users.

>> How does the cxgb3 driver know when to update the HW if the dst/nd
>> entries change?

> It uses netevents.  See nb_callback() in
> drivers/net/cxgb3/cxgb3_offload.c.

What about route table changes?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]                         ` <20100720203044.GK7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-07-20 20:50                           ` Steve Wise
       [not found]                             ` <4C460BFD.5010707-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Steve Wise @ 2010-07-20 20:50 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Gerlitz, Sean Hefty, linux-rdma

Jason Gunthorpe wrote:
> On Tue, Jul 20, 2010 at 02:20:21PM -0500, Steve Wise wrote:
>
>   
>> I guess it should be using the oif from the cm_id?
>>     
>
> Not sure exactly what is best here :|
>
>   
>>> So, looks like there is a larger cleanup here, if the RDMACM holds the
>>> dst and has functions to freshen it/track it then the iwarp driver
>>> should rely on the RDMACM to manage the dst..
>>>
>>> In other words, moving the dst handling from iwch_cm into RDMACM would
>>> also mostly satisfy why Or is trying to do.
>>>
>>> Does that make sense to you Steve?
>>>   
>>>       
>> Yes, in principle. 
>>
>> If you want to move all this into the RDMACM, then an interface must be  
>> devised so the drivers can tell the RDMACM that an offload connection is  
>> failing and probably needs ND/NUD done.  Or some such feedback  
>> interface.  And the RDMACM needs to call the devices if something  
>> changes like routing redirects I guess.   
>>     
>
> I think if RDMACM manages the dst and lets the devices access it then
> all the existing netdev infrastructure for poking at a dst should be
> available to the device?
>   


Yes. But I'm not sure exactly how the logic I described previous for 
cxgb* would be handled in the design being ironed out here.



>   
>> You might want the device to specify whether it wants the rdma-cm to
>> handle all this or not.  Some devices might be better able to handle
>> this stuff.
>>     
>
> ?? either you integrate with netdev in this area or your device is
> broken :( :( Ie doing ND under the covers is broken, it breaks corner
> case netdev ND management stuff like static ND entries. Same for ICMP
> redirects, same for route lookups and caching, same for route PMTU
> .. :(
>
> IMHO, going down the path of integration is all or nothing, you don't
> get to support things like Amasso doing seperate ND while providing
> much fuller integration for cxgb. That just creates a huge complex
> mess for end users.
>
>   


Guess you'd have to remove the Ammasso driver then. ;)


>>> How does the cxgb3 driver know when to update the HW if the dst/nd
>>> entries change?
>>>       
>
>   
>> It uses netevents.  See nb_callback() in
>> drivers/net/cxgb3/cxgb3_offload.c.
>>     
>
> What about route table changes?
>
>   

Currently route table changes don't have any affect on existing 
connections. Only new connections would be affected.


Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]                             ` <4C460BFD.5010707-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2010-07-20 20:57                               ` Jason Gunthorpe
       [not found]                                 ` <20100720205746.GL7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2010-07-20 20:57 UTC (permalink / raw)
  To: Steve Wise; +Cc: Or Gerlitz, Sean Hefty, linux-rdma

On Tue, Jul 20, 2010 at 03:50:05PM -0500, Steve Wise wrote:

>> I think if RDMACM manages the dst and lets the devices access it then
>> all the existing netdev infrastructure for poking at a dst should be
>> available to the device?
>
> Yes. But I'm not sure exactly how the logic I described previous for  
> cxgb* would be handled in the design being ironed out here.

I'm thinking something like this..

- The RDMA CM gets the dst from its route lookup locks it and stores
  it.
- Instead of doing a route lookup cxgb gets the dst from RDMA CM,
  locks it and stores it
- RDMA CM traps all notifications/etc and generates callback to cxgb
  to say the dst has changed.
- cxgb releases the old dst and grabs the new one, updates the HW,
  etc.

Basically the same as what you have now, but all the logic to find
and monitor the dst moves to RDMA CM..

redirects/etc are all handled by netdev/rdma cm and just generate the
same 'dst has changed' call back to cxgb..

Or's user space notification stuff hooks the same callback to generate
a notification to userspace about the new dst.

All the stuff you do now with the dst you can keep doing, you just
remove all the route lookup and netdev hooking to get the dst from
RDMA CM.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]                                 ` <20100720205746.GL7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-07-20 21:03                                   ` Steve Wise
       [not found]                                     ` <4C460F08.7030304-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  2010-07-21 14:40                                   ` Or Gerlitz
  1 sibling, 1 reply; 16+ messages in thread
From: Steve Wise @ 2010-07-20 21:03 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Gerlitz, Sean Hefty, linux-rdma

Jason Gunthorpe wrote:
> On Tue, Jul 20, 2010 at 03:50:05PM -0500, Steve Wise wrote:
>
>   
>>> I think if RDMACM manages the dst and lets the devices access it then
>>> all the existing netdev infrastructure for poking at a dst should be
>>> available to the device?
>>>       
>> Yes. But I'm not sure exactly how the logic I described previous for  
>> cxgb* would be handled in the design being ironed out here.
>>     
>
> I'm thinking something like this..
>
> - The RDMA CM gets the dst from its route lookup locks it and stores
>   it.
> - Instead of doing a route lookup cxgb gets the dst from RDMA CM,
>   locks it and stores it
> - RDMA CM traps all notifications/etc and generates callback to cxgb
>   to say the dst has changed.
> - cxgb releases the old dst and grabs the new one, updates the HW,
>   etc.
>
> Basically the same as what you have now, but all the logic to find
> and monitor the dst moves to RDMA CM..
>
> redirects/etc are all handled by netdev/rdma cm and just generate the
> same 'dst has changed' call back to cxgb..
>
> Or's user space notification stuff hooks the same callback to generate
> a notification to userspace about the new dst.
>
> All the stuff you do now with the dst you can keep doing, you just
> remove all the route lookup and netdev hooking to get the dst from
> RDMA CM.
>
> Jason
>   

Sounds like this would work nicely.


Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]                                     ` <4C460F08.7030304-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2010-07-20 21:15                                       ` Steve Wise
  0 siblings, 0 replies; 16+ messages in thread
From: Steve Wise @ 2010-07-20 21:15 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Gerlitz, Sean Hefty, linux-rdma, Tung, Chien Tin


Steve Wise wrote:
> Jason Gunthorpe wrote:
>> On Tue, Jul 20, 2010 at 03:50:05PM -0500, Steve Wise wrote:
>>
>>  
>>>> I think if RDMACM manages the dst and lets the devices access it then
>>>> all the existing netdev infrastructure for poking at a dst should be
>>>> available to the device?
>>>>       
>>> Yes. But I'm not sure exactly how the logic I described previous 
>>> for  cxgb* would be handled in the design being ironed out here.
>>>     
>>
>> I'm thinking something like this..
>>
>> - The RDMA CM gets the dst from its route lookup locks it and stores
>>   it.
>> - Instead of doing a route lookup cxgb gets the dst from RDMA CM,
>>   locks it and stores it
>> - RDMA CM traps all notifications/etc and generates callback to cxgb
>>   to say the dst has changed.
>> - cxgb releases the old dst and grabs the new one, updates the HW,
>>   etc.
>>
>> Basically the same as what you have now, but all the logic to find
>> and monitor the dst moves to RDMA CM..
>>
>> redirects/etc are all handled by netdev/rdma cm and just generate the
>> same 'dst has changed' call back to cxgb..
>>
>> Or's user space notification stuff hooks the same callback to generate
>> a notification to userspace about the new dst.
>>
>> All the stuff you do now with the dst you can keep doing, you just
>> remove all the route lookup and netdev hooking to get the dst from
>> RDMA CM.
>>
>> Jason
>>   
>
> Sounds like this would work nicely.
>
>
> Steve.
>

Need to hear from Intel.  CCing Chien.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]             ` <4C45E701.7030501-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  2010-07-20 18:46               ` Jason Gunthorpe
@ 2010-07-21 14:33               ` Or Gerlitz
       [not found]                 ` <4C47053B.3000802-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 16+ messages in thread
From: Or Gerlitz @ 2010-07-21 14:33 UTC (permalink / raw)
  To: Steve Wise; +Cc: Jason Gunthorpe, Sean Hefty, linux-rdma

Steve Wise wrote:
> The cxgb3/4 drivers do not set IFF_NOARP and rely on ND being done as
> part of connection setup.  The driver will initiate ND if there isn't a
> neigh entry available at the time the iwarp driver tries to send a SYN or SYN/ACK.  

okay, understood, thanks for clarifying this out.

> The cxgb* drivers actually reference the neigh and dst structs until the
> offload connection is gone.  Also if the the offloaded connection has
> problems transmitting (due to a L2 address change, for example), then
> the driver will initiate ND again by calling neigh_event_send().  See
> t4_l2t_send_event() in l2t.c which is called by the iwarp driver in
> peer_abort() from iwch_cm.c when the HW tells us its retransmitting too much.

In the general case of rdma-cm consumer, e.g IB RC based and/or UD unicast based, 
we don't have such feedback mechanism from the HW. As such, I would draw the line here around adopting into the rdma-cm the behavior of referencing the neigh and dst structures until the connection is gone (could you point on the func/path in drivers/net/cxgb3/l2t.c which does this? i wasn't sure).

> What doesn't happen is active positive feedback during the connection to
> avoid NUD.  IE once the connection is setup, nobody calls dst_confirm()
> It is only called during connection setup/teardown.

I think we can live with that, this is similar to the case of an app using UDP in uni-directional manner between host A --> B so the NUD part of the network stack @ host A has to issue timely probes to validate the L2 address of host B. The only difference is that we have the A --> B comm offloaded and eventually without keeping the ref the neighbour and dst are deleted, the proposed patch eliminates this deletion.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]                                 ` <20100720205746.GL7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2010-07-20 21:03                                   ` Steve Wise
@ 2010-07-21 14:40                                   ` Or Gerlitz
  1 sibling, 0 replies; 16+ messages in thread
From: Or Gerlitz @ 2010-07-21 14:40 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Steve Wise, Sean Hefty, linux-rdma

Jason Gunthorpe wrote:
> I'm thinking something like this..
> - The RDMA CM gets the dst from its route lookup locks it and stores it.
> - Instead of doing a route lookup cxgb gets the dst from RDMA CM,
>   locks it and stores it
> - RDMA CM traps all notifications/etc and generates callback to cxgb
>   to say the dst has changed.
> - cxgb releases the old dst and grabs the new one, updates the HW, etc.


Jason,

I'm up for extending the rdma-cm event of address change, on which an app can decide if
to re-act or not. For example, the in-tree iser and rds code treat this event the same as a disconnection request arriving, which means higher layer (e.g the user space iscsi daemon in the iser case) would try to re-connect. This has the advantage of simplifying the ULP state-machine, so there's no need for special handing for address-change, just treat it as a hint that re-connection is needed.

the cxgb* code take this deeper as they handle L2 changes in the driver level and not as event delivered to the ULP which can optionally address or ignore it.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]                 ` <4C47053B.3000802-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
@ 2010-07-21 15:48                   ` Steve Wise
       [not found]                     ` <4C4716D8.2040902-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Steve Wise @ 2010-07-21 15:48 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Jason Gunthorpe, Sean Hefty, linux-rdma

Or Gerlitz wrote:
> Steve Wise wrote:
>   
>> The cxgb3/4 drivers do not set IFF_NOARP and rely on ND being done as
>> part of connection setup.  The driver will initiate ND if there isn't a
>> neigh entry available at the time the iwarp driver tries to send a SYN or SYN/ACK.  
>>     
>
> okay, understood, thanks for clarifying this out.
>
>   
>> The cxgb* drivers actually reference the neigh and dst structs until the
>> offload connection is gone.  Also if the the offloaded connection has
>> problems transmitting (due to a L2 address change, for example), then
>> the driver will initiate ND again by calling neigh_event_send().  See
>> t4_l2t_send_event() in l2t.c which is called by the iwarp driver in
>> peer_abort() from iwch_cm.c when the HW tells us its retransmitting too much.
>>     
>
> In the general case of rdma-cm consumer, e.g IB RC based and/or UD unicast based, 
> we don't have such feedback mechanism from the HW. As such, I would draw the line here around adopting into the rdma-cm the behavior of referencing the neigh and dst structures until the connection is gone (could you point on the func/path in drivers/net/cxgb3/l2t.c which does this? i wasn't sure).
>
>   


Actually the dst entry ref/deref is really done in iw_cxgb3.  The 
dst/neigh entries are referenced in iwch_connect() and pass_accept_req() 
by calling ip_route_output() via find_route().  They are released in 
__free_ep() when the endpoint is finally freed after connection shutdown.


The L2T code deals with maintaining the HW L2 entries and dealing with 
neighbour change events from the kernel.



Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sense remote hardware address change by rdma-cm applications
       [not found]                     ` <4C4716D8.2040902-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2010-07-22  8:18                       ` Or Gerlitz
  0 siblings, 0 replies; 16+ messages in thread
From: Or Gerlitz @ 2010-07-22  8:18 UTC (permalink / raw)
  To: Steve Wise; +Cc: Jason Gunthorpe, Sean Hefty, linux-rdma

Steve Wise wrote:
> Actually the dst entry ref/deref is really done in iw_cxgb3.  The
> dst/neigh entries are referenced in iwch_connect() and pass_accept_req()
> by calling ip_route_output() via find_route(). 

okay, I see now (more or less) how this part works, thanks for the pointer.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-07-22  8:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-19 21:42 sense remote hardware address change by rdma-cm applications Or Gerlitz
     [not found] ` <AANLkTimmWiNqHJIqSEKbY-X6mSx6zA19p__JDYPEmp8b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-20  0:14   ` Jason Gunthorpe
     [not found]     ` <20100720001436.GH7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-07-20  7:25       ` Or Gerlitz
     [not found]         ` <4C454F80.1060808-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
2010-07-20 17:22           ` Jason Gunthorpe
2010-07-20 18:12           ` Steve Wise
     [not found]             ` <4C45E701.7030501-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-07-20 18:46               ` Jason Gunthorpe
     [not found]                 ` <20100720184620.GJ7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-07-20 19:20                   ` Steve Wise
     [not found]                     ` <4C45F6F5.6050008-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-07-20 20:30                       ` Jason Gunthorpe
     [not found]                         ` <20100720203044.GK7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-07-20 20:50                           ` Steve Wise
     [not found]                             ` <4C460BFD.5010707-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-07-20 20:57                               ` Jason Gunthorpe
     [not found]                                 ` <20100720205746.GL7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-07-20 21:03                                   ` Steve Wise
     [not found]                                     ` <4C460F08.7030304-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-07-20 21:15                                       ` Steve Wise
2010-07-21 14:40                                   ` Or Gerlitz
2010-07-21 14:33               ` Or Gerlitz
     [not found]                 ` <4C47053B.3000802-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
2010-07-21 15:48                   ` Steve Wise
     [not found]                     ` <4C4716D8.2040902-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-07-22  8:18                       ` Or Gerlitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox