Advice needed on IP-over-InfiniBand driver

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Advice needed on IP-over-InfiniBand driver
@ 2004-09-19  4:08 Roland Dreier
  2004-09-19 21:01 ` David S. Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Roland Dreier @ 2004-09-19  4:08 UTC (permalink / raw)
  To: netdev

Hi, I'm looking for guidance on the "right" way to implement an
IP-over-InfiniBand (IPoIB) driver for Linux.  Right now, we have
something that works, but we are cleaning it up for upstream
submission (along with the rest of the OpenIB code).  IPoIB is a
network device driver (in the sense of "struct net_device"), but there
are a few complications beyond the usual ethernet NIC case, which I'll
described below.  For full details you can look at the drafts from the
IETF ipoib working group: http://ietf.org/html.charters/ipoib-charter.html
If you want to look at the existing code, the best place to look is in
my Subversion branch, specifically

https://openib.org/svn/gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib

for IPoIB code.

IPoIB uses the usual ARP protocol for IPv4 (everything is analogous
for IPv6 neighbor discovery but I'll focus on IPv4 for simplicity).  A
hardware address is 20 bytes: 1 reserved byte, 3 bytes of queue pair
number (QPN) and 16 bytes of global identifier (GID).  ARP works as
specified in RFC 826 (with hardware type 32).

The wrinkle is that while this 20 byte address is enough to uniquely
identify a destination, it is not enough to actually send a packet to
there.  Once an ARP reply comes back, the IPoIB driver must then send
a query to the IB subnet manager (which is a remote server on the IB
fabric) and obtain a path to the destination GID -- a path is a 2 byte
local identifier (LID) and a few other pieces of information.  Once we
have the path, we give that to the IB hardware and a get an address
handle, which can finally be used to send a packet.

This means there are a few things we would like to be able to do in
the IPoIB driver.  First of all, it would be good to be able to hook
into the ARP code so that we add the GID->path lookup after the normal
ARP (and have the kernel keep queuing packets until that lookup
completes).  Also, once the whole process is complete and we have an
address handle, the driver doesn't actually care about the 20-byte
destination address when it's getting packets to send -- it just needs
the address handle.  So it would be nice to have some way to stash
that in struct neighbor and get that in our hard_header method (rather
than having to keep our own cache mapping 20-bytes address back to
address handle).

It seems that some combination of clever neigh_setup and
hard_header_cache/header_cache_update methods should be enough to make
this work, but I don't know enough about the network stack to see how
to do it.

I'd really appreciate guidance on how to implement this, and I'm happy
to answer any questions about the IPoIB architecture.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-19  4:08 Advice needed on IP-over-InfiniBand driver Roland Dreier
@ 2004-09-19 21:01 ` David S. Miller
  2004-09-19 21:19   ` jamal
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: David S. Miller @ 2004-09-19 21:01 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev

You probably want to be editing net/ipv4/arp.c and testing
it about Infiniband.  Keep neighbour entries in the unresolved
stated until both transitions are made:

1) obtain 20 byte address
2) get response from IB subnet manager

Only store the destination GID in the neighbour entry,
and only mark the neighbour entry as resolved once
#2 above completes successfully.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-19 21:01 ` David S. Miller
@ 2004-09-19 21:19   ` jamal
  2004-09-20  2:34     ` David S. Miller
  2004-09-20  4:49     ` Roland Dreier
  2004-09-20  4:42   ` Roland Dreier
  2004-09-28  4:41   ` Roland Dreier
  2 siblings, 2 replies; 14+ messages in thread
From: jamal @ 2004-09-19 21:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: Roland Dreier, netdev

On Sun, 2004-09-19 at 17:01, David S. Miller wrote:
> You probably want to be editing net/ipv4/arp.c and testing
> it about Infiniband.  Keep neighbour entries in the unresolved
> stated until both transitions are made:
> 
> 1) obtain 20 byte address
> 2) get response from IB subnet manager
> 
> Only store the destination GID in the neighbour entry,
> and only mark the neighbour entry as resolved once
> #2 above completes successfully.

Probably just easier to have his own private tables holding reference to
neighbor entries instead of polluting the neighbor tables.
Listens to ARP events - on state transition to/from reachable state he
queries his remote manager. 
Curious though if ARP still works even when that "path" thing hasnt been
resolved.

cheers,
jamal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-19 21:19   ` jamal
@ 2004-09-20  2:34     ` David S. Miller
  2004-09-20  4:51       ` Roland Dreier
  2004-09-20  4:49     ` Roland Dreier
  1 sibling, 1 reply; 14+ messages in thread
From: David S. Miller @ 2004-09-20  2:34 UTC (permalink / raw)
  To: hadi; +Cc: roland, netdev

On 19 Sep 2004 17:19:19 -0400
jamal <hadi@cyberus.ca> wrote:

> Probably just easier to have his own private tables holding reference to
> neighbor entries instead of polluting the neighbor tables.
> Listens to ARP events - on state transition to/from reachable state he
> queries his remote manager. 
> Curious though if ARP still works even when that "path" thing hasnt been
> resolved.

It sounds like a two-stage thing, the first stage lets you talk
on the subnet and get to the IB subnet manager, and the second
stage lets you acually speak IP.  My understanding, from his
description, is that once the second stage part is complete you
don't need to first stage address information at all.

The reason I suggested the scheme the way that I did was so
that the hh_cache can work fully for IP over IB.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-19 21:01 ` David S. Miller
  2004-09-19 21:19   ` jamal
@ 2004-09-20  4:42   ` Roland Dreier
  2004-09-28  4:41   ` Roland Dreier
  2 siblings, 0 replies; 14+ messages in thread
From: Roland Dreier @ 2004-09-20  4:42 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

    David> You probably want to be editing net/ipv4/arp.c and testing

"teaching" not "testing" I assume :)

    David> it about Infiniband.  Keep neighbour entries in the
    David> unresolved stated until both transitions are made:

    David> 1) obtain 20 byte address 2) get response from IB subnet
    David> manager

    David> Only store the destination GID in the neighbour entry, and
    David> only mark the neighbour entry as resolved once #2 above
    David> completes successfully.

Hmm... it looks like the place you're telling me to hook into is in
arp_process() right before the neigh state gets set to NUD_REACHABLE.
But where should I stick the path information that I get back from the
subnet manager once the query finishes?  And how does the path get
passed into the device driver to actually send an skb?

(Sorry to be so dense but I'm afraid my little brain needs things
broken down into bite-sized pieces...)

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-19 21:19   ` jamal
  2004-09-20  2:34     ` David S. Miller
@ 2004-09-20  4:49     ` Roland Dreier
  2004-09-21 11:35       ` jamal
  1 sibling, 1 reply; 14+ messages in thread
From: Roland Dreier @ 2004-09-20  4:49 UTC (permalink / raw)
  To: hadi; +Cc: David S. Miller, netdev

    jamal> Probably just easier to have his own private tables holding
    jamal> reference to neighbor entries instead of polluting the
    jamal> neighbor tables.  Listens to ARP events - on state
    jamal> transition to/from reachable state he queries his remote
    jamal> manager.

This does seem neater, but I don't know how to implement it.  How does
one hook into ARP events?

    jamal> Curious though if ARP still works even when that "path"
    jamal> thing hasnt been resolved.

ARP works because we can send broadcasts even without a path to a
specific destination.  (I'm leaving out the details of how IP
broadcast gets mapped to InfiniBand multicast)  When the system with
the IP we're looking for receives a broadcast ARP, it can use the HW
address in the ARP request to look up a path, so it can send an ARP
reply.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-20  2:34     ` David S. Miller
@ 2004-09-20  4:51       ` Roland Dreier
  0 siblings, 0 replies; 14+ messages in thread
From: Roland Dreier @ 2004-09-20  4:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, netdev

    David> It sounds like a two-stage thing, the first stage lets you
    David> talk on the subnet and get to the IB subnet manager, and
    David> the second stage lets you acually speak IP.  My
    David> understanding, from his description, is that once the
    David> second stage part is complete you don't need to first stage
    David> address information at all.

Pretty much... ARP gives you a unique identifier for the port with the
IP address you're looking for.  Then you need to take that unique
identifier and ask the subnet manager what path to use to get from
your local port to that destination port.  (Talking to the subnet
manager uses an InfiniBand native, non-IP mechanism -- one of the
first things the subnet manager does is go over the whole fabric and
tell each port what path to use to send it queries)

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-20  4:49     ` Roland Dreier
@ 2004-09-21 11:35       ` jamal
  2004-09-21 15:23         ` Roland Dreier
  0 siblings, 1 reply; 14+ messages in thread
From: jamal @ 2004-09-21 11:35 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David S. Miller, netdev

On Mon, 2004-09-20 at 00:49, Roland Dreier wrote:
>     jamal> Probably just easier to have his own private tables holding
>     jamal> reference to neighbor entries instead of polluting the
>     jamal> neighbor tables.  Listens to ARP events - on state
>     jamal> transition to/from reachable state he queries his remote
>     jamal> manager.
> 
> This does seem neater, but I don't know how to implement it.  How does
> one hook into ARP events?

Are you doing the path manager from user space or kernel?
Its easy to generate netlink events to user space; you could then
have the manager create path from user space. 

>     jamal> Curious though if ARP still works even when that "path"
>     jamal> thing hasnt been resolved.
> 
> ARP works because we can send broadcasts even without a path to a
> specific destination.  (I'm leaving out the details of how IP
> broadcast gets mapped to InfiniBand multicast)  When the system with
> the IP we're looking for receives a broadcast ARP, it can use the HW
> address in the ARP request to look up a path, so it can send an ARP
> reply.

So the sending broadcast path is essentially existent already.

Like i said above, depending on how you do this remote manager; in
kernel would be a little more involved.

cheers,
jamal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-21 11:35       ` jamal
@ 2004-09-21 15:23         ` Roland Dreier
  0 siblings, 0 replies; 14+ messages in thread
From: Roland Dreier @ 2004-09-21 15:23 UTC (permalink / raw)
  To: hadi; +Cc: David S. Miller, netdev

    jamal> Are you doing the path manager from user space or kernel?
    jamal> Its easy to generate netlink events to user space; you
    jamal> could then have the manager create path from user space.

The subnet manager (== big application that assigns paths to everyone
on a fabric, etc) will be in user space running on a single node.  But
I would prefer to have the IPoIB driver be contained within the kernel
to avoid complications like needing to start a userspace helper from
an initrd for NFS root, etc.  Sending path queries to the subnet
manager is pretty simple so I don't think there's an issue with having
that piece of code in the kernel.

Also, if the path record lookup is done in userspace, it seems the
driver will be passed 20-byte hardware addresses and need to look up
the path in some shadow ARP table for every packet, which doesn't seem
very efficient.

I'd like to understand David's approach better, since it seems he
knows how to avoid that.  Unfortunately I don't understand the
hard_header_cache() etc. methods well enough for his original
explanation to make sense to me.  Hopefully he'll have time to explain
in a little more detail...

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-19 21:01 ` David S. Miller
  2004-09-19 21:19   ` jamal
  2004-09-20  4:42   ` Roland Dreier
@ 2004-09-28  4:41   ` Roland Dreier
  2004-09-28  4:52     ` David S. Miller
  2 siblings, 1 reply; 14+ messages in thread
From: Roland Dreier @ 2004-09-28  4:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

I've been mulling over the previous advice I got and reading over the
networking code, and I hope I have some more intelligent questions
now.  It seems I really have two somewhat independent issues to
resolve with respect to IP-over-IB address resolution:

 - IPoIB adds a second InfiniBand-specific path record lookup after
   the normal ARP/ND lookup.  I think I have a handle on how this can
   be added to net/ipv4/arp.c.

 - For InfiniBand, the layer 2 header is built and parsed by the
   hardware without a chance for software to see it.  In fact, once we
   have completed the 2 stage resolution as described above, the
   network driver has to pass this path information to the IB hardware
   and get an "address handle" back, which it uses to actually send
   packets.  It seems the existing hard_header, hard_header_cache and
   header_cache_update are not really applicable.

   My ideal solution would be some way for the driver have the packets
   passed to its hard_start_xmit method with the address handle in the
   skb->cb field (or another field if it's acceptable/desirable to add
   a field for this to struct sk_buff -- I notice a number of
   #ifdef'ed fields in there already, but I'm not sure if that type of
   thing is just old cruft or still OK).

   Also it would be nice is that address handle could be taken
   directly from the struct neighbor -- after all, we should be able
   to get it without requiring the driver to do any lookup from HW
   address to address handle.

   Finally, address handles need to be freed eventually, and I
   don't see a way for a driver to find out when an ARP entry is being
   destroyed.  Am I missing something, or is this something else I'll
   need to add?  What would be an acceptable way to add this to the
   networking code?

I'd really appreciate another dose of cluestick...

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-28  4:41   ` Roland Dreier
@ 2004-09-28  4:52     ` David S. Miller
  2004-09-30 18:41       ` Roland Dreier
  0 siblings, 1 reply; 14+ messages in thread
From: David S. Miller @ 2004-09-28  4:52 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev

On Mon, 27 Sep 2004 21:41:12 -0700
Roland Dreier <roland@topspin.com> wrote:

> I've been mulling over the previous advice I got and reading over the
> networking code, and I hope I have some more intelligent questions
> now.  It seems I really have two somewhat independent issues to
> resolve with respect to IP-over-IB address resolution:
> 
>  - IPoIB adds a second InfiniBand-specific path record lookup after
>    the normal ARP/ND lookup.  I think I have a handle on how this can
>    be added to net/ipv4/arp.c.

I think you might learn something by having a look at
what net/atm/clip.c is doing, it creates it's own neighbour
layer for CLIP ATM neighbours.  It is in a similar boat to
your IPoIB stuff.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-28  4:52     ` David S. Miller
@ 2004-09-30 18:41       ` Roland Dreier
  2004-09-30 21:21         ` David Stevens
  0 siblings, 1 reply; 14+ messages in thread
From: Roland Dreier @ 2004-09-30 18:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, openib-general

    David> I think you might learn something by having a look at what
    David> net/atm/clip.c is doing, it creates it's own neighbour
    David> layer for CLIP ATM neighbours.  It is in a similar boat to
    David> your IPoIB stuff.

Thanks, this suggestion was very helpful.  I think I'm making
progress.  Now I know my next question :)

CLIP ATM is a little different from IPoIB in that it completely
replaces the ARP layer with its own ARP daemon.  For IPoIB I don't
want to reinvent the ARP and ND code -- I just want to add a secondary
lookup after the response comes back.  I think I have an idea of how
to do that and then stash the information in the struct neighbour, so
that my hard_start_xmit method can get it from skb->dst (ala clip.c).

However, it seems that broadcast ARP packets have skb->dst == NULL.
Is it safe for me to assume that packets with skb->dst == NULL are
broadcast packets?  Will multicast packets have a non-NULL dst?

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-30 18:41       ` Roland Dreier
@ 2004-09-30 21:21         ` David Stevens
  2004-09-30 21:48           ` Roland Dreier
  0 siblings, 1 reply; 14+ messages in thread
From: David Stevens @ 2004-09-30 21:21 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev, David S. Miller, openib-general

> However, it seems that broadcast ARP packets have skb->dst == NULL.
> Is it safe for me to assume that packets with skb->dst == NULL are
> broadcast packets?  Will multicast packets have a non-NULL dst?

I think it would be a mistake to use skb->dst as a flag for unicast
or not. Even if it is correct in all cases you care about now (I don't
know either way), it would be a hidden dependency with high potential
to break something eventually.

                                                +-DLS

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Advice needed on IP-over-InfiniBand driver
  2004-09-30 21:21         ` David Stevens
@ 2004-09-30 21:48           ` Roland Dreier
  0 siblings, 0 replies; 14+ messages in thread
From: Roland Dreier @ 2004-09-30 21:48 UTC (permalink / raw)
  To: David Stevens; +Cc: netdev, David S. Miller, openib-general

    David> I think it would be a mistake to use skb->dst as a flag for
    David> unicast or not. Even if it is correct in all cases you care
    David> about now (I don't know either way), it would be a hidden
    David> dependency with high potential to break something
    David> eventually.

That's kind of what I thought.  But since my packets have no L2 header
in them, I don't know what hard_start_xmit can look at other than
skb->dst.

I guess hard_header could put some info in skb->cb -- is cb available
for net device use between hard_header and hard_start_xmit, or does
someone else still own it?

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2004-09-30 21:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-19  4:08 Advice needed on IP-over-InfiniBand driver Roland Dreier
2004-09-19 21:01 ` David S. Miller
2004-09-19 21:19   ` jamal
2004-09-20  2:34     ` David S. Miller
2004-09-20  4:51       ` Roland Dreier
2004-09-20  4:49     ` Roland Dreier
2004-09-21 11:35       ` jamal
2004-09-21 15:23         ` Roland Dreier
2004-09-20  4:42   ` Roland Dreier
2004-09-28  4:41   ` Roland Dreier
2004-09-28  4:52     ` David S. Miller
2004-09-30 18:41       ` Roland Dreier
2004-09-30 21:21         ` David Stevens
2004-09-30 21:48           ` Roland Dreier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).