netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>,
	davem@davemloft.net, roland@kernel.org, netdev@vger.kernel.org,
	ali@mellanox.com, sean.hefty@intel.com,
	Erez Shitrit <erezsh@mellanox.co.il>
Subject: Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
Date: Sun, 5 Aug 2012 21:50:31 +0300	[thread overview]
Message-ID: <20120805185031.GA18640@redhat.com> (raw)
In-Reply-To: <87boitz044.fsf@xmission.com>

On Thu, Aug 02, 2012 at 10:15:23AM -0700, Eric W. Biederman wrote:
> Or Gerlitz <ogerlitz@mellanox.com> writes:
> 
> > From: Erez Shitrit <erezsh@mellanox.co.il>
> >
> > The eipoib driver provides a standard Ethernet netdevice over
> > the InfiniBand IPoIB interface .
> >
> > Some services can run only on top of Ethernet L2 interfaces, and cannot be
> > bound to an IPoIB interface. With this new driver, these services can run
> > seamlessly.
> 
> Do I read this code correctly that what you are doing is not tunneling
> ethernet over IB but instead you are removing an ethernet header and
> replacing it with an IB header?
> 
> Do I also read this code correctly if you can't find your destination
> mac address in your ""neighbor table"" you do a normal IPoIB arp
> for the infiniband GUID?
> 
> Do I read this right that if presented with a non-IPv4 or ARP packet
> this code will do something undefined and unpredictable?
> 
> Maybe this makes some sense but just skimming it looks like you
> are trying to force a square peg into a round hole resulting in
> some weird code and some very weird maintainability issues.
> 
> I am honestly surprised at this approach.  I would think it would be
> faster and simpler to run an IB queue pair directly to the hypervisor or
> possibly even the guest operating system bypassing the kernel and doing
> all of this translation in userspace.
> 
> Eric

I'm on vacation and I have not looked at the patches, at Erez' request,
just reacting to the presentation and the discussion.

Bypassing the kernel has its own set of issues, not the
least of which is the need to lock all of guest memory which breaks
overcommit. Running an IB queue pair directly to the hypervisor
will also break live migration.

Another problem with exposing IB to guests has to do with the fact that
IB addresses such as combinations of LIDs, GIDs and QPNs to best of my
knowledge do not support soft hardware address setting, which interferes
with live migration.

So it seems that a sane solution would involve an extra level of
indirection, with guest addresses being translated to host IB addresses.

As long as you do this, maybe using an ethernet frame format makes
sense.

So far the things that make sense. Here are some that don't, to me:

- Is a pdf presentation all you have in terms of documentation?
  We are talking communication protocols here - I would expect a
  proper spec, and some effort to standardize, otherwise where's the
  guarantee it won't change in an incompatible way?
  Other things that I would expect to be addressed in such a spec is
  interaction with other IPoIB features, such as connected
  mode, checksum offloading etc, and IB features such as multipath etc.

- The way you encode LID/QPN in the MAC seems questionable. IIRC there's
  more to IB addressing than just the LID.  Since everyone on the subnet
  need access to this translation, I think it makes sense to store it in
  the SM. I think this would also obviate some IPv4 specific hacks
  in kernel.

- IGMP/MAC snooping in a driver is just too hairy.
  As you point out, bridge currently needs the uplink in promisc mode.
  I don't think a driver should work around that limitation.
  For some setups, it might be interesting to remove the
  promisc mode requirement, failing that,
  I think you could use macvtap passthrough.

- Currently migration works without host kernel help, would be
  preferable to keep it that way.


Hope this helps,
MST

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2012-08-05 18:50 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 01/12] IB/ipoib: Add rtnl_link_ops support Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 02/12] IB/ipoib: Add support for clones / multiple childs on the same partition Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 03/12] include/linux: Add private flags for IPoIB interfaces Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 04/12] IB/ipoib: Add support for acting as VIF Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 05/12] net: Add ndo_set_vif_param operation to serve eIPoIB VIFs Or Gerlitz
2012-08-02  0:17   ` Ben Hutchings
2012-08-02  8:25     ` Erez Shitrit
2012-08-01 17:09 ` [PATCH V2 06/12] net/core: Add rtnetlink support to vif parameters Or Gerlitz
2012-08-02  0:20   ` Ben Hutchings
2012-08-02 15:29     ` Erez Shitrit
2012-08-01 17:09 ` [PATCH V2 07/12] net/eipoib: Add private header file Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 08/12] net/eipoib: Add ethtool file support Or Gerlitz
2012-08-02  0:22   ` Ben Hutchings
2012-08-02  8:35     ` Erez Shitrit
2012-08-02 15:42       ` Ben Hutchings
2012-08-01 17:09 ` [PATCH V2 09/12] net/eipoib: Add main driver functionality Or Gerlitz
2012-08-02 17:15   ` Eric W. Biederman
2012-08-03 20:31     ` Ali Ayoub
2012-08-03 21:33       ` David Miller
2012-08-03 22:39         ` Ali Ayoub
2012-08-03 23:36           ` David Miller
2012-08-04 21:23             ` Or Gerlitz
2012-08-04 21:44               ` Or Gerlitz
2012-08-04 23:19                 ` Eric W. Biederman
2012-08-07  0:14             ` Ali Ayoub
2012-08-07  0:44               ` Eric W. Biederman
2012-08-07  1:21                 ` Re[2]: " Naoto MATSUMOTO
2012-08-15  9:10                   ` Re[3]: " Naoto MATSUMOTO
2012-08-07  3:33                 ` Eric W. Biederman
2012-08-08  6:04                   ` Or Gerlitz
2012-08-08  8:36                     ` Eric W. Biederman
2012-08-09  4:06                       ` Or Gerlitz
2012-08-12 14:05                         ` Michael S. Tsirkin
2012-08-07  3:37                 ` Joseph Glanville
2012-08-08  7:32                 ` Or Gerlitz
2012-08-08  9:17                   ` Eric W. Biederman
2012-08-09  4:34                     ` Or Gerlitz
2012-08-12 10:36                       ` Michael S. Tsirkin
2012-08-04  0:02           ` Ali Ayoub
2012-08-04  0:05             ` David Miller
2012-08-04  1:34             ` Eric W. Biederman
2012-08-04 21:33               ` Or Gerlitz
2012-08-05 18:50     ` Michael S. Tsirkin [this message]
2012-08-08  5:23       ` Or Gerlitz
2012-08-12 10:22         ` Michael S. Tsirkin
2012-08-12 13:09           ` Or Gerlitz
2012-08-12 13:41             ` Michael S. Tsirkin
2012-08-12 13:15           ` Or Gerlitz
2012-08-12 13:55             ` Michael S. Tsirkin
2012-08-12 14:13               ` Or Gerlitz
2012-08-12 20:54                 ` Michael S. Tsirkin
2012-08-14  8:44                   ` Or Gerlitz
2012-08-20 18:57                   ` Michael S. Tsirkin
2012-08-23  6:45                     ` Or Gerlitz
2012-08-14  7:41               ` Or Gerlitz
2012-08-12 10:54         ` Michael S. Tsirkin
2012-08-12 13:19           ` Or Gerlitz
2012-08-12 15:40         ` Eric W. Biederman
2012-08-13  8:33           ` Or Gerlitz
2012-08-13 16:08             ` Eric W. Biederman
2012-09-03 20:53       ` Or Gerlitz
2012-09-03 21:22         ` Michael S. Tsirkin
2012-09-04 18:50           ` Or Gerlitz
2012-09-04 19:31             ` Eric W. Biederman
2012-09-04 19:47               ` Or Gerlitz
2012-09-04 21:21             ` Michael S. Tsirkin
2012-09-04 18:57           ` Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 10/12] net/eipoib: Add sysfs support Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 11/12] net/eipoib: Add Makefile, Kconfig and MAINTAINERS entries Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 12/12] IB/ipoib: Add support for transmission of skbs w.o dst/neighbour Or Gerlitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120805185031.GA18640@redhat.com \
    --to=mst@redhat.com \
    --cc=ali@mellanox.com \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=erezsh@mellanox.co.il \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=roland@kernel.org \
    --cc=sean.hefty@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).