public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Roland Dreier <rdreier@cisco.com>
To: Stephen Hemminger <shemminger@vyatta.com>
Cc: Marc Aurele La France <tsi@ualberta.ca>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Ben Hutchings <bhutchings@solarflare.com>,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	"David S. Miller" <davem@davemloft.net>,
	Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>,
	"Pekka Savola \(ipv6\)" <pekkas@netcore.fi>,
	James Morris <jmorris@namei.org>,
	Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
	Patrick McHardy <kaber@trash.net>
Subject: Re: RFC: MTU for serving NFS on Infiniband
Date: Fri, 27 Aug 2010 09:20:15 -0700	[thread overview]
Message-ID: <adaiq2w2fk0.fsf@cisco.com> (raw)
In-Reply-To: <20100826165359.3b79b27d@nehalam> (Stephen Hemminger's message of "Thu, 26 Aug 2010 16:53:59 -0700")

 > Infiniband device driver needs to be fixed to do SG and checksum offload.
 > Otherwise it is insane to try and run large MTU over it. I even wonder if
 > the dev_change_mtu() function should reject > PAGESIZE mtu for devices
 > that don't do scatter/gather or at least a raise a warning.

It's not possible to "fix" the driver to do checksum offload, since the
underlying hardware does not support it.  Theoretically we could handle
SG but of course there's no point in that without checksum offload.

I think there is some confusion about what IPoIB is in this thread, so
let me try to give some basic background to help the discussion.  There
are two "modes" that an IPoIB interface can operate in: datagram mode
and connected mode.

In datagram mode, packets given to the IPoIB driver are sent as IB
unreliable datagram messages, which means each skb turns into one packet
on the wire -- very much like the ethernet case.  In this mode, the MTU
is limited by the MTU on the IB side, which is typically either 2K or 4K
depending on the adapter and the switches involved.  Modern IB adapters
do support checksum offload and large send offload for datagrams, so we
can and do enable SG and IP_CSUM.

In connected mode, the IPoIB driver actually makes a reliable connection
to each peer.  For reliable connections, IB adapters can actually send
messages up to 4GB, with the adapter handling all the segmentation and
transport level acks etc. -- the host system simply queues one work
request for each message of any size.  These work requests do support
gather/scatter, but no existing adapter supports checksum offload for
messages on reliable connections.

However, since reliable connections support arbitrary sized messages, in
connected mode the IPoIB driver allows an MTU up to roughly the maximum
64K IP message size.  (I don't think anyone has tried it with bigger
IPv6 jumbograms ;)

It does seem even with all the horrible memory allocation problems
caused by requiring huge linear skbs, connected mode does offer very
good performance for at least some real-world uses (although apparently
NFS is not one such use).  In fact as far as I know, connected mode with
a huge MTU continues to outperform datagram mode even with LSO and LRO
(although I don't have any particularly recent numbers).  So I don't
think we want to completely disallow such uses.

 - R.

  parent reply	other threads:[~2010-08-27 16:20 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-23 14:44 RFC: MTU for serving NFS on Infiniband Marc Aurele La France
2010-08-23 15:05 ` Stephen Hemminger
2010-08-24 15:14   ` Marc Aurele La France
2010-08-24 17:57     ` Ben Hutchings
2010-08-24 19:49       ` Marc Aurele La France
2010-08-24 20:09         ` Eric Dumazet
2010-08-24 20:33           ` Marc Aurele La France
2010-08-24 22:20         ` Ben Hutchings
2010-08-24 22:39           ` Stephen Hemminger
2010-08-25  5:54             ` Eric Dumazet
2010-08-25 12:10               ` Alexey Kuznetsov
2010-08-25 12:17                 ` Eric Dumazet
2010-08-26 11:40             ` Marc Aurele La France
2010-08-26 11:57               ` Eric Dumazet
2010-08-26 14:43                 ` Marc Aurele La France
2010-08-26 23:53                   ` Stephen Hemminger
2010-08-27  0:06                     ` David Miller
2010-08-27 16:20                     ` Roland Dreier [this message]
2010-08-27 17:16                       ` Roland Dreier
2010-08-27 17:53                         ` Marc Aurele La France
2010-08-26 14:58               ` Chuck Lever
2010-09-30 18:50               ` Marc Aurele La France
2010-08-23 15:12 ` Ben Hutchings

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adaiq2w2fk0.fsf@cisco.com \
    --to=rdreier@cisco.com \
    --cc=bhutchings@solarflare.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=jmorris@namei.org \
    --cc=kaber@trash.net \
    --cc=kuznet@ms2.inr.ac.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pekkas@netcore.fi \
    --cc=shemminger@vyatta.com \
    --cc=tsi@ualberta.ca \
    --cc=yoshfuji@linux-ipv6.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox