[RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload
@ 2016-03-18 23:24 Alexander Duyck
  2016-03-18 23:24 ` [RFC PATCH 1/9] ipv4/GRO: Allow multiple frames to use the same IP ID Alexander Duyck
                   ` (9 more replies)
  0 siblings, 10 replies; 51+ messages in thread
From: Alexander Duyck @ 2016-03-18 23:24 UTC (permalink / raw)
  To: ecree, netdev, davem, alexander.duyck, tom

This patch series addresses two things.

First it enables what I am calling RFC6864 compliant GRO.  Basically what
I am doing is allowing one of two patterns for incoming frames.  Either the
IP ID will increment, or if the DF bit is set it can either increment or
stay the same value.  This allows us to perform GRO if the IP ID is forced
to stay at a static value as may occur if we are replicating an IP header
instead of actually having it offloaded.

The last 3 patches introduce what I am calling GSO partial.  The best way
to describe it is that it is a GSO offload in which the portion pointed to
by csum_start must be handled by the hardware, and the region before that
can be optionally handled.  So for example with i40e the only pieces it was
missing from the full offload was the checksum so this is computed in
software and the hardware will update inner and outer IP headers.  In the
example for ixgbe the hardware will only update the outer IP header.  The
outer UDP or GRE header and inner IP header are unmodified.

The bits remaining to be worked out are what to do about drivers that
wouldn't be able to handle updating the outer IP header.  So for example
should I require the driver to have to update the outer IP header or handle
it in software.  The one concern here is that if the outer IP header does
not have the DF bit set and does not update the IP ID field we run the risk
of causing all sorts of problems if the packet is fragmented in flight.

---

Alexander Duyck (9):
      ipv4/GRO: Allow multiple frames to use the same IP ID
      gre: Enforce IP ID verification on outer headers
      geneve: Enforce IP ID verification on outer headers
      vxlan: Enforce IP ID verification on outer headers
      gue: Enforce IP ID verification on outer headers
      ethtool: Add support for toggling any of the GSO offloads
      GSO: Support partial segmentation offload
      i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM
      ixgbe/ixgbevf: Add support for GSO partial

 drivers/net/ethernet/intel/i40e/i40e_main.c       |    8 ++
 drivers/net/ethernet/intel/i40e/i40e_txrx.c       |   14 +++-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c     |   14 +++-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c   |    8 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |   72 +++++++++++++-------
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   77 ++++++++++++++-------
 drivers/net/geneve.c                              |   14 ++++
 drivers/net/vxlan.c                               |   14 ++++
 include/linux/netdev_features.h                   |    7 ++
 include/linux/netdevice.h                         |    7 ++
 include/linux/skbuff.h                            |    7 ++
 net/core/dev.c                                    |   31 ++++++++
 net/core/ethtool.c                                |    4 +
 net/core/skbuff.c                                 |   21 +++++-
 net/ipv4/af_inet.c                                |   20 ++++-
 net/ipv4/fou.c                                    |   14 ++++
 net/ipv4/gre_offload.c                            |   37 +++++++++-
 net/ipv4/tcp_offload.c                            |   25 ++++++-
 net/ipv4/udp_offload.c                            |   20 ++++-
 net/ipv6/ip6_offload.c                            |    9 ++
 20 files changed, 344 insertions(+), 79 deletions(-)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC PATCH 1/9] ipv4/GRO: Allow multiple frames to use the same IP ID
  2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
@ 2016-03-18 23:24 ` Alexander Duyck
  2016-03-24  1:43   ` Jesse Gross
  2016-03-18 23:24 ` [RFC PATCH 2/9] gre: Enforce IP ID verification on outer headers Alexander Duyck
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-18 23:24 UTC (permalink / raw)
  To: ecree, netdev, davem, alexander.duyck, tom

In RFC 6864 it is stated that we can essentially ignore the IPv4 ID field
if we have not and will not use fragmentation.  Such a frame is defined
as having the DF flag set to 1, and the MF and frag_offset as 0.  Currently
for GRO we were requiring that the inner header always have an increasing
IPv4 ID, but we are ignoring the outer value.

This patch is a first step in trying to reverse some of that.  Specifically
what this patch does is allow us to coalesce frames that have a static IPv4
ID value.  So for example if we had a series of frames where the DF flag
was set we would allow the same IPv4 ID value to be used for all the frames
belonging to that flow.  This would become the standard behavior for TCP so
it would support either a fixed IPv4 ID value, or one in which the value
increments.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 include/linux/netdevice.h |    5 ++++-
 net/ipv4/af_inet.c        |    8 ++++++--
 net/ipv4/tcp_offload.c    |   15 ++++++++++++++-
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index be693b34662f..31474d9d8a96 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2113,7 +2113,10 @@ struct napi_gro_cb {
 	/* Used in foo-over-udp, set in udp[46]_gro_receive */
 	u8	is_ipv6:1;
 
-	/* 7 bit hole */
+	/* Flag indicating if IP ID can be ignored on receive */
+	u8	rfc6864:1;
+
+	/* 6 bit hole */
 
 	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
 	__wsum	csum;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 0cc923f83e10..5e3885672907 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1320,6 +1320,8 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 
 	id = ntohl(*(__be32 *)&iph->id);
 	flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF));
+
+	NAPI_GRO_CB(skb)->rfc6864 = !!(id & IP_DF);
 	id >>= 16;
 
 	for (p = *head; p; p = p->next) {
@@ -1352,8 +1354,10 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 		 * This is because some GSO/TSO implementations do not
 		 * correctly increment the IP ID for the outer hdrs.
 		 */
-		NAPI_GRO_CB(p)->flush_id =
-			    ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id);
+		NAPI_GRO_CB(p)->flush_id = (u16)(id - ntohs(iph2->id));
+		if (!NAPI_GRO_CB(p)->rfc6864 || !NAPI_GRO_CB(skb)->rfc6864)
+			NAPI_GRO_CB(p)->flush_id ^= NAPI_GRO_CB(p)->count;
+
 		NAPI_GRO_CB(p)->flush |= flush;
 	}
 
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 773083b7f1e9..1a2e9957c177 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -237,7 +237,7 @@ struct sk_buff **tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 
 found:
 	/* Include the IP ID check below from the inner most IP hdr */
-	flush = NAPI_GRO_CB(p)->flush | NAPI_GRO_CB(p)->flush_id;
+	flush = NAPI_GRO_CB(p)->flush;
 	flush |= (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
 		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
@@ -246,6 +246,19 @@ found:
 		flush |= *(u32 *)((u8 *)th + i) ^
 			 *(u32 *)((u8 *)th2 + i);
 
+	/* Some devices may make use of RFC6864 to skip the need to
+	 * increment the IP ID on inner headers of tunneled frames.  This
+	 * allows them to make use of TSO by not updating the headers from
+	 * the outer L4 to inner L3.  We should be able to identify any such
+	 * frames on the second frame received and then after that we expect
+	 * the same ID on all remaining frames in the flow.
+	 */
+	if ((NAPI_GRO_CB(p)->flush_id == 1) && NAPI_GRO_CB(p)->rfc6864 &&
+	    (NAPI_GRO_CB(p)->count == 1))
+		NAPI_GRO_CB(p)->rfc6864 = 0;
+	else
+		flush |= NAPI_GRO_CB(p)->flush_id;
+
 	mss = skb_shinfo(p)->gso_size;
 
 	flush |= (len - 1) >= mss;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH 2/9] gre: Enforce IP ID verification on outer headers
  2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
  2016-03-18 23:24 ` [RFC PATCH 1/9] ipv4/GRO: Allow multiple frames to use the same IP ID Alexander Duyck
@ 2016-03-18 23:24 ` Alexander Duyck
  2016-03-18 23:24 ` [RFC PATCH 3/9] geneve: " Alexander Duyck
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2016-03-18 23:24 UTC (permalink / raw)
  To: ecree, netdev, davem, alexander.duyck, tom

This change enforces the IP ID verification on outer headers.  As a result
if the DF flag is not set on the outer header we will force the flow to be
flushed in the event that the IP ID is out of sequence with the existing
flow.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 net/ipv4/gre_offload.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index 540866dbd27d..7ea14ced9222 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -203,6 +203,20 @@ static struct sk_buff **gre_gro_receive(struct sk_buff **head,
 				continue;
 			}
 		}
+
+		/* Include the IP ID check from the outer IP hdr */
+		if (!NAPI_GRO_CB(p)->flush_id)
+			continue;
+
+		/* If flush_id is non-zero and rfc6864 is enabled for
+		 * the new frame the only possibility is that we are
+		 * incrementing so we can xor by count to cancel out
+		 * the one acceptable value.
+		 */
+		NAPI_GRO_CB(p)->flush |= NAPI_GRO_CB(skb)->rfc6864 ?
+					 NAPI_GRO_CB(p)->flush_id ^
+					 NAPI_GRO_CB(p)->count :
+					 NAPI_GRO_CB(p)->flush_id;
 	}
 
 	skb_gro_pull(skb, grehlen);

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH 3/9] geneve: Enforce IP ID verification on outer headers
  2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
  2016-03-18 23:24 ` [RFC PATCH 1/9] ipv4/GRO: Allow multiple frames to use the same IP ID Alexander Duyck
  2016-03-18 23:24 ` [RFC PATCH 2/9] gre: Enforce IP ID verification on outer headers Alexander Duyck
@ 2016-03-18 23:24 ` Alexander Duyck
  2016-03-18 23:25 ` [RFC PATCH 4/9] vxlan: " Alexander Duyck
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2016-03-18 23:24 UTC (permalink / raw)
  To: ecree, netdev, davem, alexander.duyck, tom

This change enforces the IP ID verification on outer headers.  As a result
if the DF flag is not set on the outer header we will force the flow to be
flushed in the event that the IP ID is out of sequence with the existing
flow.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/net/geneve.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 192631a345df..2dac5496c965 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -473,6 +473,20 @@ static struct sk_buff **geneve_gro_receive(struct sk_buff **head,
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
 		}
+
+		/* Include the IP ID check from the outer IP hdr */
+		if (!NAPI_GRO_CB(p)->flush_id)
+			continue;
+
+		/* If flush_id is non-zero and rfc6864 is enabled for
+		 * the new frame the only possibility is that we are
+		 * incrementing so we can xor by count to cancel out
+		 * the one acceptable value.
+		 */
+		NAPI_GRO_CB(p)->flush |= NAPI_GRO_CB(skb)->rfc6864 ?
+					 NAPI_GRO_CB(p)->flush_id ^
+					 NAPI_GRO_CB(p)->count :
+					 NAPI_GRO_CB(p)->flush_id;
 	}
 
 	type = gh->proto_type;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH 4/9] vxlan: Enforce IP ID verification on outer headers
  2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
                   ` (2 preceding siblings ...)
  2016-03-18 23:24 ` [RFC PATCH 3/9] geneve: " Alexander Duyck
@ 2016-03-18 23:25 ` Alexander Duyck
  2016-03-18 23:25 ` [RFC PATCH 5/9] gue: " Alexander Duyck
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2016-03-18 23:25 UTC (permalink / raw)
  To: ecree, netdev, davem, alexander.duyck, tom

This change enforces the IP ID verification on outer headers.  As a result
if the DF flag is not set on the outer header we will force the flow to be
flushed in the event that the IP ID is out of sequence with the existing
flow.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/net/vxlan.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 800106a7246c..d6838a30e772 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -601,6 +601,20 @@ static struct sk_buff **vxlan_gro_receive(struct sk_buff **head,
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
 		}
+
+		/* Include the IP ID check from the outer IP hdr */
+		if (!NAPI_GRO_CB(p)->flush_id)
+			continue;
+
+		/* If flush_id is non-zero and rfc6864 is enabled for
+		 * the new frame the only possibility is that we are
+		 * incrementing so we can xor by count to cancel out
+		 * the one acceptable value.
+		 */
+		NAPI_GRO_CB(p)->flush |= NAPI_GRO_CB(skb)->rfc6864 ?
+					 NAPI_GRO_CB(p)->flush_id ^
+					 NAPI_GRO_CB(p)->count :
+					 NAPI_GRO_CB(p)->flush_id;
 	}
 
 	pp = eth_gro_receive(head, skb);

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH 5/9] gue: Enforce IP ID verification on outer headers
  2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
                   ` (3 preceding siblings ...)
  2016-03-18 23:25 ` [RFC PATCH 4/9] vxlan: " Alexander Duyck
@ 2016-03-18 23:25 ` Alexander Duyck
  2016-03-18 23:25 ` [RFC PATCH 6/9] ethtool: Add support for toggling any of the GSO offloads Alexander Duyck
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2016-03-18 23:25 UTC (permalink / raw)
  To: ecree, netdev, davem, alexander.duyck, tom

This change enforces the IP ID verification on outer headers.  As a result
if the DF flag is not set on the outer header we will force the flow to be
flushed in the event that the IP ID is out of sequence with the existing
flow.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 net/ipv4/fou.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 780484243e14..2be35e4f6d88 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -341,6 +341,20 @@ static struct sk_buff **gue_gro_receive(struct sk_buff **head,
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
 		}
+
+		/* Include the IP ID check from the outer IP hdr */
+		if (!NAPI_GRO_CB(p)->flush_id)
+			continue;
+
+		/* If flush_id is non-zero and rfc6864 is enabled for
+		 * the new frame the only possibility is that we are
+		 * incrementing so we can xor by count to cancel out
+		 * the one acceptable value.
+		 */
+		NAPI_GRO_CB(p)->flush |= NAPI_GRO_CB(skb)->rfc6864 ?
+					 NAPI_GRO_CB(p)->flush_id ^
+					 NAPI_GRO_CB(p)->count :
+					 NAPI_GRO_CB(p)->flush_id;
 	}
 
 	rcu_read_lock();

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH 6/9] ethtool: Add support for toggling any of the GSO offloads
  2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
                   ` (4 preceding siblings ...)
  2016-03-18 23:25 ` [RFC PATCH 5/9] gue: " Alexander Duyck
@ 2016-03-18 23:25 ` Alexander Duyck
  2016-03-19  0:18   ` Ben Hutchings
  2016-03-18 23:25 ` [RFC PATCH 7/9] GSO: Support partial segmentation offload Alexander Duyck
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-18 23:25 UTC (permalink / raw)
  To: ecree, netdev, davem, alexander.duyck, tom

The strings were missing for several of the GSO offloads that are
available.  This patch provides the missing strings so that we can toggle
or query any of them via the ethtool command.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 net/core/ethtool.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 2966cd0d7c93..b3c39d531469 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -82,9 +82,12 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
 	[NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
 	[NETIF_F_GSO_GRE_BIT] =		 "tx-gre-segmentation",
+	[NETIF_F_GSO_GRE_CSUM_BIT] =	 "tx-gre-csum-segmentation",
 	[NETIF_F_GSO_IPIP_BIT] =	 "tx-ipip-segmentation",
 	[NETIF_F_GSO_SIT_BIT] =		 "tx-sit-segmentation",
 	[NETIF_F_GSO_UDP_TUNNEL_BIT] =	 "tx-udp_tnl-segmentation",
+	[NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
+	[NETIF_F_GSO_TUNNEL_REMCSUM_BIT] = "tx-remcsum-segmentation",
 
 	[NETIF_F_FCOE_CRC_BIT] =         "tx-checksum-fcoe-crc",
 	[NETIF_F_SCTP_CRC_BIT] =        "tx-checksum-sctp",

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
                   ` (5 preceding siblings ...)
  2016-03-18 23:25 ` [RFC PATCH 6/9] ethtool: Add support for toggling any of the GSO offloads Alexander Duyck
@ 2016-03-18 23:25 ` Alexander Duyck
  2016-03-22 17:00   ` Edward Cree
                     ` (2 more replies)
  2016-03-18 23:25 ` [RFC PATCH 8/9] i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM Alexander Duyck
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 51+ messages in thread
From: Alexander Duyck @ 2016-03-18 23:25 UTC (permalink / raw)
  To: ecree, netdev, davem, alexander.duyck, tom

This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header.  The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.

With the current implementation it allows us to add support for the
following GSO types with an inner TSO or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 include/linux/netdev_features.h |    7 ++++++-
 include/linux/netdevice.h       |    2 ++
 include/linux/skbuff.h          |    7 ++++++-
 net/core/dev.c                  |   31 ++++++++++++++++++++++++++++++-
 net/core/ethtool.c              |    1 +
 net/core/skbuff.c               |   21 ++++++++++++++++++++-
 net/ipv4/af_inet.c              |   12 ++++++++++--
 net/ipv4/gre_offload.c          |   23 +++++++++++++++++++----
 net/ipv4/tcp_offload.c          |   10 ++++++++--
 net/ipv4/udp_offload.c          |   20 ++++++++++++++++----
 net/ipv6/ip6_offload.c          |    9 ++++++++-
 11 files changed, 126 insertions(+), 17 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index a734bf43d190..8df3c5553af0 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -48,8 +48,12 @@ enum {
 	NETIF_F_GSO_UDP_TUNNEL_BIT,	/* ... UDP TUNNEL with TSO */
 	NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
 	NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
+	NETIF_F_GSO_PARTIAL_BIT,	/* ... Only segment inner-most L4
+					 *     in hardware and all other
+					 *     headers in software.
+					 */
 	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
-		NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
+		NETIF_F_GSO_PARTIAL_BIT,
 
 	NETIF_F_FCOE_CRC_BIT,		/* FCoE CRC32 */
 	NETIF_F_SCTP_CRC_BIT,		/* SCTP checksum offload */
@@ -121,6 +125,7 @@ enum {
 #define NETIF_F_GSO_UDP_TUNNEL	__NETIF_F(GSO_UDP_TUNNEL)
 #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
+#define NETIF_F_GSO_PARTIAL	 __NETIF_F(GSO_PARTIAL)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX	__NETIF_F(HW_VLAN_STAG_RX)
 #define NETIF_F_HW_VLAN_STAG_TX	__NETIF_F(HW_VLAN_STAG_TX)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 31474d9d8a96..427d748ad8f9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1647,6 +1647,7 @@ struct net_device {
 	netdev_features_t	vlan_features;
 	netdev_features_t	hw_enc_features;
 	netdev_features_t	mpls_features;
+	netdev_features_t	gso_partial_features;
 
 	int			ifindex;
 	int			group;
@@ -4014,6 +4015,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
+	BUILD_BUG_ON(SKB_GSO_PARTIAL != (NETIF_F_GSO_PARTIAL >> NETIF_F_GSO_SHIFT));
 
 	return (features & feature) == feature;
 }
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 15d0df943466..c291a282f8b6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -482,6 +482,8 @@ enum {
 	SKB_GSO_UDP_TUNNEL_CSUM = 1 << 11,
 
 	SKB_GSO_TUNNEL_REMCSUM = 1 << 12,
+
+	SKB_GSO_PARTIAL = 1 << 13,
 };
 
 #if BITS_PER_LONG > 32
@@ -3584,7 +3586,10 @@ static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
  * Keeps track of level of encapsulation of network headers.
  */
 struct skb_gso_cb {
-	int	mac_offset;
+	union {
+		int	mac_offset;
+		int	data_offset;
+	};
 	int	encap_level;
 	__wsum	csum;
 	__u16	csum_start;
diff --git a/net/core/dev.c b/net/core/dev.c
index edb7179bc051..666cf427898b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2711,6 +2711,19 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
 			return ERR_PTR(err);
 	}
 
+	/* Only report GSO partial support if it will enable us to
+	 * support segmentation on this frame without needing additional
+	 * work.
+	 */
+	if (features & NETIF_F_GSO_PARTIAL) {
+		netdev_features_t partial_features;
+		struct net_device *dev = skb->dev;
+
+		partial_features = dev->features & dev->gso_partial_features;
+		if (!skb_gso_ok(skb, features | partial_features))
+			features &= ~NETIF_F_GSO_PARTIAL;
+	}
+
 	BUILD_BUG_ON(SKB_SGO_CB_OFFSET +
 		     sizeof(*SKB_GSO_CB(skb)) > sizeof(skb->cb));
 
@@ -2841,6 +2854,14 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
 	if (skb->encapsulation)
 		features &= dev->hw_enc_features;
 
+	/* Support for GSO partial features requires software intervention
+	 * before we can actually process the packets so we need to strip
+	 * support for any partial features now and we can pull them back
+	 * in after we have partially segmented the frame.
+	 */
+	if (skb_is_gso(skb) && !(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL))
+		features &= ~dev->gso_partial_features;
+
 	if (skb_vlan_tagged(skb))
 		features = netdev_intersect_features(features,
 						     dev->vlan_features |
@@ -6702,6 +6723,14 @@ static netdev_features_t netdev_fix_features(struct net_device *dev,
 		}
 	}
 
+	/* GSO partial features require GSO partial be set */
+	if ((features & dev->gso_partial_features) &&
+	    !(features & NETIF_F_GSO_PARTIAL)) {
+		netdev_dbg(dev,
+			   "Dropping partially supported GSO features since no GSO partial.\n");
+		features &= ~dev->gso_partial_features;
+	}
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	if (dev->netdev_ops->ndo_busy_poll)
 		features |= NETIF_F_BUSY_POLL;
@@ -6982,7 +7011,7 @@ int register_netdevice(struct net_device *dev)
 
 	/* Make NETIF_F_SG inheritable to tunnel devices.
 	 */
-	dev->hw_enc_features |= NETIF_F_SG;
+	dev->hw_enc_features |= NETIF_F_SG | NETIF_F_GSO_PARTIAL;
 
 	/* Make NETIF_F_SG inheritable to MPLS.
 	 */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index b3c39d531469..d1b278c6c29f 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -88,6 +88,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_GSO_UDP_TUNNEL_BIT] =	 "tx-udp_tnl-segmentation",
 	[NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
 	[NETIF_F_GSO_TUNNEL_REMCSUM_BIT] = "tx-remcsum-segmentation",
+	[NETIF_F_GSO_PARTIAL_BIT] =	 "tx-gso-partial",
 
 	[NETIF_F_FCOE_CRC_BIT] =         "tx-checksum-fcoe-crc",
 	[NETIF_F_SCTP_CRC_BIT] =        "tx-checksum-sctp",
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f044f970f1a6..bdcba77e164c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3076,8 +3076,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	struct sk_buff *frag_skb = head_skb;
 	unsigned int offset = doffset;
 	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
+	unsigned int partial_segs = 0;
 	unsigned int headroom;
-	unsigned int len;
+	unsigned int len = head_skb->len;
 	__be16 proto;
 	bool csum;
 	int sg = !!(features & NETIF_F_SG);
@@ -3094,6 +3095,15 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 	csum = !!can_checksum_protocol(features, proto);
 
+	/* GSO partial only requires that we trim off any excess that
+	 * doesn't fit into an MSS sized block, so take care of that
+	 * now.
+	 */
+	if (features & NETIF_F_GSO_PARTIAL) {
+		partial_segs = len / mss;
+		mss *= partial_segs;
+	}
+
 	headroom = skb_headroom(head_skb);
 	pos = skb_headlen(head_skb);
 
@@ -3281,6 +3291,15 @@ perform_csum_check:
 	 */
 	segs->prev = tail;
 
+	/* Update GSO info on first skb in partial sequence. */
+	if (partial_segs) {
+		skb_shinfo(segs)->gso_size = mss / partial_segs;
+		skb_shinfo(segs)->gso_segs = partial_segs;
+		skb_shinfo(segs)->gso_type = skb_shinfo(head_skb)->gso_type |
+					     SKB_GSO_PARTIAL;
+		SKB_GSO_CB(segs)->data_offset = skb_headroom(segs) + doffset;
+	}
+
 	/* Following permits correct backpressure, for protocols
 	 * using skb_set_owner_w().
 	 * Idea is to tranfert ownership from head_skb to last segment.
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5e3885672907..d091f91fa25d 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1199,7 +1199,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	unsigned int offset = 0;
 	bool udpfrag, encap;
 	struct iphdr *iph;
-	int proto;
+	int proto, tot_len;
 	int nhoff;
 	int ihl;
 	int id;
@@ -1269,10 +1269,18 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 			if (skb->next)
 				iph->frag_off |= htons(IP_MF);
 			offset += skb->len - nhoff - ihl;
+			tot_len = skb->len - nhoff;
+		} else if (skb_is_gso(skb)) {
+			iph->id = htons(id);
+			id += skb_shinfo(skb)->gso_segs;
+			tot_len = skb_shinfo(skb)->gso_size +
+				  SKB_GSO_CB(skb)->data_offset +
+				  skb->head - (unsigned char *)iph;
 		} else {
 			iph->id = htons(id++);
+			tot_len = skb->len - nhoff;
 		}
-		iph->tot_len = htons(skb->len - nhoff);
+		iph->tot_len = htons(tot_len);
 		ip_send_check(iph);
 		if (encap)
 			skb_reset_inner_headers(skb);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index 7ea14ced9222..dea0390d65bb 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -85,7 +85,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
 	skb = segs;
 	do {
 		struct gre_base_hdr *greh;
-		__be32 *pcsum;
+		__sum16 *pcsum;
 
 		/* Set up inner headers if we are offloading inner checksum */
 		if (skb->ip_summed == CHECKSUM_PARTIAL) {
@@ -105,10 +105,25 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
 			continue;
 
 		greh = (struct gre_base_hdr *)skb_transport_header(skb);
-		pcsum = (__be32 *)(greh + 1);
+		pcsum = (__sum16 *)(greh + 1);
+
+		if (skb_is_gso(skb)) {
+			unsigned int partial_adj;
+
+			/* Adjust checksum to account for the fact that
+			 * the partial checksum is based on actual size
+			 * whereas headers should be based on MSS size.
+			 */
+			partial_adj = skb->len + skb_headroom(skb) -
+				      SKB_GSO_CB(skb)->data_offset -
+				      skb_shinfo(skb)->gso_size;
+			*pcsum = ~csum_fold((__force __wsum)htonl(partial_adj));
+		} else {
+			*pcsum = 0;
+		}
 
-		*pcsum = 0;
-		*(__sum16 *)pcsum = gso_make_checksum(skb, 0);
+		*(pcsum + 1) = 0;
+		*pcsum = gso_make_checksum(skb, 0);
 	} while ((skb = skb->next));
 out:
 	return segs;
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 1a2e9957c177..4e9b8f011473 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -107,6 +107,12 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 		goto out;
 	}
 
+	/* GSO partial only requires splitting the frame into an MSS
+	 * multiple and possibly a remainder.  So update the mss now.
+	 */
+	if (features & NETIF_F_GSO_PARTIAL)
+		mss = skb->len - (skb->len % mss);
+
 	copy_destructor = gso_skb->destructor == tcp_wfree;
 	ooo_okay = gso_skb->ooo_okay;
 	/* All segments but the first should have ooo_okay cleared */
@@ -131,7 +137,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	newcheck = ~csum_fold((__force __wsum)((__force u32)th->check +
 					       (__force u32)delta));
 
-	do {
+	while (skb->next) {
 		th->fin = th->psh = 0;
 		th->check = newcheck;
 
@@ -151,7 +157,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 
 		th->seq = htonl(seq);
 		th->cwr = 0;
-	} while (skb->next);
+	}
 
 	/* Following permits TCP Small Queues to work well with GSO :
 	 * The callback to TCP stack will be called at the time last frag
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 8a3405a80260..5fcb93269afb 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -100,7 +100,7 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
 	udp_offset = outer_hlen - tnl_hlen;
 	skb = segs;
 	do {
-		__be16 len;
+		unsigned int len;
 
 		if (remcsum)
 			skb->ip_summed = CHECKSUM_NONE;
@@ -118,14 +118,26 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
 		skb_reset_mac_header(skb);
 		skb_set_network_header(skb, mac_len);
 		skb_set_transport_header(skb, udp_offset);
-		len = htons(skb->len - udp_offset);
+		len = skb->len - udp_offset;
 		uh = udp_hdr(skb);
-		uh->len = len;
+
+		/* If we are only performing partial GSO the inner header
+		 * will be using a length value equal to only one MSS sized
+		 * segment instead of the entire frame.
+		 */
+		if (skb_is_gso(skb)) {
+			uh->len = htons(skb_shinfo(skb)->gso_size +
+					SKB_GSO_CB(skb)->data_offset +
+					skb->head - (unsigned char *)uh);
+		} else {
+			uh->len = htons(len);
+		}
 
 		if (!need_csum)
 			continue;
 
-		uh->check = ~csum_fold(csum_add(partial, (__force __wsum)len));
+		uh->check = ~csum_fold(csum_add(partial,
+				       (__force __wsum)htonl(len)));
 
 		if (skb->encapsulation || !offload_csum) {
 			uh->check = gso_make_checksum(skb, ~uh->check);
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index eeca943f12dc..d467053c226a 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -63,6 +63,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 	int proto;
 	struct frag_hdr *fptr;
 	unsigned int unfrag_ip6hlen;
+	unsigned int payload_len;
 	u8 *prevhdr;
 	int offset = 0;
 	bool encap, udpfrag;
@@ -117,7 +118,13 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 
 	for (skb = segs; skb; skb = skb->next) {
 		ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff);
-		ipv6h->payload_len = htons(skb->len - nhoff - sizeof(*ipv6h));
+		if (skb_is_gso(skb))
+			payload_len = skb_shinfo(skb)->gso_size +
+				      SKB_GSO_CB(skb)->data_offset +
+				      skb->head - (unsigned char *)(ipv6h + 1);
+		else
+			payload_len = skb->len - nhoff - sizeof(*ipv6h);
+		ipv6h->payload_len = htons(payload_len);
 		skb->network_header = (u8 *)ipv6h - skb->head;
 
 		if (udpfrag) {

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH 8/9] i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM
  2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
                   ` (6 preceding siblings ...)
  2016-03-18 23:25 ` [RFC PATCH 7/9] GSO: Support partial segmentation offload Alexander Duyck
@ 2016-03-18 23:25 ` Alexander Duyck
  2016-03-23 19:35   ` Jesse Gross
  2016-03-18 23:25 ` [RFC PATCH 9/9] ixgbe/ixgbevf: Add support for GSO partial Alexander Duyck
  2016-03-21 18:50 ` [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload David Miller
  9 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-18 23:25 UTC (permalink / raw)
  To: ecree, netdev, davem, alexander.duyck, tom

This patch makes it so that i40e and i40evf can use GSO_PARTIAL to support
segmentation for frames with checksums enabled in outer headers.  As a
result we can now send data over these types of tunnels at over 20Gb/s
versus the 12Gb/s that was previously possible on my system.

The advantage with the i40e parts is that this offload is mostly
transparent as the hardware still deals with the inner and/or outer IPv4
headers so the IP ID is still incrementing for both when this offload is
performed.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c     |    8 +++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c     |   14 +++++++++++---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c   |   14 +++++++++++---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c |    8 +++++++-
 4 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 39b0009253c2..ac3964a9f5c0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9050,6 +9050,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
 				   NETIF_F_TSO6		       |
 				   NETIF_F_TSO_ECN	       |
 				   NETIF_F_GSO_GRE	       |
+				   NETIF_F_GSO_GRE_CSUM	       |
 				   NETIF_F_GSO_UDP_TUNNEL      |
 				   NETIF_F_GSO_UDP_TUNNEL_CSUM |
 				   0;
@@ -9074,7 +9075,12 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
 	if (!(pf->flags & I40E_FLAG_MFP_ENABLED))
 		netdev->features |= NETIF_F_NTUPLE;
 	if (pf->flags & I40E_FLAG_OUTER_UDP_CSUM_CAPABLE)
-		netdev->features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
+		netdev->features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
+	else
+		netdev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
+
+	netdev->gso_partial_features |= NETIF_F_GSO_GRE_CSUM;
+	netdev->features |= NETIF_F_GSO_PARTIAL | netdev->gso_partial_features;
 
 	/* copy netdev features into list of user selectable features */
 	netdev->hw_features |= netdev->features;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 5d5fa5359a1d..50aa76d7f92e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2297,9 +2297,16 @@ static int i40e_tso(struct i40e_ring *tx_ring, struct sk_buff *skb,
 		ip.v6->payload_len = 0;
 	}
 
-	if (skb_shinfo(skb)->gso_type & (SKB_GSO_UDP_TUNNEL | SKB_GSO_GRE |
+	if (skb_shinfo(skb)->gso_type & (SKB_GSO_GRE |
+					 SKB_GSO_GRE_CSUM |
+					 SKB_GSO_UDP_TUNNEL |
 					 SKB_GSO_UDP_TUNNEL_CSUM)) {
-		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM) {
+		if (skb_shinfo(skb)->gso_type & (SKB_GSO_UDP_TUNNEL |
+						 SKB_GSO_UDP_TUNNEL_CSUM))
+			l4.udp->len = 0;
+
+		if (!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
+		    (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM)) {
 			/* determine offset of outer transport header */
 			l4_offset = l4.hdr - skb->data;
 
@@ -2470,7 +2477,8 @@ static int i40e_tx_enable_csum(struct sk_buff *skb, u32 *tx_flags,
 
 		/* indicate if we need to offload outer UDP header */
 		if ((*tx_flags & I40E_TX_FLAGS_TSO) &&
-		    (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM))
+		    (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM) &&
+		    !(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL))
 			tunnel |= I40E_TXD_CTX_QW0_L4T_CS_MASK;
 
 		/* record tunnel offload values */
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 04aabc52ba0d..1bb7c3efa36c 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1564,9 +1564,16 @@ static int i40e_tso(struct i40e_ring *tx_ring, struct sk_buff *skb,
 		ip.v6->payload_len = 0;
 	}
 
-	if (skb_shinfo(skb)->gso_type & (SKB_GSO_UDP_TUNNEL | SKB_GSO_GRE |
+	if (skb_shinfo(skb)->gso_type & (SKB_GSO_GRE |
+					 SKB_GSO_GRE_CSUM |
+					 SKB_GSO_UDP_TUNNEL |
 					 SKB_GSO_UDP_TUNNEL_CSUM)) {
-		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM) {
+		if (skb_shinfo(skb)->gso_type & (SKB_GSO_UDP_TUNNEL |
+						 SKB_GSO_UDP_TUNNEL_CSUM))
+			l4.udp->len = 0;
+
+		if (!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
+		    (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM)) {
 			/* determine offset of outer transport header */
 			l4_offset = l4.hdr - skb->data;
 
@@ -1695,7 +1702,8 @@ static int i40e_tx_enable_csum(struct sk_buff *skb, u32 *tx_flags,
 
 		/* indicate if we need to offload outer UDP header */
 		if ((*tx_flags & I40E_TX_FLAGS_TSO) &&
-		    (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM))
+		    (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM) &&
+		    !(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL))
 			tunnel |= I40E_TXD_CTX_QW0_L4T_CS_MASK;
 
 		/* record tunnel offload values */
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index e3973684746b..ed3c9310c3e0 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -2331,7 +2331,7 @@ int i40evf_process_config(struct i40evf_adapter *adapter)
 			    NETIF_F_TSO |
 			    NETIF_F_TSO6 |
 			    NETIF_F_TSO_ECN |
-			    NETIF_F_GSO_GRE	       |
+			    NETIF_F_GSO_GRE |
 			    NETIF_F_GSO_UDP_TUNNEL |
 			    NETIF_F_RXCSUM |
 			    NETIF_F_GRO;
@@ -2342,11 +2342,17 @@ int i40evf_process_config(struct i40evf_adapter *adapter)
 				   NETIF_F_TSO6		       |
 				   NETIF_F_TSO_ECN	       |
 				   NETIF_F_GSO_GRE	       |
+				   NETIF_F_GSO_GRE_CSUM	       |
 				   NETIF_F_GSO_UDP_TUNNEL      |
 				   NETIF_F_GSO_UDP_TUNNEL_CSUM;
 
 	if (adapter->flags & I40EVF_FLAG_OUTER_UDP_CSUM_CAPABLE)
 		netdev->features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
+	else
+		netdev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
+
+	netdev->gso_partial_features |= NETIF_F_GSO_GRE_CSUM;
+	netdev->features |= NETIF_F_GSO_PARTIAL | netdev->gso_partial_features;
 
 	/* copy netdev features into list of user selectable features */
 	netdev->hw_features |= netdev->features;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH 9/9] ixgbe/ixgbevf: Add support for GSO partial
  2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
                   ` (7 preceding siblings ...)
  2016-03-18 23:25 ` [RFC PATCH 8/9] i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM Alexander Duyck
@ 2016-03-18 23:25 ` Alexander Duyck
  2016-03-19  2:05   ` Jesse Gross
  2016-03-21 18:50 ` [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload David Miller
  9 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-18 23:25 UTC (permalink / raw)
  To: ecree, netdev, davem, alexander.duyck, tom

This patch adds support for partial GSO segmentation in the case of GRE or
UDP encapsulated frames.

The one bit in this patch that is a bit controversial is the fact that we
are leaving the inner IPv4 IP ID as a static value in the case of
segmentation.  As per RFC6864 this should be acceptable as TCP frames set
the DF bit so the IP ID should be ignored.  However this is not always the
case as header compression schemes for PPP and SLIP can end up taking a
performance hit as they have to record the fact that the ID didn't change
as expected.

In addition GRO was examining the IP ID field as well.  As such on older
GRO implementations TSO frames from this driver may end up blocking GRO on
the other end which will likely hurt performance instead of helping it.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |   72 +++++++++++++-------
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   77 ++++++++++++++-------
 2 files changed, 99 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 59b43ce200be..bef69306fb65 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7178,9 +7178,18 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
 		     struct ixgbe_tx_buffer *first,
 		     u8 *hdr_len)
 {
+	u32 vlan_macip_lens, type_tucmd, mss_l4len_idx;
 	struct sk_buff *skb = first->skb;
-	u32 vlan_macip_lens, type_tucmd;
-	u32 mss_l4len_idx, l4len;
+	union {
+		struct iphdr *v4;
+		struct ipv6hdr *v6;
+		unsigned char *hdr;
+	} ip;
+	union {
+		struct tcphdr *tcp;
+		unsigned char *hdr;
+	} l4;
+	u32 paylen, l4_offset;
 	int err;
 
 	if (skb->ip_summed != CHECKSUM_PARTIAL)
@@ -7193,46 +7202,52 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
 	if (err < 0)
 		return err;
 
+	ip.hdr = skb_network_header(skb);
+	l4.hdr = skb_checksum_start(skb);
+
 	/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */
 	type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP;
 
-	if (first->protocol == htons(ETH_P_IP)) {
-		struct iphdr *iph = ip_hdr(skb);
-		iph->tot_len = 0;
-		iph->check = 0;
-		tcp_hdr(skb)->check = ~csum_tcpudp_magic(iph->saddr,
-							 iph->daddr, 0,
-							 IPPROTO_TCP,
-							 0);
+	/* initialize outer IP header fields */
+	if (ip.v4->version == 4) {
+		/* IP header will have to cancel out any data that
+		 * is not a part of the outer IP header
+		 */
+		ip.v4->check = csum_fold(csum_add(lco_csum(skb),
+						  csum_unfold(l4.tcp->check)));
 		type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;
+
+		ip.v4->tot_len = 0;
 		first->tx_flags |= IXGBE_TX_FLAGS_TSO |
 				   IXGBE_TX_FLAGS_CSUM |
 				   IXGBE_TX_FLAGS_IPV4;
-	} else if (skb_is_gso_v6(skb)) {
-		ipv6_hdr(skb)->payload_len = 0;
-		tcp_hdr(skb)->check =
-		    ~csum_ipv6_magic(&ipv6_hdr(skb)->saddr,
-				     &ipv6_hdr(skb)->daddr,
-				     0, IPPROTO_TCP, 0);
+	} else {
+		ip.v6->payload_len = 0;
 		first->tx_flags |= IXGBE_TX_FLAGS_TSO |
 				   IXGBE_TX_FLAGS_CSUM;
 	}
 
-	/* compute header lengths */
-	l4len = tcp_hdrlen(skb);
-	*hdr_len = skb_transport_offset(skb) + l4len;
+	/* determine offset of inner transport header */
+	l4_offset = l4.hdr - skb->data;
+
+	/* compute length of segmentation header */
+	*hdr_len = (l4.tcp->doff * 4) + l4_offset;
+
+	/* remove payload length from inner checksum */
+	paylen = skb->len - l4_offset;
+	csum_replace_by_diff(&l4.tcp->check, htonl(paylen));
 
 	/* update gso size and bytecount with header size */
 	first->gso_segs = skb_shinfo(skb)->gso_segs;
 	first->bytecount += (first->gso_segs - 1) * *hdr_len;
 
 	/* mss_l4len_id: use 0 as index for TSO */
-	mss_l4len_idx = l4len << IXGBE_ADVTXD_L4LEN_SHIFT;
+	mss_l4len_idx = (*hdr_len - l4_offset) << IXGBE_ADVTXD_L4LEN_SHIFT;
 	mss_l4len_idx |= skb_shinfo(skb)->gso_size << IXGBE_ADVTXD_MSS_SHIFT;
 
 	/* vlan_macip_lens: HEADLEN, MACLEN, VLAN tag */
-	vlan_macip_lens = skb_network_header_len(skb);
-	vlan_macip_lens |= skb_network_offset(skb) << IXGBE_ADVTXD_MACLEN_SHIFT;
+	vlan_macip_lens = l4.hdr - ip.hdr;
+	vlan_macip_lens |= (ip.hdr - skb->data) << IXGBE_ADVTXD_MACLEN_SHIFT;
 	vlan_macip_lens |= first->tx_flags & IXGBE_TX_FLAGS_VLAN_MASK;
 
 	ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, 0, type_tucmd,
@@ -9201,6 +9216,14 @@ skip_sriov:
 			   NETIF_F_HW_VLAN_CTAG_RX |
 			   NETIF_F_HW_VLAN_CTAG_FILTER;
 
+	netdev->gso_partial_features = NETIF_F_GSO_GRE |
+				       NETIF_F_GSO_GRE_CSUM |
+				       NETIF_F_GSO_UDP_TUNNEL |
+				       NETIF_F_GSO_UDP_TUNNEL_CSUM;
+
+	netdev->features |= NETIF_F_GSO_PARTIAL |
+			    netdev->gso_partial_features;
+
 	if (hw->mac.type >= ixgbe_mac_82599EB)
 		netdev->features |= NETIF_F_SCTP_CRC;
 
@@ -9219,7 +9242,10 @@ skip_sriov:
 				 NETIF_F_SCTP_CRC;
 
 	netdev->mpls_features |= NETIF_F_HW_CSUM;
-	netdev->hw_enc_features |= NETIF_F_HW_CSUM;
+	netdev->hw_enc_features |= NETIF_F_TSO |
+				   NETIF_F_TSO6 |
+				   NETIF_F_HW_CSUM |
+				   netdev->gso_partial_features;
 
 	netdev->priv_flags |= IFF_UNICAST_FLT;
 	netdev->priv_flags |= IFF_SUPP_NOFCS;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 0c3e29b55b45..1ece15ee9834 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3272,9 +3272,18 @@ static int ixgbevf_tso(struct ixgbevf_ring *tx_ring,
 		       struct ixgbevf_tx_buffer *first,
 		       u8 *hdr_len)
 {
+	u32 vlan_macip_lens, type_tucmd, mss_l4len_idx;
 	struct sk_buff *skb = first->skb;
-	u32 vlan_macip_lens, type_tucmd;
-	u32 mss_l4len_idx, l4len;
+	union {
+		struct iphdr *v4;
+		struct ipv6hdr *v6;
+		unsigned char *hdr;
+	} ip;
+	union {
+		struct tcphdr *tcp;
+		unsigned char *hdr;
+	} l4;
+	u32 paylen, l4_offset;
 	int err;
 
 	if (skb->ip_summed != CHECKSUM_PARTIAL)
@@ -3287,49 +3296,53 @@ static int ixgbevf_tso(struct ixgbevf_ring *tx_ring,
 	if (err < 0)
 		return err;
 
+	ip.hdr = skb_network_header(skb);
+	l4.hdr = skb_checksum_start(skb);
+
 	/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */
 	type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP;
 
-	if (first->protocol == htons(ETH_P_IP)) {
-		struct iphdr *iph = ip_hdr(skb);
-
-		iph->tot_len = 0;
-		iph->check = 0;
-		tcp_hdr(skb)->check = ~csum_tcpudp_magic(iph->saddr,
-							 iph->daddr, 0,
-							 IPPROTO_TCP,
-							 0);
+	/* initialize outer IP header fields */
+	if (ip.v4->version == 4) {
+		/* IP header will have to cancel out any data that
+		 * is not a part of the outer IP header
+		 */
+		ip.v4->check = csum_fold(csum_add(lco_csum(skb),
+						  csum_unfold(l4.tcp->check)));
 		type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;
+
+		ip.v4->tot_len = 0;
 		first->tx_flags |= IXGBE_TX_FLAGS_TSO |
 				   IXGBE_TX_FLAGS_CSUM |
 				   IXGBE_TX_FLAGS_IPV4;
-	} else if (skb_is_gso_v6(skb)) {
-		ipv6_hdr(skb)->payload_len = 0;
-		tcp_hdr(skb)->check =
-		    ~csum_ipv6_magic(&ipv6_hdr(skb)->saddr,
-				     &ipv6_hdr(skb)->daddr,
-				     0, IPPROTO_TCP, 0);
+	} else {
+		ip.v6->payload_len = 0;
 		first->tx_flags |= IXGBE_TX_FLAGS_TSO |
 				   IXGBE_TX_FLAGS_CSUM;
 	}
 
-	/* compute header lengths */
-	l4len = tcp_hdrlen(skb);
-	*hdr_len += l4len;
-	*hdr_len = skb_transport_offset(skb) + l4len;
+	/* determine offset of inner transport header */
+	l4_offset = l4.hdr - skb->data;
 
-	/* update GSO size and bytecount with header size */
+	/* compute length of segmentation header */
+	*hdr_len = (l4.tcp->doff * 4) + l4_offset;
+
+	/* remove payload length from inner checksum */
+	paylen = skb->len - l4_offset;
+	csum_replace_by_diff(&l4.tcp->check, htonl(paylen));
+
+	/* update gso size and bytecount with header size */
 	first->gso_segs = skb_shinfo(skb)->gso_segs;
 	first->bytecount += (first->gso_segs - 1) * *hdr_len;
 
 	/* mss_l4len_id: use 1 as index for TSO */
-	mss_l4len_idx = l4len << IXGBE_ADVTXD_L4LEN_SHIFT;
+	mss_l4len_idx = (*hdr_len - l4_offset) << IXGBE_ADVTXD_L4LEN_SHIFT;
 	mss_l4len_idx |= skb_shinfo(skb)->gso_size << IXGBE_ADVTXD_MSS_SHIFT;
 	mss_l4len_idx |= 1 << IXGBE_ADVTXD_IDX_SHIFT;
 
 	/* vlan_macip_lens: HEADLEN, MACLEN, VLAN tag */
-	vlan_macip_lens = skb_network_header_len(skb);
-	vlan_macip_lens |= skb_network_offset(skb) << IXGBE_ADVTXD_MACLEN_SHIFT;
+	vlan_macip_lens = l4.hdr - ip.hdr;
+	vlan_macip_lens |= (ip.hdr - skb->data) << IXGBE_ADVTXD_MACLEN_SHIFT;
 	vlan_macip_lens |= first->tx_flags & IXGBE_TX_FLAGS_VLAN_MASK;
 
 	ixgbevf_tx_ctxtdesc(tx_ring, vlan_macip_lens,
@@ -3992,12 +4005,19 @@ static int ixgbevf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto err_sw_init;
 	}
 
+	netdev->gso_partial_features = NETIF_F_GSO_GRE |
+				       NETIF_F_GSO_GRE_CSUM |
+				       NETIF_F_GSO_UDP_TUNNEL |
+				       NETIF_F_GSO_UDP_TUNNEL_CSUM;
+
 	netdev->hw_features = NETIF_F_SG |
 			      NETIF_F_TSO |
 			      NETIF_F_TSO6 |
 			      NETIF_F_RXCSUM |
 			      NETIF_F_HW_CSUM |
-			      NETIF_F_SCTP_CRC;
+			      NETIF_F_SCTP_CRC |
+			      NETIF_F_GSO_PARTIAL |
+			      netdev->gso_partial_features;
 
 	netdev->features = netdev->hw_features |
 			   NETIF_F_HW_VLAN_CTAG_TX |
@@ -4011,7 +4031,10 @@ static int ixgbevf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 				 NETIF_F_SCTP_CRC;
 
 	netdev->mpls_features |= NETIF_F_HW_CSUM;
-	netdev->hw_enc_features |= NETIF_F_HW_CSUM;
+	netdev->hw_enc_features |= NETIF_F_TSO |
+				   NETIF_F_TSO6 |
+				   NETIF_F_HW_CSUM |
+				   netdev->gso_partial_features;
 
 	if (pci_using_dac)
 		netdev->features |= NETIF_F_HIGHDMA;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 6/9] ethtool: Add support for toggling any of the GSO offloads
  2016-03-18 23:25 ` [RFC PATCH 6/9] ethtool: Add support for toggling any of the GSO offloads Alexander Duyck
@ 2016-03-19  0:18   ` Ben Hutchings
  2016-03-19  0:30     ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Ben Hutchings @ 2016-03-19  0:18 UTC (permalink / raw)
  To: Alexander Duyck, ecree, netdev, davem, alexander.duyck, tom

[-- Attachment #1: Type: text/plain, Size: 1728 bytes --]

On Fri, 2016-03-18 at 16:25 -0700, Alexander Duyck wrote:

> The strings were missing for several of the GSO offloads that are
> available.  This patch provides the missing strings so that we can toggle
> or query any of them via the ethtool command.
> 
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
>  net/core/ethtool.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index 2966cd0d7c93..b3c39d531469 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -82,9 +82,12 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
>  	[NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
>  	[NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
>  	[NETIF_F_GSO_GRE_BIT] =		 "tx-gre-segmentation",
> +	[NETIF_F_GSO_GRE_CSUM_BIT] =	 "tx-gre-csum-segmentation",

All of the existing checksum offload names include the word "checksum"
in full, so I think the new names should do the same.

>  	[NETIF_F_GSO_IPIP_BIT] =	 "tx-ipip-segmentation",
>  	[NETIF_F_GSO_SIT_BIT] =		 "tx-sit-segmentation",
>  	[NETIF_F_GSO_UDP_TUNNEL_BIT] =	 "tx-udp_tnl-segmentation",
> +	[NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
> +	[NETIF_F_GSO_TUNNEL_REMCSUM_BIT] = "tx-remcsum-segmentation",

I think this should be "tx-tunnel-remote-checksum-segmentation", though
that is getting quite unwieldy.

Ben.

>  	[NETIF_F_FCOE_CRC_BIT] =         "tx-checksum-fcoe-crc",
>  	[NETIF_F_SCTP_CRC_BIT] =        "tx-checksum-sctp",
> 
-- 
Ben Hutchings
To err is human; to really foul things up requires a computer.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 6/9] ethtool: Add support for toggling any of the GSO offloads
  2016-03-19  0:18   ` Ben Hutchings
@ 2016-03-19  0:30     ` Alexander Duyck
  2016-03-19  1:42       ` Ben Hutchings
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-19  0:30 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Alexander Duyck, Edward Cree, Netdev, David Miller, Tom Herbert

On Fri, Mar 18, 2016 at 5:18 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
> On Fri, 2016-03-18 at 16:25 -0700, Alexander Duyck wrote:
>
>> The strings were missing for several of the GSO offloads that are
>> available.  This patch provides the missing strings so that we can toggle
>> or query any of them via the ethtool command.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>> ---
>>  net/core/ethtool.c |    3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
>> index 2966cd0d7c93..b3c39d531469 100644
>> --- a/net/core/ethtool.c
>> +++ b/net/core/ethtool.c
>> @@ -82,9 +82,12 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
>>       [NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
>>       [NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
>>       [NETIF_F_GSO_GRE_BIT] =          "tx-gre-segmentation",
>> +     [NETIF_F_GSO_GRE_CSUM_BIT] =     "tx-gre-csum-segmentation",
>
> All of the existing checksum offload names include the word "checksum"
> in full, so I think the new names should do the same.

It isn't a checksum offload though, it is a segmentation offload for a
tunnel that has an outer checksum.  I was hoping to avoid using the
word checksum as that might make it confusing as it is a segmentation
offload, not a checksum offload.

>>       [NETIF_F_GSO_IPIP_BIT] =         "tx-ipip-segmentation",
>>       [NETIF_F_GSO_SIT_BIT] =          "tx-sit-segmentation",
>>       [NETIF_F_GSO_UDP_TUNNEL_BIT] =   "tx-udp_tnl-segmentation",
>> +     [NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
>> +     [NETIF_F_GSO_TUNNEL_REMCSUM_BIT] = "tx-remcsum-segmentation",
>
> I think this should be "tx-tunnel-remote-checksum-segmentation", though
> that is getting quite unwieldy.

Right.  As it is I think we might be coming up on the 32 character
limit for the strings.  Replacing csum with checksum would probably
push us over.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 6/9] ethtool: Add support for toggling any of the GSO offloads
  2016-03-19  0:30     ` Alexander Duyck
@ 2016-03-19  1:42       ` Ben Hutchings
  2016-03-19  2:01         ` Jesse Gross
  0 siblings, 1 reply; 51+ messages in thread
From: Ben Hutchings @ 2016-03-19  1:42 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Edward Cree, Netdev, David Miller, Tom Herbert

[-- Attachment #1: Type: text/plain, Size: 2657 bytes --]

On Fri, 2016-03-18 at 17:30 -0700, Alexander Duyck wrote:
> On Fri, Mar 18, 2016 at 5:18 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
> > 
> > On Fri, 2016-03-18 at 16:25 -0700, Alexander Duyck wrote:
> > 
> > > 
> > > The strings were missing for several of the GSO offloads that are
> > > available.  This patch provides the missing strings so that we can toggle
> > > or query any of them via the ethtool command.
> > > 
> > > Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> > > ---
> > >  net/core/ethtool.c |    3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> > > index 2966cd0d7c93..b3c39d531469 100644
> > > --- a/net/core/ethtool.c
> > > +++ b/net/core/ethtool.c
> > > @@ -82,9 +82,12 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
> > >       [NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
> > >       [NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
> > >       [NETIF_F_GSO_GRE_BIT] =          "tx-gre-segmentation",
> > > +     [NETIF_F_GSO_GRE_CSUM_BIT] =     "tx-gre-csum-segmentation",
> > All of the existing checksum offload names include the word "checksum"
> > in full, so I think the new names should do the same.
> It isn't a checksum offload though, it is a segmentation offload for a
> tunnel that has an outer checksum.  I was hoping to avoid using the
> word checksum as that might make it confusing as it is a segmentation
> offload, not a checksum offload.

Yes, but my point is that we haven't used the abbreviation "csum".

> > > 
> > >       [NETIF_F_GSO_IPIP_BIT] =         "tx-ipip-segmentation",
> > >       [NETIF_F_GSO_SIT_BIT] =          "tx-sit-segmentation",
> > >       [NETIF_F_GSO_UDP_TUNNEL_BIT] =   "tx-udp_tnl-segmentation",
> > > +     [NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
> > > +     [NETIF_F_GSO_TUNNEL_REMCSUM_BIT] = "tx-remcsum-segmentation",
> > I think this should be "tx-tunnel-remote-checksum-segmentation", though
> > that is getting quite unwieldy.
> Right.  As it is I think we might be coming up on the 32 character
> limit for the strings.  Replacing csum with checksum would probably
> push us over.

Right, I wasn't even thinking about the static limit!  That does weigh
rather heavily in favour of abbreviation here.

Please do at least hyphenate "remcsum" though.

Ben.

-- 
Ben Hutchings
To err is human; to really foul things up requires a computer.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 6/9] ethtool: Add support for toggling any of the GSO offloads
  2016-03-19  1:42       ` Ben Hutchings
@ 2016-03-19  2:01         ` Jesse Gross
  2016-03-19  2:43           ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Jesse Gross @ 2016-03-19  2:01 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Alexander Duyck, Alexander Duyck, Edward Cree, Netdev,
	David Miller, Tom Herbert

On Fri, Mar 18, 2016 at 6:42 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
> On Fri, 2016-03-18 at 17:30 -0700, Alexander Duyck wrote:
>> On Fri, Mar 18, 2016 at 5:18 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
>> > On Fri, 2016-03-18 at 16:25 -0700, Alexander Duyck wrote:
>> > >       [NETIF_F_GSO_IPIP_BIT] =         "tx-ipip-segmentation",
>> > >       [NETIF_F_GSO_SIT_BIT] =          "tx-sit-segmentation",
>> > >       [NETIF_F_GSO_UDP_TUNNEL_BIT] =   "tx-udp_tnl-segmentation",
>> > > +     [NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
>> > > +     [NETIF_F_GSO_TUNNEL_REMCSUM_BIT] = "tx-remcsum-segmentation",
>> > I think this should be "tx-tunnel-remote-checksum-segmentation", though
>> > that is getting quite unwieldy.
>> Right.  As it is I think we might be coming up on the 32 character
>> limit for the strings.  Replacing csum with checksum would probably
>> push us over.
>
> Right, I wasn't even thinking about the static limit!  That does weigh
> rather heavily in favour of abbreviation here.
>
> Please do at least hyphenate "remcsum" though.

I think that remote checksum offload is just a purely internal feature
- that is no device will ever expose support for it, since it is
explicitly to work around lack of hardware support. As a result, I
don't know if it makes sense to show it through ethtool at all.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 9/9] ixgbe/ixgbevf: Add support for GSO partial
  2016-03-18 23:25 ` [RFC PATCH 9/9] ixgbe/ixgbevf: Add support for GSO partial Alexander Duyck
@ 2016-03-19  2:05   ` Jesse Gross
  2016-03-19  2:42     ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Jesse Gross @ 2016-03-19  2:05 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Edward Cree, Linux Kernel Network Developers, David Miller,
	Alexander Duyck, Tom Herbert

On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
> This patch adds support for partial GSO segmentation in the case of GRE or
> UDP encapsulated frames.
>
> The one bit in this patch that is a bit controversial is the fact that we
> are leaving the inner IPv4 IP ID as a static value in the case of
> segmentation.  As per RFC6864 this should be acceptable as TCP frames set
> the DF bit so the IP ID should be ignored.  However this is not always the
> case as header compression schemes for PPP and SLIP can end up taking a
> performance hit as they have to record the fact that the ID didn't change
> as expected.
>
> In addition GRO was examining the IP ID field as well.  As such on older
> GRO implementations TSO frames from this driver may end up blocking GRO on
> the other end which will likely hurt performance instead of helping it.
>
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>

I wasn't able to apply this patch, it seems like it might be based on
some Intel driver patches that haven't been merged into net-next yet?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 9/9] ixgbe/ixgbevf: Add support for GSO partial
  2016-03-19  2:05   ` Jesse Gross
@ 2016-03-19  2:42     ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2016-03-19  2:42 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alexander Duyck, Edward Cree, Linux Kernel Network Developers,
	David Miller, Tom Herbert

On Fri, Mar 18, 2016 at 7:05 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
>> This patch adds support for partial GSO segmentation in the case of GRE or
>> UDP encapsulated frames.
>>
>> The one bit in this patch that is a bit controversial is the fact that we
>> are leaving the inner IPv4 IP ID as a static value in the case of
>> segmentation.  As per RFC6864 this should be acceptable as TCP frames set
>> the DF bit so the IP ID should be ignored.  However this is not always the
>> case as header compression schemes for PPP and SLIP can end up taking a
>> performance hit as they have to record the fact that the ID didn't change
>> as expected.
>>
>> In addition GRO was examining the IP ID field as well.  As such on older
>> GRO implementations TSO frames from this driver may end up blocking GRO on
>> the other end which will likely hurt performance instead of helping it.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>
> I wasn't able to apply this patch, it seems like it might be based on
> some Intel driver patches that haven't been merged into net-next yet?

Yeah, I based it off of Jeff's Kirshers dev-queue branch of his
next-queue git repo.  Odds are it isn't working because net-next still
doesn't have the HW_CSUM patches I submitted a couple months ago.  I
ended up doing that as I needed to pull in that and a couple of other
fixes for i40e in order to apply the patches for those two drivers.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 6/9] ethtool: Add support for toggling any of the GSO offloads
  2016-03-19  2:01         ` Jesse Gross
@ 2016-03-19  2:43           ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2016-03-19  2:43 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Ben Hutchings, Alexander Duyck, Edward Cree, Netdev, David Miller,
	Tom Herbert

On Fri, Mar 18, 2016 at 7:01 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Fri, Mar 18, 2016 at 6:42 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
>> On Fri, 2016-03-18 at 17:30 -0700, Alexander Duyck wrote:
>>> On Fri, Mar 18, 2016 at 5:18 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
>>> > On Fri, 2016-03-18 at 16:25 -0700, Alexander Duyck wrote:
>>> > >       [NETIF_F_GSO_IPIP_BIT] =         "tx-ipip-segmentation",
>>> > >       [NETIF_F_GSO_SIT_BIT] =          "tx-sit-segmentation",
>>> > >       [NETIF_F_GSO_UDP_TUNNEL_BIT] =   "tx-udp_tnl-segmentation",
>>> > > +     [NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
>>> > > +     [NETIF_F_GSO_TUNNEL_REMCSUM_BIT] = "tx-remcsum-segmentation",
>>> > I think this should be "tx-tunnel-remote-checksum-segmentation", though
>>> > that is getting quite unwieldy.
>>> Right.  As it is I think we might be coming up on the 32 character
>>> limit for the strings.  Replacing csum with checksum would probably
>>> push us over.
>>
>> Right, I wasn't even thinking about the static limit!  That does weigh
>> rather heavily in favour of abbreviation here.
>>
>> Please do at least hyphenate "remcsum" though.
>
> I think that remote checksum offload is just a purely internal feature
> - that is no device will ever expose support for it, since it is
> explicitly to work around lack of hardware support. As a result, I
> don't know if it makes sense to show it through ethtool at all.

That's true.  I can probably drop that.  The two I cared about where
the GRE and UDP bits anyway.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload
  2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
                   ` (8 preceding siblings ...)
  2016-03-18 23:25 ` [RFC PATCH 9/9] ixgbe/ixgbevf: Add support for GSO partial Alexander Duyck
@ 2016-03-21 18:50 ` David Miller
  2016-03-21 19:46   ` Alexander Duyck
  9 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 2016-03-21 18:50 UTC (permalink / raw)
  To: aduyck; +Cc: ecree, netdev, alexander.duyck, tom

From: Alexander Duyck <aduyck@mirantis.com>
Date: Fri, 18 Mar 2016 16:24:38 -0700

> This patch series addresses two things.
> 
> First it enables what I am calling RFC6864 compliant GRO.  Basically what
> I am doing is allowing one of two patterns for incoming frames.  Either the
> IP ID will increment, or if the DF bit is set it can either increment or
> stay the same value.  This allows us to perform GRO if the IP ID is forced
> to stay at a static value as may occur if we are replicating an IP header
> instead of actually having it offloaded.
> 
> The last 3 patches introduce what I am calling GSO partial.  The best way
> to describe it is that it is a GSO offload in which the portion pointed to
> by csum_start must be handled by the hardware, and the region before that
> can be optionally handled.  So for example with i40e the only pieces it was
> missing from the full offload was the checksum so this is computed in
> software and the hardware will update inner and outer IP headers.  In the
> example for ixgbe the hardware will only update the outer IP header.  The
> outer UDP or GRE header and inner IP header are unmodified.

Conceptually I am completely fine with these changes.

> The one concern here is that if the outer IP header does not have
> the DF bit set and does not update the IP ID field we run the risk
> of causing all sorts of problems if the packet is fragmented in
> flight.

I think we absolutely cannot let such a packet be output from our
stack.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload
  2016-03-21 18:50 ` [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload David Miller
@ 2016-03-21 19:46   ` Alexander Duyck
  2016-03-21 20:10     ` Jesse Gross
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-21 19:46 UTC (permalink / raw)
  To: David Miller; +Cc: Alex Duyck, Edward Cree, Netdev, Tom Herbert

On Mon, Mar 21, 2016 at 11:50 AM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Duyck <aduyck@mirantis.com>
> Date: Fri, 18 Mar 2016 16:24:38 -0700
>
>> This patch series addresses two things.
>>
>> First it enables what I am calling RFC6864 compliant GRO.  Basically what
>> I am doing is allowing one of two patterns for incoming frames.  Either the
>> IP ID will increment, or if the DF bit is set it can either increment or
>> stay the same value.  This allows us to perform GRO if the IP ID is forced
>> to stay at a static value as may occur if we are replicating an IP header
>> instead of actually having it offloaded.
>>
>> The last 3 patches introduce what I am calling GSO partial.  The best way
>> to describe it is that it is a GSO offload in which the portion pointed to
>> by csum_start must be handled by the hardware, and the region before that
>> can be optionally handled.  So for example with i40e the only pieces it was
>> missing from the full offload was the checksum so this is computed in
>> software and the hardware will update inner and outer IP headers.  In the
>> example for ixgbe the hardware will only update the outer IP header.  The
>> outer UDP or GRE header and inner IP header are unmodified.
>
> Conceptually I am completely fine with these changes.
>
>> The one concern here is that if the outer IP header does not have
>> the DF bit set and does not update the IP ID field we run the risk
>> of causing all sorts of problems if the packet is fragmented in
>> flight.
>
> I think we absolutely cannot let such a packet be output from our
> stack.

My concern is that there may be devices doing just that.  That is one
of the motivations behind updating the GRO code so that if DF is not
set we refuse to aggregate frames that do not increment the IP ID.  If
nothing else we just have to wait and see who comes screaming about
performance regressions once the patch is applied.  That will tell us
who is cheating and segmenting frames with the IP ID not being updated
and DF bit not being set.  There was already the comment in
inet_gro_receive about how some devices weren't updating this
correctly so we probably need to figure out if those devices are at
least setting the DF bit.

The other bit I would be interested in seeing is how this will work
with drivers from other vendors.  In the case of the Intel parts the
outer IPv4/IPv6 header doesn't really matter since we zero the length
and recompute the checksum anyway.  I'm wondering if it would be
preferred to leave the outer IPv4/IPv6 header unmodified for the
larger frame or to modify it as I currently am so that it matches the
setup for the headers that were going to be transmitted.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload
  2016-03-21 19:46   ` Alexander Duyck
@ 2016-03-21 20:10     ` Jesse Gross
  0 siblings, 0 replies; 51+ messages in thread
From: Jesse Gross @ 2016-03-21 20:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Alex Duyck, Edward Cree, Netdev, Tom Herbert

On Mon, Mar 21, 2016 at 12:46 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Mon, Mar 21, 2016 at 11:50 AM, David Miller <davem@davemloft.net> wrote:
>> From: Alexander Duyck <aduyck@mirantis.com>
>> Date: Fri, 18 Mar 2016 16:24:38 -0700
>>
>>> This patch series addresses two things.
>>>
>>> First it enables what I am calling RFC6864 compliant GRO.  Basically what
>>> I am doing is allowing one of two patterns for incoming frames.  Either the
>>> IP ID will increment, or if the DF bit is set it can either increment or
>>> stay the same value.  This allows us to perform GRO if the IP ID is forced
>>> to stay at a static value as may occur if we are replicating an IP header
>>> instead of actually having it offloaded.
>>>
>>> The last 3 patches introduce what I am calling GSO partial.  The best way
>>> to describe it is that it is a GSO offload in which the portion pointed to
>>> by csum_start must be handled by the hardware, and the region before that
>>> can be optionally handled.  So for example with i40e the only pieces it was
>>> missing from the full offload was the checksum so this is computed in
>>> software and the hardware will update inner and outer IP headers.  In the
>>> example for ixgbe the hardware will only update the outer IP header.  The
>>> outer UDP or GRE header and inner IP header are unmodified.
>>
>> Conceptually I am completely fine with these changes.
>>
>>> The one concern here is that if the outer IP header does not have
>>> the DF bit set and does not update the IP ID field we run the risk
>>> of causing all sorts of problems if the packet is fragmented in
>>> flight.
>>
>> I think we absolutely cannot let such a packet be output from our
>> stack.
>
> My concern is that there may be devices doing just that.  That is one
> of the motivations behind updating the GRO code so that if DF is not
> set we refuse to aggregate frames that do not increment the IP ID.  If
> nothing else we just have to wait and see who comes screaming about
> performance regressions once the patch is applied.  That will tell us
> who is cheating and segmenting frames with the IP ID not being updated
> and DF bit not being set.  There was already the comment in
> inet_gro_receive about how some devices weren't updating this
> correctly so we probably need to figure out if those devices are at
> least setting the DF bit.

It's clearly wrong to send packets without the DF bit and the same IP
ID (but I agree that there may be some devices that are doing this by
accident). However, I don't think that we want the device to be
changing the value of the DF bit to somehow compensate for that.

I was surprised to see that the DF bit is not on for outer IP headers
- it seems like that should at least be the default. If it is turned
off and the device isn't capable of updating the outer IP ID, well,
then it shouldn't be doing TSO.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-18 23:25 ` [RFC PATCH 7/9] GSO: Support partial segmentation offload Alexander Duyck
@ 2016-03-22 17:00   ` Edward Cree
  2016-03-22 17:47     ` Alexander Duyck
  2016-03-23 17:09   ` Tom Herbert
  2016-03-28  5:36   ` Jesse Gross
  2 siblings, 1 reply; 51+ messages in thread
From: Edward Cree @ 2016-03-22 17:00 UTC (permalink / raw)
  To: Alexander Duyck, netdev, davem, alexander.duyck, tom

On 18/03/16 23:25, Alexander Duyck wrote:
> This patch adds support for something I am referring to as GSO partial.
> The basic idea is that we can support a broader range of devices for
> segmentation if we use fixed outer headers and have the hardware only
> really deal with segmenting the inner header.  The idea behind the naming
> is due to the fact that everything before csum_start will be fixed headers,
> and everything after will be the region that is handled by hardware.
>
> With the current implementation it allows us to add support for the
> following GSO types with an inner TSO or TSO6 offload:
> NETIF_F_GSO_GRE
> NETIF_F_GSO_GRE_CSUM
> NETIF_F_UDP_TUNNEL
> NETIF_F_UDP_TUNNEL_CSUM
>
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
If I'm correctly understanding what you're doing, you're building a large
TCP segment, feeding it through the encapsulation drivers as normal, then
at GSO time you're fixing up length fields, checksums etc. in the headers.
I think we can do this more simply, by making it so that at the time when
we _generate_ the TCP segment, we give it headers saying it's one MSS big,
but have several MSS of data.  Similarly when adding the encap headers,
they all need to get their lengths from what the layer below tells them,
rather than the current length of data in the SKB.  Then at GSO time all
the headers already have the right things in, and you don't need to call
any per-protocol GSO callbacks for them.
Any protocol that noticed it was putting something non-copyable in its
headers (e.g. GRE with the Counter field, or an outer IP layer without DF
set needing real IPIDs) would set a flag in the SKB to indicate that we
really do need to call through the per-protocol GSO stuff.  (Ideally, if
we had a separate skb->gso_start field rather than piggybacking on
csum_start, we could reset it to point just before us, so that any further
headers outside us still can be copied rather than taking callbacks.  But
I'm not sure whether that's worth using up sk_buff real estate for.)
(It might still be necessary to put the true length in the TCP header, if
hardware is using that as an input to segmentation.  I think sfc hardware
just uses 'total length of all payload DMA descriptors', but others might
behave differently.)
However, I haven't yet had the time to attempt to implement this, so there
might be some obvious reason I'm missing why this is impossible.
Also, it's possible that I've completely misunderstood your patch and it's
orthogonal to and can coexist with what I'm suggesting.
-Ed

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-22 17:00   ` Edward Cree
@ 2016-03-22 17:47     ` Alexander Duyck
  2016-03-22 19:40       ` Edward Cree
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-22 17:47 UTC (permalink / raw)
  To: Edward Cree; +Cc: Alexander Duyck, Netdev, David Miller, Tom Herbert

On Tue, Mar 22, 2016 at 10:00 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 18/03/16 23:25, Alexander Duyck wrote:
>> This patch adds support for something I am referring to as GSO partial.
>> The basic idea is that we can support a broader range of devices for
>> segmentation if we use fixed outer headers and have the hardware only
>> really deal with segmenting the inner header.  The idea behind the naming
>> is due to the fact that everything before csum_start will be fixed headers,
>> and everything after will be the region that is handled by hardware.
>>
>> With the current implementation it allows us to add support for the
>> following GSO types with an inner TSO or TSO6 offload:
>> NETIF_F_GSO_GRE
>> NETIF_F_GSO_GRE_CSUM
>> NETIF_F_UDP_TUNNEL
>> NETIF_F_UDP_TUNNEL_CSUM
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>> ---
> If I'm correctly understanding what you're doing, you're building a large
> TCP segment, feeding it through the encapsulation drivers as normal, then
> at GSO time you're fixing up length fields, checksums etc. in the headers.
> I think we can do this more simply, by making it so that at the time when
> we _generate_ the TCP segment, we give it headers saying it's one MSS big,
> but have several MSS of data.  Similarly when adding the encap headers,
> they all need to get their lengths from what the layer below tells them,
> rather than the current length of data in the SKB.  Then at GSO time all
> the headers already have the right things in, and you don't need to call
> any per-protocol GSO callbacks for them.

One issue I have to deal with here is that we have no way of knowing
what the underlying hardware can support at the time of segment being
created.  You have to keep in mind that what we have access to is the
tunnel dev in many cases, not the underlying dev so we don't know if
things can be offloaded to hardware or not.  By pushing this logic
into the GSO code we can actually implement it without much overhead
since we either segment it into an MSS multiple, or into single MSS
sized chunks.  This way we defer the decision until the very last
moment when we actually know if we can offload some portion of this in
hardware or not.

> Any protocol that noticed it was putting something non-copyable in its
> headers (e.g. GRE with the Counter field, or an outer IP layer without DF
> set needing real IPIDs) would set a flag in the SKB to indicate that we
> really do need to call through the per-protocol GSO stuff.  (Ideally, if
> we had a separate skb->gso_start field rather than piggybacking on
> csum_start, we could reset it to point just before us, so that any further
> headers outside us still can be copied rather than taking callbacks.  But
> I'm not sure whether that's worth using up sk_buff real estate for.)

The idea behind piggybacking on csum_start was due to the fact that we
cannot perform GSO/TSO unless CHECKSUM_PARTIAL is set.  As far as I
know in the case of TCP offloads this always ends up being the
inner-most L4 header so it works out in that it actually reduces code
path as we were having to deal with all the skb->encapsulation checks.
It was a relationship that already existed, I just decided to make use
of it since it simplifies things pretty significantly.

As far as retreating I don't really see how that would work. In most
cases it is an all-or-nothing proposition to setup these outer
headers.  Either we can segment the frame with the outer headers
replicated or we cannot.  I suspect it would end up being a common
case where the hardware will update the outer IP and inner TCP
headers, but I think the outer L4 and inner IP headers will be the
ones that most likely always end up being static.  Also we already
have code paths in place in the GRE driver for instance that prevent
us from using GSO in the case of TUNNEL_SEQ being enabled.

> (It might still be necessary to put the true length in the TCP header, if
> hardware is using that as an input to segmentation.  I think sfc hardware
> just uses 'total length of all payload DMA descriptors', but others might
> behave differently.)

That is what most drivers do.  The way I kind of retained that is that
the TCP header doesn't include an actual length field, but I left the
pseudo-header using the full length of all data.  My thought was to
end up using something like the ixgbe approach for most devices.  What
I did there was replicate the tunnel headers and inner IPv4 or IPv6
header.  In the case of ixgbe and i40e I can throw away the checksum
and length values for the outer IP header, one thing I was curious
about is if I really needed to retain the full packet size for those.

> However, I haven't yet had the time to attempt to implement this, so there
> might be some obvious reason I'm missing why this is impossible.
> Also, it's possible that I've completely misunderstood your patch and it's
> orthogonal to and can coexist with what I'm suggesting.

The one piece I could really use would be an understanding of what
inputs your hardware is expecting in order for us to extend TSO to
support this kind of approach.  Then I could start tailoring the
output generated so that we had something that would work with more
devices.  I was thinking the approach I have taken is fairly generic
since essentially it allows us to get away with TSO as long as we are
allowed to provide the offsets for the IP header and the TCP header.
>From what I can tell it looks like the Solarflare drivers do something
similar so you might even try making changes similar to what I did for
ixgbe to see if you can get a proof of concept working for sfc.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-22 17:47     ` Alexander Duyck
@ 2016-03-22 19:40       ` Edward Cree
  2016-03-22 20:11         ` Jesse Gross
  2016-03-22 21:38         ` Alexander Duyck
  0 siblings, 2 replies; 51+ messages in thread
From: Edward Cree @ 2016-03-22 19:40 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Alexander Duyck, Netdev, David Miller, Tom Herbert

On 22/03/16 17:47, Alexander Duyck wrote:
> On Tue, Mar 22, 2016 at 10:00 AM, Edward Cree <ecree@solarflare.com> wrote:
>> On 18/03/16 23:25, Alexander Duyck wrote:
>>> This patch adds support for something I am referring to as GSO partial.
>>> The basic idea is that we can support a broader range of devices for
>>> segmentation if we use fixed outer headers and have the hardware only
>>> really deal with segmenting the inner header.  The idea behind the naming
>>> is due to the fact that everything before csum_start will be fixed headers,
>>> and everything after will be the region that is handled by hardware.
>>>
>>> With the current implementation it allows us to add support for the
>>> following GSO types with an inner TSO or TSO6 offload:
>>> NETIF_F_GSO_GRE
>>> NETIF_F_GSO_GRE_CSUM
>>> NETIF_F_UDP_TUNNEL
>>> NETIF_F_UDP_TUNNEL_CSUM
>>>
>>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>>> ---
>> If I'm correctly understanding what you're doing, you're building a large
>> TCP segment, feeding it through the encapsulation drivers as normal, then
>> at GSO time you're fixing up length fields, checksums etc. in the headers.
>> I think we can do this more simply, by making it so that at the time when
>> we _generate_ the TCP segment, we give it headers saying it's one MSS big,
>> but have several MSS of data.  Similarly when adding the encap headers,
>> they all need to get their lengths from what the layer below tells them,
>> rather than the current length of data in the SKB.  Then at GSO time all
>> the headers already have the right things in, and you don't need to call
>> any per-protocol GSO callbacks for them.
> One issue I have to deal with here is that we have no way of knowing
> what the underlying hardware can support at the time of segment being
> created.  You have to keep in mind that what we have access to is the
> tunnel dev in many cases, not the underlying dev so we don't know if
> things can be offloaded to hardware or not.  By pushing this logic
> into the GSO code we can actually implement it without much overhead
> since we either segment it into an MSS multiple, or into single MSS
> sized chunks.  This way we defer the decision until the very last
> moment when we actually know if we can offload some portion of this in
> hardware or not.
But won't the tunnel dev have the feature flag for GSO_PARTIAL depending
on what the underlying dev advertises?  (Or, at least, could we make it
bethatway?)
Alternatively, have per-protocol GSO callbacks to do the fixup in the
opposite direction to what you have now - in the long term we hope that
hardware supporting GSO partial will become the common case, so that
should be the fast path without bouncing backwards through GSO callbacks.
Then, if you find out at GSO time that the hardware wants to do old-style
TSO, you call those callbacks so as to give it a superframe with the long
lengths filled in everywhere.  (Even that might not be necessary; it's a
question of whether hardware makes assumptions about what those fields
contain when folding its packet edits into checksums.  Since this is
going to be driver-specific and drivers doing these things will have a
fixed list of what encaps they can parse and therefore do this for, maybe
all these fixups could be done by the driver - using common helper
functions, of course - in its TSO path.)
>> Any protocol that noticed it was putting something non-copyable in its
>> headers (e.g. GRE with the Counter field, or an outer IP layer without DF
>> set needing real IPIDs) would set a flag in the SKB to indicate that we
>> really do need to call through the per-protocol GSO stuff.  (Ideally, if
>> we had a separate skb->gso_start field rather than piggybacking on
>> csum_start, we could reset it to point just before us, so that any further
>> headers outside us still can be copied rather than taking callbacks.  But
>> I'm not sure whether that's worth using up sk_buff real estate for.)
> The idea behind piggybacking on csum_start was due to the fact that we
> cannot perform GSO/TSO unless CHECKSUM_PARTIAL is set.  As far as I
> know in the case of TCP offloads this always ends up being the
> inner-most L4 header so it works out in that it actually reduces code
> path as we were having to deal with all the skb->encapsulation checks.
> It was a relationship that already existed, I just decided to make use
> of it since it simplifies things pretty significantly.
Yes; it's a clever idea.  Only trouble is that we really want theinner IP
header rather than the inner TCP header, so that we can (if we want to)
increment the inner IP IDs.  Of course, if we Officially Don't Care about
inner IP IDs that's not a problem.
Iwonder if we could just always fill in inner_network_headereven if we're
not doing encapsulation.  Or does it end up pointing to a 'middle' header
in the case of nested encap?
> As far as retreating I don't really see how that would work. In most
> cases it is an all-or-nothing proposition to setup these outer
> headers.  Either we can segment the frame with the outer headers
> replicated or we cannot.  I suspect it would end up being a common
> case where the hardware will update the outer IP and inner TCP
> headers, but I think the outer L4 and inner IP headers will be the
> ones that most likely always end up being static.
Having thought a bit more about this, I think supporting anything other
than "hardware updates inner [IP and] TCPheaders" is needlessly complex
(well, we still have to handle "hardware updates everything 'cos it
thinks it knows best", because that already exists in the wild in
hardware that might not support the new way).  I don't think there's
likely to be a case where hardware can do half of the segmentation at
the same time as copying headers for the other half.
I also still don't see why hardware would want to update the outer IP
header - can you explain?

> Also we already
> have code paths in place in the GRE driver for instance that prevent
> us from using GSO in the case of TUNNEL_SEQ being enabled.
Oh good, one less thing to worry about.

>> (It might still be necessary to put the true length in the TCP header, if
>> hardware is using that as an input to segmentation.  I think sfc hardware
>> just uses 'total length of all payload DMA descriptors', but others might
>> behave differently.)
> That is what most drivers do.  The way I kind of retained that is that
> the TCP header doesn't include an actual length field, but I left the
> pseudo-header using the full length of all data.
But then you're guaranteed to have to update the outer L4 checksum when
yousegment (because outer LCO reads the inner pseudo-header checksum).
Why not use the single-segment length in the pseudo-header, then the
outer L4 checksum is already the right thing?  (And if yourhardware
can't be told to leave the outer L4 checksum alone, then it's not worth
the trouble of trying to support GSO partial for it, since it clearly
wants to do old-style "NIC knows best" TSO.)
Then if the hardware is assuming the (inner) pseudo is using the full
length, and is going to include that edit in its checksum calculation,
you can just do the opposite edit in the driver, just before handing
the packet off to the hardware.
Again, the idea is that we optimise for GSO partial by making it a plain
header copy everywhere, and put all the 'fix things up' on the _other_
path.
And yes, I forgot (and keep forgetting) that the TCP header doesn't have
an explicit length field.
> My thought was to
> end up using something like the ixgbe approach for most devices.  What
> I did there was replicate the tunnel headers and inner IPv4 or IPv6
> header.  In the case of ixgbe and i40e I can throw away the checksum
> and length values for the outer IP header, one thing I was curious
> about is if I really needed to retain the full packet size for those.
Again, the outer IP header should be computed for a single segment
rather than for the superframe, so that it doesn't need to be edited
later.  It should be possible to send a "GSO partial" frame to TSO
withouta single GSO callback needing to be called; similarly,
software GSO should be able to just copy the outer headers, and only
need to know how to update the TCP header.  (See below for my "what
a NIC should do" TSO design, which software can easily emulate.)
>> However, I haven't yet had the time to attempt to implement this, so there
>> might be some obvious reason I'm missing why this is impossible.
>> Also, it's possible that I've completely misunderstood your patch and it's
>> orthogonal to and can coexist with what I'm suggesting.
> The one piece I could really use would be an understanding of what
> inputs your hardware is expecting in order for us to extend TSO to
> support this kind of approach.  Then I could start tailoring the
> output generated so that we had something that would work with more
> devices.  I was thinking the approach I have taken is fairly generic
> since essentially it allows us to get away with TSO as long as we are
> allowed to provide the offsets for the IP header and the TCP header.
> From what I can tell it looks like the Solarflare drivers do something
> similar so you might even try making changes similar to what I did for
> ixgbe to see if you can get a proof of concept working for sfc.
So, this is all still slightly speculative because while I've talked to
some of our firmware developers, we haven't got as far as actually writing
the new firmware.  I'd also like to make clear that this isn't "what
Solarflare has officially decided to do"; rather it's "what I'm currently
trying to convince people at Solarflare to do".
But what I think we're going to end up with is this:

The kernel will give us a packet that looks like a single MSS-sized segment
except that the payload is too long; the length fields in all the headers
are for an MSS-sized segment, and the checksums are correct for that
(except that the inner TCP checksum is, of course, the pseudo-header sum
rather than a sum over the whole payload).  The kernel will also tell us
where in the packet the inner IP header begins.  The driver will then give
the following descriptors to the hardware:
* A TSO descriptor, containing the offset of the inner IP header, and the
  MSS to use for segmentation.
* A DMA descriptor containing all the headers (i.e. up to the end of the
  inner TCP header).
* A series of DMA descriptors containing the payload, with a total length
  divisible by the MSS we thought of earlier.
The NIC can now read IHL from the inner IP header, and thereby compute the
offset of the inner TCP header, and the csum_start/offset values.
Then for each MSS-sized block of payload, the NIC will do the following:
* transmit header + payload block
* increment inner IP ID, and decrement inner IP checksum (ones-complement)
* add MSS to TCP sequence number

I believe this is something thatany NIC with TSO support should be able to
learn to do, with appropriate firmware changes.  It might be a while before
there are NICs in the wild that can do this,though.

-Ed

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-22 19:40       ` Edward Cree
@ 2016-03-22 20:11         ` Jesse Gross
  2016-03-22 20:17           ` David Miller
  2016-03-22 21:38         ` Alexander Duyck
  1 sibling, 1 reply; 51+ messages in thread
From: Jesse Gross @ 2016-03-22 20:11 UTC (permalink / raw)
  To: Edward Cree
  Cc: Alexander Duyck, Alexander Duyck, Netdev, David Miller,
	Tom Herbert

On Tue, Mar 22, 2016 at 12:40 PM, Edward Cree <ecree@solarflare.com> wrote:
> On 22/03/16 17:47, Alexander Duyck wrote:
>> On Tue, Mar 22, 2016 at 10:00 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> On 18/03/16 23:25, Alexander Duyck wrote:
>>>> This patch adds support for something I am referring to as GSO partial.
>>>> The basic idea is that we can support a broader range of devices for
>>>> segmentation if we use fixed outer headers and have the hardware only
>>>> really deal with segmenting the inner header.  The idea behind the naming
>>>> is due to the fact that everything before csum_start will be fixed headers,
>>>> and everything after will be the region that is handled by hardware.
>>>>
>>>> With the current implementation it allows us to add support for the
>>>> following GSO types with an inner TSO or TSO6 offload:
>>>> NETIF_F_GSO_GRE
>>>> NETIF_F_GSO_GRE_CSUM
>>>> NETIF_F_UDP_TUNNEL
>>>> NETIF_F_UDP_TUNNEL_CSUM
>>>>
>>>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>>>> ---
>>> If I'm correctly understanding what you're doing, you're building a large
>>> TCP segment, feeding it through the encapsulation drivers as normal, then
>>> at GSO time you're fixing up length fields, checksums etc. in the headers.
>>> I think we can do this more simply, by making it so that at the time when
>>> we _generate_ the TCP segment, we give it headers saying it's one MSS big,
>>> but have several MSS of data.  Similarly when adding the encap headers,
>>> they all need to get their lengths from what the layer below tells them,
>>> rather than the current length of data in the SKB.  Then at GSO time all
>>> the headers already have the right things in, and you don't need to call
>>> any per-protocol GSO callbacks for them.
>> One issue I have to deal with here is that we have no way of knowing
>> what the underlying hardware can support at the time of segment being
>> created.  You have to keep in mind that what we have access to is the
>> tunnel dev in many cases, not the underlying dev so we don't know if
>> things can be offloaded to hardware or not.  By pushing this logic
>> into the GSO code we can actually implement it without much overhead
>> since we either segment it into an MSS multiple, or into single MSS
>> sized chunks.  This way we defer the decision until the very last
>> moment when we actually know if we can offload some portion of this in
>> hardware or not.
> But won't the tunnel dev have the feature flag for GSO_PARTIAL depending
> on what the underlying dev advertises?  (Or, at least, could we make it
> bethatway?)

Features that have been designed this way in the past are usually
pretty fragile. Not only would you have to track changes in the
routing table but you could have bridges, tc, vlan devices, etc. all
of which might change the path of the packet and would have to somehow
propagate this information. It's much more robust to resolve the
device capabilities just before you hand the packet to that device.
Plus, anything along the path of the packet (iptables, for example)
that looks at the headers might potentially need to be aware of this
optimization.

You're also assuming that the generating TCP stack is resident on the
same machine as the device that does the offloads. That's not
necessarily true in the case of VMs or remote senders whose packets
have been GRO'ed.

Keeping the core stack consistent and just handling this at the
GRO/driver layer as Alex has here seems preferable to me.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-22 20:11         ` Jesse Gross
@ 2016-03-22 20:17           ` David Miller
  0 siblings, 0 replies; 51+ messages in thread
From: David Miller @ 2016-03-22 20:17 UTC (permalink / raw)
  To: jesse; +Cc: ecree, alexander.duyck, aduyck, netdev, tom

From: Jesse Gross <jesse@kernel.org>
Date: Tue, 22 Mar 2016 13:11:21 -0700

> Features that have been designed this way in the past are usually
> pretty fragile. Not only would you have to track changes in the
> routing table but you could have bridges, tc, vlan devices, etc. all
> of which might change the path of the packet and would have to somehow
> propagate this information. It's much more robust to resolve the
> device capabilities just before you hand the packet to that device.
> Plus, anything along the path of the packet (iptables, for example)
> that looks at the headers might potentially need to be aware of this
> optimization.

Indeed, this is a major fundamental issue in our stack right now.  I
keep being reminded of that ugly change we had to make to accomodate
scatter-gather limitations for Infiniband devices, exactly because
properties don't propagate properly through all of the layers right
now.

But we have to solve this somehow, the packetizer has to know certain
basic properties of the ultimate device in order to function properly.

This requirement is unavoidable.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-22 19:40       ` Edward Cree
  2016-03-22 20:11         ` Jesse Gross
@ 2016-03-22 21:38         ` Alexander Duyck
  2016-03-23 16:27           ` Edward Cree
  1 sibling, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-22 21:38 UTC (permalink / raw)
  To: Edward Cree; +Cc: Alexander Duyck, Netdev, David Miller, Tom Herbert

On Tue, Mar 22, 2016 at 12:40 PM, Edward Cree <ecree@solarflare.com> wrote:
> On 22/03/16 17:47, Alexander Duyck wrote:
>> On Tue, Mar 22, 2016 at 10:00 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> On 18/03/16 23:25, Alexander Duyck wrote:
>>>> This patch adds support for something I am referring to as GSO partial.
>>>> The basic idea is that we can support a broader range of devices for
>>>> segmentation if we use fixed outer headers and have the hardware only
>>>> really deal with segmenting the inner header.  The idea behind the naming
>>>> is due to the fact that everything before csum_start will be fixed headers,
>>>> and everything after will be the region that is handled by hardware.
>>>>
>>>> With the current implementation it allows us to add support for the
>>>> following GSO types with an inner TSO or TSO6 offload:
>>>> NETIF_F_GSO_GRE
>>>> NETIF_F_GSO_GRE_CSUM
>>>> NETIF_F_UDP_TUNNEL
>>>> NETIF_F_UDP_TUNNEL_CSUM
>>>>
>>>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>>>> ---
>>> If I'm correctly understanding what you're doing, you're building a large
>>> TCP segment, feeding it through the encapsulation drivers as normal, then
>>> at GSO time you're fixing up length fields, checksums etc. in the headers.
>>> I think we can do this more simply, by making it so that at the time when
>>> we _generate_ the TCP segment, we give it headers saying it's one MSS big,
>>> but have several MSS of data.  Similarly when adding the encap headers,
>>> they all need to get their lengths from what the layer below tells them,
>>> rather than the current length of data in the SKB.  Then at GSO time all
>>> the headers already have the right things in, and you don't need to call
>>> any per-protocol GSO callbacks for them.
>> One issue I have to deal with here is that we have no way of knowing
>> what the underlying hardware can support at the time of segment being
>> created.  You have to keep in mind that what we have access to is the
>> tunnel dev in many cases, not the underlying dev so we don't know if
>> things can be offloaded to hardware or not.  By pushing this logic
>> into the GSO code we can actually implement it without much overhead
>> since we either segment it into an MSS multiple, or into single MSS
>> sized chunks.  This way we defer the decision until the very last
>> moment when we actually know if we can offload some portion of this in
>> hardware or not.
> But won't the tunnel dev have the feature flag for GSO_PARTIAL depending
> on what the underlying dev advertises?  (Or, at least, could we make it
> bethatway?)

This stuff doesn't work.  That is why tunnels now advertise all
available features that can be offloaded via software.  Basically if
we can advertise a feature we do, and then we sort things out in
software if we cannot actually do it in hardware.

> Alternatively, have per-protocol GSO callbacks to do the fixup in the
> opposite direction to what you have now - in the long term we hope that
> hardware supporting GSO partial will become the common case, so that
> should be the fast path without bouncing backwards through GSO callbacks.
> Then, if you find out at GSO time that the hardware wants to do old-style
> TSO, you call those callbacks so as to give it a superframe with the long
> lengths filled in everywhere.  (Even that might not be necessary; it's a
> question of whether hardware makes assumptions about what those fields
> contain when folding its packet edits into checksums.  Since this is
> going to be driver-specific and drivers doing these things will have a
> fixed list of what encaps they can parse and therefore do this for, maybe
> all these fixups could be done by the driver - using common helper
> functions, of course - in its TSO path.)

I thought about doing that but decided it was much simpler to simply
update all headers.  For now I want to keep this as simple as possible
while we get the first few drivers on board.  If we later want to
optimize and add complexity we can go that route, but for now this
change is more than fast enough as it allows me to push an i40e at
20Gb/s while sending frames with outer checksums enabled.

>>> Any protocol that noticed it was putting something non-copyable in its
>>> headers (e.g. GRE with the Counter field, or an outer IP layer without DF
>>> set needing real IPIDs) would set a flag in the SKB to indicate that we
>>> really do need to call through the per-protocol GSO stuff.  (Ideally, if
>>> we had a separate skb->gso_start field rather than piggybacking on
>>> csum_start, we could reset it to point just before us, so that any further
>>> headers outside us still can be copied rather than taking callbacks.  But
>>> I'm not sure whether that's worth using up sk_buff real estate for.)
>> The idea behind piggybacking on csum_start was due to the fact that we
>> cannot perform GSO/TSO unless CHECKSUM_PARTIAL is set.  As far as I
>> know in the case of TCP offloads this always ends up being the
>> inner-most L4 header so it works out in that it actually reduces code
>> path as we were having to deal with all the skb->encapsulation checks.
>> It was a relationship that already existed, I just decided to make use
>> of it since it simplifies things pretty significantly.
> Yes; it's a clever idea.  Only trouble is that we really want theinner IP
> header rather than the inner TCP header, so that we can (if we want to)
> increment the inner IP IDs.  Of course, if we Officially Don't Care about
> inner IP IDs that's not a problem.

The inner IP IDs are the ones that are guaranteed to be the ones we
can leave fixed since TCP will require that the DF bit be set.  The
current VXLAN implementation does not set DF for the outer headers.
So really we don't have too many options right now if we are wanting
to support tunnels.

> Iwonder if we could just always fill in inner_network_headereven if we're
> not doing encapsulation.  Or does it end up pointing to a 'middle' header
> in the case of nested encap?

Right now neseted encap is not supported because tunnels don't
advertise hw_encap_features.

>> As far as retreating I don't really see how that would work. In most
>> cases it is an all-or-nothing proposition to setup these outer
>> headers.  Either we can segment the frame with the outer headers
>> replicated or we cannot.  I suspect it would end up being a common
>> case where the hardware will update the outer IP and inner TCP
>> headers, but I think the outer L4 and inner IP headers will be the
>> ones that most likely always end up being static.
> Having thought a bit more about this, I think supporting anything other
> than "hardware updates inner [IP and] TCPheaders" is needlessly complex
> (well, we still have to handle "hardware updates everything 'cos it
> thinks it knows best", because that already exists in the wild in
> hardware that might not support the new way).  I don't think there's
> likely to be a case where hardware can do half of the segmentation at
> the same time as copying headers for the other half.

The ixgbe approach skips the inner IP header.  The inner IPv4 ID field
can be fixed since TCP requires the DF bit be set.

> I also still don't see why hardware would want to update the outer IP
> header - can you explain?

Inner IP header has DF bit set.  As such we can ignore IPv4 ID field.
The outer header will not.  As such RFC6864 says we have to increment
it since it could be fragmented.

>> Also we already
>> have code paths in place in the GRE driver for instance that prevent
>> us from using GSO in the case of TUNNEL_SEQ being enabled.
> Oh good, one less thing to worry about.
>
>>> (It might still be necessary to put the true length in the TCP header, if
>>> hardware is using that as an input to segmentation.  I think sfc hardware
>>> just uses 'total length of all payload DMA descriptors', but others might
>>> behave differently.)
>> That is what most drivers do.  The way I kind of retained that is that
>> the TCP header doesn't include an actual length field, but I left the
>> pseudo-header using the full length of all data.
> But then you're guaranteed to have to update the outer L4 checksum when
> yousegment (because outer LCO reads the inner pseudo-header checksum).

Yes, but the LCO checksum is computed when we are constructing the
headers.  I have already taken that into account in my patches.

> Why not use the single-segment length in the pseudo-header, then the
> outer L4 checksum is already the right thing?  (And if yourhardware
> can't be told to leave the outer L4 checksum alone, then it's not worth
> the trouble of trying to support GSO partial for it, since it clearly
> wants to do old-style "NIC knows best" TSO.)

The problem then becomes that I needs special case code.  One for
standard TSO and one for this special case.  If I leave it as is I can
use the same code to cancel out the length in the TCP pseudo-header
checksum for either case.

> Then if the hardware is assuming the (inner) pseudo is using the full
> length, and is going to include that edit in its checksum calculation,
> you can just do the opposite edit in the driver, just before handing
> the packet off to the hardware.

I want to keep the scope of this change limited.  By keeping the inner
TCP checksum calculation working the same way I don't have to write up
some new function.

The general idea to my approach is to treat the the UDP or GRE header
to the inner IP header as an IP header extension type value.  That
will allow us to trick most hardware into being able to accept it.

> Again, the idea is that we optimise for GSO partial by making it a plain
> header copy everywhere, and put all the 'fix things up' on the _other_
> path.

That is what I went for.  The only part that isn't a plain header copy
is the TCP pseudo-header checksum.

> And yes, I forgot (and keep forgetting) that the TCP header doesn't have
> an explicit length field.

Right.  That is one of the other reasons for not wanting to pull that
length out since the actual length is the only real length value we
can track with the TCP header since it doesn't contain one of its own.

>> My thought was to
>> end up using something like the ixgbe approach for most devices.  What
>> I did there was replicate the tunnel headers and inner IPv4 or IPv6
>> header.  In the case of ixgbe and i40e I can throw away the checksum
>> and length values for the outer IP header, one thing I was curious
>> about is if I really needed to retain the full packet size for those.
> Again, the outer IP header should be computed for a single segment
> rather than for the superframe, so that it doesn't need to be edited
> later.  It should be possible to send a "GSO partial" frame to TSO
> withouta single GSO callback needing to be called; similarly,
> software GSO should be able to just copy the outer headers, and only
> need to know how to update the TCP header.  (See below for my "what
> a NIC should do" TSO design, which software can easily emulate.)
>>> However, I haven't yet had the time to attempt to implement this, so there
>>> might be some obvious reason I'm missing why this is impossible.
>>> Also, it's possible that I've completely misunderstood your patch and it's
>>> orthogonal to and can coexist with what I'm suggesting.
>> The one piece I could really use would be an understanding of what
>> inputs your hardware is expecting in order for us to extend TSO to
>> support this kind of approach.  Then I could start tailoring the
>> output generated so that we had something that would work with more
>> devices.  I was thinking the approach I have taken is fairly generic
>> since essentially it allows us to get away with TSO as long as we are
>> allowed to provide the offsets for the IP header and the TCP header.
>> From what I can tell it looks like the Solarflare drivers do something
>> similar so you might even try making changes similar to what I did for
>> ixgbe to see if you can get a proof of concept working for sfc.
> So, this is all still slightly speculative because while I've talked to
> some of our firmware developers, we haven't got as far as actually writing
> the new firmware.  I'd also like to make clear that this isn't "what
> Solarflare has officially decided to do"; rather it's "what I'm currently
> trying to convince people at Solarflare to do".
> But what I think we're going to end up with is this:
>
> The kernel will give us a packet that looks like a single MSS-sized segment
> except that the payload is too long; the length fields in all the headers
> are for an MSS-sized segment, and the checksums are correct for that
> (except that the inner TCP checksum is, of course, the pseudo-header sum
> rather than a sum over the whole payload).  The kernel will also tell us
> where in the packet the inner IP header begins.  The driver will then give
> the following descriptors to the hardware:
> * A TSO descriptor, containing the offset of the inner IP header, and the
>   MSS to use for segmentation.
> * A DMA descriptor containing all the headers (i.e. up to the end of the
>   inner TCP header).
> * A series of DMA descriptors containing the payload, with a total length
>   divisible by the MSS we thought of earlier.
> The NIC can now read IHL from the inner IP header, and thereby compute the
> offset of the inner TCP header, and the csum_start/offset values.
> Then for each MSS-sized block of payload, the NIC will do the following:
> * transmit header + payload block
> * increment inner IP ID, and decrement inner IP checksum (ones-complement)
> * add MSS to TCP sequence number
>
> I believe this is something thatany NIC with TSO support should be able to
> learn to do, with appropriate firmware changes.  It might be a while before
> there are NICs in the wild that can do this,though.

The only issue I see is the expectation that the outer headers go
untouched is problematic.  The outer headers are where things like
fragmentation will occur in transit.  In addition I suspect a number
of devices other than the Intel NICs probably use the network header
to determine where to insert L2 tags such as VLAN.

Like I have said with the current solution I could probably update
igb, igbvf, fm10k, ixgbe, ixgbevf, i40e, and i40evf to support this.
That covers pretty much the entire Intel series of drivers in terms of
enterprise.  The question is what do I need to enable to support other
drivers.  It doesn't seem like there would be too much but the bit I
would need to know is what the expectations are when computing outer
IPv4 checksums.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-22 21:38         ` Alexander Duyck
@ 2016-03-23 16:27           ` Edward Cree
  2016-03-23 18:06             ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Edward Cree @ 2016-03-23 16:27 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Alexander Duyck, Netdev, David Miller, Tom Herbert

On 22/03/16 21:38, Alexander Duyck wrote:
> On Tue, Mar 22, 2016 at 12:40 PM, Edward Cree <ecree@solarflare.com> wrote:
>> But won't the tunnel dev have the feature flag for GSO_PARTIAL depending
>> on what the underlying dev advertises?  (Or, at least, could we make it
>> bethatway?)
> This stuff doesn't work.  That is why tunnels now advertise all
> available features that can be offloaded via software.  Basically if
> we can advertise a feature we do, and then we sort things out in
> software if we cannot actually do it in hardware.
Fair enough; then go withthe other approach:
>> Alternatively, have per-protocol GSO callbacks to do the fixup in the
>> opposite direction to what you have now - in the long term we hope that
>> hardware supporting GSO partial will become the common case, so that
>> should be the fast path without bouncing backwards through GSO callbacks.
>> Then, if you find out at GSO time that the hardware wants to do old-style
>> TSO, you call those callbacks so as to give it a superframe with the long
>> lengths filled in everywhere.
> I thought about doing that but decided it was much simpler to simply
> update all headers.  For now I want to keep this as simple as possible
> while we get the first few drivers on board.  If we later want to
> optimize and add complexity we can go that route, but for now this
> change is more than fast enough as it allows me to push an i40e at
> 20Gb/s while sending frames with outer checksums enabled.
My belief is that my way is (in the long run) simpler: ultimately it gets
rid of per-protocol GSO callbacks entirely.  Every header gets written
correctly* when the packet initially traverses the stack, and then at
transmit time you either hand that off to TSO, or do the equivalent thing
in software: segment at the TCP layer while treating everything above it
as an opaque pseudo-L2 header.

*That is, correctly for a single segment, rather than correctly for the
superframe.
>> Yes; it's a clever idea.  Only trouble is that we really want theinner IP
>> header rather than the inner TCP header, so that we can (if we want to)
>> increment the inner IP IDs.  Of course, if we Officially Don't Care about
>> inner IP IDs that's not a problem.
> The inner IP IDs are the ones that are guaranteed to be the ones we
> can leave fixed since TCP will require that the DF bit be set.
I was worrying about the SLHC stuff, I thought the inner ones were precisely
the ones where that was a problem.  If we really don't care about it, then
we do just need the inner TCP header, and we can stick with using your
csum_start == gso_start trick :)
> The
> current VXLAN implementation does not set DF for the outer headers.
> So really we don't have too many options right now if we are wanting
> to support tunnels.
I was predicating all this on the assumption that tunnels would be changed
to set DF in their outer frames; I thought I saw a patch recently to do
that, but maybe I was mistaken.  In any case I think that's the right thing
to do, and it's a necessary prerequisite for truly tunnel-agnostic TSO.
>> Iwonder if we could just always fill in inner_network_headereven if we're
>> not doing encapsulation.  Or does it end up pointing to a 'middle' header
>> in the case of nested encap?
> Right now neseted encap is not supported because tunnels don't
> advertise hw_encap_features.
You mean not supported by offloads, right?  We can still _create_ a nested
tunnel device, it just won't use hardware offloads.  And in the long run,
if we can make tunnel-agnostic TSO as I'm proposing, we should be able to
support it for nested tunnels too (the giant L2 header just gets more giant).
>> Why not use the single-segment length in the pseudo-header, then the
>> outer L4 checksum is already the right thing?  (And if yourhardware
>> can't be told to leave the outer L4 checksum alone, then it's not worth
>> the trouble of trying to support GSO partial for it, since it clearly
>> wants to do old-style "NIC knows best" TSO.)
> The problem then becomes that I needs special case code.  One for
> standard TSO and one for this special case.  If I leave it as is I can
> use the same code to cancel out the length in the TCP pseudo-header
> checksum for either case.
I don't think that's true.  Look at it this way: there are two choices.
1) The normal path builds things for standard TSO, and we have code to
   fix it up for this new-style TSO.
2) The normal path builds things for new-style TSO, and we have code to
   fix it up for standard TSO.
Now the fix-up code should be of equal complexity in either case, because
the two cases are inverses of each other.  So we should choose whichever
approach makes the normal path simpler.  I think having TCP 'lie' to all
the outer layers about the length of the packet is simpler, because then
they don't even have to know that GSO is happening.  They just append a
header appropriate for an MSS-sized packet, blissfully unaware that the
payload data carries on for longer than that.
Moreover, the fixup in (2) can sometimes be avoided...
I don't know how your hardware does encapsulated TSO, but what ours does
(in the old "I'm-a-smart-NIC-I-know-best" world) is that it computes both
checksums itself from scratch.  So it doesn't care what it was given as
the pre-seeded value, because it's reconstructing the pseudo-header and
zeroing the checksum field to begin with.
I would have expected this to be the common behaviour for "smart" NICs
doing encap TSO, in which case (2) is a clear winner.  And, of course,
once "less is more" hardware becomes common, we will want (2) anyway.
> The general idea to my approach is to treat the the UDP or GRE header
> to the inner IP header as an IP header extension type value.  That
> will allow us to trick most hardware into being able to accept it.
I think an L2 header extension is a better semantic match for what's
happening (the tunnel is an L2 device and we're sending L3 packets over
it).  But why does it matter?  Are you giving the hardware the L2 and
L3 headers in separate DMA descriptors or something?  The way I see it,
all hardware needs to be told is "where to start TSO from", and how it
thinks of the stuff before that doesn't matter, because it's not
supposed to touch it anyway.
>> Again, the idea is that we optimise for GSO partial by making it a plain
>> header copy everywhere, and put all the 'fix things up' on the _other_
>> path.
> That is what I went for.  The only part that isn't a plain header copy
> is the TCP pseudo-header checksum.
Right, but that means someone, somewhere, has to fix up the outer UDP
checksum to account for that (because the TCP phdr sum leaks out of LCO).
I realise that's a per-superframe job, rather than a per-segment job, but
it still means that code at GSO time (when we decide whether to do GSO or
GSO-partial) has to know enough about the tunnel headers to find the outer
checksum and fix it up.
And it's a question of who pays that cost, the old TSO or the new one; see
above for why I think the old TSO should pay it.
> The only issue I see is the expectation that the outer headers go
> untouched is problematic.  The outer headers are where things like
> fragmentation will occur in transit.
Like I say, I'm assuming we'll start setting DF on outer frames.
Besides, it doesn't matter what happens "in transit" - as long as the
outer headers aren't touched by the transmitting NIC, the network can
do what it likes to them afterwards, without it breaking our TSO code.
> In addition I suspect a number
> of devices other than the Intel NICs probably use the network header
> to determine where to insert L2 tags such as VLAN.
Ah, VLAN is interesting, because there's two things you might want:
VLAN inside the tunnel, or outside of it.  Presumably if you're having
the NIC insert the VLAN, you want it outside (e.g. you're doing SR-IOV
and you're putting the VF on a VLAN).
But then it doesn't make sense to work backwards from the network
header, because that'll get confused by traffic that's already VLAN
tagged - because again, you want to insert the new VLAN as the outer
VLAN.  You need to find the Ethertype, as well; it just seems like
working backwards from L3 is crazy.  Particularly when working forwards
from L2 is so easy - you know your L2 header begins at the start of the
packet, you (hopefully) know it's Ethernet, so you know where the VLAN
tags and Ethertype go.  (Are you /sure/ Intel works backwards from the
L3 header?  I'm pretty confident we don't.)
Then of course this works fine with the 'giant L2 header', because the
NIC just inserts the VLAN as normal in the outer Ethernet header and
shunts the remaining headers along to make room for it.  In fact, in
our NICs I think that's done by an entirely separate bit of hardware
that doesn't even have to know that TSO was considering much more of
the packet to be "L2 header".

_However_, if we don't need to update the IP IDs, then we can just take
the offset of the inner L4 header, and it doesn't make any difference
whether you choose to think of the stuff before that as L2 + giant L3
or as giant L2 + normal L3, because it's not part of the OS->NIC
interface (which is just "L4 starts here").  If your NIC needs to be
told where the outer L3 starts as well, then, I guess that's just a
wart you need in your drivers.  You have skb->network_header, so that
shouldn't be difficult - that will always point to the outer one.
> Like I have said with the current solution I could probably update
> igb, igbvf, fm10k, ixgbe, ixgbevf, i40e, and i40evf to support this.
> That covers pretty much the entire Intel series of drivers in terms of
> enterprise.  The question is what do I need to enable to support other
> drivers.  It doesn't seem like there would be too much but the bit I
> would need to know is what the expectations are when computing outer
> IPv4 checksums.
Like I say, the plan we had for Solarflare NICs before "less is more"
happened was that the device would reconstruct the outer pseudo-header
(same as it does with the inner) and ignore any pre-seeded value in the
checksum field.  I'd expected other vendors to have gone down the same
route, but if Intel didn't then maybe others didn't either.  It'd be
nice if some of them would chime in and let us know what they want...

-Ed

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-18 23:25 ` [RFC PATCH 7/9] GSO: Support partial segmentation offload Alexander Duyck
  2016-03-22 17:00   ` Edward Cree
@ 2016-03-23 17:09   ` Tom Herbert
  2016-03-23 18:19     ` Alexander Duyck
  2016-03-28  5:36   ` Jesse Gross
  2 siblings, 1 reply; 51+ messages in thread
From: Tom Herbert @ 2016-03-23 17:09 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Edward Cree, Linux Kernel Network Developers, David S. Miller,
	Alexander Duyck

On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
> This patch adds support for something I am referring to as GSO partial.
> The basic idea is that we can support a broader range of devices for
> segmentation if we use fixed outer headers and have the hardware only
> really deal with segmenting the inner header.  The idea behind the naming
> is due to the fact that everything before csum_start will be fixed headers,
> and everything after will be the region that is handled by hardware.
>
Personally, I don't like the name ;-) This technique should be
"generic" to handle almost all GSO except those case where headers are
not fixed which we should be able to avoid as a design point in any
new encapsulations. Besides, what if someday we perform GSO on
something where csum_start is not set?

Can you add some description about strategy for dealing with ip_id?

Thanks,
Tom

> With the current implementation it allows us to add support for the
> following GSO types with an inner TSO or TSO6 offload:
> NETIF_F_GSO_GRE
> NETIF_F_GSO_GRE_CSUM
> NETIF_F_UDP_TUNNEL
> NETIF_F_UDP_TUNNEL_CSUM
>
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
>  include/linux/netdev_features.h |    7 ++++++-
>  include/linux/netdevice.h       |    2 ++
>  include/linux/skbuff.h          |    7 ++++++-
>  net/core/dev.c                  |   31 ++++++++++++++++++++++++++++++-
>  net/core/ethtool.c              |    1 +
>  net/core/skbuff.c               |   21 ++++++++++++++++++++-
>  net/ipv4/af_inet.c              |   12 ++++++++++--
>  net/ipv4/gre_offload.c          |   23 +++++++++++++++++++----
>  net/ipv4/tcp_offload.c          |   10 ++++++++--
>  net/ipv4/udp_offload.c          |   20 ++++++++++++++++----
>  net/ipv6/ip6_offload.c          |    9 ++++++++-
>  11 files changed, 126 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index a734bf43d190..8df3c5553af0 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -48,8 +48,12 @@ enum {
>         NETIF_F_GSO_UDP_TUNNEL_BIT,     /* ... UDP TUNNEL with TSO */
>         NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
>         NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
> +       NETIF_F_GSO_PARTIAL_BIT,        /* ... Only segment inner-most L4
> +                                        *     in hardware and all other
> +                                        *     headers in software.
> +                                        */
>         /**/NETIF_F_GSO_LAST =          /* last bit, see GSO_MASK */
> -               NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
> +               NETIF_F_GSO_PARTIAL_BIT,
>
>         NETIF_F_FCOE_CRC_BIT,           /* FCoE CRC32 */
>         NETIF_F_SCTP_CRC_BIT,           /* SCTP checksum offload */
> @@ -121,6 +125,7 @@ enum {
>  #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL)
>  #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
>  #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
> +#define NETIF_F_GSO_PARTIAL     __NETIF_F(GSO_PARTIAL)
>  #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
>  #define NETIF_F_HW_VLAN_STAG_RX        __NETIF_F(HW_VLAN_STAG_RX)
>  #define NETIF_F_HW_VLAN_STAG_TX        __NETIF_F(HW_VLAN_STAG_TX)
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 31474d9d8a96..427d748ad8f9 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1647,6 +1647,7 @@ struct net_device {
>         netdev_features_t       vlan_features;
>         netdev_features_t       hw_enc_features;
>         netdev_features_t       mpls_features;
> +       netdev_features_t       gso_partial_features;
>
>         int                     ifindex;
>         int                     group;
> @@ -4014,6 +4015,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
>         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
>         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
>         BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
> +       BUILD_BUG_ON(SKB_GSO_PARTIAL != (NETIF_F_GSO_PARTIAL >> NETIF_F_GSO_SHIFT));
>
>         return (features & feature) == feature;
>  }
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 15d0df943466..c291a282f8b6 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -482,6 +482,8 @@ enum {
>         SKB_GSO_UDP_TUNNEL_CSUM = 1 << 11,
>
>         SKB_GSO_TUNNEL_REMCSUM = 1 << 12,
> +
> +       SKB_GSO_PARTIAL = 1 << 13,
>  };
>
>  #if BITS_PER_LONG > 32
> @@ -3584,7 +3586,10 @@ static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
>   * Keeps track of level of encapsulation of network headers.
>   */
>  struct skb_gso_cb {
> -       int     mac_offset;
> +       union {
> +               int     mac_offset;
> +               int     data_offset;
> +       };
>         int     encap_level;
>         __wsum  csum;
>         __u16   csum_start;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index edb7179bc051..666cf427898b 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2711,6 +2711,19 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
>                         return ERR_PTR(err);
>         }
>
> +       /* Only report GSO partial support if it will enable us to
> +        * support segmentation on this frame without needing additional
> +        * work.
> +        */
> +       if (features & NETIF_F_GSO_PARTIAL) {
> +               netdev_features_t partial_features;
> +               struct net_device *dev = skb->dev;
> +
> +               partial_features = dev->features & dev->gso_partial_features;
> +               if (!skb_gso_ok(skb, features | partial_features))
> +                       features &= ~NETIF_F_GSO_PARTIAL;
> +       }
> +
>         BUILD_BUG_ON(SKB_SGO_CB_OFFSET +
>                      sizeof(*SKB_GSO_CB(skb)) > sizeof(skb->cb));
>
> @@ -2841,6 +2854,14 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
>         if (skb->encapsulation)
>                 features &= dev->hw_enc_features;
>
> +       /* Support for GSO partial features requires software intervention
> +        * before we can actually process the packets so we need to strip
> +        * support for any partial features now and we can pull them back
> +        * in after we have partially segmented the frame.
> +        */
> +       if (skb_is_gso(skb) && !(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL))
> +               features &= ~dev->gso_partial_features;
> +
>         if (skb_vlan_tagged(skb))
>                 features = netdev_intersect_features(features,
>                                                      dev->vlan_features |
> @@ -6702,6 +6723,14 @@ static netdev_features_t netdev_fix_features(struct net_device *dev,
>                 }
>         }
>
> +       /* GSO partial features require GSO partial be set */
> +       if ((features & dev->gso_partial_features) &&
> +           !(features & NETIF_F_GSO_PARTIAL)) {
> +               netdev_dbg(dev,
> +                          "Dropping partially supported GSO features since no GSO partial.\n");
> +               features &= ~dev->gso_partial_features;
> +       }
> +
>  #ifdef CONFIG_NET_RX_BUSY_POLL
>         if (dev->netdev_ops->ndo_busy_poll)
>                 features |= NETIF_F_BUSY_POLL;
> @@ -6982,7 +7011,7 @@ int register_netdevice(struct net_device *dev)
>
>         /* Make NETIF_F_SG inheritable to tunnel devices.
>          */
> -       dev->hw_enc_features |= NETIF_F_SG;
> +       dev->hw_enc_features |= NETIF_F_SG | NETIF_F_GSO_PARTIAL;
>
>         /* Make NETIF_F_SG inheritable to MPLS.
>          */
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index b3c39d531469..d1b278c6c29f 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -88,6 +88,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
>         [NETIF_F_GSO_UDP_TUNNEL_BIT] =   "tx-udp_tnl-segmentation",
>         [NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
>         [NETIF_F_GSO_TUNNEL_REMCSUM_BIT] = "tx-remcsum-segmentation",
> +       [NETIF_F_GSO_PARTIAL_BIT] =      "tx-gso-partial",
>
>         [NETIF_F_FCOE_CRC_BIT] =         "tx-checksum-fcoe-crc",
>         [NETIF_F_SCTP_CRC_BIT] =        "tx-checksum-sctp",
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f044f970f1a6..bdcba77e164c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3076,8 +3076,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>         struct sk_buff *frag_skb = head_skb;
>         unsigned int offset = doffset;
>         unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
> +       unsigned int partial_segs = 0;
>         unsigned int headroom;
> -       unsigned int len;
> +       unsigned int len = head_skb->len;
>         __be16 proto;
>         bool csum;
>         int sg = !!(features & NETIF_F_SG);
> @@ -3094,6 +3095,15 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>
>         csum = !!can_checksum_protocol(features, proto);
>
> +       /* GSO partial only requires that we trim off any excess that
> +        * doesn't fit into an MSS sized block, so take care of that
> +        * now.
> +        */
> +       if (features & NETIF_F_GSO_PARTIAL) {
> +               partial_segs = len / mss;
> +               mss *= partial_segs;
> +       }
> +
>         headroom = skb_headroom(head_skb);
>         pos = skb_headlen(head_skb);
>
> @@ -3281,6 +3291,15 @@ perform_csum_check:
>          */
>         segs->prev = tail;
>
> +       /* Update GSO info on first skb in partial sequence. */
> +       if (partial_segs) {
> +               skb_shinfo(segs)->gso_size = mss / partial_segs;
> +               skb_shinfo(segs)->gso_segs = partial_segs;
> +               skb_shinfo(segs)->gso_type = skb_shinfo(head_skb)->gso_type |
> +                                            SKB_GSO_PARTIAL;
> +               SKB_GSO_CB(segs)->data_offset = skb_headroom(segs) + doffset;
> +       }
> +
>         /* Following permits correct backpressure, for protocols
>          * using skb_set_owner_w().
>          * Idea is to tranfert ownership from head_skb to last segment.
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 5e3885672907..d091f91fa25d 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1199,7 +1199,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>         unsigned int offset = 0;
>         bool udpfrag, encap;
>         struct iphdr *iph;
> -       int proto;
> +       int proto, tot_len;
>         int nhoff;
>         int ihl;
>         int id;
> @@ -1269,10 +1269,18 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>                         if (skb->next)
>                                 iph->frag_off |= htons(IP_MF);
>                         offset += skb->len - nhoff - ihl;
> +                       tot_len = skb->len - nhoff;
> +               } else if (skb_is_gso(skb)) {
> +                       iph->id = htons(id);
> +                       id += skb_shinfo(skb)->gso_segs;
> +                       tot_len = skb_shinfo(skb)->gso_size +
> +                                 SKB_GSO_CB(skb)->data_offset +
> +                                 skb->head - (unsigned char *)iph;
>                 } else {
>                         iph->id = htons(id++);
> +                       tot_len = skb->len - nhoff;
>                 }
> -               iph->tot_len = htons(skb->len - nhoff);
> +               iph->tot_len = htons(tot_len);
>                 ip_send_check(iph);
>                 if (encap)
>                         skb_reset_inner_headers(skb);
> diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
> index 7ea14ced9222..dea0390d65bb 100644
> --- a/net/ipv4/gre_offload.c
> +++ b/net/ipv4/gre_offload.c
> @@ -85,7 +85,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
>         skb = segs;
>         do {
>                 struct gre_base_hdr *greh;
> -               __be32 *pcsum;
> +               __sum16 *pcsum;
>
>                 /* Set up inner headers if we are offloading inner checksum */
>                 if (skb->ip_summed == CHECKSUM_PARTIAL) {
> @@ -105,10 +105,25 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
>                         continue;
>
>                 greh = (struct gre_base_hdr *)skb_transport_header(skb);
> -               pcsum = (__be32 *)(greh + 1);
> +               pcsum = (__sum16 *)(greh + 1);
> +
> +               if (skb_is_gso(skb)) {
> +                       unsigned int partial_adj;
> +
> +                       /* Adjust checksum to account for the fact that
> +                        * the partial checksum is based on actual size
> +                        * whereas headers should be based on MSS size.
> +                        */
> +                       partial_adj = skb->len + skb_headroom(skb) -
> +                                     SKB_GSO_CB(skb)->data_offset -
> +                                     skb_shinfo(skb)->gso_size;
> +                       *pcsum = ~csum_fold((__force __wsum)htonl(partial_adj));
> +               } else {
> +                       *pcsum = 0;
> +               }
>
> -               *pcsum = 0;
> -               *(__sum16 *)pcsum = gso_make_checksum(skb, 0);
> +               *(pcsum + 1) = 0;
> +               *pcsum = gso_make_checksum(skb, 0);
>         } while ((skb = skb->next));
>  out:
>         return segs;
> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> index 1a2e9957c177..4e9b8f011473 100644
> --- a/net/ipv4/tcp_offload.c
> +++ b/net/ipv4/tcp_offload.c
> @@ -107,6 +107,12 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
>                 goto out;
>         }
>
> +       /* GSO partial only requires splitting the frame into an MSS
> +        * multiple and possibly a remainder.  So update the mss now.
> +        */
> +       if (features & NETIF_F_GSO_PARTIAL)
> +               mss = skb->len - (skb->len % mss);
> +
>         copy_destructor = gso_skb->destructor == tcp_wfree;
>         ooo_okay = gso_skb->ooo_okay;
>         /* All segments but the first should have ooo_okay cleared */
> @@ -131,7 +137,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
>         newcheck = ~csum_fold((__force __wsum)((__force u32)th->check +
>                                                (__force u32)delta));
>
> -       do {
> +       while (skb->next) {
>                 th->fin = th->psh = 0;
>                 th->check = newcheck;
>
> @@ -151,7 +157,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
>
>                 th->seq = htonl(seq);
>                 th->cwr = 0;
> -       } while (skb->next);
> +       }
>
>         /* Following permits TCP Small Queues to work well with GSO :
>          * The callback to TCP stack will be called at the time last frag
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 8a3405a80260..5fcb93269afb 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -100,7 +100,7 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
>         udp_offset = outer_hlen - tnl_hlen;
>         skb = segs;
>         do {
> -               __be16 len;
> +               unsigned int len;
>
>                 if (remcsum)
>                         skb->ip_summed = CHECKSUM_NONE;
> @@ -118,14 +118,26 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
>                 skb_reset_mac_header(skb);
>                 skb_set_network_header(skb, mac_len);
>                 skb_set_transport_header(skb, udp_offset);
> -               len = htons(skb->len - udp_offset);
> +               len = skb->len - udp_offset;
>                 uh = udp_hdr(skb);
> -               uh->len = len;
> +
> +               /* If we are only performing partial GSO the inner header
> +                * will be using a length value equal to only one MSS sized
> +                * segment instead of the entire frame.
> +                */
> +               if (skb_is_gso(skb)) {
> +                       uh->len = htons(skb_shinfo(skb)->gso_size +
> +                                       SKB_GSO_CB(skb)->data_offset +
> +                                       skb->head - (unsigned char *)uh);
> +               } else {
> +                       uh->len = htons(len);
> +               }
>
>                 if (!need_csum)
>                         continue;
>
> -               uh->check = ~csum_fold(csum_add(partial, (__force __wsum)len));
> +               uh->check = ~csum_fold(csum_add(partial,
> +                                      (__force __wsum)htonl(len)));
>
>                 if (skb->encapsulation || !offload_csum) {
>                         uh->check = gso_make_checksum(skb, ~uh->check);
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index eeca943f12dc..d467053c226a 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -63,6 +63,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
>         int proto;
>         struct frag_hdr *fptr;
>         unsigned int unfrag_ip6hlen;
> +       unsigned int payload_len;
>         u8 *prevhdr;
>         int offset = 0;
>         bool encap, udpfrag;
> @@ -117,7 +118,13 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
>
>         for (skb = segs; skb; skb = skb->next) {
>                 ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff);
> -               ipv6h->payload_len = htons(skb->len - nhoff - sizeof(*ipv6h));
> +               if (skb_is_gso(skb))
> +                       payload_len = skb_shinfo(skb)->gso_size +
> +                                     SKB_GSO_CB(skb)->data_offset +
> +                                     skb->head - (unsigned char *)(ipv6h + 1);
> +               else
> +                       payload_len = skb->len - nhoff - sizeof(*ipv6h);
> +               ipv6h->payload_len = htons(payload_len);
>                 skb->network_header = (u8 *)ipv6h - skb->head;
>
>                 if (udpfrag) {
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-23 16:27           ` Edward Cree
@ 2016-03-23 18:06             ` Alexander Duyck
  2016-03-23 21:05               ` Edward Cree
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-23 18:06 UTC (permalink / raw)
  To: Edward Cree, Or Gerlitz
  Cc: Alexander Duyck, Netdev, David Miller, Tom Herbert

On Wed, Mar 23, 2016 at 9:27 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 22/03/16 21:38, Alexander Duyck wrote:
>> On Tue, Mar 22, 2016 at 12:40 PM, Edward Cree <ecree@solarflare.com> wrote:
>>> But won't the tunnel dev have the feature flag for GSO_PARTIAL depending
>>> on what the underlying dev advertises?  (Or, at least, could we make it
>>> bethatway?)
>> This stuff doesn't work.  That is why tunnels now advertise all
>> available features that can be offloaded via software.  Basically if
>> we can advertise a feature we do, and then we sort things out in
>> software if we cannot actually do it in hardware.
> Fair enough; then go withthe other approach:
>>> Alternatively, have per-protocol GSO callbacks to do the fixup in the
>>> opposite direction to what you have now - in the long term we hope that
>>> hardware supporting GSO partial will become the common case, so that
>>> should be the fast path without bouncing backwards through GSO callbacks.
>>> Then, if you find out at GSO time that the hardware wants to do old-style
>>> TSO, you call those callbacks so as to give it a superframe with the long
>>> lengths filled in everywhere.
>> I thought about doing that but decided it was much simpler to simply
>> update all headers.  For now I want to keep this as simple as possible
>> while we get the first few drivers on board.  If we later want to
>> optimize and add complexity we can go that route, but for now this
>> change is more than fast enough as it allows me to push an i40e at
>> 20Gb/s while sending frames with outer checksums enabled.

> My belief is that my way is (in the long run) simpler: ultimately it gets
> rid of per-protocol GSO callbacks entirely.  Every header gets written
> correctly* when the packet initially traverses the stack, and then at
> transmit time you either hand that off to TSO, or do the equivalent thing
> in software: segment at the TCP layer while treating everything above it
> as an opaque pseudo-L2 header.

I'm pretty sure that isn't a safe approach to take.  With GSO we are
holding off until we are about to hand the packets off to the device
before we segment it.  If we do it earlier it will lead to more issues
as you could have the packet route off to somewhere you were not
expecting and having it already "soft segmented" could lead to issues
where the packet would no longer be coherent.

> *That is, correctly for a single segment, rather than correctly for the
> superframe.

Yes, but what about the cases where a packets gets switched from Tx
back to Rx?  How would you expect the stack to handle that then?

>>> Yes; it's a clever idea.  Only trouble is that we really want theinner IP
>>> header rather than the inner TCP header, so that we can (if we want to)
>>> increment the inner IP IDs.  Of course, if we Officially Don't Care about
>>> inner IP IDs that's not a problem.
>> The inner IP IDs are the ones that are guaranteed to be the ones we
>> can leave fixed since TCP will require that the DF bit be set.
> I was worrying about the SLHC stuff, I thought the inner ones were precisely
> the ones where that was a problem.  If we really don't care about it, then
> we do just need the inner TCP header, and we can stick with using your
> csum_start == gso_start trick :)
>> The
>> current VXLAN implementation does not set DF for the outer headers.
>> So really we don't have too many options right now if we are wanting
>> to support tunnels.
> I was predicating all this on the assumption that tunnels would be changed
> to set DF in their outer frames; I thought I saw a patch recently to do
> that, but maybe I was mistaken.  In any case I think that's the right thing
> to do, and it's a necessary prerequisite for truly tunnel-agnostic TSO.
>>> Iwonder if we could just always fill in inner_network_headereven if we're
>>> not doing encapsulation.  Or does it end up pointing to a 'middle' header
>>> in the case of nested encap?
>> Right now neseted encap is not supported because tunnels don't
>> advertise hw_encap_features.

> You mean not supported by offloads, right?  We can still _create_ a nested
> tunnel device, it just won't use hardware offloads.  And in the long run,
> if we can make tunnel-agnostic TSO as I'm proposing, we should be able to
> support it for nested tunnels too (the giant L2 header just gets more giant).

Right.  Basically what will currently happen is that if you were to
encapsulate an ipip tunnel inside of a VXLAN for instance the GSO
would kick in before the VXLAN tunnel has even added the outer headers
because the VXLAN netdev does not advertise NETIF_F_GSO_IPIP.  That is
the kind of thing we want to have happen though anyway since a
physical device wouldn't know how to deal with such a scenario anyway.

>>> Why not use the single-segment length in the pseudo-header, then the
>>> outer L4 checksum is already the right thing?  (And if yourhardware
>>> can't be told to leave the outer L4 checksum alone, then it's not worth
>>> the trouble of trying to support GSO partial for it, since it clearly
>>> wants to do old-style "NIC knows best" TSO.)
>> The problem then becomes that I needs special case code.  One for
>> standard TSO and one for this special case.  If I leave it as is I can
>> use the same code to cancel out the length in the TCP pseudo-header
>> checksum for either case.

> I don't think that's true.  Look at it this way: there are two choices.
> 1) The normal path builds things for standard TSO, and we have code to
>    fix it up for this new-style TSO.
> 2) The normal path builds things for new-style TSO, and we have code to
>    fix it up for standard TSO.
> Now the fix-up code should be of equal complexity in either case, because
> the two cases are inverses of each other.  So we should choose whichever
> approach makes the normal path simpler.  I think having TCP 'lie' to all
> the outer layers about the length of the packet is simpler, because then
> they don't even have to know that GSO is happening.  They just append a
> header appropriate for an MSS-sized packet, blissfully unaware that the
> payload data carries on for longer than that.
> Moreover, the fixup in (2) can sometimes be avoided...
> I don't know how your hardware does encapsulated TSO, but what ours does
> (in the old "I'm-a-smart-NIC-I-know-best" world) is that it computes both
> checksums itself from scratch.  So it doesn't care what it was given as
> the pre-seeded value, because it's reconstructing the pseudo-header and
> zeroing the checksum field to begin with.
> I would have expected this to be the common behaviour for "smart" NICs
> doing encap TSO, in which case (2) is a clear winner.  And, of course,
> once "less is more" hardware becomes common, we will want (2) anyway.

The argument for here is a matter of complexity.  I really don't think
we gain much by recomputing the pseudo-header with the segmented
length versus the entire packet length.  In the case of gso it is more
complicated to figure out the length if we are having to take gso_size
and add the size for a TCP header rather than just subtracting the
inner transport offset from skb->len.

>> The general idea to my approach is to treat the the UDP or GRE header
>> to the inner IP header as an IP header extension type value.  That
>> will allow us to trick most hardware into being able to accept it.
> I think an L2 header extension is a better semantic match for what's
> happening (the tunnel is an L2 device and we're sending L3 packets over
> it).  But why does it matter?  Are you giving the hardware the L2 and
> L3 headers in separate DMA descriptors or something?  The way I see it,
> all hardware needs to be told is "where to start TSO from", and how it
> thinks of the stuff before that doesn't matter, because it's not
> supposed to touch it anyway.

The problem is setups like VFs where they want to know where the
network header starts so they know where to insert a VLAN tag.  Most
hardware supports IP options or IPv6 extension headers.  What I am
doing is exploiting that in the case of the Intel hardware by telling
it the IP header is the outer IP header and the transport header is
the inner transport header.  Then all I have to deal with is the fact
that hardware will try to compute the entire IPv4 header checksum over
the range so I cancel that out by seeding the IP checksum with the
lco_csum added to the inner pseudo-header checksum.

>>> Again, the idea is that we optimise for GSO partial by making it a plain
>>> header copy everywhere, and put all the 'fix things up' on the _other_
>>> path.
>> That is what I went for.  The only part that isn't a plain header copy
>> is the TCP pseudo-header checksum.

> Right, but that means someone, somewhere, has to fix up the outer UDP
> checksum to account for that (because the TCP phdr sum leaks out of LCO).
> I realise that's a per-superframe job, rather than a per-segment job, but
> it still means that code at GSO time (when we decide whether to do GSO or
> GSO-partial) has to know enough about the tunnel headers to find the outer
> checksum and fix it up.

The fix-up already happens in the GSO code path I have coded up.  The
outer checksum already takes it into account.

If you look at my patches you should see a block in both the UDP
tunnel and GRE tunnel code that handles the "skb_is_gso()" case.  That
is where we deal with the checksum adjustment.  In the case of UDP
tunnels it actually requires very little work as the only field that
actually has to be updated is uh->len since the outer pseduo-header
checksum effectively cancels out the length for the inner anyway.  In
the case of GRE it took a bit more as I had to seed the checksum with
the delta of the length in order to correctly account for the change.

> And it's a question of who pays that cost, the old TSO or the new one; see
> above for why I think the old TSO should pay it.

Generally I don't agree with slowing down existing paths in order to
improve the performance for new ones.  One of my goals with this was
to keep the code minimally intrusive.  I would rather not have to
rewrite all driver TSO paths in order to support a new segmentation
approach.

>> The only issue I see is the expectation that the outer headers go
>> untouched is problematic.  The outer headers are where things like
>> fragmentation will occur in transit.

> Like I say, I'm assuming we'll start setting DF on outer frames.
> Besides, it doesn't matter what happens "in transit" - as long as the
> outer headers aren't touched by the transmitting NIC, the network can
> do what it likes to them afterwards, without it breaking our TSO code.

The IP ID bit is going to be something where I don't want to break
things.  One of the things I have seen is there ends up being a number
of occasions where VXLAN gets fragmented due to incorrectly configured
MTU.  I would rather not have it get completely screwed up when this
occurs.  A performance hit for fragmenting is one thing.  Having
people start complaining because previously working tunnels suddenly
stop functioning is another.  The fact is the whole point of VXLAN is
to support sending the tunneled frames between data centers.  With the
DF bit set on a tunnel header we end up making it so that frames get
dropped instead of fragmented which would be a change of behavior.

>> In addition I suspect a number
>> of devices other than the Intel NICs probably use the network header
>> to determine where to insert L2 tags such as VLAN.

> Ah, VLAN is interesting, because there's two things you might want:
> VLAN inside the tunnel, or outside of it.  Presumably if you're having
> the NIC insert the VLAN, you want it outside (e.g. you're doing SR-IOV
> and you're putting the VF on a VLAN).
> But then it doesn't make sense to work backwards from the network
> header, because that'll get confused by traffic that's already VLAN
> tagged - because again, you want to insert the new VLAN as the outer
> VLAN.  You need to find the Ethertype, as well; it just seems like
> working backwards from L3 is crazy.  Particularly when working forwards
> from L2 is so easy - you know your L2 header begins at the start of the
> packet, you (hopefully) know it's Ethernet, so you know where the VLAN
> tags and Ethertype go.  (Are you /sure/ Intel works backwards from the
> L3 header?  I'm pretty confident we don't.)

Yes, I am very confident of that.  For Intel hardware the outer VLAN
tag would be the one inserted via software, the inner VLAN tag is the
one inserted via hardware.

> Then of course this works fine with the 'giant L2 header', because the
> NIC just inserts the VLAN as normal in the outer Ethernet header and
> shunts the remaining headers along to make room for it.  In fact, in
> our NICs I think that's done by an entirely separate bit of hardware
> that doesn't even have to know that TSO was considering much more of
> the packet to be "L2 header".

Well the Intel hardware doesn't work that way.  It would be nice if it
did because then you could actually support Q-in-Q on the VFs without
much issue, but due to the limiting nature of a 64b transmit
descriptor they opted to insert the VLAN tags at the start of the
network header.

> _However_, if we don't need to update the IP IDs, then we can just take
> the offset of the inner L4 header, and it doesn't make any difference
> whether you choose to think of the stuff before that as L2 + giant L3
> or as giant L2 + normal L3, because it's not part of the OS->NIC
> interface (which is just "L4 starts here").  If your NIC needs to be
> told where the outer L3 starts as well, then, I guess that's just a
> wart you need in your drivers.  You have skb->network_header, so that
> shouldn't be difficult - that will always point to the outer one.

Right.  That is kind of what I was going for with this setup.  The
only issue is that the VXLAN tunnels not setting the DF bit kind of
get in the way of the giant L3 approach.

Dealing with the outer header needing the DF bit is something I have
left unaddressed at this point.  The question would be what is the
correct approach to take for all this.  I know RFC2003 for IPv4 in
IPv4 says you must set the DF bit if the inner header has the DF bit
set.  I'm just wondering if we can apply the same type of logic to GRE
and UDP tunnels.

>> Like I have said with the current solution I could probably update
>> igb, igbvf, fm10k, ixgbe, ixgbevf, i40e, and i40evf to support this.
>> That covers pretty much the entire Intel series of drivers in terms of
>> enterprise.  The question is what do I need to enable to support other
>> drivers.  It doesn't seem like there would be too much but the bit I
>> would need to know is what the expectations are when computing outer
>> IPv4 checksums.

> Like I say, the plan we had for Solarflare NICs before "less is more"
> happened was that the device would reconstruct the outer pseudo-header
> (same as it does with the inner) and ignore any pre-seeded value in the
> checksum field.  I'd expected other vendors to have gone down the same
> route, but if Intel didn't then maybe others didn't either.  It'd be
> nice if some of them would chime in and let us know what they want...

Right.  I added Or to the Cc.  Maybe we can get some input from
Mellanox on how they do TSO and what changes they would need in order
to support TSO for GRE or UDP tunnels with checksum.

My concern is that I am not sure how much value there is to add with
this type of code if the hardware is parsing headers.  In the case of
most of the Intel parts you specify offsets for the network and
transport headers so it gives you some degree of flexibility.  If
however the hardware parses headers it becomes problematic as we can
only support protocols that can be parsed by the hardware.  So for
example on the Intel parts using the drivers igb and ixgbe I will be
able to support VXLAN-GPE using a partial GSO approach.  However for
fm10k which does some parsing for the tunnel headers I probably won't
as it will not know what to do if there are any extra headers included
and won't be able to determine the offset of the inner transport
header.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-23 17:09   ` Tom Herbert
@ 2016-03-23 18:19     ` Alexander Duyck
  2016-03-24  1:37       ` Jesse Gross
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-23 18:19 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexander Duyck, Edward Cree, Linux Kernel Network Developers,
	David S. Miller

On Wed, Mar 23, 2016 at 10:09 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
>> This patch adds support for something I am referring to as GSO partial.
>> The basic idea is that we can support a broader range of devices for
>> segmentation if we use fixed outer headers and have the hardware only
>> really deal with segmenting the inner header.  The idea behind the naming
>> is due to the fact that everything before csum_start will be fixed headers,
>> and everything after will be the region that is handled by hardware.
>>
> Personally, I don't like the name ;-) This technique should be
> "generic" to handle almost all GSO except those case where headers are
> not fixed which we should be able to avoid as a design point in any
> new encapsulations. Besides, what if someday we perform GSO on
> something where csum_start is not set?

Well the "partial" component of it refers to the fact that we have to
do a certain amount of this in software before we offload the work to
hardware.  So it isn't a full segmentation offload, it is a partial
one.

The csum_start is a bit of a red herring in terms of the naming.  I
was using csum_start as that works for TCP offloads since it will
always point to the inner-most transport header.  It was basically
just a convenient optimization for the drivers and makes it so that we
can essentially use just one code path in the case of ixgbe since
network header and checksum_start will always represent the two
headers we need to update for TSO regardless of it being partial or
complete offload.

> Can you add some description about strategy for dealing with ip_id?

Yeah.  I still need to add more documentation.  I just didn't want to
get into details on it until we have finalized things a bit more.  I'm
still wondering if we should follow the pattern of ipip tunnels in
terms of setting the DF bit if the inner header has the DF bit set.
If we end up needing to add code to do that then it makes it so that
the ip_id value can be fixed for both inner and outer which makes the
segmentation much simpler since the only header that would ever really
need to be updated would be the transport header in order to get the
checksum correct.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 8/9] i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM
  2016-03-18 23:25 ` [RFC PATCH 8/9] i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM Alexander Duyck
@ 2016-03-23 19:35   ` Jesse Gross
  2016-03-23 20:21     ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Jesse Gross @ 2016-03-23 19:35 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Edward Cree, Linux Kernel Network Developers, David Miller,
	Alexander Duyck, Tom Herbert

On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 39b0009253c2..ac3964a9f5c0 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -9074,7 +9075,12 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
>         if (!(pf->flags & I40E_FLAG_MFP_ENABLED))
>                 netdev->features |= NETIF_F_NTUPLE;
>         if (pf->flags & I40E_FLAG_OUTER_UDP_CSUM_CAPABLE)
> -               netdev->features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
> +               netdev->features = NETIF_F_GSO_UDP_TUNNEL_CSUM;

Looks like there might be a typo - I guess this is supposed to still
OR with the previous features instead of replacing them?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 8/9] i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM
  2016-03-23 19:35   ` Jesse Gross
@ 2016-03-23 20:21     ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2016-03-23 20:21 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alexander Duyck, Edward Cree, Linux Kernel Network Developers,
	David Miller, Tom Herbert

On Wed, Mar 23, 2016 at 12:35 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index 39b0009253c2..ac3964a9f5c0 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -9074,7 +9075,12 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
>>         if (!(pf->flags & I40E_FLAG_MFP_ENABLED))
>>                 netdev->features |= NETIF_F_NTUPLE;
>>         if (pf->flags & I40E_FLAG_OUTER_UDP_CSUM_CAPABLE)
>> -               netdev->features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
>> +               netdev->features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
>
> Looks like there might be a typo - I guess this is supposed to still
> OR with the previous features instead of replacing them?

Yes, that was a typo.  It looks like I got it right on the i40vf half
of the patch but didn't get it correct here.

Thanks for catching that, I'll have it fixed for when I push it for net-next.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-23 18:06             ` Alexander Duyck
@ 2016-03-23 21:05               ` Edward Cree
  2016-03-23 22:36                 ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Edward Cree @ 2016-03-23 21:05 UTC (permalink / raw)
  To: Alexander Duyck, Or Gerlitz
  Cc: Alexander Duyck, Netdev, David Miller, Tom Herbert

On 23/03/16 18:06, Alexander Duyck wrote:
> On Wed, Mar 23, 2016 at 9:27 AM, Edward Cree <ecree@solarflare.com> wrote:
>> My belief is that my way is (in the long run) simpler: ultimately it gets
>> rid of per-protocol GSO callbacks entirely.  Every header gets written
>> correctly* when the packet initially traverses the stack, and then at
>> transmit time you either hand that off to TSO, or do the equivalent thing
>> in software: segment at the TCP layer while treating everything above it
>> as an opaque pseudo-L2 header.
> I'm pretty sure that isn't a safe approach to take.  With GSO we are
> holding off until we are about to hand the packets off to the device
> before we segment it.  If we do it earlier it will lead to more issues
> as you could have the packet route off to somewhere you were not
> expecting and having it already "soft segmented" could lead to issues
> where the packet would no longer be coherent.
Ah, yes, that is a potential problem.  And now that I think about it, we
might not know what the MSS is until segmentation time, either, even if
we did know for sure we would want to segment.
So my approach doesn't work after all, the superframe has to be coherent
when it traverses the stack.
>> You mean not supported by offloads, right?  We can still _create_ a nested
>> tunnel device, it just won't use hardware offloads.  And in the long run,
>> if we can make tunnel-agnostic TSO as I'm proposing, we should be able to
>> support it for nested tunnels too (the giant L2 header just gets more giant).
> Right.  Basically what will currently happen is that if you were to
> encapsulate an ipip tunnel inside of a VXLAN for instance the GSO
> would kick in before the VXLAN tunnel has even added the outer headers
> because the VXLAN netdev does not advertise NETIF_F_GSO_IPIP.  That is
> the kind of thing we want to have happen though anyway since a
> physical device wouldn't know how to deal with such a scenario anyway.
I disagree.  Surely we should be able to "soft segment" the packet just
before we give it to the physical device, and then tell it to do dumb copying
of both the VXLAN and IPIP headers?  At this point, we don't have the problem
you identified above, because we've arrived at the device now.
So we can chase through some per-protocol callbacks to shorten all the outer
lengths and adjust all the outer checksums, then hand it to the device for
TSO.  The device is treating the extra headers as an opaque blob, so it
doesn't know or care whether it's one layer of encapsulation or forty-two.
>> I think an L2 header extension is a better semantic match for what's
>> happening (the tunnel is an L2 device and we're sending L3 packets over
>> it).  But why does it matter?  Are you giving the hardware the L2 and
>> L3 headers in separate DMA descriptors or something?  The way I see it,
>> all hardware needs to be told is "where to start TSO from", and how it
>> thinks of the stuff before that doesn't matter, because it's not
>> supposed to touch it anyway.
> The problem is setups like VFs where they want to know where the
> network header starts so they know where to insert a VLAN tag.  Most
> hardware supports IP options or IPv6 extension headers.  What I am
> doing is exploiting that in the case of the Intel hardware by telling
> it the IP header is the outer IP header and the transport header is
> the inner transport header.  Then all I have to deal with is the fact
> that hardware will try to compute the entire IPv4 header checksum over
> the range so I cancel that out by seeding the IP checksum with the
> lco_csum added to the inner pseudo-header checksum.
Ok, it sounds like the interface to Intel hardware is just Very Different
to Solarflare hardware on this point: we don't tell our hardware anything
about where the various headers start, it just parses them to figure it
out.  (And for new-style TSO we'd tell it where the TCP header starts, as
I described before.)
But this sounds like a driver-level thing: you have to undo some of what
your hardware will do because you're having to lie to it about what you're
giving it.  So that all happens in the driver, the stack's GSO code isn't
affected.  In which case I'm much less bothered by this; I thought it was
an assumption you were baking into the stack.  As far as I'm concerned you
can do whatever ugly hacks you like in Intel drivers, as long as I don't
have to do them in sfc ;)
Although, why is your device computing the IPv4 header checksum?  Those
aren't supposed to be offloaded, the stack always already filled them in
in software.  (I think sfc makes the same mistake actually.)  Is there not
a way for you to tell your device to skip IP header checksum offload?
Then you wouldn't have this problem in the first place, and you could tell
it the IP header was as big as you like without having to seed it with
this correction value.
>> Like I say, I'm assuming we'll start setting DF on outer frames.
>> Besides, it doesn't matter what happens "in transit" - as long as the
>> outer headers aren't touched by the transmitting NIC, the network can
>> do what it likes to them afterwards, without it breaking our TSO code.
> The IP ID bit is going to be something where I don't want to break
> things.  One of the things I have seen is there ends up being a number
> of occasions where VXLAN gets fragmented due to incorrectly configured
> MTU.  I would rather not have it get completely screwed up when this
> occurs.  A performance hit for fragmenting is one thing.  Having
> people start complaining because previously working tunnels suddenly
> stop functioning is another.  The fact is the whole point of VXLAN is
> to support sending the tunneled frames between data centers.  With the
> DF bit set on a tunnel header we end up making it so that frames get
> dropped instead of fragmented which would be a change of behavior.
I agree this isn't something we can do silently.  But we _can_ make it a
condition for enabling gso-partial.  And I think it's a necessary
condition for truly generic TSO.  Sure, your 'L3 extension header' works
fine for a single tunnel.  But if you nest tunnels, you now need to
update the outer _and_ middle IP IDs, and you can't do that because you
only have one L3 header pointer.
OTOH, for a single tunnel I think we could implement your 'L3 extension
header' trick in firmware, by making it always parse the outer packet up
to the outer L3 header and increment the IP ID in that.  So I could live
with that approach if necessary.
> Yes, I am very confident of that.  For Intel hardware the outer VLAN
> tag would be the one inserted via software, the inner VLAN tag is the
> one inserted via hardware.
That's really weird; why does it do that?
For sfc, the only reason we do hardware VLANs at all is to transparently
tag a VF (or non-primary PF) that's being passed through into an
(untrusted) VM.  For that use case you'd always want to insert the outer
VLAN tag, because otherwise the VM can escape the isolation by inserting
a different VLAN tag in software.
>> _However_, if we don't need to update the IP IDs, then we can just take
>> the offset of the inner L4 header, and it doesn't make any difference
>> whether you choose to think of the stuff before that as L2 + giant L3
>> or as giant L2 + normal L3, because it's not part of the OS->NIC
>> interface (which is just "L4 starts here").  If your NIC needs to be
>> told where the outer L3 starts as well, then, I guess that's just a
>> wart you need in your drivers.  You have skb->network_header, so that
>> shouldn't be difficult - that will always point to the outer one.
> Right.  That is kind of what I was going for with this setup.  The
> only issue is that the VXLAN tunnels not setting the DF bit kind of
> get in the way of the giant L3 approach.
On the contrary; your giant L3 approach is exactly what solves this case
(for non-nested tunnels) - the hardware has the outer IP and inner TCP
header offsets, which are exactly the two headers it needs to alter.
And if the hardware is sensible, it won't try to re-checksum the whole
giant L3 header, it'll just decrement the (outer) IP checksum to account
for incrementing the (outer) IP ID.  If the hardware isn't sensible,
then you have to play games like you are doing in the Intel drivers ;)
> Dealing with the outer header needing the DF bit is something I have
> left unaddressed at this point.  The question would be what is the
> correct approach to take for all this.  I know RFC2003 for IPv4 in
> IPv4 says you must set the DF bit if the inner header has the DF bit
> set.  I'm just wondering if we can apply the same type of logic to GRE
> and UDP tunnels.
I wonder if _not_ setting the DF bit in the outer header can harm TCP
congestion control at all?  If so, then we'd pretty much have to set it
in that case.
> My concern is that I am not sure how much value there is to add with
> this type of code if the hardware is parsing headers.  In the case of
> most of the Intel parts you specify offsets for the network and
> transport headers so it gives you some degree of flexibility.  If
> however the hardware parses headers it becomes problematic as we can
> only support protocols that can be parsed by the hardware.
Solarflare parts will _normally_ parse headers to get that information.
But, when doing TSO, we do get a chance to specify some extra
information, in the TSO option descriptor.  Enough of the datapath is
under firmware control that that should be enough; as long as the outer
frame is IP over Ethernet, the hardware will parse that fine, and we
*should* be able to make it just accept that it doesn't know what's
going on between that and the start of the TCP header.  And, it
shouldn't matter that the hardware can parse some types of tunnel
headers, because we'll just tell it to ignore that.

Of course, that means changing the firmware; luckily we haven't got any
parts in the wild doing tunnel offloads yet, so we still have a chance
to do that without needing driver code to work around our past
mistakes...

But this stuff does definitely add value for us, it means we could TSO
any tunnel type whatsoever; even nested tunnels as long as only the
outermost IP ID needs to change.

-Ed

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-23 21:05               ` Edward Cree
@ 2016-03-23 22:36                 ` Alexander Duyck
  2016-03-23 23:00                   ` Edward Cree
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-23 22:36 UTC (permalink / raw)
  To: Edward Cree
  Cc: Or Gerlitz, Alexander Duyck, Netdev, David Miller, Tom Herbert

On Wed, Mar 23, 2016 at 2:05 PM, Edward Cree <ecree@solarflare.com> wrote:
> On 23/03/16 18:06, Alexander Duyck wrote:
>> On Wed, Mar 23, 2016 at 9:27 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> My belief is that my way is (in the long run) simpler: ultimately it gets
>>> rid of per-protocol GSO callbacks entirely.  Every header gets written
>>> correctly* when the packet initially traverses the stack, and then at
>>> transmit time you either hand that off to TSO, or do the equivalent thing
>>> in software: segment at the TCP layer while treating everything above it
>>> as an opaque pseudo-L2 header.
>> I'm pretty sure that isn't a safe approach to take.  With GSO we are
>> holding off until we are about to hand the packets off to the device
>> before we segment it.  If we do it earlier it will lead to more issues
>> as you could have the packet route off to somewhere you were not
>> expecting and having it already "soft segmented" could lead to issues
>> where the packet would no longer be coherent.

> Ah, yes, that is a potential problem.  And now that I think about it, we
> might not know what the MSS is until segmentation time, either, even if
> we did know for sure we would want to segment.
> So my approach doesn't work after all, the superframe has to be coherent
> when it traverses the stack.

Right.

>>> You mean not supported by offloads, right?  We can still _create_ a nested
>>> tunnel device, it just won't use hardware offloads.  And in the long run,
>>> if we can make tunnel-agnostic TSO as I'm proposing, we should be able to
>>> support it for nested tunnels too (the giant L2 header just gets more giant).
>> Right.  Basically what will currently happen is that if you were to
>> encapsulate an ipip tunnel inside of a VXLAN for instance the GSO
>> would kick in before the VXLAN tunnel has even added the outer headers
>> because the VXLAN netdev does not advertise NETIF_F_GSO_IPIP.  That is
>> the kind of thing we want to have happen though anyway since a
>> physical device wouldn't know how to deal with such a scenario anyway.

> I disagree.  Surely we should be able to "soft segment" the packet just
> before we give it to the physical device, and then tell it to do dumb copying
> of both the VXLAN and IPIP headers?  At this point, we don't have the problem
> you identified above, because we've arrived at the device now.

One issue here is that all levels of IP headers would have to have the
DF bit set.  I don't think that happens right now.

> So we can chase through some per-protocol callbacks to shorten all the outer
> lengths and adjust all the outer checksums, then hand it to the device for
> TSO.  The device is treating the extra headers as an opaque blob, so it
> doesn't know or care whether it's one layer of encapsulation or forty-two.

So if we do pure software offloads this is doable.  However the GSO
flags are meant to have hardware feature equivalents.  The problem is
if you combine an IPIP and VXLAN header how do you know what header is
what and which order things are in, and what is the likelihood of
having a device that would get things right when dealing with 3 levels
of IP headers.  This is one of the reasons why we don't support
multiple levels of tunnels in the GSO code.  GSO is just meant to be a
fall-back for hardware offloads.

>>> I think an L2 header extension is a better semantic match for what's
>>> happening (the tunnel is an L2 device and we're sending L3 packets over
>>> it).  But why does it matter?  Are you giving the hardware the L2 and
>>> L3 headers in separate DMA descriptors or something?  The way I see it,
>>> all hardware needs to be told is "where to start TSO from", and how it
>>> thinks of the stuff before that doesn't matter, because it's not
>>> supposed to touch it anyway.
>> The problem is setups like VFs where they want to know where the
>> network header starts so they know where to insert a VLAN tag.  Most
>> hardware supports IP options or IPv6 extension headers.  What I am
>> doing is exploiting that in the case of the Intel hardware by telling
>> it the IP header is the outer IP header and the transport header is
>> the inner transport header.  Then all I have to deal with is the fact
>> that hardware will try to compute the entire IPv4 header checksum over
>> the range so I cancel that out by seeding the IP checksum with the
>> lco_csum added to the inner pseudo-header checksum.

> Ok, it sounds like the interface to Intel hardware is just Very Different
> to Solarflare hardware on this point: we don't tell our hardware anything
> about where the various headers start, it just parses them to figure it
> out.  (And for new-style TSO we'd tell it where the TCP header starts, as
> I described before.)

That is kind of what I figured.  So does that mean for IPv6 you guys
are parsing through extension headers?  I believe that is one of the
reasons why Intel did things the way they did is to avoid having to
parse through any IPv4 options or IPv6 extension headers.

> But this sounds like a driver-level thing: you have to undo some of what
> your hardware will do because you're having to lie to it about what you're
> giving it.  So that all happens in the driver, the stack's GSO code isn't
> affected.  In which case I'm much less bothered by this; I thought it was
> an assumption you were baking into the stack.  As far as I'm concerned you
> can do whatever ugly hacks you like in Intel drivers, as long as I don't
> have to do them in sfc ;)

The thing is I have to have the hardware recompute the outer IPv4
checksum because it is updating the ID field.  If I were able to leave
the IP ID static then I wouldn't need to update the checksum.  The
only bit that makes it ugly is the fact that the hardware is being
misled so it thinks the IPv4 header has 50 bytes of options when those
actually represent the outer L4 through to the inner L3 headers.

> Although, why is your device computing the IPv4 header checksum?  Those
> aren't supposed to be offloaded, the stack always already filled them in
> in software.  (I think sfc makes the same mistake actually.)  Is there not
> a way for you to tell your device to skip IP header checksum offload?
> Then you wouldn't have this problem in the first place, and you could tell
> it the IP header was as big as you like without having to seed it with
> this correction value.

It is because we are rewriting the IP ID.  We cannot have a static
checksum if the IP ID is getting updated.

>>> Like I say, I'm assuming we'll start setting DF on outer frames.
>>> Besides, it doesn't matter what happens "in transit" - as long as the
>>> outer headers aren't touched by the transmitting NIC, the network can
>>> do what it likes to them afterwards, without it breaking our TSO code.
>> The IP ID bit is going to be something where I don't want to break
>> things.  One of the things I have seen is there ends up being a number
>> of occasions where VXLAN gets fragmented due to incorrectly configured
>> MTU.  I would rather not have it get completely screwed up when this
>> occurs.  A performance hit for fragmenting is one thing.  Having
>> people start complaining because previously working tunnels suddenly
>> stop functioning is another.  The fact is the whole point of VXLAN is
>> to support sending the tunneled frames between data centers.  With the
>> DF bit set on a tunnel header we end up making it so that frames get
>> dropped instead of fragmented which would be a change of behavior.

> I agree this isn't something we can do silently.  But we _can_ make it a
> condition for enabling gso-partial.  And I think it's a necessary
> condition for truly generic TSO.  Sure, your 'L3 extension header' works
> fine for a single tunnel.  But if you nest tunnels, you now need to
> update the outer _and_ middle IP IDs, and you can't do that because you
> only have one L3 header pointer.

This is getting away from the 'less is more' concept.  If we are doing
multiple levels of tunnels we have already made things far too
complicated and it is unlikely hardware will ever support anything
like that.

> OTOH, for a single tunnel I think we could implement your 'L3 extension
> header' trick in firmware, by making it always parse the outer packet up
> to the outer L3 header and increment the IP ID in that.  So I could live
> with that approach if necessary.

Good to hear.

>> Yes, I am very confident of that.  For Intel hardware the outer VLAN
>> tag would be the one inserted via software, the inner VLAN tag is the
>> one inserted via hardware.

> That's really weird; why does it do that?
> For sfc, the only reason we do hardware VLANs at all is to transparently
> tag a VF (or non-primary PF) that's being passed through into an
> (untrusted) VM.  For that use case you'd always want to insert the outer
> VLAN tag, because otherwise the VM can escape the isolation by inserting
> a different VLAN tag in software.

The problem is I think the feature was implemented long before any of
the SR-IOV stuff was added and it has been carried that way through
the igb and ixgbe drivers.  I haven't looked at i40e but it doesn't
report doing STAG so I don't know how it handles QinQ.

>>> _However_, if we don't need to update the IP IDs, then we can just take
>>> the offset of the inner L4 header, and it doesn't make any difference
>>> whether you choose to think of the stuff before that as L2 + giant L3
>>> or as giant L2 + normal L3, because it's not part of the OS->NIC
>>> interface (which is just "L4 starts here").  If your NIC needs to be
>>> told where the outer L3 starts as well, then, I guess that's just a
>>> wart you need in your drivers.  You have skb->network_header, so that
>>> shouldn't be difficult - that will always point to the outer one.
>> Right.  That is kind of what I was going for with this setup.  The
>> only issue is that the VXLAN tunnels not setting the DF bit kind of
>> get in the way of the giant L3 approach.

Sorry meant giant L2 approach here, not L3.

> On the contrary; your giant L3 approach is exactly what solves this case
> (for non-nested tunnels) - the hardware has the outer IP and inner TCP
> header offsets, which are exactly the two headers it needs to alter.
> And if the hardware is sensible, it won't try to re-checksum the whole
> giant L3 header, it'll just decrement the (outer) IP checksum to account
> for incrementing the (outer) IP ID.  If the hardware isn't sensible,
> then you have to play games like you are doing in the Intel drivers ;)

Agreed.

>> Dealing with the outer header needing the DF bit is something I have
>> left unaddressed at this point.  The question would be what is the
>> correct approach to take for all this.  I know RFC2003 for IPv4 in
>> IPv4 says you must set the DF bit if the inner header has the DF bit
>> set.  I'm just wondering if we can apply the same type of logic to GRE
>> and UDP tunnels.

> I wonder if _not_ setting the DF bit in the outer header can harm TCP
> congestion control at all?  If so, then we'd pretty much have to set it
> in that case.

Really the only concern with DF based on RFC 2003 is for path MTU
discovery.  What happens right now if you set the VXLAN MTU
incorrectly is you end up taking a performance hit as the tunneled
frames are fragmented and have to be reassembled on the other end.  It
ends up being a performance hit due to the fragmentation.  I might
need to look into this.  It is possible that we are already doing the
wrong stuff anyway with this and might need to look at setting the DF
bit anyway.

>> My concern is that I am not sure how much value there is to add with
>> this type of code if the hardware is parsing headers.  In the case of
>> most of the Intel parts you specify offsets for the network and
>> transport headers so it gives you some degree of flexibility.  If
>> however the hardware parses headers it becomes problematic as we can
>> only support protocols that can be parsed by the hardware.

> Solarflare parts will _normally_ parse headers to get that information.
> But, when doing TSO, we do get a chance to specify some extra
> information, in the TSO option descriptor.  Enough of the datapath is
> under firmware control that that should be enough; as long as the outer
> frame is IP over Ethernet, the hardware will parse that fine, and we
> *should* be able to make it just accept that it doesn't know what's
> going on between that and the start of the TCP header.  And, it
> shouldn't matter that the hardware can parse some types of tunnel
> headers, because we'll just tell it to ignore that.

Right.  Just skipping the tunnel headers makes it quite a bit easier.

> Of course, that means changing the firmware; luckily we haven't got any
> parts in the wild doing tunnel offloads yet, so we still have a chance
> to do that without needing driver code to work around our past
> mistakes...
>
> But this stuff does definitely add value for us, it means we could TSO
> any tunnel type whatsoever; even nested tunnels as long as only the
> outermost IP ID needs to change.

Right.  In your case it sounds like you would have the advantage of
just having to run essentially two counters, one increments the IPv4
ID and the other decrements the IPv4 checksum.  Beyond that the outer
headers wouldn't need to change at all.

The only other issue would be determining how the inner pseudo-header
checksum is updated.  If you were parsing out header fields from the
IP header previously to generate it you would instead need to update
things so that you could use the partial checksum that is already
stored in the TCP header checksum field.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-23 22:36                 ` Alexander Duyck
@ 2016-03-23 23:00                   ` Edward Cree
  2016-03-23 23:15                     ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Edward Cree @ 2016-03-23 23:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Or Gerlitz, Alexander Duyck, Netdev, David Miller, Tom Herbert

On 23/03/16 22:36, Alexander Duyck wrote:
> On Wed, Mar 23, 2016 at 2:05 PM, Edward Cree <ecree@solarflare.com> wrote:
>> I disagree.  Surely we should be able to "soft segment" the packet just
>> before we give it to the physical device, and then tell it to do dumb copying
>> of both the VXLAN and IPIP headers?  At this point, we don't have the problem
>> you identified above, because we've arrived at the device now.
> One issue here is that all levels of IP headers would have to have the
> DF bit set.  I don't think that happens right now.
Yes, that's still a requirement.  (Well, except for the outermost IP header.)
>> So we can chase through some per-protocol callbacks to shorten all the outer
>> lengths and adjust all the outer checksums, then hand it to the device for
>> TSO.  The device is treating the extra headers as an opaque blob, so it
>> doesn't know or care whether it's one layer of encapsulation or forty-two.
> So if we do pure software offloads this is doable.  However the GSO
> flags are meant to have hardware feature equivalents.  The problem is
> if you combine an IPIP and VXLAN header how do you know what header is
> what and which order things are in, and what is the likelihood of
> having a device that would get things right when dealing with 3 levels
> of IP headers.  This is one of the reasons why we don't support
> multiple levels of tunnels in the GSO code.  GSO is just meant to be a
> fall-back for hardware offloads.
Right, but if the hardware does things "the new way" it should work fine:
Packet still starts with Eth + IP.  Packet still has TCP headers at some
specified offset.  So it all works, as long as you don't have to update
any IP IDs except possibly the outermost one.
>> Ok, it sounds like the interface to Intel hardware is just Very Different
>> to Solarflare hardware on this point: we don't tell our hardware anything
>> about where the various headers start, it just parses them to figure it
>> out.  (And for new-style TSO we'd tell it where the TCP header starts, as
>> I described before.)
> That is kind of what I figured.  So does that mean for IPv6 you guys
> are parsing through extension headers?  I believe that is one of the
> reasons why Intel did things the way they did is to avoid having to
> parse through any IPv4 options or IPv6 extension headers.
I believe so, but I'd have to check with our firmware team to be sure.
The hardware needs to have that capability for RX processing, where it
wants to figure out things like the l4proto for IPv6: you have to walk
the extension headers until you get a layer 4 nexthdr.  I wonder how
Intel manage without that?
>> I agree this isn't something we can do silently.  But we _can_ make it a
>> condition for enabling gso-partial.  And I think it's a necessary
>> condition for truly generic TSO.  Sure, your 'L3 extension header' works
>> fine for a single tunnel.  But if you nest tunnels, you now need to
>> update the outer _and_ middle IP IDs, and you can't do that because you
>> only have one L3 header pointer.
> This is getting away from the 'less is more' concept.  If we are doing
> multiple levels of tunnels we have already made things far too
> complicated and it is unlikely hardware will ever support anything
> like that.
That's not how I understood the concept.  I parsed it as "if hardware knows
less, we can get more out of it", i.e. by having the hardware blithely paste
together whatever headers you give it, you can support things like nested
tunnels.  As long as your 'middle' IP header has DF set, this can be done
without the hardware needing to know a thing about it.  And while we don't
need to implement that straight away, we should care to design our
interfaces to ensure we can do that in the future without too much trouble.
>> Of course, that means changing the firmware; luckily we haven't got any
>> parts in the wild doing tunnel offloads yet, so we still have a chance
>> to do that without needing driver code to work around our past
>> mistakes...
>>
>> But this stuff does definitely add value for us, it means we could TSO
>> any tunnel type whatsoever; even nested tunnels as long as only the
>> outermost IP ID needs to change.
> Right.  In your case it sounds like you would have the advantage of
> just having to run essentially two counters, one increments the IPv4
> ID and the other decrements the IPv4 checksum.  Beyond that the outer
> headers wouldn't need to change at all.
Exactly.
> The only other issue would be determining how the inner pseudo-header
> checksum is updated.  If you were parsing out header fields from the
> IP header previously to generate it you would instead need to update
> things so that you could use the partial checksum that is already
> stored in the TCP header checksum field.
Right, but again that's sufficiently under firmware control (AFAIK) that
that should just be a SMOP for the firmware.  Though I will ask about
that tomorrow, just in case.

-Ed

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-23 23:00                   ` Edward Cree
@ 2016-03-23 23:15                     ` Alexander Duyck
  2016-03-24 17:12                       ` Edward Cree
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-23 23:15 UTC (permalink / raw)
  To: Edward Cree
  Cc: Or Gerlitz, Alexander Duyck, Netdev, David Miller, Tom Herbert

On Wed, Mar 23, 2016 at 4:00 PM, Edward Cree <ecree@solarflare.com> wrote:
> On 23/03/16 22:36, Alexander Duyck wrote:
>> On Wed, Mar 23, 2016 at 2:05 PM, Edward Cree <ecree@solarflare.com> wrote:
>>> I disagree.  Surely we should be able to "soft segment" the packet just
>>> before we give it to the physical device, and then tell it to do dumb copying
>>> of both the VXLAN and IPIP headers?  At this point, we don't have the problem
>>> you identified above, because we've arrived at the device now.
>> One issue here is that all levels of IP headers would have to have the
>> DF bit set.  I don't think that happens right now.
> Yes, that's still a requirement.  (Well, except for the outermost IP header.)
>>> So we can chase through some per-protocol callbacks to shorten all the outer
>>> lengths and adjust all the outer checksums, then hand it to the device for
>>> TSO.  The device is treating the extra headers as an opaque blob, so it
>>> doesn't know or care whether it's one layer of encapsulation or forty-two.
>> So if we do pure software offloads this is doable.  However the GSO
>> flags are meant to have hardware feature equivalents.  The problem is
>> if you combine an IPIP and VXLAN header how do you know what header is
>> what and which order things are in, and what is the likelihood of
>> having a device that would get things right when dealing with 3 levels
>> of IP headers.  This is one of the reasons why we don't support
>> multiple levels of tunnels in the GSO code.  GSO is just meant to be a
>> fall-back for hardware offloads.

> Right, but if the hardware does things "the new way" it should work fine:
> Packet still starts with Eth + IP.  Packet still has TCP headers at some
> specified offset.  So it all works, as long as you don't have to update
> any IP IDs except possibly the outermost one.

Right, but the problem becomes how do you identify what tunnel wants
what.  So for example we could theoretically have a UDP tunnel in a
UDP with checksum.  How would we tell which one want to have the
checksum set and which one doesn't?  The fact is we cannot.  You are
looking too far ahead.  We haven't gotten to tunnel in tunnel yet.
The approach as it stands doesn't have any issues that necessarily
prevent that as long as the outer is the only IP ID that has to
increment, but we don't support anything like that now so we don't
need to worry about it too much.

>>> Ok, it sounds like the interface to Intel hardware is just Very Different
>>> to Solarflare hardware on this point: we don't tell our hardware anything
>>> about where the various headers start, it just parses them to figure it
>>> out.  (And for new-style TSO we'd tell it where the TCP header starts, as
>>> I described before.)
>> That is kind of what I figured.  So does that mean for IPv6 you guys
>> are parsing through extension headers?  I believe that is one of the
>> reasons why Intel did things the way they did is to avoid having to
>> parse through any IPv4 options or IPv6 extension headers.

> I believe so, but I'd have to check with our firmware team to be sure.
> The hardware needs to have that capability for RX processing, where it
> wants to figure out things like the l4proto for IPv6: you have to walk
> the extension headers until you get a layer 4 nexthdr.  I wonder how
> Intel manage without that?

They have some parsing in the Rx.  That is one of the reasons why
there was all the arguing about adding GENEVE port numbers a few
months ago.  They just don't make use of it in the Tx path with the
exception of the fm10k parts.

>>> I agree this isn't something we can do silently.  But we _can_ make it a
>>> condition for enabling gso-partial.  And I think it's a necessary
>>> condition for truly generic TSO.  Sure, your 'L3 extension header' works
>>> fine for a single tunnel.  But if you nest tunnels, you now need to
>>> update the outer _and_ middle IP IDs, and you can't do that because you
>>> only have one L3 header pointer.
>> This is getting away from the 'less is more' concept.  If we are doing
>> multiple levels of tunnels we have already made things far too
>> complicated and it is unlikely hardware will ever support anything
>> like that.

> That's not how I understood the concept.  I parsed it as "if hardware knows
> less, we can get more out of it", i.e. by having the hardware blithely paste
> together whatever headers you give it, you can support things like nested
> tunnels.  As long as your 'middle' IP header has DF set, this can be done
> without the hardware needing to know a thing about it.  And while we don't
> need to implement that straight away, we should care to design our
> interfaces to ensure we can do that in the future without too much trouble.

The design as is does nothing to prevent that.  One of the reasons why
I prefer to keep the outer IP ID incrementing is in order to support
that kind of concept.  Also it shields us a bit as we usually cannot
control the network between the tunnel endpoints since it is usually
traversing a WAN.  What we need to do though is go through and see if
we can get away with something like "if inner IP DF is set the outer
IP DF bit must be set" kind of logic for GRE and UDP tunnels.  If we
can push that then it will allow us to essentially fix all the tunnel
logic in one shot since TCP requires DF bit be set so all levels of
headers would have the DF bit set.

>>> Of course, that means changing the firmware; luckily we haven't got any
>>> parts in the wild doing tunnel offloads yet, so we still have a chance
>>> to do that without needing driver code to work around our past
>>> mistakes...
>>>
>>> But this stuff does definitely add value for us, it means we could TSO
>>> any tunnel type whatsoever; even nested tunnels as long as only the
>>> outermost IP ID needs to change.
>> Right.  In your case it sounds like you would have the advantage of
>> just having to run essentially two counters, one increments the IPv4
>> ID and the other decrements the IPv4 checksum.  Beyond that the outer
>> headers wouldn't need to change at all.
> Exactly.
>> The only other issue would be determining how the inner pseudo-header
>> checksum is updated.  If you were parsing out header fields from the
>> IP header previously to generate it you would instead need to update
>> things so that you could use the partial checksum that is already
>> stored in the TCP header checksum field.
> Right, but again that's sufficiently under firmware control (AFAIK) that
> that should just be a SMOP for the firmware.  Though I will ask about
> that tomorrow, just in case.

There shouldn't be much to it.  In the case of the Intel parts they
want the length cancelled out of the checksum by the driver and they
they fold it back in via hardware.  I would imagine that your hardware
could probably do something similar or may already be doing it since
the length has to be handled differently for IPv4 vs IPv6.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-23 18:19     ` Alexander Duyck
@ 2016-03-24  1:37       ` Jesse Gross
  2016-03-24  2:53         ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Jesse Gross @ 2016-03-24  1:37 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Tom Herbert, Alexander Duyck, Edward Cree,
	Linux Kernel Network Developers, David S. Miller

On Wed, Mar 23, 2016 at 11:19 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Mar 23, 2016 at 10:09 AM, Tom Herbert <tom@herbertland.com> wrote:
>> Can you add some description about strategy for dealing with ip_id?
>
> Yeah.  I still need to add more documentation.  I just didn't want to
> get into details on it until we have finalized things a bit more.  I'm
> still wondering if we should follow the pattern of ipip tunnels in
> terms of setting the DF bit if the inner header has the DF bit set.
> If we end up needing to add code to do that then it makes it so that
> the ip_id value can be fixed for both inner and outer which makes the
> segmentation much simpler since the only header that would ever really
> need to be updated would be the transport header in order to get the
> checksum correct.

I tried to do this years ago but in practice it broke things.

There's enough middleboxes/firewalls/etc. out there that filter ICMP
messages that path MTU discovery isn't necessarily reliable. And while
you might argue that if the box is breaking things then the same would
be true for the original, unencapsulated TCP stream but a lot of times
there are some other hacks built in (like MSS clamping) that make
assumptions that the traffic is TCP. So at the minimum it is generally
good to have an option to force the DF bit off.

That being said, I actually think that it is good to have the DF bit
on by default for encapsulation headers being added. Unintentional
(and perhaps multiple layers of) fragmentation usually results in
unuseably bad performance and so it best to try to correct it,
hopefully automatically in most cases. And, of course, this is the
direction that IPv6 has already gone. If we can assume that this is
the most common case then in practice we can keep the outer headers
constant for the high performance path.

To me, incrementing the inner IP really seems the best choice. The
inner header is most likely someone else's traffic so it best to not
mess with that whereas the outer headers are likely ours and we know
the parameters for them (and can set the DF bit as we believe is
correct). Also, if you are looking forward to the future as far as
stacking multiple layers of tunnels, I think the only consistent thing
to do is have the inner ID increment and all of the tunnel headers
stay fixed - it is hard to justify why the first tunnel header should
increment but not the second one. And finally, as a nice bonus, this
is what the GRO code has been expecting already so you won't cause any
performance regressions with existing systems.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 1/9] ipv4/GRO: Allow multiple frames to use the same IP ID
  2016-03-18 23:24 ` [RFC PATCH 1/9] ipv4/GRO: Allow multiple frames to use the same IP ID Alexander Duyck
@ 2016-03-24  1:43   ` Jesse Gross
  2016-03-24  2:21     ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Jesse Gross @ 2016-03-24  1:43 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Edward Cree, Linux Kernel Network Developers, David Miller,
	Alexander Duyck, Tom Herbert

On Fri, Mar 18, 2016 at 4:24 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
> In RFC 6864 it is stated that we can essentially ignore the IPv4 ID field
> if we have not and will not use fragmentation.  Such a frame is defined
> as having the DF flag set to 1, and the MF and frag_offset as 0.  Currently
> for GRO we were requiring that the inner header always have an increasing
> IPv4 ID, but we are ignoring the outer value.
>
> This patch is a first step in trying to reverse some of that.  Specifically
> what this patch does is allow us to coalesce frames that have a static IPv4
> ID value.  So for example if we had a series of frames where the DF flag
> was set we would allow the same IPv4 ID value to be used for all the frames
> belonging to that flow.  This would become the standard behavior for TCP so
> it would support either a fixed IPv4 ID value, or one in which the value
> increments.
>
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>

One thing that is a bit odd is that the TSO output procedure stays the
same. So that means that if we get a stream of packets with the DF bit
set and a constant IP ID, aggregate them with GRO, and the retransmit
with GSO/TSO then we'll get packets with IDs that increment for each
burst and then start back again to the original value. I guess it
doesn't matter in practice since the IDs are supposed to be ignored
but it does seem a little strange - especially because the new packets
will now be violating the rules of the same GRO implementation that
produced them.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 1/9] ipv4/GRO: Allow multiple frames to use the same IP ID
  2016-03-24  1:43   ` Jesse Gross
@ 2016-03-24  2:21     ` Alexander Duyck
  2016-03-28  4:57       ` Jesse Gross
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-24  2:21 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alexander Duyck, Edward Cree, Linux Kernel Network Developers,
	David Miller, Tom Herbert

On Wed, Mar 23, 2016 at 6:43 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Fri, Mar 18, 2016 at 4:24 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
>> In RFC 6864 it is stated that we can essentially ignore the IPv4 ID field
>> if we have not and will not use fragmentation.  Such a frame is defined
>> as having the DF flag set to 1, and the MF and frag_offset as 0.  Currently
>> for GRO we were requiring that the inner header always have an increasing
>> IPv4 ID, but we are ignoring the outer value.
>>
>> This patch is a first step in trying to reverse some of that.  Specifically
>> what this patch does is allow us to coalesce frames that have a static IPv4
>> ID value.  So for example if we had a series of frames where the DF flag
>> was set we would allow the same IPv4 ID value to be used for all the frames
>> belonging to that flow.  This would become the standard behavior for TCP so
>> it would support either a fixed IPv4 ID value, or one in which the value
>> increments.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>
> One thing that is a bit odd is that the TSO output procedure stays the
> same. So that means that if we get a stream of packets with the DF bit
> set and a constant IP ID, aggregate them with GRO, and the retransmit
> with GSO/TSO then we'll get packets with IDs that increment for each
> burst and then start back again to the original value. I guess it
> doesn't matter in practice since the IDs are supposed to be ignored
> but it does seem a little strange - especially because the new packets
> will now be violating the rules of the same GRO implementation that
> produced them.

Yes and no.  The rule for GRO with this patch is that the IP ID has to
be either incrementing or if DF is set it has the option to be a fixed
value for a given grouping of packets.  In that regard either GSO
partial or standard GSO are still both reversible via GRO so that you
can aggregate to get back to the original frame (ignoring the IP ID)
that GSO segmented.  The bit I am still trying to work out is what to
do about the case where we GRO 2 frames out of one GSO segment.  I
wonder if I should just totally ignore the IP ID value since it ends
up creating an artificial boundary between the two frames if they are
segmented using the incrementing GSO versus the fixed IP ID GSO.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-24  1:37       ` Jesse Gross
@ 2016-03-24  2:53         ` Alexander Duyck
  2016-03-28  5:35           ` Jesse Gross
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-24  2:53 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Tom Herbert, Alexander Duyck, Edward Cree,
	Linux Kernel Network Developers, David S. Miller

On Wed, Mar 23, 2016 at 6:37 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Wed, Mar 23, 2016 at 11:19 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Wed, Mar 23, 2016 at 10:09 AM, Tom Herbert <tom@herbertland.com> wrote:
>>> Can you add some description about strategy for dealing with ip_id?
>>
>> Yeah.  I still need to add more documentation.  I just didn't want to
>> get into details on it until we have finalized things a bit more.  I'm
>> still wondering if we should follow the pattern of ipip tunnels in
>> terms of setting the DF bit if the inner header has the DF bit set.
>> If we end up needing to add code to do that then it makes it so that
>> the ip_id value can be fixed for both inner and outer which makes the
>> segmentation much simpler since the only header that would ever really
>> need to be updated would be the transport header in order to get the
>> checksum correct.
>
> I tried to do this years ago but in practice it broke things.
>
> There's enough middleboxes/firewalls/etc. out there that filter ICMP
> messages that path MTU discovery isn't necessarily reliable. And while
> you might argue that if the box is breaking things then the same would
> be true for the original, unencapsulated TCP stream but a lot of times
> there are some other hacks built in (like MSS clamping) that make
> assumptions that the traffic is TCP. So at the minimum it is generally
> good to have an option to force the DF bit off.

Agreed, I think if we tried anything we would have to have some sort
of bit remaining to shut it off.

> That being said, I actually think that it is good to have the DF bit
> on by default for encapsulation headers being added. Unintentional
> (and perhaps multiple layers of) fragmentation usually results in
> unuseably bad performance and so it best to try to correct it,
> hopefully automatically in most cases. And, of course, this is the
> direction that IPv6 has already gone. If we can assume that this is
> the most common case then in practice we can keep the outer headers
> constant for the high performance path.

I'm still weighting my options on how to address the DF issue.  I'm
not really sure having DF always on for outer headers is the best way
to go either.  I kind of like the idea that if DF is set for the inner
headers then we set it for the outer headers.  I just have to see how
hard something like that would be to implement.

> To me, incrementing the inner IP really seems the best choice. The
> inner header is most likely someone else's traffic so it best to not
> mess with that whereas the outer headers are likely ours and we know
> the parameters for them (and can set the DF bit as we believe is
> correct). Also, if you are looking forward to the future as far as
> stacking multiple layers of tunnels, I think the only consistent thing
> to do is have the inner ID increment and all of the tunnel headers
> stay fixed - it is hard to justify why the first tunnel header should
> increment but not the second one. And finally, as a nice bonus, this
> is what the GRO code has been expecting already so you won't cause any
> performance regressions with existing systems.

Agreed.  However in the case of the igb and ixgbe incrementing the
inner isn't really going to work well as it introduces a number of
issues.  I've considered not enabling GSO partial by default on those
parts, but leaving it as an available feature.  In the case of i40e we
will not have any of those issues as both the inner and outer IP IDs
will increment.

The regressions issue shouldn't be all that big.  In most cases
tunnels are point to point links.  Odds are if someone is running a
network they should have similar devices on both ends.  So in the case
of the Intel drivers I have patched here updating i40e isn't that big
of a deal since it retains the original GSO behavior.  In the case of
ixgbe it didn't support any tunnel offloads so there is no GRO support
without checksums being enabled in the tunnel, and Tx checksum support
vi HW_CSUM has yet to be pushed upstream from Jeff's tree so things
are currently throttled by the Tx side.

GSO partial is going to end up being very driver specific in terms of
what increments and what doesn't as we are having to really hack our
way around to make it do things the spec sheets say it cannot do.
Some implementations are going to end up working out better than
others in terms of interoperability and what can make use of other
advanced features such as GRO on legacy systems.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-23 23:15                     ` Alexander Duyck
@ 2016-03-24 17:12                       ` Edward Cree
  2016-03-24 18:43                         ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Edward Cree @ 2016-03-24 17:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Or Gerlitz, Alexander Duyck, Netdev, David Miller, Tom Herbert

On 23/03/16 23:15, Alexander Duyck wrote:
> Right, but the problem becomes how do you identify what tunnel wants
> what.  So for example we could theoretically have a UDP tunnel in a
> UDP with checksum.  How would we tell which one want to have the
> checksum set and which one doesn't?  The fact is we cannot.
I think we can still handle that, assuming the device is only touching the
innermost checksum (i.e. it's obeying csum_start/offset).  We don't need
flags to tell us what to fill in in GSO, we can work it all out:
Make the series of per-protocol callbacks for GSO partial run inner-
outwards, by using recursion at the head.  Make each return a csum_edit
value.  Then for example:
For IPv4 header, our checksum covers only our header, so we fold any edits
into our own checksum, and pass csum_edit through unchanged.
For UDP header, we look to see if the current checksum field is zero.  If
so, we leave it as zero, fold our edits into csum_edit and return the
result.  Otherwise, we fold our edits and csum_edit into our checksum
field, and return zero.
For GRE, we look at the checksumming bit in the GRE header, and behave
similarly to UDP.
Etcetera...

This should even be a worthwhile simplification of the non-nested case,
because (if I understand correctly) it means GSO partial doesn't need the
various gso_type flags we already have to specify tunnel type and checksum
status; it just figures it out as it goes.

If your device is touching other checksums as well, then of course you need
to figure that out in your driver so you can cancel it out.  But the device
will only fiddle with the headers you tell it about (in your case I think
that's outermost L3), not any others in the middle.  So it should still all
work, without the driver having to know about the nesting.

> You are
> looking too far ahead.  We haven't gotten to tunnel in tunnel yet.
IMHO, if our offloads are truly generic, tunnel in tunnel should be low-
hanging fruit.  (In principle, "VxLAN + Ethernet + IP + GRE" is just
another encapsulation header, albeit a rather long one).  Therefore, if
it _isn't_ low-hanging fruit for us, we should suspect that we aren't
generic.  So even if it's not currently useful in itself, it's still a
convenient canary.

-Ed

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-24 17:12                       ` Edward Cree
@ 2016-03-24 18:43                         ` Alexander Duyck
  2016-03-24 20:17                           ` Edward Cree
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-24 18:43 UTC (permalink / raw)
  To: Edward Cree
  Cc: Or Gerlitz, Alexander Duyck, Netdev, David Miller, Tom Herbert

On Thu, Mar 24, 2016 at 10:12 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 23/03/16 23:15, Alexander Duyck wrote:
>> Right, but the problem becomes how do you identify what tunnel wants
>> what.  So for example we could theoretically have a UDP tunnel in a
>> UDP with checksum.  How would we tell which one want to have the
>> checksum set and which one doesn't?  The fact is we cannot.

> I think we can still handle that, assuming the device is only touching the
> innermost checksum (i.e. it's obeying csum_start/offset).  We don't need
> flags to tell us what to fill in in GSO, we can work it all out:
> Make the series of per-protocol callbacks for GSO partial run inner-
> outwards, by using recursion at the head.  Make each return a csum_edit
> value.  Then for example:
> For IPv4 header, our checksum covers only our header, so we fold any edits
> into our own checksum, and pass csum_edit through unchanged.

Right. IPv4 is easy because it is a localized checksum that is always present.

> For UDP header, we look to see if the current checksum field is zero.  If
> so, we leave it as zero, fold our edits into csum_edit and return the
> result.  Otherwise, we fold our edits and csum_edit into our checksum
> field, and return zero.

This would require changing how we handle partial checksums so that in
the case of UDP we don't allow 0 as a valid value.  Currently we do.
It isn't till we get to the final checksum that we take care of the
bit flip in the case of 0.

> For GRE, we look at the checksumming bit in the GRE header, and behave
> similarly to UDP.
> Etcetera...

Right.  In the case of GRE we at least have a flag we could check.

> This should even be a worthwhile simplification of the non-nested case,
> because (if I understand correctly) it means GSO partial doesn't need the
> various gso_type flags we already have to specify tunnel type and checksum
> status; it just figures it out as it goes.

Yes, but doing packet inspection can get to be expensive as we crawl
through the headers.  In addition it gets into the whole software
versus hardware offloads thing.

> If your device is touching other checksums as well, then of course you need
> to figure that out in your driver so you can cancel it out.  But the device
> will only fiddle with the headers you tell it about (in your case I think
> that's outermost L3), not any others in the middle.  So it should still all
> work, without the driver having to know about the nesting.
>
>> You are
>> looking too far ahead.  We haven't gotten to tunnel in tunnel yet.

> IMHO, if our offloads are truly generic, tunnel in tunnel should be low-
> hanging fruit.  (In principle, "VxLAN + Ethernet + IP + GRE" is just
> another encapsulation header, albeit a rather long one).  Therefore, if
> it _isn't_ low-hanging fruit for us, we should suspect that we aren't
> generic.  So even if it's not currently useful in itself, it's still a
> convenient canary.

Honestly I think tunnel-in-tunnel is not going to be doable due to the
fact that we would have to increment multiple layers of IP ID in order
to do it correctly.  The more I look into the whole DF on outer
headers thing the more I see RFCs such as RFC 2784 that say not to do
it unless you want to implement PMTU discovery, and PMTU discovery is
inherently flawed since it would require ICMP messages to be passed
which may be filtered out by firewalls.

On top of that it occurred to me that GRE cannot be screened in GRO
for the outer-IP-ID check.  Basically what can happen is on devices
that don't parse inner headers for GRE we can end up with two TCP
flows from the same tunnel essentially stomping on each other and
causing one another to get evicted for having an outer IP-ID that
doesn't match up with the expected value.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-24 18:43                         ` Alexander Duyck
@ 2016-03-24 20:17                           ` Edward Cree
  2016-03-24 21:50                             ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Edward Cree @ 2016-03-24 20:17 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Or Gerlitz, Alexander Duyck, Netdev, David Miller, Tom Herbert

On 24/03/16 18:43, Alexander Duyck wrote:
> On Thu, Mar 24, 2016 at 10:12 AM, Edward Cree <ecree@solarflare.com> wrote:
>> For UDP header, we look to see if the current checksum field is zero.  If
>> so, we leave it as zero, fold our edits into csum_edit and return the
>> result.  Otherwise, we fold our edits and csum_edit into our checksum
>> field, and return zero.
> This would require changing how we handle partial checksums so that in
> the case of UDP we don't allow 0 as a valid value.  Currently we do.
> It isn't till we get to the final checksum that we take care of the
> bit flip in the case of 0.
No, the UDP checksum will have been filled in by LCO and thus have been bit-
flipped already if it was zero.  Only the innermost L4 header will have a
partial checksum, and that's TCP so the checksum is required.  (Alternatively:
whichever header has the partial checksum - and there is at most one - is
identified by skb->csum_start, and by definition the checksum must be enabled
for that header, so we can skip the 'check for zero' heuristic there.)
(Besides, I thought it was impossible for the partial checksum to be zero
anyway because at least one of the inputs must be nonzero, and the end-
around carry can never produce a zero.  But maybe I'm getting confused here.)
>> This should even be a worthwhile simplification of the non-nested case,
>> because (if I understand correctly) it means GSO partial doesn't need the
>> various gso_type flags we already have to specify tunnel type and checksum
>> status; it just figures it out as it goes.
> Yes, but doing packet inspection can get to be expensive as we crawl
> through the headers.  In addition it gets into the whole software
> versus hardware offloads thing.
The headers should already be in cache, I thought, and this is only happening
once per superframe.  We're already going to have to crawl through the headers
anyway to edit the lengths, I don't think it should cost much more to also
inspect things like GRE csum bit or the UDP checksum field.  And by
identifying the 'next header' from this method as well, we don't need to add a
new SKB_GSO_FOO bit or two every time we add another kind of encapsulation to
the kernel.

(And remember that this is separate from - and doesn't impact - the existing
GSO code; this is a bunch of new foo_gso_partial() callbacks, distinct from
the foo_gso_segment() ones; and it's about preparing a superframe for hardware
offload rather than doing the segmentation in software.  I think it's
preferable to have some preparation happen in software if that allows hardware
to be totally dumb.  (Apologies if that was already all obvious.))
> Honestly I think tunnel-in-tunnel is not going to be doable due to the
> fact that we would have to increment multiple layers of IP ID in order
> to do it correctly.  The more I look into the whole DF on outer
> headers thing the more I see RFCs such as RFC 2784 that say not to do
> it unless you want to implement PMTU discovery, and PMTU discovery is
> inherently flawed since it would require ICMP messages to be passed
> which may be filtered out by firewalls.
If PMTU discovery really is "inherently flawed", then TCP is broken and so
is IPv6.  I think the Right Thing here is for us to translate and forward
ICMP errors to the originator of the traffic, as RFC 2784 vaguely suggests.
It should also be possible to configure the tunnel endpoint's MTU to a
sufficiently low value that in practice it will fit within the path MTU;
then the sender will discover the tunnel's MTU restriction and stay within
that.  (I am assuming here that ICMP won't be filtered on the overlay
network - which should be under the control of the tunnel administrator -
even if it might be on the underlay network.)
> On top of that it occurred to me that GRE cannot be screened in GRO
> for the outer-IP-ID check.  Basically what can happen is on devices
> that don't parse inner headers for GRE we can end up with two TCP
> flows from the same tunnel essentially stomping on each other and
> causing one another to get evicted for having an outer IP-ID that
> doesn't match up with the expected value.
Yes, that does seem problematic - you'd need to save away the IPID check
result and only evict once you'd ascertained they were the same flow.  All
the more reason to try to make your tunnels use DF *grin*.

On the subject of GRO, I was wondering - it seems like the main reason why
drivers have LRO is so they can be more permissive than GRO's "must be
reversible" rules.  (At least, that seems to be the case for sfc's LRO,
which is only in our out-of-tree driver.)  Maybe instead of each driver
having its own LRO code (with hacks to avoid breaking Slow Start and
suchlike by breaking ACK stats), there should be per-device options to
control how permissive GRO is (e.g. don't check IP IDs), so that a user who
wants LRO-like behaviour can get it from GRO.
Any obvious reason why I couldn't do such a thing?

(I realise LRO might not go away entirely, if other drivers are doing
various hardware-assisted things.  But in our case it really is all in
software.)

-Ed

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-24 20:17                           ` Edward Cree
@ 2016-03-24 21:50                             ` Alexander Duyck
  2016-03-24 23:00                               ` Edward Cree
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-24 21:50 UTC (permalink / raw)
  To: Edward Cree
  Cc: Or Gerlitz, Alexander Duyck, Netdev, David Miller, Tom Herbert

On Thu, Mar 24, 2016 at 1:17 PM, Edward Cree <ecree@solarflare.com> wrote:
> On 24/03/16 18:43, Alexander Duyck wrote:
>> On Thu, Mar 24, 2016 at 10:12 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> For UDP header, we look to see if the current checksum field is zero.  If
>>> so, we leave it as zero, fold our edits into csum_edit and return the
>>> result.  Otherwise, we fold our edits and csum_edit into our checksum
>>> field, and return zero.
>> This would require changing how we handle partial checksums so that in
>> the case of UDP we don't allow 0 as a valid value.  Currently we do.
>> It isn't till we get to the final checksum that we take care of the
>> bit flip in the case of 0.

> No, the UDP checksum will have been filled in by LCO and thus have been bit-
> flipped already if it was zero.  Only the innermost L4 header will have a
> partial checksum, and that's TCP so the checksum is required.  (Alternatively:
> whichever header has the partial checksum - and there is at most one - is
> identified by skb->csum_start, and by definition the checksum must be enabled
> for that header, so we can skip the 'check for zero' heuristic there.)

No, recheck the code.  If the skb is GSO when we call udp_set_csum()
we only have populated the partial checksum in the UDP header.  It
isn't until we actually perform the segmentation that we call
lco_csum().

> (Besides, I thought it was impossible for the partial checksum to be zero
> anyway because at least one of the inputs must be nonzero, and the end-
> around carry can never produce a zero.  But maybe I'm getting confused here.)

I forgot about that bit.  I think you are right.  We end up inverting
the output from csum fold so we are back to 0x1 - 0xFFFF as possible
values.

>>> This should even be a worthwhile simplification of the non-nested case,
>>> because (if I understand correctly) it means GSO partial doesn't need the
>>> various gso_type flags we already have to specify tunnel type and checksum
>>> status; it just figures it out as it goes.
>> Yes, but doing packet inspection can get to be expensive as we crawl
>> through the headers.  In addition it gets into the whole software
>> versus hardware offloads thing.

> The headers should already be in cache, I thought, and this is only happening
> once per superframe.  We're already going to have to crawl through the headers
> anyway to edit the lengths, I don't think it should cost much more to also
> inspect things like GRE csum bit or the UDP checksum field.  And by
> identifying the 'next header' from this method as well, we don't need to add a
> new SKB_GSO_FOO bit or two every time we add another kind of encapsulation to
> the kernel.

Right, but this gets back into the hardware flags issue.  The whole
reason for SKB_GSO_FOO and SKB_GSO_FOO_CSUM is because hardware
typically supports one but not the other.

The other thing to keep in mind is it is much easier to figure out
what offloads we can make use of when we are only dealing with at most
1 layer of tunnels.  When we start getting into stacked tunnels it
really becomes impossible to figure out what features in the hardware
we can still make use of.  You throw everything and the kitchen sink
on it and we have to give up and would be doing everything in
software.  At least if we restrict the features being requested the
hardware has a chance to perhaps be useful.

> (And remember that this is separate from - and doesn't impact - the existing
> GSO code; this is a bunch of new foo_gso_partial() callbacks, distinct from
> the foo_gso_segment() ones; and it's about preparing a superframe for hardware
> offload rather than doing the segmentation in software.  I think it's
> preferable to have some preparation happen in software if that allows hardware
> to be totally dumb.  (Apologies if that was already all obvious.))

This is where you and I differ.  I think there are things you are overlooking.

For example one of the reasons why I know the outer UDP checksum works
the way it does with tunnel GSO is because the i40e and i40evf drivers
already have hardware that supports UDP_TUNNEL_CSUM.  As such in order
to do what you are proposing we would likely have to rip that code out
and completely rewrite it since it would change out the checksum value
given to the device.

>> Honestly I think tunnel-in-tunnel is not going to be doable due to the
>> fact that we would have to increment multiple layers of IP ID in order
>> to do it correctly.  The more I look into the whole DF on outer
>> headers thing the more I see RFCs such as RFC 2784 that say not to do
>> it unless you want to implement PMTU discovery, and PMTU discovery is
>> inherently flawed since it would require ICMP messages to be passed
>> which may be filtered out by firewalls.

> If PMTU discovery really is "inherently flawed", then TCP is broken and so
> is IPv6.  I think the Right Thing here is for us to translate and forward
> ICMP errors to the originator of the traffic, as RFC 2784 vaguely suggests.
> It should also be possible to configure the tunnel endpoint's MTU to a
> sufficiently low value that in practice it will fit within the path MTU;
> then the sender will discover the tunnel's MTU restriction and stay within
> that.  (I am assuming here that ICMP won't be filtered on the overlay
> network - which should be under the control of the tunnel administrator -
> even if it might be on the underlay network.)

I'm not going to get into an argument over the merits of PMTU.  The
fact is I don't believe it is worth the effort in order to enable the
tunnel-in-tunnel case you envision.

Also another thing to keep in mind is you would have to find a way to
identify that a tunnel requesting to be segmented is setting the DF
bit in the outer headers.

>> On top of that it occurred to me that GRE cannot be screened in GRO
>> for the outer-IP-ID check.  Basically what can happen is on devices
>> that don't parse inner headers for GRE we can end up with two TCP
>> flows from the same tunnel essentially stomping on each other and
>> causing one another to get evicted for having an outer IP-ID that
>> doesn't match up with the expected value.

> Yes, that does seem problematic - you'd need to save away the IPID check
> result and only evict once you'd ascertained they were the same flow.  All
> the more reason to try to make your tunnels use DF *grin*.

Feel free to solve the problem if you believe it is that easy.  There
is nothing in my patches to prevent that.

> On the subject of GRO, I was wondering - it seems like the main reason why
> drivers have LRO is so they can be more permissive than GRO's "must be
> reversible" rules.  (At least, that seems to be the case for sfc's LRO,
> which is only in our out-of-tree driver.)  Maybe instead of each driver
> having its own LRO code (with hacks to avoid breaking Slow Start and
> suchlike by breaking ACK stats), there should be per-device options to
> control how permissive GRO is (e.g. don't check IP IDs), so that a user who
> wants LRO-like behaviour can get it from GRO.

I don't really see the point.  Odds are the LRO in the driver isn't
too much more efficient than the in-kernel GRO.  If anything I would
think the main motivation for maintaining LRO in your out-of-tree
driver is to support kernels prior to GRO.

> Any obvious reason why I couldn't do such a thing?
>
> (I realise LRO might not go away entirely, if other drivers are doing
> various hardware-assisted things.  But in our case it really is all in
> software.)

I don't know.  I'm thinking it all depends on what kind of stuff you
are talking about trying to change.  Most of the flush conditions in
GRO have very good reasons for being there.  I know in the case of the
Intel drivers I actually found that by copying most of the conditions
into the out-of-tree driver LRO code we had I saw improvements.  More
often then not the more aggressive LRO can end up causing more harm
than good because what ends up happening is that it spends too much
time aggregating and starves the other side by witholding acks.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-24 21:50                             ` Alexander Duyck
@ 2016-03-24 23:00                               ` Edward Cree
  2016-03-24 23:35                                 ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Edward Cree @ 2016-03-24 23:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Or Gerlitz, Alexander Duyck, Netdev, David Miller, Tom Herbert

On 24/03/16 21:50, Alexander Duyck wrote:
> On Thu, Mar 24, 2016 at 1:17 PM, Edward Cree <ecree@solarflare.com> wrote:
>> (Besides, I thought it was impossible for the partial checksum to be zero
>> anyway because at least one of the inputs must be nonzero, and the end-
>> around carry can never produce a zero.  But maybe I'm getting confused here.)
> 
> I forgot about that bit.  I think you are right.  We end up inverting
> the output from csum fold so we are back to 0x1 - 0xFFFF as possible
> values.
Yes, it seems csum_fold inverts it and then the relevant callers invert it
again.  (I hadn't spotted either inversion, so I panicked for a moment when
I thought it was getting inverted just once.)

>> The headers should already be in cache, I thought, and this is only happening
>> once per superframe.  We're already going to have to crawl through the headers
>> anyway to edit the lengths, I don't think it should cost much more to also
>> inspect things like GRE csum bit or the UDP checksum field.  And by
>> identifying the 'next header' from this method as well, we don't need to add a
>> new SKB_GSO_FOO bit or two every time we add another kind of encapsulation to
>> the kernel.
> 
> Right, but this gets back into the hardware flags issue.  The whole
> reason for SKB_GSO_FOO and SKB_GSO_FOO_CSUM is because hardware
> typically supports one but not the other.
Right, and for hardware supporting a specific GSO_FOO[_CSUM], we'll still
need the flags - but in that case we'll be doing "old style" TSO anyway.
GSO partial only really makes sense for a device that isn't (or at least can
be prevented from) parsing the tunnel headers, and such hardware will support
GSO_PARTIAL only, not a profusion of GSO_FOO (with or without _CSUM).

So: in the initial transmit path we build a coherent superframe; when we get
to the device, we say either "oh, device doesn't support offloads at all,
call GSO", or "oh, device supports this particular GSO type for TSO, so give
it the superframe SKB to do what it wants with", or "oh, device supports
GSO_PARTIAL, let's call skb_mac_gso_partial [i.e. chain of new callbacks I
was describing in previous email] and then give the non-coherent SKB to the
device".
And in that last case it doesn't matter what the GSO type is, because the
hardware can handle anything as long as we transform it into GSO partial.

> The other thing to keep in mind is it is much easier to figure out
> what offloads we can make use of when we are only dealing with at most
> 1 layer of tunnels.  When we start getting into stacked tunnels it
> really becomes impossible to figure out what features in the hardware
> we can still make use of.  You throw everything and the kitchen sink
> on it and we have to give up and would be doing everything in
> software.  At least if we restrict the features being requested the
> hardware has a chance to perhaps be useful.
I think all the protocol-specific offloads can just say "don't even try
if you've got stacked tunnels"; it's not something I expect current
hardware to support, and if GSO partial works out then a hw vendor would
have to be crazy to add such support in the future.  But GSO partial
doesn't care about stacked tunnels, it still works.  Unless of course
someone in the middle wants to change their IP IDs or GRE Counter or
whatever, in which case they can mark the skb as "not gso_partial-safe"
when they first build their header, or their gso_partial callback can
return "nope", and then we fall back on a protocol-specific offload (if
one's available, which it won't be for stacked tunnels) or software
segmentation.

>> (And remember that this is separate from - and doesn't impact - the existing
>> GSO code; this is a bunch of new foo_gso_partial() callbacks, distinct from
>> the foo_gso_segment() ones; and it's about preparing a superframe for hardware
>> offload rather than doing the segmentation in software.  I think it's
>> preferable to have some preparation happen in software if that allows hardware
>> to be totally dumb.  (Apologies if that was already all obvious.))
> 
> This is where you and I differ.  I think there are things you are overlooking.

> 
> For example one of the reasons why I know the outer UDP checksum works
> the way it does with tunnel GSO is because the i40e and i40evf drivers
> already have hardware that supports UDP_TUNNEL_CSUM.  As such in order
> to do what you are proposing we would likely have to rip that code out
> and completely rewrite it since it would change out the checksum value
> given to the device.
I'm not sure I understand, so bear with me.  AIUI, UDP_TUNNEL_CSUM means
that the device will fill in both inner and outer L4 checksums.  But it can
only do that if it knows where the inner and outer L4 headers *are*, and I
thought you were going to tell it not to treat it as a tunnel at all, but
instead as a giant L3 header.  So it won't touch the outer L4 checksum at
all (and, in the case of stacked tunnels, it wouldn't touch the 'middle'
one either - it only sees the innermost L4 header).
So while the driver would need to change to add GSO_partial support, the
hardware wouldn't.
Where am I going wrong?

> I'm not going to get into an argument over the merits of PMTU.  The
> fact is I don't believe it is worth the effort in order to enable the
> tunnel-in-tunnel case you envision.
I agree it's not worth the effort to implement it right now.  But some day
it will be, and at that point, if we've designed GSO partial the way I'm
arguing for, it will just magically start working when we disable
fragmentation on inner tunnels.  And in the meantime we'll be using a more
generic interface that hopefully won't encourage driver (and hardware)
designers to see the gso_type enum as a list of protocols to support.
Again, tunnel-in-tunnel isn't important in itself, but as a canary for
"can we support whatever encapsulation protocol comes down the pike";
arguing that you won't be able to set DF on your inner tunnel because
operational consideration XYZ is missing the point IMHO.

> Also another thing to keep in mind is you would have to find a way to
> identify that a tunnel requesting to be segmented is setting the DF
> bit in the outer headers.
But that's easy, as above: either foo_xmit sets a "don't GSO-partial me"
flag in the SKB, or foo_gso_partial returns -EMIGHTBEFRAGMENTED.  The
former is probably preferable for performance reasons.

>>> On top of that it occurred to me that GRE cannot be screened in GRO
>>> for the outer-IP-ID check.  Basically what can happen is on devices
>>> that don't parse inner headers for GRE we can end up with two TCP
>>> flows from the same tunnel essentially stomping on each other and
>>> causing one another to get evicted for having an outer IP-ID that
>>> doesn't match up with the expected value.
> 
>> Yes, that does seem problematic - you'd need to save away the IPID check
>> result and only evict once you'd ascertained they were the same flow.  All
>> the more reason to try to make your tunnels use DF *grin*.
> 
> Feel free to solve the problem if you believe it is that easy.  There
> is nothing in my patches to prevent that.
I'm not saying it's easy.  The GRO side of your patches look great to me
and I think you're more likely to solve it than I am.  But I will have a
go at it too, if I have time.

Though I must admit, I don't quite understand why we have to restrict to
(incrementing or constant) in GRO, rather than just saying "it's DF, we
can ignore the IP IDs and just aggregate it all together".  We won't be
producing the original IP IDs anyway if we resegment, because we don't
know whether they were originally incrementing or constant.  (In other
words, "RFC6864 compliant GRO" as you've defined it can reverse either
GSO or GSO partial, but GSO (in either form) can't always reverse
RFC6864 compliant GRO.)

-Ed

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-24 23:00                               ` Edward Cree
@ 2016-03-24 23:35                                 ` Alexander Duyck
  2016-03-25  0:37                                   ` Edward Cree
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2016-03-24 23:35 UTC (permalink / raw)
  To: Edward Cree
  Cc: Or Gerlitz, Alexander Duyck, Netdev, David Miller, Tom Herbert

On Thu, Mar 24, 2016 at 4:00 PM, Edward Cree <ecree@solarflare.com> wrote:
> On 24/03/16 21:50, Alexander Duyck wrote:
>> On Thu, Mar 24, 2016 at 1:17 PM, Edward Cree <ecree@solarflare.com> wrote:
>>> (Besides, I thought it was impossible for the partial checksum to be zero
>>> anyway because at least one of the inputs must be nonzero, and the end-
>>> around carry can never produce a zero.  But maybe I'm getting confused here.)
>>
>> I forgot about that bit.  I think you are right.  We end up inverting
>> the output from csum fold so we are back to 0x1 - 0xFFFF as possible
>> values.
> Yes, it seems csum_fold inverts it and then the relevant callers invert it
> again.  (I hadn't spotted either inversion, so I panicked for a moment when
> I thought it was getting inverted just once.)
>
>>> The headers should already be in cache, I thought, and this is only happening
>>> once per superframe.  We're already going to have to crawl through the headers
>>> anyway to edit the lengths, I don't think it should cost much more to also
>>> inspect things like GRE csum bit or the UDP checksum field.  And by
>>> identifying the 'next header' from this method as well, we don't need to add a
>>> new SKB_GSO_FOO bit or two every time we add another kind of encapsulation to
>>> the kernel.
>>
>> Right, but this gets back into the hardware flags issue.  The whole
>> reason for SKB_GSO_FOO and SKB_GSO_FOO_CSUM is because hardware
>> typically supports one but not the other.

> Right, and for hardware supporting a specific GSO_FOO[_CSUM], we'll still
> need the flags - but in that case we'll be doing "old style" TSO anyway.
> GSO partial only really makes sense for a device that isn't (or at least can
> be prevented from) parsing the tunnel headers, and such hardware will support
> GSO_PARTIAL only, not a profusion of GSO_FOO (with or without _CSUM).
>
> So: in the initial transmit path we build a coherent superframe; when we get
> to the device, we say either "oh, device doesn't support offloads at all,
> call GSO", or "oh, device supports this particular GSO type for TSO, so give
> it the superframe SKB to do what it wants with", or "oh, device supports
> GSO_PARTIAL, let's call skb_mac_gso_partial [i.e. chain of new callbacks I
> was describing in previous email] and then give the non-coherent SKB to the
> device".

What you are proposing is essentially forking GSO.  I really don't
like that.  I prefer the approach we have now where essentially GSO
partial is something checked for at the very end and we have a few
checks when putting together the headers so we do the right thing.

> And in that last case it doesn't matter what the GSO type is, because the
> hardware can handle anything as long as we transform it into GSO partial.

Yes and no.  The problem is trying to figure out what the device
supports.  The hw_enc_features flags are only meant to be used with
one layer of tunnels.  Are you suggesting we add some code to make
sure the GSO tunnel types are kept mutually exclusive and add an extra
offload that forces GSO/GSO_PARTIAL if they aren't?

This is sounding very complicated.  I think we would be better off if
you took the time to try and implement some of this yourself so you
could see how feasible it is.

>> The other thing to keep in mind is it is much easier to figure out
>> what offloads we can make use of when we are only dealing with at most
>> 1 layer of tunnels.  When we start getting into stacked tunnels it
>> really becomes impossible to figure out what features in the hardware
>> we can still make use of.  You throw everything and the kitchen sink
>> on it and we have to give up and would be doing everything in
>> software.  At least if we restrict the features being requested the
>> hardware has a chance to perhaps be useful.

> I think all the protocol-specific offloads can just say "don't even try
> if you've got stacked tunnels"; it's not something I expect current
> hardware to support, and if GSO partial works out then a hw vendor would
> have to be crazy to add such support in the future.  But GSO partial
> doesn't care about stacked tunnels, it still works.  Unless of course
> someone in the middle wants to change their IP IDs or GRE Counter or
> whatever, in which case they can mark the skb as "not gso_partial-safe"
> when they first build their header, or their gso_partial callback can
> return "nope", and then we fall back on a protocol-specific offload (if
> one's available, which it won't be for stacked tunnels) or software
> segmentation.

That is basically what happens now.  When the tunnel gets configured
if there are any options that would prevent GSO from working then the
tunnel disables TSO support.  That way the frames are segmented
further up the stack and then come through and are built around the
segmented frames.

>> For example one of the reasons why I know the outer UDP checksum works
>> the way it does with tunnel GSO is because the i40e and i40evf drivers
>> already have hardware that supports UDP_TUNNEL_CSUM.  As such in order
>> to do what you are proposing we would likely have to rip that code out
>> and completely rewrite it since it would change out the checksum value
>> given to the device.

> I'm not sure I understand, so bear with me.  AIUI, UDP_TUNNEL_CSUM means
> that the device will fill in both inner and outer L4 checksums.  But it can
> only do that if it knows where the inner and outer L4 headers *are*, and I
> thought you were going to tell it not to treat it as a tunnel at all, but
> instead as a giant L3 header.  So it won't touch the outer L4 checksum at
> all (and, in the case of stacked tunnels, it wouldn't touch the 'middle'
> one either - it only sees the innermost L4 header).

There is hardware that supports this already without GSO partial.
Specifically the i40e driver supports a XL722 device that supports
doing the outer UDP tunnel checksum in addition to the inner TCP
checksum.  I'm suspecting there will be hardware from other vendors in
the near future that will have support as well.

The ixgbe and igb drivers will be taking the giant L3 header approach.

> So while the driver would need to change to add GSO_partial support, the
> hardware wouldn't.
> Where am I going wrong?
>
>> I'm not going to get into an argument over the merits of PMTU.  The
>> fact is I don't believe it is worth the effort in order to enable the
>> tunnel-in-tunnel case you envision.

> I agree it's not worth the effort to implement it right now.  But some day
> it will be, and at that point, if we've designed GSO partial the way I'm
> arguing for, it will just magically start working when we disable
> fragmentation on inner tunnels.  And in the meantime we'll be using a more
> generic interface that hopefully won't encourage driver (and hardware)
> designers to see the gso_type enum as a list of protocols to support.
> Again, tunnel-in-tunnel isn't important in itself, but as a canary for
> "can we support whatever encapsulation protocol comes down the pike";
> arguing that you won't be able to set DF on your inner tunnel because
> operational consideration XYZ is missing the point IMHO.

It turns out that UDP tunnels seem to be the only real problem child
in terms of the DF bit.  I was looking over the GRE code and it
already sets the DF bit and supports PMTU.  Odds are we would probably
need to figure out how to borrow code from there and apply it to the
UDP tunnels.

>> Also another thing to keep in mind is you would have to find a way to
>> identify that a tunnel requesting to be segmented is setting the DF
>> bit in the outer headers.
> But that's easy, as above: either foo_xmit sets a "don't GSO-partial me"
> flag in the SKB, or foo_gso_partial returns -EMIGHTBEFRAGMENTED.  The
> former is probably preferable for performance reasons.

Still ugly.

Honestly I don't know if it is worth all this trouble because we are
just as likely to run out of headroom which would probably be a more
expensive operation then giving up on the segmentation offloads and
only supporting HW_CSUM anyway.

>>>> On top of that it occurred to me that GRE cannot be screened in GRO
>>>> for the outer-IP-ID check.  Basically what can happen is on devices
>>>> that don't parse inner headers for GRE we can end up with two TCP
>>>> flows from the same tunnel essentially stomping on each other and
>>>> causing one another to get evicted for having an outer IP-ID that
>>>> doesn't match up with the expected value.
>>
>>> Yes, that does seem problematic - you'd need to save away the IPID check
>>> result and only evict once you'd ascertained they were the same flow.  All
>>> the more reason to try to make your tunnels use DF *grin*.
>>
>> Feel free to solve the problem if you believe it is that easy.  There
>> is nothing in my patches to prevent that.

> I'm not saying it's easy.  The GRO side of your patches look great to me
> and I think you're more likely to solve it than I am.  But I will have a
> go at it too, if I have time.

My concern is that what you are talking about ends up being a pretty
significant change to the transmit path itself.  Really the stack
doesn't want things segmented like that until you hit the GSO layer
and I think that might be where you run into issues.

> Though I must admit, I don't quite understand why we have to restrict to
> (incrementing or constant) in GRO, rather than just saying "it's DF, we
> can ignore the IP IDs and just aggregate it all together".  We won't be
> producing the original IP IDs anyway if we resegment, because we don't
> know whether they were originally incrementing or constant.  (In other
> words, "RFC6864 compliant GRO" as you've defined it can reverse either
> GSO or GSO partial, but GSO (in either form) can't always reverse
> RFC6864 compliant GRO.)

Actually for v2 of this GRO code I am taking the "it's DF just ignore
the IP ID entirely" approach.  Basically IPv4 with DF set will behave
the same as IPv6 by setting flush_id to 0 instead of computing the
offset.  I realized I was still deriving information from the IP ID
and RFC 6864 says we cannot do that for atomic datagrams.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-24 23:35                                 ` Alexander Duyck
@ 2016-03-25  0:37                                   ` Edward Cree
  0 siblings, 0 replies; 51+ messages in thread
From: Edward Cree @ 2016-03-25  0:37 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Edward Cree, Or Gerlitz, Alexander Duyck, Netdev, David Miller,
	Tom Herbert

On Thu, 24 Mar 2016, Alexander Duyck wrote:
> On Thu, Mar 24, 2016 at 4:00 PM, Edward Cree <ecree@solarflare.com> wrote:
>> So: in the initial transmit path we build a coherent superframe; when we get
>> to the device, we say either "oh, device doesn't support offloads at all,
>> call GSO", or "oh, device supports this particular GSO type for TSO, so give
>> it the superframe SKB to do what it wants with", or "oh, device supports
>> GSO_PARTIAL, let's call skb_mac_gso_partial [i.e. chain of new callbacks I
>> was describing in previous email] and then give the non-coherent SKB to the
>> device".
> 
> What you are proposing is essentially forking GSO.
In a way, I suppose it is.

> I really don't
> like that.  I prefer the approach we have now where essentially GSO
> partial is something checked for at the very end and we have a few
> checks when putting together the headers so we do the right thing.
I guess both can exist, and I'll name mine something other than "GSO 
partial"...

> This is sounding very complicated.  I think we would be better off if
> you took the time to try and implement some of this yourself so you
> could see how feasible it is.
Sure, I was already intending to do that before I saw your patch series
and thought it might be able to do what I had in mind.  As it now seems
like we're envisaging different things, I'll go back to implementing
mine and wish you luck with yours.

-Ed

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 1/9] ipv4/GRO: Allow multiple frames to use the same IP ID
  2016-03-24  2:21     ` Alexander Duyck
@ 2016-03-28  4:57       ` Jesse Gross
  0 siblings, 0 replies; 51+ messages in thread
From: Jesse Gross @ 2016-03-28  4:57 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Edward Cree, Linux Kernel Network Developers,
	David Miller, Tom Herbert

On Wed, Mar 23, 2016 at 7:21 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Mar 23, 2016 at 6:43 PM, Jesse Gross <jesse@kernel.org> wrote:
>> On Fri, Mar 18, 2016 at 4:24 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
>>> In RFC 6864 it is stated that we can essentially ignore the IPv4 ID field
>>> if we have not and will not use fragmentation.  Such a frame is defined
>>> as having the DF flag set to 1, and the MF and frag_offset as 0.  Currently
>>> for GRO we were requiring that the inner header always have an increasing
>>> IPv4 ID, but we are ignoring the outer value.
>>>
>>> This patch is a first step in trying to reverse some of that.  Specifically
>>> what this patch does is allow us to coalesce frames that have a static IPv4
>>> ID value.  So for example if we had a series of frames where the DF flag
>>> was set we would allow the same IPv4 ID value to be used for all the frames
>>> belonging to that flow.  This would become the standard behavior for TCP so
>>> it would support either a fixed IPv4 ID value, or one in which the value
>>> increments.
>>>
>>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>>
>> One thing that is a bit odd is that the TSO output procedure stays the
>> same. So that means that if we get a stream of packets with the DF bit
>> set and a constant IP ID, aggregate them with GRO, and the retransmit
>> with GSO/TSO then we'll get packets with IDs that increment for each
>> burst and then start back again to the original value. I guess it
>> doesn't matter in practice since the IDs are supposed to be ignored
>> but it does seem a little strange - especially because the new packets
>> will now be violating the rules of the same GRO implementation that
>> produced them.
>
> Yes and no.  The rule for GRO with this patch is that the IP ID has to
> be either incrementing or if DF is set it has the option to be a fixed
> value for a given grouping of packets.  In that regard either GSO
> partial or standard GSO are still both reversible via GRO so that you
> can aggregate to get back to the original frame (ignoring the IP ID)
> that GSO segmented.  The bit I am still trying to work out is what to
> do about the case where we GRO 2 frames out of one GSO segment.  I
> wonder if I should just totally ignore the IP ID value since it ends
> up creating an artificial boundary between the two frames if they are
> segmented using the incrementing GSO versus the fixed IP ID GSO.

Yeah, I agree that it should work in practice, it just seems a bit odd
to have the IP IDs skip around like that. It does also mean that GRO
will no longer be completely, transparently reversible.

I guess that's fine though as long as we fully embrace the idea that
the DF bit means that IP IDs are not used. In that case, it seems best
to allow any ID when DF is set so we are at least self-consistent.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-24  2:53         ` Alexander Duyck
@ 2016-03-28  5:35           ` Jesse Gross
  0 siblings, 0 replies; 51+ messages in thread
From: Jesse Gross @ 2016-03-28  5:35 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Tom Herbert, Alexander Duyck, Edward Cree,
	Linux Kernel Network Developers, David S. Miller

On Wed, Mar 23, 2016 at 7:53 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Mar 23, 2016 at 6:37 PM, Jesse Gross <jesse@kernel.org> wrote:
>> That being said, I actually think that it is good to have the DF bit
>> on by default for encapsulation headers being added. Unintentional
>> (and perhaps multiple layers of) fragmentation usually results in
>> unuseably bad performance and so it best to try to correct it,
>> hopefully automatically in most cases. And, of course, this is the
>> direction that IPv6 has already gone. If we can assume that this is
>> the most common case then in practice we can keep the outer headers
>> constant for the high performance path.
>
> I'm still weighting my options on how to address the DF issue.  I'm
> not really sure having DF always on for outer headers is the best way
> to go either.  I kind of like the idea that if DF is set for the inner
> headers then we set it for the outer headers.  I just have to see how
> hard something like that would be to implement.

I don't think it would be too hard to implement. We already have code
that copies things like ECN between the inner and outer headers, so it
shouldn't be an issue to add this if that's what we decide is right.

My first concern was that we have a way to turn off the DF bit in the
event that it breaks things. Otherwise, I think it is a performance
win both from the perspective of avoiding fragmentation and allowing
blind copying of headers through TSO by avoiding the need to increment
the ID. In practice, there's probably not too much difference between
turning it on by default and inheriting from the inner frame other
than it might affect what is considered to be the 'fast path'.

>> To me, incrementing the inner IP really seems the best choice. The
>> inner header is most likely someone else's traffic so it best to not
>> mess with that whereas the outer headers are likely ours and we know
>> the parameters for them (and can set the DF bit as we believe is
>> correct). Also, if you are looking forward to the future as far as
>> stacking multiple layers of tunnels, I think the only consistent thing
>> to do is have the inner ID increment and all of the tunnel headers
>> stay fixed - it is hard to justify why the first tunnel header should
>> increment but not the second one. And finally, as a nice bonus, this
>> is what the GRO code has been expecting already so you won't cause any
>> performance regressions with existing systems.
>
> Agreed.  However in the case of the igb and ixgbe incrementing the
> inner isn't really going to work well as it introduces a number of
> issues.  I've considered not enabling GSO partial by default on those
> parts, but leaving it as an available feature.  In the case of i40e we
> will not have any of those issues as both the inner and outer IP IDs
> will increment.
>
> The regressions issue shouldn't be all that big.  In most cases
> tunnels are point to point links.  Odds are if someone is running a
> network they should have similar devices on both ends.  So in the case
> of the Intel drivers I have patched here updating i40e isn't that big
> of a deal since it retains the original GSO behavior.  In the case of
> ixgbe it didn't support any tunnel offloads so there is no GRO support
> without checksums being enabled in the tunnel, and Tx checksum support
> vi HW_CSUM has yet to be pushed upstream from Jeff's tree so things
> are currently throttled by the Tx side.
>
> GSO partial is going to end up being very driver specific in terms of
> what increments and what doesn't as we are having to really hack our
> way around to make it do things the spec sheets say it cannot do.
> Some implementations are going to end up working out better than
> others in terms of interoperability and what can make use of other
> advanced features such as GRO on legacy systems.

I have to admit that I'm a little concerned about the fact that
different drivers will end up doing different things as a result of
this series. Generally speaking, it doesn't seem all that good for
drivers to be setting policy for what packets look like.

That being said, if we are OK with IP IDs jumping around when DF is
set as a result of the first patch, then this shouldn't really be a
problem as the effect is similar. We just obviously need to make sure
that we don't let a packet without DF and and non-incrementing IP IDs
escape out in the wild.

Anyways, I think you've already covered a lot of this ground in the
other thread, so I don't think there is any real reason to rehash it
here.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-18 23:25 ` [RFC PATCH 7/9] GSO: Support partial segmentation offload Alexander Duyck
  2016-03-22 17:00   ` Edward Cree
  2016-03-23 17:09   ` Tom Herbert
@ 2016-03-28  5:36   ` Jesse Gross
  2016-03-28 16:25     ` Alexander Duyck
  2 siblings, 1 reply; 51+ messages in thread
From: Jesse Gross @ 2016-03-28  5:36 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Edward Cree, Linux Kernel Network Developers, David Miller,
	Alexander Duyck, Tom Herbert

On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index edb7179bc051..666cf427898b 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2711,6 +2711,19 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
[...]
> +       /* Only report GSO partial support if it will enable us to
> +        * support segmentation on this frame without needing additional
> +        * work.
> +        */
> +       if (features & NETIF_F_GSO_PARTIAL) {
> +               netdev_features_t partial_features;
> +               struct net_device *dev = skb->dev;
> +
> +               partial_features = dev->features & dev->gso_partial_features;
> +               if (!skb_gso_ok(skb, features | partial_features))
> +                       features &= ~NETIF_F_GSO_PARTIAL;

I think we need to add NETIF_F_GSO_ROBUST into the skb_gso_ok() check
- otherwise packets coming from VMs fail this test and we lose GSO
partial. It's totally safe to expose this feature, since we'll compute
gso_segs anyways.

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f044f970f1a6..bdcba77e164c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3281,6 +3291,15 @@ perform_csum_check:
>          */
>         segs->prev = tail;
>
> +       /* Update GSO info on first skb in partial sequence. */
> +       if (partial_segs) {
> +               skb_shinfo(segs)->gso_size = mss / partial_segs;

One small thing: this gso_size is the same as the original MSS, right?
It seems like we could trivially stick it in a local variable and
avoid the extra division.

> +               skb_shinfo(segs)->gso_segs = partial_segs;
> +               skb_shinfo(segs)->gso_type = skb_shinfo(head_skb)->gso_type |
> +                                            SKB_GSO_PARTIAL;

Since we're computing the gso_segs ourselves, it might be nice to
strip out SKB_GSO_DODGY when we set the type.

I just wanted to say that this is really nice work - I was expecting
it to turn out to be really messy and unmaintainable but this is very
clean. Thanks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 7/9] GSO: Support partial segmentation offload
  2016-03-28  5:36   ` Jesse Gross
@ 2016-03-28 16:25     ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2016-03-28 16:25 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alexander Duyck, Edward Cree, Linux Kernel Network Developers,
	David Miller, Tom Herbert

On Sun, Mar 27, 2016 at 10:36 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Fri, Mar 18, 2016 at 4:25 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index edb7179bc051..666cf427898b 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2711,6 +2711,19 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
> [...]
>> +       /* Only report GSO partial support if it will enable us to
>> +        * support segmentation on this frame without needing additional
>> +        * work.
>> +        */
>> +       if (features & NETIF_F_GSO_PARTIAL) {
>> +               netdev_features_t partial_features;
>> +               struct net_device *dev = skb->dev;
>> +
>> +               partial_features = dev->features & dev->gso_partial_features;
>> +               if (!skb_gso_ok(skb, features | partial_features))
>> +                       features &= ~NETIF_F_GSO_PARTIAL;
>
> I think we need to add NETIF_F_GSO_ROBUST into the skb_gso_ok() check
> - otherwise packets coming from VMs fail this test and we lose GSO
> partial. It's totally safe to expose this feature, since we'll compute
> gso_segs anyways.

Good point.  I will update that before submitting for net-next.

>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index f044f970f1a6..bdcba77e164c 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -3281,6 +3291,15 @@ perform_csum_check:
>>          */
>>         segs->prev = tail;
>>
>> +       /* Update GSO info on first skb in partial sequence. */
>> +       if (partial_segs) {
>> +               skb_shinfo(segs)->gso_size = mss / partial_segs;
>
> One small thing: this gso_size is the same as the original MSS, right?
> It seems like we could trivially stick it in a local variable and
> avoid the extra division.

Nice catch.  I was a bit too in the mindset of having to use the same
variable throughout.

>> +               skb_shinfo(segs)->gso_segs = partial_segs;
>> +               skb_shinfo(segs)->gso_type = skb_shinfo(head_skb)->gso_type |
>> +                                            SKB_GSO_PARTIAL;
>
> Since we're computing the gso_segs ourselves, it might be nice to
> strip out SKB_GSO_DODGY when we set the type.

I will have that fixed for the version I submit for net-next.

> I just wanted to say that this is really nice work - I was expecting
> it to turn out to be really messy and unmaintainable but this is very
> clean. Thanks!

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2016-03-28 16:25 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-18 23:24 [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload Alexander Duyck
2016-03-18 23:24 ` [RFC PATCH 1/9] ipv4/GRO: Allow multiple frames to use the same IP ID Alexander Duyck
2016-03-24  1:43   ` Jesse Gross
2016-03-24  2:21     ` Alexander Duyck
2016-03-28  4:57       ` Jesse Gross
2016-03-18 23:24 ` [RFC PATCH 2/9] gre: Enforce IP ID verification on outer headers Alexander Duyck
2016-03-18 23:24 ` [RFC PATCH 3/9] geneve: " Alexander Duyck
2016-03-18 23:25 ` [RFC PATCH 4/9] vxlan: " Alexander Duyck
2016-03-18 23:25 ` [RFC PATCH 5/9] gue: " Alexander Duyck
2016-03-18 23:25 ` [RFC PATCH 6/9] ethtool: Add support for toggling any of the GSO offloads Alexander Duyck
2016-03-19  0:18   ` Ben Hutchings
2016-03-19  0:30     ` Alexander Duyck
2016-03-19  1:42       ` Ben Hutchings
2016-03-19  2:01         ` Jesse Gross
2016-03-19  2:43           ` Alexander Duyck
2016-03-18 23:25 ` [RFC PATCH 7/9] GSO: Support partial segmentation offload Alexander Duyck
2016-03-22 17:00   ` Edward Cree
2016-03-22 17:47     ` Alexander Duyck
2016-03-22 19:40       ` Edward Cree
2016-03-22 20:11         ` Jesse Gross
2016-03-22 20:17           ` David Miller
2016-03-22 21:38         ` Alexander Duyck
2016-03-23 16:27           ` Edward Cree
2016-03-23 18:06             ` Alexander Duyck
2016-03-23 21:05               ` Edward Cree
2016-03-23 22:36                 ` Alexander Duyck
2016-03-23 23:00                   ` Edward Cree
2016-03-23 23:15                     ` Alexander Duyck
2016-03-24 17:12                       ` Edward Cree
2016-03-24 18:43                         ` Alexander Duyck
2016-03-24 20:17                           ` Edward Cree
2016-03-24 21:50                             ` Alexander Duyck
2016-03-24 23:00                               ` Edward Cree
2016-03-24 23:35                                 ` Alexander Duyck
2016-03-25  0:37                                   ` Edward Cree
2016-03-23 17:09   ` Tom Herbert
2016-03-23 18:19     ` Alexander Duyck
2016-03-24  1:37       ` Jesse Gross
2016-03-24  2:53         ` Alexander Duyck
2016-03-28  5:35           ` Jesse Gross
2016-03-28  5:36   ` Jesse Gross
2016-03-28 16:25     ` Alexander Duyck
2016-03-18 23:25 ` [RFC PATCH 8/9] i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM Alexander Duyck
2016-03-23 19:35   ` Jesse Gross
2016-03-23 20:21     ` Alexander Duyck
2016-03-18 23:25 ` [RFC PATCH 9/9] ixgbe/ixgbevf: Add support for GSO partial Alexander Duyck
2016-03-19  2:05   ` Jesse Gross
2016-03-19  2:42     ` Alexander Duyck
2016-03-21 18:50 ` [RFC PATCH 0/9] RFC6864 compliant GRO and GSO partial offload David Miller
2016-03-21 19:46   ` Alexander Duyck
2016-03-21 20:10     ` Jesse Gross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).