Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v2.24] datapath: Add basic MPLS support to kernel
From: Simon Horman @ 2013-04-03  5:38 UTC (permalink / raw)
  To: dev, netdev; +Cc: Ravi K, Isaku Yamahata, Jesse Gross, Ben Pfaff

Allow datapath to recognize and extract MPLS labels into flow keys
and execute actions which push, pop, and set labels on packets.

Based heavily on work by Leo Alterman and Ravi K.

Cc: Ravi K <rkerur@gmail.com>
Cc: Leo Alterman <lalterman@nicira.com>
Reviewed-by: Isaku Yamahata <yamahata@valinux.co.jp>
Signed-off-by: Simon Horman <horms@verge.net.au>

---

This is the remaining patch of the series "MPLS actions and matches".
It is available in git at:

	git://github.com/horms/openvswitch.git devel/mpls-v2.24

v2.24
* Use skb_mac_header() in set_ethertype()
* Set skb->encapsulation in set_ethertype() to support MPLS GSO.
  Also add a note about the other requirements for MPLS GSO.
  MPLS GSO support will be posted as a patch net-next (Linux mainline)
  "MPLS: Add limited GSO support"
* Do not add ETH_TYPE_MIN, it is no longer used

v2.23
* As suggested by Jesse Gross:
  - Verify the current ethernet type when validating sample actions
    both for the taken and not-taken path if the sample action.
  - Document that the OVS_KEY_ATTR_MPLS attribute accepts a list of
    struct ovs_key_mpls but that an implementation may restrict
    the length it accepts.
  - Restrict the array length of the OVS_KEY_ATTR_MPLS to one.
    + Don't add ovs_flow_verify_key_len as it was added to
      handle attributes whose values are arrays but there are
      no attributes with values that are arrays (of length greater than one).

v2.22
* As suggested by Jesse Gross:
  - Fix sparse warning in validate_and_copy_actions()
    I have no idea why sparse doesn't show this up this on my system.
  - Remove call to skb_cow_head() from push_mpls() as it
    is already covered by a call to make_writable()
  - Check (key_type > OVS_KEY_ATTR_MAX) in ovs_flow_verify_key_len()
  - Disallow set actions on l2.5+ data and MPLS push and pop actions
    after an MPLS pop action as there is no verification that the packet
    is actually of the new ethernet type. This may later be supported
    using recirculation or by other means.
  - Do not add spurious debuging message to ovs_flow_cmd_new_or_set()

v2.21
* As suggested by Jesse Gross:
  - Verify that l3 and l4 actions always always occur prior to
    a push_mpls action and use the network header pointer of an skb
    to track the top of the MPLS stack. This avoids adding an l2_size
    element to the skb callback.

v2.20
* As suggested by Jesse Gross:
  - Do not add ovs_dp_ioctl_hook
    + This appears to be garbage from a rebase
  - Do not add skb_cb_set_l2_size. Instead set OVS_CB(skb)->l2_size
    in ovs_flow_extract().
  - Do not free skb on error in push_mpls(), it is freed in the caller
  - Call skb_reset_mac_len() in pop_mpls() and push_mpls()
  - Update checksums in pop_mpls(), push_mpls() and set_mpls().
  - Rename skb_cb_mpls_bos() as skb_cb_mpls_stack().
    It returns the top not the bottom of the stack.
  - Track the current eth_type in validate_and_copy_actions
    which is initially the eth_type of the flow and may be modified
    by push_mpls and pop_mpls actions. Use this to correctly validate
    mpls_set actions. This is to allow mpls_set actions to be applied
    to a non-MPLS frame after an mpls_push action (although ovs-vswitchd
    doesn't currently do that).
    Also:
    + Remove the check of the eth_type in set_mpls() as the new validation
      scheme should ensure it cannot be incorrect.
    + Use the current eth_type to validate mpls_pop actions and remove
      the eth_type check from pop_mpls().
  - Move OVS_KEY_ATTR_MPLS to non-upstream group in ovs_key_lens
  - Remove unnecessary memset of mpls_key in ovs_flow_to_nlattrs()
  - Make a union of the mpls and ip elements of struct sw_flow_key.
    Currently the code stops parsing after an MPLS header so it is
    not possible for the ip and mpls elements to be used simultaneously
    and some space can be saved by using a union.
  - Allow an array of MPLS key attributes
    + Currently all but the first element is ignored
    + User-space needs to be updated to accept more than one element,
      currently it will treat their presence as an error
  - Do not update network header in ovs_flow_extract() for after parsing
    the MPLS stack as it is never used because no l3+ processing
    occurs on MPLS frames.
  - Allow multiple MPLS entries in a match by allowing the OVS_KEY_ATTR_MPLS
    to be an array of struct ovs_key_mpls with at least one entry.
    Currently only one entry is used which is byte-for-byte compatible with
    the previous scheme of having OVS_KEY_ATTR_MPLS as a struct
    ovs_key_mpls.
* Make skb writable in pop_mpls(), push_mpls() and set_mpls().

v2.18 - v2.19
* No change

v2.17
* As suggested by Ben Pfaff
  - Use consistent terminology for MPLS.
    + Consistently refer to the MPLS component of a packet as the
      MPLS label stack and entries in the stack as MPLS label stack entries
      (LSE).  An MPLS label is a component of an MPLS label stack entry.
      The other components are the traffic class (TC), time to live (TTL)
      and bottom of stack (BoS) bit.
  - Rename compose_.*mpls_ functions as execute_.*mpls_

v2.16
* No change

v2.15
* As suggested by Ben Pfaff
  - Use OVS_ACTION_SET to set OVS_KEY_ATTR_MPLS instead of
    OVS_ACTION_ATTR_SET_MPLS

v2.14
* Remove include/linux/openvswitch.h portion which added add
  new key and action attributes. This
  now present in "User-Space MPLS actions and matches"
  which is now a dependency of this patch

v2.13
* As suggested by Jarno Rajahalme
  - Rename mpls_bos element of ovs_skb_cb as l2_size as it is set and used
    regardless of if an MPLS stack is present or not. Update the name of
    helper functions and documentation accordingly.
  - Ensure that skb_cb_mpls_bos() never returns NULL
* Correct endieness in eth_p_mpls()

v2.12
* Update skb and network header on MPLS extraction in ovs_flow_extract()
* Use NULL in skb_cb_mpls_bos()
* Add eth_p_mpls helper

v2.10 - v2.11
* No change

v2.9
* datapath: Always update the mpls bos if  vlan_pop is successful

  Regardless of the details of how a successful
  vlan_pop is achieved, the mpls bos needs to be updated.

  Without this fix it has been observed that the following
  results in malformed packets

v2.8
* No change

v2.7
* Rebase

v2.6
* As suggested by Yamahata-san
  - Do not guard against label == 0 for
    OVS_ACTION_ATTR_SET_MPLS in validate_actions().
    A label of 0 is valid
  - Remove comment stupulating that if
    the top_label element of struct sw_flow_key is 0 then
    there is no MPLS label. An MPLS label of 0 is valid
    and the correct check if ethertype is
    ntohs(ETH_TYPE_MPLS) or ntohs(ETH_TYPE_MPLS_MCAST)

v2.4 - v2.5
* No change

v2.3
* s/mpls_stack/mpls_bos/
  This is in keeping with the naming used in the OpenFlow 1.3 specification

v2.2
* Call skb_reset_mac_header() in skb_cb_set_mpls_stack()
  eth_hdr(skb) is non-NULL when called in skb_cb_set_mpls_stack().
* Add a call to skb_cb_set_mpls_stack() in ovs_packet_cmd_execute().
  I apologise that I have mislaid my notes on this but
  it avoids a kernel panic. I can investigate again if necessary.
* Use struct ovs_action_push_mpls instead of
  __be16 to decode OVS_ACTION_ATTR_PUSH_MPLS in validate_actions(). This is
  consistent with the data format for the attribute.
* Indentation fix in skb_cb_mpls_stack(). [cosmetic]

v2.1
* Manual rebase
---
 datapath/actions.c                           |  106 ++++++++++++++++++
 datapath/datapath.c                          |  148 +++++++++++++++++++++-----
 datapath/datapath.h                          |    2 +
 datapath/flow.c                              |   28 +++++
 datapath/flow.h                              |   25 +++--
 datapath/linux/compat/include/linux/skbuff.h |   12 +++
 include/linux/openvswitch.h                  |    6 +-
 lib/odp-util.c                               |    8 +-
 8 files changed, 297 insertions(+), 38 deletions(-)

diff --git a/datapath/actions.c b/datapath/actions.c
index 0dac658..e9634fe 100644
--- a/datapath/actions.c
+++ b/datapath/actions.c
@@ -48,6 +48,97 @@ static int make_writable(struct sk_buff *skb, int write_len)
 	return pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
 }
 
+static void set_ethertype(struct sk_buff *skb, const __be16 ethertype)
+{
+	struct ethhdr *hdr = (struct ethhdr *)skb_mac_header(skb);
+	if (hdr->h_proto == ethertype)
+		return;
+	if (!eth_p_mpls(hdr->h_proto)) {
+		/* This implies we are pushing adding an MPLS label stack
+		 * to a previously non-MPLS packet. Set the encapsulation
+		 * bit to allow MPLS GSO segmentation. It will make use
+		 * of hdr->h_proto, set to the new MPLS ethertype and
+		 * skb->protocol which is set to the old non-MPLS ethertype. */
+		skb_set_encapsulation(skb);
+	}
+	hdr->h_proto = ethertype;
+	if (get_ip_summed(skb) == OVS_CSUM_COMPLETE) {
+		__be16 diff[] = { ~hdr->h_proto, ethertype };
+		skb->csum = ~csum_partial((char *)diff, sizeof(diff),
+					  ~skb->csum);
+	}
+}
+
+static int push_mpls(struct sk_buff *skb, const struct ovs_action_push_mpls *mpls)
+{
+	__be32 *new_mpls_lse;
+	int err;
+
+	err = make_writable(skb, skb->mac_len + MPLS_HLEN);
+	if (unlikely(err))
+		return err;
+
+	skb_push(skb, MPLS_HLEN);
+	memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_header(skb),
+		skb->mac_len);
+	skb_reset_mac_header(skb);
+	skb_set_network_header(skb, skb->mac_len);
+
+	new_mpls_lse = (__be32 *)skb_network_header(skb);
+	*new_mpls_lse = mpls->mpls_lse;
+
+	if (get_ip_summed(skb) == OVS_CSUM_COMPLETE)
+		skb->csum = csum_add(skb->csum, csum_partial(new_mpls_lse,
+							     MPLS_HLEN, 0));
+
+	set_ethertype(skb, mpls->mpls_ethertype);
+	return 0;
+}
+
+static int pop_mpls(struct sk_buff *skb, const __be16 *ethertype)
+{
+	int err;
+
+	err = make_writable(skb, skb->mac_len + MPLS_HLEN);
+	if (unlikely(err))
+		return err;
+
+	if (get_ip_summed(skb) == OVS_CSUM_COMPLETE)
+		skb->csum = csum_sub(skb->csum,
+				     csum_partial(skb_network_header(skb),
+						  MPLS_HLEN, 0));
+
+	memmove(skb_mac_header(skb) + MPLS_HLEN, skb_mac_header(skb),
+		skb->mac_len);
+
+	skb_pull(skb, MPLS_HLEN);
+	skb_reset_mac_header(skb);
+	skb_set_network_header(skb, skb->mac_len);
+
+	set_ethertype(skb, *ethertype);
+	return 0;
+}
+
+static int set_mpls(struct sk_buff *skb, const __be32 *mpls_lse)
+{
+	__be32 *stack = (__be32 *)skb_network_header(skb);
+	int err;
+
+	err = make_writable(skb, skb->mac_len + MPLS_HLEN);
+	if (unlikely(err))
+		return err;
+
+	if (get_ip_summed(skb) == OVS_CSUM_COMPLETE) {
+		__be32 diff[] = { ~(*stack), *mpls_lse };
+		skb->csum = ~csum_partial((char *)diff, sizeof(diff),
+					  ~skb->csum);
+	}
+
+	*stack = *mpls_lse;
+
+	return 0;
+}
+
 /* remove VLAN header from packet and update csum accordingly. */
 static int __pop_vlan_tci(struct sk_buff *skb, __be16 *current_tci)
 {
@@ -115,6 +206,9 @@ static int push_vlan(struct sk_buff *skb, const struct ovs_action_push_vlan *vla
 		if (!__vlan_put_tag(skb, current_tag))
 			return -ENOMEM;
 
+		/* update mac_len for MPLS functions */
+		skb_reset_mac_len(skb);
+
 		if (get_ip_summed(skb) == OVS_CSUM_COMPLETE)
 			skb->csum = csum_add(skb->csum, csum_partial(skb->data
 					+ (2 * ETH_ALEN), VLAN_HLEN, 0));
@@ -459,6 +553,10 @@ static int execute_set_action(struct sk_buff *skb,
 	case OVS_KEY_ATTR_UDP:
 		err = set_udp(skb, nla_data(nested_attr));
 		break;
+
+	case OVS_KEY_ATTR_MPLS:
+		err = set_mpls(skb, nla_data(nested_attr));
+		break;
 	}
 
 	return err;
@@ -494,6 +592,14 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 			output_userspace(dp, skb, a);
 			break;
 
+		case OVS_ACTION_ATTR_PUSH_MPLS:
+			err = push_mpls(skb, nla_data(a));
+			break;
+
+		case OVS_ACTION_ATTR_POP_MPLS:
+			err = pop_mpls(skb, nla_data(a));
+			break;
+
 		case OVS_ACTION_ATTR_PUSH_VLAN:
 			err = push_vlan(skb, nla_data(a));
 			if (unlikely(err)) /* skb already freed. */
diff --git a/datapath/datapath.c b/datapath/datapath.c
index 9cd4b0e..e8be795 100644
--- a/datapath/datapath.c
+++ b/datapath/datapath.c
@@ -528,13 +528,26 @@ static inline void add_nested_action_end(struct sw_flow_actions *sfa, int st_off
 	a->nla_len = sfa->actions_len - st_offset;
 }
 
-static int validate_and_copy_actions(const struct nlattr *attr,
+struct eth_types {
+	size_t depth;
+	__be16 types[SAMPLE_ACTION_DEPTH];
+};
+
+static void eth_types_set(struct eth_types *types, size_t depth, __be16 type)
+{
+	types->depth = depth;
+	types->types[depth] = type;
+}
+
+static int validate_and_copy_actions__(const struct nlattr *attr,
 				const struct sw_flow_key *key, int depth,
-				struct sw_flow_actions **sfa);
+				struct sw_flow_actions **sfa,
+				struct eth_types *eth_types);
 
 static int validate_and_copy_sample(const struct nlattr *attr,
 			   const struct sw_flow_key *key, int depth,
-			   struct sw_flow_actions **sfa)
+			   struct sw_flow_actions **sfa,
+			   struct eth_types *eth_types)
 {
 	const struct nlattr *attrs[OVS_SAMPLE_ATTR_MAX + 1];
 	const struct nlattr *probability, *actions;
@@ -570,7 +583,8 @@ static int validate_and_copy_sample(const struct nlattr *attr,
 	if (st_acts < 0)
 		return st_acts;
 
-	err = validate_and_copy_actions(actions, key, depth + 1, sfa);
+	err = validate_and_copy_actions__(actions, key, depth + 1, sfa,
+					  eth_types);
 	if (err)
 		return err;
 
@@ -580,12 +594,12 @@ static int validate_and_copy_sample(const struct nlattr *attr,
 	return 0;
 }
 
-static int validate_tp_port(const struct sw_flow_key *flow_key)
+static int validate_tp_port(const struct sw_flow_key *flow_key, __be16 eth_type)
 {
-	if (flow_key->eth.type == htons(ETH_P_IP)) {
+	if (eth_type == htons(ETH_P_IP)) {
 		if (flow_key->ipv4.tp.src || flow_key->ipv4.tp.dst)
 			return 0;
-	} else if (flow_key->eth.type == htons(ETH_P_IPV6)) {
+	} else 	if (eth_type == htons(ETH_P_IPV6)) {
 		if (flow_key->ipv6.tp.src || flow_key->ipv6.tp.dst)
 			return 0;
 	}
@@ -616,7 +630,7 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 static int validate_set(const struct nlattr *a,
 			const struct sw_flow_key *flow_key,
 			struct sw_flow_actions **sfa,
-			bool *set_tun)
+			bool *set_tun, struct eth_types *eth_types)
 {
 	const struct nlattr *ovs_key = nla_data(a);
 	int key_type = nla_type(ovs_key);
@@ -653,9 +667,12 @@ static int validate_set(const struct nlattr *a,
 			return err;
 		break;
 
-	case OVS_KEY_ATTR_IPV4:
-		if (flow_key->eth.type != htons(ETH_P_IP))
-			return -EINVAL;
+	case OVS_KEY_ATTR_IPV4: {
+		size_t i;
+
+		for (i = 0; i < eth_types->depth; i++)
+			if (eth_types->types[i] != htons(ETH_P_IP))
+				return -EINVAL;
 
 		if (!flow_key->ip.proto)
 			return -EINVAL;
@@ -668,10 +685,14 @@ static int validate_set(const struct nlattr *a,
 			return -EINVAL;
 
 		break;
+	}
 
-	case OVS_KEY_ATTR_IPV6:
-		if (flow_key->eth.type != htons(ETH_P_IPV6))
-			return -EINVAL;
+	case OVS_KEY_ATTR_IPV6: {
+		size_t i;
+
+		for (i = 0; i < eth_types->depth; i++)
+			if (eth_types->types[i] != htons(ETH_P_IPV6))
+				return -EINVAL;
 
 		if (!flow_key->ip.proto)
 			return -EINVAL;
@@ -687,18 +708,37 @@ static int validate_set(const struct nlattr *a,
 			return -EINVAL;
 
 		break;
+	}
+
+	case OVS_KEY_ATTR_TCP: {
+		size_t i;
 
-	case OVS_KEY_ATTR_TCP:
 		if (flow_key->ip.proto != IPPROTO_TCP)
 			return -EINVAL;
 
-		return validate_tp_port(flow_key);
+		for (i = 0; i < eth_types->depth; i++)
+			if (validate_tp_port(flow_key, eth_types->types[i]))
+				return -EINVAL;
+	}
 
-	case OVS_KEY_ATTR_UDP:
+	case OVS_KEY_ATTR_UDP: {
+		size_t i;
 		if (flow_key->ip.proto != IPPROTO_UDP)
 			return -EINVAL;
 
-		return validate_tp_port(flow_key);
+		for (i = 0; i < eth_types->depth; i++)
+			if (validate_tp_port(flow_key, eth_types->types[i]))
+				return -EINVAL;
+	}
+
+	case OVS_KEY_ATTR_MPLS: {
+		size_t i;
+
+		for (i = 0; i < eth_types->depth; i++)
+			if (!eth_p_mpls(eth_types->types[i]))
+				return -EINVAL;
+		break;
+	}
 
 	default:
 		return -EINVAL;
@@ -742,10 +782,10 @@ static int copy_action(const struct nlattr *from,
 	return 0;
 }
 
-static int validate_and_copy_actions(const struct nlattr *attr,
-				const struct sw_flow_key *key,
-				int depth,
-				struct sw_flow_actions **sfa)
+static int validate_and_copy_actions__(const struct nlattr *attr,
+				const struct sw_flow_key *key, int depth,
+				struct sw_flow_actions **sfa,
+				struct eth_types *eth_types)
 {
 	const struct nlattr *a;
 	int rem, err;
@@ -753,11 +793,29 @@ static int validate_and_copy_actions(const struct nlattr *attr,
 	if (depth >= SAMPLE_ACTION_DEPTH)
 		return -EOVERFLOW;
 
+	/* Due to the sample action there may be more than one possibility
+	 * for the current ethernet type. They all need to be verified.
+	 *
+	 * This is handled by tracking a stack of ethernet types, one for
+	 * each (sample) depth of validation. Here the ethernet type for
+	 * the current depth is pushed onto the stack. It may be modified
+	 * as by actions are validated. When a modification occurs the
+	 * ethernet types for higher stack-depths are popped off the stack.
+	 * All entries on the stack are checked when validating the
+	 * ethernet type required by an action.
+	 */
+	if (!depth)
+		eth_types_set(eth_types, 0, key->eth.type);
+	else
+		eth_types_set(eth_types, depth, eth_types->types[depth - 1]);
+
 	nla_for_each_nested(a, attr, rem) {
 		/* Expected argument lengths, (u32)-1 for variable length. */
 		static const u32 action_lens[OVS_ACTION_ATTR_MAX + 1] = {
 			[OVS_ACTION_ATTR_OUTPUT] = sizeof(u32),
 			[OVS_ACTION_ATTR_USERSPACE] = (u32)-1,
+			[OVS_ACTION_ATTR_PUSH_MPLS] = sizeof(struct ovs_action_push_mpls),
+			[OVS_ACTION_ATTR_POP_MPLS] = sizeof(__be16),
 			[OVS_ACTION_ATTR_PUSH_VLAN] = sizeof(struct ovs_action_push_vlan),
 			[OVS_ACTION_ATTR_POP_VLAN] = 0,
 			[OVS_ACTION_ATTR_SET] = (u32)-1,
@@ -788,6 +846,35 @@ static int validate_and_copy_actions(const struct nlattr *attr,
 				return -EINVAL;
 			break;
 
+		case OVS_ACTION_ATTR_PUSH_MPLS: {
+			const struct ovs_action_push_mpls *mpls = nla_data(a);
+			if (!eth_p_mpls(mpls->mpls_ethertype))
+				return -EINVAL;
+			eth_types_set(eth_types, depth, mpls->mpls_ethertype);
+			break;
+		}
+
+		case OVS_ACTION_ATTR_POP_MPLS: {
+			size_t i;
+
+			for (i = 0; i < eth_types->depth; i++)
+				if (eth_types->types[i] != htons(ETH_P_IP))
+					return -EINVAL;
+
+			/* Disallow subsequent l2.5+ set and mpls_pop actions
+			 * as there is no check here to ensure that the new
+			 * eth_type is valid and thus set actions could
+			 * write off the end of the packet or otherwise
+			 * corrupt it.
+			 *
+			 * Support for these actions that after mpls_pop
+			 * using packet recirculation is planned.
+			 * are planned to be supported using using packet
+			 * recirculation.
+			 */
+			eth_types_set(eth_types, depth, ntohs(0));
+			break;
+		}
 
 		case OVS_ACTION_ATTR_POP_VLAN:
 			break;
@@ -801,13 +888,14 @@ static int validate_and_copy_actions(const struct nlattr *attr,
 			break;
 
 		case OVS_ACTION_ATTR_SET:
-			err = validate_set(a, key, sfa, &skip_copy);
+			err = validate_set(a, key, sfa, &skip_copy, eth_types);
 			if (err)
 				return err;
 			break;
 
 		case OVS_ACTION_ATTR_SAMPLE:
-			err = validate_and_copy_sample(a, key, depth, sfa);
+			err = validate_and_copy_sample(a, key, depth, sfa,
+						       eth_types);
 			if (err)
 				return err;
 			skip_copy = true;
@@ -829,6 +917,14 @@ static int validate_and_copy_actions(const struct nlattr *attr,
 	return 0;
 }
 
+static int validate_and_copy_actions(const struct nlattr *attr,
+				const struct sw_flow_key *key,
+				struct sw_flow_actions **sfa)
+{
+	struct eth_types eth_type;
+	return validate_and_copy_actions__(attr, key, 0, sfa, &eth_type);
+}
+
 static void clear_stats(struct sw_flow *flow)
 {
 	flow->used = 0;
@@ -893,7 +989,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 	if (IS_ERR(acts))
 		goto err_flow_free;
 
-	err = validate_and_copy_actions(a[OVS_PACKET_ATTR_ACTIONS], &flow->key, 0, &acts);
+	err = validate_and_copy_actions(a[OVS_PACKET_ATTR_ACTIONS], &flow->key, &acts);
 	rcu_assign_pointer(flow->sf_acts, acts);
 	if (err)
 		goto err_flow_free;
@@ -1231,7 +1327,7 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info)
 		if (IS_ERR(acts))
 			goto error;
 
-		error = validate_and_copy_actions(a[OVS_FLOW_ATTR_ACTIONS], &key,  0, &acts);
+		error = validate_and_copy_actions(a[OVS_FLOW_ATTR_ACTIONS], &key,  &acts);
 		if (error)
 			goto err_kfree;
 	} else if (info->genlhdr->cmd == OVS_FLOW_CMD_NEW) {
diff --git a/datapath/datapath.h b/datapath/datapath.h
index 9bc98fb..7665742 100644
--- a/datapath/datapath.h
+++ b/datapath/datapath.h
@@ -189,4 +189,6 @@ struct sk_buff *ovs_vport_cmd_build_info(struct vport *, u32 portid, u32 seq,
 					 u8 cmd);
 
 int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb);
+
+unsigned char *skb_cb_mpls_stack(const struct sk_buff *skb);
 #endif /* datapath.h */
diff --git a/datapath/flow.c b/datapath/flow.c
index 841e8be..8a4cb84 100644
--- a/datapath/flow.c
+++ b/datapath/flow.c
@@ -648,6 +648,7 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 		return -ENOMEM;
 
 	skb_reset_network_header(skb);
+	skb_reset_mac_len(skb);
 	__skb_push(skb, skb->data - skb_mac_header(skb));
 
 	/* Network layer. */
@@ -730,6 +731,13 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 			memcpy(key->ipv4.arp.tha, arp->ar_tha, ETH_ALEN);
 			key_len = SW_FLOW_KEY_OFFSET(ipv4.arp);
 		}
+	} else if (eth_p_mpls(key->eth.type)) {
+		error = check_header(skb, MPLS_HLEN);
+		if (unlikely(error))
+			goto out;
+
+		key_len = SW_FLOW_KEY_OFFSET(mpls.top_lse);
+		memcpy(&key->mpls.top_lse, skb_network_header(skb), MPLS_HLEN);
 	} else if (key->eth.type == htons(ETH_P_IPV6)) {
 		int nh_len;             /* IPv6 Header + Extensions */
 
@@ -848,6 +856,9 @@ const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 	[OVS_KEY_ATTR_ARP] = sizeof(struct ovs_key_arp),
 	[OVS_KEY_ATTR_ND] = sizeof(struct ovs_key_nd),
 	[OVS_KEY_ATTR_TUNNEL] = -1,
+
+	/* Not upstream. */
+	[OVS_KEY_ATTR_MPLS] = sizeof(struct ovs_key_mpls),
 };
 
 static int ipv4_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_len,
@@ -1254,6 +1265,15 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
 		swkey->ip.proto = ntohs(arp_key->arp_op);
 		memcpy(swkey->ipv4.arp.sha, arp_key->arp_sha, ETH_ALEN);
 		memcpy(swkey->ipv4.arp.tha, arp_key->arp_tha, ETH_ALEN);
+	} else if (eth_p_mpls(swkey->eth.type)) {
+		const struct ovs_key_mpls *mpls_key;
+		if (!(attrs & (1ULL << OVS_KEY_ATTR_MPLS)))
+			return -EINVAL;
+		attrs &= ~(1ULL << OVS_KEY_ATTR_MPLS);
+
+		key_len = SW_FLOW_KEY_OFFSET(mpls.top_lse);
+		mpls_key = nla_data(a[OVS_KEY_ATTR_MPLS]);
+		swkey->mpls.top_lse = mpls_key->mpls_lse;
 	}
 
 	if (attrs)
@@ -1420,6 +1440,14 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb)
 		arp_key->arp_op = htons(swkey->ip.proto);
 		memcpy(arp_key->arp_sha, swkey->ipv4.arp.sha, ETH_ALEN);
 		memcpy(arp_key->arp_tha, swkey->ipv4.arp.tha, ETH_ALEN);
+	} else if (eth_p_mpls(swkey->eth.type)) {
+		struct ovs_key_mpls *mpls_key;
+
+		nla = nla_reserve(skb, OVS_KEY_ATTR_MPLS, sizeof(*mpls_key));
+		if (!nla)
+			goto nla_put_failure;
+		mpls_key = nla_data(nla);
+		mpls_key->mpls_lse = swkey->mpls.top_lse;
 	}
 
 	if ((swkey->eth.type == htons(ETH_P_IP) ||
diff --git a/datapath/flow.h b/datapath/flow.h
index dba66cf..2e3083b 100644
--- a/datapath/flow.h
+++ b/datapath/flow.h
@@ -72,12 +72,17 @@ struct sw_flow_key {
 		__be16 tci;		/* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
 		__be16 type;		/* Ethernet frame type. */
 	} eth;
-	struct {
-		u8     proto;		/* IP protocol or lower 8 bits of ARP opcode. */
-		u8     tos;		/* IP ToS. */
-		u8     ttl;		/* IP TTL/hop limit. */
-		u8     frag;		/* One of OVS_FRAG_TYPE_*. */
-	} ip;
+	union {
+		struct {
+			__be32 top_lse;		/* top label stack entry */
+		} mpls;
+		struct {
+			u8     proto;		/* IP protocol or lower 8 bits of ARP opcode. */
+			u8     tos;		/* IP ToS. */
+			u8     ttl;		/* IP TTL/hop limit. */
+			u8     frag;		/* One of OVS_FRAG_TYPE_*. */
+		} ip;
+	};
 	union {
 		struct {
 			struct {
@@ -143,6 +148,8 @@ struct arp_eth_header {
 	unsigned char       ar_tip[4];		/* target IP address        */
 } __packed;
 
+#define MPLS_HLEN 4
+
 int ovs_flow_init(void);
 void ovs_flow_exit(void);
 
@@ -204,4 +211,10 @@ int ipv4_tun_from_nlattr(const struct nlattr *attr,
 int ipv4_tun_to_nlattr(struct sk_buff *skb,
 			const struct ovs_key_ipv4_tunnel *tun_key);
 
+static inline bool eth_p_mpls(__be16 eth_type)
+{
+	return eth_type == htons(ETH_P_MPLS_UC) ||
+		eth_type == htons(ETH_P_MPLS_MC);
+}
+
 #endif /* flow.h */
diff --git a/datapath/linux/compat/include/linux/skbuff.h b/datapath/linux/compat/include/linux/skbuff.h
index d485b39..938d6e4 100644
--- a/datapath/linux/compat/include/linux/skbuff.h
+++ b/datapath/linux/compat/include/linux/skbuff.h
@@ -251,4 +251,16 @@ static inline void skb_reset_mac_len(struct sk_buff *skb)
 	skb->mac_len = skb->network_header - skb->mac_header;
 }
 #endif
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,7,0)
+static inline void skb_set_encapsulation(struct sk_buff *skb)
+{
+	skb->encapsulation = 1;
+}
+#else
+/* Encapsulation didn't exist before 3.7.0 but that is ok
+ * because neither did MPLS GSO the only reason the bit is set
+ * by Open vSwtich */
+static inline void skb_set_encapsulation(struct sk_buff *skb) { }
+#endif
 #endif
diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
index bd2f05f..e890fd8 100644
--- a/include/linux/openvswitch.h
+++ b/include/linux/openvswitch.h
@@ -287,7 +287,9 @@ enum ovs_key_attr {
 	OVS_KEY_ATTR_IPV4_TUNNEL,  /* struct ovs_key_ipv4_tunnel */
 #endif
 
-	OVS_KEY_ATTR_MPLS = 62, /* struct ovs_key_mpls */
+	OVS_KEY_ATTR_MPLS = 62, /* array of struct ovs_key_mpls.
+				 * The implementation may restrict
+				 * the accepted length of the array. */
 	__OVS_KEY_ATTR_MAX
 };
 
@@ -330,7 +332,7 @@ struct ovs_key_ethernet {
 };
 
 struct ovs_key_mpls {
-	__be32 mpls_top_lse;
+	__be32 mpls_lse;
 };
 
 struct ovs_key_ipv4 {
diff --git a/lib/odp-util.c b/lib/odp-util.c
index 751c1c9..3206dc9 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -906,7 +906,7 @@ format_odp_key_attr(const struct nlattr *a, struct ds *ds)
     case OVS_KEY_ATTR_MPLS: {
         const struct ovs_key_mpls *mpls_key = nl_attr_get(a);
         ds_put_char(ds, '(');
-        format_mpls_lse(ds, mpls_key->mpls_top_lse);
+        format_mpls_lse(ds, mpls_key->mpls_lse);
         ds_put_char(ds, ')');
         break;
     }
@@ -1231,7 +1231,7 @@ parse_odp_key_attr(const char *s, const struct simap *port_names,
 
             mpls = nl_msg_put_unspec_uninit(key, OVS_KEY_ATTR_MPLS,
                                             sizeof *mpls);
-            mpls->mpls_top_lse = mpls_lse_from_components(label, tc, ttl, bos);
+            mpls->mpls_lse = mpls_lse_from_components(label, tc, ttl, bos);
             return n;
         }
     }
@@ -1594,7 +1594,7 @@ odp_flow_key_from_flow(struct ofpbuf *buf, const struct flow *flow,
 
         mpls_key = nl_msg_put_unspec_uninit(buf, OVS_KEY_ATTR_MPLS,
                                             sizeof *mpls_key);
-        mpls_key->mpls_top_lse = flow->mpls_lse;
+        mpls_key->mpls_lse = flow->mpls_lse;
     }
 
     if (is_ip_any(flow) && !(flow->nw_frag & FLOW_NW_FRAG_LATER)) {
@@ -2250,7 +2250,7 @@ commit_mpls_action(const struct flow *flow, struct flow *base,
     } else {
         struct ovs_key_mpls mpls_key;
 
-        mpls_key.mpls_top_lse = flow->mpls_lse;
+        mpls_key.mpls_lse = flow->mpls_lse;
         commit_set_action(odp_actions, OVS_KEY_ATTR_MPLS,
                           &mpls_key, sizeof(mpls_key));
     }
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH] MPLS: Add limited GSO support
From: Simon Horman @ 2013-04-03  5:24 UTC (permalink / raw)
  To: dev, netdev; +Cc: Jesse Gross, Simon Horman, Pravin B Shelar

In the case where a non-MPLS packet is recieved and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not support GSO of MPLS packets.

The aim of this code is to provide GSO in software for MPLS packets
whose skbs are GSO.

When an implementation adds an MPLS stack to a non-MPLS packet it should do
the following to skb metadata:

* Set skb_mac_header(skb)->protocol to the new MPLS ethertype.
  That is, either ETH_P_MPLS_UC or ETH_P_MPLS_MC.

* Leave skb->protocol as the old non-MPLS ethertype.

* Set skb->encapsulation = 1.

  This may not strictly be necessary as I believe that checking
  skb_mac_header(skb)->protocol and skb->protocol should be necessary and
  sufficient.

  However, it does seem to fit nicely with the current implementation of
  dev_hard_start_xmit() where the more expensive check of
  skb_mac_header(skb)->protocol may be guarded by an existing check of
  skb->encapsulation.

One aspect of this patch that I am unsure about is the modification I have
made to skb_segment(). This seems to be necessary as checskum accelearation
may no longer be possible as the packet has changed to be MPLS from some
other packet type which may have been supported by the hardware in use.

I will post a patch, "[PATCH v3.24] datapath: Add basic MPLS support to
kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
That patch sets the above requirements in
datapath/actions.c:set_ethertype() and was used to exercise the MPLS GSO
code. The datapath patch is against the Open vSwtich tree but it is
intended that it be added to the Open vSwtich code present in the mainline
Linux kernel at some point.

Suggested by Jesse Gross. Based heavily on "v4 GRE: Add TCP segmentation
offload for GRE" by Pravin B Shelar.

Cc: Jesse Gross <jesse@nicira.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 include/linux/netdev_features.h |    2 +
 include/linux/skbuff.h          |   15 +++--
 net/Kconfig                     |    1 +
 net/Makefile                    |    1 +
 net/core/dev.c                  |    9 ++-
 net/core/ethtool.c              |    1 +
 net/core/skbuff.c               |    3 +-
 net/ipv4/af_inet.c              |    1 +
 net/ipv4/tcp.c                  |    1 +
 net/ipv4/udp.c                  |    2 +-
 net/ipv6/ip6_offload.c          |    1 +
 net/ipv6/udp_offload.c          |    3 +-
 net/mpls/Kconfig                |    9 +++
 net/mpls/Makefile               |    4 ++
 net/mpls/mpls_gso.c             |  125 +++++++++++++++++++++++++++++++++++++++
 15 files changed, 170 insertions(+), 8 deletions(-)
 create mode 100644 net/mpls/Kconfig
 create mode 100644 net/mpls/Makefile
 create mode 100644 net/mpls/mpls_gso.c

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index d6ee2d0..e7bffa8 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -43,6 +43,7 @@ enum {
 	NETIF_F_FSO_BIT,		/* ... FCoE segmentation */
 	NETIF_F_GSO_GRE_BIT,		/* ... GRE with TSO */
 	NETIF_F_GSO_UDP_TUNNEL_BIT,	/* ... UDP TUNNEL with TSO */
+	NETIF_F_GSO_MPLS_BIT,		/* ... MPLS segmentation */
 	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
 		NETIF_F_GSO_UDP_TUNNEL_BIT,
 
@@ -104,6 +105,7 @@ enum {
 #define NETIF_F_RXALL		__NETIF_F(RXALL)
 #define NETIF_F_GSO_GRE		__NETIF_F(GSO_GRE)
 #define NETIF_F_GSO_UDP_TUNNEL	__NETIF_F(GSO_UDP_TUNNEL)
+#define NETIF_F_GSO_MPLS	__NETIF_F(GSO_MPLS)
 
 /* Features valid for ethtool to change */
 /* = all defined minus driver/device-class-related */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fa88b96..e46af4c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -319,6 +319,8 @@ enum {
 	SKB_GSO_GRE = 1 << 6,
 
 	SKB_GSO_UDP_TUNNEL = 1 << 7,
+
+	SKB_GSO_MPLS = 1 << 8,
 };
 
 #if BITS_PER_LONG > 32
@@ -2789,12 +2791,17 @@ static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
 }
 #endif
 
-/* Keeps track of mac header offset relative to skb->head.
- * It is useful for TSO of Tunneling protocol. e.g. GRE.
- * For non-tunnel skb it points to skb_mac_header() and for
- * tunnel skb it points to outer mac header. */
 struct skb_gso_cb {
+	/* Keeps track of mac header offset relative to skb->head.
+	 * It is useful for TSO of Tunneling protocol. e.g. GRE.
+	 * For non-tunnel skb it points to skb_mac_header() and for
+	 * tunnel skb it points to outer mac header. */
 	int mac_offset;
+
+	/* Keeps track of the ethernet type of an encapsualted
+	 * packet where it is otherwise unknown. This is the case
+	 * if an MPLS stack is added to a non-MPLS packet. */
+	__be16 encap_eth_type;
 };
 #define SKB_GSO_CB(skb) ((struct skb_gso_cb *)(skb)->cb)
 
diff --git a/net/Kconfig b/net/Kconfig
index 2ddc904..d32a7fe 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -218,6 +218,7 @@ source "net/batman-adv/Kconfig"
 source "net/openvswitch/Kconfig"
 source "net/vmw_vsock/Kconfig"
 source "net/netlink/Kconfig"
+source "net/mpls/Kconfig"
 
 config RPS
 	boolean
diff --git a/net/Makefile b/net/Makefile
index 091e7b04..9492e8c 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -70,3 +70,4 @@ obj-$(CONFIG_BATMAN_ADV)	+= batman-adv/
 obj-$(CONFIG_NFC)		+= nfc/
 obj-$(CONFIG_OPENVSWITCH)	+= openvswitch/
 obj-$(CONFIG_VSOCKETS)	+= vmw_vsock/
+obj-$(CONFIG_NET_MPLS_GSO)	+= mpls/
diff --git a/net/core/dev.c b/net/core/dev.c
index 2db88df..8d82a24 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2494,8 +2494,15 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		 * hardware encapsulation features instead of standard
 		 * features for the netdev
 		 */
-		if (skb->encapsulation)
+		if (skb->encapsulation) {
+			struct ethhdr *hdr = (struct ethhdr *)skb_mac_header(skb);
 			features &= dev->hw_enc_features;
+			if (hdr->h_proto == htons(ETH_P_MPLS_UC) ||
+			    hdr->h_proto == htons(ETH_P_MPLS_MC)) {
+				SKB_GSO_CB(skb)->encap_eth_type = skb->protocol;
+				skb->protocol = hdr->h_proto;
+			}
+		}
 
 		if (netif_needs_gso(skb, features)) {
 			if (unlikely(dev_gso_segment(skb, features)))
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index adc1351..f6d5d06 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -79,6 +79,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
 	[NETIF_F_GSO_GRE_BIT] =		 "tx-gre-segmentation",
 	[NETIF_F_GSO_UDP_TUNNEL_BIT] =	 "tx-udp_tnl-segmentation",
+	[NETIF_F_GSO_MPLS_BIT] =	 "tx-mpls-segmentation",
 
 	[NETIF_F_FCOE_CRC_BIT] =         "tx-checksum-fcoe-crc",
 	[NETIF_F_SCTP_CSUM_BIT] =        "tx-checksum-sctp",
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ba64614..d63292b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2824,7 +2824,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 doffset + tnl_hlen);
 
 		if (fskb != skb_shinfo(skb)->frag_list)
-			continue;
+			goto csum;
 
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
@@ -2888,6 +2888,7 @@ skip_fraglist:
 		nskb->len += nskb->data_len;
 		nskb->truesize += nskb->data_len;
 
+csum:
 		if (!csum) {
 			nskb->csum = skb_checksum(nskb, doffset,
 						  nskb->len - doffset, 0);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 93824c5..f5b4c9a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1291,6 +1291,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_TCP_ECN |
 		       SKB_GSO_GRE |
 		       SKB_GSO_UDP_TUNNEL |
+		       SKB_GSO_MPLS |
 		       0)))
 		goto out;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a96f7b5..425c933 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2914,6 +2914,7 @@ struct sk_buff *tcp_tso_segment(struct sk_buff *skb,
 			       SKB_GSO_TCP_ECN |
 			       SKB_GSO_TCPV6 |
 			       SKB_GSO_GRE |
+			       SKB_GSO_MPLS |
 			       SKB_GSO_UDP_TUNNEL |
 			       0) ||
 			     !(type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))))
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7117d14..0b5b57f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2371,7 +2371,7 @@ struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 
 		if (unlikely(type & ~(SKB_GSO_UDP | SKB_GSO_DODGY |
 				      SKB_GSO_UDP_TUNNEL |
-				      SKB_GSO_GRE) ||
+				      SKB_GSO_GRE | SKB_GSO_MPLS) ||
 			     !(type & (SKB_GSO_UDP))))
 			goto out;
 
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 71b766e..a263b99 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -98,6 +98,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_TCP_ECN |
 		       SKB_GSO_GRE |
 		       SKB_GSO_UDP_TUNNEL |
+		       SKB_GSO_MPLS |
 		       SKB_GSO_TCPV6 |
 		       0)))
 		goto out;
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 3bb3a89..76d401a 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -63,7 +63,8 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
 		if (unlikely(type & ~(SKB_GSO_UDP |
 				      SKB_GSO_DODGY |
 				      SKB_GSO_UDP_TUNNEL |
-				      SKB_GSO_GRE) ||
+				      SKB_GSO_GRE |
+				      SKB_GSO_MPLS) ||
 			     !(type & (SKB_GSO_UDP))))
 			goto out;
 
diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
new file mode 100644
index 0000000..37421db
--- /dev/null
+++ b/net/mpls/Kconfig
@@ -0,0 +1,9 @@
+#
+# MPLS configuration
+#
+config NET_MPLS_GSO
+	tristate "MPLS: GSO support"
+	help
+	 This is helper module to allow segmentation of non-MPLS GSO packets
+	 that have had MPLS stack entries pushed onto them and thus
+	 become MPLS GSO packets.
diff --git a/net/mpls/Makefile b/net/mpls/Makefile
new file mode 100644
index 0000000..0a3c171
--- /dev/null
+++ b/net/mpls/Makefile
@@ -0,0 +1,4 @@
+#
+# Makefile for MPLS.
+#
+obj-y += mpls_gso.o
diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
new file mode 100644
index 0000000..583de9e
--- /dev/null
+++ b/net/mpls/mpls_gso.c
@@ -0,0 +1,125 @@
+/*
+ *	MPLS GSO Support
+ *
+ *	Authors: Simon Horman (horms@verge.net.au)
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; either version
+ *	2 of the License, or (at your option) any later version.
+ *
+ *	Based on: GSO portions of net/ipv4/gre.c
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/netdev_features.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+
+#define MPLS_BOS_MASK	0x00000100	/* Bottom of Stack bit */
+#define MPLS_HLEN	(sizeof(__be32))
+
+static struct sk_buff *mpls_gso_segment(struct sk_buff *skb,
+				       netdev_features_t features)
+{
+	struct sk_buff *segs = ERR_PTR(-EINVAL);
+	netdev_features_t enc_features;
+	int stack_len = 0;
+	int mac_len = skb->mac_len;
+
+	if (unlikely(skb_shinfo(skb)->gso_type &
+				~(SKB_GSO_TCPV4 |
+				  SKB_GSO_TCPV6 |
+				  SKB_GSO_UDP |
+				  SKB_GSO_DODGY |
+				  SKB_GSO_TCP_ECN |
+				  SKB_GSO_GRE |
+				  SKB_GSO_MPLS)))
+		goto out;
+
+	while (1) {
+		__be32 lse = *(__be32 *)(skb_network_header(skb) + stack_len);
+
+		stack_len += MPLS_HLEN;
+		if (unlikely(!pskb_may_pull(skb, MPLS_HLEN)))
+			goto out;
+		__skb_pull(skb, MPLS_HLEN);
+
+		if (lse & htonl(MPLS_BOS_MASK))
+			break;
+	}
+
+	/* setup inner skb. */
+	skb_reset_mac_header(skb);
+	skb_reset_network_header(skb);
+	skb->mac_len = 0;
+	skb->protocol = SKB_GSO_CB(skb)->encap_eth_type;
+	skb->encapsulation = 0;
+
+	/* segment inner packet. */
+	enc_features = skb->dev->hw_enc_features & netif_skb_features(skb) &
+		~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
+	segs = skb_mac_gso_segment(skb, enc_features);
+	if (IS_ERR_OR_NULL(segs))
+		goto out;
+
+	skb = segs;
+	do {
+		__skb_push(skb,  stack_len + mac_len);
+		skb_reset_mac_header(skb);
+		skb_set_network_header(skb, mac_len);
+		skb->mac_len = mac_len;
+	} while ((skb = skb->next));
+out:
+	return segs;
+}
+
+static int mpls_gso_send_check(struct sk_buff *skb)
+{
+	if (!skb->encapsulation)
+		return -EINVAL;
+	return 0;
+}
+
+static struct packet_offload mpls_mc_offload = {
+	.type = cpu_to_be16(ETH_P_MPLS_UC),
+	.callbacks = {
+		.gso_send_check =	mpls_gso_send_check,
+		.gso_segment    =	mpls_gso_segment,
+	},
+};
+
+static struct packet_offload mpls_uc_offload = {
+	.type = cpu_to_be16(ETH_P_MPLS_UC),
+	.callbacks = {
+		.gso_send_check =	mpls_gso_send_check,
+		.gso_segment    =	mpls_gso_segment,
+	},
+};
+
+static int __init mpls_gso_init(void)
+{
+	pr_info("MPLS GSO support\n");
+
+	dev_add_offload(&mpls_uc_offload);
+	dev_add_offload(&mpls_mc_offload);
+
+	return 0;
+}
+
+static void __exit mpls_gso_exit(void)
+{
+	dev_remove_offload(&mpls_uc_offload);
+	dev_remove_offload(&mpls_mc_offload);
+}
+
+module_init(mpls_gso_init);
+module_exit(mpls_gso_exit);
+
+MODULE_DESCRIPTION("MPLS GSO support");
+MODULE_AUTHOR("Simon Horman (horms@verge.net.au)");
+MODULE_LICENSE("GPL");
+
-- 
1.7.10.4

^ permalink raw reply related

* Re: TCP: snd_cwnd is nul, please report this bug.
From: Eric Dumazet @ 2013-04-03  5:12 UTC (permalink / raw)
  To: Dave Jones; +Cc: netdev, Yuchung Cheng, Neal Cardwell
In-Reply-To: <20130403035428.GA31709@redhat.com>

On Tue, 2013-04-02 at 23:54 -0400, Dave Jones wrote:
> Just had this warning on 3.9-rc5
> 
> Not sure what else to report.  Nothing really odd going on,
> just some ssh traffic and firefox over wifi (iwlwifi)
> 
> anything else I can provide ?

Thanks for the report.

I guess we'll need to resurrect a patch I did to track the (buggy)
setting to 0.

^ permalink raw reply

* TCP: snd_cwnd is nul, please report this bug.
From: Dave Jones @ 2013-04-03  3:54 UTC (permalink / raw)
  To: netdev

Just had this warning on 3.9-rc5

Not sure what else to report.  Nothing really odd going on,
just some ssh traffic and firefox over wifi (iwlwifi)

anything else I can provide ?

	Dave

^ permalink raw reply

* RE: [PATCH v2 net-next 6/8] r8169: add a new chip for RTL8111G
From: hayeswang @ 2013-04-03  3:03 UTC (permalink / raw)
  To: 'Francois Romieu'; +Cc: netdev, linux-kernel
In-Reply-To: <20130402222728.GA16612@electric-eye.fr.zoreil.com>

Francois Romieu [mailto:romieu@fr.zoreil.com] 
> Sent: Wednesday, April 03, 2013 6:27 AM
> To: Hayeswang
> Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2 net-next 6/8] r8169: add a new chip 
> for RTL8111G
> 
> Hayes Wang <hayeswang@realtek.com> :
> > Add a new chip for RTL8111G series.
> 
> It does not need any of the workarounds in patch #5, right ?

Excuse me, I don't sure what is your question. Do you mean if the patch "[PATCH
v2 net-next 5/8] r8169: Update the RTL8111G parameters" is necessary before the
current patch? According to the settings from our hw engineers, some settings of
the new chip are the same with the previous 8111g chip using the new parameters.
It would work without patch #5. However, it is necessary to be updated finally,
so I add this patch after patch #5.

Best Regards,
Hayes

^ permalink raw reply

* Re: [Linux-zigbee-devel] [PATCH 1/6] mac802154: Immediately retry sending failed packets
From: Alan Ott @ 2013-04-03  2:57 UTC (permalink / raw)
  To: David Miller; +Cc: werner, netdev, linux-zigbee-devel, linux-kernel
In-Reply-To: <20130402.223036.1483788180825619960.davem@davemloft.net>

On 04/02/2013 10:30 PM, David Miller wrote:
> From: Alan Ott <alan@signal11.us>
> Date: Tue, 02 Apr 2013 22:25:28 -0400
>
>> The workqueue in mac802154 is only needed because the current mac802154
>> xmit() function is designed to be blocking and synchronous.
>>
>> Prior to my patch (#3/6), that very same workqueue would actually queue
>> up packets (without bound). That's what my patch fixes.
>>
>> The workqueue in mac802154 also serializes the access to the device for
>> other functions like setting the channel, ensuring that in the driver
>> code, one doesn't have to mutex everything. I'm not sure if that's the
>> "right" way to do it, but that's the way it was when I got here.
> This is entirely duplicating existing facilities.
>
> Your desire to allow blockability during xmit() on the basis of mutual
> exclusion is not well founded.

I'm not sure it's my desire, but rather a statement of the way it
currently is. To be clear, .ndo_start_xmit() does not block, but queues
a workqueue item which then calls ieee802154_ops->xmit() which does block.

This patch series centers around putting netif_stop_queue() and
netif_wake_queue() in the mac802154 layer. I've sent emails about this
before[1], and gotten no real suggestions about the issue, so I
proceeded with Solution #1 (as described at [1]). If you want to skip
this and go straight to solution #2, then let's talk about what that
might look like. I still think though, that there is benefit in getting
solution #1 in because it fixes some current usability problems
(including the buffer (workqueue) growing without bound).

All that said, I'm not sure I've answered your question or concern.
Please let me know if I'm still not getting it.

Alan.

[1] http://thread.gmane.org/gmane.linux.network/242495/focus=262869

^ permalink raw reply

* Re: [Linux-zigbee-devel] [PATCH 1/6] mac802154: Immediately retry sending failed packets
From: Werner Almesberger @ 2013-04-03  2:38 UTC (permalink / raw)
  To: Alan Ott; +Cc: netdev, David S. Miller, linux-zigbee-devel, linux-kernel
In-Reply-To: <515B84EB.8020006@signal11.us>

Alan Ott wrote:
> 1. Most supported devices have only single packet output buffer, so
> blocking in the driver is the most straight-forward way to handle it.
> The alternative is to make each driver have a workqueue for xmit() (to
> lift the blocking out from atomic context). This makes each driver simpler.

It does make following the program flow a little easier, but
the difference isn't all that large if you think of it,
particularly if you have to wait not only for I/O to finish
but also for the device to send the packet.

The latter will usually be signaled by some form of interrupt,
so you're already in a situation where a callback to the higher
layers of the stack would be very natural.

> Maybe at some point this will be done. Right now we have a ton of
> pressing issues (in my opinion).

Agreed on having no shortage of nasty issues :-) And I'd like
to echo Dave's comment regarding netdev. Those ieee802154_dev
always struck me as peculiar, with flow control just being one
issue.

And things get worse when you have a complex bus underneath
your driver. For example, my USB-using atusb driver (*) has to
do a great many things usbnet already does. And any other
USB-based WPAN driver would be more or less in the same boat.
Of course, one could reinvent that wheel as well and make a
usbwpan, but ... :)

(*) Sneak preview, still with a number of issues, not only style:
    https://github.com/wpwrak/ben-wpan-linux/blob/master/drivers/net/ieee802154/atusb.c

- Werner

^ permalink raw reply

* Re: [PATCH 1/6] mac802154: Immediately retry sending failed packets
From: David Miller @ 2013-04-03  2:30 UTC (permalink / raw)
  To: alan-yzvJWuRpmD1zbRFIqnYvSA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-zigbee-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <515B9318.8090101-yzvJWuRpmD1zbRFIqnYvSA@public.gmane.org>

From: Alan Ott <alan-yzvJWuRpmD1zbRFIqnYvSA@public.gmane.org>
Date: Tue, 02 Apr 2013 22:25:28 -0400

> The workqueue in mac802154 is only needed because the current mac802154
> xmit() function is designed to be blocking and synchronous.
> 
> Prior to my patch (#3/6), that very same workqueue would actually queue
> up packets (without bound). That's what my patch fixes.
> 
> The workqueue in mac802154 also serializes the access to the device for
> other functions like setting the channel, ensuring that in the driver
> code, one doesn't have to mutex everything. I'm not sure if that's the
> "right" way to do it, but that's the way it was when I got here.

This is entirely duplicating existing facilities.

Your desire to allow blockability during xmit() on the basis of mutual
exclusion is not well founded.

------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html

^ permalink raw reply

* Re: [PATCH 1/6] mac802154: Immediately retry sending failed packets
From: Alan Ott @ 2013-04-03  2:25 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-zigbee-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <20130402.220315.1782012687105065631.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On 04/02/2013 10:03 PM, David Miller wrote:
> From: Alan Ott <alan-yzvJWuRpmD1zbRFIqnYvSA@public.gmane.org>
> Date: Tue, 02 Apr 2013 21:59:37 -0400
>
>> On 04/02/2013 09:56 PM, David Miller wrote:
>>> From: Alan Ott <alan-yzvJWuRpmD1zbRFIqnYvSA@public.gmane.org>
>>> Date: Tue, 02 Apr 2013 21:24:59 -0400
>>>
>>>> I like it for a couple of reasons.
>>>> 1. Most supported devices have only single packet output buffer, so
>>>> blocking in the driver is the most straight-forward way to handle it.
>>>> The alternative is to make each driver have a workqueue for xmit() (to
>>>> lift the blocking out from atomic context). This makes each driver simpler.
>>>>
>>>> 2. All of the flow control can be handled one time in the mac802154 layer.
>>> We have a perfectly working flow control mechanism in the generic
>>> networking queuing layer.  Please use it instead of inventing things.
>> I'm pretty sure that's what I'm doing in [1]. When I say "flow control
>> can be handled," I mean managing calls to netif_stop_queue() and
>> netif_wake_queue(). Is there something else I should be doing instead?
> Then you shouldn't need workqueues if the generic netdev facilities
> can do the flow control properly.

The workqueue in mac802154 is only needed because the current mac802154
xmit() function is designed to be blocking and synchronous.

Prior to my patch (#3/6), that very same workqueue would actually queue
up packets (without bound). That's what my patch fixes.

The workqueue in mac802154 also serializes the access to the device for
other functions like setting the channel, ensuring that in the driver
code, one doesn't have to mutex everything. I'm not sure if that's the
"right" way to do it, but that's the way it was when I got here.

> There are several ethernet devices that have a single transmit buffer
> and function just fine, and optimally, solely using the transmit queue
> stop/start/wake infrastructure.

Yes, that does work. enc28j60 works like this. However, since it's an
SPI device (and can sleep), its ndo_start_xmit() _does_ use a workqueue
(to remove it from atomic context), the same as ours (mac802154) does.
The difference is that we do it at the mac802154 layer, while Ethernet
devices do it in the driver.

I guess one advantage to the way it currently is in mac802154, with the
synchronous xmit(), is that a return code can be had from the driver for
each packet. With my new idea that we don't need to retransmit on
failure, I'm not sure we need this return code at all.

Alan.

------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html

^ permalink raw reply

* Re: [PATCH 1/6] mac802154: Immediately retry sending failed packets
From: David Miller @ 2013-04-03  2:03 UTC (permalink / raw)
  To: alan-yzvJWuRpmD1zbRFIqnYvSA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-zigbee-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <515B8D09.9050304-yzvJWuRpmD1zbRFIqnYvSA@public.gmane.org>

From: Alan Ott <alan-yzvJWuRpmD1zbRFIqnYvSA@public.gmane.org>
Date: Tue, 02 Apr 2013 21:59:37 -0400

> On 04/02/2013 09:56 PM, David Miller wrote:
>> From: Alan Ott <alan-yzvJWuRpmD1zbRFIqnYvSA@public.gmane.org>
>> Date: Tue, 02 Apr 2013 21:24:59 -0400
>>
>>> I like it for a couple of reasons.
>>> 1. Most supported devices have only single packet output buffer, so
>>> blocking in the driver is the most straight-forward way to handle it.
>>> The alternative is to make each driver have a workqueue for xmit() (to
>>> lift the blocking out from atomic context). This makes each driver simpler.
>>>
>>> 2. All of the flow control can be handled one time in the mac802154 layer.
>> We have a perfectly working flow control mechanism in the generic
>> networking queuing layer.  Please use it instead of inventing things.
> 
> I'm pretty sure that's what I'm doing in [1]. When I say "flow control
> can be handled," I mean managing calls to netif_stop_queue() and
> netif_wake_queue(). Is there something else I should be doing instead?

Then you shouldn't need workqueues if the generic netdev facilities
can do the flow control properly.

There are several ethernet devices that have a single transmit buffer
and function just fine, and optimally, solely using the transmit queue
stop/start/wake infrastructure.

------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html

^ permalink raw reply

* Re: [PATCH 1/6] mac802154: Immediately retry sending failed packets
From: Alan Ott @ 2013-04-03  1:59 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-zigbee-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <20130402.215625.1555279506975246223.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On 04/02/2013 09:56 PM, David Miller wrote:
> From: Alan Ott <alan-yzvJWuRpmD1zbRFIqnYvSA@public.gmane.org>
> Date: Tue, 02 Apr 2013 21:24:59 -0400
>
>> I like it for a couple of reasons.
>> 1. Most supported devices have only single packet output buffer, so
>> blocking in the driver is the most straight-forward way to handle it.
>> The alternative is to make each driver have a workqueue for xmit() (to
>> lift the blocking out from atomic context). This makes each driver simpler.
>>
>> 2. All of the flow control can be handled one time in the mac802154 layer.
> We have a perfectly working flow control mechanism in the generic
> networking queuing layer.  Please use it instead of inventing things.

I'm pretty sure that's what I'm doing in [1]. When I say "flow control
can be handled," I mean managing calls to netif_stop_queue() and
netif_wake_queue(). Is there something else I should be doing instead?

> If it does not meet your needs, fix it, rather than go off and do
> your own thing.  That way everyone benfits, not just you.

Fully agreed.

Alan.

[1] http://www.spinics.net/lists/netdev/msg231483.html


------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html

^ permalink raw reply

* Re: [Linux-zigbee-devel] [PATCH 1/6] mac802154: Immediately retry sending failed packets
From: David Miller @ 2013-04-03  1:56 UTC (permalink / raw)
  To: alan; +Cc: werner, netdev, linux-zigbee-devel, linux-kernel
In-Reply-To: <515B84EB.8020006@signal11.us>

From: Alan Ott <alan@signal11.us>
Date: Tue, 02 Apr 2013 21:24:59 -0400

> I like it for a couple of reasons.
> 1. Most supported devices have only single packet output buffer, so
> blocking in the driver is the most straight-forward way to handle it.
> The alternative is to make each driver have a workqueue for xmit() (to
> lift the blocking out from atomic context). This makes each driver simpler.
> 
> 2. All of the flow control can be handled one time in the mac802154 layer.

We have a perfectly working flow control mechanism in the generic
networking queuing layer.  Please use it instead of inventing things.

If it does not meet your needs, fix it, rather than go off and do
your own thing.  That way everyone benfits, not just you.

^ permalink raw reply

* Re: [PATCH Resend] core: should call pskb_expand_head if skb header is cloned in skb_gso_segment in rx path
From: RongQing Li @ 2013-04-03  1:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, netdev, jesse, pshelar, Herbert Xu, edumazet
In-Reply-To: <1364951990.5113.192.camel@edumazet-glaptop>

OK.

2013/4/3 Eric Dumazet <eric.dumazet@gmail.com>:
> On Wed, 2013-04-03 at 08:51 +0800, RongQing Li wrote:
>> No, I just read and analyze it, and think it is bogus.
>
> Please rewrite the changelog, because I read it several times and could
> not understand it.
>
>
>
>

^ permalink raw reply

* Re: [PATCH 1/6] mac802154: Immediately retry sending failed packets
From: Alan Ott @ 2013-04-03  1:24 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller,
	linux-zigbee-devel, linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20130402231319.GD28141@ws>

On 04/02/2013 07:13 PM, Werner Almesberger wrote:
> Alan Ott wrote:
>> it's now my opinion that we should _not_ try to retransmit at
>> all in mac802154/tx.c.
> I think the currently blocking workqueue design is ugly and
> quite contrary to how most the rest of the stack works. So
> anything that kills it has my blessing :-)

I like it for a couple of reasons.
1. Most supported devices have only single packet output buffer, so
blocking in the driver is the most straight-forward way to handle it.
The alternative is to make each driver have a workqueue for xmit() (to
lift the blocking out from atomic context). This makes each driver simpler.

2. All of the flow control can be handled one time in the mac802154 layer.

> I do wonder though why it was done like this in the first place.
> Just for convenience ?
>
> If we want to move towards an asynchronous interface, it could
> exist in parallel with the current one. That way, drivers could
> be migrated one by one.

Maybe at some point this will be done. Right now we have a ton of
pressing issues (in my opinion).

> Having said that, the errors you get there may not be failed
> single transmissions on the air but some form of congestion in
> the driver or a problem with the device. But I don't think
> that's a valid reason for retrying the transmission at that
> level.

Agreed. Like I said, I'm now of the opinion that the mac802154 layer
should _not_ retry at all.

Alan.


------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html

^ permalink raw reply

* Re: [PATCH Resend] core: should call pskb_expand_head if skb header is cloned in skb_gso_segment in rx path
From: Eric Dumazet @ 2013-04-03  1:19 UTC (permalink / raw)
  To: RongQing Li; +Cc: Cong Wang, netdev, jesse, pshelar, herbert, edumazet
In-Reply-To: <CAJFZqHxn77yUpGkOUzywVR7wg2_L8tFaLNrFyaauuLWPJsQUHQ@mail.gmail.com>

On Wed, 2013-04-03 at 08:51 +0800, RongQing Li wrote:
> No, I just read and analyze it, and think it is bogus.

Please rewrite the changelog, because I read it several times and could
not understand it.

^ permalink raw reply

* Re: [PATCH Resend] core: should call pskb_expand_head if skb header is cloned in skb_gso_segment in rx path
From: RongQing Li @ 2013-04-03  0:51 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, jesse, pshelar, herbert, edumazet
In-Reply-To: <1364905967.22165.4.camel@cr0>

 No, I just read and analyze it, and think it is bogus.

2013/4/2 Cong Wang <amwang@redhat.com>:
> On Mon, 2013-04-01 at 10:37 +0800, roy.qing.li@gmail.com wrote:
>> From: Li RongQing <roy.qing.li@gmail.com>
>>
>> 12b0004d1d1 (adjust skb_gso_segment() for calling in rx path) tries to kill warnings
>> by checking if ip_summed is CHECK_NONE or not in rx path, since if skb_gso_segment()
>> is called on rx path, and ->ip_summed has different meaning.
>>
>> but this maybe break skb if skb header is cloned, and not expand the header, since when
>> step into skb_mac_gso_segment(), which will still check ip_summed with CHECKSUM_PARTIAL,
>> then do gso_send_check(). and after __skb_gso_segment() in queue_gso_packets() of
>> openvswitch, queue_userspace_packet() still checks ip_summed with CHECKSUM_PARTIAL,
>> and do checksum.
>>
>> so I think it is enough to ignore the warning in rx path.
>>
>
> Did you see any bogus warning triggered by it?
>
> BTW, please Cc all the people involved in the original commit you
> mentioned above.
>
>

^ permalink raw reply

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
From: Hannes Frederic Sowa @ 2013-04-03  0:43 UTC (permalink / raw)
  To: Huang, Xiong
  Cc: Ben Hutchings, Anders Boström, netdev@vger.kernel.org,
	565404@bugs.debian.org
In-Reply-To: <157393863283F442885425D2C45428564F202BB6@nasanexd02f.na.qualcomm.com>

On Wed, Apr 03, 2013 at 12:12:12AM +0000, Huang, Xiong wrote:
> > > Hannes,  Thanks for your testing !
> > >
> > >  simply revising MAX_TX_BUF_LEN to 0x4000 will cause incorrect TX
> > configuration...
> > > I mean you can try to put a gso size limit of 0x4000 (or 0x5000)....
> > 
> > I tested both values with multi-gigabyte nfsv4 traffic and both values are ok.
> > If I understand you correctly 0x4000 is a safe limit?
> 
> Since Win7 driver uses 15000 bytes as its max packet length for TSO, I think 0x3C00 is more safer than 0x4000. :)

Thanks again for helping to resolve this issue. I just submitted a patch
but accidently killed the cc-line.

Greetings,

  Hannes

^ permalink raw reply

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
From: Hannes Frederic Sowa @ 2013-04-03  0:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Huang, Xiong, Ben Hutchings, Anders Boström,
	netdev@vger.kernel.org, 565404@bugs.debian.org
In-Reply-To: <1364942093.5113.189.camel@edumazet-glaptop>

On Tue, Apr 02, 2013 at 03:34:53PM -0700, Eric Dumazet wrote:
> Really I don't understand why people use u16 instead of u32.
> 
> u16 is slower most of the time, and more prone to overflows.
> 
> diff --git a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
> index 7e0a822..48ac487 100644
> --- a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
> +++ b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
> @@ -1569,18 +1569,17 @@ static u16 atl1e_cal_tdp_req(const struct sk_buff *skb)
>  {
>  	int i = 0;
>  	u16 tpd_req = 1;
> -	u16 fg_size = 0;
> -	u16 proto_hdr_len = 0;
>  
>  	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> -		fg_size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
> +		u32 fg_size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
> +
>  		tpd_req += ((fg_size + MAX_TX_BUF_LEN - 1) >> MAX_TX_BUF_SHIFT);
>  	}
>  
>  	if (skb_is_gso(skb)) {
>  		if (skb->protocol == htons(ETH_P_IP) ||
>  		   (skb_shinfo(skb)->gso_type == SKB_GSO_TCPV6)) {
> -			proto_hdr_len = skb_transport_offset(skb) +
> +			u32 proto_hdr_len = skb_transport_offset(skb) +
>  					tcp_hdrlen(skb);
>  			if (proto_hdr_len < skb_headlen(skb)) {
>  				tpd_req += ((skb_headlen(skb) - proto_hdr_len +
> @@ -1670,7 +1669,7 @@ static void atl1e_tx_map(struct atl1e_adapter *adapter,
>  {
>  	struct atl1e_tpd_desc *use_tpd = NULL;
>  	struct atl1e_tx_buffer *tx_buffer = NULL;
> -	u16 buf_len = skb_headlen(skb);
> +	u32 buf_len = skb_headlen(skb);
>  	u16 map_len = 0;
>  	u16 mapped_len = 0;
>  	u16 hdr_len = 0;
> 

I tested this patch ontop of the patch which reduces gso max size to 0x3c00.
If you want to submit the patch you could add my acked-by.

Thanks,

  Hannes

^ permalink raw reply

* [PATCH net] atl1e: limit gso segment size to prevent generation of wrong ip length fields
From: Hannes Frederic Sowa @ 2013-04-03  0:36 UTC (permalink / raw)
  To: netdev

The limit of 0x3c00 is taken from the windows driver.

Suggested-by: Huang, Xiong <xiong@qca.qualcomm.com>
Cc: Huang, Xiong <xiong@qca.qualcomm.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
I think this is stable material, too.

 drivers/net/ethernet/atheros/atl1e/atl1e.h      | 2 +-
 drivers/net/ethernet/atheros/atl1e/atl1e_main.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/atheros/atl1e/atl1e.h b/drivers/net/ethernet/atheros/atl1e/atl1e.h
index edfdf6b..b5fd934 100644
--- a/drivers/net/ethernet/atheros/atl1e/atl1e.h
+++ b/drivers/net/ethernet/atheros/atl1e/atl1e.h
@@ -186,7 +186,7 @@ struct atl1e_tpd_desc {
 /* how about 0x2000 */
 #define MAX_TX_BUF_LEN      0x2000
 #define MAX_TX_BUF_SHIFT    13
-/*#define MAX_TX_BUF_LEN  0x3000 */
+#define MAX_TSO_SEG_SIZE    0x3c00
 
 /* rrs word 1 bit 0:31 */
 #define RRS_RX_CSUM_MASK	0xFFFF
diff --git a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
index 7e0a822..d058d00 100644
--- a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
+++ b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
@@ -2327,6 +2327,7 @@ static int atl1e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	INIT_WORK(&adapter->reset_task, atl1e_reset_task);
 	INIT_WORK(&adapter->link_chg_task, atl1e_link_chg_task);
+	netif_set_gso_max_size(netdev, MAX_TSO_SEG_SIZE);
 	err = register_netdev(netdev);
 	if (err) {
 		netdev_err(netdev, "register netdevice failed\n");
-- 
1.8.1.4

^ permalink raw reply related

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
From: Huang, Xiong @ 2013-04-03  0:12 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Ben Hutchings, Anders Boström, netdev@vger.kernel.org,
	565404@bugs.debian.org
In-Reply-To: <20130403000020.GI4924@order.stressinduktion.org>

> > Hannes,  Thanks for your testing !
> >
> >  simply revising MAX_TX_BUF_LEN to 0x4000 will cause incorrect TX
> configuration...
> > I mean you can try to put a gso size limit of 0x4000 (or 0x5000)....
> 
> I tested both values with multi-gigabyte nfsv4 traffic and both values are ok.
> If I understand you correctly 0x4000 is a safe limit?

Since Win7 driver uses 15000 bytes as its max packet length for TSO, I think 0x3C00 is more safer than 0x4000. :)

Thanks
Xiong

^ permalink raw reply

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
From: Hannes Frederic Sowa @ 2013-04-03  0:00 UTC (permalink / raw)
  To: Huang, Xiong
  Cc: Ben Hutchings, Anders Boström, netdev@vger.kernel.org,
	565404@bugs.debian.org
In-Reply-To: <157393863283F442885425D2C45428564F202B2A@nasanexd02f.na.qualcomm.com>

On Tue, Apr 02, 2013 at 10:23:54PM +0000, Huang, Xiong wrote:
> 
> > 
> > On Tue, Apr 02, 2013 at 09:51:12PM +0000, Huang, Xiong wrote:
> > > > The error vanishes as soon as I put a gso size limit of
> > > > MAX_TX_BUF_LEN in the driver. MAX_TX_BUF_LEN seems to be
> > arbitrary
> > > > set to 0x2000. I can even raise it to 0x3000 and don't see any tcp
> > > > retransmits. Do you have an advice on how to size this value (e.g. should
> > we switch to the windows values)?
> > > >
> > >
> > > Would you try 0x4000 ? because the buffer-length in TX descriptor is 14bits,
> > 0x4000 exceeds max value.
> > > Do you find any bug/issue on the code that calculate the length for each TX
> > descriptor ?
> > 
> > Setting MAX_TX_BUF_LEN to 0x4000
> > 
> > [ 8949.833750] ATL1E 0000:04:00.0 p33p1: NIC Link is Up <100 Mbps Full
> > Duplex> [ 8949.833783] IPv6: ADDRCONF(NETDEV_CHANGE): p33p1: link
> > becomes ready [ 8960.861557] ATL1E 0000:04:00.0 p33p1: PCIE DMA RW error
> > (status = 0x5000400) [ 8960.866879] ATL1E 0000:04:00.0 p33p1: NIC Link is Up
> > <100 Mbps Full Duplex> [ 8961.095266] ATL1E 0000:04:00.0 p33p1: PCIE DMA
> > RW error (status = 0x5000400) [ 8961.100791] ATL1E 0000:04:00.0 p33p1: NIC
> > Link is Up <100 Mbps Full Duplex>
> > 
> Hannes,  Thanks for your testing !
> 
>  simply revising MAX_TX_BUF_LEN to 0x4000 will cause incorrect TX configuration...
> I mean you can try to put a gso size limit of 0x4000 (or 0x5000)....

I tested both values with multi-gigabyte nfsv4 traffic and both values are ok.
If I understand you correctly 0x4000 is a safe limit?

^ permalink raw reply

* question about kernel patch: ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: Ani Sinha @ 2013-04-02 23:49 UTC (permalink / raw)
  To: netdev, Eric Dumazet, David S. Miller
  Cc: Francesco Ruggeri, Bob Gilligan, Prasad Koya

hello :

I was looking at the former netdev thread :
http://www.spinics.net/lists/netdev/msg193411.html

It seems the following patch was applied in Linux 3.5 to fix the oops
in ip_route_output_slow :

commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1
Author: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Date:   Wed May 23 15:39:45 2012 +0000

    ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

We hit into a very similar oops in the same place but with 2.6.38
kernel. We are thinking of backporting this patch to our 2.6.38 kernel
tree. Since I am not very familier with this code, I wanted to seek
the opinion of netdev folks and ask whether this will be safe to do
so. We did notice that this patch was backported to longterm supported
3.4 tree as well. We also notice that this patch was committed almost
one year back and has remained intact in Torvalds' latest and greatest
tip.

The patch itself looks simple enough but obviously we do not know the
code path well enough to understand all the impacts of this change.

Any suggestions will be greatly appreciated.

thanks,
ani

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: Don't insert empty OVS_VPORT_ATTR_OPTIONS attribute
From: Jesse Gross @ 2013-04-02 23:36 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev@openvswitch.org, netdev
In-Reply-To: <8f6704668f7f795658b48a7b0219c88d0ca088a3.1364941760.git.tgraf@suug.ch>

On Tue, Apr 2, 2013 at 3:30 PM, Thomas Graf <tgraf@suug.ch> wrote:
> The port specific options are currently unused resulting in an
> empty OVS_VPORT_ATTR_OPTIONS nested attribute being inserted
> into every OVS_VPORT_CMD_GET message.
>
> Don't insert OVS_VPORT_ATTR_OPTIONS if no options are present.
>
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: Provide OVS_DP_ATTR_UPCALL_PID in datapath messages
From: Jesse Gross @ 2013-04-02 23:30 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev@openvswitch.org, netdev
In-Reply-To: <e7d0e105e879fb78ff3e5628147e41d52fb0cddd.1364941614.git.tgraf@suug.ch>

On Tue, Apr 2, 2013 at 3:28 PM, Thomas Graf <tgraf@suug.ch> wrote:
> The upcall port configured when adding a new datapath is currently
> only provided to user space as part of the vport message. Therefore
> user space has to request two separate messages which is prone to
> race conditions.
>
> Provide the upcall port of the local port (0) of a data path in the
> datapath message to gain symmetry between the SET and GET command.
>
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Can you describe the race condition some more?  The kernel doesn't
change the port ID on its own so even needing to request the value
seems rare.

Assigning the local ports upcall PID through datapath creation is
really somewhat of a hack since it's port state.  I don't disagree
that it's somewhat asymmetric now but it seems better to move away
from the current model if possible.

^ permalink raw reply

* Re: [PATCH net-next 5/7] r8169: add a new chip for RTL8111G
From: David Miller @ 2013-04-02 23:26 UTC (permalink / raw)
  To: romieu; +Cc: hayeswang, netdev, linux-kernel
In-Reply-To: <20130402232008.GB16612@electric-eye.fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Wed, 3 Apr 2013 01:20:08 +0200

> hayeswang <hayeswang@realtek.com> :
>> Francois Romieu [mailto:romieu@fr.zoreil.com] 
> [...]
>> > There is close to zero added value for this stuff in the kernel.
>> > You may as well move it completely into the firmware.
>> 
>> Do you mean all of the phy settings ? I have checked these settings with
>> our hw engineers. These are not firmware. 
> 
> Undocumented configuration data which is subject to change over time ?
> 
> No one outside of Realtek can make any sense of this opaque pile of data.
> There is no point in me or anybody else rubber stamping it for inclusion.

Right, so either document all of these indirect registers being programmed
in the PHY or move it to firmware.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox