Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 2/2] net/smsc911x: Fix delays in the PHY enable/disable routines
From: David Miller @ 2014-11-13 19:38 UTC (permalink / raw)
  To: al.kochet; +Cc: netdev, steve.glendinning
In-Reply-To: <1415841980-14250-2-git-send-email-al.kochet@gmail.com>

From: Alexander Kochetkov <al.kochet@gmail.com>
Date: Thu, 13 Nov 2014 05:26:20 +0400

> Increased delay in the smsc911x_phy_disable_energy_detect (from 1ms to 2ms).
> Dropped delays in the smsc911x_phy_enable_energy_detect (100ms and 1ms).
> 
> The patch affect SMSC LAN generation 4 chips with integrated PHY (LAN9221).
> 
> I saw problems with soft reset due to wrong udelay timings.
> After I fixed udelay, I measured the time needed to bring integrated PHY
> from power-down to operational mode (the time beetween clearing EDPWRDOWN
> bit and soft reset complete event). I got 1ms (measured using ktime_get).
> The value is equal to the current value (1ms) used in the
> smsc911x_phy_disable_energy_detect. It is near the upper bound and in order
> to avoid rare soft reset faults it is doubled (2ms).
> 
> I don't know official timing for bringing up integrated PHY as specs doesn't
> clarify this (or may be I didn't found).
> 
> It looks safe to drop delays before and after setting EDPWRDOWN bit
> (enable PHY power-down mode). I didn't saw any regressions with the patch.
> 
> The patch was reviewed by Steve Glendinning and Microchip Team.
> 
> Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com>
> Acked-by: Steve Glendinning <steve.glendinning@shawell.net>

Applied.

^ permalink raw reply

* Re: [PATCH 1/2] net/smsc911x: Fix rare soft reset timeout issue due to PHY power-down mode
From: David Miller @ 2014-11-13 19:38 UTC (permalink / raw)
  To: al.kochet; +Cc: netdev, steve.glendinning
In-Reply-To: <1415841980-14250-1-git-send-email-al.kochet@gmail.com>

From: Alexander Kochetkov <al.kochet@gmail.com>
Date: Thu, 13 Nov 2014 05:26:19 +0400

> The patch affect SMSC LAN generation 4 chips with integrated PHY (LAN9221).
> 
> It is possible that PHY could enter power-down mode (ENERGYON clear),
> between ENERGYON bit check in smsc911x_phy_disable_energy_detect and SRST
> bit set in smsc911x_soft_reset. This could happen, for example, if someone
> disconnect ethernet cable between the checks. The PHY in a power-down mode
> would prevent the MAC portion of chip to be software reseted.
> 
> Initially found by code review, confirmed later using test case.
> 
> This is low probability issue, and in order to reproduce it you have to
> run the script:
> 
> while true; do
> 	ifconfig eth0 down
> 	ifconfig eth0 up || break
> done
> 
> While the script is running you have to plug/unplug ethernet cable many
> times (using gpio controlled ethernet switch, for example) until get:
> 
> [ 4516.477783] ADDRCONF(NETDEV_UP): eth0: link is not ready
> [ 4516.512207] smsc911x smsc911x.0: eth0: SMSC911x/921x identified at 0xce006000, IRQ: 336
> [ 4516.524658] ADDRCONF(NETDEV_UP): eth0: link is not ready
> [ 4516.559082] smsc911x smsc911x.0: eth0: SMSC911x/921x identified at 0xce006000, IRQ: 336
> [ 4516.571990] ADDRCONF(NETDEV_UP): eth0: link is not ready
> ifconfig: SIOCSIFFLAGS: Input/output error
> 
> The patch was reviewed by Steve Glendinning and Microchip Team.
> 
> Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com>
> Acked-by: Steve Glendinning <steve.glendinning@shawell.net>

Applied.

^ permalink raw reply

* [PATCHv10 ovs 12/15] datapath: Add support for unique flow identifiers.
From: Joe Stringer @ 2014-11-13 19:17 UTC (permalink / raw)
  To: dev; +Cc: Pravin B Shelar, netdev
In-Reply-To: <1415906275-3172-1-git-send-email-joestringer@nicira.com>

Previously, flows were manipulated by userspace specifying a full,
unmasked flow key. This adds significant burden onto flow
serialization/deserialization, particularly when dumping flows.

This patch adds an alternative way to refer to flows using a
variable-length "unique flow identifier" (UFID). At flow setup time,
userspace may specify a UFID for a flow, which is stored with the flow
and inserted into a separate table for lookup, in addition to the
standard flow table. Flows created using a UFID must be fetched or
deleted using the UFID.

All flow dump operations may now be made more terse with OVS_UFID_F_*
flags. For example, the OVS_UFID_F_OMIT_KEY flag allows responses to
omit the flow key from a datapath operation if the flow has a
corresponding UFID. This significantly reduces the time spent assembling
and transacting netlink messages. With all OVS_UFID_F_OMIT_* flags
enabled, the datapath only returns the UFID and statistics for each flow
during flow dump, increasing ovs-vswitchd revalidator performance by up
to 50%.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
CC: Pravin B Shelar <pshelar@nicira.com>
CC: netdev@vger.kernel.org
---
v10: Ignore flow_key in requests if UFID is specified.
     Only allow UFID flows to be indexed by UFID.
     Only allow non-UFID flows to be indexed by unmasked flow key.
     Unite the unmasked_key and ufid+ufid_hash in 'struct sw_flow'.
     Don't periodically rehash the UFID table.
     Resize the UFID table independently from the flow table.
     Modify table_destroy() to iterate once and delete from both tables.
     Fix UFID memory leak in flow_free().
     Remove kernel-only UFIDs for non-UFID cases.
     Rename "OVS_UFID_F_SKIP_*" -> "OVS_UFID_F_OMIT_*"
     Update documentation.
     Rebase.
v9: No change.
v8: Rename UID -> UFID "unique flow identifier".
    Fix null dereference when adding flow without uid or mask.
    If UFID and not match are specified, and lookup fails, return ENOENT.
    Rebase.
v7: Remove OVS_DP_F_INDEX_BY_UID.
    Rework UID serialisation for variable-length UID.
    Log error if uid not specified and OVS_UID_F_SKIP_KEY is set.
    Rebase against "probe" logging changes.
v6: Fix documentation for supporting UIDs between 32-128 bits.
    Minor style fixes.
    Rebase.
v5: No change.
v4: Fix memory leaks.
    Log when triggering the older userspace issue above.
v3: Initial post.
---
 datapath/README.md                                |   13 ++
 datapath/datapath.c                               |  248 +++++++++++++++------
 datapath/flow.h                                   |   20 +-
 datapath/flow_netlink.c                           |   35 +++
 datapath/flow_netlink.h                           |    1 +
 datapath/flow_table.c                             |  214 ++++++++++++++----
 datapath/flow_table.h                             |    8 +
 datapath/linux/compat/include/linux/openvswitch.h |   30 +++
 8 files changed, 453 insertions(+), 116 deletions(-)

diff --git a/datapath/README.md b/datapath/README.md
index a8effa3..9c03a2b 100644
--- a/datapath/README.md
+++ b/datapath/README.md
@@ -131,6 +131,19 @@ performs best-effort detection of overlapping wildcarded flows and may reject
 some but not all of them. However, this behavior may change in future versions.
 
 
+Unique flow identifiers
+-----------------------
+
+An alternative to using the original match portion of a key as the handle for
+flow identification is a unique flow identifier, or "UFID". UFIDs are optional
+for both the kernel and user space program.
+
+User space programs that support UFID are expected to provide it during flow
+setup in addition to the flow, then refer to the flow using the UFID for all
+future operations. The kernel is not required to index flows by the original
+flow key if a UFID is specified.
+
+
 Basic rule for evolving flow keys
 ---------------------------------
 
diff --git a/datapath/datapath.c b/datapath/datapath.c
index a898584..c14d834 100644
--- a/datapath/datapath.c
+++ b/datapath/datapath.c
@@ -671,11 +671,18 @@ static void get_dp_stats(const struct datapath *dp, struct ovs_dp_stats *stats,
 	}
 }
 
-static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts)
+static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts,
+				    const struct sw_flow_id *sfid)
 {
+	size_t sfid_len = 0;
+
+	if (sfid && sfid->ufid_len)
+		sfid_len = nla_total_size(sfid->ufid_len);
+
 	return NLMSG_ALIGN(sizeof(struct ovs_header))
 		+ nla_total_size(ovs_key_attr_size()) /* OVS_FLOW_ATTR_KEY */
 		+ nla_total_size(ovs_key_attr_size()) /* OVS_FLOW_ATTR_MASK */
+		+ sfid_len /* OVS_FLOW_ATTR_UFID */
 		+ nla_total_size(sizeof(struct ovs_flow_stats)) /* OVS_FLOW_ATTR_STATS */
 		+ nla_total_size(1) /* OVS_FLOW_ATTR_TCP_FLAGS */
 		+ nla_total_size(8) /* OVS_FLOW_ATTR_USED */
@@ -684,33 +691,43 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts)
 
 /* Called with ovs_mutex or RCU read lock. */
 static int ovs_flow_cmd_fill_match(const struct sw_flow *flow,
-				   struct sk_buff *skb)
+				   struct sk_buff *skb, u32 ufid_flags)
 {
 	struct nlattr *nla;
 	int err;
 
-	/* Fill flow key. */
-	nla = nla_nest_start(skb, OVS_FLOW_ATTR_KEY);
-	if (!nla)
-		return -EMSGSIZE;
-
-	err = ovs_nla_put_flow(&flow->unmasked_key, &flow->unmasked_key, skb,
-			       false);
-	if (err)
-		return err;
-
-	nla_nest_end(skb, nla);
+	/* Fill flow key. If userspace didn't specify a UFID, then ignore the
+	 * OMIT_KEY flag. */
+	if (!(ufid_flags & OVS_UFID_F_OMIT_KEY) ||
+	    !flow->index_by_ufid) {
+		nla = nla_nest_start(skb, OVS_FLOW_ATTR_KEY);
+		if (!nla)
+			return -EMSGSIZE;
+
+		if (flow->index_by_ufid)
+			err = ovs_nla_put_flow(&flow->mask->key, &flow->key,
+					       skb, false);
+		else
+			err = ovs_nla_put_flow(&flow->index.unmasked_key,
+					       &flow->index.unmasked_key, skb,
+					       false);
+		if (err)
+			return err;
+		nla_nest_end(skb, nla);
+	}
 
 	/* Fill flow mask. */
-	nla = nla_nest_start(skb, OVS_FLOW_ATTR_MASK);
-	if (!nla)
-		return -EMSGSIZE;
+	if (!(ufid_flags & OVS_UFID_F_OMIT_MASK)) {
+		nla = nla_nest_start(skb, OVS_FLOW_ATTR_MASK);
+		if (!nla)
+			return -EMSGSIZE;
 
-	err = ovs_nla_put_flow(&flow->key, &flow->mask->key, skb, true);
-	if (err)
-		return err;
+		err = ovs_nla_put_flow(&flow->key, &flow->mask->key, skb, true);
+		if (err)
+			return err;
+		nla_nest_end(skb, nla);
+	}
 
-	nla_nest_end(skb, nla);
 	return 0;
 }
 
@@ -740,6 +757,32 @@ static int ovs_flow_cmd_fill_stats(const struct sw_flow *flow,
 }
 
 /* Called with ovs_mutex or RCU read lock. */
+static int ovs_flow_cmd_fill_ufid(const struct sw_flow *flow,
+				  struct sk_buff *skb)
+{
+	struct nlattr *start;
+	const struct sw_flow_id *sfid;
+
+	if (!flow->index_by_ufid)
+		return 0;
+
+	sfid = &flow->index.ufid;
+	start = nla_nest_start(skb, OVS_FLOW_ATTR_UFID);
+	if (start) {
+		int err;
+
+		err = nla_put(skb, OVS_UFID_ATTR_ID, sfid->ufid_len,
+			      sfid->ufid);
+		if (err)
+			return err;
+		nla_nest_end(skb, start);
+	} else
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+/* Called with ovs_mutex or RCU read lock. */
 static int ovs_flow_cmd_fill_actions(const struct sw_flow *flow,
 				     struct sk_buff *skb, int skb_orig_len)
 {
@@ -782,7 +825,7 @@ static int ovs_flow_cmd_fill_actions(const struct sw_flow *flow,
 /* Called with ovs_mutex or RCU read lock. */
 static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
 				  struct sk_buff *skb, u32 portid,
-				  u32 seq, u32 flags, u8 cmd)
+				  u32 seq, u32 flags, u8 cmd, u32 ufid_flags)
 {
 	const int skb_orig_len = skb->len;
 	struct ovs_header *ovs_header;
@@ -795,18 +838,24 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
 
 	ovs_header->dp_ifindex = dp_ifindex;
 
-	err = ovs_flow_cmd_fill_match(flow, skb);
+	err = ovs_flow_cmd_fill_match(flow, skb, ufid_flags);
 	if (err)
 		goto error;
 
-	err = ovs_flow_cmd_fill_stats(flow, skb);
+	err = ovs_flow_cmd_fill_ufid(flow, skb);
 	if (err)
 		goto error;
 
-	err = ovs_flow_cmd_fill_actions(flow, skb, skb_orig_len);
+	err = ovs_flow_cmd_fill_stats(flow, skb);
 	if (err)
 		goto error;
 
+	if (!(ufid_flags & OVS_UFID_F_OMIT_ACTIONS)) {
+		err = ovs_flow_cmd_fill_actions(flow, skb, skb_orig_len);
+		if (err)
+			goto error;
+	}
+
 	return genlmsg_end(skb, ovs_header);
 
 error:
@@ -816,6 +865,7 @@ error:
 
 /* May not be called with RCU read lock. */
 static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *acts,
+					       const struct sw_flow_id *sfid,
 					       struct genl_info *info,
 					       bool always)
 {
@@ -825,30 +875,36 @@ static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *act
 					GROUP_ID(&ovs_dp_flow_multicast_group)))
 		return NULL;
 
-	skb = genlmsg_new_unicast(ovs_flow_cmd_msg_size(acts), info, GFP_KERNEL);
+	skb = genlmsg_new_unicast(ovs_flow_cmd_msg_size(acts, sfid), info,
+				  GFP_KERNEL);
 	if (!skb)
 		return ERR_PTR(-ENOMEM);
 
 	return skb;
 }
 
+static const struct sw_flow_id *flow_get_ufid(const struct sw_flow *flow)
+{
+	return flow->index_by_ufid ? &flow->index.ufid : NULL;
+}
+
 /* Called with ovs_mutex. */
 static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
 					       int dp_ifindex,
 					       struct genl_info *info, u8 cmd,
-					       bool always)
+					       bool always, u32 ufid_flags)
 {
 	struct sk_buff *skb;
 	int retval;
 
-	skb = ovs_flow_cmd_alloc_info(ovsl_dereference(flow->sf_acts), info,
-				      always);
+	skb = ovs_flow_cmd_alloc_info(ovsl_dereference(flow->sf_acts),
+				      flow_get_ufid(flow), info, always);
 	if (IS_ERR_OR_NULL(skb))
 		return skb;
 
 	retval = ovs_flow_cmd_fill_info(flow, dp_ifindex, skb,
 					info->snd_portid, info->snd_seq, 0,
-					cmd);
+					cmd, ufid_flags);
 	BUG_ON(retval < 0);
 	return skb;
 }
@@ -863,6 +919,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	struct datapath *dp;
 	struct sw_flow_actions *acts;
 	struct sw_flow_match match;
+	struct sw_flow_id sfid;
+	u32 ufid_flags;
 	int error;
 	bool log = !a[OVS_FLOW_ATTR_PROBE];
 
@@ -887,13 +945,20 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	}
 
 	/* Extract key. */
-	ovs_match_init(&match, &new_flow->unmasked_key, &mask);
+	ovs_match_init(&match, &new_flow->index.unmasked_key, &mask);
 	error = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY],
 				  a[OVS_FLOW_ATTR_MASK], log);
 	if (error)
 		goto err_kfree_flow;
 
-	ovs_flow_mask_key(&new_flow->key, &new_flow->unmasked_key, &mask);
+	ovs_flow_mask_key(&new_flow->key, &new_flow->index.unmasked_key, &mask);
+
+	/* Extract ufid. */
+	error = ovs_nla_get_ufid(a[OVS_FLOW_ATTR_UFID], &sfid, &ufid_flags);
+	if (!error)
+		error = ovs_flow_ufid(new_flow, &sfid);
+	if (error)
+		goto err_kfree_flow;
 
 	/* Validate actions. */
 	error = ovs_nla_copy_actions(a[OVS_FLOW_ATTR_ACTIONS], &new_flow->key,
@@ -903,7 +968,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		goto err_kfree_flow;
 	}
 
-	reply = ovs_flow_cmd_alloc_info(acts, info, false);
+	reply = ovs_flow_cmd_alloc_info(acts, &sfid, info, false);
 	if (IS_ERR(reply)) {
 		error = PTR_ERR(reply);
 		goto err_kfree_acts;
@@ -915,8 +980,9 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		error = -ENODEV;
 		goto err_unlock_ovs;
 	}
+
 	/* Check if this is a duplicate flow */
-	flow = ovs_flow_tbl_lookup(&dp->table, &new_flow->unmasked_key);
+	flow = ovs_flow_tbl_lookup(&dp->table, &new_flow->key);
 	if (likely(!flow)) {
 		rcu_assign_pointer(new_flow->sf_acts, acts);
 
@@ -932,7 +998,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 						       ovs_header->dp_ifindex,
 						       reply, info->snd_portid,
 						       info->snd_seq, 0,
-						       OVS_FLOW_CMD_NEW);
+						       OVS_FLOW_CMD_NEW,
+						       ufid_flags);
 			BUG_ON(error < 0);
 		}
 		ovs_unlock();
@@ -950,14 +1017,16 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 			error = -EEXIST;
 			goto err_unlock_ovs;
 		}
-		/* The unmasked key has to be the same for flow updates. */
-		if (unlikely(!ovs_flow_cmp_unmasked_key(flow, &match))) {
-			/* Look for any overlapping flow. */
+		/* The flow identifier has to be the same for flow updates. */
+		if (sfid.ufid_len) {
+			if (unlikely(!ovs_flow_cmp_ufid(flow, &sfid)))
+				flow = NULL;
+		} else if (unlikely(!ovs_flow_cmp_unmasked_key(flow, &match))) {
 			flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
-			if (!flow) {
-				error = -ENOENT;
-				goto err_unlock_ovs;
-			}
+		}
+		if (unlikely(!flow)) {
+			error = -ENOENT;
+			goto err_unlock_ovs;
 		}
 		/* Update actions. */
 		old_acts = ovsl_dereference(flow->sf_acts);
@@ -968,7 +1037,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 						       ovs_header->dp_ifindex,
 						       reply, info->snd_portid,
 						       info->snd_seq, 0,
-						       OVS_FLOW_CMD_NEW);
+						       OVS_FLOW_CMD_NEW,
+						       ufid_flags);
 			BUG_ON(error < 0);
 		}
 		ovs_unlock();
@@ -1018,30 +1088,36 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 	struct nlattr **a = info->attrs;
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key;
-	struct sw_flow *flow;
+	struct sw_flow *flow = NULL;
 	struct sw_flow_mask mask;
 	struct sk_buff *reply = NULL;
 	struct datapath *dp;
 	struct sw_flow_actions *old_acts = NULL, *acts = NULL;
 	struct sw_flow_match match;
+	struct sw_flow_id ufid;
+	u32 ufid_flags;
 	int error;
 	bool log = !a[OVS_FLOW_ATTR_PROBE];
 
-	/* Extract key. */
-	error = -EINVAL;
-	if (!a[OVS_FLOW_ATTR_KEY]) {
-		OVS_NLERR(log, "Flow key attribute not present in set flow.");
+	/* Extract ufid/key. */
+	error = ovs_nla_get_ufid(a[OVS_FLOW_ATTR_UFID], &ufid,
+				 &ufid_flags);
+	if (error)
 		goto error;
+	if (a[OVS_FLOW_ATTR_KEY]) {
+		ovs_match_init(&match, &key, &mask);
+		error = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY],
+					  a[OVS_FLOW_ATTR_MASK], log);
+	} else if (!ufid.ufid_len) {
+		OVS_NLERR(log, "Flow key attribute not present in set flow.\n");
+		error = -EINVAL;
 	}
-
-	ovs_match_init(&match, &key, &mask);
-	error = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY],
-				  a[OVS_FLOW_ATTR_MASK], log);
 	if (error)
 		goto error;
 
 	/* Validate actions. */
-	if (a[OVS_FLOW_ATTR_ACTIONS]) {
+	if (a[OVS_FLOW_ATTR_ACTIONS] && a[OVS_FLOW_ATTR_KEY] &&
+	    a[OVS_FLOW_ATTR_MASK]) {
 		acts = get_flow_actions(a[OVS_FLOW_ATTR_ACTIONS], &key, &mask,
 					log);
 		if (IS_ERR(acts)) {
@@ -1050,7 +1126,7 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 		}
 
 		/* Can allocate before locking if have acts. */
-		reply = ovs_flow_cmd_alloc_info(acts, info, false);
+		reply = ovs_flow_cmd_alloc_info(acts, &ufid, info, false);
 		if (IS_ERR(reply)) {
 			error = PTR_ERR(reply);
 			goto err_kfree_acts;
@@ -1064,7 +1140,10 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 		goto err_unlock_ovs;
 	}
 	/* Check that the flow exists. */
-	flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
+	if (ufid.ufid_len)
+		flow = ovs_flow_tbl_lookup_ufid(&dp->table, &ufid);
+	else
+		flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
 	if (unlikely(!flow)) {
 		error = -ENOENT;
 		goto err_unlock_ovs;
@@ -1080,13 +1159,15 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 						       ovs_header->dp_ifindex,
 						       reply, info->snd_portid,
 						       info->snd_seq, 0,
-						       OVS_FLOW_CMD_NEW);
+						       OVS_FLOW_CMD_NEW,
+						       ufid_flags);
 			BUG_ON(error < 0);
 		}
 	} else {
 		/* Could not alloc without acts before locking. */
 		reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex,
-						info, OVS_FLOW_CMD_NEW, false);
+						info, OVS_FLOW_CMD_NEW, false,
+						ufid_flags);
 		if (unlikely(IS_ERR(reply))) {
 			error = PTR_ERR(reply);
 			goto err_unlock_ovs;
@@ -1123,17 +1204,22 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
 	struct sw_flow *flow;
 	struct datapath *dp;
 	struct sw_flow_match match;
+	struct sw_flow_id ufid;
+	u32 ufid_flags;
 	int err;
 	bool log = !a[OVS_FLOW_ATTR_PROBE];
 
-	if (!a[OVS_FLOW_ATTR_KEY]) {
+	err = ovs_nla_get_ufid(a[OVS_FLOW_ATTR_UFID], &ufid, &ufid_flags);
+	if (err)
+		return err;
+	if (a[OVS_FLOW_ATTR_KEY]) {
+		ovs_match_init(&match, &key, NULL);
+		err = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY], NULL, log);
+	} else if (!ufid.ufid_len) {
 		OVS_NLERR(log,
-			  "Flow get message rejected, Key attribute missing.");
-		return -EINVAL;
+			  "Flow get message rejected, Key attribute missing.\n");
+		err = -EINVAL;
 	}
-
-	ovs_match_init(&match, &key, NULL);
-	err = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY], NULL, log);
 	if (err)
 		return err;
 
@@ -1144,14 +1230,17 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
 		goto unlock;
 	}
 
-	flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
+	if (ufid.ufid_len)
+		flow = ovs_flow_tbl_lookup_ufid(&dp->table, &ufid);
+	else
+		flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
 	if (!flow) {
 		err = -ENOENT;
 		goto unlock;
 	}
 
 	reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex, info,
-					OVS_FLOW_CMD_NEW, true);
+					OVS_FLOW_CMD_NEW, true, ufid_flags);
 	if (IS_ERR(reply)) {
 		err = PTR_ERR(reply);
 		goto unlock;
@@ -1170,13 +1259,18 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key;
 	struct sk_buff *reply;
-	struct sw_flow *flow;
+	struct sw_flow *flow = NULL;
 	struct datapath *dp;
 	struct sw_flow_match match;
+	struct sw_flow_id ufid;
+	u32 ufid_flags;
 	int err;
 	bool log = !a[OVS_FLOW_ATTR_PROBE];
 
-	if (likely(a[OVS_FLOW_ATTR_KEY])) {
+	err = ovs_nla_get_ufid(a[OVS_FLOW_ATTR_UFID], &ufid, &ufid_flags);
+	if (err)
+		return err;
+	if (a[OVS_FLOW_ATTR_KEY]) {
 		ovs_match_init(&match, &key, NULL);
 		err = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY], NULL,
 					log);
@@ -1191,12 +1285,15 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 		goto unlock;
 	}
 
-	if (unlikely(!a[OVS_FLOW_ATTR_KEY])) {
+	if (unlikely(!a[OVS_FLOW_ATTR_KEY] && !ufid.ufid_len)) {
 		err = ovs_flow_tbl_flush(&dp->table);
 		goto unlock;
 	}
 
-	flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
+	if (ufid.ufid_len)
+		flow = ovs_flow_tbl_lookup_ufid(&dp->table, &ufid);
+	else
+		flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
 	if (unlikely(!flow)) {
 		err = -ENOENT;
 		goto unlock;
@@ -1206,7 +1303,7 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	ovs_unlock();
 
 	reply = ovs_flow_cmd_alloc_info(rcu_dereference_raw(flow->sf_acts),
-					info, false);
+					flow_get_ufid(flow), info, false);
 
 	if (likely(reply)) {
 		if (likely(!IS_ERR(reply))) {
@@ -1214,7 +1311,8 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 			err = ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex,
 						     reply, info->snd_portid,
 						     info->snd_seq, 0,
-						     OVS_FLOW_CMD_DEL);
+						     OVS_FLOW_CMD_DEL,
+						     ufid_flags);
 			rcu_read_unlock();
 			BUG_ON(err < 0);
 			ovs_notify(&dp_flow_genl_family, &ovs_dp_flow_multicast_group, reply, info);
@@ -1235,8 +1333,15 @@ unlock:
 static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct ovs_header *ovs_header = genlmsg_data(nlmsg_data(cb->nlh));
+	struct nlattr *nla, *ufid;
 	struct table_instance *ti;
 	struct datapath *dp;
+	u32 ufid_flags = 0;
+
+	nla = nlmsg_attrdata(cb->nlh, sizeof(*ovs_header));
+	ufid = nla_find_nested(nla, OVS_FLOW_ATTR_UFID);
+	if (ufid && ovs_nla_get_ufid(ufid, NULL, &ufid_flags))
+		OVS_NLERR(true, "Error occurred parsing UFID flags on dump");
 
 	rcu_read_lock();
 	dp = get_dp_rcu(sock_net(skb->sk), ovs_header->dp_ifindex);
@@ -1259,7 +1364,7 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
 		if (ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex, skb,
 					   NETLINK_CB(cb->skb).portid,
 					   cb->nlh->nlmsg_seq, NLM_F_MULTI,
-					   OVS_FLOW_CMD_NEW) < 0)
+					   OVS_FLOW_CMD_NEW, ufid_flags) < 0)
 			break;
 
 		cb->args[0] = bucket;
@@ -1275,6 +1380,7 @@ static const struct nla_policy flow_policy[OVS_FLOW_ATTR_MAX + 1] = {
 	[OVS_FLOW_ATTR_ACTIONS] = { .type = NLA_NESTED },
 	[OVS_FLOW_ATTR_CLEAR] = { .type = NLA_FLAG },
 	[OVS_FLOW_ATTR_PROBE] = { .type = NLA_FLAG },
+	[OVS_FLOW_ATTR_UFID] = { .type = NLA_NESTED },
 };
 
 static struct genl_ops dp_flow_genl_ops[] = {
diff --git a/datapath/flow.h b/datapath/flow.h
index 2bbf789..736e0eb 100644
--- a/datapath/flow.h
+++ b/datapath/flow.h
@@ -196,6 +196,13 @@ struct sw_flow_match {
 	struct sw_flow_mask *mask;
 };
 
+struct sw_flow_id {
+	struct hlist_node node[2];
+	u32 hash;
+	u8 *ufid;
+	u8 ufid_len;
+};
+
 struct sw_flow_actions {
 	struct rcu_head rcu;
 	u32 actions_len;
@@ -212,13 +219,20 @@ struct flow_stats {
 
 struct sw_flow {
 	struct rcu_head rcu;
-	struct hlist_node hash_node[2];
-	u32 hash;
+	struct {
+		struct hlist_node node[2];
+		u32 hash;
+	} flow_hash;
 	int stats_last_writer;		/* NUMA-node id of the last writer on
 					 * 'stats[0]'.
 					 */
 	struct sw_flow_key key;
-	struct sw_flow_key unmasked_key;
+	bool index_by_ufid;		/* Which of the below that userspace
+					   uses to index this flow. */
+	union {
+		struct sw_flow_key unmasked_key;
+		struct sw_flow_id ufid;
+	} index;
 	struct sw_flow_mask *mask;
 	struct sw_flow_actions __rcu *sf_acts;
 	struct flow_stats __rcu *stats[]; /* One for each NUMA node.  First one
diff --git a/datapath/flow_netlink.c b/datapath/flow_netlink.c
index c1c29f5..7927462 100644
--- a/datapath/flow_netlink.c
+++ b/datapath/flow_netlink.c
@@ -1094,6 +1094,41 @@ free_newmask:
 	return err;
 }
 
+int ovs_nla_get_ufid(const struct nlattr *attr, struct sw_flow_id *sfid,
+		     u32 *flags)
+{
+	static const struct nla_policy ovs_ufid_policy[OVS_UFID_ATTR_MAX + 1] = {
+		[OVS_UFID_ATTR_FLAGS] = { .type = NLA_U32 },
+		[OVS_UFID_ATTR_ID] = { .len = sizeof(u32) },
+	};
+	const struct nlattr *a[OVS_UFID_ATTR_MAX + 1];
+	int err;
+
+	if (sfid) {
+		sfid->ufid = NULL;
+		sfid->ufid_len = 0;
+	}
+	if (flags)
+		*flags = 0;
+	if (!attr)
+		return 0;
+
+	err = nla_parse_nested((struct nlattr **)a, OVS_UFID_ATTR_MAX, attr,
+			       ovs_ufid_policy);
+	if (err)
+		return err;
+
+	if (sfid && a[OVS_UFID_ATTR_ID]) {
+		sfid->ufid = nla_data(a[OVS_UFID_ATTR_ID]);
+		sfid->ufid_len = nla_len(a[OVS_UFID_ATTR_ID]);
+	}
+
+	if (flags && a[OVS_UFID_ATTR_FLAGS])
+		*flags = nla_get_u32(a[OVS_UFID_ATTR_FLAGS]);
+
+	return 0;
+}
+
 /**
  * ovs_nla_get_flow_metadata - parses Netlink attributes into a flow key.
  * @key: Receives extracted in_port, priority, tun_key and skb_mark.
diff --git a/datapath/flow_netlink.h b/datapath/flow_netlink.h
index fde7616..9a14ad9 100644
--- a/datapath/flow_netlink.h
+++ b/datapath/flow_netlink.h
@@ -52,6 +52,7 @@ int ovs_nla_get_match(struct sw_flow_match *, const struct nlattr *key,
 		      const struct nlattr *mask, bool log);
 int ovs_nla_put_egress_tunnel_key(struct sk_buff *,
 				  const struct ovs_tunnel_info *);
+int ovs_nla_get_ufid(const struct nlattr *, struct sw_flow_id *, u32 *flags);
 
 int ovs_nla_copy_actions(const struct nlattr *attr,
 			 const struct sw_flow_key *key,
diff --git a/datapath/flow_table.c b/datapath/flow_table.c
index ad410fd..03e7040 100644
--- a/datapath/flow_table.c
+++ b/datapath/flow_table.c
@@ -92,6 +92,7 @@ struct sw_flow *ovs_flow_alloc(void)
 
 	flow->sf_acts = NULL;
 	flow->mask = NULL;
+	flow->index_by_ufid = false;
 	flow->stats_last_writer = NUMA_NO_NODE;
 
 	/* Initialize the default stat node. */
@@ -146,6 +147,8 @@ static void flow_free(struct sw_flow *flow)
 {
 	int node;
 
+	if (flow->index_by_ufid)
+		kfree(flow->index.ufid.ufid);
 	kfree(rcu_dereference_raw(flow->sf_acts));
 	for_each_node(node)
 		if (flow->stats[node])
@@ -265,7 +268,7 @@ static int tbl_mask_array_realloc(struct flow_table *tbl, int size)
 
 int ovs_flow_tbl_init(struct flow_table *table)
 {
-	struct table_instance *ti;
+	struct table_instance *ti, *ufid_ti;
 	struct mask_array *ma;
 
 	table->mask_cache = __alloc_percpu(sizeof(struct mask_cache_entry) *
@@ -277,16 +280,24 @@ int ovs_flow_tbl_init(struct flow_table *table)
 	if (!ma)
 		goto free_mask_cache;
 
+	ufid_ti = table_instance_alloc(TBL_MIN_BUCKETS);
+	if (!ufid_ti)
+		goto free_mask_array;
+
 	ti = table_instance_alloc(TBL_MIN_BUCKETS);
 	if (!ti)
-		goto free_mask_array;
+		goto free_ti;
 
 	rcu_assign_pointer(table->ti, ti);
+	rcu_assign_pointer(table->ufid_ti, ufid_ti);
 	rcu_assign_pointer(table->mask_array, ma);
-	table->last_rehash = jiffies;
 	table->count = 0;
+	table->ufid_count = 0;
+	table->last_rehash = jiffies;
 	return 0;
 
+free_ti:
+	__table_instance_destroy(ufid_ti);
 free_mask_array:
 	kfree(ma);
 free_mask_cache:
@@ -301,13 +312,16 @@ static void flow_tbl_destroy_rcu_cb(struct rcu_head *rcu)
 	__table_instance_destroy(ti);
 }
 
-static void table_instance_destroy(struct table_instance *ti, bool deferred)
+static void table_instance_destroy(struct table_instance *ti,
+				   struct table_instance *ufid_ti,
+				   bool deferred)
 {
 	int i;
 
 	if (!ti)
 		return;
 
+	BUG_ON(!ufid_ti);
 	if (ti->keep_flows)
 		goto skip_flows;
 
@@ -316,18 +330,24 @@ static void table_instance_destroy(struct table_instance *ti, bool deferred)
 		struct hlist_head *head = flex_array_get(ti->buckets, i);
 		struct hlist_node *n;
 		int ver = ti->node_ver;
+		int ufid_ver = ufid_ti->node_ver;
 
-		hlist_for_each_entry_safe(flow, n, head, hash_node[ver]) {
-			hlist_del_rcu(&flow->hash_node[ver]);
+		hlist_for_each_entry_safe(flow, n, head, flow_hash.node[ver]) {
+			hlist_del_rcu(&flow->flow_hash.node[ver]);
+			if (flow->index_by_ufid)
+				hlist_del_rcu(&flow->index.ufid.node[ufid_ver]);
 			ovs_flow_free(flow, deferred);
 		}
 	}
 
 skip_flows:
-	if (deferred)
+	if (deferred) {
 		call_rcu(&ti->rcu, flow_tbl_destroy_rcu_cb);
-	else
+		call_rcu(&ufid_ti->rcu, flow_tbl_destroy_rcu_cb);
+	} else {
 		__table_instance_destroy(ti);
+		__table_instance_destroy(ufid_ti);
+	}
 }
 
 /* No need for locking this function is called from RCU callback or
@@ -336,10 +356,11 @@ skip_flows:
 void ovs_flow_tbl_destroy(struct flow_table *table)
 {
 	struct table_instance *ti = rcu_dereference_raw(table->ti);
+	struct table_instance *ufid_ti = rcu_dereference_raw(table->ufid_ti);
 
 	free_percpu(table->mask_cache);
-	kfree(rcu_dereference_raw(table->mask_array));
-	table_instance_destroy(ti, false);
+	kfree((struct mask_array __force *)table->mask_array);
+	table_instance_destroy(ti, ufid_ti, false);
 }
 
 struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
@@ -354,7 +375,7 @@ struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
 	while (*bucket < ti->n_buckets) {
 		i = 0;
 		head = flex_array_get(ti->buckets, *bucket);
-		hlist_for_each_entry_rcu(flow, head, hash_node[ver]) {
+		hlist_for_each_entry_rcu(flow, head, flow_hash.node[ver]) {
 			if (i < *last) {
 				i++;
 				continue;
@@ -380,12 +401,21 @@ static void table_instance_insert(struct table_instance *ti, struct sw_flow *flo
 {
 	struct hlist_head *head;
 
-	head = find_bucket(ti, flow->hash);
-	hlist_add_head_rcu(&flow->hash_node[ti->node_ver], head);
+	head = find_bucket(ti, flow->flow_hash.hash);
+	hlist_add_head_rcu(&flow->flow_hash.node[ti->node_ver], head);
+}
+
+static void ufid_table_instance_insert(struct table_instance *ti,
+				       struct sw_flow *flow)
+{
+	struct hlist_head *head;
+
+	head = find_bucket(ti, flow->index.ufid.hash);
+	hlist_add_head_rcu(&flow->index.ufid.node[ti->node_ver], head);
 }
 
 static void flow_table_copy_flows(struct table_instance *old,
-				  struct table_instance *new)
+				  struct table_instance *new, bool ufid)
 {
 	int old_ver;
 	int i;
@@ -400,42 +430,69 @@ static void flow_table_copy_flows(struct table_instance *old,
 
 		head = flex_array_get(old->buckets, i);
 
-		hlist_for_each_entry(flow, head, hash_node[old_ver])
-			table_instance_insert(new, flow);
+		if (ufid)
+			hlist_for_each_entry(flow, head,
+					     index.ufid.node[old_ver])
+				ufid_table_instance_insert(new, flow);
+		else
+			hlist_for_each_entry(flow, head, flow_hash.node[old_ver])
+				table_instance_insert(new, flow);
 	}
 
 	old->keep_flows = true;
 }
 
-static struct table_instance *table_instance_rehash(struct table_instance *ti,
-					    int n_buckets)
+static int flow_table_instance_alloc(struct table_instance **ti,
+				     int n_buckets)
 {
 	struct table_instance *new_ti;
 
 	new_ti = table_instance_alloc(n_buckets);
 	if (!new_ti)
-		return NULL;
+		return -ENOMEM;
 
-	flow_table_copy_flows(ti, new_ti);
+	*ti = new_ti;
+	return 0;
+}
+
+static struct table_instance *flow_table_rehash(struct table_instance *old_ti,
+						int n_buckets, bool ufid)
+{
+	struct table_instance *new_ti;
+	int err;
+
+	err = flow_table_instance_alloc(&new_ti, n_buckets);
+	if (err)
+		return NULL;
+	flow_table_copy_flows(old_ti, new_ti, ufid);
 
 	return new_ti;
 }
 
 int ovs_flow_tbl_flush(struct flow_table *flow_table)
 {
-	struct table_instance *old_ti;
-	struct table_instance *new_ti;
+	struct table_instance *old_ti, *new_ti, *old_ufid_ti;
+	struct table_instance *new_ufid_ti = NULL;
+	int err;
 
 	old_ti = ovsl_dereference(flow_table->ti);
-	new_ti = table_instance_alloc(TBL_MIN_BUCKETS);
-	if (!new_ti)
-		return -ENOMEM;
+	old_ufid_ti = ovsl_dereference(flow_table->ufid_ti);
+	err = flow_table_instance_alloc(&new_ti, TBL_MIN_BUCKETS);
+	if (err)
+		return err;
+	err = flow_table_instance_alloc(&new_ufid_ti, TBL_MIN_BUCKETS);
+	if (err) {
+		__table_instance_destroy(new_ti);
+		return err;
+	}
 
 	rcu_assign_pointer(flow_table->ti, new_ti);
+	rcu_assign_pointer(flow_table->ufid_ti, new_ufid_ti);
 	flow_table->last_rehash = jiffies;
 	flow_table->count = 0;
+	flow_table->ufid_count = 0;
 
-	table_instance_destroy(old_ti, true);
+	table_instance_destroy(old_ti, old_ufid_ti, true);
 	return 0;
 }
 
@@ -489,7 +546,8 @@ bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
 	int key_start = flow_key_start(key);
 	int key_end = match->range.end;
 
-	return cmp_key(&flow->unmasked_key, key, key_start, key_end);
+	BUG_ON(flow->index_by_ufid);
+	return cmp_key(&flow->index.unmasked_key, key, key_start, key_end);
 }
 
 static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
@@ -508,10 +566,9 @@ static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
 	hash = flow_hash(&masked_key, key_start, key_end);
 	head = find_bucket(ti, hash);
 	(*n_mask_hit)++;
-	hlist_for_each_entry_rcu(flow, head, hash_node[ti->node_ver]) {
-		if (flow->mask == mask && flow->hash == hash &&
-		    flow_cmp_masked_key(flow, &masked_key,
-					  key_start, key_end))
+	hlist_for_each_entry_rcu(flow, head, flow_hash.node[ti->node_ver]) {
+		if (flow->mask == mask && flow->flow_hash.hash == hash &&
+		    flow_cmp_masked_key(flow, &masked_key, key_start, key_end))
 			return flow;
 	}
 	return NULL;
@@ -644,7 +701,40 @@ struct sw_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
 		if (!mask)
 			continue;
 		flow = masked_flow_lookup(ti, match->key, mask, &n_mask_hit);
-		if (flow && ovs_flow_cmp_unmasked_key(flow, match))
+		if (flow && !flow->index_by_ufid &&
+		    ovs_flow_cmp_unmasked_key(flow, match))
+			return flow;
+	}
+	return NULL;
+}
+
+static u32 ufid_hash(const struct sw_flow_id *sfid)
+{
+	return arch_fast_hash(sfid->ufid, sfid->ufid_len, 0);
+}
+
+bool ovs_flow_cmp_ufid(const struct sw_flow *flow,
+		       const struct sw_flow_id *sfid)
+{
+	if (flow->index.ufid.ufid_len != sfid->ufid_len)
+		return false;
+
+	return !memcmp(flow->index.ufid.ufid, sfid->ufid, sfid->ufid_len);
+}
+
+struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *tbl,
+					 const struct sw_flow_id *ufid)
+{
+	struct table_instance *ti = rcu_dereference_ovsl(tbl->ufid_ti);
+	struct sw_flow *flow;
+	struct hlist_head *head;
+	u32 hash;
+
+	hash = ufid_hash(ufid);
+	head = find_bucket(ti, hash);
+	hlist_for_each_entry_rcu(flow, head, index.ufid.node[ti->node_ver]) {
+		if (flow->index.ufid.hash == hash &&
+		    ovs_flow_cmp_ufid(flow, ufid))
 			return flow;
 	}
 	return NULL;
@@ -658,9 +748,10 @@ int ovs_flow_tbl_num_masks(const struct flow_table *table)
 	return ma->count;
 }
 
-static struct table_instance *table_instance_expand(struct table_instance *ti)
+static struct table_instance *flow_table_expand(struct table_instance *old_ti,
+						bool ufid)
 {
-	return table_instance_rehash(ti, ti->n_buckets * 2);
+	return flow_table_rehash(old_ti, old_ti->n_buckets * 2, ufid);
 }
 
 static void tbl_mask_array_delete_mask(struct mask_array *ma,
@@ -710,10 +801,15 @@ static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask)
 void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow)
 {
 	struct table_instance *ti = ovsl_dereference(table->ti);
+	struct table_instance *ufid_ti = ovsl_dereference(table->ufid_ti);
 
 	BUG_ON(table->count == 0);
-	hlist_del_rcu(&flow->hash_node[ti->node_ver]);
+	hlist_del_rcu(&flow->flow_hash.node[ti->node_ver]);
 	table->count--;
+	if (flow->index_by_ufid) {
+		hlist_del_rcu(&flow->index.ufid.node[ufid_ti->node_ver]);
+		table->ufid_count--;
+	}
 
 	/* RCU delete the mask. 'flow->mask' is not NULLed, as it should be
 	 * accessible as long as the RCU read lock is held.
@@ -818,31 +914,65 @@ static int flow_mask_insert(struct flow_table *tbl, struct sw_flow *flow,
 int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
 			const struct sw_flow_mask *mask)
 {
-	struct table_instance *new_ti = NULL;
-	struct table_instance *ti;
+	struct table_instance *new_ti = NULL, *new_ufid_ti = NULL;
+	struct table_instance *ti, *ufid_ti = NULL;
 	int err;
 
 	err = flow_mask_insert(table, flow, mask);
 	if (err)
 		return err;
 
-	flow->hash = flow_hash(&flow->key, flow->mask->range.start,
-			flow->mask->range.end);
+	flow->flow_hash.hash = flow_hash(&flow->key, flow->mask->range.start,
+					 flow->mask->range.end);
 	ti = ovsl_dereference(table->ti);
 	table_instance_insert(ti, flow);
 	table->count++;
+	if (flow->index_by_ufid) {
+		flow->index.ufid.hash = ufid_hash(&flow->index.ufid);
+		ufid_ti = ovsl_dereference(table->ufid_ti);
+		ufid_table_instance_insert(ufid_ti, flow);
+		table->ufid_count++;
+	}
 
 	/* Expand table, if necessary, to make room. */
 	if (table->count > ti->n_buckets)
-		new_ti = table_instance_expand(ti);
+		new_ti = flow_table_expand(ti, false);
 	else if (time_after(jiffies, table->last_rehash + REHASH_INTERVAL))
-		new_ti = table_instance_rehash(ti, ti->n_buckets);
+		new_ti = flow_table_rehash(ti, ti->n_buckets, false);
+	if (ufid_ti && table->ufid_count > ufid_ti->n_buckets)
+		new_ufid_ti = flow_table_expand(ufid_ti, true);
 
 	if (new_ti) {
 		rcu_assign_pointer(table->ti, new_ti);
-		table_instance_destroy(ti, true);
+		call_rcu(&ti->rcu, flow_tbl_destroy_rcu_cb);
 		table->last_rehash = jiffies;
 	}
+	if (new_ufid_ti) {
+		rcu_assign_pointer(table->ufid_ti, new_ufid_ti);
+		call_rcu(&ufid_ti->rcu, flow_tbl_destroy_rcu_cb);
+	}
+	return 0;
+}
+
+/* Initializes 'flow->ufid'. */
+int ovs_flow_ufid(struct sw_flow *flow, const struct sw_flow_id *src)
+{
+	if (src->ufid_len) {
+		struct sw_flow_id *dst = &flow->index.ufid;
+
+		/* Use userspace-specified flow-id. */
+		dst->ufid = kmalloc(src->ufid_len, GFP_KERNEL);
+		if (!dst->ufid)
+			return -ENOMEM;
+
+		memcpy(dst->ufid, src->ufid, src->ufid_len);
+		dst->ufid_len = src->ufid_len;
+		flow->index_by_ufid = true;
+	} else {
+		/* Don't index by UFID. */
+		flow->index_by_ufid = false;
+	}
+
 	return 0;
 }
 
diff --git a/datapath/flow_table.h b/datapath/flow_table.h
index 80c01a3..e05b59c 100644
--- a/datapath/flow_table.h
+++ b/datapath/flow_table.h
@@ -60,8 +60,10 @@ struct flow_table {
 	struct table_instance __rcu *ti;
 	struct mask_cache_entry __percpu *mask_cache;
 	struct mask_array __rcu *mask_array;
+	struct table_instance __rcu *ufid_ti;
 	unsigned long last_rehash;
 	unsigned int count;
+	unsigned int ufid_count;
 };
 
 extern struct kmem_cache *flow_stats_cache;
@@ -91,9 +93,15 @@ struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *,
 				    const struct sw_flow_key *);
 struct sw_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
 					  const struct sw_flow_match *match);
+struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *,
+					 const struct sw_flow_id *);
+
 bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
 			       const struct sw_flow_match *match);
+bool ovs_flow_cmp_ufid(const struct sw_flow *flow,
+		       const struct sw_flow_id *sfid);
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
 		       const struct sw_flow_mask *mask);
+int ovs_flow_ufid(struct sw_flow *flow, const struct sw_flow_id *src);
 #endif /* flow_table.h */
diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
index c8fa66e..7d759a4 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -471,6 +471,9 @@ struct ovs_key_nd {
  * a wildcarded match. Omitting attribute is treated as wildcarding all
  * corresponding fields. Optional for all requests. If not present,
  * all flow key bits are exact match bits.
+ * @OVS_FLOW_ATTR_UFID: Nested %OVS_UFID_ATTR_* attributes specifying unique
+ * identifiers for flows and providing alternative semantics for flow
+ * installation and retrieval.
  *
  * These attributes follow the &struct ovs_header within the Generic Netlink
  * payload for %OVS_FLOW_* commands.
@@ -486,12 +489,39 @@ enum ovs_flow_attr {
 	OVS_FLOW_ATTR_MASK,      /* Sequence of OVS_KEY_ATTR_* attributes. */
 	OVS_FLOW_ATTR_PROBE,     /* Flow operation is a feature probe, error
 				  * logging should be suppressed. */
+	OVS_FLOW_ATTR_UFID,      /* Unique flow identifier. */
 	__OVS_FLOW_ATTR_MAX
 };
 
 #define OVS_FLOW_ATTR_MAX (__OVS_FLOW_ATTR_MAX - 1)
 
 /**
+ * enum ovs_ufid_attr - Unique identifier types.
+ *
+ * @OVS_UFID_ATTR_FLAGS: A 32-bit value specifying changes to the behaviour of
+ * the current %OVS_FLOW_CMD_* request. Optional for all requests.
+ * @OVS_UFID_ATTR_ID: A unique identifier for a flow.
+ */
+enum ovs_ufid_attr {
+	OVS_UFID_ATTR_UNSPEC,
+	OVS_UFID_ATTR_FLAGS,     /* u32 of OVS_UFID_F_* */
+	OVS_UFID_ATTR_ID,        /* variable length identifier. */
+	__OVS_UFID_ATTR_MAX
+};
+
+#define OVS_UFID_ATTR_MAX (__OVS_UFID_ATTR_MAX - 1)
+
+/**
+ * Omit attributes for notifications.
+ *
+ * If a datapath request contains an OVS_UFID_F_OMIT_* flag, then the datapath
+ * may omit the corresponding 'ovs_flow_attr' from the response.
+ */
+#define OVS_UFID_F_OMIT_KEY      (1 << 0)
+#define OVS_UFID_F_OMIT_MASK     (1 << 1)
+#define OVS_UFID_F_OMIT_ACTIONS  (1 << 2)
+
+/**
  * enum ovs_sample_attr - Attributes for %OVS_ACTION_ATTR_SAMPLE action.
  * @OVS_SAMPLE_ATTR_PROBABILITY: 32-bit fraction of packets to sample with
  * @OVS_ACTION_ATTR_SAMPLE.  A value of 0 samples no packets, a value of
-- 
1.7.10.4

^ permalink raw reply related

* [PATCHv10 ovs 03/15] datapath: Add 'is_mask' to ovs_nla_put_flow().
From: Joe Stringer @ 2014-11-13 19:17 UTC (permalink / raw)
  To: dev; +Cc: netdev
In-Reply-To: <1415906275-3172-1-git-send-email-joestringer@nicira.com>

This function previously hid the 'is_mask' parameter from the callers,
which actually have better knowledge about whether it is serializing a
mask or not. Expose this parameter to the callers. This allows the same
function to be called to serialize masked keys as well as masked keys.

To serialize an unmasked key:
ovs_nla_put_flow(key, key, skb, false);

To serialize a masked key:
ovs_nla_put_flow(mask, key, skb, false);

To serialize a mask:
ovs_nla_put_flow(key, mask, skb, true);

Signed-off-by: Joe Stringer <joestringer@nicira.com>
CC: netdev@vger.kernel.org
---
v10: First post.
---
 datapath/datapath.c     |    7 ++++---
 datapath/flow_netlink.c |    4 ++--
 datapath/flow_netlink.h |    4 ++--
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/datapath/datapath.c b/datapath/datapath.c
index 3607170..a898584 100644
--- a/datapath/datapath.c
+++ b/datapath/datapath.c
@@ -468,7 +468,7 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
 	upcall->dp_ifindex = dp_ifindex;
 
 	nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_KEY);
-	err = ovs_nla_put_flow(key, key, user_skb);
+	err = ovs_nla_put_flow(key, key, user_skb, false);
 	BUG_ON(err);
 	nla_nest_end(user_skb, nla);
 
@@ -694,7 +694,8 @@ static int ovs_flow_cmd_fill_match(const struct sw_flow *flow,
 	if (!nla)
 		return -EMSGSIZE;
 
-	err = ovs_nla_put_flow(&flow->unmasked_key, &flow->unmasked_key, skb);
+	err = ovs_nla_put_flow(&flow->unmasked_key, &flow->unmasked_key, skb,
+			       false);
 	if (err)
 		return err;
 
@@ -705,7 +706,7 @@ static int ovs_flow_cmd_fill_match(const struct sw_flow *flow,
 	if (!nla)
 		return -EMSGSIZE;
 
-	err = ovs_nla_put_flow(&flow->key, &flow->mask->key, skb);
+	err = ovs_nla_put_flow(&flow->key, &flow->mask->key, skb, true);
 	if (err)
 		return err;
 
diff --git a/datapath/flow_netlink.c b/datapath/flow_netlink.c
index 503cf63..c1c29f5 100644
--- a/datapath/flow_netlink.c
+++ b/datapath/flow_netlink.c
@@ -1132,11 +1132,11 @@ int ovs_nla_get_flow_metadata(const struct nlattr *attr,
 }
 
 int ovs_nla_put_flow(const struct sw_flow_key *swkey,
-		     const struct sw_flow_key *output, struct sk_buff *skb)
+		     const struct sw_flow_key *output,
+		     struct sk_buff *skb, bool is_mask)
 {
 	struct ovs_key_ethernet *eth_key;
 	struct nlattr *nla, *encap;
-	bool is_mask = (swkey != output);
 
 	if (nla_put_u32(skb, OVS_KEY_ATTR_RECIRC_ID, output->recirc_id))
 		goto nla_put_failure;
diff --git a/datapath/flow_netlink.h b/datapath/flow_netlink.h
index 577f12b..fde7616 100644
--- a/datapath/flow_netlink.h
+++ b/datapath/flow_netlink.h
@@ -43,8 +43,8 @@ size_t ovs_key_attr_size(void);
 void ovs_match_init(struct sw_flow_match *match,
 		    struct sw_flow_key *key, struct sw_flow_mask *mask);
 
-int ovs_nla_put_flow(const struct sw_flow_key *,
-		     const struct sw_flow_key *, struct sk_buff *);
+int ovs_nla_put_flow(const struct sw_flow_key *, const struct sw_flow_key *,
+		     struct sk_buff *, bool is_mask);
 int ovs_nla_get_flow_metadata(const struct nlattr *, struct sw_flow_key *,
 			      bool log);
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCHv10 ovs 00/15] Revalidate flows with unique identifiers.
From: Joe Stringer @ 2014-11-13 19:17 UTC (permalink / raw)
  To: dev; +Cc: netdev

This series modifies the dpif interface for flow commands to use 128-bit unique
identifiers as an alternative to the netlink-formatted flow key, and caches the
mask/actions in the udpif_key. This significantly reduces the cost of
assembling messages between revalidators and the datapath, improving
revalidation performance by 40% or more. In a test environment of many
short-lived flows constantly being set up in the datapath, this increases the
number of flows that can be maintained in the linux datapath from around
130-140K up to 190-200K. For the userspace datapath, this decreases the time
spent revalidating 160K flows from 250ms to 150ms.

The core of the changes sits in the handler and revalidator code. Handlers take
responsibility for creating udpif_key cache entries which now include a copy of
the flow mask and actions. Revalidators request datapaths to dump flows using
only the unique identifier and stats, rather than the full set of
netlink-formatted flow key, mask and actions.

In cases where full revalidation is required, revalidators will use the
udpif_key cache of the key/mask/acts to validate the flow. The dpif will
detect datapath support for the unique identifer "UFID" feature, and omit flow
keys from netlink transactions if it is supported. For backwards compatibility,
flow keys will always be serialised if UFID support is not detected in the
datapath.

Patches 1,2,3,15 are unreviewed. Patch 12 needs further review.

This series is also made available here to assist review:
https://github.com/joestringer/openvswitch/tree/submit/ufid_v10

CC: netdev@vger.kernel.org

v10:
- New patch allowing datapath to serialize masked keys
- New patch providing commandline parsing of UFIDs
- New patch to fix IP fragment testsuite failure
- Simplify datapath interface by accepting UFID or flow_key, but not both
- Flows set up with UFID must be queried/deleted using UFID
- Reduce sw_flow memory usage for UFID
- Don't periodically rehash UFID table in linux datapath
- Remove kernel_only UFID in linux datapath
- Track whether UFIDs are present in datapath in udpif_key

v9:
- New patch to enable verbose flow-dumping in ovs-bugtool
- Don't print UFIDs by default in ovs-dpctl, ovs-appctl dump-flows output
- Userspace datapath performance and correctness improvements

v8:
- Rename UID -> UFID
- Clarify dpif interface descriptions
- Remove 'struct odputil_uidbuf'
- Simplify dpif-netlink UFID marshalling
- 32-bit build fix
- Fix null dereference in datapath when paired with older userspace
- Don't generate UFIDs for feature probes or ovs-dpctl usage
- Rebase
- All patches are reviewed/acked except datapath changes.

v7:
- Remove OVS_DP_F_INDEX_BY_UID
- Rework datapath UID serialization for variable length UIDs
- Create ukeys from revalidator threads in corner cases
- Hide "terse" flags from flow_get,flow_del dpif interface
- Scattered replacements of memcpy with u128_equal()
- Rebase

v6:
- Address feedback from Ben
- Split out "dpif: Add Unique flow identifiers." into three patches
- Reduce netlink conversions for all datapaths
- Reduce udpif_key footprint
- Added x64 version of murmurhash3
- Added hash function tests
- Various bugfixes
- Rebase

v5:
- Rebase
- Various bugfixes
- Improve logging

v4:
- Datapath memory leak fixes
- Enable UID-based terse dumping and deleting by default
- Shifted UID generation down to dpif
- Log flow UIDs in more places
- Various fixes

RFCv3:
- Add datapath implementation
- Minor fixes
- Rebased

RFCv2:
- Revised early patches from v1 feedback
- Add Acks from Ben
- Rebased

Joe Stringer (15):
  tests: Add command to purge revalidators of flows.
  ovs-bugtool: Log more detail for dumped flows.
  datapath: Add 'is_mask' to ovs_nla_put_flow().
  revalidator: Use 'cmap' for storing ukeys.
  revalidator: Protect ukeys with a mutex.
  udpif: Separate udpif_key maps from revalidators.
  upcall: Rename dump_op -> ukey_op.
  upcall: Create ukeys in handler threads.
  upcall: Revalidate using cache of mask, actions.
  hash: Add 128-bit murmurhash.
  dpif: Generate flow_hash for revalidators in dpif.
  datapath: Add support for unique flow identifiers.
  dpif: Index flows using unique identifiers.
  dpif: Minimize memory copy for revalidation.
  dpctl: Add support for using UFID to add/del flows.

 datapath/README.md                                |   13 +
 datapath/datapath.c                               |  249 ++++--
 datapath/flow.h                                   |   20 +-
 datapath/flow_netlink.c                           |   39 +-
 datapath/flow_netlink.h                           |    5 +-
 datapath/flow_table.c                             |  214 ++++-
 datapath/flow_table.h                             |    8 +
 datapath/linux/compat/include/linux/openvswitch.h |   30 +
 include/openvswitch/types.h                       |   14 +
 lib/dpctl.c                                       |   47 +-
 lib/dpif-netdev.c                                 |  180 +++--
 lib/dpif-netlink.c                                |  258 +++++-
 lib/dpif-provider.h                               |   13 +-
 lib/dpif.c                                        |   65 +-
 lib/dpif.h                                        |   44 +-
 lib/hash.c                                        |  266 ++++++-
 lib/hash.h                                        |   11 +-
 lib/odp-util.c                                    |   94 +++
 lib/odp-util.h                                    |    6 +
 ofproto/ofproto-dpif-upcall.c                     |  868 +++++++++++++++------
 ofproto/ofproto-dpif.c                            |   14 +-
 tests/dpif-netdev.at                              |    5 +
 tests/ofproto-dpif.at                             |   36 +-
 tests/ofproto-macros.at                           |    1 +
 tests/test-hash.c                                 |   83 ++
 utilities/bugtool/ovs-bugtool-ovs-appctl-dpif     |    4 +-
 utilities/bugtool/ovs-bugtool.in                  |    2 +-
 27 files changed, 2058 insertions(+), 531 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* Re: [PATCH net-next] cxgb4i/cxgb4 : Refactor macros to conform to uniform standards
From: David Miller @ 2014-11-13 19:36 UTC (permalink / raw)
  To: anish
  Cc: netdev, linux-scsi, linux-rdma, hch, jbottomley, roland,
	hariprasad, kxie, swise
In-Reply-To: <1415841357-19700-1-git-send-email-anish@chelsio.com>

From: Anish Bhatt <anish@chelsio.com>
Date: Wed, 12 Nov 2014 17:15:57 -0800

> Refactored all macros used in cxgb4i as part of previously started cxgb4 macro
> names cleanup. Makes them more uniform and avoids namespace collision.
> Minor changes in other drivers where required as some of these macros are used
>  by multiple drivers, affected drivers are iw_cxgb4, cxgb4(vf) & csiostor
> 
> Signed-off-by: Anish Bhatt <anish@chelsio.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] tun: fix issues of iovec iterators using in tun_put_user()
From: David Miller @ 2014-11-13 19:35 UTC (permalink / raw)
  To: herbert; +Cc: jasowang, netdev, linux-kernel, mst
In-Reply-To: <20141113085838.GA3070@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 13 Nov 2014 16:58:38 +0800

> On Thu, Nov 13, 2014 at 04:54:14PM +0800, Jason Wang wrote:
>> This patch fixes two issues after using iovec iterators:
>> - vlan_offset should be initialized to zero, otherwise unexpected offset
>>   will be used in skb_copy_datagram_iter()
>> - advance iovec iterator when vnet_hdr_sz is greater than sizeof(gso), this
>>   is the case when mergeable rx buffer were enabled for a virt guest.
>> 
>> Fixes e0b46d0ee9c240c7430a47e9b0365674d4a04522 ("tun: Use iovec iterators")
>> Cc: Herbert Xu <herbert@gondor.apana.org.au>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> 
> Good catch.
> 
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks everyone.

^ permalink raw reply

* Re: arm64 allmodconfig failures in nft_reject_bridge.c
From: David Miller @ 2014-11-13 19:35 UTC (permalink / raw)
  To: pablo
  Cc: linaro-kernel, coreteam, kernel-build-reports, netdev, bridge,
	stephen, broonie, netfilter-devel, kadlec, kaber, linux
In-Reply-To: <20141113114357.GA7857@salvia>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Thu, 13 Nov 2014 12:43:57 +0100

> On Thu, Nov 13, 2014 at 11:33:20AM +0000, Mark Brown wrote:
>> Since about -rc3 we've been seeing build failures in Linus' tree on
>> arm64 allmodconfig due to:
>> 
>> > 	arm64-allmodconfig
>> > ../net/bridge/netfilter/nft_reject_bridge.c:240:3: error: implicit declaration of function 'csum_ipv6_magic' [-Werror=implicit-function-declaration]
>> 
>> By the time I looked into this it was fixed in -next by c1207c049b204b0
>> (netfilter: nft_reject_bridge: Fix powerpc build error) from Guenter but
>> that doesn't seem to have made it into -rc4 so I just wanted to check
>> that this fix was intended to go to Linus before v3.18?
> 
> I can see this in David's tree:
> 
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=c1207c049b204b0a96535dc5416aee331b51e0e1
> 
> I think this was reported by when -rc4 was already out, so you'll see
> this by -rc5.

I hold changes in my tree for a week or more, because I want them to
"cook" there before they go to Linus.

So if it takes a week or two for a bug fix like this to propagate into
Linus's tree, that's just what sometimes happens.

In the mean time you can apply the fix locally if you absolutely have
to have it right at this moment, that is the freedom that everyone
has.

FWIW, I plan to push my tree to Linus some time today, so this will be
resolved in the next day or so.

Thanks.

^ permalink raw reply

* Re: [PATCH] FOU: Fix no return statement warning for !CONFIG_NET_FOU_IP_TUNNELS
From: David Miller @ 2014-11-13 19:32 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, therbert
In-Reply-To: <8cf12c98c1a9606f2bd69dd9c108d76f444ab79e.1415879256.git.tgraf@suug.ch>

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 13 Nov 2014 12:48:21 +0100

> net/ipv4/fou.c: In function ‘ip_tunnel_encap_del_fou_ops’:
> net/ipv4/fou.c:861:1: warning: no return statement in function returning non-void [-Wreturn-type]
> 
> Fixes: a8c5f90fb5 ("ip_tunnel: Ops registration for secondary encap (fou, gue)")
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Applied, thanks Thomas.

^ permalink raw reply

* [PATCH 2/3] r8169: Use load_acquire() and store_release() to reduce memory barrier overhead
From: Alexander Duyck @ 2014-11-13 19:27 UTC (permalink / raw)
  To: linux-arch, netdev, linux-kernel
  Cc: mikey, tony.luck, mathieu.desnoyers, donald.c.skidmore, peterz,
	benh, heiko.carstens, oleg, will.deacon, davem, michael,
	matthew.vick, nic_swsd, geert, jeffrey.t.kirsher, fweisbec,
	schwidefsky, linux, paulmck, torvalds, mingo
In-Reply-To: <20141113191250.12579.19694.stgit@ahduyck-server>

The r8169 use a pair of wmb() calls when setting up the descriptor rings.
The first is to synchronize the descriptor data with the descriptor status,
and the second is to synchronize the descriptor status with the use of the
MMIO doorbell to notify the device that descriptors are ready.  This can come
at a heavy price on some systems, and is not really necessary on systems such
as x86 as a simple barrier() would suffice to order store/store accesses.

In addition the r8169 uses a rmb() however I believe it is placed incorrectly
as I assume it supposed to be ordering descriptor reads after the check for
ownership.  As such I believe this is actually an acquire access pattern so
I have updated the code and removed the barrier using a load_acquire() call.

Cc: Realtek linux nic maintainers <nic_swsd@realtek.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 drivers/net/ethernet/realtek/r8169.c |   23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index cf154f7..42b4645 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6601,14 +6601,13 @@ static inline void rtl8169_mark_to_asic(struct RxDesc *desc, u32 rx_buf_sz)
 {
 	u32 eor = le32_to_cpu(desc->opts1) & RingEnd;
 
-	desc->opts1 = cpu_to_le32(DescOwn | eor | rx_buf_sz);
+	store_release(&desc->opts1, cpu_to_le32(DescOwn | eor | rx_buf_sz));
 }
 
 static inline void rtl8169_map_to_asic(struct RxDesc *desc, dma_addr_t mapping,
 				       u32 rx_buf_sz)
 {
 	desc->addr = cpu_to_le64(mapping);
-	wmb();
 	rtl8169_mark_to_asic(desc, rx_buf_sz);
 }
 
@@ -7077,11 +7076,9 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	skb_tx_timestamp(skb);
 
-	wmb();
-
 	/* Anti gcc 2.95.3 bugware (sic) */
 	status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
-	txd->opts1 = cpu_to_le32(status);
+	store_release(&txd->opts1, cpu_to_le32(status));
 
 	tp->cur_tx += frags + 1;
 
@@ -7183,16 +7180,15 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 	while (tx_left > 0) {
 		unsigned int entry = dirty_tx % NUM_TX_DESC;
 		struct ring_info *tx_skb = tp->tx_skb + entry;
-		u32 status;
+		__le32 status;
 
-		rmb();
-		status = le32_to_cpu(tp->TxDescArray[entry].opts1);
-		if (status & DescOwn)
+		status = load_acquire(&tp->TxDescArray[entry].opts1);
+		if (status & cpu_to_le32(DescOwn))
 			break;
 
 		rtl8169_unmap_tx_skb(&tp->pci_dev->dev, tx_skb,
 				     tp->TxDescArray + entry);
-		if (status & LastFrag) {
+		if (status & cpu_to_le32(LastFrag)) {
 			pkts_compl++;
 			bytes_compl += tx_skb->skb->len;
 			dev_kfree_skb_any(tx_skb->skb);
@@ -7284,11 +7280,11 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
 		struct RxDesc *desc = tp->RxDescArray + entry;
 		u32 status;
 
-		rmb();
-		status = le32_to_cpu(desc->opts1) & tp->opts1_mask;
-
+		status = cpu_to_le32(load_acquire(&desc->opts1));
 		if (status & DescOwn)
 			break;
+
+		status &= tp->opts1_mask;
 		if (unlikely(status & RxRES)) {
 			netif_info(tp, rx_err, dev, "Rx ERROR. status = %08x\n",
 				   status);
@@ -7350,7 +7346,6 @@ process_pkt:
 		}
 release_descriptor:
 		desc->opts2 = 0;
-		wmb();
 		rtl8169_mark_to_asic(desc, rx_buf_sz);
 	}
 

^ permalink raw reply related

* [PATCH 3/3] fm10k/igb/ixgbe: Use load_acquire on Rx descriptor
From: Alexander Duyck @ 2014-11-13 19:27 UTC (permalink / raw)
  To: linux-arch, netdev, linux-kernel
  Cc: mikey, tony.luck, mathieu.desnoyers, donald.c.skidmore, peterz,
	benh, heiko.carstens, oleg, will.deacon, davem, michael,
	matthew.vick, nic_swsd, geert, jeffrey.t.kirsher, fweisbec,
	schwidefsky, linux, paulmck, torvalds, mingo
In-Reply-To: <20141113191250.12579.19694.stgit@ahduyck-server>

This change makes it so that load_acquire is used when reading the Rx
descriptor.  The advantage of load_acquire is that it allows for a much
lower cost barrier on x86, ia64, powerpc, arm64, and s390 architectures
than a traditional memory barrier when dealing with reads that only have
to synchronize to system memory.

In addition I have updated the code so that it just checks to see if any
bits have been set instead of just the DD bit since the DD bit will always
be set as a part of a descriptor write-back so we just need to check for a
non-zero value being present at that memory location rather than just
checking for any specific bit.  This allows the code itself to appear much
cleaner and allows the compiler more room to optimize.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Matthew Vick <matthew.vick@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |    8 +++-----
 drivers/net/ethernet/intel/igb/igb_main.c     |    8 +++-----
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   11 ++++-------
 3 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index e645af4..2d4923c 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -620,14 +620,12 @@ static bool fm10k_clean_rx_irq(struct fm10k_q_vector *q_vector,
 
 		rx_desc = FM10K_RX_DESC(rx_ring, rx_ring->next_to_clean);
 
-		if (!fm10k_test_staterr(rx_desc, FM10K_RXD_STATUS_DD))
-			break;
-
 		/* This memory barrier is needed to keep us from reading
 		 * any other fields out of the rx_desc until we know the
-		 * RXD_STATUS_DD bit is set
+		 * descriptor has been written back
 		 */
-		rmb();
+		if (!load_acquire(&rx_desc->d.staterr))
+			break;
 
 		/* retrieve a buffer from the ring */
 		skb = fm10k_fetch_rx_buffer(rx_ring, rx_desc, skb);
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a2d72a8..a55cd54 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6918,14 +6918,12 @@ static bool igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
 
 		rx_desc = IGB_RX_DESC(rx_ring, rx_ring->next_to_clean);
 
-		if (!igb_test_staterr(rx_desc, E1000_RXD_STAT_DD))
-			break;
-
 		/* This memory barrier is needed to keep us from reading
 		 * any other fields out of the rx_desc until we know the
-		 * RXD_STAT_DD bit is set
+		 * descriptor has been written back
 		 */
-		rmb();
+		if (!load_acquire(&rx_desc->wb.upper.status_error))
+			break;
 
 		/* retrieve a buffer from the ring */
 		skb = igb_fetch_rx_buffer(rx_ring, rx_desc, skb);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d2df4e3..007f174 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2003,15 +2003,12 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 
 		rx_desc = IXGBE_RX_DESC(rx_ring, rx_ring->next_to_clean);
 
-		if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
-			break;
-
-		/*
-		 * This memory barrier is needed to keep us from reading
+		/* This memory barrier is needed to keep us from reading
 		 * any other fields out of the rx_desc until we know the
-		 * RXD_STAT_DD bit is set
+		 * descriptor has been written back
 		 */
-		rmb();
+		if (!load_acquire(&rx_desc->wb.upper.status_error))
+			break;
 
 		/* retrieve a buffer from the ring */
 		skb = ixgbe_fetch_rx_buffer(rx_ring, rx_desc);

^ permalink raw reply related

* [PATCH 1/3] arch: Introduce load_acquire() and store_release()
From: Alexander Duyck @ 2014-11-13 19:27 UTC (permalink / raw)
  To: linux-arch, netdev, linux-kernel
  Cc: mikey, tony.luck, mathieu.desnoyers, donald.c.skidmore, peterz,
	benh, heiko.carstens, oleg, will.deacon, davem, michael,
	matthew.vick, nic_swsd, geert, jeffrey.t.kirsher, fweisbec,
	schwidefsky, linux, paulmck, torvalds, mingo
In-Reply-To: <20141113191250.12579.19694.stgit@ahduyck-server>

It is common for device drivers to make use of acquire/release semantics
when dealing with descriptors stored in device memory.  On reviewing the
documentation and code for smp_load_acquire() and smp_store_release() as
well as reviewing an IBM website that goes over the use of PowerPC barriers
at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
occurred to me that the same code could likely be applied to device drivers.

As a result this patch introduces load_acquire() and store_release().  The
load_acquire() function can be used in the place of situations where a test
for ownership must be followed by a memory barrier.  The below example is
from ixgbe:

	if (!rx_desc->wb.upper.status_error)
		break;

	/* This memory barrier is needed to keep us from reading
	 * any other fields out of the rx_desc until we know the
	 * descriptor has been written back
	 */
	rmb();

With load_acquire() this can be changed to:

	if (!load_acquire(&rx_desc->wb.upper.status_error))
		break;

A similar change can be made in the release path of many drivers.  For
example in the Realtek r8169 driver there are a number of flows that
consist of something like the following:

	wmb();

	status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
	txd->opts1 = cpu_to_le32(status);

	tp->cur_tx += frags + 1;

	wmb();

With store_release() this can be changed to the following:

	status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
	store_release(&txd->opts1, cpu_to_le32(status));

	tp->cur_tx += frags + 1;

	wmb();

The resulting assembler code generated as a result can be significantly
less expensive on architectures such as x86 and s390 that support strong
ordering.  On architectures that are able to use different primitives than
their rmb/wmb() such as powerpc, ia64, and arm64 we should see gains as we
are able to use less expensive barriers, and for other architectures we end
up using a mb() which may come at the same amount of overhead or more than
a rmb/wmb() as we must ensure Load/Store ordering.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 arch/arm/include/asm/barrier.h      |   15 +++++++++
 arch/arm64/include/asm/barrier.h    |   59 ++++++++++++++++++-----------------
 arch/ia64/include/asm/barrier.h     |    7 +++-
 arch/metag/include/asm/barrier.h    |   15 +++++++++
 arch/mips/include/asm/barrier.h     |   15 +++++++++
 arch/powerpc/include/asm/barrier.h  |   24 +++++++++++---
 arch/s390/include/asm/barrier.h     |    7 +++-
 arch/sparc/include/asm/barrier_64.h |    6 ++--
 arch/x86/include/asm/barrier.h      |   22 ++++++++++++-
 include/asm-generic/barrier.h       |   15 +++++++++
 10 files changed, 144 insertions(+), 41 deletions(-)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..bbdcd34 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,21 @@
 #define smp_wmb()	dmb(ishst)
 #endif
 
+#define store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	mb();								\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define load_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	mb();								\
+	___p1;								\
+})
+
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..c91571c 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -32,33 +32,7 @@
 #define rmb()		dsb(ld)
 #define wmb()		dsb(st)
 
-#ifndef CONFIG_SMP
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-
-#define smp_store_release(p, v)						\
-do {									\
-	compiletime_assert_atomic_type(*p);				\
-	barrier();							\
-	ACCESS_ONCE(*p) = (v);						\
-} while (0)
-
-#define smp_load_acquire(p)						\
-({									\
-	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
-	compiletime_assert_atomic_type(*p);				\
-	barrier();							\
-	___p1;								\
-})
-
-#else
-
-#define smp_mb()	dmb(ish)
-#define smp_rmb()	dmb(ishld)
-#define smp_wmb()	dmb(ishst)
-
-#define smp_store_release(p, v)						\
+#define store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
 	switch (sizeof(*p)) {						\
@@ -73,7 +47,7 @@ do {									\
 	}								\
 } while (0)
 
-#define smp_load_acquire(p)						\
+#define load_acquire(p)							\
 ({									\
 	typeof(*p) ___p1;						\
 	compiletime_assert_atomic_type(*p);				\
@@ -90,6 +64,35 @@ do {									\
 	___p1;								\
 })
 
+#ifndef CONFIG_SMP
+#define smp_mb()	barrier()
+#define smp_rmb()	barrier()
+#define smp_wmb()	barrier()
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	___p1;								\
+})
+
+#else
+
+#define smp_mb()	dmb(ish)
+#define smp_rmb()	dmb(ishld)
+#define smp_wmb()	dmb(ishst)
+
+#define smp_store_release(p, v)	store_release(p, v)
+#define smp_load_acquire(p)	load_acquire(p)
+
 #endif
 
 #define read_barrier_depends()		do { } while(0)
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index a48957c..d7fe208 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -63,14 +63,14 @@
  * need for asm trickery!
  */
 
-#define smp_store_release(p, v)						\
+#define store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
 	barrier();							\
 	ACCESS_ONCE(*p) = (v);						\
 } while (0)
 
-#define smp_load_acquire(p)						\
+#define load_acquire(p)						\
 ({									\
 	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
 	compiletime_assert_atomic_type(*p);				\
@@ -78,6 +78,9 @@ do {									\
 	___p1;								\
 })
 
+#define smp_store_release(p, v)	store_release(p, v)
+#define smp_load_acquire(p)	load_acquire(p)
+
 /*
  * XXX check on this ---I suspect what Linus really wants here is
  * acquire vs release semantics but we can't discuss this stuff with
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c7591e8..9beb687 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -85,6 +85,21 @@ static inline void fence(void)
 #define smp_read_barrier_depends()     do { } while (0)
 #define set_mb(var, value) do { var = value; smp_mb(); } while (0)
 
+#define store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	mb();								\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define load_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	mb();								\
+	___p1;								\
+})
+
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index d0101dd..fc7323c 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -180,6 +180,21 @@
 #define nudge_writes() mb()
 #endif
 
+#define store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	mb();								\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define load_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	mb();								\
+	___p1;								\
+})
+
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index bab79a1..f2a0d73 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -37,6 +37,23 @@
 
 #define set_mb(var, value)	do { var = value; mb(); } while (0)
 
+#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+
+#define store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	__lwsync();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define load_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	__lwsync();							\
+	___p1;								\
+})
+
 #ifdef CONFIG_SMP
 
 #ifdef __SUBARCH_HAS_LWSYNC
@@ -45,15 +62,12 @@
 #    define SMPWMB      eieio
 #endif
 
-#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
 
 #define smp_mb()	mb()
 #define smp_rmb()	__lwsync()
 #define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
 #define smp_read_barrier_depends()	read_barrier_depends()
 #else
-#define __lwsync()	barrier()
-
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
@@ -72,7 +86,7 @@
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
-	__lwsync();							\
+	smp_rmb();							\
 	ACCESS_ONCE(*p) = (v);						\
 } while (0)
 
@@ -80,7 +94,7 @@ do {									\
 ({									\
 	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
 	compiletime_assert_atomic_type(*p);				\
-	__lwsync();							\
+	smp_rmb();							\
 	___p1;								\
 })
 
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index b5dce65..637d7a9 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -35,14 +35,14 @@
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
-#define smp_store_release(p, v)						\
+#define store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
 	barrier();							\
 	ACCESS_ONCE(*p) = (v);						\
 } while (0)
 
-#define smp_load_acquire(p)						\
+#define load_acquire(p)							\
 ({									\
 	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
 	compiletime_assert_atomic_type(*p);				\
@@ -50,4 +50,7 @@ do {									\
 	___p1;								\
 })
 
+#define smp_store_release(p, v)		store_release(p, v)
+#define smp_load_acquire(p)		load_acquire(p)
+
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 305dcc3..7de3c69 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -53,14 +53,14 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
 
 #define smp_read_barrier_depends()	do { } while(0)
 
-#define smp_store_release(p, v)						\
+#define store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
 	barrier();							\
 	ACCESS_ONCE(*p) = (v);						\
 } while (0)
 
-#define smp_load_acquire(p)						\
+#define load_acquire(p)							\
 ({									\
 	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
 	compiletime_assert_atomic_type(*p);				\
@@ -68,6 +68,8 @@ do {									\
 	___p1;								\
 })
 
+#define smp_store_release(p, v)	store_release(p, v)
+#define smp_load_acquire(p)	load_acquire(p)
 #define smp_mb__before_atomic()	barrier()
 #define smp_mb__after_atomic()	barrier()
 
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0f4460b..3d2aa18 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -103,6 +103,21 @@
  * model and we should fall back to full barriers.
  */
 
+#define store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	mb();								\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define load_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	mb();								\
+	___p1;								\
+})
+
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
@@ -120,14 +135,14 @@ do {									\
 
 #else /* regular x86 TSO memory ordering */
 
-#define smp_store_release(p, v)						\
+#define store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
 	barrier();							\
 	ACCESS_ONCE(*p) = (v);						\
 } while (0)
 
-#define smp_load_acquire(p)						\
+#define load_acquire(p)							\
 ({									\
 	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
 	compiletime_assert_atomic_type(*p);				\
@@ -135,6 +150,9 @@ do {									\
 	___p1;								\
 })
 
+#define smp_store_release(p, v)	store_release(p, v)
+#define smp_load_acquire(p)	load_acquire(p)
+
 #endif
 
 /* Atomic operations are already serializing on x86 */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 1402fa8..c6e4b99 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -70,6 +70,21 @@
 #define smp_mb__after_atomic()	smp_mb()
 #endif
 
+#define store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	mb();								\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define load_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	mb();								\
+	___p1;								\
+})
+
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\

^ permalink raw reply related

* [PATCH 0/3] Introduce load_acquire() and store_release()
From: Alexander Duyck @ 2014-11-13 19:27 UTC (permalink / raw)
  To: linux-arch, netdev, linux-kernel
  Cc: mikey, tony.luck, mathieu.desnoyers, donald.c.skidmore, peterz,
	benh, heiko.carstens, oleg, will.deacon, davem, michael,
	matthew.vick, nic_swsd, geert, jeffrey.t.kirsher, fweisbec,
	schwidefsky, linux, paulmck, torvalds, mingo

These patches introduce uniprocessor or CPU<->device equivalents for
smp_load_acquire() and smp_store_release().  These two new primitives are:

	load_acquire()
	store_release()

The first patch adds the primitives for the applicable architectures and
asm-generic.

The second patch adds the primitives to r8169 which turns out to be a good
example of where the new primitives might be useful as they have memory
barriers ordering accesses to the descriptors and the DescOwn bit within the
descriptors which follow acquire/release style semantics.

The third patch adds support for load_acquire() to the Intel fm10k, igb,
and ixgbe drivers.  Testing with the ixgbe driver has shown a processing
time reduction of at least 7ns per 64B frame on a Core i7-4930K.

This patch series is essentially the v2 for:
	arch: Introduce read_acquire()

The key changes in this patch series versus that patch are:
	- Renamed read_acquire() to be consistent with smp_load_acquire()
	- Changed barrier used to be consistent with smp_load_acquire()
	- Updated PowerPC code to use __lwsync based on IBM article
	- Added store_release() as this is a viable use case for drivers
	- Added r8169 patch which is able to fully use primitives
	- Added fm10k/igb/ixgbe patch which is able to test performance

---

Alexander Duyck (3):
      arch: Introduce load_acquire() and store_release()
      r8169: Use load_acquire() and store_release() to reduce memory barrier overhead
      fm10k/igb/ixgbe: Use load_acquire on Rx descriptor


 arch/arm/include/asm/barrier.h                |   15 ++++++
 arch/arm64/include/asm/barrier.h              |   59 +++++++++++++------------
 arch/ia64/include/asm/barrier.h               |    7 ++-
 arch/metag/include/asm/barrier.h              |   15 ++++++
 arch/mips/include/asm/barrier.h               |   15 ++++++
 arch/powerpc/include/asm/barrier.h            |   24 ++++++++--
 arch/s390/include/asm/barrier.h               |    7 ++-
 arch/sparc/include/asm/barrier_64.h           |    6 ++-
 arch/x86/include/asm/barrier.h                |   22 ++++++++-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |    8 +--
 drivers/net/ethernet/intel/igb/igb_main.c     |    8 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   11 ++---
 drivers/net/ethernet/realtek/r8169.c          |   23 ++++------
 include/asm-generic/barrier.h                 |   15 ++++++
 14 files changed, 163 insertions(+), 72 deletions(-)

--

^ permalink raw reply

* tcp_fastretrans_alert warning in Linux kernel 3.10.
From: Vinson Lee @ 2014-11-13 19:26 UTC (permalink / raw)
  To: netdev

Hi.

We saw this networking tcp_fastretrans_alert warning in Linux kernel 3.10.

------------[ cut here ]------------
WARNING: at net/ipv4/tcp_input.c:2788 tcp_fastretrans_alert+0x7a1/0x7be()
Modules linked in: netconsole configfs xt_DSCP iptable_mangle
cpufreq_ondemand ipv6 ppdev parport_pc lp parport tcp_diag inet_diag
ipmi_si ipmi_devintf ipmi
_wdt iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp
kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode serio_raw
i2c_i801 lpc_ich mfd_core
t i2c_core ptp pps_core ioatdma dca wmi
CPU: 13 PID: 0 Comm: swapper/13 Not tainted 3.10.50 #1
 0000000000000000 ffff88085fce39b8 ffffffff814a4094 ffff88085fce39f0
 ffffffff8103cbf9 0000000000000000 000000000000470e 0000000000000000
 0000000000000000 ffff880824fe5e80 ffff88085fce3a00 ffffffff8103ccbf
Call Trace:
 <IRQ>  [<ffffffff814a4094>] dump_stack+0x19/0x1b
 [<ffffffff8103cbf9>] warn_slowpath_common+0x65/0x7d
 [<ffffffff8103ccbf>] warn_slowpath_null+0x1a/0x1c
 [<ffffffff81438c56>] tcp_fastretrans_alert+0x7a1/0x7be
 [<ffffffff81439644>] tcp_ack+0x95e/0xb47
 [<ffffffff81439d46>] tcp_rcv_established+0x23e/0x442
 [<ffffffff81442cab>] tcp_v4_do_rcv+0x1e6/0x3e8
 [<ffffffff8144371e>] tcp_v4_rcv+0x293/0x52c
 [<ffffffff81425e6a>] ? NF_HOOK_THRESH.constprop.11+0x53/0x53
 [<ffffffff8142c668>] ? do_ip_setsockopt.isra.7+0xa95/0xb3f
 [<ffffffff81425f4a>] ip_local_deliver_finish+0xe0/0x155
 [<ffffffff81425e6a>] ? NF_HOOK_THRESH.constprop.11+0x53/0x53
 [<ffffffff81425e48>] NF_HOOK_THRESH.constprop.11+0x31/0x53
 [<ffffffff814260ef>] ip_local_deliver+0x40/0x52
 [<ffffffff81425d50>] ip_rcv_finish+0x274/0x2b6
 [<ffffffff81425adc>] ? inet_del_offload+0x3d/0x3d
 [<ffffffff81425e48>] NF_HOOK_THRESH.constprop.11+0x31/0x53
 [<ffffffff81426324>] ip_rcv+0x223/0x267
 [<ffffffff813f76b4>] __netif_receive_skb_core+0x435/0x4a7
 [<ffffffff8107bab8>] ? timekeeping_get_ns.constprop.10+0x11/0x36
 [<ffffffff813f773e>] __netif_receive_skb+0x18/0x5a
 [<ffffffff813f8813>] netif_receive_skb+0x40/0x75
 [<ffffffff813f8f94>] napi_gro_receive+0x3e/0x80
 [<ffffffffa004a097>] igb_clean_rx_irq+0x67e/0x69e [igb]
 [<ffffffffa004a425>] igb_poll+0x36e/0x5b0 [igb]
 [<ffffffff8107bab8>] ? timekeeping_get_ns.constprop.10+0x11/0x36
 [<ffffffff8107bf32>] ? ktime_get+0x68/0x76
 [<ffffffff813f8a61>] net_rx_action+0xcf/0x196
 [<ffffffff8106870c>] ? sched_clock_cpu+0x42/0xc7
 [<ffffffff8104371d>] __do_softirq+0xd5/0x1f4
 [<ffffffff814b01bc>] call_softirq+0x1c/0x30
 [<ffffffff81003ecd>] do_softirq+0x33/0x6e
 [<ffffffff81043931>] irq_exit+0x51/0x93
 [<ffffffff814b086e>] do_IRQ+0x8e/0xa5
 [<ffffffff814a85aa>] common_interrupt+0x6a/0x6a
 <EOI>  [<ffffffff813c6807>] ? cpuidle_enter_state+0x52/0xa3
 [<ffffffff813c6800>] ? cpuidle_enter_state+0x4b/0xa3
 [<ffffffff813c693d>] cpuidle_idle_call+0xe5/0x13f
 [<ffffffff81009408>] arch_cpu_idle+0xe/0x1d
 [<ffffffff8107a872>] cpu_startup_entry+0x128/0x17a
 [<ffffffff814956be>] start_secondary+0x246/0x248
---[ end trace 804e605f7757900d ]---


net/ipv4/tcp_input.c
  2775  static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
  2776                                    int prior_sacked, int prior_packets,
  2777                                    bool is_dupack, int flag)
  2778  {
  2779          struct inet_connection_sock *icsk = inet_csk(sk);
  2780          struct tcp_sock *tp = tcp_sk(sk);
  2781          int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
  2782                                      (tcp_fackets_out(tp) >
tp->reordering));
  2783          int newly_acked_sacked = 0;
  2784          int fast_rexmit = 0;
  2785
  2786          if (WARN_ON(!tp->packets_out && tp->sacked_out))
  2787                  tp->sacked_out = 0;
  2788          if (WARN_ON(!tp->sacked_out && tp->fackets_out))
  2789                  tp->fackets_out = 0;


Cheers,
Vinson

^ permalink raw reply

* [RFC PATCH 00/16] Replace smp_read_barrier_depends() with lockless_derefrence()
From: Pranith Kumar @ 2014-11-13 19:24 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Cristian Stoica, Horia Geanta,
	Ruchika Gupta, Michael Neuling, Wolfram Sang,
	open list:CRYPTO API, open list, Vinod Koul, Dan Williams,
	Bartlomiej Zolnierkiewicz, Kyungmin Park, Manuel Schölling,
	Dave Jiang, Rashika, open list:DMA GENERIC OFFLO...,
	K. Y. Srinivasan, Haiyang Zhang
  Cc: paulmck

Recently lockless_dereference() was added which can be used in place of
hard-coding smp_read_barrier_depends(). 

http://lkml.iu.edu/hypermail/linux/kernel/1410.3/04561.html

The following series tries to do this.

There are still some hard-coded locations which I was not sure how to replace
with. I will send in separate patches/questions regarding them.

Pranith Kumar (16):
  crypto: caam - Remove unnecessary smp_read_barrier_depends()
  doc: memory-barriers.txt: Document use of lockless_dereference()
  drivers: dma: Replace smp_read_barrier_depends() with
    lockless_dereference()
  dcache: Replace smp_read_barrier_depends() with lockless_dereference()
  overlayfs: Replace smp_read_barrier_depends() with
    lockless_dereference()
  assoc_array: Replace smp_read_barrier_depends() with
    lockless_dereference()
  hyperv: Replace smp_read_barrier_depends() with lockless_dereference()
  rcupdate: Replace smp_read_barrier_depends() with
    lockless_dereference()
  percpu: Replace smp_read_barrier_depends() with lockless_dereference()
  perf: Replace smp_read_barrier_depends() with lockless_dereference()
  seccomp: Replace smp_read_barrier_depends() with
    lockless_dereference()
  task_work: Replace smp_read_barrier_depends() with
    lockless_dereference()
  ksm: Replace smp_read_barrier_depends() with lockless_dereference()
  slab: Replace smp_read_barrier_depends() with lockless_dereference()
  netfilter: Replace smp_read_barrier_depends() with
    lockless_dereference()
  rxrpc: Replace smp_read_barrier_depends() with lockless_dereference()

 Documentation/memory-barriers.txt |  2 +-
 drivers/crypto/caam/jr.c          |  3 ---
 drivers/dma/ioat/dma_v2.c         |  3 +--
 drivers/dma/ioat/dma_v3.c         |  3 +--
 fs/dcache.c                       |  7 ++-----
 fs/overlayfs/super.c              |  4 +---
 include/linux/assoc_array_priv.h  | 11 +++++++----
 include/linux/hyperv.h            |  9 ++++-----
 include/linux/percpu-refcount.h   |  4 +---
 include/linux/rcupdate.h          | 10 +++++-----
 kernel/events/core.c              |  3 +--
 kernel/events/uprobes.c           |  8 ++++----
 kernel/seccomp.c                  |  7 +++----
 kernel/task_work.c                |  3 +--
 lib/assoc_array.c                 |  7 -------
 mm/ksm.c                          |  7 +++----
 mm/slab.h                         |  6 +++---
 net/ipv4/netfilter/arp_tables.c   |  3 +--
 net/ipv4/netfilter/ip_tables.c    |  3 +--
 net/ipv6/netfilter/ip6_tables.c   |  3 +--
 net/rxrpc/ar-ack.c                | 22 +++++++++-------------
 security/keys/keyring.c           |  6 ------
 22 files changed, 50 insertions(+), 84 deletions(-)

-- 
1.9.1

^ permalink raw reply

* [PATCH 16/16] rxrpc: Replace smp_read_barrier_depends() with lockless_dereference()
From: Pranith Kumar @ 2014-11-13 19:24 UTC (permalink / raw)
  To: David S. Miller, David Howells, Dan Carpenter,
	open list:NETWORKING [GENERAL], open list
  Cc: paulmck
In-Reply-To: <1415906662-4576-1-git-send-email-bobby.prani@gmail.com>

Recently lockless_dereference() was added which can be used in place of
hard-coding smp_read_barrier_depends(). The following PATCH makes the change.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 net/rxrpc/ar-ack.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/net/rxrpc/ar-ack.c b/net/rxrpc/ar-ack.c
index c6be17a..9237448 100644
--- a/net/rxrpc/ar-ack.c
+++ b/net/rxrpc/ar-ack.c
@@ -234,8 +234,7 @@ static void rxrpc_resend(struct rxrpc_call *call)
 	     loop != call->acks_head || stop;
 	     loop = (loop + 1) &  (call->acks_winsz - 1)
 	     ) {
-		p_txb = call->acks_window + loop;
-		smp_read_barrier_depends();
+		p_txb = lockless_dereference(call)->acks_window + loop;
 		if (*p_txb & 1)
 			continue;
 
@@ -303,8 +302,7 @@ static void rxrpc_resend_timer(struct rxrpc_call *call)
 	     loop != call->acks_head;
 	     loop = (loop + 1) &  (call->acks_winsz - 1)
 	     ) {
-		p_txb = call->acks_window + loop;
-		smp_read_barrier_depends();
+		p_txb = lockless_dereference(call)->acks_window + loop;
 		txb = (struct sk_buff *) (*p_txb & ~1);
 		sp = rxrpc_skb(txb);
 
@@ -354,9 +352,10 @@ static int rxrpc_process_soft_ACKs(struct rxrpc_call *call,
 	resend = 0;
 	resend_at = 0;
 	for (loop = 0; loop < ack->nAcks; loop++) {
-		p_txb = call->acks_window;
-		p_txb += (call->acks_tail + loop) & (call->acks_winsz - 1);
-		smp_read_barrier_depends();
+		struct rxrpc_call *callp = lockless_dereference(call);
+
+		p_txb = callp->acks_window;
+		p_txb += (callp->acks_tail + loop) & (callp->acks_winsz - 1);
 		txb = (struct sk_buff *) (*p_txb & ~1);
 		sp = rxrpc_skb(txb);
 
@@ -385,8 +384,7 @@ static int rxrpc_process_soft_ACKs(struct rxrpc_call *call,
 	     loop != call->acks_head;
 	     loop = (loop + 1) &  (call->acks_winsz - 1)
 	     ) {
-		p_txb = call->acks_window + loop;
-		smp_read_barrier_depends();
+		p_txb = lockless_dereference(call)->acks_window + loop;
 		txb = (struct sk_buff *) (*p_txb & ~1);
 		sp = rxrpc_skb(txb);
 
@@ -432,8 +430,7 @@ static void rxrpc_rotate_tx_window(struct rxrpc_call *call, u32 hard)
 	ASSERTCMP(hard - call->acks_hard, <=, win);
 
 	while (call->acks_hard < hard) {
-		smp_read_barrier_depends();
-		_skb = call->acks_window[tail] & ~1;
+		_skb = lockless_dereference(call)->acks_window[tail] & ~1;
 		rxrpc_free_skb((struct sk_buff *) _skb);
 		old_tail = tail;
 		tail = (tail + 1) & (call->acks_winsz - 1);
@@ -577,8 +574,7 @@ static void rxrpc_zap_tx_window(struct rxrpc_call *call)
 	call->acks_window = NULL;
 
 	while (CIRC_CNT(call->acks_head, call->acks_tail, winsz) > 0) {
-		tail = call->acks_tail;
-		smp_read_barrier_depends();
+		tail = lockless_dereference(call)->acks_tail;
 		_skb = acks_window[tail] & ~1;
 		smp_mb();
 		call->acks_tail = (call->acks_tail + 1) & (winsz - 1);
-- 
1.9.1

^ permalink raw reply related

* [PATCH 15/16] netfilter: Replace smp_read_barrier_depends() with lockless_dereference()
From: Pranith Kumar @ 2014-11-13 19:24 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, netfilter-devel, coreteam,
	open list:NETWORKING [IPv4/..., open list
  Cc: paulmck
In-Reply-To: <1415906662-4576-1-git-send-email-bobby.prani@gmail.com>

Recently lockless_dereference() was added which can be used in place of
hard-coding smp_read_barrier_depends(). The following PATCH makes the change.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 net/ipv4/netfilter/arp_tables.c | 3 +--
 net/ipv4/netfilter/ip_tables.c  | 3 +--
 net/ipv6/netfilter/ip6_tables.c | 3 +--
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index f95b6f9..fc7533d 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -270,12 +270,11 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 
 	local_bh_disable();
 	addend = xt_write_recseq_begin();
-	private = table->private;
 	/*
 	 * Ensure we load private-> members after we've fetched the base
 	 * pointer.
 	 */
-	smp_read_barrier_depends();
+	private = lockless_dereference(table->private);
 	table_base = private->entries[smp_processor_id()];
 
 	e = get_entry(table_base, private->hook_entry[hook]);
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 99e810f..e0fd044 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -325,13 +325,12 @@ ipt_do_table(struct sk_buff *skb,
 	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
 	local_bh_disable();
 	addend = xt_write_recseq_begin();
-	private = table->private;
 	cpu        = smp_processor_id();
 	/*
 	 * Ensure we load private-> members after we've fetched the base
 	 * pointer.
 	 */
-	smp_read_barrier_depends();
+	private = lockless_dereference(table->private);
 	table_base = private->entries[cpu];
 	jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
 	stackptr   = per_cpu_ptr(private->stackptr, cpu);
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index e080fbb..0459d6a 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -348,12 +348,11 @@ ip6t_do_table(struct sk_buff *skb,
 
 	local_bh_disable();
 	addend = xt_write_recseq_begin();
-	private = table->private;
 	/*
 	 * Ensure we load private-> members after we've fetched the base
 	 * pointer.
 	 */
-	smp_read_barrier_depends();
+	private = lockless_dereference(table->private);
 	cpu        = smp_processor_id();
 	table_base = private->entries[cpu];
 	jumpstack  = (struct ip6t_entry **)private->jumpstack[cpu];
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH] net: skb_fclone_busy() needs to detect orphaned skb
From: Luis Henriques @ 2014-11-13 19:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Neal Cardwell, Joseph Salisbury
In-Reply-To: <1414690354.9028.9.camel@edumazet-glaptop2.roam.corp.google.com>

Hi Eric,

On Thu, Oct 30, 2014 at 10:32:34AM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Some drivers are unable to perform TX completions in a bound time.
> They instead call skb_orphan()
> 
> Problem is skb_fclone_busy() has to detect this case, otherwise
> we block TCP retransmits and can freeze unlucky tcp sessions on
> mostly idle hosts.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Fixes: 1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host queues")
> ---
>  This is a stable candidate.
>  This problem is known to hurt users of linux-3.16 kernels used by guests kernels.
>  David, I can provide backports if you want.
>  Thanks !
> 

We got a bug report[0] where a backport for 3.16 was provided.  Since
I couldn't find the original backport post, I'm not sure who's the
actual author.  Could you please confirm if this backport is correct?
(I'm copying the patch below).

[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1390604

Cheers,
--
Luís


diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4e4932b5079b..a8794367cd20 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2082,7 +2082,8 @@ static bool skb_still_in_host_queue(const struct sock *sk,
 	const struct sk_buff *fclone = skb + 1;
 
 	if (unlikely(skb->fclone == SKB_FCLONE_ORIG &&
-		     fclone->fclone == SKB_FCLONE_CLONE)) {
+		     fclone->fclone == SKB_FCLONE_CLONE &&
+		     fclone->sk == sk)) {
 		NET_INC_STATS_BH(sock_net(sk),
 				 LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES);
 		return true;

>  include/linux/skbuff.h |    8 ++++++--
>  net/ipv4/tcp_output.c  |    2 +-
>  net/xfrm/xfrm_policy.c |    2 +-
>  3 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 5884f95ff0e9..6c8b6f604e76 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -799,15 +799,19 @@ struct sk_buff_fclones {
>   *	@skb: buffer
>   *
>   * Returns true is skb is a fast clone, and its clone is not freed.
> + * Some drivers call skb_orphan() in their ndo_start_xmit(),
> + * so we also check that this didnt happen.
>   */
> -static inline bool skb_fclone_busy(const struct sk_buff *skb)
> +static inline bool skb_fclone_busy(const struct sock *sk,
> +				   const struct sk_buff *skb)
>  {
>  	const struct sk_buff_fclones *fclones;
>  
>  	fclones = container_of(skb, struct sk_buff_fclones, skb1);
>  
>  	return skb->fclone == SKB_FCLONE_ORIG &&
> -	       fclones->skb2.fclone == SKB_FCLONE_CLONE;
> +	       fclones->skb2.fclone == SKB_FCLONE_CLONE &&
> +	       fclones->skb2.sk == sk;
>  }
>  
>  static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 3af21296d967..a3d453b94747 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2126,7 +2126,7 @@ bool tcp_schedule_loss_probe(struct sock *sk)
>  static bool skb_still_in_host_queue(const struct sock *sk,
>  				    const struct sk_buff *skb)
>  {
> -	if (unlikely(skb_fclone_busy(skb))) {
> +	if (unlikely(skb_fclone_busy(sk, skb))) {
>  		NET_INC_STATS_BH(sock_net(sk),
>  				 LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES);
>  		return true;
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index 4c4e457e7888..88bf289abdc9 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -1962,7 +1962,7 @@ static int xdst_queue_output(struct sock *sk, struct sk_buff *skb)
>  	struct xfrm_policy *pol = xdst->pols[0];
>  	struct xfrm_policy_queue *pq = &pol->polq;
>  
> -	if (unlikely(skb_fclone_busy(skb))) {
> +	if (unlikely(skb_fclone_busy(sk, skb))) {
>  		kfree_skb(skb);
>  		return 0;
>  	}
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH net-next] tcp: limit GSO packets to half cwnd
From: Neal Cardwell @ 2014-11-13 18:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati
In-Reply-To: <1415900722.17262.22.camel@edumazet-glaptop2.roam.corp.google.com>

On Thu, Nov 13, 2014 at 12:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> In DC world, GSO packets initially cooked by tcp_sendmsg() are usually
> big, as sk_pacing_rate is high.
>
> When network is congested, cwnd can be smaller than the GSO packets
> found in socket write queue. tcp_write_xmit() splits GSO packets
> using the available cwnd, and we end up sending a single GSO packet,
> consuming all available cwnd.
>
> With GRO aggregation on the receiver, we might handle a single GRO
> packet, sending back a single ACK.
>
> 1) This single ACK might be lost
>    TLP or RTO are forced to attempt a retransmit.
> 2) This ACK releases a full cwnd, sender sends another big GSO packet,
>    in a ping pong mode.
>
> This behavior does not fill the pipes in the best way, because of
> scheduling artifacts.
>
> Make sure we always have at least two GSO packets in flight.
>
> This allows us to safely increase GRO efficiency without risking
> spurious retransmits.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/tcp_output.c |   12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)

Acked-by: Neal Cardwell <ncardwell@google.com>

Thanks, Eric!

neal

^ permalink raw reply

* Re: [PATCH net-next] tcp: limit GSO packets to half cwnd
From: Dave Taht @ 2014-11-13 18:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng,
	Nandita Dukkipati
In-Reply-To: <1415900722.17262.22.camel@edumazet-glaptop2.roam.corp.google.com>

On Thu, Nov 13, 2014 at 9:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> In DC world, GSO packets initially cooked by tcp_sendmsg() are usually
> big, as sk_pacing_rate is high.
>
> When network is congested, cwnd can be smaller than the GSO packets
> found in socket write queue. tcp_write_xmit() splits GSO packets
> using the available cwnd, and we end up sending a single GSO packet,
> consuming all available cwnd.

My take on this is that this will also help reduce inter-flow latency
on devices that are running GSO at 100Mbit or lower speeds.  (?)

I have really liked many of the patches entering netdev in this cycle!
(could this be a tunable instead to split on 1/4th for example?)

That does not prevent me from wishing, abstractly, to strand all the
tcp development orgs of the world on a tropic island with a 10mbit
uplink, or on a gbus, endlessly circling san francisco... until more
low bandwidth with high latency problems are resolved.

:)

> With GRO aggregation on the receiver, we might handle a single GRO
> packet, sending back a single ACK.
>
> 1) This single ACK might be lost
>    TLP or RTO are forced to attempt a retransmit.
> 2) This ACK releases a full cwnd, sender sends another big GSO packet,
>    in a ping pong mode.
>
> This behavior does not fill the pipes in the best way, because of
> scheduling artifacts.
>
> Make sure we always have at least two GSO packets in flight.
>
> This allows us to safely increase GRO efficiency without risking
> spurious retransmits.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/tcp_output.c |   12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0b88158dd4a70d5007e79f0d8251fee2f7c6c7f8..eb73a1dccf56b823a45c0ca034e40dc50fc48068 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1562,7 +1562,7 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
>  static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
>                                          const struct sk_buff *skb)
>  {
> -       u32 in_flight, cwnd;
> +       u32 in_flight, cwnd, halfcwnd;
>
>         /* Don't be strict about the congestion window for the final FIN.  */
>         if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
> @@ -1571,10 +1571,14 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
>
>         in_flight = tcp_packets_in_flight(tp);
>         cwnd = tp->snd_cwnd;
> -       if (in_flight < cwnd)
> -               return (cwnd - in_flight);
> +       if (in_flight >= cwnd)
> +               return 0;
>
> -       return 0;
> +       /* For better scheduling, ensure we have at least
> +        * 2 GSO packets in flight.
> +        */
> +       halfcwnd = max(cwnd >> 1, 1U);
> +       return min(halfcwnd, cwnd - in_flight);
>  }
>
>  /* Initialize TSO state of a skb.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Dave Täht

http://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

^ permalink raw reply

* [PATCH net-next] tcp: limit GSO packets to half cwnd
From: Eric Dumazet @ 2014-11-13 17:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati

From: Eric Dumazet <edumazet@google.com>

In DC world, GSO packets initially cooked by tcp_sendmsg() are usually
big, as sk_pacing_rate is high.

When network is congested, cwnd can be smaller than the GSO packets
found in socket write queue. tcp_write_xmit() splits GSO packets
using the available cwnd, and we end up sending a single GSO packet,
consuming all available cwnd.

With GRO aggregation on the receiver, we might handle a single GRO
packet, sending back a single ACK.

1) This single ACK might be lost
   TLP or RTO are forced to attempt a retransmit.
2) This ACK releases a full cwnd, sender sends another big GSO packet,
   in a ping pong mode.

This behavior does not fill the pipes in the best way, because of
scheduling artifacts.

Make sure we always have at least two GSO packets in flight.

This allows us to safely increase GRO efficiency without risking
spurious retransmits.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0b88158dd4a70d5007e79f0d8251fee2f7c6c7f8..eb73a1dccf56b823a45c0ca034e40dc50fc48068 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1562,7 +1562,7 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
 static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
 					 const struct sk_buff *skb)
 {
-	u32 in_flight, cwnd;
+	u32 in_flight, cwnd, halfcwnd;
 
 	/* Don't be strict about the congestion window for the final FIN.  */
 	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
@@ -1571,10 +1571,14 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
 
 	in_flight = tcp_packets_in_flight(tp);
 	cwnd = tp->snd_cwnd;
-	if (in_flight < cwnd)
-		return (cwnd - in_flight);
+	if (in_flight >= cwnd)
+		return 0;
 
-	return 0;
+	/* For better scheduling, ensure we have at least
+	 * 2 GSO packets in flight.
+	 */
+	halfcwnd = max(cwnd >> 1, 1U);
+	return min(halfcwnd, cwnd - in_flight);
 }
 
 /* Initialize TSO state of a skb.

^ permalink raw reply related

* Re: [E1000-devel] [PATCH] ixgbe: make VLAN filter conditional in SR-IOV case
From: Jeff Kirsher @ 2014-11-13 17:08 UTC (permalink / raw)
  To: Hiroshi Shimamoto
  Cc: e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org,
	Choi, Sy Jong, Hayato Momma, linux-kernel@vger.kernel.org
In-Reply-To: <7F861DC0615E0C47A872E6F3C5FCDDBD05D9D336@BPXM14GP.gisp.nec.co.jp>

[-- Attachment #1: Type: text/plain, Size: 1271 bytes --]

On Thu, 2014-11-13 at 08:28 +0000, Hiroshi Shimamoto wrote:
> From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
> 
> Disable hardware VLAN filtering if netdev->features VLAN flag is
> dropped.
> 
> In SR-IOV case, there is a use case which needs to disable VLAN
> filter.
> For example, we need to make a network function with VF in virtualized
> environment. That network function may be a software switch, a router
> or etc. It means that that network function will be an end point which
> terminates many VLANs.
> 
> In the current implementation, VLAN filtering always be turned on and
> VF can receive only 63 VLANs. It means that only 63 VLANs can be used
> and it's not enough at all for building a virtual router.
> 
> With this patch, if the user turns VLAN filtering off on the host, VF
> can receive every VLAN packet.
> The behavior is changed only if VLAN filtering is turned off by
> ethtool.
> 
> Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
> CC: Choi, Sy Jong <sy.jong.choi@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 10 ++++++++++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |  4 ++++
>  2 files changed, 14 insertions(+)

Thanks Hiroshi, I will add your patch to my queue.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [patch net-next v2 4/4] sched: introduce vlan action
From: Cong Wang @ 2014-11-13 17:07 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, David Miller, Jamal Hadi Salim, Pravin B Shelar,
	Tom Herbert, Eric Dumazet, Willem de Bruijn, Daniel Borkmann, mst,
	Florian Westphal, Paul.Durrant, Thomas Graf
In-Reply-To: <1415803950-9838-4-git-send-email-jiri@resnulli.us>

On Wed, Nov 12, 2014 at 6:52 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> This tc action allows to work with vlan tagged skbs. Two supported
> sub-actions are header pop and header push.
>
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>

Reviewed-by: Cong Wang <cwang@twopensource.com>

^ permalink raw reply

* Re: [patch net-next 2/2] sched: introduce vlan action
From: Cong Wang @ 2014-11-13 17:06 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jiri Pirko, netdev, David Miller, Pravin B Shelar, Tom Herbert,
	Eric Dumazet, Willem de Bruijn, Daniel Borkmann, mst,
	Florian Westphal, Paul.Durrant, Thomas Graf
In-Reply-To: <5463522C.1030104@mojatatu.com>

On Wed, Nov 12, 2014 at 4:27 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> Cong,
> I think it is better to have all these "tunneling" activities
> as a separate action each. It is cleaner from a usability perspective.
> [e.g. it is not hard to express nat action with pedit action or take
> checksum out of nat since we have a csum action), but makes sense to
> have it separate)].
>

Sounds good.

^ permalink raw reply

* Re: [PATCH net-next 1/1] ipvlan: Initial check-in of the IPVLAN driver.
From: Pavel Emelyanov @ 2014-11-13 15:57 UTC (permalink / raw)
  To: Mahesh Bandewar
  Cc: netdev, Eric Dumazet, Maciej Zenczykowski, Laurent Chavey,
	Tim Hockin, David Miller, Brandon Philips
In-Reply-To: <CAF2d9jjmFymJdZmnfDCCUTRSSm-63RPaEFdB3TeYOq9S3jcPOA@mail.gmail.com>


>> Maybe introduce some "lock" call for ipvlan device after which no new IP addresses
>> can be assigned? And the configuration would look like
>>
>> 1. create ipvlan
>> 2. move to proper net namespace
>> 3. add addresses
>> 4. lock
>>
>> ?
> Yes. Exporting this "locked" property on the master device so that it
> can be controlled from masters' net-ns. Only thing we have to ensure
> is that both possibilities are allowed i.e. trusted ns where config do
> not need to be locked as well as untrusted/hostile ns where one can
> lock it down. However this is a future enhancement and if your
> implementation idea is different than this concept; we can discuss it
> at the time of implementation.

Sure. It's not a wish for the very first version of the set, it can
be added later.

Thanks,
Pavel

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox