Netdev List
 help / color / mirror / Atom feed
* pull-request: mac80211 2016-11-18
From: Johannes Berg @ 2016-11-18  7:52 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA

Hi Dave,

Due to travel/vacation, this is a bit late, but there aren't
that many fixes either. Most interesting/important are the
fixes from Felix and perhaps the scan entry limit.

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit 269ebce4531b8edc4224259a02143181a1c1d77c:

  xen-netfront: cast grant table reference first to type int (2016-11-02 15:33:36 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2016-11-18

for you to fetch changes up to 9853a55ef1bb66d7411136046060bbfb69c714fa:

  cfg80211: limit scan results cache size (2016-11-18 08:44:44 +0100)

----------------------------------------------------------------
A few more bugfixes:
 * limit # of scan results stored in memory - this is a long-standing bug
   Jouni and I only noticed while discussing other things in Santa Fe
 * revert AP_LINK_PS patch that was causing issues (Felix)
 * various A-MSDU/A-MPDU fixes for TXQ code (Felix)
 * interoperability workaround for peers with broken VHT capabilities
   (Filip Matusiak)
 * add bitrate definition for a VHT MCS that's supposed to be invalid
   but gets used by some hardware anyway (Thomas Pedersen)
 * beacon timer fix in hwsim (Benjamin Beichler)

----------------------------------------------------------------
Benjamin Beichler (1):
      mac80211_hwsim: fix beacon delta calculation

Felix Fietkau (4):
      Revert "mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE"
      mac80211: update A-MPDU flag on tx dequeue
      mac80211: remove bogus skb vif assignment
      mac80211: fix A-MSDU aggregation with fast-xmit + txq

Filip Matusiak (1):
      mac80211: Ignore VHT IE from peer with wrong rx_mcs_map

Johannes Berg (1):
      cfg80211: limit scan results cache size

Pedersen, Thomas (1):
      cfg80211: add bitrate for 20MHz MCS 9

 drivers/net/wireless/mac80211_hwsim.c |  2 +-
 net/mac80211/sta_info.c               |  2 +-
 net/mac80211/tx.c                     | 14 +++++--
 net/mac80211/vht.c                    | 16 ++++++++
 net/wireless/core.h                   |  1 +
 net/wireless/scan.c                   | 69 +++++++++++++++++++++++++++++++++++
 net/wireless/util.c                   |  3 +-
 7 files changed, 100 insertions(+), 7 deletions(-)

^ permalink raw reply

* [PATCH net-next] bridge: add igmpv3 and mldv2 query support
From: Hangbin Liu @ 2016-11-18  7:32 UTC (permalink / raw)
  To: netdev
  Cc: Hannes Frederic Sowa, Nikolay Aleksandrov, linus.luessing,
	Hangbin Liu

Add bridge IGMPv3 and MLDv2 query support. But before we think it is stable
enough, only enable it when declare in force_igmp/mld_version.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 net/bridge/br_multicast.c | 203 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 194 insertions(+), 9 deletions(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 2136e45..9fb47f3 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -35,6 +35,10 @@
 
 #include "br_private.h"
 
+#define IGMP_V3_SEEN(in_dev) \
+	(IPV4_DEVCONF_ALL(dev_net(in_dev->dev), FORCE_IGMP_VERSION) == 3 || \
+	 IN_DEV_CONF_GET((in_dev), FORCE_IGMP_VERSION) == 3)
+
 static void br_multicast_start_querier(struct net_bridge *br,
 				       struct bridge_mcast_own_query *query);
 static void br_multicast_add_router(struct net_bridge *br,
@@ -360,9 +364,8 @@ static int br_mdb_rehash(struct net_bridge_mdb_htable __rcu **mdbp, int max,
 	return 0;
 }
 
-static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
-						    __be32 group,
-						    u8 *igmp_type)
+static struct sk_buff *br_ip4_alloc_query_v2(struct net_bridge *br,
+					     __be32 group, u8 *igmp_type)
 {
 	struct sk_buff *skb;
 	struct igmphdr *ih;
@@ -428,10 +431,82 @@ static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
 	return skb;
 }
 
+static struct sk_buff *br_ip4_alloc_query_v3(struct net_bridge *br,
+					     __be32 group, u8 *igmp_type)
+{
+	struct sk_buff *skb;
+	struct igmpv3_query *ih3;
+	struct ethhdr *eth;
+	struct iphdr *iph;
+
+	skb = netdev_alloc_skb_ip_align(br->dev, sizeof(*eth) + sizeof(*iph) +
+						 sizeof(*ih3) + 4);
+	if (!skb)
+		goto out;
+
+	skb->protocol = htons(ETH_P_IP);
+
+	skb_reset_mac_header(skb);
+	eth = eth_hdr(skb);
+
+	ether_addr_copy(eth->h_source, br->dev->dev_addr);
+	eth->h_dest[0] = 1;
+	eth->h_dest[1] = 0;
+	eth->h_dest[2] = 0x5e;
+	eth->h_dest[3] = 0;
+	eth->h_dest[4] = 0;
+	eth->h_dest[5] = 1;
+	eth->h_proto = htons(ETH_P_IP);
+	skb_put(skb, sizeof(*eth));
+
+	skb_set_network_header(skb, skb->len);
+	iph = ip_hdr(skb);
+
+	iph->version = 4;
+	iph->ihl = 6;
+	iph->tos = 0xc0;
+	iph->tot_len = htons(sizeof(*iph) + sizeof(*ih3) + 4);
+	iph->id = 0;
+	iph->frag_off = htons(IP_DF);
+	iph->ttl = 1;
+	iph->protocol = IPPROTO_IGMP;
+	iph->saddr = br->multicast_query_use_ifaddr ?
+		     inet_select_addr(br->dev, 0, RT_SCOPE_LINK) : 0;
+	iph->daddr = htonl(INADDR_ALLHOSTS_GROUP);
+	((u8 *)&iph[1])[0] = IPOPT_RA;
+	((u8 *)&iph[1])[1] = 4;
+	((u8 *)&iph[1])[2] = 0;
+	((u8 *)&iph[1])[3] = 0;
+	ip_send_check(iph);
+	skb_put(skb, 24);
+
+	skb_set_transport_header(skb, skb->len);
+	ih3 = igmpv3_query_hdr(skb);
+
+	*igmp_type = IGMP_HOST_MEMBERSHIP_QUERY;
+	ih3->type = IGMP_HOST_MEMBERSHIP_QUERY;
+	ih3->code = (group ? br->multicast_last_member_interval :
+			    br->multicast_query_response_interval) /
+		   (HZ / IGMP_TIMER_SCALE);
+	ih3->csum = 0;
+	ih3->group = group;
+	ih3->resv = 0;
+	ih3->suppress = 0;
+	ih3->qrv= 2;
+	ih3->qqic = br->multicast_query_interval / HZ;
+	ih3->nsrcs = 0;
+	ih3->csum = ip_compute_csum((void *)ih3, sizeof(struct igmpv3_query ));
+	skb_put(skb, sizeof(*ih3));
+
+	__skb_pull(skb, sizeof(*eth));
+
+out:
+	return skb;
+}
 #if IS_ENABLED(CONFIG_IPV6)
-static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
-						    const struct in6_addr *grp,
-						    u8 *igmp_type)
+static struct sk_buff *br_ip6_alloc_query_v1(struct net_bridge *br,
+					     const struct in6_addr *grp,
+					     u8 *igmp_type)
 {
 	struct sk_buff *skb;
 	struct ipv6hdr *ip6h;
@@ -514,19 +589,129 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
 out:
 	return skb;
 }
+
+static struct sk_buff *br_ip6_alloc_query_v2(struct net_bridge *br,
+					     const struct in6_addr *grp,
+					     u8 *igmp_type)
+{
+	struct sk_buff *skb;
+	struct ipv6hdr *ip6h;
+	struct mld2_query *mld2q;
+	struct ethhdr *eth;
+	u8 *hopopt;
+	unsigned long interval;
+
+	skb = netdev_alloc_skb_ip_align(br->dev, sizeof(*eth) + sizeof(*ip6h) +
+						 8 + sizeof(*mld2q));
+	if (!skb)
+		goto out;
+
+	skb->protocol = htons(ETH_P_IPV6);
+
+	/* Ethernet header */
+	skb_reset_mac_header(skb);
+	eth = eth_hdr(skb);
+
+	ether_addr_copy(eth->h_source, br->dev->dev_addr);
+	eth->h_proto = htons(ETH_P_IPV6);
+	skb_put(skb, sizeof(*eth));
+
+	/* IPv6 header + HbH option */
+	skb_set_network_header(skb, skb->len);
+	ip6h = ipv6_hdr(skb);
+
+	*(__force __be32 *)ip6h = htonl(0x60000000);
+	ip6h->payload_len = htons(8 + sizeof(*mld2q));
+	ip6h->nexthdr = IPPROTO_HOPOPTS;
+	ip6h->hop_limit = 1;
+	ipv6_addr_set(&ip6h->daddr, htonl(0xff020000), 0, 0, htonl(1));
+	if (ipv6_dev_get_saddr(dev_net(br->dev), br->dev, &ip6h->daddr, 0,
+			       &ip6h->saddr)) {
+		kfree_skb(skb);
+		br->has_ipv6_addr = 0;
+		return NULL;
+	}
+
+	br->has_ipv6_addr = 1;
+	ipv6_eth_mc_map(&ip6h->daddr, eth->h_dest);
+
+	hopopt = (u8 *)(ip6h + 1);
+	hopopt[0] = IPPROTO_ICMPV6;		/* next hdr */
+	hopopt[1] = 0;				/* length of HbH */
+	hopopt[2] = IPV6_TLV_ROUTERALERT;	/* Router Alert */
+	hopopt[3] = 2;				/* Length of RA Option */
+	hopopt[4] = 0;				/* Type = 0x0000 (MLD) */
+	hopopt[5] = 0;
+	hopopt[6] = IPV6_TLV_PAD1;		/* Pad1 */
+	hopopt[7] = IPV6_TLV_PAD1;		/* Pad1 */
+
+	skb_put(skb, sizeof(*ip6h) + 8);
+
+	/* ICMPv6 */
+	skb_set_transport_header(skb, skb->len);
+	mld2q = (struct mld2_query *) icmp6_hdr(skb);
+
+	interval = ipv6_addr_any(grp) ?
+			br->multicast_query_response_interval :
+			br->multicast_last_member_interval;
+
+	*igmp_type = ICMPV6_MGM_QUERY;
+	mld2q->mld2q_type = ICMPV6_MGM_QUERY;
+	mld2q->mld2q_code = 0;
+	mld2q->mld2q_cksum = 0;
+	mld2q->mld2q_mrc = htons((u16)jiffies_to_msecs(interval));
+	mld2q->mld2q_resv1 = 0;
+	mld2q->mld2q_mca = *grp;
+	mld2q->mld2q_resv2 = 0;
+	mld2q->mld2q_suppress = 0;
+	mld2q->mld2q_qrv = 2;
+	mld2q->mld2q_qqic = br->multicast_query_interval / HZ;
+	mld2q->mld2q_nsrcs = 0;
+
+	/* checksum */
+	mld2q->mld2q_cksum = csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
+					  sizeof(*mld2q), IPPROTO_ICMPV6,
+					  csum_partial(mld2q,
+						       sizeof(*mld2q), 0));
+	skb_put(skb, sizeof(*mld2q));
+
+	__skb_pull(skb, sizeof(*eth));
+
+out:
+	return skb;
+}
 #endif
 
+static int mld_force_mld_version(const struct inet6_dev *idev)
+{
+	if (dev_net(idev->dev)->ipv6.devconf_all->force_mld_version != 0)
+		return dev_net(idev->dev)->ipv6.devconf_all->force_mld_version;
+	else
+		return idev->cnf.force_mld_version;
+}
+
 static struct sk_buff *br_multicast_alloc_query(struct net_bridge *br,
 						struct br_ip *addr,
 						u8 *igmp_type)
 {
+	struct in_device *in_dev = __in_dev_get_rcu(br->dev);
+	struct inet6_dev *idev = __in6_dev_get(br->dev);
 	switch (addr->proto) {
 	case htons(ETH_P_IP):
-		return br_ip4_multicast_alloc_query(br, addr->u.ip4, igmp_type);
+		if (IGMP_V3_SEEN(in_dev))
+			return br_ip4_alloc_query_v3(br, addr->u.ip4,
+						     igmp_type);
+		else
+			return br_ip4_alloc_query_v2(br, addr->u.ip4,
+						     igmp_type);
 #if IS_ENABLED(CONFIG_IPV6)
 	case htons(ETH_P_IPV6):
-		return br_ip6_multicast_alloc_query(br, &addr->u.ip6,
-						    igmp_type);
+		if (mld_force_mld_version(idev) == 2)
+			return br_ip6_alloc_query_v2(br, &addr->u.ip6,
+						     igmp_type);
+		else
+			return br_ip6_alloc_query_v1(br, &addr->u.ip6,
+						     igmp_type);
 #endif
 	}
 	return NULL;
-- 
2.5.5

^ permalink raw reply related

* [PATCH net-next 4/4] geneve: Optimize geneve device lookup.
From: Pravin B Shelar @ 2016-11-18  7:10 UTC (permalink / raw)
  To: netdev; +Cc: Pravin B Shelar
In-Reply-To: <1479453029-29619-1-git-send-email-pshelar@ovn.org>

Rather than comparing 64-bit tunnel-id, compare tunnel vni
which is 24-bit id. This also save conversion from vni
to tunnel id on each tunnel packet receive.

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
---
 drivers/net/geneve.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 8e6c659..21f3270 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -103,6 +103,17 @@ static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni)
 #endif
 }
 
+static bool cmp_tunnel_id_and_vni(u8 *tun_id, u8 *vni)
+{
+#ifdef __BIG_ENDIAN
+	return (vni[0] == tun_id[2]) &&
+	       (vni[1] == tun_id[1]) &&
+	       (vni[2] == tun_id[0]);
+#else
+	return !memcmp(vni, &tun_id[5], 3);
+#endif
+}
+
 static sa_family_t geneve_get_sk_family(struct geneve_sock *gs)
 {
 	return gs->sock->sk->sk_family;
@@ -111,7 +122,6 @@ static sa_family_t geneve_get_sk_family(struct geneve_sock *gs)
 static struct geneve_dev *geneve_lookup(struct geneve_sock *gs,
 					__be32 addr, u8 vni[])
 {
-	__be64 id = vni_to_tunnel_id(vni);
 	struct hlist_head *vni_list_head;
 	struct geneve_dev *geneve;
 	__u32 hash;
@@ -120,7 +130,7 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock *gs,
 	hash = geneve_net_vni_hash(vni);
 	vni_list_head = &gs->vni_list[hash];
 	hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
-		if (!memcmp(&id, &geneve->info.key.tun_id, sizeof(id)) &&
+		if (cmp_tunnel_id_and_vni((u8 *)&geneve->info.key.tun_id, vni) &&
 		    addr == geneve->info.key.u.ipv4.dst)
 			return geneve;
 	}
@@ -131,7 +141,6 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock *gs,
 static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs,
 					 struct in6_addr addr6, u8 vni[])
 {
-	__be64 id = vni_to_tunnel_id(vni);
 	struct hlist_head *vni_list_head;
 	struct geneve_dev *geneve;
 	__u32 hash;
@@ -140,7 +149,7 @@ static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs,
 	hash = geneve_net_vni_hash(vni);
 	vni_list_head = &gs->vni_list[hash];
 	hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
-		if (!memcmp(&id, &geneve->info.key.tun_id, sizeof(id)) &&
+		if (cmp_tunnel_id_and_vni((u8 *)&geneve->info.key.tun_id, vni) &&
 		    ipv6_addr_equal(&addr6, &geneve->info.key.u.ipv6.dst))
 			return geneve;
 	}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 3/4] geneve: Remove redundant socket checks.
From: Pravin B Shelar @ 2016-11-18  7:10 UTC (permalink / raw)
  To: netdev; +Cc: Pravin B Shelar
In-Reply-To: <1479453029-29619-1-git-send-email-pshelar@ovn.org>

Geneve already has check for device socket in route
lookup function. So no need to check it in xmit
function.

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
---
 drivers/net/geneve.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 9a4351c..8e6c659 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -785,14 +785,11 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	struct geneve_sock *gs4 = rcu_dereference(geneve->sock4);
 	const struct ip_tunnel_key *key = &info->key;
 	struct rtable *rt;
-	int err = -EINVAL;
 	struct flowi4 fl4;
 	__u8 tos, ttl;
 	__be16 sport;
 	__be16 df;
-
-	if (!gs4)
-		return err;
+	int err;
 
 	rt = geneve_get_v4_rt(skb, dev, &fl4, info);
 	if (IS_ERR(rt))
@@ -828,13 +825,10 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	struct geneve_sock *gs6 = rcu_dereference(geneve->sock6);
 	const struct ip_tunnel_key *key = &info->key;
 	struct dst_entry *dst = NULL;
-	int err = -EINVAL;
 	struct flowi6 fl6;
 	__u8 prio, ttl;
 	__be16 sport;
-
-	if (!gs6)
-		return err;
+	int err;
 
 	dst = geneve_get_v6_dst(skb, dev, &fl6, info);
 	if (IS_ERR(dst))
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/4] geneve: Unify LWT and netdev handling.
From: Pravin B Shelar @ 2016-11-18  7:10 UTC (permalink / raw)
  To: netdev; +Cc: Pravin B Shelar
In-Reply-To: <1479453029-29619-1-git-send-email-pshelar@ovn.org>

Current geneve implementation has two separate cases to handle.
1. netdev xmit
2. LWT xmit.

In case of netdev, geneve configuration is stored in various
struct geneve_dev members. For example geneve_addr, ttl, tos,
label, flags, dst_cache, etc. For LWT ip_tunnel_info is passed
to the device in ip_tunnel_info.

Following patch uses ip_tunnel_info struct to store almost all
of configuration of a geneve netdevice. This allows us to unify
most of geneve driver code around ip_tunnel_info struct.
This dramatically simplify geneve code, since it does not
need to handle two different configuration cases. Removes
duplicate code, single code path can handle either type
of geneve devices.

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
---
 drivers/net/geneve.c | 611 ++++++++++++++++++++++-----------------------------
 1 file changed, 262 insertions(+), 349 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 85a423a..622cf3b 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -45,41 +45,22 @@ struct geneve_net {
 
 static int geneve_net_id;
 
-union geneve_addr {
-	struct sockaddr_in sin;
-	struct sockaddr_in6 sin6;
-	struct sockaddr sa;
-};
-
-static union geneve_addr geneve_remote_unspec = { .sa.sa_family = AF_UNSPEC, };
-
 /* Pseudo network device */
 struct geneve_dev {
 	struct hlist_node  hlist;	/* vni hash table */
 	struct net	   *net;	/* netns for packet i/o */
 	struct net_device  *dev;	/* netdev for geneve tunnel */
+	struct ip_tunnel_info info;
 	struct geneve_sock __rcu *sock4;	/* IPv4 socket used for geneve tunnel */
 #if IS_ENABLED(CONFIG_IPV6)
 	struct geneve_sock __rcu *sock6;	/* IPv6 socket used for geneve tunnel */
 #endif
-	u8                 vni[3];	/* virtual network ID for tunnel */
-	u8                 ttl;		/* TTL override */
-	u8                 tos;		/* TOS override */
-	union geneve_addr  remote;	/* IP address for link partner */
 	struct list_head   next;	/* geneve's per namespace list */
-	__be32		   label;	/* IPv6 flowlabel override */
-	__be16		   dst_port;
-	bool		   collect_md;
 	struct gro_cells   gro_cells;
-	u32		   flags;
-	struct dst_cache   dst_cache;
+	bool		   collect_md;
+	bool		   use_udp6_rx_checksums;
 };
 
-/* Geneve device flags */
-#define GENEVE_F_UDP_ZERO_CSUM_TX	BIT(0)
-#define GENEVE_F_UDP_ZERO_CSUM6_TX	BIT(1)
-#define GENEVE_F_UDP_ZERO_CSUM6_RX	BIT(2)
-
 struct geneve_sock {
 	bool			collect_md;
 	struct list_head	list;
@@ -87,7 +68,6 @@ struct geneve_sock {
 	struct rcu_head		rcu;
 	int			refcnt;
 	struct hlist_head	vni_list[VNI_HASH_SIZE];
-	u32			flags;
 };
 
 static inline __u32 geneve_net_vni_hash(u8 vni[3])
@@ -109,6 +89,20 @@ static __be64 vni_to_tunnel_id(const __u8 *vni)
 #endif
 }
 
+/* Convert 64 bit tunnel ID to 24 bit VNI. */
+static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni)
+{
+#ifdef __BIG_ENDIAN
+	vni[0] = (__force __u8)(tun_id >> 16);
+	vni[1] = (__force __u8)(tun_id >> 8);
+	vni[2] = (__force __u8)tun_id;
+#else
+	vni[0] = (__force __u8)((__force u64)tun_id >> 40);
+	vni[1] = (__force __u8)((__force u64)tun_id >> 48);
+	vni[2] = (__force __u8)((__force u64)tun_id >> 56);
+#endif
+}
+
 static sa_family_t geneve_get_sk_family(struct geneve_sock *gs)
 {
 	return gs->sock->sk->sk_family;
@@ -117,6 +111,7 @@ static sa_family_t geneve_get_sk_family(struct geneve_sock *gs)
 static struct geneve_dev *geneve_lookup(struct geneve_sock *gs,
 					__be32 addr, u8 vni[])
 {
+	__be64 id = vni_to_tunnel_id(vni);
 	struct hlist_head *vni_list_head;
 	struct geneve_dev *geneve;
 	__u32 hash;
@@ -125,8 +120,8 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock *gs,
 	hash = geneve_net_vni_hash(vni);
 	vni_list_head = &gs->vni_list[hash];
 	hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
-		if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) &&
-		    addr == geneve->remote.sin.sin_addr.s_addr)
+		if (!memcmp(&id, &geneve->info.key.tun_id, sizeof(id)) &&
+		    addr == geneve->info.key.u.ipv4.dst)
 			return geneve;
 	}
 	return NULL;
@@ -136,6 +131,7 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock *gs,
 static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs,
 					 struct in6_addr addr6, u8 vni[])
 {
+	__be64 id = vni_to_tunnel_id(vni);
 	struct hlist_head *vni_list_head;
 	struct geneve_dev *geneve;
 	__u32 hash;
@@ -144,8 +140,8 @@ static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs,
 	hash = geneve_net_vni_hash(vni);
 	vni_list_head = &gs->vni_list[hash];
 	hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
-		if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) &&
-		    ipv6_addr_equal(&addr6, &geneve->remote.sin6.sin6_addr))
+		if (!memcmp(&id, &geneve->info.key.tun_id, sizeof(id)) &&
+		    ipv6_addr_equal(&addr6, &geneve->info.key.u.ipv6.dst))
 			return geneve;
 	}
 	return NULL;
@@ -160,15 +156,12 @@ static inline struct genevehdr *geneve_hdr(const struct sk_buff *skb)
 static struct geneve_dev *geneve_lookup_skb(struct geneve_sock *gs,
 					    struct sk_buff *skb)
 {
-	u8 *vni;
-	__be32 addr;
 	static u8 zero_vni[3];
-#if IS_ENABLED(CONFIG_IPV6)
-	static struct in6_addr zero_addr6;
-#endif
+	u8 *vni;
 
 	if (geneve_get_sk_family(gs) == AF_INET) {
 		struct iphdr *iph;
+		__be32 addr;
 
 		iph = ip_hdr(skb); /* outer IP header... */
 
@@ -183,6 +176,7 @@ static struct geneve_dev *geneve_lookup_skb(struct geneve_sock *gs,
 		return geneve_lookup(gs, addr, vni);
 #if IS_ENABLED(CONFIG_IPV6)
 	} else if (geneve_get_sk_family(gs) == AF_INET6) {
+		static struct in6_addr zero_addr6;
 		struct ipv6hdr *ip6h;
 		struct in6_addr addr6;
 
@@ -305,13 +299,12 @@ static int geneve_init(struct net_device *dev)
 		return err;
 	}
 
-	err = dst_cache_init(&geneve->dst_cache, GFP_KERNEL);
+	err = dst_cache_init(&geneve->info.dst_cache, GFP_KERNEL);
 	if (err) {
 		free_percpu(dev->tstats);
 		gro_cells_destroy(&geneve->gro_cells);
 		return err;
 	}
-
 	return 0;
 }
 
@@ -319,7 +312,7 @@ static void geneve_uninit(struct net_device *dev)
 {
 	struct geneve_dev *geneve = netdev_priv(dev);
 
-	dst_cache_destroy(&geneve->dst_cache);
+	dst_cache_destroy(&geneve->info.dst_cache);
 	gro_cells_destroy(&geneve->gro_cells);
 	free_percpu(dev->tstats);
 }
@@ -368,7 +361,7 @@ static int geneve_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 }
 
 static struct socket *geneve_create_sock(struct net *net, bool ipv6,
-					 __be16 port, u32 flags)
+					 __be16 port, bool ipv6_rx_csum)
 {
 	struct socket *sock;
 	struct udp_port_cfg udp_conf;
@@ -379,8 +372,7 @@ static struct socket *geneve_create_sock(struct net *net, bool ipv6,
 	if (ipv6) {
 		udp_conf.family = AF_INET6;
 		udp_conf.ipv6_v6only = 1;
-		udp_conf.use_udp6_rx_checksums =
-		    !(flags & GENEVE_F_UDP_ZERO_CSUM6_RX);
+		udp_conf.use_udp6_rx_checksums = ipv6_rx_csum;
 	} else {
 		udp_conf.family = AF_INET;
 		udp_conf.local_ip.s_addr = htonl(INADDR_ANY);
@@ -491,7 +483,7 @@ static int geneve_gro_complete(struct sock *sk, struct sk_buff *skb,
 
 /* Create new listen socket if needed */
 static struct geneve_sock *geneve_socket_create(struct net *net, __be16 port,
-						bool ipv6, u32 flags)
+						bool ipv6, bool ipv6_rx_csum)
 {
 	struct geneve_net *gn = net_generic(net, geneve_net_id);
 	struct geneve_sock *gs;
@@ -503,7 +495,7 @@ static struct geneve_sock *geneve_socket_create(struct net *net, __be16 port,
 	if (!gs)
 		return ERR_PTR(-ENOMEM);
 
-	sock = geneve_create_sock(net, ipv6, port, flags);
+	sock = geneve_create_sock(net, ipv6, port, ipv6_rx_csum);
 	if (IS_ERR(sock)) {
 		kfree(gs);
 		return ERR_CAST(sock);
@@ -579,21 +571,22 @@ static int geneve_sock_add(struct geneve_dev *geneve, bool ipv6)
 	struct net *net = geneve->net;
 	struct geneve_net *gn = net_generic(net, geneve_net_id);
 	struct geneve_sock *gs;
+	__u8 vni[3];
 	__u32 hash;
 
-	gs = geneve_find_sock(gn, ipv6 ? AF_INET6 : AF_INET, geneve->dst_port);
+	gs = geneve_find_sock(gn, ipv6 ? AF_INET6 : AF_INET, geneve->info.key.tp_dst);
 	if (gs) {
 		gs->refcnt++;
 		goto out;
 	}
 
-	gs = geneve_socket_create(net, geneve->dst_port, ipv6, geneve->flags);
+	gs = geneve_socket_create(net, geneve->info.key.tp_dst, ipv6,
+				  geneve->use_udp6_rx_checksums);
 	if (IS_ERR(gs))
 		return PTR_ERR(gs);
 
 out:
 	gs->collect_md = geneve->collect_md;
-	gs->flags = geneve->flags;
 #if IS_ENABLED(CONFIG_IPV6)
 	if (ipv6)
 		rcu_assign_pointer(geneve->sock6, gs);
@@ -601,7 +594,8 @@ static int geneve_sock_add(struct geneve_dev *geneve, bool ipv6)
 #endif
 		rcu_assign_pointer(geneve->sock4, gs);
 
-	hash = geneve_net_vni_hash(geneve->vni);
+	tunnel_id_to_vni(geneve->info.key.tun_id, vni);
+	hash = geneve_net_vni_hash(vni);
 	hlist_add_head_rcu(&geneve->hlist, &gs->vni_list[hash]);
 	return 0;
 }
@@ -609,7 +603,7 @@ static int geneve_sock_add(struct geneve_dev *geneve, bool ipv6)
 static int geneve_open(struct net_device *dev)
 {
 	struct geneve_dev *geneve = netdev_priv(dev);
-	bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
+	bool ipv6 = !!(geneve->info.mode & IP_TUNNEL_INFO_IPV6);
 	bool metadata = geneve->collect_md;
 	int ret = 0;
 
@@ -653,12 +647,12 @@ static void geneve_build_header(struct genevehdr *geneveh,
 
 static int geneve_build_skb(struct rtable *rt, struct sk_buff *skb,
 			    __be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt,
-			    u32 flags, bool xnet)
+			    bool xnet)
 {
+	bool udp_sum = !!(tun_flags & TUNNEL_CSUM);
 	struct genevehdr *gnvh;
 	int min_headroom;
 	int err;
-	bool udp_sum = !(flags & GENEVE_F_UDP_ZERO_CSUM_TX);
 
 	skb_scrub_packet(skb, xnet);
 
@@ -686,12 +680,12 @@ static int geneve_build_skb(struct rtable *rt, struct sk_buff *skb,
 #if IS_ENABLED(CONFIG_IPV6)
 static int geneve6_build_skb(struct dst_entry *dst, struct sk_buff *skb,
 			     __be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt,
-			     u32 flags, bool xnet)
+			     bool xnet)
 {
+	bool udp_sum = !!(tun_flags & TUNNEL_CSUM);
 	struct genevehdr *gnvh;
 	int min_headroom;
 	int err;
-	bool udp_sum = !(flags & GENEVE_F_UDP_ZERO_CSUM6_TX);
 
 	skb_scrub_packet(skb, xnet);
 
@@ -734,32 +728,22 @@ static struct rtable *geneve_get_v4_rt(struct sk_buff *skb,
 	memset(fl4, 0, sizeof(*fl4));
 	fl4->flowi4_mark = skb->mark;
 	fl4->flowi4_proto = IPPROTO_UDP;
+	fl4->daddr = info->key.u.ipv4.dst;
+	fl4->saddr = info->key.u.ipv4.src;
 
-	if (info) {
-		fl4->daddr = info->key.u.ipv4.dst;
-		fl4->saddr = info->key.u.ipv4.src;
-		fl4->flowi4_tos = RT_TOS(info->key.tos);
-		dst_cache = &info->dst_cache;
-	} else {
-		tos = geneve->tos;
-		if (tos == 1) {
-			const struct iphdr *iip = ip_hdr(skb);
-
-			tos = ip_tunnel_get_dsfield(iip, skb);
-			use_cache = false;
-		}
-
-		fl4->flowi4_tos = RT_TOS(tos);
-		fl4->daddr = geneve->remote.sin.sin_addr.s_addr;
-		dst_cache = &geneve->dst_cache;
+	tos = info->key.tos;
+	if (!geneve->collect_md && (tos == 1)) {
+		tos = ip_tunnel_get_dsfield(ip_hdr(skb), skb);
+		use_cache = false;
 	}
+	fl4->flowi4_tos = RT_TOS(tos);
 
+	dst_cache = &info->dst_cache;
 	if (use_cache) {
 		rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
 		if (rt)
 			return rt;
 	}
-
 	rt = ip_route_output_key(geneve->net, fl4);
 	if (IS_ERR(rt)) {
 		netdev_dbg(dev, "no route to %pI4\n", &fl4->daddr);
@@ -795,34 +779,22 @@ static struct dst_entry *geneve_get_v6_dst(struct sk_buff *skb,
 	memset(fl6, 0, sizeof(*fl6));
 	fl6->flowi6_mark = skb->mark;
 	fl6->flowi6_proto = IPPROTO_UDP;
-
-	if (info) {
-		fl6->daddr = info->key.u.ipv6.dst;
-		fl6->saddr = info->key.u.ipv6.src;
-		fl6->flowlabel = ip6_make_flowinfo(RT_TOS(info->key.tos),
-						   info->key.label);
-		dst_cache = &info->dst_cache;
-	} else {
-		prio = geneve->tos;
-		if (prio == 1) {
-			const struct iphdr *iip = ip_hdr(skb);
-
-			prio = ip_tunnel_get_dsfield(iip, skb);
-			use_cache = false;
-		}
-
-		fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
-						   geneve->label);
-		fl6->daddr = geneve->remote.sin6.sin6_addr;
-		dst_cache = &geneve->dst_cache;
+	fl6->daddr = info->key.u.ipv6.dst;
+	fl6->saddr = info->key.u.ipv6.src;
+	prio = info->key.tos;
+	if (!geneve->collect_md && (prio == 1)) {
+		prio = ip_tunnel_get_dsfield(ip_hdr(skb), skb);
+		use_cache = false;
 	}
 
+	fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
+					   info->key.label);
+	dst_cache = &info->dst_cache;
 	if (use_cache) {
 		dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
 		if (dst)
 			return dst;
 	}
-
 	if (ipv6_stub->ipv6_dst_lookup(geneve->net, gs6->sock->sk, &dst, fl6)) {
 		netdev_dbg(dev, "no route to %pI6\n", &fl6->daddr);
 		return ERR_PTR(-ENETUNREACH);
@@ -839,195 +811,129 @@ static struct dst_entry *geneve_get_v6_dst(struct sk_buff *skb,
 }
 #endif
 
-/* Convert 64 bit tunnel ID to 24 bit VNI. */
-static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni)
-{
-#ifdef __BIG_ENDIAN
-	vni[0] = (__force __u8)(tun_id >> 16);
-	vni[1] = (__force __u8)(tun_id >> 8);
-	vni[2] = (__force __u8)tun_id;
-#else
-	vni[0] = (__force __u8)((__force u64)tun_id >> 40);
-	vni[1] = (__force __u8)((__force u64)tun_id >> 48);
-	vni[2] = (__force __u8)((__force u64)tun_id >> 56);
-#endif
-}
-
-static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
-				   struct ip_tunnel_info *info)
+static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
+			   struct geneve_dev *geneve, struct ip_tunnel_info *info)
 {
-	struct geneve_dev *geneve = netdev_priv(dev);
-	struct geneve_sock *gs4;
-	struct rtable *rt = NULL;
-	const struct iphdr *iip; /* interior IP header */
+	bool xnet = !net_eq(geneve->net, dev_net(geneve->dev));
+	struct geneve_sock *gs4 = rcu_dereference(geneve->sock4);
+	const struct ip_tunnel_key *key = &info->key;
+	struct rtable *rt;
 	int err = -EINVAL;
 	struct flowi4 fl4;
+	u8 *opts = NULL;
 	__u8 tos, ttl;
 	__be16 sport;
 	__be16 df;
-	bool xnet = !net_eq(geneve->net, dev_net(geneve->dev));
-	u32 flags = geneve->flags;
+	u8 vni[3];
 
-	gs4 = rcu_dereference(geneve->sock4);
 	if (!gs4)
-		goto tx_error;
-
-	if (geneve->collect_md) {
-		if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) {
-			netdev_dbg(dev, "no tunnel metadata\n");
-			goto tx_error;
-		}
-		if (info && ip_tunnel_info_af(info) != AF_INET)
-			goto tx_error;
-	}
+		return err;
 
 	rt = geneve_get_v4_rt(skb, dev, &fl4, info);
-	if (IS_ERR(rt)) {
-		err = PTR_ERR(rt);
-		goto tx_error;
-	}
+	if (IS_ERR(rt))
+		return PTR_ERR(rt);
 
 	sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
-	skb_reset_mac_header(skb);
-
-	iip = ip_hdr(skb);
-
-	if (info) {
-		const struct ip_tunnel_key *key = &info->key;
-		u8 *opts = NULL;
-		u8 vni[3];
-
-		tunnel_id_to_vni(key->tun_id, vni);
-		if (info->options_len)
-			opts = ip_tunnel_info_opts(info);
-
-		if (key->tun_flags & TUNNEL_CSUM)
-			flags &= ~GENEVE_F_UDP_ZERO_CSUM_TX;
-		else
-			flags |= GENEVE_F_UDP_ZERO_CSUM_TX;
-
-		err = geneve_build_skb(rt, skb, key->tun_flags, vni,
-				       info->options_len, opts, flags, xnet);
-		if (unlikely(err))
-			goto tx_error;
-
-		tos = ip_tunnel_ecn_encap(key->tos, iip, skb);
+	if (geneve->collect_md) {
+		tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
 		ttl = key->ttl;
-		df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
 	} else {
-		err = geneve_build_skb(rt, skb, 0, geneve->vni,
-				       0, NULL, flags, xnet);
-		if (unlikely(err))
-			goto tx_error;
-
-		tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, iip, skb);
-		ttl = geneve->ttl;
-		if (!ttl && IN_MULTICAST(ntohl(fl4.daddr)))
-			ttl = 1;
-		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
-		df = 0;
+		tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
+		ttl = key->ttl ? : ip4_dst_hoplimit(&rt->dst);
 	}
-	udp_tunnel_xmit_skb(rt, gs4->sock->sk, skb, fl4.saddr, fl4.daddr,
-			    tos, ttl, df, sport, geneve->dst_port,
-			    !net_eq(geneve->net, dev_net(geneve->dev)),
-			    !!(flags & GENEVE_F_UDP_ZERO_CSUM_TX));
+	df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
 
-	return NETDEV_TX_OK;
-
-tx_error:
-	dev_kfree_skb(skb);
+	tunnel_id_to_vni(key->tun_id, vni);
+	if (info->options_len)
+		opts = ip_tunnel_info_opts(info);
 
-	if (err == -ELOOP)
-		dev->stats.collisions++;
-	else if (err == -ENETUNREACH)
-		dev->stats.tx_carrier_errors++;
+	skb_reset_mac_header(skb);
+	err = geneve_build_skb(rt, skb, key->tun_flags, vni,
+			       info->options_len, opts, xnet);
+	if (unlikely(err))
+		return err;
 
-	dev->stats.tx_errors++;
-	return NETDEV_TX_OK;
+	udp_tunnel_xmit_skb(rt, gs4->sock->sk, skb, fl4.saddr, fl4.daddr,
+			    tos, ttl, df, sport, geneve->info.key.tp_dst,
+			    !net_eq(geneve->net, dev_net(geneve->dev)),
+			    !(info->key.tun_flags & TUNNEL_CSUM));
+	return 0;
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
-static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
-				    struct ip_tunnel_info *info)
+static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
+			    struct geneve_dev *geneve, struct ip_tunnel_info *info)
 {
-	struct geneve_dev *geneve = netdev_priv(dev);
+	bool xnet = !net_eq(geneve->net, dev_net(geneve->dev));
+	struct geneve_sock *gs6 = rcu_dereference(geneve->sock6);
+	const struct ip_tunnel_key *key = &info->key;
 	struct dst_entry *dst = NULL;
-	const struct iphdr *iip; /* interior IP header */
-	struct geneve_sock *gs6;
 	int err = -EINVAL;
 	struct flowi6 fl6;
+	u8 *opts = NULL;
 	__u8 prio, ttl;
 	__be16 sport;
-	__be32 label;
-	bool xnet = !net_eq(geneve->net, dev_net(geneve->dev));
-	u32 flags = geneve->flags;
+	u8 vni[3];
 
-	gs6 = rcu_dereference(geneve->sock6);
 	if (!gs6)
-		goto tx_error;
-
-	if (geneve->collect_md) {
-		if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) {
-			netdev_dbg(dev, "no tunnel metadata\n");
-			goto tx_error;
-		}
-	}
+		return err;
 
 	dst = geneve_get_v6_dst(skb, dev, &fl6, info);
-	if (IS_ERR(dst)) {
-		err = PTR_ERR(dst);
-		goto tx_error;
-	}
+	if (IS_ERR(dst))
+		return PTR_ERR(dst);
 
 	sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
-	skb_reset_mac_header(skb);
-
-	iip = ip_hdr(skb);
+	if (geneve->collect_md) {
+		prio = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
+		ttl = key->ttl;
+	} else {
+		prio = ip_tunnel_ecn_encap(ip6_tclass(fl6.flowlabel),
+					   ip_hdr(skb), skb);
+		ttl = key->ttl ? : ip6_dst_hoplimit(dst);
+	}
+	tunnel_id_to_vni(key->tun_id, vni);
+	if (info->options_len)
+		opts = ip_tunnel_info_opts(info);
 
-	if (info) {
-		const struct ip_tunnel_key *key = &info->key;
-		u8 *opts = NULL;
-		u8 vni[3];
+	skb_reset_mac_header(skb);
+	err = geneve6_build_skb(dst, skb, key->tun_flags, vni,
+				info->options_len, opts, xnet);
+	if (unlikely(err))
+		return err;
 
-		tunnel_id_to_vni(key->tun_id, vni);
-		if (info->options_len)
-			opts = ip_tunnel_info_opts(info);
+	udp_tunnel6_xmit_skb(dst, gs6->sock->sk, skb, dev,
+			     &fl6.saddr, &fl6.daddr, prio, ttl,
+			     info->key.label, sport, geneve->info.key.tp_dst,
+			     !(info->key.tun_flags & TUNNEL_CSUM));
+	return 0;
+}
+#endif
 
-		if (key->tun_flags & TUNNEL_CSUM)
-			flags &= ~GENEVE_F_UDP_ZERO_CSUM6_TX;
-		else
-			flags |= GENEVE_F_UDP_ZERO_CSUM6_TX;
+static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct geneve_dev *geneve = netdev_priv(dev);
+	struct ip_tunnel_info *info = NULL;
+	int err;
 
-		err = geneve6_build_skb(dst, skb, key->tun_flags, vni,
-					info->options_len, opts,
-					flags, xnet);
-		if (unlikely(err))
+	if (geneve->collect_md) {
+		info = skb_tunnel_info(skb);
+		if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) {
+			netdev_dbg(dev, "no tunnel metadata\n");
 			goto tx_error;
-
-		prio = ip_tunnel_ecn_encap(key->tos, iip, skb);
-		ttl = key->ttl;
-		label = info->key.label;
+		}
 	} else {
-		err = geneve6_build_skb(dst, skb, 0, geneve->vni,
-					0, NULL, flags, xnet);
-		if (unlikely(err))
-			goto tx_error;
-
-		prio = ip_tunnel_ecn_encap(ip6_tclass(fl6.flowlabel),
-					   iip, skb);
-		ttl = geneve->ttl;
-		if (!ttl && ipv6_addr_is_multicast(&fl6.daddr))
-			ttl = 1;
-		ttl = ttl ? : ip6_dst_hoplimit(dst);
-		label = geneve->label;
+		info = &geneve->info;
 	}
 
-	udp_tunnel6_xmit_skb(dst, gs6->sock->sk, skb, dev,
-			     &fl6.saddr, &fl6.daddr, prio, ttl, label,
-			     sport, geneve->dst_port,
-			     !!(flags & GENEVE_F_UDP_ZERO_CSUM6_TX));
-	return NETDEV_TX_OK;
+#if IS_ENABLED(CONFIG_IPV6)
+	if (info->mode & IP_TUNNEL_INFO_IPV6)
+		err = geneve6_xmit_skb(skb, dev, geneve, info);
+	else
+#endif
+		err = geneve_xmit_skb(skb, dev, geneve, info);
 
+	if (likely(!err))
+		return NETDEV_TX_OK;
 tx_error:
 	dev_kfree_skb(skb);
 
@@ -1039,23 +945,6 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	dev->stats.tx_errors++;
 	return NETDEV_TX_OK;
 }
-#endif
-
-static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev)
-{
-	struct geneve_dev *geneve = netdev_priv(dev);
-	struct ip_tunnel_info *info = NULL;
-
-	if (geneve->collect_md)
-		info = skb_tunnel_info(skb);
-
-#if IS_ENABLED(CONFIG_IPV6)
-	if ((info && ip_tunnel_info_af(info) == AF_INET6) ||
-	    (!info && geneve->remote.sa.sa_family == AF_INET6))
-		return geneve6_xmit_skb(skb, dev, info);
-#endif
-	return geneve_xmit_skb(skb, dev, info);
-}
 
 static int geneve_change_mtu(struct net_device *dev, int new_mtu)
 {
@@ -1073,14 +962,11 @@ static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
 {
 	struct ip_tunnel_info *info = skb_tunnel_info(skb);
 	struct geneve_dev *geneve = netdev_priv(dev);
-	struct rtable *rt;
-	struct flowi4 fl4;
-#if IS_ENABLED(CONFIG_IPV6)
-	struct dst_entry *dst;
-	struct flowi6 fl6;
-#endif
 
 	if (ip_tunnel_info_af(info) == AF_INET) {
+		struct rtable *rt;
+		struct flowi4 fl4;
+
 		rt = geneve_get_v4_rt(skb, dev, &fl4, info);
 		if (IS_ERR(rt))
 			return PTR_ERR(rt);
@@ -1089,6 +975,9 @@ static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
 		info->key.u.ipv4.src = fl4.saddr;
 #if IS_ENABLED(CONFIG_IPV6)
 	} else if (ip_tunnel_info_af(info) == AF_INET6) {
+		struct dst_entry *dst;
+		struct flowi6 fl6;
+
 		dst = geneve_get_v6_dst(skb, dev, &fl6, info);
 		if (IS_ERR(dst))
 			return PTR_ERR(dst);
@@ -1102,7 +991,7 @@ static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
 
 	info->key.tp_src = udp_flow_src_port(geneve->net, skb,
 					     1, USHRT_MAX, true);
-	info->key.tp_dst = geneve->dst_port;
+	info->key.tp_dst = geneve->info.key.tp_dst;
 	return 0;
 }
 
@@ -1224,78 +1113,69 @@ static int geneve_validate(struct nlattr *tb[], struct nlattr *data[])
 }
 
 static struct geneve_dev *geneve_find_dev(struct geneve_net *gn,
-					  __be16 dst_port,
-					  union geneve_addr *remote,
-					  u8 vni[],
+					  const struct ip_tunnel_info *info,
 					  bool *tun_on_same_port,
 					  bool *tun_collect_md)
 {
-	struct geneve_dev *geneve, *t;
+	struct geneve_dev *geneve, *t = NULL;
 
 	*tun_on_same_port = false;
 	*tun_collect_md = false;
-	t = NULL;
 	list_for_each_entry(geneve, &gn->geneve_list, next) {
-		if (geneve->dst_port == dst_port) {
+		if (info->key.tp_dst == geneve->info.key.tp_dst) {
 			*tun_collect_md = geneve->collect_md;
 			*tun_on_same_port = true;
 		}
-		if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) &&
-		    !memcmp(remote, &geneve->remote, sizeof(geneve->remote)) &&
-		    dst_port == geneve->dst_port)
+		if (info->key.tun_id == geneve->info.key.tun_id &&
+		    info->key.tp_dst == geneve->info.key.tp_dst &&
+		    !memcmp(&info->key.u, &geneve->info.key.u, sizeof(info->key.u)))
 			t = geneve;
 	}
 	return t;
 }
 
+static bool is_all_zero(const u8 *fp, size_t size)
+{
+	int i;
+
+	for (i = 0; i < size; i++)
+		if (fp[i])
+			return false;
+	return true;
+}
+
+static bool is_tnl_info_zero(const struct ip_tunnel_info *info)
+{
+	if (info->key.tun_id || info->key.tun_flags || info->key.tos ||
+	    info->key.ttl || info->key.label || info->key.tp_src ||
+	    !is_all_zero((const u8 *)&info->key.u, sizeof(info->key.u)))
+		return false;
+	else
+		return true;
+}
+
 static int geneve_configure(struct net *net, struct net_device *dev,
-			    union geneve_addr *remote,
-			    __u32 vni, __u8 ttl, __u8 tos, __be32 label,
-			    __be16 dst_port, bool metadata, u32 flags)
+			    const struct ip_tunnel_info *info,
+			    bool metadata, bool ipv6_rx_csum)
 {
 	struct geneve_net *gn = net_generic(net, geneve_net_id);
 	struct geneve_dev *t, *geneve = netdev_priv(dev);
 	bool tun_collect_md, tun_on_same_port;
 	int err, encap_len;
 
-	if (!remote)
-		return -EINVAL;
-	if (metadata &&
-	    (remote->sa.sa_family != AF_UNSPEC || vni || tos || ttl || label))
+	if (metadata && !is_tnl_info_zero(info))
 		return -EINVAL;
 
 	geneve->net = net;
 	geneve->dev = dev;
 
-	geneve->vni[0] = (vni & 0x00ff0000) >> 16;
-	geneve->vni[1] = (vni & 0x0000ff00) >> 8;
-	geneve->vni[2] =  vni & 0x000000ff;
-
-	if ((remote->sa.sa_family == AF_INET &&
-	     IN_MULTICAST(ntohl(remote->sin.sin_addr.s_addr))) ||
-	    (remote->sa.sa_family == AF_INET6 &&
-	     ipv6_addr_is_multicast(&remote->sin6.sin6_addr)))
-		return -EINVAL;
-	if (label && remote->sa.sa_family != AF_INET6)
-		return -EINVAL;
-
-	geneve->remote = *remote;
-
-	geneve->ttl = ttl;
-	geneve->tos = tos;
-	geneve->label = label;
-	geneve->dst_port = dst_port;
-	geneve->collect_md = metadata;
-	geneve->flags = flags;
-
-	t = geneve_find_dev(gn, dst_port, remote, geneve->vni,
-			    &tun_on_same_port, &tun_collect_md);
+	t = geneve_find_dev(gn, info, &tun_on_same_port, &tun_collect_md);
 	if (t)
 		return -EBUSY;
 
 	/* make enough headroom for basic scenario */
 	encap_len = GENEVE_BASE_HLEN + ETH_HLEN;
-	if (remote->sa.sa_family == AF_INET) {
+	if (ip_tunnel_info_af(info) == AF_INET) {
 		encap_len += sizeof(struct iphdr);
 		dev->max_mtu -= sizeof(struct iphdr);
 	} else {
@@ -1312,7 +1192,10 @@ static int geneve_configure(struct net *net, struct net_device *dev,
 			return -EPERM;
 	}
 
-	dst_cache_reset(&geneve->dst_cache);
+	dst_cache_reset(&geneve->info.dst_cache);
+	geneve->info = *info;
+	geneve->collect_md = metadata;
+	geneve->use_udp6_rx_checksums = ipv6_rx_csum;
 
 	err = register_netdevice(dev);
 	if (err)
@@ -1322,74 +1205,99 @@ static int geneve_configure(struct net *net, struct net_device *dev,
 	return 0;
 }
 
+static void init_tnl_info(struct ip_tunnel_info *info, __u16 dst_port)
+{
+	memset(info, 0, sizeof(*info));
+	info->key.tp_dst = htons(dst_port);
+}
+
 static int geneve_newlink(struct net *net, struct net_device *dev,
 			  struct nlattr *tb[], struct nlattr *data[])
 {
-	__be16 dst_port = htons(GENEVE_UDP_PORT);
-	__u8 ttl = 0, tos = 0;
+	bool use_udp6_rx_checksums = false;
+	struct ip_tunnel_info info;
 	bool metadata = false;
-	union geneve_addr remote = geneve_remote_unspec;
-	__be32 label = 0;
-	__u32 vni = 0;
-	u32 flags = 0;
+
+	init_tnl_info(&info, GENEVE_UDP_PORT);
 
 	if (data[IFLA_GENEVE_REMOTE] && data[IFLA_GENEVE_REMOTE6])
 		return -EINVAL;
 
 	if (data[IFLA_GENEVE_REMOTE]) {
-		remote.sa.sa_family = AF_INET;
-		remote.sin.sin_addr.s_addr =
+		info.key.u.ipv4.dst =
 			nla_get_in_addr(data[IFLA_GENEVE_REMOTE]);
+
+		if (IN_MULTICAST(ntohl(info.key.u.ipv4.dst))) {
+			netdev_dbg(dev, "multicast remote is unsupported\n");
+			return -EINVAL;
+		}
 	}
 
 	if (data[IFLA_GENEVE_REMOTE6]) {
-		if (!IS_ENABLED(CONFIG_IPV6))
-			return -EPFNOSUPPORT;
-
-		remote.sa.sa_family = AF_INET6;
-		remote.sin6.sin6_addr =
+ #if IS_ENABLED(CONFIG_IPV6)
+		info.mode = IP_TUNNEL_INFO_IPV6;
+		info.key.u.ipv6.dst =
 			nla_get_in6_addr(data[IFLA_GENEVE_REMOTE6]);
 
-		if (ipv6_addr_type(&remote.sin6.sin6_addr) &
+		if (ipv6_addr_type(&info.key.u.ipv6.dst) &
 		    IPV6_ADDR_LINKLOCAL) {
 			netdev_dbg(dev, "link-local remote is unsupported\n");
 			return -EINVAL;
 		}
+		if (ipv6_addr_is_multicast(&info.key.u.ipv6.dst)) {
+			netdev_dbg(dev, "multicast remote is unsupported\n");
+			return -EINVAL;
+		}
+		info.key.tun_flags |= TUNNEL_CSUM;
+		use_udp6_rx_checksums = true;
+#else
+		return -EPFNOSUPPORT;
+#endif
 	}
 
-	if (data[IFLA_GENEVE_ID])
+	if (data[IFLA_GENEVE_ID]) {
+		__u32 vni;
+		__u8 tvni[3];
+
 		vni = nla_get_u32(data[IFLA_GENEVE_ID]);
+		tvni[0] = (vni & 0x00ff0000) >> 16;
+		tvni[1] = (vni & 0x0000ff00) >> 8;
+		tvni[2] =  vni & 0x000000ff;
 
+		info.key.tun_id = vni_to_tunnel_id(tvni);
+	}
 	if (data[IFLA_GENEVE_TTL])
-		ttl = nla_get_u8(data[IFLA_GENEVE_TTL]);
+		info.key.ttl = nla_get_u8(data[IFLA_GENEVE_TTL]);
 
 	if (data[IFLA_GENEVE_TOS])
-		tos = nla_get_u8(data[IFLA_GENEVE_TOS]);
+		info.key.tos = nla_get_u8(data[IFLA_GENEVE_TOS]);
 
-	if (data[IFLA_GENEVE_LABEL])
-		label = nla_get_be32(data[IFLA_GENEVE_LABEL]) &
-			IPV6_FLOWLABEL_MASK;
+	if (data[IFLA_GENEVE_LABEL]) {
+		info.key.label = nla_get_be32(data[IFLA_GENEVE_LABEL]) &
+				  IPV6_FLOWLABEL_MASK;
+		if (info.key.label && (!(info.mode & IP_TUNNEL_INFO_IPV6)))
+			return -EINVAL;
+	}
 
 	if (data[IFLA_GENEVE_PORT])
-		dst_port = nla_get_be16(data[IFLA_GENEVE_PORT]);
+		info.key.tp_dst = nla_get_be16(data[IFLA_GENEVE_PORT]);
 
 	if (data[IFLA_GENEVE_COLLECT_METADATA])
 		metadata = true;
 
 	if (data[IFLA_GENEVE_UDP_CSUM] &&
 	    !nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
-		flags |= GENEVE_F_UDP_ZERO_CSUM_TX;
+		info.key.tun_flags |= TUNNEL_CSUM;
 
 	if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] &&
 	    nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]))
-		flags |= GENEVE_F_UDP_ZERO_CSUM6_TX;
+		info.key.tun_flags &= ~TUNNEL_CSUM;
 
 	if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX] &&
 	    nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX]))
-		flags |= GENEVE_F_UDP_ZERO_CSUM6_RX;
+		use_udp6_rx_checksums = false;
 
-	return geneve_configure(net, dev, &remote, vni, ttl, tos, label,
-				dst_port, metadata, flags);
+	return geneve_configure(net, dev, &info, metadata, use_udp6_rx_checksums);
 }
 
 static void geneve_dellink(struct net_device *dev, struct list_head *head)
@@ -1418,45 +1326,52 @@ static size_t geneve_get_size(const struct net_device *dev)
 static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev)
 {
 	struct geneve_dev *geneve = netdev_priv(dev);
+	struct ip_tunnel_info *info = &geneve->info;
+	__u8 tmp_vni[3];
 	__u32 vni;
 
-	vni = (geneve->vni[0] << 16) | (geneve->vni[1] << 8) | geneve->vni[2];
+	tunnel_id_to_vni(info->key.tun_id, tmp_vni);
+	vni = (tmp_vni[0] << 16) | (tmp_vni[1] << 8) | tmp_vni[2];
 	if (nla_put_u32(skb, IFLA_GENEVE_ID, vni))
 		goto nla_put_failure;
 
-	if (geneve->remote.sa.sa_family == AF_INET) {
+	if (ip_tunnel_info_af(info) == AF_INET) {
 		if (nla_put_in_addr(skb, IFLA_GENEVE_REMOTE,
-				    geneve->remote.sin.sin_addr.s_addr))
+				    info->key.u.ipv4.dst))
+			goto nla_put_failure;
+
+		if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
+			       !!(info->key.tun_flags & TUNNEL_CSUM)))
 			goto nla_put_failure;
+
 #if IS_ENABLED(CONFIG_IPV6)
 	} else {
 		if (nla_put_in6_addr(skb, IFLA_GENEVE_REMOTE6,
-				     &geneve->remote.sin6.sin6_addr))
+				     &info->key.u.ipv6.dst))
+			goto nla_put_failure;
+
+		if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX,
+			       !(info->key.tun_flags & TUNNEL_CSUM)))
+			goto nla_put_failure;
+
+		if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
+			       !geneve->use_udp6_rx_checksums))
 			goto nla_put_failure;
 #endif
 	}
 
-	if (nla_put_u8(skb, IFLA_GENEVE_TTL, geneve->ttl) ||
-	    nla_put_u8(skb, IFLA_GENEVE_TOS, geneve->tos) ||
-	    nla_put_be32(skb, IFLA_GENEVE_LABEL, geneve->label))
+	if (nla_put_u8(skb, IFLA_GENEVE_TTL, info->key.ttl) ||
+	    nla_put_u8(skb, IFLA_GENEVE_TOS, info->key.tos) ||
+	    nla_put_be32(skb, IFLA_GENEVE_LABEL, info->key.label))
 		goto nla_put_failure;
 
-	if (nla_put_be16(skb, IFLA_GENEVE_PORT, geneve->dst_port))
+	if (nla_put_be16(skb, IFLA_GENEVE_PORT, info->key.tp_dst))
 		goto nla_put_failure;
 
 	if (geneve->collect_md) {
 		if (nla_put_flag(skb, IFLA_GENEVE_COLLECT_METADATA))
 			goto nla_put_failure;
 	}
-
-	if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
-		       !(geneve->flags & GENEVE_F_UDP_ZERO_CSUM_TX)) ||
-	    nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX,
-		       !!(geneve->flags & GENEVE_F_UDP_ZERO_CSUM6_TX)) ||
-	    nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
-		       !!(geneve->flags & GENEVE_F_UDP_ZERO_CSUM6_RX)))
-		goto nla_put_failure;
-
 	return 0;
 
 nla_put_failure:
@@ -1480,6 +1395,7 @@ struct net_device *geneve_dev_create_fb(struct net *net, const char *name,
 					u8 name_assign_type, u16 dst_port)
 {
 	struct nlattr *tb[IFLA_MAX + 1];
+	struct ip_tunnel_info info;
 	struct net_device *dev;
 	LIST_HEAD(list_kill);
 	int err;
@@ -1490,9 +1406,8 @@ struct net_device *geneve_dev_create_fb(struct net *net, const char *name,
 	if (IS_ERR(dev))
 		return dev;
 
-	err = geneve_configure(net, dev, &geneve_remote_unspec,
-			       0, 0, 0, 0, htons(dst_port), true,
-			       GENEVE_F_UDP_ZERO_CSUM6_RX);
+	init_tnl_info(&info, dst_port);
+	err = geneve_configure(net, dev, &info, true, true);
 	if (err) {
 		free_netdev(dev);
 		return ERR_PTR(err);
@@ -1510,8 +1425,7 @@ struct net_device *geneve_dev_create_fb(struct net *net, const char *name,
 		goto err;
 
 	return dev;
-
- err:
+err:
 	geneve_dellink(dev, &list_kill);
 	unregister_netdevice_many(&list_kill);
 	return ERR_PTR(err);
@@ -1594,7 +1508,6 @@ static int __init geneve_init_module(void)
 		goto out3;
 
 	return 0;
-
 out3:
 	unregister_netdevice_notifier(&geneve_notifier_block);
 out2:
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 2/4] geneve: Merge ipv4 and ipv6 geneve_build_skb()
From: Pravin B Shelar @ 2016-11-18  7:10 UTC (permalink / raw)
  To: netdev; +Cc: Pravin B Shelar
In-Reply-To: <1479453029-29619-1-git-send-email-pshelar@ovn.org>

There are minimal difference in building Geneve header
between ipv4 and ipv6 geneve tunnels. Following patch
refactors code to unify it.

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
---
 drivers/net/geneve.c | 100 ++++++++++++++-------------------------------------
 1 file changed, 26 insertions(+), 74 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 622cf3b..9a4351c 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -630,67 +630,34 @@ static int geneve_stop(struct net_device *dev)
 }
 
 static void geneve_build_header(struct genevehdr *geneveh,
-				__be16 tun_flags, u8 vni[3],
-				u8 options_len, u8 *options)
+				const struct ip_tunnel_info *info)
 {
 	geneveh->ver = GENEVE_VER;
-	geneveh->opt_len = options_len / 4;
-	geneveh->oam = !!(tun_flags & TUNNEL_OAM);
-	geneveh->critical = !!(tun_flags & TUNNEL_CRIT_OPT);
+	geneveh->opt_len = info->options_len / 4;
+	geneveh->oam = !!(info->key.tun_flags & TUNNEL_OAM);
+	geneveh->critical = !!(info->key.tun_flags & TUNNEL_CRIT_OPT);
 	geneveh->rsvd1 = 0;
-	memcpy(geneveh->vni, vni, 3);
+	tunnel_id_to_vni(info->key.tun_id, geneveh->vni);
 	geneveh->proto_type = htons(ETH_P_TEB);
 	geneveh->rsvd2 = 0;
 
-	memcpy(geneveh->options, options, options_len);
+	ip_tunnel_info_opts_get(geneveh->options, info);
 }
 
-static int geneve_build_skb(struct rtable *rt, struct sk_buff *skb,
-			    __be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt,
-			    bool xnet)
-{
-	bool udp_sum = !!(tun_flags & TUNNEL_CSUM);
-	struct genevehdr *gnvh;
-	int min_headroom;
-	int err;
-
-	skb_scrub_packet(skb, xnet);
-
-	min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len
-			+ GENEVE_BASE_HLEN + opt_len + sizeof(struct iphdr);
-	err = skb_cow_head(skb, min_headroom);
-	if (unlikely(err))
-		goto free_rt;
-
-	err = udp_tunnel_handle_offloads(skb, udp_sum);
-	if (err)
-		goto free_rt;
-
-	gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) + opt_len);
-	geneve_build_header(gnvh, tun_flags, vni, opt_len, opt);
-
-	skb_set_inner_protocol(skb, htons(ETH_P_TEB));
-	return 0;
-
-free_rt:
-	ip_rt_put(rt);
-	return err;
-}
-
-#if IS_ENABLED(CONFIG_IPV6)
-static int geneve6_build_skb(struct dst_entry *dst, struct sk_buff *skb,
-			     __be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt,
-			     bool xnet)
+static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb,
+			    const struct ip_tunnel_info *info,
+			    bool xnet, int ip_hdr_len)
 {
-	bool udp_sum = !!(tun_flags & TUNNEL_CSUM);
+	bool udp_sum = !!(info->key.tun_flags & TUNNEL_CSUM);
 	struct genevehdr *gnvh;
 	int min_headroom;
 	int err;
 
+	skb_reset_mac_header(skb);
 	skb_scrub_packet(skb, xnet);
 
-	min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len
-			+ GENEVE_BASE_HLEN + opt_len + sizeof(struct ipv6hdr);
+	min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len +
+		       GENEVE_BASE_HLEN + info->options_len + ip_hdr_len;
 	err = skb_cow_head(skb, min_headroom);
 	if (unlikely(err))
 		goto free_dst;
@@ -699,9 +666,9 @@ static int geneve6_build_skb(struct dst_entry *dst, struct sk_buff *skb,
 	if (err)
 		goto free_dst;
 
-	gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) + opt_len);
-	geneve_build_header(gnvh, tun_flags, vni, opt_len, opt);
-
+	gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) +
+						   info->options_len);
+	geneve_build_header(gnvh, info);
 	skb_set_inner_protocol(skb, htons(ETH_P_TEB));
 	return 0;
 
@@ -709,12 +676,11 @@ static int geneve6_build_skb(struct dst_entry *dst, struct sk_buff *skb,
 	dst_release(dst);
 	return err;
 }
-#endif
 
 static struct rtable *geneve_get_v4_rt(struct sk_buff *skb,
 				       struct net_device *dev,
 				       struct flowi4 *fl4,
-				       struct ip_tunnel_info *info)
+				       const struct ip_tunnel_info *info)
 {
 	bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
 	struct geneve_dev *geneve = netdev_priv(dev);
@@ -738,7 +704,7 @@ static struct rtable *geneve_get_v4_rt(struct sk_buff *skb,
 	}
 	fl4->flowi4_tos = RT_TOS(tos);
 
-	dst_cache = &info->dst_cache;
+	dst_cache = (struct dst_cache *)&info->dst_cache;
 	if (use_cache) {
 		rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
 		if (rt)
@@ -763,7 +729,7 @@ static struct rtable *geneve_get_v4_rt(struct sk_buff *skb,
 static struct dst_entry *geneve_get_v6_dst(struct sk_buff *skb,
 					   struct net_device *dev,
 					   struct flowi6 *fl6,
-					   struct ip_tunnel_info *info)
+					   const struct ip_tunnel_info *info)
 {
 	bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
 	struct geneve_dev *geneve = netdev_priv(dev);
@@ -789,7 +755,7 @@ static struct dst_entry *geneve_get_v6_dst(struct sk_buff *skb,
 
 	fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
 					   info->key.label);
-	dst_cache = &info->dst_cache;
+	dst_cache = (struct dst_cache *)&info->dst_cache;
 	if (use_cache) {
 		dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
 		if (dst)
@@ -812,7 +778,8 @@ static struct dst_entry *geneve_get_v6_dst(struct sk_buff *skb,
 #endif
 
 static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
-			   struct geneve_dev *geneve, struct ip_tunnel_info *info)
+			   struct geneve_dev *geneve,
+			   const struct ip_tunnel_info *info)
 {
 	bool xnet = !net_eq(geneve->net, dev_net(geneve->dev));
 	struct geneve_sock *gs4 = rcu_dereference(geneve->sock4);
@@ -820,11 +787,9 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	struct rtable *rt;
 	int err = -EINVAL;
 	struct flowi4 fl4;
-	u8 *opts = NULL;
 	__u8 tos, ttl;
 	__be16 sport;
 	__be16 df;
-	u8 vni[3];
 
 	if (!gs4)
 		return err;
@@ -843,13 +808,7 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	}
 	df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
 
-	tunnel_id_to_vni(key->tun_id, vni);
-	if (info->options_len)
-		opts = ip_tunnel_info_opts(info);
-
-	skb_reset_mac_header(skb);
-	err = geneve_build_skb(rt, skb, key->tun_flags, vni,
-			       info->options_len, opts, xnet);
+	err = geneve_build_skb(&rt->dst, skb, info, xnet, sizeof(struct iphdr));
 	if (unlikely(err))
 		return err;
 
@@ -862,7 +821,8 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 
 #if IS_ENABLED(CONFIG_IPV6)
 static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
-			    struct geneve_dev *geneve, struct ip_tunnel_info *info)
+			    struct geneve_dev *geneve,
+			    const struct ip_tunnel_info *info)
 {
 	bool xnet = !net_eq(geneve->net, dev_net(geneve->dev));
 	struct geneve_sock *gs6 = rcu_dereference(geneve->sock6);
@@ -870,10 +830,8 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	struct dst_entry *dst = NULL;
 	int err = -EINVAL;
 	struct flowi6 fl6;
-	u8 *opts = NULL;
 	__u8 prio, ttl;
 	__be16 sport;
-	u8 vni[3];
 
 	if (!gs6)
 		return err;
@@ -891,13 +849,7 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 					   ip_hdr(skb), skb);
 		ttl = key->ttl ? : ip6_dst_hoplimit(dst);
 	}
-	tunnel_id_to_vni(key->tun_id, vni);
-	if (info->options_len)
-		opts = ip_tunnel_info_opts(info);
-
-	skb_reset_mac_header(skb);
-	err = geneve6_build_skb(dst, skb, key->tun_flags, vni,
-				info->options_len, opts, xnet);
+	err = geneve_build_skb(dst, skb, info, xnet, sizeof(struct iphdr));
 	if (unlikely(err))
 		return err;
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 0/4] geneve: Use LWT more effectively.
From: Pravin B Shelar @ 2016-11-18  7:10 UTC (permalink / raw)
  To: netdev; +Cc: Pravin B Shelar

Following patch series make use of geneve LWT code path for
geneve netdev type of device.
This allows us to simplify geneve module.

Pravin B Shelar (4):
  geneve: Unify LWT and netdev handling.
  geneve: Merge ipv4 and ipv6 geneve_build_skb()
  geneve: Remove redundant socket checks.
  geneve: Optimize geneve device lookup.

 drivers/net/geneve.c | 678 +++++++++++++++++++++------------------------------
 1 file changed, 273 insertions(+), 405 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* RE: [PATCH] net: fec: Detect and recover receive queue hangs
From: Andy Duan @ 2016-11-18  6:44 UTC (permalink / raw)
  To: Chris Lesiak
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Jaccon Bastiaansen
In-Reply-To: <1479417282-15540-1-git-send-email-chris.lesiak@licor.com>

From: Chris Lesiak <chris.lesiak@licor.com> Sent: Friday, November 18, 2016 5:15 AM
 >To: Andy Duan <fugang.duan@nxp.com>
 >Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Jaccon
 >Bastiaansen <jaccon.bastiaansen@gmail.com>; chris.lesiak@licor.com
 >Subject: [PATCH] net: fec: Detect and recover receive queue hangs
 >
 >This corrects a problem that appears to be similar to ERR006358.  But while
 >ERR006358 is a race when the tx queue transitions from empty to not empty,
 >this problem is a race when the rx queue transitions from full to not full.
 >
 >The symptom is a receive queue that is stuck.  The ENET_RDAR register will
 >read 0, indicating that there are no empty receive descriptors in the receive
 >ring.  Since no additional frames can be queued, no RXF interrupts occur.
 >
 >This problem can be triggered with a 1 Gb link and about 400 Mbps of traffic.
 >
 >This patch detects this condition, sets the work_rx bit, and reschedules the
 >poll method.
 >
 >Signed-off-by: Chris Lesiak <chris.lesiak@licor.com>
 >---
 > drivers/net/ethernet/freescale/fec_main.c | 31
 >+++++++++++++++++++++++++++++++
 > 1 file changed, 31 insertions(+)
 >
Firstly, how to reproduce the issue, pls list the reproduce steps. Thanks.
Secondly, pls check below comments.

 >diff --git a/drivers/net/ethernet/freescale/fec_main.c
 >b/drivers/net/ethernet/freescale/fec_main.c
 >index fea0f33..8a87037 100644
 >--- a/drivers/net/ethernet/freescale/fec_main.c
 >+++ b/drivers/net/ethernet/freescale/fec_main.c
 >@@ -1588,6 +1588,34 @@ fec_enet_interrupt(int irq, void *dev_id)
 > 	return ret;
 > }
 >
 >+static inline bool
 >+fec_enet_recover_rxq(struct fec_enet_private *fep, u16 queue_id) {
 >+	int work_bit = (queue_id == 0) ? 2 : ((queue_id == 1) ? 0 : 1);
 >+
 >+	if (readl(fep->rx_queue[queue_id]->bd.reg_desc_active))
If rx ring is really empty in slight throughput cases,  rdar is always cleared, then there always do napi reschedule.

 >+		return false;
 >+
 >+	dev_notice_once(&fep->pdev->dev, "Recovered rx queue\n");
 >+
 >+	fep->work_rx |= 1 << work_bit;
 >+
 >+	return true;
 >+}
 >+
 >+static inline bool fec_enet_recover_rxqs(struct fec_enet_private *fep)
 >+{
 >+	unsigned int q;
 >+	bool ret = false;
 >+
 >+	for (q = 0; q < fep->num_rx_queues; q++) {
 >+		if (fec_enet_recover_rxq(fep, q))
 >+			ret = true;
 >+	}
 >+
 >+	return ret;
 >+}
 >+
 > static int fec_enet_rx_napi(struct napi_struct *napi, int budget)  {
 > 	struct net_device *ndev = napi->dev;
 >@@ -1601,6 +1629,9 @@ static int fec_enet_rx_napi(struct napi_struct *napi,
 >int budget)
 > 	if (pkts < budget) {
 > 		napi_complete(napi);
 > 		writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
 >+
 >+		if (fec_enet_recover_rxqs(fep) && napi_reschedule(napi))
 >+			writel(FEC_NAPI_IMASK, fep->hwp + FEC_IMASK);
 > 	}
 > 	return pkts;
 > }
 >--
 >2.5.5

^ permalink raw reply

* Re: [PATCH net-next v3 4/7] vxlan: improve vxlan route lookup checks.
From: Pravin Shelar @ 2016-11-18  5:30 UTC (permalink / raw)
  To: David Laight; +Cc: Jiri Benc, netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0222355@AcuExch.aculab.com>

On Thu, Nov 17, 2016 at 2:17 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Of Jiri Benc
>> Sent: 15 November 2016 14:40
>> On Sun, 13 Nov 2016 20:43:55 -0800, Pravin B Shelar wrote:
>> > @@ -1929,8 +1951,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>> >     union vxlan_addr *src;
>> >     struct vxlan_metadata _md;
>> >     struct vxlan_metadata *md = &_md;
>> > -   struct dst_entry *ndst = NULL;
>> >     __be16 src_port = 0, dst_port;
>> > +   struct dst_entry *ndst = NULL;
>> >     __be32 vni, label;
>> >     __be16 df = 0;
>> >     __u8 tos, ttl;
>>
>> This looks kind of arbitrary. You might want to remove this hunk or
>> merge it to patch 3.
>
> Worse than arbitrary, it adds 4 bytes of pad on 64bit systems.
>
OK. I will send out a patch.
But this is not real issue in vxlan module today ;)

^ permalink raw reply

* Proposal
From: Teresa Au @ 2016-11-18  4:34 UTC (permalink / raw)




Business Partnership Proposal For You,contact me via my personal E-mail for further 
detail's: ms_teresa_au17@outlook.com

^ permalink raw reply

* Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
From: Harini Katakam @ 2016-11-18  4:29 UTC (permalink / raw)
  To: Rafal Ozieblo
  Cc: Nicolas Ferre, harini.katakam@xilinx.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <BN3PR07MB2516EBCDD8822FC5F8A9095FC9B10@BN3PR07MB2516.namprd07.prod.outlook.com>

Hi Rafal,

On Thu, Nov 17, 2016 at 7:05 PM, Rafal Ozieblo <rafalo@cadence.com> wrote:
> -----Original Message-----
> From: Nicolas Ferre [mailto:nicolas.ferre@atmel.com]
> Sent: 17 listopada 2016 14:29
> To: Harini Katakam; Rafal Ozieblo
> Cc: harini.katakam@xilinx.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
>
>> Le 17/11/2016 à 13:21, Harini Katakam a écrit :
>> > Hi Rafal,
>> >
>> > On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo <rafalo@cadence.com> wrote:
>> >> Hello,
>> >> I think, there could a bug in your patch.
>> >>
>> >>> +
>> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>> >>> +             dmacfg |= GEM_BIT(ADDR64); #endif
>> >>
>> >> You enable 64 bit addressing (64b dma bus width) always when appropriate architecture config option is enabled.
>> >> But there are some legacy controllers which do not support that feature. According Cadence hardware team:
>> >> "64 bit addressing was added in July 2013. Earlier version do not have it.
>> >> This feature was enhanced in release August 2014 to have separate upper address values for transmit and receive."
>> >>
>> >>> /* Bitfields in NSR */
>> >>> @@ -474,6 +479,10 @@
>> >>>  struct macb_dma_desc {
>> >>  >      u32     addr;
>> >>>       u32     ctrl;
>> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>> >>> +     u32     addrh;
>> >>> +     u32     resvd;
>> >>> +#endif
>> >>>  };
>> >>
>> >> It will not work for legacy hardware. Old descriptor is 2 words wide, the new one is 4 words wide.
>> >> If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't
>> >> support it at all, you will miss every second descriptor.
>> >>
>> >
>> > True, this feature is not available in all of Cadence IP versions.
>> > In fact, the IP version Zynq does not support this. But the one in ZynqMP does.
>> > So, we enable kernel config for 64 bit DMA addressing for this SoC and
>> > hence the driver picks it up. My assumption was that if the legacy IP
>> > does not support
>> > 64 bit addressing, then this DMA option wouldn't be enabled.
>> >
>> > There is a design config register in Cadence IP which is being read to
>> > check for 64 bit address support - DMA mask is set based on that.
>> > But the addition of two descriptor words cannot be based on this runtime check.
>> > For this reason, all the static changes were placed under this check.
>>
>> We have quite a bunch of options in this driver to determinate what is the real capacity of the underlying hardware.
>> If HW configuration registers are not appropriate, and it seems they are not, I would advice to simply use the DT compatibility string.
>>
>> Best regards,
>> --
>> Nicolas Ferre
>
> HW configuration registers are appropriate. The issue is that this code doesn’t use the capability bit to switch between different dma descriptors (2 words vs. 4 words).
> DMA descriptor size is chosen based on kernel configuration, not based on hardware capabilities.

HW configuration register does give appropriate information.
But addition of two address words in the macb descriptor structure is
a static change.

+static inline void macb_set_addr(struct macb_dma_desc *desc, dma_addr_t addr)
+{
+       desc->addr = (u32)addr;
+#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
+       desc->addrh = (u32)(addr >> 32);
+#endif
+

Even if the #ifdef condition here is changed to HW config check, addr
and addrh are different.
And "addrh" entry has to be present for 64 bit desc case to be handled
separately.
Can you please tell me how you propose change in DMA descriptor structure from
4 to 2 or 2 to 4 words *after* reading the DCFG register?

Regards,
Harini

^ permalink raw reply

* [iproute2] iproute2: fix the link group name getting error
From: Zhang Shengju @ 2016-11-18  1:12 UTC (permalink / raw)
  To: netdev

In the situation where more than one entry live in the same hash bucket,
loop to get the correct one.

Before:
$ cat /etc/iproute2/group
0	default
256     test

$ sudo ip link set group test dummy1

$ ip link show type dummy
11: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group 0 qlen 1000
    link/ether 4e:3b:d3:6c:f0:e6 brd ff:ff:ff:ff:ff:ff
12: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group test qlen 1000
    link/ether d6:9c:a4:1f:e7:e5 brd ff:ff:ff:ff:ff:ff

After:
$ ip link show type dummy
11: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 4e:3b:d3:6c:f0:e6 brd ff:ff:ff:ff:ff:ff
12: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group test qlen 1000
    link/ether d6:9c:a4:1f:e7:e5 brd ff:ff:ff:ff:ff:ff

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
---
 lib/rt_names.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/rt_names.c b/lib/rt_names.c
index b665d3e..c66cb1e 100644
--- a/lib/rt_names.c
+++ b/lib/rt_names.c
@@ -559,8 +559,12 @@ const char *rtnl_group_n2a(int id, char *buf, int len)
 
 	for (i = 0; i < 256; i++) {
 		entry = rtnl_group_hash[i];
-		if (entry && entry->id == id)
-			return entry->name;
+
+		while (entry) {
+			if (entry->id == id)
+				return entry->name;
+			entry = entry->next;
+		}
 	}
 
 	snprintf(buf, len, "%d", id);
-- 
1.8.3.1

^ permalink raw reply related

* Re: Netperf UDP issue with connected sockets
From: Rick Jones @ 2016-11-18  0:42 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, Jesper Dangaard Brouer, netdev, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <alpine.LFD.2.11.1611180145580.2135@ja.home.ssi.bg>

On 11/17/2016 04:37 PM, Julian Anastasov wrote:
> On Thu, 17 Nov 2016, Rick Jones wrote:
>
>> raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
>> src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472
>>
>> ...
>>
>> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
>> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
>> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
>> setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>> bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
>> 16) = 0
>> setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
>
> 	connected socket can benefit from dst cached in socket
> but not if SO_DONTROUTE is set. If we do not want to send packets
> via gateway this -l 1 should help but I don't see IP_TTL setsockopt
> in your first example with connect() to 127.0.0.1.
>
> 	Also, may be there can be another default, if -l is used to
> specify TTL then SO_DONTROUTE should not be set. I.e. we should
> avoid SO_DONTROUTE, if possible.

The global -l option specifies the duration of the test.  It doesn't 
specify the TTL of the IP datagrams being generated by the actions of 
the test.

I resisted setting SO_DONTROUTE for a number of years after the first 
instance of UDP_STREAM being used in link up/down testing took-out a 
company's network (including security camera feeds to galactic HQ) but 
at this point I'm likely to keep it in there because there ended-up 
being a second such incident.  It is set only for UDP_STREAM.  It isn't 
set for UDP_RR or TCP_*.  And for UDP_STREAM it can be overridden by the 
test-specific -R option.

happy benchmarking,

rick jones

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Julian Anastasov @ 2016-11-18  0:37 UTC (permalink / raw)
  To: Rick Jones
  Cc: Eric Dumazet, Jesper Dangaard Brouer, netdev, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <d7d209a8-6142-2599-3303-19640d340b97@hpe.com>


	Hello,

On Thu, 17 Nov 2016, Rick Jones wrote:

> raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
> src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472
> 
> ...
> 
> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
> setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
> 16) = 0
> setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0

	connected socket can benefit from dst cached in socket
but not if SO_DONTROUTE is set. If we do not want to send packets
via gateway this -l 1 should help but I don't see IP_TTL setsockopt
in your first example with connect() to 127.0.0.1.

	Also, may be there can be another default, if -l is used to
specify TTL then SO_DONTROUTE should not be set. I.e. we should
avoid SO_DONTROUTE, if possible.

Regards

^ permalink raw reply

* Re: Long delays creating a netns after deleting one (possibly RCU related)
From: Jarno Rajahalme @ 2016-11-18  0:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric W. Biederman, Paul E. McKenney, Cong Wang, Rolf Neugebauer,
	LKML, Linux Kernel Network Developers, Justin Cormack,
	Ian Campbell, Eric Dumazet
In-Reply-To: <1479164967.8455.87.camel@edumazet-glaptop3.roam.corp.google.com>


> On Nov 14, 2016, at 3:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
>> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
>> 
>>> synchronize_rcu_expidited is not enough if you have multiple network
>>> devices in play.
>>> 
>>> Looking at the code it comes down to this commit, and it appears there
>>> is a promise add rcu grace period combining by Eric Dumazet.
>>> 
>>> Eric since people are hitting noticable stalls because of the rcu grace
>>> period taking a long time do you think you could look at this code path
>>> a bit more?
>>> 
>>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
>>> Author: Eric Dumazet <edumazet@google.com>
>>> Date:   Wed Nov 18 06:31:03 2015 -0800
>> 
>> Absolutely, I will take a loop asap.
> 
> The worst offender should be fixed by the following patch.
> 
> busy poll needs to poll the physical device, not a virtual one...
> 
> diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
> index d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 100644
> --- a/include/net/gro_cells.h
> +++ b/include/net/gro_cells.h
> @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de
> 		struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
> 
> 		__skb_queue_head_init(&cell->napi_skbs);
> +
> +		set_bit(NAPI_STATE_NO_BUSY_POLL, &cell->napi.state);
> +
> 		netif_napi_add(dev, &cell->napi, gro_cell_poll, 64);
> 		napi_enable(&cell->napi);
> 	}
> 

This solved a ~20 second slowdown between OVS datapath unit tests for me.

  Jarno

^ permalink raw reply

* [Patch net v2] af_unix: conditionally use freezable blocking calls in read
From: Cong Wang @ 2016-11-17 23:55 UTC (permalink / raw)
  To: netdev
  Cc: dvyukov, Cong Wang, Tejun Heo, Colin Cross, Rafael J. Wysocki,
	Hannes Frederic Sowa

Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably
correct at that time, but later, commit 2b514574f7e8
("net: af_unix: implement splice for stream af_unix sockets") breaks
the strong requirement for a freezable sleep, according to
commit 0f9548ca1091:

    We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

The pipe_lock is still held at that point.

So use freezable version only for the recvmsg call path, avoid impact for
Android.

Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Colin Cross <ccross@android.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/unix/af_unix.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5d1c14a..2358f26 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2199,7 +2199,8 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
  *	Sleep until more data has arrived. But check for races..
  */
 static long unix_stream_data_wait(struct sock *sk, long timeo,
-				  struct sk_buff *last, unsigned int last_len)
+				  struct sk_buff *last, unsigned int last_len,
+				  bool freezable)
 {
 	struct sk_buff *tail;
 	DEFINE_WAIT(wait);
@@ -2220,7 +2221,10 @@ static long unix_stream_data_wait(struct sock *sk, long timeo,
 
 		sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 		unix_state_unlock(sk);
-		timeo = freezable_schedule_timeout(timeo);
+		if (freezable)
+			timeo = freezable_schedule_timeout(timeo);
+		else
+			timeo = schedule_timeout(timeo);
 		unix_state_lock(sk);
 
 		if (sock_flag(sk, SOCK_DEAD))
@@ -2250,7 +2254,8 @@ struct unix_stream_read_state {
 	unsigned int splice_flags;
 };
 
-static int unix_stream_read_generic(struct unix_stream_read_state *state)
+static int unix_stream_read_generic(struct unix_stream_read_state *state,
+				    bool freezable)
 {
 	struct scm_cookie scm;
 	struct socket *sock = state->socket;
@@ -2330,7 +2335,7 @@ static int unix_stream_read_generic(struct unix_stream_read_state *state)
 			mutex_unlock(&u->iolock);
 
 			timeo = unix_stream_data_wait(sk, timeo, last,
-						      last_len);
+						      last_len, freezable);
 
 			if (signal_pending(current)) {
 				err = sock_intr_errno(timeo);
@@ -2472,7 +2477,7 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
 		.flags = flags
 	};
 
-	return unix_stream_read_generic(&state);
+	return unix_stream_read_generic(&state, true);
 }
 
 static int unix_stream_splice_actor(struct sk_buff *skb,
@@ -2503,7 +2508,7 @@ static ssize_t unix_stream_splice_read(struct socket *sock,  loff_t *ppos,
 	    flags & SPLICE_F_NONBLOCK)
 		state.flags = MSG_DONTWAIT;
 
-	return unix_stream_read_generic(&state);
+	return unix_stream_read_generic(&state, false);
 }
 
 static int unix_shutdown(struct socket *sock, int mode)
-- 
2.1.0

^ permalink raw reply related

* Re: Virtio_net support vxlan encapsulation package TSO offload discuss
From: Jarno Rajahalme @ 2016-11-17 23:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhangming (James, Euler), netdev@vger.kernel.org,
	Michael S. Tsirkin, Vlad Yasevic, Amnon Ilan
In-Reply-To: <0438366c-8b0e-270e-cbd1-334c1d655428@redhat.com>

I worked on the same issue a few months back. I rebased my proof-of-concept code to the current net-next and posted an RFC patch a moment ago.

I have zero experience on QEMU feature negotiation or extending the virtio_net spec. Since the virtio_net handling code is now all done using shared code, this should work for macvtap as well, not sure if macvtap needs some control plane changes.

I posted a separate patch to make af_packet also use the shared infra for virtio_net handling yesterday. My RFC patch assumes that af_packet need not be touched, i.e., assumes the af_packet patch is applied, even though the patches apply to net-next in either order.

  Jarno

> On Nov 16, 2016, at 11:27 PM, Jason Wang <jasowang@redhat.com> wrote:
> 
> 
> 
> On 2016年11月17日 09:31, Zhangming (James, Euler) wrote:
>> On 2016年11月15日 11:28, Jason Wang wrote:
>>> On 2016年11月10日 14:19, Zhangming (James, Euler) wrote:
>>>> On 2016年11月09日 15:14, Jason Wang wrote:
>>>>> On 2016年11月08日 19:58, Zhangming (James, Euler) wrote:
>>>>>> On 2016年11月08日 19:17, Jason Wang wrote:
>>>>>> 
>>>>>>> On 2016年11月08日 19:13, Jason Wang wrote:
>>>>>>>> Cc Michael
>>>>>>>> 
>>>>>>>> On 2016年11月08日 16:34, Zhangming (James, Euler) wrote:
>>>>>>>>> In container scenario, OVS is installed in the Virtual machine,
>>>>>>>>> and all the containers connected to the OVS will communicated
>>>>>>>>> through VXLAN encapsulation.
>>>>>>>>> 
>>>>>>>>> By now, virtio_net does not support TSO offload for VXLAN
>>>>>>>>> encapsulated TSO package. In this condition, the performance is
>>>>>>>>> not good, sender is bottleneck
>>>>>>>>> 
>>>>>>>>> I googled this scenario, but I didn’t find any information. Will
>>>>>>>>> virtio_net support VXLAN encapsulation package TSO offload later?
>>>>>>>>> 
>>>>>>>> Yes and for both sender and receiver.
>>>>>>>> 
>>>>>>>>> My idea is virtio_net open encapsulated TSO offload, and
>>>>>>>>> transport encapsulation info to TUN, TUN will parse the info and
>>>>>>>>> build skb with encapsulation info.
>>>>>>>>> 
>>>>>>>>> OVS or kernel on the host should be modified to support this.
>>>>>>>>> Using this method, the TCP performance aremore than 2x as before.
>>>>>>>>> 
>>>>>>>>> Any advice and suggestions for this idea or new idea will be
>>>>>>>>> greatly appreciated!
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> 
>>>>>>>>>      James zhang
>>>>>>>>> 
>>>>>>>> Sounds very good. And we may also need features bits
>>>>>>>> (VIRTIO_NET_F_GUEST|HOST_GSO_X) for this.
>>>>>>>> 
>>>>>>>> This is in fact one of items in networking todo list. (See
>>>>>>>> http://www.linux-kvm.org/page/NetworkingTodo). While at it, we'd
>>>>>>>> better support not only VXLAN but also other tunnels.
>>>>>>> Cc Vlad who is working on extending virtio-net headers.
>>>>>>> 
>>>>>>>> We can start with the spec work, or if you've already had some
>>>>>>>> bits you can post them as RFC for early review.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>> Below is my demo code
>>>>>> Virtio_net.c
>>>>>> static int virtnet_probe(struct virtio_device *vdev), add belows codes:
>>>>>>           if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) ||				// avoid gso segment, it should be negotiation later, because in the demo I reuse num_buffers.
>>>>>>               virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
>>>>>>                   dev->hw_enc_features |= NETIF_F_TSO;
>>>>>>                   dev->hw_enc_features |= NETIF_F_ALL_CSUM;
>>>>>>                   dev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL;
>>>>>>                   dev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
>>>>>>                   dev->hw_enc_features |=
>>>>>> NETIF_F_GSO_TUNNEL_REMCSUM;
>>>>>> 
>>>>>>                   dev->features |= NETIF_F_GSO_UDP_TUNNEL;
>>>>>>                   dev->features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
>>>>>>                   dev->features |= NETIF_F_GSO_TUNNEL_REMCSUM;
>>>>>>           }
>>>>>> 
>>>>>> static int xmit_skb(struct send_queue *sq, struct sk_buff *skb), add
>>>>>> below to pieces of codes
>>>>>> 
>>>>>>                   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL)
>>>>>>                           hdr->hdr.gso_type |= VIRTIO_NET_HDR_GSO_TUNNEL;
>>>>>>                   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM)
>>>>>>                           hdr->hdr.gso_type |= VIRTIO_NET_HDR_GSO_TUNNEL_CSUM;
>>>>>>                   if (skb_shinfo(skb)->gso_type & SKB_GSO_TUNNEL_REMCSUM)
>>>>>>                           hdr->hdr.gso_type |=
>>>>>> VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM;
>>>>>> 
>>>>>>           if (skb->encapsulation && skb_is_gso(skb)) {
>>>>>>                   inner_mac_len = skb_inner_network_header(skb) - skb_inner_mac_header(skb);
>>>>>>                   tnl_len = skb_inner_mac_header(skb) - skb_mac_header(skb);
>>>>>>                   if ( !(inner_mac_len >> DATA_LEN_SHIFT) && !(tnl_len >> DATA_LEN_SHIFT) ) {
>>>>>>                           hdr->hdr.flags |= VIRTIO_NET_HDR_F_ENCAPSULATION;
>>>>>>                           hdr->num_buffers = (__virtio16)((inner_mac_len << DATA_LEN_SHIFT) | tnl_len);		//we reuse num_buffers for simple , we should add extend member for later.
>>>>>>                   }  else
>>>>>>                           hdr->num_buffers = 0;
>>>>>>           }
>>>>>> 
>>>>>> Tun.c
>>>>>>                   if (memcpy_fromiovecend((void *)&hdr, iv, offset, tun->vnet_hdr_sz))		//read header with negotiation length
>>>>>>                           return -EFAULT;
>>>>>> 
>>>>>>                   if (hdr.gso_type & VIRTIO_NET_HDR_GSO_TUNNEL)					//set tunnel gso info
>>>>>>                           skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
>>>>>>                   if (hdr.gso_type & VIRTIO_NET_HDR_GSO_TUNNEL_CSUM)
>>>>>>                           skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL_CSUM;
>>>>>>                   if (hdr.gso_type & VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM)
>>>>>>                           skb_shinfo(skb)->gso_type |=
>>>>>> SKB_GSO_TUNNEL_REMCSUM;
>>>>>> 
>>>>>>           if (hdr.flags & VIRTIO_NET_HDR_F_ENCAPSULATION) {						//read tunnel info from header and set to built skb.
>>>>>>                   tnl_len = tun16_to_cpu(tun, hdr.num_buffers) & TUN_TNL_LEN_MASK;
>>>>>>                   payload_mac_len = tun16_to_cpu(tun, hdr.num_buffers) >> TUN_DATA_LEN_SHIFT;
>>>>>>                   mac_len = skb_network_header(skb) - skb_mac_header(skb);
>>>>>>                   skb_set_inner_mac_header(skb, tnl_len - mac_len);
>>>>>>                   skb_set_inner_network_header(skb, tnl_len + payload_mac_len - mac_len);
>>>>>>                   skb->encapsulation = 1;
>>>>>>           }
>>>>>> 
>>>>>> 
>>>>> Something like this, and you probably need do something more:
>>>>> 
>>>>> - use net-next.git to generate the patch (for the latest code)
>>>>> - add feature negotiation
>>>>> - tun/macvtap/qemu patches for this, you can start with tun/macvtap
>>>>> patches
>>>>> - support for all other SKB_GSO_* types which is not supported
>>>>> - use a new field instead of num_buffers
>>>>> - a virtio spec patch to describe the support for encapsulation
>>>>> offload
>>>>> 
>>>>> Thanks
>>>> Thank you for your advice, I will start it right now.
>>>> 
>>>> Thanks
>>> Cool, one more question: while at it, I think you may want to add support for dpdk too?
>>> 
>>> Thanks
>> Do you mean that the patch should be compatible with virtio pmd, or give virtio pmd patch?
>> 
>> Thanks
> 
> I mean it's better to prepare patches for both virtio pmd and dpdk.
> 
> Thanks

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Rick Jones @ 2016-11-17 23:08 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer; +Cc: netdev, Saeed Mahameed, Tariq Toukan
In-Reply-To: <1479419042.8455.280.camel@edumazet-glaptop3.roam.corp.google.com>

On 11/17/2016 01:44 PM, Eric Dumazet wrote:
> because netperf sends the same message
> over and over...

Well, sort of, by default.  That can be altered to a degree.

The global -F option should cause netperf to fill the buffers in its 
send ring with data from the specified file.  The number of buffers in 
the send ring can be controlled via the global -W option.  The number of 
elements in the ring will default to one more than the initial SO_SNDBUF 
size divided by the send size.

raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf 
-F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472

...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
open("src/nettest_omni.c", O_RDONLY)    = 5
fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, 
st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, 
st_size=230027, st_atime=2016/11/16-09:49:29, 
st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f3099f62000
read(5, "#ifdef HAVE_CONFIG_H\n#include <c"..., 4096) = 4096
read(5, "_INTEGER *intvl_two_ptr = &intvl"..., 4096) = 4096
read(5, "interval_count = interval_burst;"..., 4096) = 4096
read(5, ";\n\n/* these will control the wid"..., 4096) = 4096
read(5, "\n  LOCAL_SECURITY_ENABLED_NUM,\n "..., 4096) = 4096
read(5, "      &dwBytes,  \n              "..., 4096) = 4096

...

rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0
alarm(1)                                = 0
sendto(4, "#ifdef HAVE_CONFIG_H\n#include <c"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " used\\n\\\n    -m local,remote   S"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " do here but clear the legacy fl"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "e before we scan the test-specif"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472

Of course, it will continue to send the same messages from the send_ring 
over and over instead of putting different data into the buffers each 
time, but if one has a sufficiently large -W option specified...
happy benchmarking,

rick jones

^ permalink raw reply

* [RFC PATCH net-next] virtio_net: Support UDP Tunnel offloads.
From: Jarno Rajahalme @ 2016-11-17 23:01 UTC (permalink / raw)
  To: netdev; +Cc: jarno, james.zhangming, mst, vyasevic, ailan

This patch is a proof-of-concept I did a few months ago for UDP tunnel
offload support in virtio_net interface, and rebased on to the current
net-next.

Real implementation needs to extend the virtio_net header rather than
piggy-backing on existing fields.  Inner MAC length (or inner network
offset) also needs to be passed as a new field.  Control plane (QEMU)
also needs to be updated.

All testing was done using Geneve, but this should work for all UDP
tunnels the same.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 drivers/net/tun.c               |  7 ++++-
 drivers/net/virtio_net.c        | 16 +++++++---
 include/linux/skbuff.h          |  5 ++++
 include/linux/virtio_net.h      | 66 ++++++++++++++++++++++++++++++-----------
 include/uapi/linux/virtio_net.h |  7 +++++
 5 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1588469..36f3219 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -198,7 +198,9 @@ struct tun_struct {
 	struct net_device	*dev;
 	netdev_features_t	set_features;
 #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
-			  NETIF_F_TSO6|NETIF_F_UFO)
+			   NETIF_F_TSO6|NETIF_F_UFO|NETIF_F_GSO_UDP_TUNNEL| \
+			   NETIF_F_GSO_UDP_TUNNEL_CSUM| \
+			   NETIF_F_GSO_TUNNEL_REMCSUM)
 
 	int			align;
 	int			vnet_hdr_sz;
@@ -1877,6 +1879,9 @@ static int set_offload(struct tun_struct *tun, unsigned long arg)
 
 		if (arg & TUN_F_UFO) {
 			features |= NETIF_F_UFO;
+#if 1
+			features |= NETIF_F_GSO_UDP_TUNNEL|NETIF_F_GSO_UDP_TUNNEL_CSUM|NETIF_F_GSO_TUNNEL_REMCSUM;
+#endif
 			arg &= ~TUN_F_UFO;
 		}
 	}
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ca5239a..eb8d887 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1789,7 +1789,10 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 		if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
 			dev->hw_features |= NETIF_F_TSO | NETIF_F_UFO
-				| NETIF_F_TSO_ECN | NETIF_F_TSO6;
+				| NETIF_F_TSO_ECN | NETIF_F_TSO6
+				| NETIF_F_GSO_UDP_TUNNEL
+				| NETIF_F_GSO_UDP_TUNNEL_CSUM
+				| NETIF_F_GSO_TUNNEL_REMCSUM;
 		}
 		/* Individual feature bits: what can host handle? */
 		if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_TSO4))
@@ -1798,13 +1801,18 @@ static int virtnet_probe(struct virtio_device *vdev)
 			dev->hw_features |= NETIF_F_TSO6;
 		if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_ECN))
 			dev->hw_features |= NETIF_F_TSO_ECN;
-		if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_UFO))
+		if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_UFO)) {
 			dev->hw_features |= NETIF_F_UFO;
-
+#if 1
+			dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL;
+			dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
+			dev->hw_features |= NETIF_F_GSO_TUNNEL_REMCSUM;
+#endif
+		}
 		dev->features |= NETIF_F_GSO_ROBUST;
 
 		if (gso)
-			dev->features |= dev->hw_features & (NETIF_F_ALL_TSO|NETIF_F_UFO);
+			dev->features |= dev->hw_features & (NETIF_F_ALL_TSO|NETIF_F_UFO|NETIF_F_GSO_UDP_TUNNEL|NETIF_F_GSO_UDP_TUNNEL_CSUM|NETIF_F_GSO_TUNNEL_REMCSUM);
 		/* (!csum && gso) case will be fixed by register_netdev() */
 	}
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a4aeeca..992ad30 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2115,6 +2115,11 @@ static inline unsigned char *skb_inner_mac_header(const struct sk_buff *skb)
 	return skb->head + skb->inner_mac_header;
 }
 
+static inline int skb_inner_mac_offset(const struct sk_buff *skb)
+{
+	return skb_inner_mac_header(skb) - skb->data;
+}
+
 static inline void skb_reset_inner_mac_header(struct sk_buff *skb)
 {
 	skb->inner_mac_header = skb->data - skb->head;
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 1c912f8..17384d1 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -8,10 +8,19 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
 					const struct virtio_net_hdr *hdr,
 					bool little_endian)
 {
-	unsigned short gso_type = 0;
+	u16 start = __virtio16_to_cpu(little_endian, hdr->csum_start);
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		u16 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
+
+		if (!skb_partial_csum_set(skb, start, off))
+			return -EINVAL;
+	}
 
 	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
-		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		unsigned short gso_type = 0;
+
+		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_FLAGS) {
 		case VIRTIO_NET_HDR_GSO_TCPV4:
 			gso_type = SKB_GSO_TCPV4;
 			break;
@@ -27,23 +36,28 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
 
 		if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
 			gso_type |= SKB_GSO_TCP_ECN;
+		if (hdr->gso_type & VIRTIO_NET_HDR_GSO_UDP_TUNNEL)
+			gso_type |= SKB_GSO_UDP_TUNNEL;
+		if (hdr->gso_type & VIRTIO_NET_HDR_GSO_UDP_TUNNEL_CSUM)
+			gso_type |= SKB_GSO_UDP_TUNNEL_CSUM;
+		if (hdr->gso_type & VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM) {
+			gso_type |= SKB_GSO_TUNNEL_REMCSUM;
+			skb->remcsum_offload = true;
+		}
+		if (gso_type & (SKB_GSO_UDP_TUNNEL | SKB_GSO_UDP_TUNNEL_CSUM)) {
+			u16 hdr_len = __virtio16_to_cpu(little_endian,
+							hdr->hdr_len);
+			skb->encapsulation = 1;
+			skb_set_inner_mac_header(skb, hdr_len);
+			skb_set_inner_network_header(skb, hdr_len + ETH_HLEN);
+			/* XXX: What if start is not set? */
+			skb_set_inner_transport_header(skb, start);
+		}
 
 		if (hdr->gso_size == 0)
 			return -EINVAL;
-	}
-
-	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-		u16 start = __virtio16_to_cpu(little_endian, hdr->csum_start);
-		u16 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
-
-		if (!skb_partial_csum_set(skb, start, off))
-			return -EINVAL;
-	}
-
-	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
-		u16 gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size);
-
-		skb_shinfo(skb)->gso_size = gso_size;
+		skb_shinfo(skb)->gso_size = __virtio16_to_cpu(little_endian,
+							      hdr->gso_size);
 		skb_shinfo(skb)->gso_type = gso_type;
 
 		/* Header must be checked, and gso_segs computed. */
@@ -64,8 +78,8 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 		struct skb_shared_info *sinfo = skb_shinfo(skb);
 
 		/* This is a hint as to how much should be linear. */
-		hdr->hdr_len = __cpu_to_virtio16(little_endian,
-						 skb_headlen(skb));
+		u16 hdr_len = skb_headlen(skb);
+
 		hdr->gso_size = __cpu_to_virtio16(little_endian,
 						  sinfo->gso_size);
 		if (sinfo->gso_type & SKB_GSO_TCPV4)
@@ -78,6 +92,22 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 			return -EINVAL;
 		if (sinfo->gso_type & SKB_GSO_TCP_ECN)
 			hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
+		if (sinfo->gso_type & SKB_GSO_UDP_TUNNEL)
+			hdr->gso_type |= VIRTIO_NET_HDR_GSO_UDP_TUNNEL;
+		if (sinfo->gso_type & SKB_GSO_UDP_TUNNEL_CSUM)
+			hdr->gso_type |= VIRTIO_NET_HDR_GSO_UDP_TUNNEL_CSUM;
+		if (sinfo->gso_type & SKB_GSO_TUNNEL_REMCSUM)
+			hdr->gso_type |= VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM;
+
+		if (sinfo->gso_type &
+		    (SKB_GSO_UDP_TUNNEL | SKB_GSO_UDP_TUNNEL_CSUM))
+			/* For encapsulated packets 'hdr_len' is the offset to
+			 * the beginning of the inner packet.  This way the
+			 * encapsulation can remain ignorant of the size of the
+			 * UDP tunnel header.
+			 */
+			hdr_len = skb_inner_mac_offset(skb);
+		hdr->hdr_len = __cpu_to_virtio16(little_endian, hdr_len);
 	} else
 		hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE;
 
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index fc353b5..833950b 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -93,7 +93,14 @@ struct virtio_net_hdr_v1 {
 #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
 #define VIRTIO_NET_HDR_GSO_UDP		3	/* GSO frame, IPv4 UDP (UFO) */
 #define VIRTIO_NET_HDR_GSO_TCPV6	4	/* GSO frame, IPv6 TCP */
+#define VIRTIO_NET_HDR_GSO_UDP_TUNNEL	0x10	/* GSO frame, UDP tunnel */
+#define VIRTIO_NET_HDR_GSO_UDP_TUNNEL_CSUM 0x20	/* GSO frame, UDP tnl w CSUM */
+#define VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM 0x40	/* TUNNEL with TSO & REMCSUM */
 #define VIRTIO_NET_HDR_GSO_ECN		0x80	/* TCP has ECN set */
+#define VIRTIO_NET_HDR_GSO_FLAGS (VIRTIO_NET_HDR_GSO_UDP_TUNNEL | \
+				  VIRTIO_NET_HDR_GSO_UDP_TUNNEL_CSUM | \
+				  VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM | \
+				  VIRTIO_NET_HDR_GSO_ECN)
 	__u8 gso_type;
 	__virtio16 hdr_len;	/* Ethernet + IP + tcp/udp hdrs */
 	__virtio16 gso_size;	/* Bytes to append to hdr_len per frame */
-- 
2.1.4

^ permalink raw reply related

* Re: [Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"
From: Cong Wang @ 2016-11-17 22:47 UTC (permalink / raw)
  To: Colin Cross; +Cc: netdev, Dmitry Vyukov, Tejun Heo, Rafael J. Wysocki
In-Reply-To: <CAMbhsRSQ+eVid4OG_xVMDyuZGfOJ2gj6DvHakOH5vbu3ySHF3A@mail.gmail.com>

On Thu, Nov 17, 2016 at 2:30 PM, Colin Cross <ccross@android.com> wrote:
> On Thu, Nov 17, 2016 at 2:09 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
>> converts schedule_timeout() to its freezable version, it was probably
>> correct at that time, but later, commit 2b514574f7e8
>> ("net: af_unix: implement splice for stream af_unix sockets") breaks
>> the strong requirement for a freezable sleep, according to
>> commit 0f9548ca1091:
>>
>>     We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
>>     deadlock if the lock is later acquired in the suspend or hibernate path
>>     (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
>>     cgroup_freezer if a lock is held inside a frozen cgroup that is later
>>     acquired by a process outside that group.
>>
>> The pipe_lock is still held at that point. So just revert commit 2b15af6f95.
>
> On my phone 77 threads are blocked in unix_stream_recvmsg.  A simple
> revert of this patch will cause every one of those threads to wake up
> twice per suspend cycle, which can be multiple times a second.  How
> about adding a freezable flag to unix_stream_read_state so
> unix_stream_recvmsg can stay freezable, and unix_stream_splice_read
> can be unfreezable?

Fair enough, I didn't know it could have such an impact.
I will send v2.

Thanks.

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Alexander Duyck @ 2016-11-17 22:39 UTC (permalink / raw)
  To: David Laight
  Cc: Jesper Dangaard Brouer, Eric Dumazet, Rick Jones,
	netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0222982@AcuExch.aculab.com>

On Thu, Nov 17, 2016 at 9:34 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Jesper Dangaard Brouer
>> Sent: 17 November 2016 14:58
>> On Thu, 17 Nov 2016 06:17:38 -0800
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
>> >
>> > > I can see that qdisc layer does not activate xmit_more in this case.
>> > >
>> >
>> > Sure. Not enough pressure from the sender(s).
>> >
>> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
>> > limit is kept at a small value.
>> >
>> > (BTW not all NIC have expensive doorbells)
>>
>> I believe this NIC mlx5 (50G edition) does.
>>
>> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
>> 2414 cycles(tsc) 603.86 ns
>>
>> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
>>
>>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>>    Overhead  Command        Shared Object        Symbol
>>  +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
>>    - _raw_spin_lock
>>       + 90.78% __dev_queue_xmit
>>       + 7.83% dev_queue_xmit
>>       + 1.30% ___slab_alloc
>>  +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w
>>  +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
>>  +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup
>>  +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
>>  +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash
>>  +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free
>>
>> In this setup the spinlock in __dev_queue_xmit should be uncongested.
>> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
>>
>> But 8.92% of the time is spend on it, which corresponds to a cost of 215
>> cycles (2414*0.0892).  This cost is too high, thus something else is
>> going on... I claim this mysterious extra cost is the tailptr/doorbell.
>
> Try adding code to ring the doorbell twice.
> If this doesn't slow things down then it isn't (likely to be) responsible
> for the delay you are seeing.
>
>         David
>

The problem isn't only the doorbell.  It is doorbell plus a locked
transaction on x86 results in a long wait until the doorbell write has
been completed.

You could batch a bunch of doorbell writes together and it isn't an
issue unless you do something like writel(), wmb(), writel(), wmb(),
then you will see the effect double since the write memory barrier is
what is forcing the delays.

- Alex

^ permalink raw reply

* Re: [Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"
From: Colin Cross @ 2016-11-17 22:30 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, dvyukov, Tejun Heo, Rafael J. Wysocki
In-Reply-To: <1479420564-15799-1-git-send-email-xiyou.wangcong@gmail.com>

On Thu, Nov 17, 2016 at 2:09 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
> converts schedule_timeout() to its freezable version, it was probably
> correct at that time, but later, commit 2b514574f7e8
> ("net: af_unix: implement splice for stream af_unix sockets") breaks
> the strong requirement for a freezable sleep, according to
> commit 0f9548ca1091:
>
>     We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
>     deadlock if the lock is later acquired in the suspend or hibernate path
>     (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
>     cgroup_freezer if a lock is held inside a frozen cgroup that is later
>     acquired by a process outside that group.
>
> The pipe_lock is still held at that point. So just revert commit 2b15af6f95.

On my phone 77 threads are blocked in unix_stream_recvmsg.  A simple
revert of this patch will cause every one of those threads to wake up
twice per suspend cycle, which can be multiple times a second.  How
about adding a freezable flag to unix_stream_read_state so
unix_stream_recvmsg can stay freezable, and unix_stream_splice_read
can be unfreezable?

^ permalink raw reply

* Re: net: BUG still has locks held in unix_stream_splice_read
From: Hannes Frederic Sowa @ 2016-11-17 22:27 UTC (permalink / raw)
  To: Cong Wang, Al Viro
  Cc: Dmitry Vyukov, David Miller, Eric Dumazet, netdev, LKML,
	syzkaller, Colin Cross, Mandeep Singh Baines
In-Reply-To: <CAM_iQpUEEDE+OcfX66YJDC6dA+b-URHhUtWtv+sn-t5Esk_FWw@mail.gmail.com>

On 17.11.2016 22:44, Cong Wang wrote:
> On Sun, Oct 9, 2016 at 8:14 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> E.g what will happen if some code does a read on AF_UNIX socket with
>> some local mutex held?  AFAICS, there are exactly two callers of
>> freezable_schedule_timeout() - this one and one in XFS; the latter is
>> in a kernel thread where we do have good warranties about the locking
>> environment, but here it's in the bleeding ->recvmsg/->splice_read and
>> for those assumption that caller doesn't hold any locks is pretty
>> strong, especially since it's not documented anywhere.
>>
>> What's going on there?
> 
> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
> converts schedule_timeout() to its freezable version, it was probably correct
> at that time, but later, commit 2b514574f7e88c8498027ee366
> ("net: af_unix: implement splice for stream af_unix sockets") breaks its
> requirement for a freezable sleep:
> 
>     commit 0f9548ca10916dec166eaf74c816bded7d8e611d
> 
>     lockdep: check that no locks held at freeze time
> 
>     We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
>     deadlock if the lock is later acquired in the suspend or hibernate path
>     (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
>     cgroup_freezer if a lock is held inside a frozen cgroup that is later
>     acquired by a process outside that group.
> 
> So probably we just need to revert commit 2b15af6f95 now.
> 
> I am going to send a revert for at least -net and -stable, since Dmitry
> saw this warning again.

I am not an expert on freezing but this looks around right from the
freezer code. Awesome, thanks a lot for spotting this one!

^ permalink raw reply

* Re: Debugging Ethernet issues
From: Måns Rullgård @ 2016-11-17 22:17 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Sebastian Frias, Mason, Andrew Lunn, netdev, Sergei Shtylyov,
	Tom Lendacky, Zach Brown, Shaohui Xie, Tim Beale, Brian Hill,
	Vince Bridgers, Balakumaran Kannan, David S. Miller,
	Kirill Kapranov
In-Reply-To: <8efc016a-3390-3bec-d74c-7c215c151b2f@gmail.com>

Florian Fainelli <f.fainelli@gmail.com> writes:

> On 11/14/2016 11:00 AM, Måns Rullgård wrote:
>> Florian Fainelli <f.fainelli@gmail.com> writes:
>> 
>>> On 11/14/2016 10:20 AM, Florian Fainelli wrote:
>>>> On 11/14/2016 09:59 AM, Sebastian Frias wrote:
>>>>> On 11/14/2016 06:32 PM, Florian Fainelli wrote:
>>>>>> On 11/14/2016 07:33 AM, Mason wrote:
>>>>>>> On 14/11/2016 15:58, Mason wrote:
>>>>>>>
>>>>>>>> nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
>>>>>>>> vs
>>>>>>>> nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off
>>>>>>>>
>>>>>>>> I'm not sure whether "flow control" is relevant...
>>>>>>>
>>>>>>> Based on phy_print_status()
>>>>>>> phydev->pause ? "rx/tx" : "off"
>>>>>>> I added the following patch.
>>>>>>>
>>>>>>> diff --git a/drivers/net/ethernet/aurora/nb8800.c b/drivers/net/ethernet/aurora/nb8800.c
>>>>>>> index defc22a15f67..4e758c1cfa4e 100644
>>>>>>> --- a/drivers/net/ethernet/aurora/nb8800.c
>>>>>>> +++ b/drivers/net/ethernet/aurora/nb8800.c
>>>>>>> @@ -667,6 +667,8 @@ static void nb8800_link_reconfigure(struct net_device *dev)
>>>>>>>         struct phy_device *phydev = priv->phydev;
>>>>>>>         int change = 0;
>>>>>>>  
>>>>>>> +       printk("%s from %pf\n", __func__, __builtin_return_address(0));
>>>>>>> +
>>>>>>>         if (phydev->link) {
>>>>>>>                 if (phydev->speed != priv->speed) {
>>>>>>>                         priv->speed = phydev->speed;
>>>>>>> @@ -1274,9 +1276,9 @@ static int nb8800_hw_init(struct net_device *dev)
>>>>>>>         nb8800_writeb(priv, NB8800_PQ2, val & 0xff);
>>>>>>>  
>>>>>>>         /* Auto-negotiate by default */
>>>>>>> -       priv->pause_aneg = true;
>>>>>>> -       priv->pause_rx = true;
>>>>>>> -       priv->pause_tx = true;
>>>>>>> +       priv->pause_aneg = false;
>>>>>>> +       priv->pause_rx = false;
>>>>>>> +       priv->pause_tx = false;
>>>>>>>  
>>>>>>>         nb8800_mc_init(dev, 0);
>>>>>>>  
>>>>>>>
>> 
>> [...]
>> 
>>>>>> And the time difference is clearly accounted for auto-negotiation time
>>>>>> here, as you can see it takes about 3 seconds for Gigabit Ethernet to
>>>>>> auto-negotiate and that seems completely acceptable and normal to me
>>>>>> since it is a more involved process than lower speeds.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> OK, so now it works (by accident?) even on 100 Mbps switch, but it still
>>>>>>> prints "flow control rx/tx"...
>>>>>>
>>>>>> Because your link partner advertises flow control, and that's what
>>>>>> phydev->pause and phydev->asym_pause report (I know it's confusing, but
>>>>>> that's what it is at the moment).
>>>>>
>>>>> Thanks.
>>>>> Could you confirm that Mason's patch is correct and/or that it does not
>>>>> has negative side-effects?
>>>>
>>>> The patch is not correct nor incorrect per-se, it changes the default
>>>> policy of having pause frames advertised by default to not having them
>>>> advertised by default.
>> 
>> I was advised to advertise flow control by default back when I was
>> working on the driver, and I think it makes sense to do so.
>> 
>>>> This influences both your Ethernet MAC and the link partner in that
>>>> the result is either flow control is enabled (before) or it is not
>>>> (with the patch). There must be something amiss if you see packet
>>>> loss or some kind of problem like that with an early exchange such as
>>>> DHCP. Flow control tend to kick in under higher packet rates (at
>>>> least, that's what you expect).
>>>>
>>>>>
>>>>> Right now we know that Mason's patch makes this work, but we do not
>>>>> understand why nor its implications.
>>>>
>>>> You need to understand why, right now, the way this problem is
>>>> presented, you came up with a workaround, not with the root cause or the
>>>> solution. What does your link partner (switch?) reports, that is, what
>>>> is the ethtool output when you have a link up from  your nb8800 adapter?
>>>
>>> Actually, nb8800_pause_config() seems to be doing a complete MAC/DMA
>>> reconfiguration when pause frames get auto-negotiated while the link is
>>> UP,
>> 
>> This is due to a silly hardware limitation.  The register containing the
>> flow control bits can't be written while rx is enabled.
>
> You do a DMA stop, but you don't disable the MAC receiver unlike what
> nb8800_stop() does, why is not calling nb8800_mac_rx() necessary here?

Oh, right.  That's because the RXC_CR register (where the flow control
bits are) can't be modified when the RCR_EN bit (rx dma enable) is set.
The MAC core register controlled by nb8800_mac_rx() doesn't matter here.
There is no way of changing the flow control setting without briefly
stopping rx dma.

None of this should be relevant here though since everything should be
all set up before dma is enabled the first time.

>>> and it does not differentiate being called from
>>> ethtool::set_pauseparam or the PHYLIB adjust_link callback (which it
>>> probably should),
>> 
>> Differentiate how?
>
> Differentiate in that when you are called from adjust_link, why bother
> checking with netif_running() since you are only configuring the pause
> settings when phydev->link is set. Not that this matters much, but
> that's something the caller can tell you.

netif_running() can be true or false independently of the link state.

-- 
Måns Rullgård

^ permalink raw reply

* Re: [PATCH v2 net-next] lan78xx: relocate mdix setting to phy driver
From: Florian Fainelli @ 2016-11-17 22:16 UTC (permalink / raw)
  To: Woojung.Huh, davem, netdev; +Cc: andrew, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D40966AA4@CHN-SV-EXMX02.mchp-main.com>

On 11/17/2016 02:10 PM, Woojung.Huh@microchip.com wrote:
> From: Woojung Huh <woojung.huh@microchip.com>
> 
> Relocate mdix code to phy driver to be called at config_init().
> 
> Signed-off-by: Woojung Huh <woojung.huh@microchip.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox