Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net 0/4] ip_gre: a bunch of fixes for erspan
From: David Miller @ 2017-10-02  5:32 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, xeb, u9012063
In-Reply-To: <cover.1506866059.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Sun,  1 Oct 2017 22:00:52 +0800

> This patchset is to fix some issues that could cause 0 or low
> performance, and even unexpected truncated packets on erspan.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net 3/3] net: skb_queue_purge(): lock/unlock the queue only once
From: Michael Witten @ 2017-10-02  5:15 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Eric Dumazet, netdev, linux-kernel
In-Reply-To: <20171001175909.4b9e53d8@xeon-e3>

On Sun, 1 Oct 2017 17:59:09 -0700, Stephen Hemminger wrote:

> On Sun, 01 Oct 2017 22:19:20 -0000 Michael Witten wrote:
>
>> +	spin_lock_irqsave(&q->lock, flags);
>> +	skb = q->next;
>> +	__skb_queue_head_init(q);
>> +	spin_unlock_irqrestore(&q->lock, flags);
>
> Other code manipulating lists uses splice operation and
> a sk_buff_head temporary on the stack. That would be easier
> to understand.
>
> 	struct sk_buf_head head;
>
> 	__skb_queue_head_init(&head);
> 	spin_lock_irqsave(&q->lock, flags);
> 	skb_queue_splice_init(q, &head);
> 	spin_unlock_irqrestore(&q->lock, flags);
>
>
>> +	while (skb != head) {
>> +		next = skb->next;
>>  		kfree_skb(skb);
>> +		skb = next;
>> +	}
>
> It would be cleaner if you could use
> skb_queue_walk_safe rather than open coding the loop.
>
> 	skb_queue_walk_safe(&head, skb,  tmp)
> 		kfree_skb(skb);

I appreciate abstraction as much as anybody, but I do not believe
that such abstractions would actually be an improvement here.

* Splice-initing seems more like an idiom than an abstraction;
  at first blush, it wouldn't be clear to me what the intention
  is.

* Such abstractions are fairly unnecessary.

    * The function as written is already so short as to be
      easily digested.

    * More to the point, this function is not some generic,
      higher-level algorithm that just happens to employ the
      socket buffer interface; rather, it is a function that
      implements part of that very interface, and may thus
      twiddle the intimate bits of these data structures
      without being accused of abusing a leaky abstraction.

* Such abstractions add overhead, if only conceptually. In this
  case, a temporary socket buffer queue allocates *3* unnecessary
  struct members, including a whole `spinlock_t' member:
  
    prev
    qlen
    lock

  It's possible that the compiler will be smart enough to leave
  those out, but I have my suspicions that it won't, not only
  given that the interface contract requires that the temporary
  socket buffer queue be properly initialized before use, but
  also because splicing into the temporary will manipulate its
  `qlen'. Yet, why worry whether optimization happens? The whole
  issue can simply be avoided by exploiting the intimate details
  that are already philosophically available to us.

  Similarly, the function `skb_queue_walk_safe' is nice, but it
  loses value both because a temporary queue loses value (as just
  described), and because it ignores the fact that legitimate
  access to the internals of these data structures allows for
  setting up the requested loop in advance; that is to say, the
  two parts of the function that we are now debating can be woven
  together more tightly than `skb_queue_walk_safe' allows.

For these reasons, I stand by the way that the patch currently
implements this function; it does exactly what is desired, no more
or less.

Sincerely,
Michael Witten

^ permalink raw reply

* [PATCH net-next 3/3] bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports
From: Roopa Prabhu @ 2017-10-02  4:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, nikolay, stephen, bridge
In-Reply-To: <1506919018-27875-1-git-send-email-roopa@cumulusnetworks.com>

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch avoids flooding and proxies ndisc packets
for BR_NEIGH_SUPPRESS ports.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 net/bridge/br_arp_nd_proxy.c | 246 +++++++++++++++++++++++++++++++++++++++++++
 net/bridge/br_device.c       |  10 ++
 net/bridge/br_input.c        |  10 ++
 net/bridge/br_private.h      |   3 +
 4 files changed, 269 insertions(+)

diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
index 5153510..1028e5ab 100644
--- a/net/bridge/br_arp_nd_proxy.c
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -211,3 +211,249 @@ void br_do_proxy_suppress_arp(struct sk_buff *skb, struct net_bridge *br,
 		neigh_release(n);
 	}
 }
+
+#if IS_ENABLED(CONFIG_IPV6)
+struct nd_msg *br_is_nd_neigh_msg(struct sk_buff *skb, struct nd_msg *msg)
+{
+	struct nd_msg *m;
+
+	m = skb_header_pointer(skb, skb_network_offset(skb) +
+			       sizeof(struct ipv6hdr), sizeof(*msg), msg);
+	if (!m)
+		return NULL;
+
+	if (m->icmph.icmp6_code != 0 ||
+	    (m->icmph.icmp6_type != NDISC_NEIGHBOUR_SOLICITATION &&
+	     m->icmph.icmp6_type != NDISC_NEIGHBOUR_ADVERTISEMENT))
+		return NULL;
+
+	return m;
+}
+
+static void br_nd_send(struct net_bridge_port *p, struct sk_buff *request,
+		       struct neighbour *n, __be16 vlan_proto, u16 vlan_tci,
+		       struct nd_msg *ns)
+{
+	struct net_device *dev = request->dev;
+	struct sk_buff *reply;
+	struct nd_msg *na;
+	struct ipv6hdr *pip6;
+	u8 *daddr;
+	int na_olen = 8; /* opt hdr + ETH_ALEN for target */
+	int ns_olen;
+	int i, len;
+
+	if (!dev)
+		return;
+
+	len = LL_RESERVED_SPACE(dev) + sizeof(struct ipv6hdr) +
+		sizeof(*na) + na_olen + dev->needed_tailroom;
+
+	reply = alloc_skb(len, GFP_ATOMIC);
+	if (!reply)
+		return;
+
+	reply->protocol = htons(ETH_P_IPV6);
+	reply->dev = dev;
+	skb_reserve(reply, LL_RESERVED_SPACE(dev));
+	skb_push(reply, sizeof(struct ethhdr));
+	skb_set_mac_header(reply, 0);
+
+	daddr = eth_hdr(request)->h_source;
+
+	/* Do we need option processing ? */
+	ns_olen = request->len - (skb_network_offset(request) +
+				  sizeof(struct ipv6hdr)) - sizeof(*ns);
+	for (i = 0; i < ns_olen - 1; i += (ns->opt[i + 1] << 3)) {
+		if (ns->opt[i] == ND_OPT_SOURCE_LL_ADDR) {
+			daddr = ns->opt + i + sizeof(struct nd_opt_hdr);
+			break;
+		}
+	}
+
+	/* Ethernet header */
+	ether_addr_copy(eth_hdr(reply)->h_dest, daddr);
+	ether_addr_copy(eth_hdr(reply)->h_source, n->ha);
+	eth_hdr(reply)->h_proto = htons(ETH_P_IPV6);
+	reply->protocol = htons(ETH_P_IPV6);
+
+	skb_pull(reply, sizeof(struct ethhdr));
+	skb_set_network_header(reply, 0);
+	skb_put(reply, sizeof(struct ipv6hdr));
+
+	/* IPv6 header */
+	pip6 = ipv6_hdr(reply);
+	memset(pip6, 0, sizeof(struct ipv6hdr));
+	pip6->version = 6;
+	pip6->priority = ipv6_hdr(request)->priority;
+	pip6->nexthdr = IPPROTO_ICMPV6;
+	pip6->hop_limit = 255;
+	pip6->daddr = ipv6_hdr(request)->saddr;
+	pip6->saddr = *(struct in6_addr *)n->primary_key;
+
+	skb_pull(reply, sizeof(struct ipv6hdr));
+	skb_set_transport_header(reply, 0);
+
+	na = (struct nd_msg *)skb_put(reply, sizeof(*na) + na_olen);
+
+	/* Neighbor Advertisement */
+	memset(na, 0, sizeof(*na) + na_olen);
+	na->icmph.icmp6_type = NDISC_NEIGHBOUR_ADVERTISEMENT;
+	na->icmph.icmp6_router = 0; /* XXX: should be 1 ? */
+	na->icmph.icmp6_override = 1;
+	na->icmph.icmp6_solicited = 1;
+	na->target = ns->target;
+	ether_addr_copy(&na->opt[2], n->ha);
+	na->opt[0] = ND_OPT_TARGET_LL_ADDR;
+	na->opt[1] = na_olen >> 3;
+
+	na->icmph.icmp6_cksum = csum_ipv6_magic(&pip6->saddr,
+						&pip6->daddr,
+						sizeof(*na) + na_olen,
+						IPPROTO_ICMPV6,
+						csum_partial(na, sizeof(*na) + na_olen, 0));
+
+	pip6->payload_len = htons(sizeof(*na) + na_olen);
+
+	skb_push(reply, sizeof(struct ipv6hdr));
+	skb_push(reply, sizeof(struct ethhdr));
+
+	reply->ip_summed = CHECKSUM_UNNECESSARY;
+
+	if (p) {
+		struct net_bridge_vlan_group *vg;
+		u16 pvid;
+
+		vg = nbp_vlan_group_rcu(p);
+		pvid = br_get_pvid(vg);
+		if (pvid && pvid == vlan_tci)
+			vlan_tci = 0;
+	}
+
+	if (vlan_tci != 0) {
+		reply = vlan_insert_tag_set_proto(reply, vlan_proto, vlan_tci);
+		if (!reply) {
+			net_err_ratelimited("evpn: failed to insert VLAN tag\n");
+			return;
+		}
+	}
+
+	netdev_dbg(dev, "nd send dev %s dst %pI6 dst_hw %pM src %pI6 src_hw %pM\n",
+		   dev->name, &pip6->daddr, daddr, &pip6->saddr, n->ha);
+
+	dev_queue_xmit(reply);
+}
+
+static int br_chk_addr_ip6(struct net_device *dev, void *data)
+{
+	struct in6_addr *addr = (struct in6_addr *)data;
+
+	if (ipv6_chk_addr(dev_net(dev), addr, dev, 0))
+		return 1;
+
+	return 0;
+}
+
+static bool br_is_local_ip6(struct net_device *dev, struct in6_addr *addr)
+
+{
+	if (br_chk_addr_ip6(dev, addr))
+		return true;
+
+	/* check if ip is configured on upper dev */
+	if (netdev_walk_all_upper_dev_rcu(dev, br_chk_addr_ip6, addr))
+		return true;
+
+	return false;
+}
+
+void br_do_suppress_nd(struct sk_buff *skb, struct net_bridge *br,
+		       u16 vid, struct net_bridge_port *p, struct nd_msg *msg)
+{
+	struct net_device *dev = br->dev;
+	struct net_device *vlandev = NULL;
+	struct in6_addr *saddr, *daddr;
+	struct ipv6hdr *iphdr;
+	struct inet6_dev *in6_dev;
+	struct neighbour *n;
+
+	BR_INPUT_SKB_CB(skb)->proxyarp_replied = false;
+
+	if (p && (p->flags & BR_NEIGH_SUPPRESS))
+		return;
+
+	if (msg->icmph.icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT &&
+	    !msg->icmph.icmp6_solicited) {
+		/* prevent flooding to neigh suppress ports */
+		BR_INPUT_SKB_CB(skb)->proxyarp_replied = true;
+		return;
+	}
+
+	if (msg->icmph.icmp6_type != NDISC_NEIGHBOUR_SOLICITATION)
+		return;
+
+	in6_dev = __in6_dev_get(dev);
+	if (!in6_dev)
+		return;
+
+	iphdr = ipv6_hdr(skb);
+	saddr = &iphdr->saddr;
+	daddr = &iphdr->daddr;
+
+	if (ipv6_addr_any(saddr) || !ipv6_addr_cmp(saddr, daddr)) {
+		/* prevent flooding to neigh suppress ports */
+		BR_INPUT_SKB_CB(skb)->proxyarp_replied = true;
+		return;
+	}
+
+	if (vid != 0) {
+		/* build neigh table lookup on the vlan device */
+		vlandev = __vlan_find_dev_deep_rcu(br->dev, skb->vlan_proto,
+						   vid);
+		if (!vlandev)
+			return;
+	} else {
+		vlandev = dev;
+	}
+
+	if (br_is_local_ip6(vlandev, &msg->target)) {
+		/* its our own ip, so don't proxy reply
+		 * and don't forward to arp suppress ports
+		 */
+		BR_INPUT_SKB_CB(skb)->proxyarp_replied = true;
+		return;
+	}
+
+	n = neigh_lookup(ipv6_stub->nd_tbl, &msg->target, vlandev);
+	if (n) {
+		struct net_bridge_fdb_entry *f;
+
+		if (!(n->nud_state & NUD_VALID)) {
+			neigh_release(n);
+			return;
+		}
+
+		f = br_fdb_find_rcu(br, n->ha, vid);
+		if (f) {
+			bool replied = false;
+
+			if (f->dst && (f->dst->flags & BR_NEIGH_SUPPRESS)) {
+				if (vid != 0)
+					br_nd_send(p, skb, n, skb->vlan_proto,
+						   skb_vlan_tag_get(skb), msg);
+				else
+					br_nd_send(p, skb, n, 0, 0, msg);
+				replied = true;
+			}
+
+			/* If we have replied or as long as we know the
+			 * mac, indicate to NEIGH_SUPPRESS ports that we
+			 * have replied
+			 */
+			if (replied || br->neigh_suppress_enabled)
+				BR_INPUT_SKB_CB(skb)->proxyarp_replied = true;
+		}
+		neigh_release(n);
+	}
+}
+#endif
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 8961c25..89c5d01 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -68,6 +68,16 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	    (ntohs(eth->h_proto) == ETH_P_ARP ||
 	     ntohs(eth->h_proto) == ETH_P_RARP)) {
 		br_do_proxy_suppress_arp(skb, br, vid, NULL);
+	} else if (IS_ENABLED(CONFIG_IPV6) && br->neigh_suppress_enabled &&
+		   skb->protocol == htons(ETH_P_IPV6) &&
+		   pskb_may_pull(skb, sizeof(struct ipv6hdr) +
+				 sizeof(struct nd_msg)) &&
+		   ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
+			struct nd_msg *msg, _msg;
+
+			msg = br_is_nd_neigh_msg(skb, &_msg);
+			if (msg)
+				br_do_suppress_nd(skb, br, vid, NULL, msg);
 	}
 
 	dest = eth_hdr(skb)->h_dest;
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 7637a23..491e4dd 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -119,6 +119,16 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
 	    (skb->protocol == htons(ETH_P_ARP) ||
 	     skb->protocol == htons(ETH_P_RARP))) {
 		br_do_proxy_suppress_arp(skb, br, vid, p);
+	} else if (IS_ENABLED(CONFIG_IPV6) && br->neigh_suppress_enabled &&
+		   skb->protocol == htons(ETH_P_IPV6) &&
+		   pskb_may_pull(skb, sizeof(struct ipv6hdr) +
+				 sizeof(struct nd_msg)) &&
+		   ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
+			struct nd_msg *msg, _msg;
+
+			msg = br_is_nd_neigh_msg(skb, &_msg);
+			if (msg)
+				br_do_suppress_nd(skb, br, vid, p, msg);
 	}
 
 	switch (pkt_type) {
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 3006f0d..ff36891 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -1135,4 +1135,7 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 void br_recalculate_neigh_suppress_enabled(struct net_bridge *br);
 void br_do_proxy_suppress_arp(struct sk_buff *skb, struct net_bridge *br,
 			      u16 vid, struct net_bridge_port *p);
+void br_do_suppress_nd(struct sk_buff *skb, struct net_bridge *br,
+		       u16 vid, struct net_bridge_port *p, struct nd_msg *msg);
+struct nd_msg *br_is_nd_neigh_msg(struct sk_buff *skb, struct nd_msg *m);
 #endif
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 2/3] bridge: suppress arp pkts on BR_NEIGH_SUPPRESS ports
From: Roopa Prabhu @ 2017-10-02  4:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, nikolay, stephen, bridge
In-Reply-To: <1506919018-27875-1-git-send-email-roopa@cumulusnetworks.com>

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch avoids flooding and proxies arp packets
for BR_NEIGH_SUPPRESS ports.

Moves existing br_do_proxy_arp to br_do_proxy_suppress_arp
to support both proxy arp and neigh suppress.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 net/bridge/br_arp_nd_proxy.c | 186 +++++++++++++++++++++++++++++++++++++++++++
 net/bridge/br_device.c       |   8 ++
 net/bridge/br_input.c        |  63 ++-------------
 net/bridge/br_private.h      |   3 +
 4 files changed, 202 insertions(+), 58 deletions(-)

diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
index e191bb5..5153510 100644
--- a/net/bridge/br_arp_nd_proxy.c
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -11,6 +11,13 @@
  */
 
 #include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/neighbour.h>
+#include <net/arp.h>
+#include <linux/if_vlan.h>
+#include <linux/inetdevice.h>
+#include <net/addrconf.h>
 #include "br_private.h"
 
 void br_recalculate_neigh_suppress_enabled(struct net_bridge *br)
@@ -25,3 +32,182 @@ void br_recalculate_neigh_suppress_enabled(struct net_bridge *br)
 		}
 	}
 }
+
+static void br_arp_send(struct net_bridge_port *p, int type, int ptype,
+			__be32 dest_ip, struct net_device *dev,
+			__be32 src_ip, const unsigned char *dest_hw,
+			const unsigned char *src_hw,
+			const unsigned char *target_hw,
+			__be16 vlan_proto, u16 vlan_tci)
+{
+	struct sk_buff *skb;
+
+	netdev_dbg(dev, "arp send dev %s dst %pI4 dst_hw %pM src %pI4 src_hw %pM\n",
+		   dev->name, &dest_ip, dest_hw, &src_ip, src_hw);
+
+	if (!vlan_tci) {
+		arp_send(type, ptype, dest_ip, dev, src_ip,
+			 dest_hw, src_hw, target_hw);
+		return;
+	}
+
+	skb = arp_create(type, ptype, dest_ip, dev, src_ip,
+			 dest_hw, src_hw, target_hw);
+	if (!skb)
+		return;
+
+	if (p) {
+		struct net_bridge_vlan_group *vg;
+		u16 pvid;
+
+		vg = nbp_vlan_group_rcu(p);
+		pvid = br_get_pvid(vg);
+		if (pvid && pvid == vlan_tci)
+			vlan_tci = 0;
+	}
+
+	if (vlan_tci) {
+		skb = vlan_insert_tag_set_proto(skb, vlan_proto,
+						vlan_tci);
+		if (!skb) {
+			net_err_ratelimited("%s: failed to insert VLAN tag\n",
+					    __func__);
+			return;
+		}
+	}
+
+	arp_xmit(skb);
+}
+
+static int br_chk_addr_ip(struct net_device *dev, void *data)
+{
+	__be32 ip = *(__be32 *)data;
+	struct in_device *in_dev;
+	__be32 addr = 0;
+
+	in_dev = __in_dev_get_rcu(dev);
+	if (in_dev)
+		addr = inet_confirm_addr(dev_net(dev), in_dev, 0, ip,
+					 RT_SCOPE_HOST);
+
+	if (addr == ip)
+		return 1;
+
+	return 0;
+}
+
+static bool br_is_local_ip(struct net_device *dev, __be32 ip)
+{
+	if (br_chk_addr_ip(dev, &ip))
+		return true;
+
+	/* check if ip is configured on upper dev */
+	if (netdev_walk_all_upper_dev_rcu(dev, br_chk_addr_ip, &ip))
+		return true;
+
+	return false;
+}
+
+void br_do_proxy_suppress_arp(struct sk_buff *skb, struct net_bridge *br,
+			      u16 vid, struct net_bridge_port *p)
+{
+	struct net_device *dev = br->dev;
+	struct net_device *vlandev = NULL;
+	struct neighbour *n;
+	struct arphdr *parp;
+	u8 *arpptr, *sha;
+	__be32 sip, tip;
+
+	BR_INPUT_SKB_CB(skb)->proxyarp_replied = false;
+
+	if ((dev->flags & IFF_NOARP) ||
+	    !pskb_may_pull(skb, arp_hdr_len(dev)))
+		return;
+
+	parp = arp_hdr(skb);
+
+	if (parp->ar_pro != htons(ETH_P_IP) ||
+	    parp->ar_hln != dev->addr_len ||
+	    parp->ar_pln != 4)
+		return;
+
+	arpptr = (u8 *)parp + sizeof(struct arphdr);
+	sha = arpptr;
+	arpptr += dev->addr_len;	/* sha */
+	memcpy(&sip, arpptr, sizeof(sip));
+	arpptr += sizeof(sip);
+	arpptr += dev->addr_len;	/* tha */
+	memcpy(&tip, arpptr, sizeof(tip));
+
+	if (ipv4_is_loopback(tip) ||
+	    ipv4_is_multicast(tip))
+		return;
+
+	if (br->neigh_suppress_enabled) {
+		if (p && (p->flags & BR_NEIGH_SUPPRESS))
+			return;
+		if (ipv4_is_zeronet(sip) || sip == tip) {
+			/* prevent flooding to neigh suppress ports */
+			BR_INPUT_SKB_CB(skb)->proxyarp_replied = true;
+			return;
+		}
+	}
+
+	if (parp->ar_op != htons(ARPOP_REQUEST))
+		return;
+
+	if (vid != 0) {
+		vlandev = __vlan_find_dev_deep_rcu(br->dev, skb->vlan_proto,
+						   vid);
+		if (!vlandev)
+			return;
+	} else {
+		vlandev = dev;
+	}
+
+	if (br->neigh_suppress_enabled && br_is_local_ip(vlandev, tip)) {
+		/* its our local ip, so don't proxy reply
+		 * and don't forward to neigh suppress ports
+		 */
+		BR_INPUT_SKB_CB(skb)->proxyarp_replied = true;
+		return;
+	}
+
+	n = neigh_lookup(&arp_tbl, &tip, vlandev);
+	if (n) {
+		struct net_bridge_fdb_entry *f;
+
+		if (!(n->nud_state & NUD_VALID)) {
+			neigh_release(n);
+			return;
+		}
+
+		f = br_fdb_find_rcu(br, n->ha, vid);
+		if (f) {
+			bool replied = false;
+
+			if (f->dst && ((p->flags & BR_PROXYARP) ||
+				       (f->dst->flags & BR_PROXYARP_WIFI) ||
+				       (f->dst->flags & BR_NEIGH_SUPPRESS))) {
+				if (!vid)
+					br_arp_send(p, ARPOP_REPLY, ETH_P_ARP,
+						    sip, skb->dev, tip, sha,
+						    n->ha, sha, 0, 0);
+				else
+					br_arp_send(p, ARPOP_REPLY, ETH_P_ARP,
+						    sip, skb->dev, tip, sha,
+						    n->ha, sha, skb->vlan_proto,
+						    skb_vlan_tag_get(skb));
+				replied = true;
+			}
+
+			/* If we have replied or as long as we know the
+			 * mac, indicate to arp replied
+			 */
+			if (replied || br->neigh_suppress_enabled)
+				BR_INPUT_SKB_CB(skb)->proxyarp_replied = true;
+		}
+
+		neigh_release(n);
+	}
+}
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index f6b6a92..8961c25 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -39,6 +39,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct pcpu_sw_netstats *brstats = this_cpu_ptr(br->stats);
 	const struct nf_br_ops *nf_ops;
 	const unsigned char *dest;
+	struct ethhdr *eth;
 	u16 vid = 0;
 
 	rcu_read_lock();
@@ -57,11 +58,18 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	BR_INPUT_SKB_CB(skb)->brdev = dev;
 
 	skb_reset_mac_header(skb);
+	eth = eth_hdr(skb);
 	skb_pull(skb, ETH_HLEN);
 
 	if (!br_allowed_ingress(br, br_vlan_group_rcu(br), skb, &vid))
 		goto out;
 
+	if (IS_ENABLED(CONFIG_INET) && br->neigh_suppress_enabled &&
+	    (ntohs(eth->h_proto) == ETH_P_ARP ||
+	     ntohs(eth->h_proto) == ETH_P_RARP)) {
+		br_do_proxy_suppress_arp(skb, br, vid, NULL);
+	}
+
 	dest = eth_hdr(skb)->h_dest;
 	if (is_broadcast_ether_addr(dest)) {
 		br_flood(br, skb, BR_PKT_BROADCAST, false, true);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 7637f58..7637a23 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -71,62 +71,6 @@ static int br_pass_frame_up(struct sk_buff *skb)
 		       br_netif_receive_skb);
 }
 
-static void br_do_proxy_arp(struct sk_buff *skb, struct net_bridge *br,
-			    u16 vid, struct net_bridge_port *p)
-{
-	struct net_device *dev = br->dev;
-	struct neighbour *n;
-	struct arphdr *parp;
-	u8 *arpptr, *sha;
-	__be32 sip, tip;
-
-	BR_INPUT_SKB_CB(skb)->proxyarp_replied = false;
-
-	if ((dev->flags & IFF_NOARP) ||
-	    !pskb_may_pull(skb, arp_hdr_len(dev)))
-		return;
-
-	parp = arp_hdr(skb);
-
-	if (parp->ar_pro != htons(ETH_P_IP) ||
-	    parp->ar_op != htons(ARPOP_REQUEST) ||
-	    parp->ar_hln != dev->addr_len ||
-	    parp->ar_pln != 4)
-		return;
-
-	arpptr = (u8 *)parp + sizeof(struct arphdr);
-	sha = arpptr;
-	arpptr += dev->addr_len;	/* sha */
-	memcpy(&sip, arpptr, sizeof(sip));
-	arpptr += sizeof(sip);
-	arpptr += dev->addr_len;	/* tha */
-	memcpy(&tip, arpptr, sizeof(tip));
-
-	if (ipv4_is_loopback(tip) ||
-	    ipv4_is_multicast(tip))
-		return;
-
-	n = neigh_lookup(&arp_tbl, &tip, dev);
-	if (n) {
-		struct net_bridge_fdb_entry *f;
-
-		if (!(n->nud_state & NUD_VALID)) {
-			neigh_release(n);
-			return;
-		}
-
-		f = br_fdb_find_rcu(br, n->ha, vid);
-		if (f && ((p->flags & BR_PROXYARP) ||
-			  (f->dst && (f->dst->flags & BR_PROXYARP_WIFI)))) {
-			arp_send(ARPOP_REPLY, ETH_P_ARP, sip, skb->dev, tip,
-				 sha, n->ha, sha);
-			BR_INPUT_SKB_CB(skb)->proxyarp_replied = true;
-		}
-
-		neigh_release(n);
-	}
-}
-
 /* note: already called with rcu_read_lock */
 int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
@@ -171,8 +115,11 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
 
 	BR_INPUT_SKB_CB(skb)->brdev = br->dev;
 
-	if (IS_ENABLED(CONFIG_INET) && skb->protocol == htons(ETH_P_ARP))
-		br_do_proxy_arp(skb, br, vid, p);
+	if (IS_ENABLED(CONFIG_INET) &&
+	    (skb->protocol == htons(ETH_P_ARP) ||
+	     skb->protocol == htons(ETH_P_RARP))) {
+		br_do_proxy_suppress_arp(skb, br, vid, p);
+	}
 
 	switch (pkt_type) {
 	case BR_PKT_MULTICAST:
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 7dacd83..3006f0d 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -1131,5 +1131,8 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 }
 #endif /* CONFIG_NET_SWITCHDEV */
 
+/* br_arp_nd_proxy.c */
 void br_recalculate_neigh_suppress_enabled(struct net_bridge *br);
+void br_do_proxy_suppress_arp(struct sk_buff *skb, struct net_bridge *br,
+			      u16 vid, struct net_bridge_port *p);
 #endif
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 1/3] bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd flood
From: Roopa Prabhu @ 2017-10-02  4:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, nikolay, stephen, bridge
In-Reply-To: <1506919018-27875-1-git-send-email-roopa@cumulusnetworks.com>

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch adds a new bridge port flag BR_NEIGH_SUPPRESS to
suppress arp and nd flood on bridge ports. It implements
rfc7432, section 10.
https://tools.ietf.org/html/rfc7432#section-10
for ethernet VPN deployments. It is similar to the existing
BR_ARP_PROXY flag but has a few semantic differences to conform
to EVPN standard. In case of EVPN, it is mainly used to
avoid flooding to tunnel ports like vxlan. Unlike the existing
flags it suppresses flood of all neigh discovery packets
(arp, nd) to tunnel ports.

This patch adds netlink and sysfs support to set this bridge port
flag.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 include/linux/if_bridge.h    |  1 +
 include/uapi/linux/if_link.h |  1 +
 net/bridge/Makefile          |  2 +-
 net/bridge/br_arp_nd_proxy.c | 27 +++++++++++++++++++++++++++
 net/bridge/br_forward.c      |  3 ++-
 net/bridge/br_if.c           |  5 +++++
 net/bridge/br_netlink.c      | 16 +++++++++++++++-
 net/bridge/br_private.h      |  2 ++
 net/bridge/br_sysfs_if.c     |  2 ++
 9 files changed, 56 insertions(+), 3 deletions(-)
 create mode 100644 net/bridge/br_arp_nd_proxy.c

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 3cd18ac..316ee11 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -49,6 +49,7 @@ struct br_ip_list {
 #define BR_MULTICAST_TO_UNICAST	BIT(12)
 #define BR_VLAN_TUNNEL		BIT(13)
 #define BR_BCAST_FLOOD		BIT(14)
+#define BR_NEIGH_SUPPRESS	BIT(15)
 
 #define BR_DEFAULT_AGEING_TIME	(300 * HZ)
 
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 8d062c5..0882e86 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -325,6 +325,7 @@ enum {
 	IFLA_BRPORT_MCAST_TO_UCAST,
 	IFLA_BRPORT_VLAN_TUNNEL,
 	IFLA_BRPORT_BCAST_FLOOD,
+	IFLA_BRPORT_NEIGH_SUPPRESS,
 	__IFLA_BRPORT_MAX
 };
 #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
diff --git a/net/bridge/Makefile b/net/bridge/Makefile
index 40b1ede..4aee55f 100644
--- a/net/bridge/Makefile
+++ b/net/bridge/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_BRIDGE) += bridge.o
 bridge-y	:= br.o br_device.o br_fdb.o br_forward.o br_if.o br_input.o \
 			br_ioctl.o br_stp.o br_stp_bpdu.o \
 			br_stp_if.o br_stp_timer.o br_netlink.o \
-			br_netlink_tunnel.o
+			br_netlink_tunnel.o br_arp_nd_proxy.o
 
 bridge-$(CONFIG_SYSFS) += br_sysfs_if.o br_sysfs_br.o
 
diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
new file mode 100644
index 0000000..e191bb5
--- /dev/null
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -0,0 +1,27 @@
+/*
+ *	Handle bridge arp/nd proxy/suppress
+ *
+ *	Authors:
+ *	Roopa Prabhu		<roopa@cumulusnetworks.com>
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; either version
+ *	2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include "br_private.h"
+
+void br_recalculate_neigh_suppress_enabled(struct net_bridge *br)
+{
+	struct net_bridge_port *p;
+
+	br->neigh_suppress_enabled = false;
+	list_for_each_entry(p, &br->port_list, list) {
+		if (p->flags & BR_NEIGH_SUPPRESS) {
+			br->neigh_suppress_enabled = true;
+			break;
+		}
+	}
+}
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 48fb174..7a50dc5 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -204,7 +204,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
 		/* Do not flood to ports that enable proxy ARP */
 		if (p->flags & BR_PROXYARP)
 			continue;
-		if ((p->flags & BR_PROXYARP_WIFI) &&
+		if ((p->flags & BR_PROXYARP_WIFI ||
+		     p->flags & BR_NEIGH_SUPPRESS) &&
 		    BR_INPUT_SKB_CB(skb)->proxyarp_replied)
 			continue;
 
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index f3aef22..8f615d4 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -310,6 +310,8 @@ void br_dev_delete(struct net_device *dev, struct list_head *head)
 		del_nbp(p);
 	}
 
+	br_recalculate_neigh_suppress_enabled(br);
+
 	br_fdb_delete_by_port(br, NULL, 0, 1);
 
 	cancel_delayed_work_sync(&br->gc_work);
@@ -653,4 +655,7 @@ void br_port_flags_change(struct net_bridge_port *p, unsigned long mask)
 
 	if (mask & BR_AUTO_MASK)
 		nbp_update_port_count(br);
+
+	if (mask & BR_NEIGH_SUPPRESS)
+		br_recalculate_neigh_suppress_enabled(br);
 }
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 3bc8907..abe1d8d 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -138,6 +138,7 @@ static inline size_t br_port_info_size(void)
 		+ nla_total_size(1)	/* IFLA_BRPORT_PROXYARP */
 		+ nla_total_size(1)	/* IFLA_BRPORT_PROXYARP_WIFI */
 		+ nla_total_size(1)	/* IFLA_BRPORT_VLAN_TUNNEL */
+		+ nla_total_size(1)	/* IFLA_BRPORT_NEIGH_SUPPRESS */
 		+ nla_total_size(sizeof(struct ifla_bridge_id))	/* IFLA_BRPORT_ROOT_ID */
 		+ nla_total_size(sizeof(struct ifla_bridge_id))	/* IFLA_BRPORT_BRIDGE_ID */
 		+ nla_total_size(sizeof(u16))	/* IFLA_BRPORT_DESIGNATED_PORT */
@@ -208,7 +209,9 @@ static int br_port_fill_attrs(struct sk_buff *skb,
 		       p->topology_change_ack) ||
 	    nla_put_u8(skb, IFLA_BRPORT_CONFIG_PENDING, p->config_pending) ||
 	    nla_put_u8(skb, IFLA_BRPORT_VLAN_TUNNEL, !!(p->flags &
-							BR_VLAN_TUNNEL)))
+							BR_VLAN_TUNNEL)) ||
+	    nla_put_u8(skb, IFLA_BRPORT_NEIGH_SUPPRESS, !!(p->flags &
+							BR_NEIGH_SUPPRESS)))
 		return -EMSGSIZE;
 
 	timerval = br_timer_value(&p->message_age_timer);
@@ -618,6 +621,9 @@ static int br_afspec(struct net_bridge *br,
 		}
 	}
 
+	if (p)
+		br_recalculate_neigh_suppress_enabled(p->br);
+
 	return err;
 }
 
@@ -689,6 +695,7 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[])
 {
 	unsigned long old_flags = p->flags;
 	bool br_vlan_tunnel_old = false;
+	int neigh_suppress_old = 0;
 	int err;
 
 	err = br_set_port_flag(p, tb, IFLA_BRPORT_MODE, BR_HAIRPIN_MODE);
@@ -773,6 +780,13 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[])
 			return err;
 	}
 #endif
+
+	neigh_suppress_old = (p->flags & BR_NEIGH_SUPPRESS);
+	br_set_port_flag(p, tb, IFLA_BRPORT_NEIGH_SUPPRESS,
+			 BR_NEIGH_SUPPRESS);
+	if (neigh_suppress_old != (p->flags & BR_NEIGH_SUPPRESS))
+		br_recalculate_neigh_suppress_enabled(p->br);
+
 	br_port_flags_change(p, old_flags ^ p->flags);
 	return 0;
 }
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index e870cfc..7dacd83 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -396,6 +396,7 @@ struct net_bridge {
 #ifdef CONFIG_NET_SWITCHDEV
 	int offload_fwd_mark;
 #endif
+	bool				neigh_suppress_enabled;
 };
 
 struct br_input_skb_cb {
@@ -1130,4 +1131,5 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 }
 #endif /* CONFIG_NET_SWITCHDEV */
 
+void br_recalculate_neigh_suppress_enabled(struct net_bridge *br);
 #endif
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 5d5d413a..2467213 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -174,6 +174,7 @@ BRPORT_ATTR_FLAG(proxyarp, BR_PROXYARP);
 BRPORT_ATTR_FLAG(proxyarp_wifi, BR_PROXYARP_WIFI);
 BRPORT_ATTR_FLAG(multicast_flood, BR_MCAST_FLOOD);
 BRPORT_ATTR_FLAG(broadcast_flood, BR_BCAST_FLOOD);
+BRPORT_ATTR_FLAG(neigh_suppress, BR_NEIGH_SUPPRESS);
 
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 static ssize_t show_multicast_router(struct net_bridge_port *p, char *buf)
@@ -223,6 +224,7 @@ static const struct brport_attribute *brport_attrs[] = {
 	&brport_attr_proxyarp_wifi,
 	&brport_attr_multicast_flood,
 	&brport_attr_broadcast_flood,
+	&brport_attr_neigh_suppress,
 	NULL
 };
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 0/3] bridge: neigh msg proxy and flood suppression support
From: Roopa Prabhu @ 2017-10-02  4:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, nikolay, stephen, bridge

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This series implements arp and nd suppression in the bridge
driver for ethernet vpns. It implements rfc7432, section 10
https://tools.ietf.org/html/rfc7432#section-10
for ethernet VPN deployments. It is similar to the existing
BR_ARP_PROXY flag but has a few semantic differences to conform
to EVPN standard. In case of EVPN, it is mainly used to avoid flooding to
tunnel ports like vxlan/mpls. Unlike the existing flags it suppresses flood
of all neigh discovery packets (arp, nd) to tunnel ports.

Roopa Prabhu (3):
  bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd
    flood
  neigh arp suppress first
  bridge: suppress nd messages from going to BR_NEIGH_SUPPRESS ports

 include/linux/if_bridge.h    |   1 +
 include/uapi/linux/if_link.h |   1 +
 net/bridge/Makefile          |   2 +-
 net/bridge/br_arp_nd_proxy.c | 492 +++++++++++++++++++++++++++++++++++++++++++
 net/bridge/br_device.c       |  18 ++
 net/bridge/br_forward.c      |   3 +-
 net/bridge/br_if.c           |   5 +
 net/bridge/br_input.c        |  73 ++-----
 net/bridge/br_netlink.c      |  16 +-
 net/bridge/br_private.h      |   9 +
 net/bridge/br_sysfs_if.c     |   2 +
 11 files changed, 561 insertions(+), 61 deletions(-)
 create mode 100644 net/bridge/br_arp_nd_proxy.c

-- 
2.1.4

^ permalink raw reply

* Re: [PATCH 2/6] ath9k: add a quirk to set use_msi automatically
From: Daniel Drake @ 2017-10-02  4:21 UTC (permalink / raw)
  To: AceLan Kao
  Cc: Kalle Valo, Christoph Hellwig, QCA ath9k Development,
	linux-wireless, netdev, Linux-Kernel@Vger. Kernel. Org
In-Reply-To: <CAFv23Qn4OMuHaSA6UG+Q7L1sTt+ZO8T-JYVaBM1FxZtGNxOhdQ@mail.gmail.com>

Hi AceLan,

On Thu, Sep 28, 2017 at 4:28 PM, AceLan Kao <acelan.kao@canonical.com> wrote:
> Hi Daniel,
>
> I've tried your patch, but it doesn't work for me.
> Wifi can scan AP, but can't get connected.

Can you please clarify which patch(es) you have tried?

This is the base patch which adds the infrastructure to request
specific MSI IRQ vectors:
https://marc.info/?l=linux-wireless&m=150631274108016&w=2

This is the ath9k MSI patch which makes use of that:
https://github.com/endlessm/linux/commit/739c7a924db8f4434a9617657

If you were already able to use ath9k MSI interrupts without specific
consideration for which MSI vector numbers were used, these are the
possible explanations that spring to mind:

1. You got lucky and it picked a vector number that is 4-aligned. You
can check this in the "lspci -vvv" output. You'll see something like:
Capabilities: [50] MSI: Enable+ Count=1/4 Maskable+ 64bit+
        Address: 00000000fee0300c  Data: 4142
The lower number is the vector number. In my example here 0x42 (66) is
not 4-aligned so the failure condition will be hit.

2. You are using interrupt remapping, which I suspect may provide a
high likelihood of MSI interrupt vectors being 4-aligned. See if
/proc/interrupts shows the IRQ type as IR-PCI-MSI
Unfortunately interrupt remapping is not available here,
https://lists.linuxfoundation.org/pipermail/iommu/2017-August/023717.html

3. My assumption that all ath9k hardware corrupts the MSI vector
number could wrong. However we've seen this on different wifi modules
in laptops produced by different OEMs and ODMs, so it seems to be a
somewhat widespread problem at least.

4. My assumption that ath9k hardware is corrupting the MSI vector
number could be wrong; maybe another component is to blame, could it
be a BIOS issue? Admittedly I don't really know how I can debug the
layers inbetween seeing the MSI Message Data value disagree with the
vector number being handled inside do_IRQ().

Daniel

^ permalink raw reply

* Re: tc-ipt v0.2: Extension does not know id 1504083504
From: Corey Hickey @ 2017-10-02  4:16 UTC (permalink / raw)
  To: Sergey K., netdev
In-Reply-To: <CAGv8E+x_4mkFrKaoJgh8dnh6ZyN7oSHRCpT6GQZKFRECOn8EtQ@mail.gmail.com>

On 2017-10-01 07:25, Sergey K. wrote:
> Hi Corey.
> 
> I did it on your recommendation, replaced xtables.h file from my
> version of iptables 1.6.0, and replaced the file netfilter.h.
> 
> Now it's works, but new construction doesn't:
> 
> # tc filter add dev eth0 parent ffff: u32 match u32 0 0 action xt -j
> SET --map-set WORLD_QoS dst
> xt: unrecognized option '--map-set'
> failed to find target (null)

I looked into this for a bit, and it seems to be due to the SET 
extension not supporting the x6-style options (I don't know the proper 
name for it).

 From the iptables source:

$ grep -c x6 extensions/libxt_MARK.c
9

$ grep -c x6 extensions/libxt_SET.c
0

I got lost in gdb for a while and didn't come up with an easy answer. 
The options in are in the struct, just not where tc is looking.

(gdb) print *m->extra_opts
$3 = {name = 0x7ffff6f1dad8 "add-set", has_arg = 1, flag = 0x0, val = 49}

(gdb) print *(m->extra_opts+4)
$7 = {name = 0x7ffff6f1db03 "map-set", has_arg = 1, flag = 0x0, val = 53}


I don't know if the problem is in iptables or in tc. Somebody familiar 
with the code and its history could probably answer it easily, but I'm 
not at all familiar with it. Sorry. Hopefully somebody else will jump in 
and help you out.

-Corey

^ permalink raw reply

* Re: [PATCH net-next] vhost_net: do not stall on zerocopy depletion
From: Michael S. Tsirkin @ 2017-10-02  4:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Jason Wang, Koichiro Den,
	virtualization, Willem de Bruijn
In-Reply-To: <CAF=yD-KotdpHs96GomMKR-BqG3Gyrvo+to0sk2=a6E5BKjgpkg@mail.gmail.com>

On Fri, Sep 29, 2017 at 09:25:27PM -0400, Willem de Bruijn wrote:
> On Fri, Sep 29, 2017 at 3:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Sep 27, 2017 at 08:25:56PM -0400, Willem de Bruijn wrote:
> >> From: Willem de Bruijn <willemb@google.com>
> >>
> >> Vhost-net has a hard limit on the number of zerocopy skbs in flight.
> >> When reached, transmission stalls. Stalls cause latency, as well as
> >> head-of-line blocking of other flows that do not use zerocopy.
> >>
> >> Instead of stalling, revert to copy-based transmission.
> >>
> >> Tested by sending two udp flows from guest to host, one with payload
> >> of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
> >> large flow is redirected to a netem instance with 1MBps rate limit
> >> and deep 1000 entry queue.
> >>
> >>   modprobe ifb
> >>   ip link set dev ifb0 up
> >>   tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit
> >>
> >>   tc qdisc add dev tap0 ingress
> >>   tc filter add dev tap0 parent ffff: protocol ip \
> >>       u32 match ip dport 8000 0xffff \
> >>       action mirred egress redirect dev ifb0
> >>
> >> Before the delay, both flows process around 80K pps. With the delay,
> >> before this patch, both process around 400. After this patch, the
> >> large flow is still rate limited, while the small reverts to its
> >> original rate. See also discussion in the first link, below.
> >>
> >> The limit in vhost_exceeds_maxpend must be carefully chosen. When
> >> vq->num >> 1, the flows remain correlated. This value happens to
> >> correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
> >> fractions and ensure correctness also for much smaller values of
> >> vq->num, by testing the min() of both explicitly. See also the
> >> discussion in the second link below.
> >>
> >> Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com
> >> Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.com
> >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> >
> > I'd like to see the effect on the non rate limited case though.
> > If guest is quick won't we have lots of copies then?
> 
> Yes, but not significantly more than without this patch.
> 
> I ran 1, 10 and 100 flow tcp_stream throughput tests from a sender
> in the guest to a receiver in the host.
> 
> To answer the other benchmark question first, I did not see anything
> noteworthy when increasing vq->num from 256 to 1024.
> 
> With 1 and 10 flows without this patch all packets use zerocopy.
> With the patch, less than 1% eschews zerocopy.
> 
> With 100 flows, even without this patch, 90+% of packets are copied.
> Some zerocopy packets from vhost_net fail this test in tun.c
> 
>     if (iov_iter_npages(&i, INT_MAX) <= MAX_SKB_FRAGS)
> 
> Generating packets with up to 21 frags. I'm not sure yet why or
> what the fraction of these packets is. But this in turn can
> disable zcopy_used in vhost_net_tx_select_zcopy for a
> larger share of packets:
> 
>         return !net->tx_flush &&
>                 net->tx_packets / 64 >= net->tx_zcopy_err;
> 
> Because the number of copied and zerocopy packets are the
> same before and after the patch, so are the overall throughput
> numbers.

OK, thanks!
Are you looking into new warnings that kbuild system reported
with this patch?

Thanks,

-- 
MST

^ permalink raw reply

* Re: [RFC net-next 1/5] net: dsa: Add infrastructure to support LAG
From: Andrew Lunn @ 2017-10-02  2:03 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, vivien.didelot, jiri, idosch, Woojung.Huh, john,
	sean.wang
In-Reply-To: <20171001194639.8647-2-f.fainelli@gmail.com>

On Sun, Oct 01, 2017 at 12:46:35PM -0700, Florian Fainelli wrote:
> Add the necessary logic to support network device events targetting LAG events,
> this is loosely inspired from mlxsw/spectrum.c.
> 
> In the process we change dsa_slave_changeupper() to be more generic and be called
> from both LAG events as well as normal bridge enslaving events paths.
> 
> The DSA layer takes care of managing the LAG group identifiers, how many LAGs
> may be supported by a switch, and how many members per LAG are supported by a
> switch device. When a LAG group is identified, the port is then configured to
> be a part of that group. When a LAG group no longer has any users, we remove it
> and we tell the drivers whether it is safe to disable trunking altogether.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
>  include/net/dsa.h  |  25 +++++++++
>  net/dsa/dsa2.c     |  12 ++++
>  net/dsa/dsa_priv.h |   7 +++
>  net/dsa/port.c     |  92 +++++++++++++++++++++++++++++++
>  net/dsa/slave.c    | 157 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  net/dsa/switch.c   |  30 ++++++++++
>  6 files changed, 312 insertions(+), 11 deletions(-)
> 
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index 10dceccd9ce8..247ea58add68 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -182,12 +182,20 @@ struct dsa_port {
>  	u8			stp_state;
>  	struct net_device	*bridge_dev;
>  	struct devlink_port	devlink_port;
> +	u8			lag_id;
> +	bool			lagged;
>  	/*
>  	 * Original copy of the master netdev ethtool_ops
>  	 */
>  	const struct ethtool_ops *orig_ethtool_ops;
>  };
>  
> +struct dsa_lag_group {
> +	/* Used to know when we can disable lag on the switch */
> +	unsigned int		ref_count;

Hi Florian

In what contexts is ref_count manipulated. Normally you use would
refcounf_t and the operations in linux/refcount.h. But if you know
there is some other protection, e.g. rtnl, an unsigned int is O.K.
Maybe scatter some assert_RTNL() in the code?

> +static bool dsa_slave_lag_check(struct net_device *dev, struct net_device *lag_dev,
> +				struct netdev_lag_upper_info *lag_upper_info)
> +{
> +	struct dsa_slave_priv *p = netdev_priv(dev);
> +	u8 lag_id;
> +
> +	/* No more lag identifiers available or already in use */
> +	if (dsa_switch_lag_get_index(p->dp->ds, lag_dev, &lag_id) != 0)
> +		return false;
> +
> +	if (lag_upper_info->tx_type != NETDEV_LAG_TX_TYPE_HASH)
> +		return false;

I wounder if the driver needs to decide this? Can different hardware
support different tx_types?

	Andrew

^ permalink raw reply

* Re: [PATCH net-next] samples/bpf: fix warnings in xdp_monitor_user
From: Alexei Starovoitov @ 2017-10-02  1:35 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: ast, daniel, netdev, Stephen Hemminger, Jesper Dangaard Brouer
In-Reply-To: <20171001210734.27010-1-sthemmin@microsoft.com>

On Sun, Oct 01, 2017 at 02:07:34PM -0700, Stephen Hemminger wrote:
> Make local functions static to fix
> 
>   HOSTCC  samples/bpf/xdp_monitor_user.o
> samples/bpf/xdp_monitor_user.c:64:7: warning: no previous prototype for ‘gettime’ [-Wmissing-prototypes]
>  __u64 gettime(void)
>        ^~~~~~~
> samples/bpf/xdp_monitor_user.c:209:6: warning: no previous prototype for ‘print_bpf_prog_info’ [-Wmissing-prototypes]
>  void print_bpf_prog_info(void)
>       ^~~~~~~~~~~~~~~~~~~
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

Fixes: 3ffab5460264 ("samples/bpf: xdp_monitor tool based on tracepoints")
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [BUG] bpf is broken in net-next
From: Alexei Starovoitov @ 2017-10-02  1:34 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Martin KaFai Lau, netdev
In-Reply-To: <20171001140230.6a494b48@xeon-e3>

On Sun, Oct 01, 2017 at 02:02:30PM -0700, Stephen Hemminger wrote:
> Recent regression in net-next building bpf.c in samples/bpf now broken.
> 
> $ make samples/bpf/
> 
> 
>   HOSTCC  samples/bpf/../../tools/lib/bpf/bpf.o
> samples/bpf/../../tools/lib/bpf/bpf.c: In function ‘bpf_create_map_node’:
> samples/bpf/../../tools/lib/bpf/bpf.c:76:13: error: ‘union bpf_attr’ has no member named ‘map_name’; did you mean ‘map_type’?
>   memcpy(attr.map_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1));
>              ^
> samples/bpf/../../tools/lib/bpf/bpf.c:76:44: error: ‘BPF_OBJ_NAME_LEN’ undeclared (first use in this function)
>   memcpy(attr.map_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1));

everything works fine for me...
did you do 'make headers_install' ?

^ permalink raw reply

* Re: [PATCH net 3/3] net: skb_queue_purge(): lock/unlock the queue only once
From: Stephen Hemminger @ 2017-10-02  0:59 UTC (permalink / raw)
  To: Michael Witten
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Eric Dumazet, netdev, linux-kernel
In-Reply-To: <ccfa23a0f76c4c9ebd07fc991db2f829-mfwitten@gmail.com>

On Sun, 01 Oct 2017 22:19:20 -0000
Michael Witten <mfwitten@gmail.com> wrote:

> +	spin_lock_irqsave(&q->lock, flags);
> +	skb = q->next;
> +	__skb_queue_head_init(q);
> +	spin_unlock_irqrestore(&q->lock, flags);

Other code manipulating lists uses splice operation and
a sk_buff_head temporary on the stack. That would be easier
to understand.

	struct sk_buf_head head;

	__skb_queue_head_init(&head);
	spin_lock_irqsave(&q->lock, flags);
	skb_queue_splice_init(q, &head);
	spin_unlock_irqrestore(&q->lock, flags);


> +	while (skb != head) {
> +		next = skb->next;
>  		kfree_skb(skb);
> +		skb = next;
> +	}

It would be cleaner if you could use
skb_queue_walk_safe rather than open coding the loop.

	skb_queue_walk_safe(&head, skb,  tmp)
		kfree_skb(skb);

^ permalink raw reply

* Re: [PATCH 00/18] use ARRAY_SIZE macro
From: Jérémy Lefaure @ 2017-10-02  0:52 UTC (permalink / raw)
  To: Tobin C. Harding
  Cc: alsa-devel, nouveau, dri-devel, dm-devel, brcm80211-dev-list,
	devel, linux-scsi, linux-rdma, amd-gfx, Jason Gunthorpe,
	linux-acpi, linux-video, intel-wired-lan, linux-media, intel-gfx,
	ecryptfs, linux-nfs, linux-raid, openipmi-developer,
	intel-gvt-dev, devel, brcm80211-dev-list.pdl, netdev, linux-usb,
	linux-wireless, linux-kernel, linux-integrity
In-Reply-To: <20171001220131.GA11812@eros>

On Mon, 2 Oct 2017 09:01:31 +1100
"Tobin C. Harding" <me@tobin.cc> wrote:

> > In order to reduce the size of the To: and Cc: lines, each patch of the
> > series is sent only to the maintainers and lists concerned by the patch.
> > This cover letter is sent to every list concerned by this series.  
> 
> Why don't you just send individual patches for each subsystem? I'm not a maintainer but I don't see
> how any one person is going to be able to apply this whole series, it is making it hard for
> maintainers if they have to pick patches out from among the series (if indeed any will bother
> doing that).
Yeah, maybe it would have been better to send individual patches.

From my point of view it's a series because the patches are related (I
did a git format-patch from my local branch). But for the maintainers
point of view, they are individual patches.

As each patch in this series is standalone, the maintainers can pick
the patches they want and apply them individually. So yeah, maybe I
should have sent individual patches. But I also wanted to tie all
patches together as it's the same change.

Anyway, I can tell to each maintainer that they can apply the patches
they're concerned about and next time I may send individual patches.

Thank you for your answer,
Jérémy
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply

* [PATCH net 3/3] net: skb_queue_purge(): lock/unlock the queue only once
From: Michael Witten @ 2017-10-01 22:19 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI
  Cc: Stephen Hemminger, Eric Dumazet, netdev, linux-kernel
In-Reply-To: <60c8906b751d4915be456009c220516e-mfwitten@gmail.com>

Date: Sat, 9 Sep 2017 05:50:23 +0000
Hitherto, the queue's lock has been locked/unlocked every time
an item is dequeued; this seems not only inefficient, but also
incorrect, as the whole point of `skb_queue_purge()' is to clear
the queue, presumably without giving any other thread a chance to
manipulate the queue in the interim.

With this commit, the queue's lock is locked/unlocked only once
when `skb_queue_purge()' is called, and in a way that disables
the IRQs for only a minimal amount of time.

This is achieved by atomically re-initializing the queue (thereby
clearing it), and then freeing each of the items as though it were
enqueued in a private queue that doesn't require locking.

Signed-off-by: Michael Witten <mfwitten@gmail.com>
---
 net/core/skbuff.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 68065d7d383f..bd26b0bde784 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2825,18 +2825,28 @@ struct sk_buff *skb_dequeue_tail(struct sk_buff_head *list)
 EXPORT_SYMBOL(skb_dequeue_tail);
 
 /**
- *	skb_queue_purge - empty a list
- *	@list: list to empty
+ *	skb_queue_purge - empty a queue
+ *	@q: the queue to empty
  *
- *	Delete all buffers on an &sk_buff list. Each buffer is removed from
- *	the list and one reference dropped. This function takes the list
- *	lock and is atomic with respect to other list locking functions.
+ *	Dequeue and free each socket buffer that is in @q.
+ *
+ *	This function is atomic with respect to other queue-locking functions.
  */
-void skb_queue_purge(struct sk_buff_head *list)
+void skb_queue_purge(struct sk_buff_head *q)
 {
-	struct sk_buff *skb;
-	while ((skb = skb_dequeue(list)) != NULL)
+	unsigned long flags;
+	struct sk_buff *skb, *next, *head = (struct sk_buff *)q;
+
+	spin_lock_irqsave(&q->lock, flags);
+	skb = q->next;
+	__skb_queue_head_init(q);
+	spin_unlock_irqrestore(&q->lock, flags);
+
+	while (skb != head) {
+		next = skb->next;
 		kfree_skb(skb);
+		skb = next;
+	}
 }
 EXPORT_SYMBOL(skb_queue_purge);
 
-- 
2.14.1

^ permalink raw reply related

* [PATCH net 2/3] net: inet_recvmsg(): Remove unnecessary bitwise operation
From: Michael Witten @ 2017-10-01 22:19 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI
  Cc: Stephen Hemminger, Eric Dumazet, netdev, linux-kernel
In-Reply-To: <14527e6c082e4ea282a3f833118c68df-mfwitten@gmail.com>

Date: Fri, 8 Sep 2017 00:47:49 +0000
The flag `MSG_DONTWAIT' is handled by passing an argument through
the dedicated parameter `nonblock' of the function `tcp_recvmsg()'.

Presumably because `MSG_DONTWAIT' is handled so explicitly, it is
unset in the collection of flags that are passed to `tcp_recvmsg()';
yet, this unsetting appears to be unnecessary, and so this commit
removes the bitwise operation that performs the unsetting.

Signed-off-by: Michael Witten <mfwitten@gmail.com>
---
 net/ipv4/af_inet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e31108e5ef79..2dbed042a412 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -791,7 +791,7 @@ int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	sock_rps_record_flow(sk);
 
 	err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
-				   flags & ~MSG_DONTWAIT, &addr_len);
+				   flags, &addr_len);
 	if (err >= 0)
 		msg->msg_namelen = addr_len;
 	return err;
-- 
2.14.1

^ permalink raw reply related

* [PATCH net 1/3] net: __sock_cmsg_send(): Remove unused parameter `msg'
From: Michael Witten @ 2017-10-01 22:19 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI
  Cc: Stephen Hemminger, Eric Dumazet, netdev, linux-kernel
In-Reply-To: <14527e6c082e4ea282a3f833118c68df-mfwitten@gmail.com>

Date: Thu, 7 Sep 2017 03:21:38 +0000
Signed-off-by: Michael Witten <mfwitten@gmail.com>
---
 include/net/sock.h     | 2 +-
 net/core/sock.c        | 4 ++--
 net/ipv4/ip_sockglue.c | 2 +-
 net/ipv6/datagram.c    | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 03a362568357..83373d7148a9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1562,7 +1562,7 @@ struct sockcm_cookie {
 	u16 tsflags;
 };
 
-int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 		     struct sockcm_cookie *sockc);
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
 		   struct sockcm_cookie *sockc);
diff --git a/net/core/sock.c b/net/core/sock.c
index 9b7b6bbb2a23..425e03fe1c56 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2091,7 +2091,7 @@ struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
 }
 EXPORT_SYMBOL(sock_alloc_send_skb);
 
-int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 		     struct sockcm_cookie *sockc)
 {
 	u32 tsflags;
@@ -2137,7 +2137,7 @@ int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
 			return -EINVAL;
 		if (cmsg->cmsg_level != SOL_SOCKET)
 			continue;
-		ret = __sock_cmsg_send(sk, msg, cmsg, sockc);
+		ret = __sock_cmsg_send(sk, cmsg, sockc);
 		if (ret)
 			return ret;
 	}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index e558e4f9597b..c79b7822b0b9 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -263,7 +263,7 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
 		}
 #endif
 		if (cmsg->cmsg_level == SOL_SOCKET) {
-			err = __sock_cmsg_send(sk, msg, cmsg, &ipc->sockc);
+			err = __sock_cmsg_send(sk, cmsg, &ipc->sockc);
 			if (err)
 				return err;
 			continue;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a1f918713006..1d1926a4cbe2 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -756,7 +756,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 		}
 
 		if (cmsg->cmsg_level == SOL_SOCKET) {
-			err = __sock_cmsg_send(sk, msg, cmsg, sockc);
+			err = __sock_cmsg_send(sk, cmsg, sockc);
 			if (err)
 				return err;
 			continue;
-- 
2.14.1

^ permalink raw reply related

* [PATCH net 0/3] net: TCP/IP: A few minor cleanups
From: Michael Witten @ 2017-10-01 22:19 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI
  Cc: Stephen Hemminger, Eric Dumazet, netdev, linux-kernel
In-Reply-To: <45aab5effc0c424a992646a97cf2ec14-mfwitten@gmail.com>

The following patch series is an ad hoc "cleanup" that I made
while perusing the code (I'm not well versed in this code, so I
would not be surprised if there were objections to the changes):

  [1] net: __sock_cmsg_send(): Remove unused parameter `msg'
  [2] net: inet_recvmsg(): Remove unnecessary bitwise operation
  [3] net: skb_queue_purge(): lock/unlock the queue only once

Each patch will be sent as an individual reply to this email;
the total diff is appended below for your convenience.

You may also fetch these patches from GitHub:

  git checkout --detach 5969d1bb3082b41eba8fd2c826559abe38ccb6df
  git pull https://github.com/mfwitten/linux.git net/tcp-ip/01-cleanup/02

Overall:

  include/net/sock.h     |  2 +-
  net/core/skbuff.c      | 26 ++++++++++++++++++--------
  net/core/sock.c        |  4 ++--
  net/ipv4/af_inet.c     |  2 +-
  net/ipv4/ip_sockglue.c |  2 +-
  net/ipv6/datagram.c    |  2 +-
  6 files changed, 24 insertions(+), 14 deletions(-)

Sincerly,
Michael Witten

diff --git a/include/net/sock.h b/include/net/sock.h
index 03a362568357..83373d7148a9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1562,7 +1562,7 @@ struct sockcm_cookie {
 	u16 tsflags;
 };
 
-int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 		     struct sockcm_cookie *sockc);
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
 		   struct sockcm_cookie *sockc);
diff --git a/net/core/sock.c b/net/core/sock.c
index 9b7b6bbb2a23..425e03fe1c56 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2091,7 +2091,7 @@ struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
 }
 EXPORT_SYMBOL(sock_alloc_send_skb);
 
-int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 		     struct sockcm_cookie *sockc)
 {
 	u32 tsflags;
@@ -2137,7 +2137,7 @@ int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
 			return -EINVAL;
 		if (cmsg->cmsg_level != SOL_SOCKET)
 			continue;
-		ret = __sock_cmsg_send(sk, msg, cmsg, sockc);
+		ret = __sock_cmsg_send(sk, cmsg, sockc);
 		if (ret)
 			return ret;
 	}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index e558e4f9597b..c79b7822b0b9 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -263,7 +263,7 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
 		}
 #endif
 		if (cmsg->cmsg_level == SOL_SOCKET) {
-			err = __sock_cmsg_send(sk, msg, cmsg, &ipc->sockc);
+			err = __sock_cmsg_send(sk, cmsg, &ipc->sockc);
 			if (err)
 				return err;
 			continue;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a1f918713006..1d1926a4cbe2 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -756,7 +756,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 		}
 
 		if (cmsg->cmsg_level == SOL_SOCKET) {
-			err = __sock_cmsg_send(sk, msg, cmsg, sockc);
+			err = __sock_cmsg_send(sk, cmsg, sockc);
 			if (err)
 				return err;
 			continue;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e31108e5ef79..2dbed042a412 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -791,7 +791,7 @@ int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	sock_rps_record_flow(sk);
 
 	err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
-				   flags & ~MSG_DONTWAIT, &addr_len);
+				   flags, &addr_len);
 	if (err >= 0)
 		msg->msg_namelen = addr_len;
 	return err;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 68065d7d383f..bd26b0bde784 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2825,18 +2825,28 @@
 EXPORT_SYMBOL(skb_dequeue_tail);
 
 /**
- *	skb_queue_purge - empty a list
- *	@list: list to empty
+ *	skb_queue_purge - empty a queue
+ *	@q: the queue to empty
  *
- *	Delete all buffers on an &sk_buff list. Each buffer is removed from
- *	the list and one reference dropped. This function takes the list
- *	lock and is atomic with respect to other list locking functions.
+ *	Dequeue and free each socket buffer that is in @q.
+ *
+ *	This function is atomic with respect to other queue-locking functions.
  */
-void skb_queue_purge(struct sk_buff_head *list)
+void skb_queue_purge(struct sk_buff_head *q)
 {
-	struct sk_buff *skb;
-	while ((skb = skb_dequeue(list)) != NULL)
+	unsigned long flags;
+	struct sk_buff *skb, *next, *head = (struct sk_buff *)q;
+
+	spin_lock_irqsave(&q->lock, flags);
+	skb = q->next;
+	__skb_queue_head_init(q);
+	spin_unlock_irqrestore(&q->lock, flags);
+
+	while (skb != head) {
+		next = skb->next;
 		kfree_skb(skb);
+		skb = next;
+	}
 }
 EXPORT_SYMBOL(skb_queue_purge);
 
-- 
2.14.1

^ permalink raw reply related

* [RFC] compat SIOCADDRT problems
From: Al Viro @ 2017-10-01 22:13 UTC (permalink / raw)
  To: netdev; +Cc: Ralf Baechle

	Handling of SIOC{ADD,DEL}RT for 32bit is somewhat odd.  AFAICS,
the rules for native ioctl look so:

AF_APPLETALK, AF_INET, AF_IPX, AF_PACKET: take struct rtentry.  The last one
doesn't have ->compat_ioctl() and 32bit automatically hits routing_ioctl()
in net/socket.c, the rest have ->compat_ioctl() but it doesn't recognize
SIOC{ADD,DEL}RT, so it ends up handled by the same code.

AF_INET6: takes struct in6_rtmsg.  Hits routing_ioctl(), which recognizes ipv6
and does the right thing.

AF_X25: takes x25_route_struct.  Layout is apparently identical for 32bit and
64bit.  Has ->compat_ioctl(), which does the same thing as ->ioctl() on those
two.

AF_AX25: takes struct ax25_routes_struct.  Again, identical layout on 32bit
and 64bit.  Unfortunately, there's no ->compat_ioctl() in this one, so we
end up hitting routing_ioctl() and get screwed.
AF_NETROM: same as previous, except that it takes struct nr_route_struct.
Apparently broken.
AF_ROSE: ditto, with struct rose_route_struct.

AF_QIPCRTR: explicitly recognizes and fails with -EINVAL.  Odd (other protocol
families without SIOCADDRT support fail with -ENOTTY), but clearly not an issue
for compat code.

Everything else: fails with -ENOTTY.


	Are AF_{AX25,NETROM,ROSE} really broken for 32bit processes on biarch
hosts, or am I missing something subtle in there?

^ permalink raw reply

* Re: [PATCH 00/18] use ARRAY_SIZE macro
From: Tobin C. Harding @ 2017-10-01 22:01 UTC (permalink / raw)
  To: Jérémy Lefaure
  Cc: alsa-devel, nouveau, dri-devel, dm-devel, brcm80211-dev-list,
	devel, linux-scsi, linux-rdma, amd-gfx, Jason Gunthorpe,
	linux-acpi, linux-video, intel-wired-lan, linux-media, intel-gfx,
	ecryptfs, linux-nfs, linux-raid, openipmi-developer,
	intel-gvt-dev, devel, brcm80211-dev-list.pdl, netdev, linux-usb,
	linux-wireless, linux-kernel
In-Reply-To: <20171001193101.8898-1-jeremy.lefaure@lse.epita.fr>

On Sun, Oct 01, 2017 at 03:30:38PM -0400, Jérémy Lefaure wrote:
> Hi everyone,
> Using ARRAY_SIZE improves the code readability. I used coccinelle (I
> made a change to the array_size.cocci file [1]) to find several places
> where ARRAY_SIZE could be used instead of other macros or sizeof
> division.
> 
> I tried to divide the changes into a patch per subsystem (excepted for
> staging). If one of the patch should be split into several patches, let
> me know.
> 
> In order to reduce the size of the To: and Cc: lines, each patch of the
> series is sent only to the maintainers and lists concerned by the patch.
> This cover letter is sent to every list concerned by this series.

Why don't you just send individual patches for each subsystem? I'm not a maintainer but I don't see
how any one person is going to be able to apply this whole series, it is making it hard for
maintainers if they have to pick patches out from among the series (if indeed any will bother
doing that).

I get that this will be more work for you but AFAIK we are optimizing for maintainers.

Good luck,
Tobin.

^ permalink raw reply

* [PATCH net-next] samples/bpf: fix warnings in xdp_monitor_user
From: Stephen Hemminger @ 2017-10-01 21:07 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, Stephen Hemminger, Stephen Hemminger

Make local functions static to fix

  HOSTCC  samples/bpf/xdp_monitor_user.o
samples/bpf/xdp_monitor_user.c:64:7: warning: no previous prototype for ‘gettime’ [-Wmissing-prototypes]
 __u64 gettime(void)
       ^~~~~~~
samples/bpf/xdp_monitor_user.c:209:6: warning: no previous prototype for ‘print_bpf_prog_info’ [-Wmissing-prototypes]
 void print_bpf_prog_info(void)
      ^~~~~~~~~~~~~~~~~~~
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 samples/bpf/xdp_monitor_user.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/xdp_monitor_user.c b/samples/bpf/xdp_monitor_user.c
index b51b4f5e3257..c5ab8b776973 100644
--- a/samples/bpf/xdp_monitor_user.c
+++ b/samples/bpf/xdp_monitor_user.c
@@ -61,7 +61,7 @@ static void usage(char *argv[])
 }
 
 #define NANOSEC_PER_SEC 1000000000 /* 10^9 */
-__u64 gettime(void)
+static __u64 gettime(void)
 {
 	struct timespec t;
 	int res;
@@ -206,7 +206,7 @@ static void stats_poll(int interval, bool err_only)
 	}
 }
 
-void print_bpf_prog_info(void)
+static void print_bpf_prog_info(void)
 {
 	int i;
 
-- 
2.11.0

^ permalink raw reply related

* [BUG] bpf is broken in net-next
From: Stephen Hemminger @ 2017-10-01 21:02 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev

Recent regression in net-next building bpf.c in samples/bpf now broken.

$ make samples/bpf/


  HOSTCC  samples/bpf/../../tools/lib/bpf/bpf.o
samples/bpf/../../tools/lib/bpf/bpf.c: In function ‘bpf_create_map_node’:
samples/bpf/../../tools/lib/bpf/bpf.c:76:13: error: ‘union bpf_attr’ has no member named ‘map_name’; did you mean ‘map_type’?
  memcpy(attr.map_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1));
             ^
samples/bpf/../../tools/lib/bpf/bpf.c:76:44: error: ‘BPF_OBJ_NAME_LEN’ undeclared (first use in this function)
  memcpy(attr.map_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1));
                                            ^
samples/bpf/../../tools/lib/bpf/bpf.c:49:27: note: in definition of macro ‘min’
 #define min(x, y) ((x) < (y) ? (x) : (y))
                           ^
samples/bpf/../../tools/lib/bpf/bpf.c:76:44: note: each undeclared identifier is reported only once for each function it appears in
  memcpy(attr.map_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1));
                                            ^
samples/bpf/../../tools/lib/bpf/bpf.c:49:27: note: in definition of macro ‘min’
 #define min(x, y) ((x) < (y) ? (x) : (y))
                           ^
samples/bpf/../../tools/lib/bpf/bpf.c: In function ‘bpf_create_map_in_map_node’:
samples/bpf/../../tools/lib/bpf/bpf.c:116:13: error: ‘union bpf_attr’ has no member named ‘map_name’; did you mean ‘map_type’?
  memcpy(attr.map_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1));
             ^
samples/bpf/../../tools/lib/bpf/bpf.c:116:44: error: ‘BPF_OBJ_NAME_LEN’ undeclared (first use in this function)
  memcpy(attr.map_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1));
                                            ^
samples/bpf/../../tools/lib/bpf/bpf.c:49:27: note: in definition of macro ‘min’
 #define min(x, y) ((x) < (y) ? (x) : (y))
                           ^
samples/bpf/../../tools/lib/bpf/bpf.c: In function ‘bpf_load_program_name’:
samples/bpf/../../tools/lib/bpf/bpf.c:154:13: error: ‘union bpf_attr’ has no member named ‘prog_name’; did you mean ‘prog_type’?
  memcpy(attr.prog_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1));
             ^
samples/bpf/../../tools/lib/bpf/bpf.c:154:45: error: ‘BPF_OBJ_NAME_LEN’ undeclared (first use in this function)
  memcpy(attr.prog_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1));
                                             ^
samples/bpf/../../tools/lib/bpf/bpf.c:49:27: note: in definition of macro ‘min’
 #define min(x, y) ((x) < (y) ? (x) : (y))
                           ^
scripts/Makefile.host:118: recipe for target 'samples/bpf/../../tools/lib/bpf/bpf.o' failed

^ permalink raw reply

* Re: [PATCH iproute2] ip xfrm: use correct key length for netlink message
From: Stephen Hemminger @ 2017-10-01 20:48 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: netdev
In-Reply-To: <20170929114105.2D971A0F6B@unicorn.suse.cz>

On Fri, 29 Sep 2017 13:41:05 +0200 (CEST)
Michal Kubecek <mkubecek@suse.cz> wrote:

> When SA is added manually using "ip xfrm state add", xfrm_state_modify()
> uses alg_key_len field of struct xfrm_algo for the length of key passed to
> kernel in the netlink message. However alg_key_len is bit length of the key
> while we need byte length here. This is usually harmless as kernel ignores
> the excess data but when the bit length of the key exceeds 512
> (XFRM_ALGO_KEY_BUF_SIZE), it can result in buffer overflow.
> 
> We can simply divide by 8 here as the only place setting alg_key_len is in
> xfrm_algo_parse() where it is always set to a multiple of 8 (and there are
> already multiple places using "algo->alg_key_len / 8").
> 
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>

This looks correct applied.

^ permalink raw reply

* Re: [PATCH v2 iproute2] tc: fix ipv6 filter selector attribute for some prefix lengths
From: Stephen Hemminger @ 2017-10-01 20:42 UTC (permalink / raw)
  To: Yulia Kartseva; +Cc: netdev, yulia.kartseva
In-Reply-To: <20171001031840.1252538-1-hex@fb.com>

On Sat, 30 Sep 2017 20:18:40 -0700
Yulia Kartseva <hex@fb.com> wrote:

> Wrong TCA_U32_SEL attribute packing if prefixLen AND 0x1f equals 0x1f.
> These are  /31, /63, /95 and /127 prefix lengths.
> 
> Example:
> ip6 dst face:b00f::/31
> filter parent b: protocol ipv6 pref 2307 u32
> filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
> filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
> key ht 800 bkt 0
>   match faceb00f/ffffffff at 24
> 
> v2: previous patch was made with a wrong repo
> 
> Signed-off-by: Yulia Kartseva <hex@fb.com>

That came through correctly. Applied.

^ permalink raw reply

* [PATCH 2/2] atm: Improve a size determination in adummy_init()
From: SF Markus Elfring @ 2017-10-01 19:50 UTC (permalink / raw)
  To: linux-atm-general, netdev, Chas Williams; +Cc: LKML, kernel-janitors
In-Reply-To: <8fe2ca8a-e1f7-d492-db69-c52b66143ac6@users.sourceforge.net>

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Sun, 1 Oct 2017 21:35:18 +0200

Replace the specification of a data structure by a pointer dereference
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 drivers/atm/adummy.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/atm/adummy.c b/drivers/atm/adummy.c
index 94b73ddfb731..1ef2d8ee8d67 100644
--- a/drivers/atm/adummy.c
+++ b/drivers/atm/adummy.c
@@ -146,9 +146,7 @@ static int __init adummy_init(void)
 	int err = 0;
 
 	printk(KERN_ERR "adummy: version %s\n", DRV_VERSION);
-
-	adummy_dev = kzalloc(sizeof(struct adummy_dev),
-						   GFP_KERNEL);
+	adummy_dev = kzalloc(sizeof(*adummy_dev), GFP_KERNEL);
 	if (!adummy_dev) {
 		err = -ENOMEM;
 		goto out;
-- 
2.14.2

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox