Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] bpf: Optimize lpm trie delete
From: Daniel Mack @ 2017-09-20 16:51 UTC (permalink / raw)
  To: Craig Gallek, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller; +Cc: netdev
In-Reply-To: <20170920162247.63787-1-kraigatgoog@gmail.com>

Hi Craig,

Thanks, this looks much cleaner already :)

On 09/20/2017 06:22 PM, Craig Gallek wrote:
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 9d58a576b2ae..b5a7d70ec8b5 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
>  	struct lpm_trie_node __rcu **trim;
>  	struct lpm_trie_node *node;
>  	unsigned long irq_flags;
> -	unsigned int next_bit;
> +	unsigned int next_bit = 0;

This default assignment seems wrong, and I guess you only added it to
squelch a compiler warning?

[...]

> +	/* If the node has one child, we may be able to collapse the tree
> +	 * while removing this node if the node's child is in the same
> +	 * 'next bit' slot as this node was in its parent or if the node
> +	 * itself is the root.
> +	 */
> +	if (trim == &trie->root) {
> +		next_bit = node->child[0] ? 0 : 1;
> +		rcu_assign_pointer(trie->root, node->child[next_bit]);
> +		kfree_rcu(node, rcu);

I don't think you should treat this 'root' case special.

Instead, move the 'next_bit' assignment outside of the condition ...

> +	} else if (rcu_access_pointer(node->child[next_bit])) {
> +		rcu_assign_pointer(*trim, node->child[next_bit]);
> +		kfree_rcu(node, rcu);

... and then this branch would handle the case just fine. Correct?

Otherwise, looks good to me!



Thanks,
Daniel

^ permalink raw reply

* Re: [PATCHv3 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough
From: Stephen Hemminger @ 2017-09-20 16:56 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, Michal Kubecek, Phil Sutter
In-Reply-To: <1505871820-31580-2-git-send-email-liuhangbin@gmail.com>

On Wed, 20 Sep 2017 09:43:39 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

Thanks for keeping up on this.

> +realloc:
> +	bufp = realloc(buf, buf_len);
> +
> +	if (bufp == NULL) {

Minor personal style issue:
To me, blank lines are like paragraphs in writing.
Code reads better assignment and condition check are next to
each other.

> +recv:
> +	len = recvmsg(fd, msg, flag);
> +
> +	if (len < 0) {
> +		if (errno == EINTR || errno == EAGAIN)
> +			goto recv;
> +		fprintf(stderr, "netlink receive error %s (%d)\n",
> +			strerror(errno), errno);
> +		free(buf);
> +		return -errno;
> +	}
> +
> +	if (len == 0) {
> +		fprintf(stderr, "EOF on netlink\n");
> +		free(buf);
> +		return -ENODATA;
> +	}
> +
> +	if (len > buf_len) {
> +		buf_len = len;
> +		flag = 0;
> +		goto realloc;
> +	}
> +
> +	if (flag != 0) {
> +		flag = 0;
> +		goto recv;

Although I programmed in BASIC years ago. I never liked code
with loops via goto. To me it indicates the logic is not well thought
through.  Not sure exactly how to rearrange the control flow, but it
should be possible to rewrite this so that it reads cleaner.

Still think this needs to go through a few more review cycles
before applying.

^ permalink raw reply

* [PATCH net-next 0/5] net: introduce noref sk
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

This series introduce the infrastructure to store inside the skb a socket
pointer without carrying a refcount to the socket.

Such infrastructure is then used in the network receive path - and
specifically the early demux operation.

This allows the UDP early demux to perform a full lookup for UDP sockets,
with many benefits:

- the UDP early demux code is now much simpler
- the early demux does not hit any performance penalties in case of UDP hash
  table collision - previously the early demux performed a partial, unsuccesful,
  lookup
- early demux is now operational also for unconnected sockets.

This infrastrcture will be used in follow-up series to allow dst caching for
unconnected UDP sockets, and than to extend the same features to TCP listening
sockets.

Paolo Abeni (5):
  net: add support for noref skb->sk
  net: allow early demux to fetch noref socket
  udp: do not touch socket refcount in early demux
  net: add simple socket-like dst cache helpers
  udp: perform full socket lookup in early demux

 include/linux/skbuff.h           | 30 +++++++++++++++
 include/linux/udp.h              |  2 +
 include/net/dst.h                | 12 ++++++
 net/core/dst.c                   | 16 ++++++++
 net/core/sock.c                  |  6 +++
 net/ipv4/ip_input.c              | 12 ++++++
 net/ipv4/ipmr.c                  | 18 +++++++--
 net/ipv4/netfilter/nf_dup_ipv4.c |  3 ++
 net/ipv4/udp.c                   | 80 ++++++++++++++++------------------------
 net/ipv6/ip6_input.c             |  7 +++-
 net/ipv6/netfilter/nf_dup_ipv6.c |  3 ++
 net/ipv6/udp.c                   | 67 ++++++++++-----------------------
 net/netfilter/nf_queue.c         |  3 ++
 13 files changed, 159 insertions(+), 100 deletions(-)

-- 
2.13.5

^ permalink raw reply

* [PATCH net-next 1/5] net: add support for noref skb->sk
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <cover.1505926196.git.pabeni@redhat.com>

Noref sk do not carry a socket refcount, are valid
only inside the current RCU section and must be
explicitly cleared before exiting such section.

They will be used in a later patch to allow early demux
without sock refcounting.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h | 30 ++++++++++++++++++++++++++++++
 net/core/sock.c        |  6 ++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 72299ef00061..459a5672811d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -922,6 +922,36 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 	return (struct rtable *)skb_dst(skb);
 }
 
+void sock_dummyfree(struct sk_buff *skb);
+
+/* only early demux can set noref socks
+ * noref socks do not carry any refcount and must be
+ * cleared before exiting the current RCU section
+ */
+static inline void skb_set_noref_sk(struct sk_buff *skb, struct sock *sk)
+{
+	skb->sk = sk;
+	skb->destructor = sock_dummyfree;
+}
+
+static inline bool skb_has_noref_sk(struct sk_buff *skb)
+{
+	return skb->destructor == sock_dummyfree;
+}
+
+static inline struct sock *skb_clear_noref_sk(struct sk_buff *skb)
+{
+	struct sock *ret;
+
+	if (!skb_has_noref_sk(skb))
+		return NULL;
+
+	ret = skb->sk;
+	skb->sk = NULL;
+	skb->destructor = NULL;
+	return ret;
+}
+
 /* For mangling skb->pkt_type from user space side from applications
  * such as nft, tc, etc, we only allow a conservative subset of
  * possible pkt_types to be set.
diff --git a/net/core/sock.c b/net/core/sock.c
index 9b7b6bbb2a23..3aa4950639bb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1893,6 +1893,12 @@ void sock_efree(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sock_efree);
 
+/* dummy destructor used by noref sockets */
+void sock_dummyfree(struct sk_buff *skb)
+{
+}
+EXPORT_SYMBOL(sock_dummyfree);
+
 kuid_t sock_i_uid(struct sock *sk)
 {
 	kuid_t uid;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 2/5] net: allow early demux to fetch noref socket
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <cover.1505926196.git.pabeni@redhat.com>

We must be careful to avoid leaking such sockets outside
the RCU section containing the early demux call; we clear
them on nonlocal delivery.

For ipv4 we must take care of local mcast delivery, too,
since udp early demux works also for mcast addresses.

Also update all iptables/nftables extension that can
happen in the input chain and can transmit the skb outside
such patch, namely TEE, nft_dup and nfqueue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/ip_input.c              | 12 ++++++++++++
 net/ipv4/ipmr.c                  | 18 ++++++++++++++----
 net/ipv4/netfilter/nf_dup_ipv4.c |  3 +++
 net/ipv6/ip6_input.c             |  7 ++++++-
 net/ipv6/netfilter/nf_dup_ipv6.c |  3 +++
 net/netfilter/nf_queue.c         |  3 +++
 6 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index fa2dc8f692c6..e71abc8b698c 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -349,6 +349,18 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 				__NET_INC_STATS(net, LINUX_MIB_IPRPFILTER);
 			goto drop;
 		}
+
+		/* Since the sk has no reference to the socket, we must
+		 * clear it before escaping this RCU section.
+		 * The sk is just an hint and we know we are not going to use
+		 * it outside the input path.
+		 */
+		if (skb_dst(skb)->input != ip_local_deliver
+#ifdef CONFIG_IP_MROUTE
+		    && skb_dst(skb)->input != ip_mr_input
+#endif
+		    )
+			skb_clear_noref_sk(skb);
 	}
 
 #ifdef CONFIG_IP_ROUTE_CLASSID
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index c9b3e6e069ae..76642af79038 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1978,11 +1978,12 @@ static struct mr_table *ipmr_rt_fib_lookup(struct net *net, struct sk_buff *skb)
  */
 int ip_mr_input(struct sk_buff *skb)
 {
-	struct mfc_cache *cache;
-	struct net *net = dev_net(skb->dev);
 	int local = skb_rtable(skb)->rt_flags & RTCF_LOCAL;
-	struct mr_table *mrt;
+	struct net *net = dev_net(skb->dev);
+	struct mfc_cache *cache;
 	struct net_device *dev;
+	struct mr_table *mrt;
+	struct sock *sk;
 
 	/* skb->dev passed in is the loX master dev for vrfs.
 	 * As there are no vifs associated with loopback devices,
@@ -2052,6 +2053,9 @@ int ip_mr_input(struct sk_buff *skb)
 			skb = skb2;
 		}
 
+		/* avoid leaking the noref sk on forward path */
+		skb_clear_noref_sk(skb);
+
 		read_lock(&mrt_lock);
 		vif = ipmr_find_vif(mrt, dev);
 		if (vif >= 0) {
@@ -2065,12 +2069,18 @@ int ip_mr_input(struct sk_buff *skb)
 		return -ENODEV;
 	}
 
+	/* avoid leaking the noref sk on forward path... */
+	sk = skb_clear_noref_sk(skb);
 	read_lock(&mrt_lock);
 	ip_mr_forward(net, mrt, dev, skb, cache, local);
 	read_unlock(&mrt_lock);
 
-	if (local)
+	if (local) {
+		/* ... but preserve it for local delivery */
+		if (sk)
+			skb_set_noref_sk(skb, sk);
 		return ip_local_deliver(skb);
+	}
 
 	return 0;
 
diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c
index 39895b9ddeb9..bf8b78492fc8 100644
--- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -71,6 +71,9 @@ void nf_dup_ipv4(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 	nf_reset(skb);
 	nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
 #endif
+	/* Avoid leaking noref sk outside the input path */
+	skb_clear_noref_sk(skb);
+
 	/*
 	 * If we are in PREROUTING/INPUT, decrease the TTL to mitigate potential
 	 * loops between two hosts.
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 9ee208a348f5..9aa6baffd4b9 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -65,9 +65,14 @@ int ip6_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 		if (ipprot && (edemux = READ_ONCE(ipprot->early_demux)))
 			edemux(skb);
 	}
-	if (!skb_valid_dst(skb))
+	if (!skb_valid_dst(skb)) {
 		ip6_route_input(skb);
 
+		/* see comment on ipv4 edmux */
+		if (skb_dst(skb)->input != ip6_input)
+			skb_clear_noref_sk(skb);
+	}
+
 	return dst_input(skb);
 }
 
diff --git a/net/ipv6/netfilter/nf_dup_ipv6.c b/net/ipv6/netfilter/nf_dup_ipv6.c
index 4a7ddeddbaab..939f6a2238f9 100644
--- a/net/ipv6/netfilter/nf_dup_ipv6.c
+++ b/net/ipv6/netfilter/nf_dup_ipv6.c
@@ -60,6 +60,9 @@ void nf_dup_ipv6(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 	nf_reset(skb);
 	nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
 #endif
+	/* Avoid leaking noref sk outside the input path */
+	skb_clear_noref_sk(skb);
+
 	if (hooknum == NF_INET_PRE_ROUTING ||
 	    hooknum == NF_INET_LOCAL_IN) {
 		struct ipv6hdr *iph = ipv6_hdr(skb);
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index f7e21953b1de..100eff08cb51 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -145,6 +145,9 @@ static int __nf_queue(struct sk_buff *skb, const struct nf_hook_state *state,
 		.size	= sizeof(*entry) + afinfo->route_key_size,
 	};
 
+	/* Avoid leaking noref sk outside the input path */
+	skb_clear_noref_sk(skb);
+
 	nf_queue_entry_get_refs(entry);
 	skb_dst_force(skb);
 	afinfo->saveroute(skb, entry);
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 3/5] udp: do not touch socket refcount in early demux
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <cover.1505926196.git.pabeni@redhat.com>

use noref sockets instead. This gives some small performance
improvements and will allow efficient early demux for unconnected
sockets in a later patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c | 18 ++++++++++--------
 net/ipv6/udp.c | 10 ++++++----
 2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 784ced0b9150..ba49d5aa9f09 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2050,12 +2050,13 @@ static inline int udp4_csum_init(struct sk_buff *skb, struct udphdr *uh,
 int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		   int proto)
 {
-	struct sock *sk;
-	struct udphdr *uh;
-	unsigned short ulen;
+	struct net *net = dev_net(skb->dev);
 	struct rtable *rt = skb_rtable(skb);
+	unsigned short ulen;
 	__be32 saddr, daddr;
-	struct net *net = dev_net(skb->dev);
+	struct udphdr *uh;
+	struct sock *sk;
+	bool noref_sk;
 
 	/*
 	 *  Validate the packet.
@@ -2081,6 +2082,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (udp4_csum_init(skb, uh, proto))
 		goto csum_error;
 
+	noref_sk = skb_has_noref_sk(skb);
 	sk = skb_steal_sock(skb);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
@@ -2090,7 +2092,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			udp_sk_rx_dst_set(sk, dst);
 
 		ret = udp_queue_rcv_skb(sk, skb);
-		sock_put(sk);
+		if (!noref_sk)
+			sock_put(sk);
 		/* a return value > 0 means to resubmit the input, but
 		 * it wants the return to be -protocol, or 0
 		 */
@@ -2261,11 +2264,10 @@ void udp_v4_early_demux(struct sk_buff *skb)
 					     uh->source, iph->saddr, dif, sdif);
 	}
 
-	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
+	if (!sk)
 		return;
 
-	skb->sk = sk;
-	skb->destructor = sock_efree;
+	skb_set_noref_sk(skb, sk);
 	dst = READ_ONCE(sk->sk_rx_dst);
 
 	if (dst)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e2ecfb137297..8f62392c4c35 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -787,6 +787,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct net *net = dev_net(skb->dev);
 	struct udphdr *uh;
 	struct sock *sk;
+	bool noref_sk;
 	u32 ulen = 0;
 
 	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
@@ -823,6 +824,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		goto csum_error;
 
 	/* Check if the socket is already available, e.g. due to early demux */
+	noref_sk = skb_has_noref_sk(skb);
 	sk = skb_steal_sock(skb);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
@@ -832,7 +834,8 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			udp6_sk_rx_dst_set(sk, dst);
 
 		ret = udpv6_queue_rcv_skb(sk, skb);
-		sock_put(sk);
+		if (!noref_sk)
+			sock_put(sk);
 
 		/* a return value > 0 means to resubmit the input */
 		if (ret > 0)
@@ -948,11 +951,10 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 	else
 		return;
 
-	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
+	if (!sk)
 		return;
 
-	skb->sk = sk;
-	skb->destructor = sock_efree;
+	skb_set_noref_sk(skb, sk);
 	dst = READ_ONCE(sk->sk_rx_dst);
 
 	if (dst)
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 4/5] net: add simple socket-like dst cache helpers
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <cover.1505926196.git.pabeni@redhat.com>

It will be used by later patches to reduce code duplication.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/dst.h | 12 ++++++++++++
 net/core/dst.c    | 16 ++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..a6a39357f19a 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -485,6 +485,18 @@ static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
 	return dst;
 }
 
+bool dst_update(struct dst_entry **cache, struct dst_entry *dst);
+static inline struct dst_entry *dst_access(struct dst_entry **cache,
+					      u32 cookie)
+{
+	struct dst_entry *dst = READ_ONCE(*cache);
+
+	if (!dst)
+		return NULL;
+
+	return dst_check(dst, cookie);
+}
+
 /* Flags for xfrm_lookup flags argument. */
 enum {
 	XFRM_LOOKUP_ICMP = 1 << 0,
diff --git a/net/core/dst.c b/net/core/dst.c
index a6c47da7d0f8..6aff0a3e7ba3 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -205,6 +205,22 @@ void dst_release_immediate(struct dst_entry *dst)
 }
 EXPORT_SYMBOL(dst_release_immediate);
 
+/* 'dst' is not refcounted */
+bool dst_update(struct dst_entry **cache, struct dst_entry *dst)
+{
+	if (likely(*cache == dst))
+		return false;
+
+	if (dst_hold_safe(dst)) {
+		struct dst_entry *old = xchg(cache, dst);
+
+		dst_release(old);
+		return old != dst;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(dst_update);
+
 u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old)
 {
 	struct dst_metrics *p = kmalloc(sizeof(*p), GFP_ATOMIC);
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 5/5] udp: perform full socket lookup in early demux
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <cover.1505926196.git.pabeni@redhat.com>

Since UDP early demux lookup fetches noref socket references,
we can safely be optimistic about it and set the sk reference
even if the skb is not going to land on such socket, avoiding
the rx dst cache usage for unconnected unicast sockets.

This avoids a second lookup for unconnected sockets, and clean
up a bit the whole udp early demux code.

After this change, on hosts not acting as routers, the UDP
early demux never affect negatively the receive performances,
while before this change UDP early demux caused measurable
performance impact for unconnected sockets.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/udp.h |  2 ++
 net/ipv4/udp.c      | 62 +++++++++++++++++++----------------------------------
 net/ipv6/udp.c      | 57 ++++++++++++------------------------------------
 3 files changed, 38 insertions(+), 83 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index eaea63bc79bb..9c68b57543cc 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -92,6 +92,8 @@ static inline struct udp_sock *udp_sk(const struct sock *sk)
 	return (struct udp_sock *)sk;
 }
 
+void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie);
+
 static inline void udp_set_no_check6_tx(struct sock *sk, bool val)
 {
 	udp_sk(sk)->no_check6_tx = val;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ba49d5aa9f09..5cbbd78024dc 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2043,6 +2043,11 @@ static inline int udp4_csum_init(struct sk_buff *skb, struct udphdr *uh,
 							 inet_compute_pseudo);
 }
 
+static bool udp_use_rx_dst_cache(struct sock *sk, struct sk_buff *skb)
+{
+	return sk->sk_state == TCP_ESTABLISHED || skb->pkt_type != PACKET_HOST;
+}
+
 /*
  *	All we need to do is get the socket, and then do a checksum.
  */
@@ -2088,8 +2093,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
-		if (unlikely(sk->sk_rx_dst != dst))
-			udp_sk_rx_dst_set(sk, dst);
+		if (udp_use_rx_dst_cache(sk, skb))
+			dst_update(&sk->sk_rx_dst, dst);
 
 		ret = udp_queue_rcv_skb(sk, skb);
 		if (!noref_sk)
@@ -2196,42 +2201,28 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
 	return result;
 }
 
-/* For unicast we should only early demux connected sockets or we can
- * break forwarding setups.  The chains here can be long so only check
- * if the first socket is an exact match and if not move on.
- */
-static struct sock *__udp4_lib_demux_lookup(struct net *net,
-					    __be16 loc_port, __be32 loc_addr,
-					    __be16 rmt_port, __be32 rmt_addr,
-					    int dif, int sdif)
+void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie)
 {
-	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = udp4_portaddr_hash(net, loc_addr, hnum);
-	unsigned int slot2 = hash2 & udp_table.mask;
-	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
-	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr);
-	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
-	struct sock *sk;
+	struct dst_entry *dst = dst_access(&sk->sk_rx_dst, cookie);
 
-	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		if (INET_MATCH(sk, net, acookie, rmt_addr,
-			       loc_addr, ports, dif, sdif))
-			return sk;
-		/* Only check first socket in chain */
-		break;
+	if (dst) {
+		/* set noref for now.
+		 * any place which wants to hold dst has to call
+		 * dst_hold_safe()
+		 */
+		skb_dst_set_noref(skb, dst);
 	}
-	return NULL;
 }
+EXPORT_SYMBOL_GPL(udp_set_skb_rx_dst);
 
 void udp_v4_early_demux(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
+	int dif = skb->dev->ifindex;
+	int sdif = inet_sdif(skb);
 	const struct iphdr *iph;
 	const struct udphdr *uh;
 	struct sock *sk = NULL;
-	struct dst_entry *dst;
-	int dif = skb->dev->ifindex;
-	int sdif = inet_sdif(skb);
 	int ours;
 
 	/* validate the packet */
@@ -2260,25 +2251,16 @@ void udp_v4_early_demux(struct sk_buff *skb)
 						   uh->source, iph->saddr,
 						   dif, sdif);
 	} else if (skb->pkt_type == PACKET_HOST) {
-		sk = __udp4_lib_demux_lookup(net, uh->dest, iph->daddr,
-					     uh->source, iph->saddr, dif, sdif);
+		sk = __udp4_lib_lookup(net, iph->saddr, uh->source, iph->daddr,
+				       uh->dest, dif, sdif, &udp_table, skb);
 	}
 
 	if (!sk)
 		return;
 
 	skb_set_noref_sk(skb, sk);
-	dst = READ_ONCE(sk->sk_rx_dst);
-
-	if (dst)
-		dst = dst_check(dst, 0);
-	if (dst) {
-		/* set noref for now.
-		 * any place which wants to hold dst has to call
-		 * dst_hold_safe()
-		 */
-		skb_dst_set_noref(skb, dst);
-	}
+	if (udp_use_rx_dst_cache(sk, skb))
+		udp_set_skb_rx_dst(sk, skb, 0);
 }
 
 int udp_rcv(struct sk_buff *skb)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8f62392c4c35..67d340679c3a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -773,13 +773,18 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 
 static void udp6_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst)
 {
-	if (udp_sk_rx_dst_set(sk, dst)) {
+	if (unlikely(dst_update(&sk->sk_rx_dst, dst))) {
 		const struct rt6_info *rt = (const struct rt6_info *)dst;
 
 		inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
 	}
 }
 
+static bool udp6_use_rx_dst_cache(struct sock *sk)
+{
+	return sk->sk_state == TCP_ESTABLISHED;
+}
+
 int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		   int proto)
 {
@@ -830,7 +835,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
-		if (unlikely(sk->sk_rx_dst != dst))
+		if (udp6_use_rx_dst_cache(sk))
 			udp6_sk_rx_dst_set(sk, dst);
 
 		ret = udpv6_queue_rcv_skb(sk, skb);
@@ -905,37 +910,13 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	return 0;
 }
 
-
-static struct sock *__udp6_lib_demux_lookup(struct net *net,
-			__be16 loc_port, const struct in6_addr *loc_addr,
-			__be16 rmt_port, const struct in6_addr *rmt_addr,
-			int dif, int sdif)
-{
-	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = udp6_portaddr_hash(net, loc_addr, hnum);
-	unsigned int slot2 = hash2 & udp_table.mask;
-	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
-	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
-	struct sock *sk;
-
-	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		if (sk->sk_state == TCP_ESTABLISHED &&
-		    INET6_MATCH(sk, net, rmt_addr, loc_addr, ports, dif, sdif))
-			return sk;
-		/* Only check first socket in chain */
-		break;
-	}
-	return NULL;
-}
-
 static void udp_v6_early_demux(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
-	const struct udphdr *uh;
-	struct sock *sk;
-	struct dst_entry *dst;
 	int dif = skb->dev->ifindex;
 	int sdif = inet6_sdif(skb);
+	const struct udphdr *uh;
+	struct sock *sk;
 
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) +
 	    sizeof(struct udphdr)))
@@ -944,10 +925,9 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 	uh = udp_hdr(skb);
 
 	if (skb->pkt_type == PACKET_HOST)
-		sk = __udp6_lib_demux_lookup(net, uh->dest,
-					     &ipv6_hdr(skb)->daddr,
-					     uh->source, &ipv6_hdr(skb)->saddr,
-					     dif, sdif);
+		sk = __udp6_lib_lookup(net, &ipv6_hdr(skb)->saddr, uh->source,
+				       &ipv6_hdr(skb)->daddr, uh->dest, dif,
+				       sdif, &udp_table, skb);
 	else
 		return;
 
@@ -955,17 +935,8 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 		return;
 
 	skb_set_noref_sk(skb, sk);
-	dst = READ_ONCE(sk->sk_rx_dst);
-
-	if (dst)
-		dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);
-	if (dst) {
-		/* set noref for now.
-		 * any place which wants to hold dst has to call
-		 * dst_hold_safe()
-		 */
-		skb_dst_set_noref(skb, dst);
-	}
+	if (udp6_use_rx_dst_cache(sk))
+		udp_set_skb_rx_dst(sk, skb, inet6_sk(sk)->rx_dst_cookie);
 }
 
 static __inline__ int udpv6_rcv(struct sk_buff *skb)
-- 
2.13.5

^ permalink raw reply related

* Re: [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map
From: Peter Zijlstra @ 2017-09-20 17:26 UTC (permalink / raw)
  To: Yonghong Song; +Cc: rostedt, ast, daniel, netdev, kernel-team, David Miller
In-Reply-To: <20170920060935.1102268-2-yhs@fb.com>

On Tue, Sep 19, 2017 at 11:09:32PM -0700, Yonghong Song wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3e691b7..2d5bbe5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3684,10 +3684,12 @@ static inline u64 perf_event_count(struct perf_event *event)
>   *     will not be local and we cannot read them atomically
>   *   - must not have a pmu::count method
>   */
> -int perf_event_read_local(struct perf_event *event, u64 *value)
> +int perf_event_read_local(struct perf_event *event, u64 *value,
> +			  u64 *enabled, u64 *running)
>  {
>  	unsigned long flags;
>  	int ret = 0;
> +	u64 now;
>  
>  	/*
>  	 * Disabling interrupts avoids all counter scheduling (context
> @@ -3718,14 +3720,21 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
>  		goto out;
>  	}
>  
> +	now = event->shadow_ctx_time + perf_clock();
> +	if (enabled)
> +		*enabled = now - event->tstamp_enabled;
>  	/*
>  	 * If the event is currently on this CPU, its either a per-task event,
>  	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
>  	 * oncpu == -1).
>  	 */
> -	if (event->oncpu == smp_processor_id())
> +	if (event->oncpu == smp_processor_id()) {
>  		event->pmu->read(event);
> -
> +		if (running)
> +			*running = now - event->tstamp_running;
> +	} else if (running) {
> +		*running = event->total_time_running;
> +	}
>  	*value = local64_read(&event->count);
>  out:
>  	local_irq_restore(flags);

Yeah, this looks about right.

Dave, could we have this in a topic tree of sorts, because I have a
pending series to rework all the timekeeping and it might be nice to not
have sfr run into all sorts of conflicts.

^ permalink raw reply

* Re: [PATCH RFC V1 net-next 0/6] Time based packet transmission
From: levipearson @ 2017-09-20 17:35 UTC (permalink / raw)
  To: rcochran
  Cc: netdev, linux-kernel, intel-wired-lan, vinicius.gomes,
	andre.guedes, john.stultz, jesus.sanchez-palencia, henrik, tglx,
	anna-maria, davem
In-Reply-To: <cover.1505719061.git.rcochran@linutronix.de>

> This series is an early RFC that introduces a new socket option
> allowing time based transmission of packets.  This option will be
> useful in implementing various real time protocols over Ethernet,
> including but not limited to P802.1Qbv, which is currently finding
> its way into 802.1Q.
> 
> * Open questions about SO_TXTIME semantics
> 
>   - What should the kernel do if the dialed Tx time is in the past?
>     Should the packet be sent ASAP, or should we throw an error?

Based on the i210 and latest NXP/Freescale FEC launch time behavior,
the hardware timestamps work over 1-second windows corresponding to
the time elapsed since the last PTP second began. When considering the
head-of-queue frame, the launch time is compared to the elapsed time
counter and if the elapsed time is between exactly the launch time and
half a second after the launch time, it is launched. If you enqueue a
frame with a scheduled launch time that ends up more than half a second
late, it is considered by the hardware to be scheduled *in the future*
at the offset belonging to the next second after the 1-second window
wraps around.

So *slightly* late (<<.5sec late) frames could be scheduled as normal,
but approaching .5sec late frames would have to either be dropped or 
have their schedule changed to avoid blocking the queue for a large
fraction of a second.

I don't like the idea of changing the scheduled time, and anything that
is close to half a second late is most likely useless. But it is also
reasonable to let barely-late frames go out ASAP--in the case of a Qav-
shaped stream, the bunching would get smoothed out downstream. A timed
launch schedule need not be used as an exact time, but a "don't send
before time X" flag. Both are useful in different circumstances.

A configurable parameter for allowable lateness, with the upper bound
set by the driver based on the hardware capabilities, seems ideal.
Barring that, I would suggest dropping frames with already-missed
launch times.

> 
>   - Should the kernel inform the user if it detects a missed deadline,
>     via the error queue for example?

I think some sort of counter for mis-scheduled/late-delivered frames
would be in keeping with the general 802.1 error handling strategy.

> 
>   - What should the timescale be for the dialed Tx time?  Should the
>     kernel select UTC when using the SW Qdisc and the HW time
>     otherwise?  Or should the socket option include a clockid_t?

When I implemented something like this, I left it relative to the HW
time for the sake of simplicity, but I don't have a strong opinion.

> 
> * Things todo
> 
>   - Design a Qdisc for purpose of configuring SO_TXTIME.  There should
>     be one option to dial HW offloading or SW best effort.

You seem focused on Qbv, but there is another aspect of the endpoint
requirements for Qav that this would provide a perfect use case for. A
bridge can treat all traffic in a Qav-shaped class equally, but an
endpoint must essentially run one credit-based shaper per active stream
feeding into the class--this is because a stream must adhere to its
frames-per-interval promise in its t-spec, and when the observation
interval is not an even multiple of the sample rate, it will occasionally
have an observation interval with no frame. This leaves extra bandwidth
in the class reservation, but it cannot be used by any other stream if
it would cause more than one frame per interval to be sent!

Even if a stream is not explicitly scheduled in userspace, a per-stream
Qdisc could apply a rough launch time that the class Qdisc (or hardware
shaping) would use to ensure the frames-per-interval aspect of the
reservation for the stream is adhered to. For example, each observation
interval could be assigned a launch time, and all streams would get a
number of frames corresponding to their frames-per-interval reservation
assigned that same launch time before being put into the class queue.
The i210's shaper would then only consider the current interval's set 
of frames ready to launch, and spread them evenly with its hardware
credit-based shaping.

For industrial and automotive control applications, a Qbv Qdisc based on
SO_TXTIME would be very interesting, but pro and automotive media uses
will most likely continue to use SRP + Qav, and these are becoming
increasingly common uses as you can see by the growing support for Qav in
automotive chips.

>   - Implement the SW best effort variant.  Here is my back of the
>     napkin sketch.  Each interface has its own timerqueue keeping the
>     TXTIME packets in order and a FIFO for all other traffic.  A guard
>     window starts at the earliest deadline minus the maximum MTU minus
>     a configurable fudge factor.  The Qdisc uses a hrtimer to transmit
>     the next packet in the timerqueue.  During the guard window, all
>     other traffic is defered unless the next packet can be transmitted
>     before the guard window expires.

This sounds plausible to me.

> 
> * Current limitations
> 
>   - The driver does not handle out of order packets.  If user space
>     sends a packet with an earlier Tx time, then the code should stop
>     the queue, reshuffle the descriptors accordingly, and then
>     restart the queue.

You might store the last scheduled timestamp in the driver private struct
and drop any frame with a timestamp not greater or equal to the last one.

> 
>   - The driver does not correctly queue up packets in the distant
>     future.  The i210 has a limited time window of +/- 0.5 seconds.
>     Packets with a Tx time greater than that should be deferred in
>     order to enqueue them later on.

The limit is not half a second in the future, but half a second from the
previous scheduled frame if one is enqueued. Another use case for the last
scheduled frame field. There are definitely cases that might need to be
deferred though.

> 
> * Performance measurements
> 
>   1. Prepared a PC and the Device Under Test (DUT) each with an Intel
>      i210 card connected with a crossover cable.
>   2. The DUT was a Pentium(R) D CPU 2.80GHz running PREEMPT_RT
>      4.9.40-rt30 with about 50 usec maximum latency under cyclictest.
>   3. Synchronized the DUT's PHC to the PC's PHC using ptp4l.
>   4. Synchronized the DUT's system clock to its PHC using phc2sys.
>   5. Started netperf to produce some network load.
>   6. Measured the arrival time of the packets at the PC's PHC using
>      hardware time stamping.
> 
>   I ran ten minute tests both with and without using the so_txtime
>   option, with a period was 1 millisecond.  I then repeated the
>   so_txtime case but with a 250 microsecond period.  The measured
>   offset from the expected period (in nanoseconds) is shown in the
>   following table.
> 
>   |         | plain preempt_rt |     so_txtime | txtime @ 250 us |
>   |---------+------------------+---------------+-----------------|
>   | min:    |    +1.940800e+04 | +4.720000e+02 |   +4.720000e+02 |
>   | max:    |    +7.556000e+04 | +5.680000e+02 |   +5.760000e+02 |
>   | pk-pk:  |    +5.615200e+04 | +9.600000e+01 |   +1.040000e+02 |
>   | mean:   |    +3.292776e+04 | +5.072274e+02 |   +5.073602e+02 |
>   | stddev: |    +6.514709e+03 | +1.310849e+01 |   +1.507144e+01 |
>   | count:  |           600000 |        600000 |         2400000 |
> 
>   Using so_txtime, the peak to peak jitter is about 100 nanoseconds,
>   independent of the period.  In contrast, plain preempt_rt shows a
>   jitter of of 56 microseconds.  The average delay of 507 nanoseconds
>   when using so_txtime is explained by the documented input and output
>   delays on the i210 cards.
> 
>   The test program is appended, below.  If anyone is interested in
>   reproducing this test, I can provide helper scripts.
> 
> Thanks,
> Richard
> 

< most of test program snipped >

> 
> 	/*
> 	 * We specify the transmission time in the CMSG.
> 	 */
> 	if (use_so_txtime) {
> 		msg.msg_control = u.buf;
> 		msg.msg_controllen = sizeof(u.buf);
> 		cmsg = CMSG_FIRSTHDR(&msg);
> 		cmsg->cmsg_level = SOL_SOCKET;
> 		cmsg->cmsg_type = SO_TXTIME;
> 		cmsg->cmsg_len = CMSG_LEN(sizeof(__u64));
> 		*((__u64 *) CMSG_DATA(cmsg)) = txtime;
> 	}
> 	cnt = sendmsg(fd, &msg, 0);

An interesting use case I have explored is to increase efficiency by batching
transmissions with sendmmsg. This is attractive when getting large chunks of
audio data from ALSA and scheduling them for transmit all at once.

Anyway, I am wholly in favor of this proposal--in fact, it is very similar to
a patch set I shared with Eric Mann and others at Intel in early Dec 2016 with
the intention to get some early feedback before submitting here. I never heard
back and got busy with other things. I only mention this since you said
elsewhere that you got this idea from Eric Mann yourself, and I am curious
whether Eric and I came up with it independently (which I would not be
surprised at).

Levi

^ permalink raw reply

* Re: [PATCH net-next 1/5] net: add support for noref skb->sk
From: Eric Dumazet @ 2017-09-20 17:41 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <39b55ec9584df7dfd6eb498a3d354cd2e08e3eaa.1505926196.git.pabeni@redhat.com>

On Wed, 2017-09-20 at 18:54 +0200, Paolo Abeni wrote:
> Noref sk do not carry a socket refcount, are valid
> only inside the current RCU section and must be
> explicitly cleared before exiting such section.
> 
> They will be used in a later patch to allow early demux
> without sock refcounting.




> +/* dummy destructor used by noref sockets */
> +void sock_dummyfree(struct sk_buff *skb)
> +{

BUG();

> +}
> +EXPORT_SYMBOL(sock_dummyfree);
> +


I do not see how you ensure we do not leave RCU section with an skb
destructor pointing to this sock_dummyfree()

This patch series looks quite dangerous to me.

Do we really have real applications using connected UDP sockets and
wanting very high pps throughput ?

I am pretty sure the bottleneck is the sender part.

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Wei Wang @ 2017-09-20 17:46 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: Eric Dumazet, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <273f49c7-58c0-0b41-10ff-6f891ffb72e9@itcare.pl>

>> This is why I suggested to replace the BUG() in another mail
>>
>> So :
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index
>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>> 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>    */
>>   static inline void dev_put(struct net_device *dev)
>>   {
>> -       this_cpu_dec(*dev->pcpu_refcnt);
>> +       int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>> +
>> +       if (!pref) {
>> +               pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>> %d\n",
>> +                      dev, dev->name, dev->reg_state, dev->dismantle);
>> +               for (;;)
>> +                       cpu_relax();
>> +       }
>> +       this_cpu_dec(*pref);
>>   }
>>     /**
>>

Thanks a lot Eric for the debug patch.

Pawel,

I want to confirm with you about the last good commit when you did bisection.
You mentioned:

> And the last one
>
> git bisect good
> Bisecting: 1 revision left to test after this (roughly 1 step)
> [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for
> insertion into fib6 tree
>
> With this have kernel panic same as always
>
> git bisect bad
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
> remove the operation of dst_free()


So it breaks right at:
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
remove the operation of dst_free()
Right?
If you sync the image to one commit before the above one:
[9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly
Does it crash?

And could you confirm that your config does not have any IPv6
addresses or routes configured?

Thanks.
Wei


6:03 +0200, Paweł Staszewski wrote:
>>>
>>> Nit much more after adding this patch
>>>
>>> https://bugzilla.kernel.org/attachment.cgi?id=258529
>>>
>> This is why I suggested to replace the BUG() in another mail
>>
>> So :
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index
>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>> 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>    */
>>   static inline void dev_put(struct net_device *dev)
>>   {
>> -       this_cpu_dec(*dev->pcpu_refcnt);
>> +       int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>> +
>> +       if (!pref) {
>> +               pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>> %d\n",
>> +                      dev, dev->name, dev->reg_state, dev->dismantle);
>> +               for (;;)
>> +                       cpu_relax();
>> +       }
>> +       this_cpu_dec(*pref);
>>   }
>>     /**
>>
>>
>>
>
> Full panic
>
> https://bugzilla.kernel.org/attachment.cgi?id=258531
>
>
> I will change patch and apply but later today cause now cant use backup
> router as testlab - Internet rush hours if something happens this will be
> bed when second router will have bugged kernel :)
>
>

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Cong Wang @ 2017-09-20 17:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paweł Staszewski, Wei Wang, Linux Kernel Network Developers,
	Eric Dumazet
In-Reply-To: <1505913075.29839.90.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Sorry for top-posting, but this is to give context to Wei, since Pawel
> used a top posting way to report his bisection.
>
> Wei, can you take a look at Pawel report ?
>
> Crash happens in dst_destroy() at following :
>
> if (dst->dev)
>      dev_put(dst->dev); <<CRASH>>
>
>
> dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
>
> 65 ff 08                decl   %gs:(%rax)   // CRASH since rax = NULL
>
>
>
> Pawel, please share your netdevices and routing setup  ?

Looks like a double dev_put() on some dev...

Pawel, do you have any idea how this is triggered? Does your
test try to remove some network device? If so which one?
I noticed you have at least multiple vlan, bond and ixgbe
devices.

^ permalink raw reply

* Re: mac80211: avoid allocating TXQs that won't be used
From: Johannes Berg @ 2017-09-20 17:51 UTC (permalink / raw)
  To: Colin Ian King
  Cc: David S. Miller,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <3592b0b1-0455-ca9a-9ca7-702d7cf421ff-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

On Wed, 2017-09-20 at 17:08 +0100, Colin Ian King wrote:
> Johannes,
> 
> Static analysis with CoverityScan on linux-next today detected a null
> pointer dereference issue on commit:
> 
> From 0fc4b3403d215ecd3c05505ec1f0028a227ed319 Mon Sep 17 00:00:00
> 2001
> From: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Date: Thu, 22 Jun 2017 12:20:29 +0200
> Subject: [PATCH] mac80211: avoid allocating TXQs that won't be used
> 
> Issue: sdata is null when the sdata is dereferenced by:
> 
>                    sdata->vif.type != NL80211_IFTYPE_AP_VLAN &&
>                    sdata->vif.type != NL80211_IFTYPE_MONITOR)
> 
> note that sdata is assigned a non-null much later with the statement
> sdata = netdev_priv(ndev).

Yeah, umm, that should be checking just 'type'. Thanks, will fix.

johannes

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Eric Dumazet @ 2017-09-20 17:59 UTC (permalink / raw)
  To: Cong Wang
  Cc: Paweł Staszewski, Wei Wang, Linux Kernel Network Developers,
	Eric Dumazet
In-Reply-To: <CAM_iQpV8qV8a_yFT0--4Z==FuWda=4EQzKJzHTL+op4S5b9E0A@mail.gmail.com>

On Wed, 2017-09-20 at 10:50 -0700, Cong Wang wrote:
> On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Sorry for top-posting, but this is to give context to Wei, since Pawel
> > used a top posting way to report his bisection.
> >
> > Wei, can you take a look at Pawel report ?
> >
> > Crash happens in dst_destroy() at following :
> >
> > if (dst->dev)
> >      dev_put(dst->dev); <<CRASH>>
> >
> >
> > dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
> >
> > 65 ff 08                decl   %gs:(%rax)   // CRASH since rax = NULL
> >
> >
> >
> > Pawel, please share your netdevices and routing setup  ?
> 
> Looks like a double dev_put() on some dev...
> 
> Pawel, do you have any idea how this is triggered? Does your
> test try to remove some network device? If so which one?
> I noticed you have at least multiple vlan, bond and ixgbe
> devices.

Or a missing dev_hold() somewhere.

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Paweł Staszewski @ 2017-09-20 17:58 UTC (permalink / raw)
  To: Wei Wang; +Cc: Eric Dumazet, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <CAEA6p_D8Av0QAJ46Gzx=Bt3uP-NshTmz2QoBzYaO8OK9+CFQNA@mail.gmail.com>



W dniu 2017-09-20 o 19:46, Wei Wang pisze:
>>> This is why I suggested to replace the BUG() in another mail
>>>
>>> So :
>>>
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index
>>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>>> 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>>     */
>>>    static inline void dev_put(struct net_device *dev)
>>>    {
>>> -       this_cpu_dec(*dev->pcpu_refcnt);
>>> +       int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>>> +
>>> +       if (!pref) {
>>> +               pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>>> %d\n",
>>> +                      dev, dev->name, dev->reg_state, dev->dismantle);
>>> +               for (;;)
>>> +                       cpu_relax();
>>> +       }
>>> +       this_cpu_dec(*pref);
>>>    }
>>>      /**
>>>
> Thanks a lot Eric for the debug patch.
>
> Pawel,
>
> I want to confirm with you about the last good commit when you did bisection.
> You mentioned:
>
>> And the last one
>>
>> git bisect good
>> Bisecting: 1 revision left to test after this (roughly 1 step)
>> [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for
>> insertion into fib6 tree
>>
>> With this have kernel panic same as always
>>
>> git bisect bad
>> Bisecting: 0 revisions left to test after this (roughly 0 steps)
>> [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
>> remove the operation of dst_free()
>
> So it breaks right at:
> [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
> remove the operation of dst_free()
> Right?
> If you sync the image to one commit before the above one:
> [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly
> Does it crash?
Later today i will repeat last three steps - in about next 3 hours after 
rush hours of internet traffic - now i cant touch backup router  :)

>
> And could you confirm that your config does not have any IPv6
> addresses or routes configured?
There is ipv6 enabled
And yes there are some ipv6 ip's
One interface have ipv6 enabled with one static route

  but no ipv6 bgp sessions - so nt many ipv6 prefixes and ipv6 fib is 
almost empty

ip -6 r ls | wc -l
57


>
> Thanks.
> Wei
>
>
> 6:03 +0200, Paweł Staszewski wrote:
>>>> Nit much more after adding this patch
>>>>
>>>> https://bugzilla.kernel.org/attachment.cgi?id=258529
>>>>
>>> This is why I suggested to replace the BUG() in another mail
>>>
>>> So :
>>>
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index
>>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>>> 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>>     */
>>>    static inline void dev_put(struct net_device *dev)
>>>    {
>>> -       this_cpu_dec(*dev->pcpu_refcnt);
>>> +       int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>>> +
>>> +       if (!pref) {
>>> +               pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>>> %d\n",
>>> +                      dev, dev->name, dev->reg_state, dev->dismantle);
>>> +               for (;;)
>>> +                       cpu_relax();
>>> +       }
>>> +       this_cpu_dec(*pref);
>>>    }
>>>      /**
>>>
>>>
>>>
>> Full panic
>>
>> https://bugzilla.kernel.org/attachment.cgi?id=258531
>>
>>
>> I will change patch and apply but later today cause now cant use backup
>> router as testlab - Internet rush hours if something happens this will be
>> bed when second router will have bugged kernel :)
>>
>>

^ permalink raw reply

* Re: [PATCH net-next 08/14] gtp: Support encpasulating over IPv6
From: Tom Herbert @ 2017-09-20 18:03 UTC (permalink / raw)
  To: David Miller
  Cc: Tom Herbert, Linux Kernel Network Developers, Pablo Neira Ayuso,
	Harald Welte, Rohit Seth
In-Reply-To: <20170918.211952.1397675528248642600.davem@davemloft.net>

On Mon, Sep 18, 2017 at 9:19 PM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@quantonium.net>
> Date: Mon, 18 Sep 2017 17:38:58 -0700
>
>> Allow peers to be specified by IPv6 addresses.
>>
>> Signed-off-by: Tom Herbert <tom@quantonium.net>
>
> Hmmm, can you just check the socket family or something like that?

I'm not sure what code you're referring to.

Thanks

^ permalink raw reply

* Re: [PATCH net-next 12/14] gtp: Configuration for zero UDP checksum
From: Tom Herbert @ 2017-09-20 18:09 UTC (permalink / raw)
  To: David Miller
  Cc: Tom Herbert, Linux Kernel Network Developers, Pablo Neira Ayuso,
	Harald Welte, Rohit Seth
In-Reply-To: <20170918.212445.1680068602644706922.davem@davemloft.net>

On Mon, Sep 18, 2017 at 9:24 PM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@quantonium.net>
> Date: Mon, 18 Sep 2017 17:39:02 -0700
>
>> Add configuration to control use of zero checksums on transmit for both
>> IPv4 and IPv6, and control over accepting zero IPv6 checksums on
>> receive.
>>
>> Signed-off-by: Tom Herbert <tom@quantonium.net>
>
> I thought we were trying to move away from this special case of allowing
> zero UDP checksums with tunnels, especially for ipv6.

I don't have a strong preference either way. I like consistency with
VXLAN and foo/UDP, but I guess it's not required. Interestingly, since
GTP only carries IP, IPv6 zero checksums are actually safer here than
VXLAN or GRE/UDP.

Tom

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Cong Wang @ 2017-09-20 18:22 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: Eric Dumazet, Wei Wang, Linux Kernel Network Developers,
	Eric Dumazet
In-Reply-To: <3c227be7-a954-a406-1987-24e908cf214c@itcare.pl>

On Wed, Sep 20, 2017 at 10:55 AM, Paweł Staszewski
<pstaszewski@itcare.pl> wrote:
>
>
> W dniu 2017-09-20 o 19:50, Cong Wang pisze:
>
> On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet <eric.dumazet@gmail.com>
> wrote:
>
> Sorry for top-posting, but this is to give context to Wei, since Pawel
> used a top posting way to report his bisection.
>
> Wei, can you take a look at Pawel report ?
>
> Crash happens in dst_destroy() at following :
>
> if (dst->dev)
>      dev_put(dst->dev); <<CRASH>>
>
>
> dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
>
> 65 ff 08                decl   %gs:(%rax)   // CRASH since rax = NULL
>
>
>
> Pawel, please share your netdevices and routing setup  ?
>
> Looks like a double dev_put() on some dev...
>
> Pawel, do you have any idea how this is triggered? Does your
> test try to remove some network device? If so which one?
> I noticed you have at least multiple vlan, bond and ixgbe
> devices.
>
> Just after i start bgp sessions
> So when host is starting i have all bgp sessions to upstreams shutdown
>
> To trigger panic i just enable all 6x bgp sessions at once to upstreams -
> and zebra is start to pull prefixes and push them to the kernel
>
> Then some traffic is generated from test hosts thru this backup router and
> panic is generated - every time after 10 to 15 seconds after bgp sessions
> are connected.
>
> I'm not removing any interface at this time or do anything with interfaces -
> just wait.
>
> And yes there are vlans attached to the bond devices
> but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

^ permalink raw reply

* Re: [PATCH v5 05/10] dt-bindings: net: dwmac-sun8i: update documentation about integrated PHY
From: Corentin Labbe @ 2017-09-20 18:23 UTC (permalink / raw)
  To: Rob Herring
  Cc: Andrew Lunn, Mark Rutland, Florian Fainelli, Alexandre Torgue,
	devicetree@vger.kernel.org, Catalin Marinas, Will Deacon,
	Russell King, linux-kernel@vger.kernel.org, Chen-Yu Tsai, netdev,
	Giuseppe CAVALLARO, Maxime Ripard,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <CAL_JsqJ2eVA9Zg60Hagqm732ZoOzHWTwVBqOabOhjE5aQ26L+g@mail.gmail.com>

On Tue, Sep 19, 2017 at 09:49:52PM -0500, Rob Herring wrote:
> On Thu, Sep 14, 2017 at 2:19 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> >> > Is the MDIO controller "allwinner,sun8i-h3-emac" or "snps,dwmac-mdio"?
> >> > If the latter, then I think the node is fine, but then the mux should be
> >> > a child node of it. IOW, the child of an MDIO controller should either
> >> > be a mux node or slave devices.
> >
> > Hi Rob
> >
> > Up until now, children of an MDIO bus have been MDIO devices. Those
> > MDIO devices are either Ethernet PHYs, Ethernet Switches, or the
> > oddball devices that Broadcom iProc has, like generic PHYs.
> >
> > We have never had MDIO-muxes as MDIO children. A Mux is not an MDIO
> > device, and does not have the properties of an MDIO device. It is not
> > addressable on the MDIO bus. The current MUXes are addressed via GPIOs
> > or MMIO.
> 
> The DT parent/child relationship defines the bus topology. We describe
> MDIO buses in that way and if a mux is sitting between the controller
> and the devices, then the DT hierarchy should reflect that. Now
> sometimes we have 2 options for what interface has the parent/child
> relationship (e.g. an I2C controlled USB hub chip), but in this case
> we don't.
> 

Putting mdio-mux as a child of it (the mdio node) give me:
[   18.175338] libphy: stmmac: probed
[   18.175379] mdio_bus stmmac-0: /soc/ethernet@1c30000/mdio/mdio-mux has invalid PHY address
[   18.175408] mdio_bus stmmac-0: scan phy mdio-mux at address 0
[   18.175450] mdio_bus stmmac-0: scan phy mdio-mux at address 1
[   18.175482] mdio_bus stmmac-0: scan phy mdio-mux at address 2
[   18.175513] mdio_bus stmmac-0: scan phy mdio-mux at address 3
[   18.175544] mdio_bus stmmac-0: scan phy mdio-mux at address 4
[   18.175575] mdio_bus stmmac-0: scan phy mdio-mux at address 5
[   18.175607] mdio_bus stmmac-0: scan phy mdio-mux at address 6
[   18.175638] mdio_bus stmmac-0: scan phy mdio-mux at address 7
[   18.175669] mdio_bus stmmac-0: scan phy mdio-mux at address 8
[   18.175700] mdio_bus stmmac-0: scan phy mdio-mux at address 9
[   18.175731] mdio_bus stmmac-0: scan phy mdio-mux at address 10
[   18.175762] mdio_bus stmmac-0: scan phy mdio-mux at address 11
[   18.175795] mdio_bus stmmac-0: scan phy mdio-mux at address 12
[   18.175827] mdio_bus stmmac-0: scan phy mdio-mux at address 13
[   18.175858] mdio_bus stmmac-0: scan phy mdio-mux at address 14
[   18.175889] mdio_bus stmmac-0: scan phy mdio-mux at address 15
[   18.175919] mdio_bus stmmac-0: scan phy mdio-mux at address 16
[   18.175951] mdio_bus stmmac-0: scan phy mdio-mux at address 17
[   18.175982] mdio_bus stmmac-0: scan phy mdio-mux at address 18
[   18.176014] mdio_bus stmmac-0: scan phy mdio-mux at address 19
[   18.176045] mdio_bus stmmac-0: scan phy mdio-mux at address 20
[   18.176076] mdio_bus stmmac-0: scan phy mdio-mux at address 21
[   18.176107] mdio_bus stmmac-0: scan phy mdio-mux at address 22
[   18.176139] mdio_bus stmmac-0: scan phy mdio-mux at address 23
[   18.176170] mdio_bus stmmac-0: scan phy mdio-mux at address 24
[   18.176202] mdio_bus stmmac-0: scan phy mdio-mux at address 25
[   18.176233] mdio_bus stmmac-0: scan phy mdio-mux at address 26
[   18.176271] mdio_bus stmmac-0: scan phy mdio-mux at address 27
[   18.176320] mdio_bus stmmac-0: scan phy mdio-mux at address 28
[   18.176371] mdio_bus stmmac-0: scan phy mdio-mux at address 29
[   18.176420] mdio_bus stmmac-0: scan phy mdio-mux at address 30
[   18.176452] mdio_bus stmmac-0: scan phy mdio-mux at address 31

Adding a fake <reg> to mdio-mux remove it, but I found that a bit ugly.
Or perhaps patching of_mdiobus_register() to not scan node with compatible "mdio-mux".

What do you think ?

Regards

^ permalink raw reply

* Re: cross namespace interface notification for tun devices
From: Cong Wang @ 2017-09-20 18:29 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: Netdev, Mathias
In-Reply-To: <CAHmME9oSZeAoDtSLxoGiEcTcCbmPLf-nj3OJcJ1ma8fs4BWQAw@mail.gmail.com>

On Tue, Sep 19, 2017 at 2:02 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> On Tue, Sep 19, 2017 at 10:40 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> By "notification" I assume you mean netlink notification.
>
> Yes, netlink notification.
>
>> The question is why does the process in A still care about
>> the device sitting in B?
>>
>> Also, the process should be able to receive a last notification
>> on IFF_UP|IFF_RUNNING before device is finally moved to B.
>> After this point, it should not have any relation to netns A
>> any more, like the device were completely gone.
>
> That's very clearly not the case with a tun device. Tun devices work
> by letting a userspace process control the inputs (ndo_start_xmit) and
> outputs (netif_rx) of the actual network device. This controlling
> userspace process needs to know when its own interface that it
> controls goes up and down. In the kernel, we can do this by just
> checking dev->flags&IFF_UP, and receive notifications on ndo_open and
> ndo_stop. In userspace, the controlling process looses the ability to
> receive notifications like ndo_open/ndo_stop when the interface is
> moved to a new namespace. After the interface is moved to a namespace,
> the process will still control inputs and ouputs (ndo_start_xmit and
> netif_rx), but it will no longer receive netlink notifications for the
> equivalent of ndo_open and ndo_stop. This is problematic.

Sounds like we should set NETIF_F_NETNS_LOCAL for tun
device.

What is your legitimate use case of send/receive packet to/from
a tun device in a different netns?

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Eric Dumazet @ 2017-09-20 18:30 UTC (permalink / raw)
  To: Cong Wang
  Cc: Paweł Staszewski, Wei Wang, Linux Kernel Network Developers,
	Eric Dumazet
In-Reply-To: <CAM_iQpUOOCcES4d_ueK2wizn2oyDZPEzLfWQFu3oOq=BuRzdiw@mail.gmail.com>

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
> but dmesg at this time shows nothing about interfaces or flaps.
> 
> This is very odd.
> 
> We only free netdevice in free_netdev() and it is only called when
> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
> to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other kernel
user.

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Cong Wang @ 2017-09-20 18:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paweł Staszewski, Wei Wang, Linux Kernel Network Developers,
	Eric Dumazet
In-Reply-To: <1505932231.29839.106.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>> but dmesg at this time shows nothing about interfaces or flaps.
>>
>> This is very odd.
>>
>> We only free netdevice in free_netdev() and it is only called when
>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>> to be NULL.
>
> If there is a missing dev_hold() or one dev_put() in excess,
> this would allow the netdev to be freed too soon.
>
> -> Use after free.
> memory holding netdev could be reallocated-cleared by some other kernel
> user.
>

Sure, but only unregister could trigger a free. If there is no unregister,
like what Pawel claims, then there is no free, the refcnt just goes to
0 but the memory is still there.

^ permalink raw reply

* usb/net/p54: trying to register non-static key in p54_unregister_leds
From: Andrey Konovalov @ 2017-09-20 18:37 UTC (permalink / raw)
  To: Christian Lamparter, Kalle Valo, linux-wireless, netdev, LKML
  Cc: Dmitry Vyukov, Kostya Serebryany, syzkaller

Hi!

I've got the following report while fuzzing the kernel with syzkaller.

On commit ebb2c2437d8008d46796902ff390653822af6cc4 (Sep 18).

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 1404 Comm: kworker/1:1 Not tainted
4.14.0-rc1-42251-gebb2c2437d80-dirty #205
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: usb_hub_wq hub_event
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x292/0x395 lib/dump_stack.c:52
 register_lock_class+0x6c4/0x1a00 kernel/locking/lockdep.c:769
 __lock_acquire+0x27e/0x4550 kernel/locking/lockdep.c:3385
 lock_acquire+0x259/0x620 kernel/locking/lockdep.c:4002
 flush_work+0xf0/0x8c0 kernel/workqueue.c:2886
 __cancel_work_timer+0x51d/0x870 kernel/workqueue.c:2961
 cancel_delayed_work_sync+0x1f/0x30 kernel/workqueue.c:3081
 p54_unregister_leds+0x6c/0xc0 drivers/net/wireless/intersil/p54/led.c:160
 p54_unregister_common+0x3d/0xb0 drivers/net/wireless/intersil/p54/main.c:856
 p54u_disconnect+0x86/0x120 drivers/net/wireless/intersil/p54/p54usb.c:1073
 usb_unbind_interface+0x21c/0xa90 drivers/usb/core/driver.c:423
 __device_release_driver drivers/base/dd.c:861
 device_release_driver_internal+0x4f4/0x5c0 drivers/base/dd.c:893
 device_release_driver+0x1e/0x30 drivers/base/dd.c:918
 bus_remove_device+0x2f4/0x4b0 drivers/base/bus.c:565
 device_del+0x5c4/0xab0 drivers/base/core.c:1985
 usb_disable_device+0x1e9/0x680 drivers/usb/core/message.c:1170
 usb_disconnect+0x260/0x7a0 drivers/usb/core/hub.c:2124
 hub_port_connect drivers/usb/core/hub.c:4754
 hub_port_connect_change drivers/usb/core/hub.c:5009
 port_event drivers/usb/core/hub.c:5115
 hub_event+0x1318/0x3740 drivers/usb/core/hub.c:5195
 process_one_work+0xc7f/0x1db0 kernel/workqueue.c:2119
 process_scheduled_works kernel/workqueue.c:2179
 worker_thread+0xb2b/0x1850 kernel/workqueue.c:2255
 kthread+0x3a1/0x470 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

^ permalink raw reply

* Re: [PATCH net-next] bpf: Optimize lpm trie delete
From: Craig Gallek @ 2017-09-20 18:51 UTC (permalink / raw)
  To: Daniel Mack; +Cc: Alexei Starovoitov, Daniel Borkmann, David S . Miller, netdev
In-Reply-To: <47e47f0d-d04d-b52b-647a-7883aa5f3268@zonque.org>

On Wed, Sep 20, 2017 at 12:51 PM, Daniel Mack <daniel@zonque.org> wrote:
> Hi Craig,
>
> Thanks, this looks much cleaner already :)
>
> On 09/20/2017 06:22 PM, Craig Gallek wrote:
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index 9d58a576b2ae..b5a7d70ec8b5 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
>>       struct lpm_trie_node __rcu **trim;
>>       struct lpm_trie_node *node;
>>       unsigned long irq_flags;
>> -     unsigned int next_bit;
>> +     unsigned int next_bit = 0;
>
> This default assignment seems wrong, and I guess you only added it to
> squelch a compiler warning?
Yes, this variable is only initialized after the lookup iterations
below (meaning it will never be initialized the the root-removal
case).

> [...]
>
>> +     /* If the node has one child, we may be able to collapse the tree
>> +      * while removing this node if the node's child is in the same
>> +      * 'next bit' slot as this node was in its parent or if the node
>> +      * itself is the root.
>> +      */
>> +     if (trim == &trie->root) {
>> +             next_bit = node->child[0] ? 0 : 1;
>> +             rcu_assign_pointer(trie->root, node->child[next_bit]);
>> +             kfree_rcu(node, rcu);
>
> I don't think you should treat this 'root' case special.
>
> Instead, move the 'next_bit' assignment outside of the condition ...
I'm not quite sure I follow.  Are you saying do something like this:

if (trim == &trie->root) {
  next_bit = node->child[0] ? 0 : 1;
}
if (rcu_access_pointer(node->child[next_bit])) {
...

This would save a couple lines of code, but I think the as-is
implementation is slightly easier to understand.  I don't have a
strong opinion either way, though.

Thanks for the pointers,
Craig

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox