Netdev List

Netdev List
 help / color / mirror / Atom feed

* Development Setup
From: sdnlabs Janakaraj @ 2017-09-19 23:44 UTC (permalink / raw)
  To: netdev, dev, linux-kernel

Dear all,
I am new a newbie, I am curious to know what development tools with
Ubuntu as Host OS, will best fit for people entering into linux kernel
development focusing on Netlink, Netdev and Wireless MAC.

I have read many blogs describing the basic setup and things like
that. But I felt input from the current developers in the same field
will be more useful.

-devprabhu-

^ permalink raw reply

* Re: [PATCH net] ipv6: fix net.ipv6.conf.all interface DAD handlers
From: David Miller @ 2017-09-19 23:44 UTC (permalink / raw)
  To: mcroce; +Cc: netdev, linux-doc, ek
In-Reply-To: <20170912154637.12996-1-mcroce@redhat.com>

From: Matteo Croce <mcroce@redhat.com>
Date: Tue, 12 Sep 2017 17:46:37 +0200

> Currently, writing into
> net.ipv6.conf.all.{accept_dad,use_optimistic,optimistic_dad} has no effect.
> Fix handling of these flags by:
> 
> - using the maximum of global and per-interface values for the
>   accept_dad flag. That is, if at least one of the two values is
>   non-zero, enable DAD on the interface. If at least one value is
>   set to 2, enable DAD and disable IPv6 operation on the interface if
>   MAC-based link-local address was found
> 
> - using the logical OR of global and per-interface values for the
>   optimistic_dad flag. If at least one of them is set to one, optimistic
>   duplicate address detection (RFC 4429) is enabled on the interface
> 
> - using the logical OR of global and per-interface values for the
>   use_optimistic flag. If at least one of them is set to one,
>   optimistic addresses won't be marked as deprecated during source address
>   selection on the interface.
> 
> While at it, as we're modifying the prototype for ipv6_use_optimistic_addr(),
> drop inline, and let the compiler decide.
> 
> Fixes: 7fd2561e4ebd ("net: ipv6: Add a sysctl to make optimistic addresses useful candidates")
> Signed-off-by: Matteo Croce <mcroce@redhat.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH] net: ipv6: fix regression of no RTM_DELADDR sent after DAD failure
From: David Miller @ 2017-09-19 23:43 UTC (permalink / raw)
  To: mmanning; +Cc: netdev, maheshb
In-Reply-To: <ce045f3d-c99a-e9a2-a0e7-c4d0410f0665@brocade.com>

From: Mike Manning <mmanning@brocade.com>
Date: Mon, 18 Sep 2017 14:06:40 +0100

> In the absence of a reply from Mahesh, I would be most grateful for
> anyone familiar with the IPv6 code to review this 1-line fix.
> 
> Or if not, then I request that the commit f784ad3d79e5 is backed out,
> as its intention is to remove the redundant but harmless RTM_DELADDR
> for addresses in tentative state, but is also incorrectly removing the
> very necessary RTM_DELADDR when an address is deleted that was previously
> notified with an RTM_NEWADDR as being in tentative dadfailed state.

I've applied your patch, and queued it up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net v2] bpf: fix ri->map_owner pointer on bpf_prog_realloc
From: David Miller @ 2017-09-19 23:39 UTC (permalink / raw)
  To: daniel; +Cc: john.fastabend, ast, netdev
In-Reply-To: <19ba0964a02127c74fbf6fb41f06ab68117d9989.1505860401.git.daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed, 20 Sep 2017 00:44:21 +0200

> Commit 109980b894e9 ("bpf: don't select potentially stale
> ri->map from buggy xdp progs") passed the pointer to the prog
> itself to be loaded into r4 prior on bpf_redirect_map() helper
> call, so that we can store the owner into ri->map_owner out of
> the helper.
> 
> Issue with that is that the actual address of the prog is still
> subject to change when subsequent rewrites occur that require
> slow path in bpf_prog_realloc() to alloc more memory, e.g. from
> patching inlining helper functions or constant blinding. Thus,
> we really need to take prog->aux as the address we're holding,
> which also works with prog clones as they share the same aux
> object.
> 
> Instead of then fetching aux->prog during runtime, which could
> potentially incur cache misses due to false sharing, we are
> going to just use aux for comparison on the map owner. This
> will also keep the patchlet of the same size, and later check
> in xdp_map_invalid() only accesses read-only aux pointer from
> the prog, it's also in the same cacheline already from prior
> access when calling bpf_func.
> 
> Fixes: 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  v1->v2:
>   - Decided to go with prog->aux instead.

Applied, thanks Daniel.

^ permalink raw reply

* Re: [PATCH v2 net-next 0/7] net: speedup netns create/delete time
From: David Miller @ 2017-09-19 23:32 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, ebiederm, eric.dumazet
In-Reply-To: <20170919232709.14690-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Tue, 19 Sep 2017 16:27:02 -0700

> When rate of netns creation/deletion is high enough,
> we observe softlockups in cleanup_net() caused by huge list
> of netns and way too many rcu_barrier() calls.
> 
> This patch series does some optimizations in kobject,
> and add batching to tunnels so that netns dismantles are
> less costly.
 ...

Series applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH net-next] net_sched: no need to free qdisc in RCU callback
From: David Miller @ 2017-09-19 23:30 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, jhs, edumazet
In-Reply-To: <20170919201542.14890-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 19 Sep 2017 13:15:42 -0700

> gen estimator has been rewritten in commit 1c0d32fde5bd
> ("net_sched: gen_estimator: complete rewrite of rate estimators"),
> the caller no longer needs to wait for a grace period. So this
> patch gets rid of it.
> 
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Nice.

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] isdn/i4l: check the message proto does not change across fetches
From: David Miller @ 2017-09-19 23:29 UTC (permalink / raw)
  To: mengxu.gatech
  Cc: isdn, johannes.berg, netdev, linux-kernel, meng.xu, sanidhya,
	taesoo
In-Reply-To: <1505847178-3179-1-git-send-email-mengxu.gatech@gmail.com>

From: Meng Xu <mengxu.gatech@gmail.com>
Date: Tue, 19 Sep 2017 14:52:58 -0400

> In isdn_ppp_write(), the header (i.e., protobuf) of the buffer is fetched
> twice from userspace. The first fetch is used to peek at the protocol
> of the message and reset the huptimer if necessary; while the second
> fetch copies in the whole buffer. However, given that buf resides in
> userspace memory, a user process can race to change its memory content
> across fetches. By doing so, we can either avoid resetting the huptimer
> for any type of packets (by first setting proto to PPP_LCP and later
> change to the actual type) or force resetting the huptimer for LCP packets.
> 
> This patch does a memcmp between the two fetches and abort if changes to
> the protobuf is detected across fetches.
> 
> Signed-off-by: Meng Xu <mengxu.gatech@gmail.com>

Doing a memcmp() for every buffer is expensive, ugly, and not the
way we usually handle this kind of issue.

Instead, atomically copy the entire buffer, as needed.

Something like:

	struct sk_buff *skb = NULL;
	unsigned char protobuf[4];
	unsigned char *cpy_buf;

	if (lp->isdn_device >= 0 && lp->isdn_channel >= 0 &&
	    (dev->drv[lp->isdn_device]->flags & DRV_FLAG_RUNNING) &&
	    lp->dialstate == 0 &&
	    (lp->flags & ISDN_NET_CONNECTED)) {
			/*
			 * we need to reserve enough space in front of
			 * sk_buff. old call to dev_alloc_skb only reserved
			 * 16 bytes, now we are looking what the driver want
			 */
			hl = dev->drv[lp->isdn_device]->interface->hl_hdrlen;
			skb = alloc_skb(hl + count, GFP_ATOMIC);
			if (!skb) {
				printk(KERN_WARNING "isdn_ppp_write: out of memory!\n");
				return count;
			}
			skb_reserve(skb, hl);
			cpy_buf = skb_put(skb, count);
	} else {
		cpy_buf = protobuf;
		count = sizeof(protobuf);
	}
	if (copy_from_user(cpy_buf, buf, count)) {
		kfree_skb(skb);
		return -EFAULT;
	}
	proto = PPP_PROTOCOL(cpy_buf);
	if (proto != PPP_LCP)
		lp->huptimer = 0;
	...

^ permalink raw reply

* [PATCH v2 net-next 7/7] ipv4: speedup ipv6 tunnels dismantle
From: Eric Dumazet @ 2017-09-19 23:27 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric W . Biederman, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170919232709.14690-1-edumazet@google.com>

Implement exit_batch() method to dismantle more devices
per round.

(rtnl_lock() ...
 unregister_netdevice_many() ...
 rtnl_unlock())

Tested:
$ cat add_del_unshare.sh
for i in `seq 1 40`
do
 (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
done
wait ; grep net_namespace /proc/slabinfo

Before patch :
$ time ./add_del_unshare.sh
net_namespace        126    282   5504    1    2 : tunables    8    4    0 : slabdata    126    282      0

real    1m38.965s
user    0m0.688s
sys     0m37.017s

After patch:
$ time ./add_del_unshare.sh
net_namespace        135    291   5504    1    2 : tunables    8    4    0 : slabdata    135    291      0

real	0m22.117s
user	0m0.728s
sys	0m35.328s

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ip_tunnels.h |  3 ++-
 net/ipv4/ip_gre.c        | 22 +++++++++-------------
 net/ipv4/ip_tunnel.c     | 12 +++++++++---
 net/ipv4/ip_vti.c        |  7 +++----
 net/ipv4/ipip.c          |  7 +++----
 5 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 992652856fe8c7c1032e0f5f92ce7ee5aa0119da..b41a1e057fcec9d6e4c5a0c1cafd1f1d537ccd53 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -258,7 +258,8 @@ int ip_tunnel_get_iflink(const struct net_device *dev);
 int ip_tunnel_init_net(struct net *net, unsigned int ip_tnl_net_id,
 		       struct rtnl_link_ops *ops, char *devname);
 
-void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops);
+void ip_tunnel_delete_nets(struct list_head *list_net, unsigned int id,
+			   struct rtnl_link_ops *ops);
 
 void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 		    const struct iphdr *tnl_params, const u8 protocol);
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 0162fb955b33abf18514cbfd482e72a0ebce6e48..9cee986ac6b8ed04ff95e193fe1e8e60e74d84a9 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -1013,15 +1013,14 @@ static int __net_init ipgre_init_net(struct net *net)
 	return ip_tunnel_init_net(net, ipgre_net_id, &ipgre_link_ops, NULL);
 }
 
-static void __net_exit ipgre_exit_net(struct net *net)
+static void __net_exit ipgre_exit_batch_net(struct list_head *list_net)
 {
-	struct ip_tunnel_net *itn = net_generic(net, ipgre_net_id);
-	ip_tunnel_delete_net(itn, &ipgre_link_ops);
+	ip_tunnel_delete_nets(list_net, ipgre_net_id, &ipgre_link_ops);
 }
 
 static struct pernet_operations ipgre_net_ops = {
 	.init = ipgre_init_net,
-	.exit = ipgre_exit_net,
+	.exit_batch = ipgre_exit_batch_net,
 	.id   = &ipgre_net_id,
 	.size = sizeof(struct ip_tunnel_net),
 };
@@ -1540,15 +1539,14 @@ static int __net_init ipgre_tap_init_net(struct net *net)
 	return ip_tunnel_init_net(net, gre_tap_net_id, &ipgre_tap_ops, "gretap0");
 }
 
-static void __net_exit ipgre_tap_exit_net(struct net *net)
+static void __net_exit ipgre_tap_exit_batch_net(struct list_head *list_net)
 {
-	struct ip_tunnel_net *itn = net_generic(net, gre_tap_net_id);
-	ip_tunnel_delete_net(itn, &ipgre_tap_ops);
+	ip_tunnel_delete_nets(list_net, gre_tap_net_id, &ipgre_tap_ops);
 }
 
 static struct pernet_operations ipgre_tap_net_ops = {
 	.init = ipgre_tap_init_net,
-	.exit = ipgre_tap_exit_net,
+	.exit_batch = ipgre_tap_exit_batch_net,
 	.id   = &gre_tap_net_id,
 	.size = sizeof(struct ip_tunnel_net),
 };
@@ -1559,16 +1557,14 @@ static int __net_init erspan_init_net(struct net *net)
 				  &erspan_link_ops, "erspan0");
 }
 
-static void __net_exit erspan_exit_net(struct net *net)
+static void __net_exit erspan_exit_batch_net(struct list_head *net_list)
 {
-	struct ip_tunnel_net *itn = net_generic(net, erspan_net_id);
-
-	ip_tunnel_delete_net(itn, &erspan_link_ops);
+	ip_tunnel_delete_nets(net_list, erspan_net_id, &erspan_link_ops);
 }
 
 static struct pernet_operations erspan_net_ops = {
 	.init = erspan_init_net,
-	.exit = erspan_exit_net,
+	.exit_batch = erspan_exit_batch_net,
 	.id   = &erspan_net_id,
 	.size = sizeof(struct ip_tunnel_net),
 };
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index e9805ad664ac24c3405ad015cfaab89dc1c95279..fe6fee728ce49d01b55aa478698e1a3bcf9a3bdb 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -1061,16 +1061,22 @@ static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head,
 	}
 }
 
-void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops)
+void ip_tunnel_delete_nets(struct list_head *net_list, unsigned int id,
+			   struct rtnl_link_ops *ops)
 {
+	struct ip_tunnel_net *itn;
+	struct net *net;
 	LIST_HEAD(list);
 
 	rtnl_lock();
-	ip_tunnel_destroy(itn, &list, ops);
+	list_for_each_entry(net, net_list, exit_list) {
+		itn = net_generic(net, id);
+		ip_tunnel_destroy(itn, &list, ops);
+	}
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
-EXPORT_SYMBOL_GPL(ip_tunnel_delete_net);
+EXPORT_SYMBOL_GPL(ip_tunnel_delete_nets);
 
 int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[],
 		      struct ip_tunnel_parm *p, __u32 fwmark)
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 5ed63d25095062d44dacfd291e227290d24ea0ed..02d70ca99db16f2a50e3e179a05e74b535865f46 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -452,15 +452,14 @@ static int __net_init vti_init_net(struct net *net)
 	return 0;
 }
 
-static void __net_exit vti_exit_net(struct net *net)
+static void __net_exit vti_exit_batch_net(struct list_head *list_net)
 {
-	struct ip_tunnel_net *itn = net_generic(net, vti_net_id);
-	ip_tunnel_delete_net(itn, &vti_link_ops);
+	ip_tunnel_delete_nets(list_net, vti_net_id, &vti_link_ops);
 }
 
 static struct pernet_operations vti_net_ops = {
 	.init = vti_init_net,
-	.exit = vti_exit_net,
+	.exit_batch = vti_exit_batch_net,
 	.id   = &vti_net_id,
 	.size = sizeof(struct ip_tunnel_net),
 };
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index fb1ad22b5e292d5669c70b5640ad3207c353c6bb..1e47818e38c766a3dab63dfa6bfa9610fa9550ac 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -634,15 +634,14 @@ static int __net_init ipip_init_net(struct net *net)
 	return ip_tunnel_init_net(net, ipip_net_id, &ipip_link_ops, "tunl0");
 }
 
-static void __net_exit ipip_exit_net(struct net *net)
+static void __net_exit ipip_exit_batch_net(struct list_head *list_net)
 {
-	struct ip_tunnel_net *itn = net_generic(net, ipip_net_id);
-	ip_tunnel_delete_net(itn, &ipip_link_ops);
+	ip_tunnel_delete_nets(list_net, ipip_net_id, &ipip_link_ops);
 }
 
 static struct pernet_operations ipip_net_ops = {
 	.init = ipip_init_net,
-	.exit = ipip_exit_net,
+	.exit_batch = ipip_exit_batch_net,
 	.id   = &ipip_net_id,
 	.size = sizeof(struct ip_tunnel_net),
 };
-- 
2.14.1.690.gbb1197296e-goog

^ permalink raw reply related

* [PATCH v2 net-next 6/7] ipv6: speedup ipv6 tunnels dismantle
From: Eric Dumazet @ 2017-09-19 23:27 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric W . Biederman, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170919232709.14690-1-edumazet@google.com>

Implement exit_batch() method to dismantle more devices
per round.

(rtnl_lock() ...
 unregister_netdevice_many() ...
 rtnl_unlock())

Tested:
$ cat add_del_unshare.sh
for i in `seq 1 40`
do
 (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
done
wait ; grep net_namespace /proc/slabinfo

Before patch :
$ time ./add_del_unshare.sh
net_namespace        110    267   5504    1    2 : tunables    8    4    0 : slabdata    110    267      0

real    3m25.292s
user    0m0.644s
sys     0m40.153s

After patch:

$ time ./add_del_unshare.sh
net_namespace        126    282   5504    1    2 : tunables    8    4    0 : slabdata    126    282      0

real	1m38.965s
user	0m0.688s
sys	0m37.017s

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv6/ip6_gre.c    |  8 +++++---
 net/ipv6/ip6_tunnel.c | 20 +++++++++++---------
 net/ipv6/ip6_vti.c    | 23 ++++++++++++++---------
 net/ipv6/sit.c        |  9 ++++++---
 4 files changed, 36 insertions(+), 24 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index b7a72d40933441f835708f55e2d8af371661a5fb..c82d41ef25e283ff92b1eed1f8b927c9d7b8f333 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1155,19 +1155,21 @@ static int __net_init ip6gre_init_net(struct net *net)
 	return err;
 }
 
-static void __net_exit ip6gre_exit_net(struct net *net)
+static void __net_exit ip6gre_exit_batch_net(struct list_head *net_list)
 {
+	struct net *net;
 	LIST_HEAD(list);
 
 	rtnl_lock();
-	ip6gre_destroy_tunnels(net, &list);
+	list_for_each_entry(net, net_list, exit_list)
+		ip6gre_destroy_tunnels(net, &list);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
 
 static struct pernet_operations ip6gre_net_ops = {
 	.init = ip6gre_init_net,
-	.exit = ip6gre_exit_net,
+	.exit_batch = ip6gre_exit_batch_net,
 	.id   = &ip6gre_net_id,
 	.size = sizeof(struct ip6gre_net),
 };
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index ae73164559d5c4d7f2650ae63c56d76dc93b165c..3d6df489b39f00014f330340927c4d11a64911c2 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -2167,17 +2167,16 @@ static struct xfrm6_tunnel ip6ip6_handler __read_mostly = {
 	.priority	=	1,
 };
 
-static void __net_exit ip6_tnl_destroy_tunnels(struct net *net)
+static void __net_exit ip6_tnl_destroy_tunnels(struct net *net, struct list_head *list)
 {
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 	struct net_device *dev, *aux;
 	int h;
 	struct ip6_tnl *t;
-	LIST_HEAD(list);
 
 	for_each_netdev_safe(net, dev, aux)
 		if (dev->rtnl_link_ops == &ip6_link_ops)
-			unregister_netdevice_queue(dev, &list);
+			unregister_netdevice_queue(dev, list);
 
 	for (h = 0; h < IP6_TUNNEL_HASH_SIZE; h++) {
 		t = rtnl_dereference(ip6n->tnls_r_l[h]);
@@ -2186,12 +2185,10 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct net *net)
 			 * been added to the list by the previous loop.
 			 */
 			if (!net_eq(dev_net(t->dev), net))
-				unregister_netdevice_queue(t->dev, &list);
+				unregister_netdevice_queue(t->dev, list);
 			t = rtnl_dereference(t->next);
 		}
 	}
-
-	unregister_netdevice_many(&list);
 }
 
 static int __net_init ip6_tnl_init_net(struct net *net)
@@ -2235,16 +2232,21 @@ static int __net_init ip6_tnl_init_net(struct net *net)
 	return err;
 }
 
-static void __net_exit ip6_tnl_exit_net(struct net *net)
+static void __net_exit ip6_tnl_exit_batch_net(struct list_head *net_list)
 {
+	struct net *net;
+	LIST_HEAD(list);
+
 	rtnl_lock();
-	ip6_tnl_destroy_tunnels(net);
+	list_for_each_entry(net, net_list, exit_list)
+		ip6_tnl_destroy_tunnels(net, &list);
+	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
 
 static struct pernet_operations ip6_tnl_net_ops = {
 	.init = ip6_tnl_init_net,
-	.exit = ip6_tnl_exit_net,
+	.exit_batch = ip6_tnl_exit_batch_net,
 	.id   = &ip6_tnl_net_id,
 	.size = sizeof(struct ip6_tnl_net),
 };
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 79444a4bfd6d245b66a7edcefe2b5b32801bf2c0..714914d1bb987c46cc98817903ec7bcc367a1b2d 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -1052,23 +1052,22 @@ static struct rtnl_link_ops vti6_link_ops __read_mostly = {
 	.get_link_net	= ip6_tnl_get_link_net,
 };
 
-static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n)
+static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n,
+					    struct list_head *list)
 {
 	int h;
 	struct ip6_tnl *t;
-	LIST_HEAD(list);
 
 	for (h = 0; h < IP6_VTI_HASH_SIZE; h++) {
 		t = rtnl_dereference(ip6n->tnls_r_l[h]);
 		while (t) {
-			unregister_netdevice_queue(t->dev, &list);
+			unregister_netdevice_queue(t->dev, list);
 			t = rtnl_dereference(t->next);
 		}
 	}
 
 	t = rtnl_dereference(ip6n->tnls_wc[0]);
-	unregister_netdevice_queue(t->dev, &list);
-	unregister_netdevice_many(&list);
+	unregister_netdevice_queue(t->dev, list);
 }
 
 static int __net_init vti6_init_net(struct net *net)
@@ -1108,18 +1107,24 @@ static int __net_init vti6_init_net(struct net *net)
 	return err;
 }
 
-static void __net_exit vti6_exit_net(struct net *net)
+static void __net_exit vti6_exit_batch_net(struct list_head *net_list)
 {
-	struct vti6_net *ip6n = net_generic(net, vti6_net_id);
+	struct vti6_net *ip6n;
+	struct net *net;
+	LIST_HEAD(list);
 
 	rtnl_lock();
-	vti6_destroy_tunnels(ip6n);
+	list_for_each_entry(net, net_list, exit_list) {
+		ip6n = net_generic(net, vti6_net_id);
+		vti6_destroy_tunnels(ip6n, &list);
+	}
+	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
 
 static struct pernet_operations vti6_net_ops = {
 	.init = vti6_init_net,
-	.exit = vti6_exit_net,
+	.exit_batch = vti6_exit_batch_net,
 	.id   = &vti6_net_id,
 	.size = sizeof(struct vti6_net),
 };
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index ac912bb217471c048df3b76aa3d7b82886221dc1..a799f525861487ad5b822ab62cdc90f6ca06762f 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1848,19 +1848,22 @@ static int __net_init sit_init_net(struct net *net)
 	return err;
 }
 
-static void __net_exit sit_exit_net(struct net *net)
+static void __net_exit sit_exit_batch_net(struct list_head *net_list)
 {
 	LIST_HEAD(list);
+	struct net *net;
 
 	rtnl_lock();
-	sit_destroy_tunnels(net, &list);
+	list_for_each_entry(net, net_list, exit_list)
+		sit_destroy_tunnels(net, &list);
+
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
 
 static struct pernet_operations sit_net_ops = {
 	.init = sit_init_net,
-	.exit = sit_exit_net,
+	.exit_batch = sit_exit_batch_net,
 	.id   = &sit_net_id,
 	.size = sizeof(struct sit_net),
 };
-- 
2.14.1.690.gbb1197296e-goog

^ permalink raw reply related

* [PATCH v2 net-next 5/7] tcp: batch tcp_net_metrics_exit
From: Eric Dumazet @ 2017-09-19 23:27 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric W . Biederman, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170919232709.14690-1-edumazet@google.com>

When dealing with a list of dismantling netns, we can scan
tcp_metrics once, saving cpu cycles.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_metrics.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 102b2c90bb807d3a88d31b59324baf72cf901cdf..0ab78abc811bef0388089befed672e3d4ee9d881 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -892,10 +892,14 @@ static void tcp_metrics_flush_all(struct net *net)
 
 	for (row = 0; row < max_rows; row++, hb++) {
 		struct tcp_metrics_block __rcu **pp;
+		bool match;
+
 		spin_lock_bh(&tcp_metrics_lock);
 		pp = &hb->chain;
 		for (tm = deref_locked(*pp); tm; tm = deref_locked(*pp)) {
-			if (net_eq(tm_net(tm), net)) {
+			match = net ? net_eq(tm_net(tm), net) :
+				!atomic_read(&tm_net(tm)->count);
+			if (match) {
 				*pp = tm->tcpm_next;
 				kfree_rcu(tm, rcu_head);
 			} else {
@@ -1018,14 +1022,14 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 	return 0;
 }
 
-static void __net_exit tcp_net_metrics_exit(struct net *net)
+static void __net_exit tcp_net_metrics_exit_batch(struct list_head *net_exit_list)
 {
-	tcp_metrics_flush_all(net);
+	tcp_metrics_flush_all(NULL);
 }
 
 static __net_initdata struct pernet_operations tcp_net_metrics_ops = {
-	.init	=	tcp_net_metrics_init,
-	.exit	=	tcp_net_metrics_exit,
+	.init		=	tcp_net_metrics_init,
+	.exit_batch	=	tcp_net_metrics_exit_batch,
 };
 
 void __init tcp_metrics_init(void)
-- 
2.14.1.690.gbb1197296e-goog

^ permalink raw reply related

* [PATCH v2 net-next 4/7] ipv6: addrlabel: per netns list
From: Eric Dumazet @ 2017-09-19 23:27 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric W . Biederman, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170919232709.14690-1-edumazet@google.com>

Having a global list of labels do not scale to thousands of
netns in the cloud era. This causes quadratic behavior on
netns creation and deletion.

This is time having a per netns list of ~10 labels.

Tested:

$ time perf record (for f in `seq 1 3000` ; do ip netns add tast$f; done)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 3.637 MB perf.data (~158898 samples) ]

real    0m20.837s # instead of 0m24.227s
user    0m0.328s
sys     0m20.338s # instead of 0m23.753s

    16.17%       ip  [kernel.kallsyms]  [k] netlink_broadcast_filtered
    12.30%       ip  [kernel.kallsyms]  [k] netlink_has_listeners
     6.76%       ip  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     5.78%       ip  [kernel.kallsyms]  [k] memset_erms
     5.77%       ip  [kernel.kallsyms]  [k] kobject_uevent_env
     5.18%       ip  [kernel.kallsyms]  [k] refcount_sub_and_test
     4.96%       ip  [kernel.kallsyms]  [k] _raw_read_lock
     3.82%       ip  [kernel.kallsyms]  [k] refcount_inc_not_zero
     3.33%       ip  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     2.11%       ip  [kernel.kallsyms]  [k] unmap_page_range
     1.77%       ip  [kernel.kallsyms]  [k] __wake_up
     1.69%       ip  [kernel.kallsyms]  [k] strlen
     1.17%       ip  [kernel.kallsyms]  [k] __wake_up_common
     1.09%       ip  [kernel.kallsyms]  [k] insert_header
     1.04%       ip  [kernel.kallsyms]  [k] page_remove_rmap
     1.01%       ip  [kernel.kallsyms]  [k] consume_skb
     0.98%       ip  [kernel.kallsyms]  [k] netlink_trim
     0.51%       ip  [kernel.kallsyms]  [k] kernfs_link_sibling
     0.51%       ip  [kernel.kallsyms]  [k] filemap_map_pages
     0.46%       ip  [kernel.kallsyms]  [k] memcpy_erms

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/netns/ipv6.h |  5 +++
 net/ipv6/addrlabel.c     | 81 ++++++++++++++++++------------------------------
 2 files changed, 35 insertions(+), 51 deletions(-)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 2544f9760a4263b7f1b8d622331ca63038586137..2ea1ed341ef81901b4fa271b0f7f4592e17c4f8a 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -89,6 +89,11 @@ struct netns_ipv6 {
 	atomic_t		fib6_sernum;
 	struct seg6_pernet_data *seg6_data;
 	struct fib_notifier_ops	*notifier_ops;
+	struct {
+		struct hlist_head head;
+		spinlock_t	lock;
+		u32		seq;
+	} ip6addrlbl_table;
 };
 
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
index b055bc79f56d555c89684116c1580984950f77a8..c6311d7108f651c7385cd6316752ba4a86667dcc 100644
--- a/net/ipv6/addrlabel.c
+++ b/net/ipv6/addrlabel.c
@@ -30,7 +30,6 @@
  * Policy Table
  */
 struct ip6addrlbl_entry {
-	possible_net_t lbl_net;
 	struct in6_addr prefix;
 	int prefixlen;
 	int ifindex;
@@ -41,19 +40,6 @@ struct ip6addrlbl_entry {
 	struct rcu_head rcu;
 };
 
-static struct ip6addrlbl_table
-{
-	struct hlist_head head;
-	spinlock_t lock;
-	u32 seq;
-} ip6addrlbl_table;
-
-static inline
-struct net *ip6addrlbl_net(const struct ip6addrlbl_entry *lbl)
-{
-	return read_pnet(&lbl->lbl_net);
-}
-
 /*
  * Default policy table (RFC6724 + extensions)
  *
@@ -148,13 +134,10 @@ static inline void ip6addrlbl_put(struct ip6addrlbl_entry *p)
 }
 
 /* Find label */
-static bool __ip6addrlbl_match(struct net *net,
-			       const struct ip6addrlbl_entry *p,
+static bool __ip6addrlbl_match(const struct ip6addrlbl_entry *p,
 			       const struct in6_addr *addr,
 			       int addrtype, int ifindex)
 {
-	if (!net_eq(ip6addrlbl_net(p), net))
-		return false;
 	if (p->ifindex && p->ifindex != ifindex)
 		return false;
 	if (p->addrtype && p->addrtype != addrtype)
@@ -169,8 +152,9 @@ static struct ip6addrlbl_entry *__ipv6_addr_label(struct net *net,
 						  int type, int ifindex)
 {
 	struct ip6addrlbl_entry *p;
-	hlist_for_each_entry_rcu(p, &ip6addrlbl_table.head, list) {
-		if (__ip6addrlbl_match(net, p, addr, type, ifindex))
+
+	hlist_for_each_entry_rcu(p, &net->ipv6.ip6addrlbl_table.head, list) {
+		if (__ip6addrlbl_match(p, addr, type, ifindex))
 			return p;
 	}
 	return NULL;
@@ -196,8 +180,7 @@ u32 ipv6_addr_label(struct net *net,
 }
 
 /* allocate one entry */
-static struct ip6addrlbl_entry *ip6addrlbl_alloc(struct net *net,
-						 const struct in6_addr *prefix,
+static struct ip6addrlbl_entry *ip6addrlbl_alloc(const struct in6_addr *prefix,
 						 int prefixlen, int ifindex,
 						 u32 label)
 {
@@ -236,24 +219,23 @@ static struct ip6addrlbl_entry *ip6addrlbl_alloc(struct net *net,
 	newp->addrtype = addrtype;
 	newp->label = label;
 	INIT_HLIST_NODE(&newp->list);
-	write_pnet(&newp->lbl_net, net);
 	refcount_set(&newp->refcnt, 1);
 	return newp;
 }
 
 /* add a label */
-static int __ip6addrlbl_add(struct ip6addrlbl_entry *newp, int replace)
+static int __ip6addrlbl_add(struct net *net, struct ip6addrlbl_entry *newp,
+			    int replace)
 {
-	struct hlist_node *n;
 	struct ip6addrlbl_entry *last = NULL, *p = NULL;
+	struct hlist_node *n;
 	int ret = 0;
 
 	ADDRLABEL(KERN_DEBUG "%s(newp=%p, replace=%d)\n", __func__, newp,
 		  replace);
 
-	hlist_for_each_entry_safe(p, n,	&ip6addrlbl_table.head, list) {
+	hlist_for_each_entry_safe(p, n,	&net->ipv6.ip6addrlbl_table.head, list) {
 		if (p->prefixlen == newp->prefixlen &&
-		    net_eq(ip6addrlbl_net(p), ip6addrlbl_net(newp)) &&
 		    p->ifindex == newp->ifindex &&
 		    ipv6_addr_equal(&p->prefix, &newp->prefix)) {
 			if (!replace) {
@@ -273,10 +255,10 @@ static int __ip6addrlbl_add(struct ip6addrlbl_entry *newp, int replace)
 	if (last)
 		hlist_add_behind_rcu(&newp->list, &last->list);
 	else
-		hlist_add_head_rcu(&newp->list, &ip6addrlbl_table.head);
+		hlist_add_head_rcu(&newp->list, &net->ipv6.ip6addrlbl_table.head);
 out:
 	if (!ret)
-		ip6addrlbl_table.seq++;
+		net->ipv6.ip6addrlbl_table.seq++;
 	return ret;
 }
 
@@ -292,12 +274,12 @@ static int ip6addrlbl_add(struct net *net,
 		  __func__, prefix, prefixlen, ifindex, (unsigned int)label,
 		  replace);
 
-	newp = ip6addrlbl_alloc(net, prefix, prefixlen, ifindex, label);
+	newp = ip6addrlbl_alloc(prefix, prefixlen, ifindex, label);
 	if (IS_ERR(newp))
 		return PTR_ERR(newp);
-	spin_lock(&ip6addrlbl_table.lock);
-	ret = __ip6addrlbl_add(newp, replace);
-	spin_unlock(&ip6addrlbl_table.lock);
+	spin_lock(&net->ipv6.ip6addrlbl_table.lock);
+	ret = __ip6addrlbl_add(net, newp, replace);
+	spin_unlock(&net->ipv6.ip6addrlbl_table.lock);
 	if (ret)
 		ip6addrlbl_free(newp);
 	return ret;
@@ -315,9 +297,8 @@ static int __ip6addrlbl_del(struct net *net,
 	ADDRLABEL(KERN_DEBUG "%s(prefix=%pI6, prefixlen=%d, ifindex=%d)\n",
 		  __func__, prefix, prefixlen, ifindex);
 
-	hlist_for_each_entry_safe(p, n, &ip6addrlbl_table.head, list) {
+	hlist_for_each_entry_safe(p, n, &net->ipv6.ip6addrlbl_table.head, list) {
 		if (p->prefixlen == prefixlen &&
-		    net_eq(ip6addrlbl_net(p), net) &&
 		    p->ifindex == ifindex &&
 		    ipv6_addr_equal(&p->prefix, prefix)) {
 			hlist_del_rcu(&p->list);
@@ -340,9 +321,9 @@ static int ip6addrlbl_del(struct net *net,
 		  __func__, prefix, prefixlen, ifindex);
 
 	ipv6_addr_prefix(&prefix_buf, prefix, prefixlen);
-	spin_lock(&ip6addrlbl_table.lock);
+	spin_lock(&net->ipv6.ip6addrlbl_table.lock);
 	ret = __ip6addrlbl_del(net, &prefix_buf, prefixlen, ifindex);
-	spin_unlock(&ip6addrlbl_table.lock);
+	spin_unlock(&net->ipv6.ip6addrlbl_table.lock);
 	return ret;
 }
 
@@ -354,6 +335,9 @@ static int __net_init ip6addrlbl_net_init(struct net *net)
 
 	ADDRLABEL(KERN_DEBUG "%s\n", __func__);
 
+	spin_lock_init(&net->ipv6.ip6addrlbl_table.lock);
+	INIT_HLIST_HEAD(&net->ipv6.ip6addrlbl_table.head);
+
 	for (i = 0; i < ARRAY_SIZE(ip6addrlbl_init_table); i++) {
 		int ret = ip6addrlbl_add(net,
 					 ip6addrlbl_init_table[i].prefix,
@@ -373,14 +357,12 @@ static void __net_exit ip6addrlbl_net_exit(struct net *net)
 	struct hlist_node *n;
 
 	/* Remove all labels belonging to the exiting net */
-	spin_lock(&ip6addrlbl_table.lock);
-	hlist_for_each_entry_safe(p, n, &ip6addrlbl_table.head, list) {
-		if (net_eq(ip6addrlbl_net(p), net)) {
-			hlist_del_rcu(&p->list);
-			ip6addrlbl_put(p);
-		}
+	spin_lock(&net->ipv6.ip6addrlbl_table.lock);
+	hlist_for_each_entry_safe(p, n, &net->ipv6.ip6addrlbl_table.head, list) {
+		hlist_del_rcu(&p->list);
+		ip6addrlbl_put(p);
 	}
-	spin_unlock(&ip6addrlbl_table.lock);
+	spin_unlock(&net->ipv6.ip6addrlbl_table.lock);
 }
 
 static struct pernet_operations ipv6_addr_label_ops = {
@@ -390,8 +372,6 @@ static struct pernet_operations ipv6_addr_label_ops = {
 
 int __init ipv6_addr_label_init(void)
 {
-	spin_lock_init(&ip6addrlbl_table.lock);
-
 	return register_pernet_subsys(&ipv6_addr_label_ops);
 }
 
@@ -510,11 +490,10 @@ static int ip6addrlbl_dump(struct sk_buff *skb, struct netlink_callback *cb)
 	int err;
 
 	rcu_read_lock();
-	hlist_for_each_entry_rcu(p, &ip6addrlbl_table.head, list) {
-		if (idx >= s_idx &&
-		    net_eq(ip6addrlbl_net(p), net)) {
+	hlist_for_each_entry_rcu(p, &net->ipv6.ip6addrlbl_table.head, list) {
+		if (idx >= s_idx) {
 			err = ip6addrlbl_fill(skb, p,
-					      ip6addrlbl_table.seq,
+					      net->ipv6.ip6addrlbl_table.seq,
 					      NETLINK_CB(cb->skb).portid,
 					      cb->nlh->nlmsg_seq,
 					      RTM_NEWADDRLABEL,
@@ -571,7 +550,7 @@ static int ip6addrlbl_get(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 	p = __ipv6_addr_label(net, addr, ipv6_addr_type(addr), ifal->ifal_index);
 	if (p && !ip6addrlbl_hold(p))
 		p = NULL;
-	lseq = ip6addrlbl_table.seq;
+	lseq = net->ipv6.ip6addrlbl_table.seq;
 	rcu_read_unlock();
 
 	if (!p) {
-- 
2.14.1.690.gbb1197296e-goog

^ permalink raw reply related

* [PATCH v2 net-next 3/7] kobject: factorize skb setup in kobject_uevent_net_broadcast()
From: Eric Dumazet @ 2017-09-19 23:27 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric W . Biederman, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170919232709.14690-1-edumazet@google.com>

We can build one skb and let it be cloned in netlink.

This is much faster, and use less memory (all clones will
share the same skb->head)

Tested:

time perf record (for f in `seq 1 3000` ; do ip netns add tast$f; done)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.110 MB perf.data (~179584 samples) ]

real    0m24.227s # instead of 0m52.554s
user    0m0.329s
sys 0m23.753s # instead of 0m51.375s

    14.77%       ip  [kernel.kallsyms]  [k] __ip6addrlbl_add
    14.56%       ip  [kernel.kallsyms]  [k] netlink_broadcast_filtered
    11.65%       ip  [kernel.kallsyms]  [k] netlink_has_listeners
     6.19%       ip  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     5.66%       ip  [kernel.kallsyms]  [k] kobject_uevent_env
     4.97%       ip  [kernel.kallsyms]  [k] memset_erms
     4.67%       ip  [kernel.kallsyms]  [k] refcount_sub_and_test
     4.41%       ip  [kernel.kallsyms]  [k] _raw_read_lock
     3.59%       ip  [kernel.kallsyms]  [k] refcount_inc_not_zero
     3.13%       ip  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     1.55%       ip  [kernel.kallsyms]  [k] __wake_up
     1.20%       ip  [kernel.kallsyms]  [k] strlen
     1.03%       ip  [kernel.kallsyms]  [k] __wake_up_common
     0.93%       ip  [kernel.kallsyms]  [k] consume_skb
     0.92%       ip  [kernel.kallsyms]  [k] netlink_trim
     0.87%       ip  [kernel.kallsyms]  [k] insert_header
     0.63%       ip  [kernel.kallsyms]  [k] unmap_page_range

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 lib/kobject_uevent.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 78b2a7e378c0deda3b32b1178d7f44203702c3f2..147db91c10d06485868ff56626a5a9b073a8a846 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -301,23 +301,26 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 {
 	int retval = 0;
 #if defined(CONFIG_NET)
+	struct sk_buff *skb = NULL;
 	struct uevent_sock *ue_sk;
 
 	/* send netlink message */
 	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
 		struct sock *uevent_sock = ue_sk->sk;
-		struct sk_buff *skb;
-		size_t len;
 
 		if (!netlink_has_listeners(uevent_sock, 1))
 			continue;
 
-		/* allocate message with the maximum possible size */
-		len = strlen(action_string) + strlen(devpath) + 2;
-		skb = alloc_skb(len + env->buflen, GFP_KERNEL);
-		if (skb) {
+		if (!skb) {
+			/* allocate message with the maximum possible size */
+			size_t len = strlen(action_string) + strlen(devpath) + 2;
 			char *scratch;
 
+			retval = -ENOMEM;
+			skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+			if (!skb)
+				continue;
+
 			/* add header */
 			scratch = skb_put(skb, len);
 			sprintf(scratch, "%s@%s", action_string, devpath);
@@ -325,16 +328,17 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 			skb_put_data(skb, env->buf, env->buflen);
 
 			NETLINK_CB(skb).dst_group = 1;
-			retval = netlink_broadcast_filtered(uevent_sock, skb,
-							    0, 1, GFP_KERNEL,
-							    kobj_bcast_filter,
-							    kobj);
-			/* ENOBUFS should be handled in userspace */
-			if (retval == -ENOBUFS || retval == -ESRCH)
-				retval = 0;
-		} else
-			retval = -ENOMEM;
+		}
+
+		retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
+						    0, 1, GFP_KERNEL,
+						    kobj_bcast_filter,
+						    kobj);
+		/* ENOBUFS should be handled in userspace */
+		if (retval == -ENOBUFS || retval == -ESRCH)
+			retval = 0;
 	}
+	consume_skb(skb);
 #endif
 	return retval;
 }
-- 
2.14.1.690.gbb1197296e-goog

^ permalink raw reply related

* [PATCH v2 net-next 2/7] kobject: copy env blob in one go
From: Eric Dumazet @ 2017-09-19 23:27 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric W . Biederman, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170919232709.14690-1-edumazet@google.com>

No need to iterate over strings, just copy in one efficient memcpy() call.

Tested:
time perf record "(for f in `seq 1 3000` ; do ip netns add tast$f; done)"
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 8.224 MB perf.data (~359301 samples) ]

real    0m52.554s  # instead of 1m7.492s
user    0m0.309s
sys 0m51.375s # instead of 1m6.875s

     9.88%       ip  [kernel.kallsyms]  [k] netlink_broadcast_filtered
     8.86%       ip  [kernel.kallsyms]  [k] string
     7.37%       ip  [kernel.kallsyms]  [k] __ip6addrlbl_add
     5.68%       ip  [kernel.kallsyms]  [k] netlink_has_listeners
     5.52%       ip  [kernel.kallsyms]  [k] memcpy_erms
     4.76%       ip  [kernel.kallsyms]  [k] __alloc_skb
     4.54%       ip  [kernel.kallsyms]  [k] vsnprintf
     3.94%       ip  [kernel.kallsyms]  [k] format_decode
     3.80%       ip  [kernel.kallsyms]  [k] kmem_cache_alloc_node_trace
     3.71%       ip  [kernel.kallsyms]  [k] kmem_cache_alloc_node
     3.66%       ip  [kernel.kallsyms]  [k] kobject_uevent_env
     3.38%       ip  [kernel.kallsyms]  [k] strlen
     2.65%       ip  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     2.20%       ip  [kernel.kallsyms]  [k] kfree
     2.09%       ip  [kernel.kallsyms]  [k] memset_erms
     2.07%       ip  [kernel.kallsyms]  [k] ___cache_free
     1.95%       ip  [kernel.kallsyms]  [k] kmem_cache_free
     1.91%       ip  [kernel.kallsyms]  [k] _raw_read_lock
     1.45%       ip  [kernel.kallsyms]  [k] ksize
     1.25%       ip  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     1.00%       ip  [kernel.kallsyms]  [k] widen_string

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 lib/kobject_uevent.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 4f48cc3b11d566e44c4115cc7716bc3b1cdf96df..78b2a7e378c0deda3b32b1178d7f44203702c3f2 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -317,18 +317,12 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 		skb = alloc_skb(len + env->buflen, GFP_KERNEL);
 		if (skb) {
 			char *scratch;
-			int i;
 
 			/* add header */
 			scratch = skb_put(skb, len);
 			sprintf(scratch, "%s@%s", action_string, devpath);
 
-			/* copy keys to our continuous event payload buffer */
-			for (i = 0; i < env->envp_idx; i++) {
-				len = strlen(env->envp[i]) + 1;
-				scratch = skb_put(skb, len);
-				strcpy(scratch, env->envp[i]);
-			}
+			skb_put_data(skb, env->buf, env->buflen);
 
 			NETLINK_CB(skb).dst_group = 1;
 			retval = netlink_broadcast_filtered(uevent_sock, skb,
-- 
2.14.1.690.gbb1197296e-goog

^ permalink raw reply related

* [PATCH v2 net-next 1/7] kobject: add kobject_uevent_net_broadcast()
From: Eric Dumazet @ 2017-09-19 23:27 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric W . Biederman, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170919232709.14690-1-edumazet@google.com>

This removes some #ifdef pollution and will ease follow up patches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 lib/kobject_uevent.c | 96 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 53 insertions(+), 43 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index e590523ea4761425df5e112a2c2aab873dbaa90d..4f48cc3b11d566e44c4115cc7716bc3b1cdf96df 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -294,6 +294,57 @@ static void cleanup_uevent_env(struct subprocess_info *info)
 }
 #endif
 
+static int kobject_uevent_net_broadcast(struct kobject *kobj,
+					struct kobj_uevent_env *env,
+					const char *action_string,
+					const char *devpath)
+{
+	int retval = 0;
+#if defined(CONFIG_NET)
+	struct uevent_sock *ue_sk;
+
+	/* send netlink message */
+	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
+		struct sock *uevent_sock = ue_sk->sk;
+		struct sk_buff *skb;
+		size_t len;
+
+		if (!netlink_has_listeners(uevent_sock, 1))
+			continue;
+
+		/* allocate message with the maximum possible size */
+		len = strlen(action_string) + strlen(devpath) + 2;
+		skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+		if (skb) {
+			char *scratch;
+			int i;
+
+			/* add header */
+			scratch = skb_put(skb, len);
+			sprintf(scratch, "%s@%s", action_string, devpath);
+
+			/* copy keys to our continuous event payload buffer */
+			for (i = 0; i < env->envp_idx; i++) {
+				len = strlen(env->envp[i]) + 1;
+				scratch = skb_put(skb, len);
+				strcpy(scratch, env->envp[i]);
+			}
+
+			NETLINK_CB(skb).dst_group = 1;
+			retval = netlink_broadcast_filtered(uevent_sock, skb,
+							    0, 1, GFP_KERNEL,
+							    kobj_bcast_filter,
+							    kobj);
+			/* ENOBUFS should be handled in userspace */
+			if (retval == -ENOBUFS || retval == -ESRCH)
+				retval = 0;
+		} else
+			retval = -ENOMEM;
+	}
+#endif
+	return retval;
+}
+
 /**
  * kobject_uevent_env - send an uevent with environmental data
  *
@@ -316,9 +367,6 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 	const struct kset_uevent_ops *uevent_ops;
 	int i = 0;
 	int retval = 0;
-#ifdef CONFIG_NET
-	struct uevent_sock *ue_sk;
-#endif
 
 	pr_debug("kobject: '%s' (%p): %s\n",
 		 kobject_name(kobj), kobj, __func__);
@@ -427,46 +475,8 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 		mutex_unlock(&uevent_sock_mutex);
 		goto exit;
 	}
-
-#if defined(CONFIG_NET)
-	/* send netlink message */
-	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
-		struct sock *uevent_sock = ue_sk->sk;
-		struct sk_buff *skb;
-		size_t len;
-
-		if (!netlink_has_listeners(uevent_sock, 1))
-			continue;
-
-		/* allocate message with the maximum possible size */
-		len = strlen(action_string) + strlen(devpath) + 2;
-		skb = alloc_skb(len + env->buflen, GFP_KERNEL);
-		if (skb) {
-			char *scratch;
-
-			/* add header */
-			scratch = skb_put(skb, len);
-			sprintf(scratch, "%s@%s", action_string, devpath);
-
-			/* copy keys to our continuous event payload buffer */
-			for (i = 0; i < env->envp_idx; i++) {
-				len = strlen(env->envp[i]) + 1;
-				scratch = skb_put(skb, len);
-				strcpy(scratch, env->envp[i]);
-			}
-
-			NETLINK_CB(skb).dst_group = 1;
-			retval = netlink_broadcast_filtered(uevent_sock, skb,
-							    0, 1, GFP_KERNEL,
-							    kobj_bcast_filter,
-							    kobj);
-			/* ENOBUFS should be handled in userspace */
-			if (retval == -ENOBUFS || retval == -ESRCH)
-				retval = 0;
-		} else
-			retval = -ENOMEM;
-	}
-#endif
+	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
+					      devpath);
 	mutex_unlock(&uevent_sock_mutex);
 
 #ifdef CONFIG_UEVENT_HELPER
-- 
2.14.1.690.gbb1197296e-goog

^ permalink raw reply related

* [PATCH v2 net-next 0/7] net: speedup netns create/delete time
From: Eric Dumazet @ 2017-09-19 23:27 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric W . Biederman, Eric Dumazet, Eric Dumazet

When rate of netns creation/deletion is high enough,
we observe softlockups in cleanup_net() caused by huge list
of netns and way too many rcu_barrier() calls.

This patch series does some optimizations in kobject,
and add batching to tunnels so that netns dismantles are
less costly.

IPv6 addrlabels also get a per netns list, and tcp_metrics
also benefit from batch flushing.

This gives me one order of magnitude gain.
(~50 ms -> ~5 ms for one netns create/delete pair)

Tested:

for i in `seq 1 40`
do
 (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
done
wait ; grep net_namespace /proc/slabinfo

Before patch series :

$ time ./add_del_unshare.sh
net_namespace        116    258   5504    1    2 : tunables    8    4    0 : slabdata    116    258      0

real	3m24.910s
user	0m0.747s
sys	0m43.162s

After :
$ time ./add_del_unshare.sh
net_namespace        135    291   5504    1    2 : tunables    8    4    0 : slabdata    135    291      0

real	0m22.117s
user	0m0.728s
sys	0m35.328s

Eric Dumazet (7):
  kobject: add kobject_uevent_net_broadcast()
  kobject: copy env blob in one go
  kobject: factorize skb setup in kobject_uevent_net_broadcast()
  ipv6: addrlabel: per netns list
  tcp: batch tcp_net_metrics_exit
  ipv6: speedup ipv6 tunnels dismantle
  ipv4: speedup ipv6 tunnels dismantle

 include/net/ip_tunnels.h |  3 +-
 include/net/netns/ipv6.h |  5 +++
 lib/kobject_uevent.c     | 94 ++++++++++++++++++++++++++----------------------
 net/ipv4/ip_gre.c        | 22 +++++-------
 net/ipv4/ip_tunnel.c     | 12 +++++--
 net/ipv4/ip_vti.c        |  7 ++--
 net/ipv4/ipip.c          |  7 ++--
 net/ipv4/tcp_metrics.c   | 14 +++++---
 net/ipv6/addrlabel.c     | 81 ++++++++++++++++-------------------------
 net/ipv6/ip6_gre.c       |  8 +++--
 net/ipv6/ip6_tunnel.c    | 20 ++++++-----
 net/ipv6/ip6_vti.c       | 23 +++++++-----
 net/ipv6/sit.c           |  9 +++--
 13 files changed, 157 insertions(+), 148 deletions(-)

-- 
2.14.1.690.gbb1197296e-goog

^ permalink raw reply

* Re: [PATCH net-next 00/14] gtp: Additional feature support
From: Harald Welte @ 2017-09-19 23:19 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Tom Herbert, David S. Miller, Linux Kernel Network Developers,
	Pablo Neira Ayuso, Rohit Seth
In-Reply-To: <CALx6S35n-V3d00SyV=Lz97KPq=CSCmOaerXpFBYSknKtdiwYug@mail.gmail.com>

Hi Tom,

On Tue, Sep 19, 2017 at 08:59:28AM -0700, Tom Herbert wrote:
> On Tue, Sep 19, 2017 at 5:43 AM, Harald Welte <laforge@gnumonks.org>
> wrote:
> > On Mon, Sep 18, 2017 at 05:38:50PM -0700, Tom Herbert wrote:
> >>   - IPv6 support
> >
> > see my detailed comments in other mails.  It's unfortunately only
> > support for the already "deprecated" IPv6-only PDP contexts, not the
> > more modern v4v6 type.  In order to interoperate with old and new
> > approach, all three cases (v4, v6 and v4v6) should be supported from
> > one code base.
> >
> It sounds like something that can be subsequently added. 

Not entirely, at least on the netlink (and any other configuration
interface) you will have to reflect this from the very beginning.  You
have to have an explicit PDP type and cannot rely on the address type to
specify the type of PDP context.  Whatever interfaces are introduced
now will have to remain compatible to any future change.

My strategy to avoid any such possible 'road blocks' from being
introduced would be to simply add v4v6 and v6 support in one go.  The
differences are marginal (having both an IPv6 prefix and a v4 address in
parallel, rather than mutually exclusive only).

> Do you have a reference to the spec?

See http://osmocom.org/issues/2418#note-7 which lists Section 11.2.1.3.2
of 3GPP TS 29.061 in combination with RFC3314, RFC7066, RFC6459 and
3GPP TS 23.060 9.2.1 as well as a summary of my understanding of it some
months ago.

> >>   - Configurable networking interfaces so that GTP kernel can be
> >>   used and tested without needing GSN network emulation (i.e. no
> >>   user space daemon needed).
> >
> > We have some pretty decent userspace utilities for configuring the
> > GTP interfaces and tunnels in the libgtpnl repository, but if it
> > helps people to have another way of configuration, I won't be
> > against it.
> >
> AFAIK those userspace utilities don't support IPv6. 

Of course not [yet]. libgtpnl and the command line tools have been
implemented specifically for the in-kernel GTP driver, and you have to
make sure to add related support on both the kernel and the userspace
side (libgtpnl). So there's little point in adding features on either
side before the other side.  There would be no way to test...

> Being able to configure GTP like any other encapsulation will
> facilitate development of IPv6 and other features.

That may very well be the case, but adding "IPv6 support" to kernel GTP
in a way that is not in line with the existing userspace libraries and
control-plane implementations means that you're developing those
features in an artificial environment that doesn't resemble real 3GPP
interoperable networks out there.

As indicated, I'm not against adding additional interfaces, but we have
to make sure that we add IPv6 support (or any new feature support) to at
least libgtpnl, and to make sure we test interoperability with existing
3GPP network equipment such as real IPv6 capable phones and SGSNs.

> > I'm not sure if this is a useful feature.  GTP is used only in
> > operator-controlled networks and only on standard ports.  It's not
> > possible to negotiate any non-standard ports on the signaling plane
> > either.
> >
> Bear in mind that we're not required to do everything the GTP spec
> says. 

Yes, we are, at least as long as it affects interoperability with other
implemetations out there.

GTP uses well-known port numbers on *both* sides of the tunnel, and you
cannot deviate from that.

There's no point in having all kinds of feetures in the GTP user plane
which are not interoperable with other implementations, and which are
completely outside of the information model / architecture of GTP.

In the real world, GTP-U is only used in combination with GTP-C.  And in
GTP-C you can only negotiate the IP address of both sides of GTP-U, and
not the port number information.  As a result, the port numbers are
static on both sides.

> My impression is GTP designers probably didn't think in terms of
> getting best performance. But we can ;-)

I think it's wasted efforts if it's about "random udp ports" as no
standards-compliant implementation out there with which you will have to
interoperate will be able to support it.

GTP is used between home and roaming operator.  If you want to introduce
changes to how it works, you will have to have control over both sides
of the implementation of both the GTP-C and the GTP-u plane, which is
very unlikely and rather the exception in the hundreds of operators you
interoperate with.  Also keep in mind that there often are various
"middleboxes" that will suddenly have to reflect your changes.  That
starts from packet filters at various locations in the operator networks
and/or roaming hubs, down to GTP hubs and others.

My opinion is: Non-standard GTP ports are not going to happen.

> I also brought up open_ggsn. ggsn to sgsn.

That's good to hear.  For both v4 and v6 PDP contexts?  Whcih phones
did you use for testing?  Particularly given how convolved the address
allocation is (see below), I'm surprised it would work.

> > For IPv6 (and v4v6) PDP contexts there is quite a bit of extra headache
> > related to the way how router solicitation/advertisements are modified
> > in the 3GPP world.
> >
> > The address allocation in v4 is simple:
> > * MS/UE requests dynamic or fixed IPv4 address via EUA IE of PDP context
> >   activation
> > * GGSN responds with IPv4 address in EUA of Activate PDP context
> >   response (and then uses netlink to tell the kernel about that
> >   IPv4 address)
> >
> > In v6 or the v6 portion of v4v6 it works differently:
> > * MS/UE requests dynamic or fixed IPv4 address in EUA IE of PDP context
> >   activation
> > * GGSN responds with an IPv6 address, but that address is *not* used
> >   for communication, but simply used as an "interface identifier" to
> >   build a link-local address.
> > * MS then uses router solicitation using that link-local address
> > * GGSN responds with router advertisement, allocating a single /64
> >   prefix, from which the MS then generates a fully-qualified IPv6
> >   source address for communication.
> >
> > How did you envision this to be done with the v6 support you just added?
> > At the very least, the /64 prefix matching would have to be implemented
> > so that in fact all addresses within that /64 prefix are matched +
> > encapsulated for a given PDP context in the downlink (to phone)
> > direction.
> > 
> > [...]
> I would hope all the above you're describing is mostly control plane
> matters. 

It is not.  The control plane is GTP-C and runs on different UDP ports
(at least for GTPv1/v2).  The user plane is GTP-U and is what's done in
the kernel.  And by its very nature, IPv6 router
solicitations/advertisements (as well as neighbor
solicitations/advertisements) are part of the user plane and thus
handled in GTP-U.

> At least a good design decouples data palne and control
> plane. I know that GTP is a bit convoluted in this regard.

The problem is that IPv6 has never been specified properly for
point-to-point links.  There's no decent PPP specs for IPv6.  So the
3GPP folks had to try to be as close as possible to the existing
(broadcast) link layer model to facilitate existing IPv6 implemetations
to work over 3GPP bearers.  That's why they kept whatever possible to
re-use in terms of neighbor/router discovery.

So the problem is now: Unless you handle GTP-U *entirely* in the kernel
(including router + neighbor advertisement/solicitation), you will have
a "split GTP-U" plane between kernel and userspace.  And in that context
the question is who owns the sequence numbers, how will you avoid race
conditions, ... - my simple suggestion is thus to keep with the current
split and do everything GTP-U related inside the kernel and everything
GTP-C related in userspace.

I think there has to be a clear plan/architecture on how to implement
those bits in terms of the kernel/userspace split, and at least a proof
of concept implementation that we can show works with some real phones
out there - otherwise there's no point in having IPv6 support that works
well with some custom tools.

Regards,
	Harald
-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* Re: [PATCH] net: emac: Fix napi poll list corruption
From: David Miller @ 2017-09-19 23:20 UTC (permalink / raw)
  To: chunkeey; +Cc: netdev
In-Reply-To: <20170919173518.3694-1-chunkeey@googlemail.com>

From: Christian Lamparter <chunkeey@googlemail.com>
Date: Tue, 19 Sep 2017 19:35:18 +0200

> This patch is pretty much a carbon copy of
> commit 3079c652141f ("caif: Fix napi poll list corruption")
> with "caif" replaced by "emac".
> 
> The commit d75b1ade567f ("net: less interrupt masking in NAPI")
> breaks emac.
> 
> It is now required that if the entire budget is consumed when poll
> returns, the napi poll_list must remain empty.  However, like some
> other drivers emac tries to do a last-ditch check and if there is
> more work it will call napi_reschedule and then immediately process
> some of this new work.  Should the entire budget be consumed while
> processing such new work then we will violate the new caller
> contract.
> 
> This patch fixes this by not touching any work when we reschedule
> in emac.
> 
> Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/7] net: speedup netns create/delete time
From: Eric Dumazet @ 2017-09-19 23:20 UTC (permalink / raw)
  To: David Miller; +Cc: edumazet, netdev, ebiederm
In-Reply-To: <20170919.160244.1149642515939268316.davem@davemloft.net>

On Tue, 2017-09-19 at 16:02 -0700, David Miller wrote:
> From: Eric Dumazet <edumazet@google.com>
> Date: Mon, 18 Sep 2017 12:07:26 -0700
> 
> > When rate of netns creation/deletion is high enough,
> > we observe softlockups in cleanup_net() caused by huge list
> > of netns and way too many rcu_barrier() calls.
> > 
> > This patch series does some optimizations in kobject,
> > and add batching to tunnels so that netns dismantles are
> > less costly.
> > 
> > IPv6 addrlabels also get a per netns list, and tcp_metrics
> > also benefit from batch flushing.
> > 
> > This gives me one order of magnitude gain.
> > (~50 ms -> ~5 ms for one netns create/delete pair)
> 
> I like it.
> 
> Please address the feedback about using skb_put_data() and
> resubmit.

Sure, will also remove a spurious // comment I accidentally left in
patch 7/7.

^ permalink raw reply

* Re: [patch net-next] team: fall back to hash if table entry is empty
From: David Miller @ 2017-09-19 23:19 UTC (permalink / raw)
  To: hanko; +Cc: jiri, netdev, linux-kernel
In-Reply-To: <1505846019-6785-1-git-send-email-hanko@drivescale.com>

From: Jim Hanko <hanko@drivescale.com>
Date: Tue, 19 Sep 2017 11:33:39 -0700

> If the hash to port mapping table does not have a valid port (i.e. when
> a port goes down), fall back to the simple hashing mechanism to avoid
> dropping packets.
> 
> Signed-off-by: Jim Hanko <hanko@drivescale.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net] tcp: fastopen: fix on syn-data transmit failure
From: David Miller @ 2017-09-19 23:17 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ycheng, ncardwell, netdev
In-Reply-To: <1505840757.29839.77.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 19 Sep 2017 10:05:57 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Our recent change exposed a bug in TCP Fastopen Client that syzkaller
> found right away [1]
> 
> When we prepare skb with SYN+DATA, we attempt to transmit it,
> and we update socket state as if the transmit was a success.
> 
> In socket RTX queue we have two skbs, one with the SYN alone,
> and a second one containing the DATA.
> 
> When (malicious) ACK comes in, we now complain that second one had no
> skb_mstamp.
> 
> The proper fix is to make sure that if the transmit failed, we do not
> pretend we sent the DATA skb, and make it our send_head.
> 
> When 3WHS completes, we can now send the DATA right away, without having
> to wait for a timeout.
> 
> [1]
 ...
> Fixes: 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully")
> Fixes: 783237e8daf1 ("net-tcp: Fast Open client - sending SYN-data")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [net-next v2 0/4] test_rhashtable: don't allocate huge static array
From: David Miller @ 2017-09-19 23:16 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <20170919231214.2281-1-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Wed, 20 Sep 2017 01:12:10 +0200

> Add a test case for the rhlist interface.
> While at it, cleanup current rhashtable test a bit and add a check
> for max_size support.
> 
> No changes since v1, except in last patch.
> kbuild robot complained about large onstack allocation caused by
> struct rhltable when lockdep is enabled.

Looks good, series applied, thanks Florian.

^ permalink raw reply

* [net-next v2 4/4] test_rhashtable: add test case for rhl_table interface
From: Florian Westphal @ 2017-09-19 23:12 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal
In-Reply-To: <20170919231214.2281-1-fw@strlen.de>

also test rhltable.  rhltable remove operations are slow as
deletions require a list walk, thus test with 1/16th of the given
entry count number to get a run duration similar to rhashtable one.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 change since v1:
 place struct rhltable in initdata section to avoid large onstack
 allocation warnings when lockdep is enabled.

 lib/test_rhashtable.c | 196 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 194 insertions(+), 2 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 1eee90e6e394..de4d0584631a 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -23,6 +23,7 @@
 #include <linux/semaphore.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/random.h>
 #include <linux/vmalloc.h>
 
 #define MAX_ENTRIES	1000000
@@ -66,6 +67,11 @@ struct test_obj {
 	struct rhash_head	node;
 };
 
+struct test_obj_rhl {
+	struct test_obj_val	value;
+	struct rhlist_head	list_node;
+};
+
 struct thread_data {
 	unsigned int entries;
 	int id;
@@ -245,6 +251,186 @@ static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj *array,
 }
 
 static struct rhashtable ht;
+static struct rhltable rhlt __initdata;
+
+static int __init test_rhltable(unsigned int entries)
+{
+	struct test_obj_rhl *rhl_test_objects;
+	unsigned long *obj_in_table;
+	unsigned int i, j, k;
+	int ret, err;
+
+	if (entries == 0)
+		entries = 1;
+
+	rhl_test_objects = vzalloc(sizeof(*rhl_test_objects) * entries);
+	if (!rhl_test_objects)
+		return -ENOMEM;
+
+	ret = -ENOMEM;
+	obj_in_table = vzalloc(BITS_TO_LONGS(entries) * sizeof(unsigned long));
+	if (!obj_in_table)
+		goto out_free;
+
+	/* nulls_base not supported in rhlist interface */
+	test_rht_params.nulls_base = 0;
+	err = rhltable_init(&rhlt, &test_rht_params);
+	if (WARN_ON(err))
+		goto out_free;
+
+	k = prandom_u32();
+	ret = 0;
+	for (i = 0; i < entries; i++) {
+		rhl_test_objects[i].value.id = k;
+		err = rhltable_insert(&rhlt, &rhl_test_objects[i].list_node,
+				      test_rht_params);
+		if (WARN(err, "error %d on element %d\n", err, i))
+			break;
+		if (err == 0)
+			set_bit(i, obj_in_table);
+	}
+
+	if (err)
+		ret = err;
+
+	pr_info("test %d add/delete pairs into rhlist\n", entries);
+	for (i = 0; i < entries; i++) {
+		struct rhlist_head *h, *pos;
+		struct test_obj_rhl *obj;
+		struct test_obj_val key = {
+			.id = k,
+		};
+		bool found;
+
+		rcu_read_lock();
+		h = rhltable_lookup(&rhlt, &key, test_rht_params);
+		if (WARN(!h, "key not found during iteration %d of %d", i, entries)) {
+			rcu_read_unlock();
+			break;
+		}
+
+		if (i) {
+			j = i - 1;
+			rhl_for_each_entry_rcu(obj, pos, h, list_node) {
+				if (WARN(pos == &rhl_test_objects[j].list_node, "old element found, should be gone"))
+					break;
+			}
+		}
+
+		cond_resched_rcu();
+
+		found = false;
+
+		rhl_for_each_entry_rcu(obj, pos, h, list_node) {
+			if (pos == &rhl_test_objects[i].list_node) {
+				found = true;
+				break;
+			}
+		}
+
+		rcu_read_unlock();
+
+		if (WARN(!found, "element %d not found", i))
+			break;
+
+		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
+		WARN(err, "rhltable_remove: err %d for iteration %d\n", err, i);
+		if (err == 0)
+			clear_bit(i, obj_in_table);
+	}
+
+	if (ret == 0 && err)
+		ret = err;
+
+	for (i = 0; i < entries; i++) {
+		WARN(test_bit(i, obj_in_table), "elem %d allegedly still present", i);
+
+		err = rhltable_insert(&rhlt, &rhl_test_objects[i].list_node,
+				      test_rht_params);
+		if (WARN(err, "error %d on element %d\n", err, i))
+			break;
+		if (err == 0)
+			set_bit(i, obj_in_table);
+	}
+
+	pr_info("test %d random rhlist add/delete operations\n", entries);
+	for (j = 0; j < entries; j++) {
+		u32 i = prandom_u32_max(entries);
+		u32 prand = prandom_u32();
+
+		cond_resched();
+
+		if (prand == 0)
+			prand = prandom_u32();
+
+		if (prand & 1) {
+			prand >>= 1;
+			continue;
+		}
+
+		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
+		if (test_bit(i, obj_in_table)) {
+			clear_bit(i, obj_in_table);
+			if (WARN(err, "cannot remove element at slot %d", i))
+				continue;
+		} else {
+			if (WARN(err != -ENOENT, "removed non-existant element %d, error %d not %d",
+			     i, err, -ENOENT))
+				continue;
+		}
+
+		if (prand & 1) {
+			prand >>= 1;
+			continue;
+		}
+
+		err = rhltable_insert(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
+		if (err == 0) {
+			if (WARN(test_and_set_bit(i, obj_in_table), "succeeded to insert same object %d", i))
+				continue;
+		} else {
+			if (WARN(!test_bit(i, obj_in_table), "failed to insert object %d", i))
+				continue;
+		}
+
+		if (prand & 1) {
+			prand >>= 1;
+			continue;
+		}
+
+		i = prandom_u32_max(entries);
+		if (test_bit(i, obj_in_table)) {
+			err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
+			WARN(err, "cannot remove element at slot %d", i);
+			if (err == 0)
+				clear_bit(i, obj_in_table);
+		} else {
+			err = rhltable_insert(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
+			WARN(err, "failed to insert object %d", i);
+			if (err == 0)
+				set_bit(i, obj_in_table);
+		}
+	}
+
+	for (i = 0; i < entries; i++) {
+		cond_resched();
+		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
+		if (test_bit(i, obj_in_table)) {
+			if (WARN(err, "cannot remove element at slot %d", i))
+				continue;
+		} else {
+			if (WARN(err != -ENOENT, "removed non-existant element, error %d not %d",
+				 err, -ENOENT))
+			continue;
+		}
+	}
+
+	rhltable_destroy(&rhlt);
+out_free:
+	vfree(rhl_test_objects);
+	vfree(obj_in_table);
+	return ret;
+}
 
 static int __init test_rhashtable_max(struct test_obj *array,
 				      unsigned int entries)
@@ -480,11 +666,17 @@ static int __init test_rht_init(void)
 			failed_threads++;
 		}
 	}
-	pr_info("Started %d threads, %d failed\n",
-	        started_threads, failed_threads);
 	rhashtable_destroy(&ht);
 	vfree(tdata);
 	vfree(objs);
+
+	/*
+	 * rhltable_remove is very expensive, default values can cause test
+	 * to run for 2 minutes or more,  use a smaller number instead.
+	 */
+	err = test_rhltable(entries / 16);
+	pr_info("Started %d threads, %d failed, rhltable test returns %d\n",
+	        started_threads, failed_threads, err);
 	return 0;
 }
 
-- 
2.13.5

^ permalink raw reply related

* [net-next v2 3/4] test_rhashtable: add a check for max_size
From: Florian Westphal @ 2017-09-19 23:12 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal
In-Reply-To: <20170919231214.2281-1-fw@strlen.de>

add a test that tries to insert more than max_size elements.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 lib/test_rhashtable.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 69f5b3849980..1eee90e6e394 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -246,6 +246,43 @@ static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj *array,
 
 static struct rhashtable ht;
 
+static int __init test_rhashtable_max(struct test_obj *array,
+				      unsigned int entries)
+{
+	unsigned int i, insert_retries = 0;
+	int err;
+
+	test_rht_params.max_size = roundup_pow_of_two(entries / 8);
+	err = rhashtable_init(&ht, &test_rht_params);
+	if (err)
+		return err;
+
+	for (i = 0; i < ht.max_elems; i++) {
+		struct test_obj *obj = &array[i];
+
+		obj->value.id = i * 2;
+		err = insert_retry(&ht, obj, test_rht_params);
+		if (err > 0)
+			insert_retries += err;
+		else if (err)
+			return err;
+	}
+
+	err = insert_retry(&ht, &array[ht.max_elems], test_rht_params);
+	if (err == -E2BIG) {
+		err = 0;
+	} else {
+		pr_info("insert element %u should have failed with %d, got %d\n",
+				ht.max_elems, -E2BIG, err);
+		if (err == 0)
+			err = -1;
+	}
+
+	rhashtable_destroy(&ht);
+
+	return err;
+}
+
 static int thread_lookup_test(struct thread_data *tdata)
 {
 	unsigned int entries = tdata->entries;
@@ -386,7 +423,11 @@ static int __init test_rht_init(void)
 		total_time += time;
 	}
 
+	pr_info("test if its possible to exceed max_size %d: %s\n",
+			test_rht_params.max_size, test_rhashtable_max(objs, entries) == 0 ?
+			"no, ok" : "YES, failed");
 	vfree(objs);
+
 	do_div(total_time, runs);
 	pr_info("Average test time: %llu\n", total_time);
 
-- 
2.13.5

^ permalink raw reply related

* [net-next v2 2/4] test_rhashtable: don't use global entries variable
From: Florian Westphal @ 2017-09-19 23:12 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal
In-Reply-To: <20170919231214.2281-1-fw@strlen.de>

pass the entries to test as an argument instead.
Followup patch will add an rhlist test case; rhlist delete opererations
are slow so we need to use a smaller number to test it.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 lib/test_rhashtable.c | 37 +++++++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index c40d6e636f33..69f5b3849980 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -28,9 +28,9 @@
 #define MAX_ENTRIES	1000000
 #define TEST_INSERT_FAIL INT_MAX
 
-static int entries = 50000;
-module_param(entries, int, 0);
-MODULE_PARM_DESC(entries, "Number of entries to add (default: 50000)");
+static int parm_entries = 50000;
+module_param(parm_entries, int, 0);
+MODULE_PARM_DESC(parm_entries, "Number of entries to add (default: 50000)");
 
 static int runs = 4;
 module_param(runs, int, 0);
@@ -67,6 +67,7 @@ struct test_obj {
 };
 
 struct thread_data {
+	unsigned int entries;
 	int id;
 	struct task_struct *task;
 	struct test_obj *objs;
@@ -105,11 +106,12 @@ static int insert_retry(struct rhashtable *ht, struct test_obj *obj,
 	return err ? : retries;
 }
 
-static int __init test_rht_lookup(struct rhashtable *ht, struct test_obj *array)
+static int __init test_rht_lookup(struct rhashtable *ht, struct test_obj *array,
+				  unsigned int entries)
 {
 	unsigned int i;
 
-	for (i = 0; i < entries * 2; i++) {
+	for (i = 0; i < entries; i++) {
 		struct test_obj *obj;
 		bool expected = !(i % 2);
 		struct test_obj_val key = {
@@ -142,7 +144,7 @@ static int __init test_rht_lookup(struct rhashtable *ht, struct test_obj *array)
 	return 0;
 }
 
-static void test_bucket_stats(struct rhashtable *ht)
+static void test_bucket_stats(struct rhashtable *ht, unsigned int entries)
 {
 	unsigned int err, total = 0, chain_len = 0;
 	struct rhashtable_iter hti;
@@ -184,7 +186,8 @@ static void test_bucket_stats(struct rhashtable *ht)
 		pr_warn("Test failed: Total count mismatch ^^^");
 }
 
-static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj *array)
+static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj *array,
+				  unsigned int entries)
 {
 	struct test_obj *obj;
 	int err;
@@ -212,12 +215,12 @@ static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj *array)
 		pr_info("  %u insertions retried due to memory pressure\n",
 			insert_retries);
 
-	test_bucket_stats(ht);
+	test_bucket_stats(ht, entries);
 	rcu_read_lock();
-	test_rht_lookup(ht, array);
+	test_rht_lookup(ht, array, entries);
 	rcu_read_unlock();
 
-	test_bucket_stats(ht);
+	test_bucket_stats(ht, entries);
 
 	pr_info("  Deleting %d keys\n", entries);
 	for (i = 0; i < entries; i++) {
@@ -245,6 +248,7 @@ static struct rhashtable ht;
 
 static int thread_lookup_test(struct thread_data *tdata)
 {
+	unsigned int entries = tdata->entries;
 	int i, err = 0;
 
 	for (i = 0; i < entries; i++) {
@@ -281,7 +285,7 @@ static int threadfunc(void *data)
 	if (down_interruptible(&startup_sem))
 		pr_err("  thread[%d]: down_interruptible failed\n", tdata->id);
 
-	for (i = 0; i < entries; i++) {
+	for (i = 0; i < tdata->entries; i++) {
 		tdata->objs[i].value.id = i;
 		tdata->objs[i].value.tid = tdata->id;
 		err = insert_retry(&ht, &tdata->objs[i], test_rht_params);
@@ -305,7 +309,7 @@ static int threadfunc(void *data)
 	}
 
 	for (step = 10; step > 0; step--) {
-		for (i = 0; i < entries; i += step) {
+		for (i = 0; i < tdata->entries; i += step) {
 			if (tdata->objs[i].value.id == TEST_INSERT_FAIL)
 				continue;
 			err = rhashtable_remove_fast(&ht, &tdata->objs[i].node,
@@ -336,12 +340,16 @@ static int threadfunc(void *data)
 
 static int __init test_rht_init(void)
 {
+	unsigned int entries;
 	int i, err, started_threads = 0, failed_threads = 0;
 	u64 total_time = 0;
 	struct thread_data *tdata;
 	struct test_obj *objs;
 
-	entries = min(entries, MAX_ENTRIES);
+	if (parm_entries < 0)
+		parm_entries = 1;
+
+	entries = min(parm_entries, MAX_ENTRIES);
 
 	test_rht_params.automatic_shrinking = shrinking;
 	test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries);
@@ -367,7 +375,7 @@ static int __init test_rht_init(void)
 			continue;
 		}
 
-		time = test_rhashtable(&ht, objs);
+		time = test_rhashtable(&ht, objs, entries);
 		rhashtable_destroy(&ht);
 		if (time < 0) {
 			vfree(objs);
@@ -409,6 +417,7 @@ static int __init test_rht_init(void)
 	}
 	for (i = 0; i < tcount; i++) {
 		tdata[i].id = i;
+		tdata[i].entries = entries;
 		tdata[i].objs = objs + i * entries;
 		tdata[i].task = kthread_run(threadfunc, &tdata[i],
 		                            "rhashtable_thrad[%d]", i);
-- 
2.13.5

^ permalink raw reply related

* [net-next v2 1/4] test_rhashtable: don't allocate huge static array
From: Florian Westphal @ 2017-09-19 23:12 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal
In-Reply-To: <20170919231214.2281-1-fw@strlen.de>

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 lib/test_rhashtable.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 0ffca990a833..c40d6e636f33 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -72,8 +72,6 @@ struct thread_data {
 	struct test_obj *objs;
 };
 
-static struct test_obj array[MAX_ENTRIES];
-
 static struct rhashtable_params test_rht_params = {
 	.head_offset = offsetof(struct test_obj, node),
 	.key_offset = offsetof(struct test_obj, value),
@@ -85,7 +83,7 @@ static struct rhashtable_params test_rht_params = {
 static struct semaphore prestart_sem;
 static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);
 
-static int insert_retry(struct rhashtable *ht, struct rhash_head *obj,
+static int insert_retry(struct rhashtable *ht, struct test_obj *obj,
                         const struct rhashtable_params params)
 {
 	int err, retries = -1, enomem_retries = 0;
@@ -93,7 +91,7 @@ static int insert_retry(struct rhashtable *ht, struct rhash_head *obj,
 	do {
 		retries++;
 		cond_resched();
-		err = rhashtable_insert_fast(ht, obj, params);
+		err = rhashtable_insert_fast(ht, &obj->node, params);
 		if (err == -ENOMEM && enomem_retry) {
 			enomem_retries++;
 			err = -EBUSY;
@@ -107,7 +105,7 @@ static int insert_retry(struct rhashtable *ht, struct rhash_head *obj,
 	return err ? : retries;
 }
 
-static int __init test_rht_lookup(struct rhashtable *ht)
+static int __init test_rht_lookup(struct rhashtable *ht, struct test_obj *array)
 {
 	unsigned int i;
 
@@ -186,7 +184,7 @@ static void test_bucket_stats(struct rhashtable *ht)
 		pr_warn("Test failed: Total count mismatch ^^^");
 }
 
-static s64 __init test_rhashtable(struct rhashtable *ht)
+static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj *array)
 {
 	struct test_obj *obj;
 	int err;
@@ -203,7 +201,7 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
 		struct test_obj *obj = &array[i];
 
 		obj->value.id = i * 2;
-		err = insert_retry(ht, &obj->node, test_rht_params);
+		err = insert_retry(ht, obj, test_rht_params);
 		if (err > 0)
 			insert_retries += err;
 		else if (err)
@@ -216,7 +214,7 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
 
 	test_bucket_stats(ht);
 	rcu_read_lock();
-	test_rht_lookup(ht);
+	test_rht_lookup(ht, array);
 	rcu_read_unlock();
 
 	test_bucket_stats(ht);
@@ -286,7 +284,7 @@ static int threadfunc(void *data)
 	for (i = 0; i < entries; i++) {
 		tdata->objs[i].value.id = i;
 		tdata->objs[i].value.tid = tdata->id;
-		err = insert_retry(&ht, &tdata->objs[i].node, test_rht_params);
+		err = insert_retry(&ht, &tdata->objs[i], test_rht_params);
 		if (err > 0) {
 			insert_retries += err;
 		} else if (err) {
@@ -349,6 +347,10 @@ static int __init test_rht_init(void)
 	test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries);
 	test_rht_params.nelem_hint = size;
 
+	objs = vzalloc((test_rht_params.max_size + 1) * sizeof(struct test_obj));
+	if (!objs)
+		return -ENOMEM;
+
 	pr_info("Running rhashtable test nelem=%d, max_size=%d, shrinking=%d\n",
 		size, max_size, shrinking);
 
@@ -356,7 +358,8 @@ static int __init test_rht_init(void)
 		s64 time;
 
 		pr_info("Test %02d:\n", i);
-		memset(&array, 0, sizeof(array));
+		memset(objs, 0, test_rht_params.max_size * sizeof(struct test_obj));
+
 		err = rhashtable_init(&ht, &test_rht_params);
 		if (err < 0) {
 			pr_warn("Test failed: Unable to initialize hashtable: %d\n",
@@ -364,9 +367,10 @@ static int __init test_rht_init(void)
 			continue;
 		}
 
-		time = test_rhashtable(&ht);
+		time = test_rhashtable(&ht, objs);
 		rhashtable_destroy(&ht);
 		if (time < 0) {
+			vfree(objs);
 			pr_warn("Test failed: return code %lld\n", time);
 			return -EINVAL;
 		}
@@ -374,6 +378,7 @@ static int __init test_rht_init(void)
 		total_time += time;
 	}
 
+	vfree(objs);
 	do_div(total_time, runs);
 	pr_info("Average test time: %llu\n", total_time);
 
-- 
2.13.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox