Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-24 22:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <87k1sw46i3.fsf@xmission.com>

On Tue, Apr 24, 2018 at 04:52:20PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > Now that it's possible to have a different set of uevents in different
> > network namespaces, per-network namespace uevent sequence numbers are
> > introduced. This increases performance as locking is now restricted to the
> > network namespace affected by the uevent rather than locking everything.
> > Testing revealed significant performance improvements. For details see
> > "Testing" below.
> 
> Maybe.  Your locking is wrong, and a few other things are wrong.  see
> below.

Thanks for the review! Happy to rework this until it's in a mergeable shape.

> 
> > Since commit 692ec06 ("netns: send uevent messages") network namespaces not
> > owned by the intial user namespace can be sent uevents from a sufficiently
> > privileged userspace process.
> > In order to send a uevent into a network namespace not owned by the initial
> > user namespace we currently still need to take the *global mutex* that
> > locks the uevent socket list even though the list *only contains network
> > namespaces owned by the initial user namespace*. This needs to be done
> > because the uevent counter is a global variable. Taking the global lock is
> > performance sensitive since a user on the host can spawn a pool of n
> > process that each create their own new user and network namespaces and then
> > go on to inject uevents in parallel into the network namespace of all of
> > these processes. This can have a significant performance impact for the
> > host's udevd since it means that there can be a lot of delay between a
> > device being added and the corresponding uevent being sent out and
> > available for processing by udevd. It also means that each network
> > namespace not owned by the initial user namespace which userspace has sent
> > a uevent to will need to wait until the lock becomes available.
> >
> > Implementation:
> > This patch gives each network namespace its own uevent sequence number.
> > Each network namespace not owned by the initial user namespace receives its
> > own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
> > so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
> > file it is clearly documented which lock has to be taken. All network
> > namespaces owned by the initial user namespace will still share the same
> > lock since they are all served sequentially via the uevent socket list.
> > This decouples the locking and ensures that the host retrieves uevents as
> > fast as possible even if there are a lot of uevents injected into network
> > namespaces not owned by the initial user namespace.  In addition, each
> > network namespace not owned by the initial user namespace does not have to
> > wait on any other network namespace not sharing the same user namespace.
> >
> > Testing:
> > Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
> > with decoupled locking and one without. To ensure that testing made sense
> > both kernels carried the patch to remove network namespaces not owned by
> > the initial user namespace from the uevent socket list.
> > Three tests were constructed. All of them showed significant performance
> > improvements with per-netns uevent sequence numbers and decoupled locking.
> >
> >  # Testcase 1:
> >    Only Injecting Uevents into network namespaces not owned by the initial
> >    user namespace.
> >    - created 1000 new user namespace + network namespace pairs
> >    - opened a uevent listener in each of those namespace pairs
> >    - injected uevents into each of those network namespaces 10,000 times
> >      meaning 10,000,000 (10 million) uevents were injected. (The high
> >      number of uevent injections should get rid of a lot of jitter.)
> >      The injection was done by fork()ing 1000 uevent injectors in a simple
> >      for-loop to ensure that uevents were injected in parallel.
> >    - mean transaction time was calculated:
> >      - *without* uevent sequence number namespacing: 67 μs
> >      - *with* uevent sequence number namespacing:    55 μs
> >      - makes a difference of:                        12 μs
> >    - a t-test was performed on the two data vectors which revealed
> >      shows significant performance improvements:
> >      Welch Two Sample t-test
> >      data:  x1 and y1
> >      t = 405.16, df = 18883000, p-value < 2.2e-16
> >      alternative hypothesis: true difference in means is not equal to 0
> >      95 percent confidence interval:
> >      12.14949 12.26761
> >      sample estimates:
> >      mean of x mean of y
> >      68.48594  56.27739
> >
> >  # Testcase 2:
> >    Injecting Uevents into network namespaces not owned by the initial user
> >    namespace and network namespaces owned by the initial user namespace.
> >    - created 500 new user namespace + network namespace pairs
> >    - created 500 new network namespace pairs
> >    - opened a uevent listener in each of those namespace pairs
> >    - injected uevents into each of those network namespaces 10,000 times
> >      meaning 10,000,000 (10 million) uevents were injected. (The high
> >      number of uevent injections should get rid of a lot of jitter.)
> >      The injection was done by fork()ing 1000 uevent injectors in a simple
> >      for-loop to ensure that uevents were injected in parallel.
> >    - mean transaction time was calculated:
> >      - *without* uevent sequence number namespacing: 572 μs
> >      - *with* uevent sequence number namespacing:    514 μs
> >      - makes a difference of:                         58 μs
> >    - a t-test was performed on the two data vectors which revealed
> >      shows significant performance improvements:
> >      Welch Two Sample t-test
> >      data:  x2 and y2
> >      t = 38.685, df = 19682000, p-value < 2.2e-16
> >      alternative hypothesis: true difference in means is not equal to 0
> >      95 percent confidence interval:
> >      55.10630 60.98815
> >      sample estimates:
> >      mean of x mean of y
> >      572.9684  514.9211
> >
> >  # Testcase 3:
> >    Created 500 new user namespace + network namespace pairs *without uevent
> >    listeners*
> >    - created 500 new network namespace pairs *without uevent listeners*
> >    - injected uevents into each of those network namespaces 10,000 times
> >      meaning 10,000,000 (10 million) uevents were injected. (The high number
> >      of uevent injections should get rid of a lot of jitter.)
> >      The injection was done by fork()ing 1000 uevent injectors in a simple
> >      for-loop to ensure that uevents were injected in parallel.
> >     - mean transaction time was calculated:
> >       - *without* uevent sequence number namespacing: 206 μs
> >       - *with* uevent sequence number namespacing:    163 μs
> >       - makes a difference of:                         43 μs
> >     - a t-test was performed on the two data vectors which revealed
> >       shows significant performance improvements:
> >       Welch Two Sample t-test
> >       data:  x3 and y3
> >       t = 58.37, df = 17711000, p-value < 2.2e-16
> >       alternative hypothesis: true difference in means is not equal to 0
> >       95 percent confidence interval:
> >       41.77860 44.68178
> >       sample estimates:
> >       mean of x mean of y
> >       207.2632  164.0330
> >
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> > Changelog v1->v2:
> > * non-functional change: fix indendation for C directives in
> >   kernel/ksysfs.c
> > Changelog v0->v1:
> > * add detailed test results to the commit message
> > * account for kernels compiled without CONFIG_NET
> > ---
> >  include/linux/kobject.h     |   2 +
> >  include/net/net_namespace.h |   3 ++
> >  kernel/ksysfs.c             |  11 +++-
> >  lib/kobject_uevent.c        | 104 +++++++++++++++++++++++++++++-------
> >  net/core/net_namespace.c    |  14 +++++
> >  5 files changed, 114 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/kobject.h b/include/linux/kobject.h
> > index 7f6f93c3df9c..4e608968907f 100644
> > --- a/include/linux/kobject.h
> > +++ b/include/linux/kobject.h
> > @@ -36,8 +36,10 @@
> >  extern char uevent_helper[];
> >  #endif
> >  
> > +#ifndef CONFIG_NET
> >  /* counter to tag the uevent, read only except for the kobject core */
> >  extern u64 uevent_seqnum;
> > +#endif
> 
> That smells like an implementation bug somewhere.

Sorry, I'm not following. I'm I'm not mistaken there won't be any struct
net when CONFIG_NET=n. This has been reported by kbuild robot with alpha
and CONFIG_NET=n.

> 
> >  /*
> >   * The actions here must match the index to the string array
> > diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> > index 47e35cce3b64..e4e171b1ba69 100644
> > --- a/include/net/net_namespace.h
> > +++ b/include/net/net_namespace.h
> > @@ -85,6 +85,8 @@ struct net {
> >  	struct sock		*genl_sock;
> >  
> >  	struct uevent_sock	*uevent_sock;		/* uevent socket */
> > +	/* counter to tag the uevent, read only except for the kobject core */
> > +	u64                     uevent_seqnum;
> >  
> >  	struct list_head 	dev_base_head;
> >  	struct hlist_head 	*dev_name_head;
> > @@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
> >  
> >  struct net *get_net_ns_by_pid(pid_t pid);
> >  struct net *get_net_ns_by_fd(int fd);
> > +u64 get_ns_uevent_seqnum_by_vpid(void);
> >  
> >  #ifdef CONFIG_SYSCTL
> >  void ipx_register_sysctl(void);
> > diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
> > index 46ba853656f6..38b70b90a21f 100644
> > --- a/kernel/ksysfs.c
> > +++ b/kernel/ksysfs.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/capability.h>
> >  #include <linux/compiler.h>
> > +#include <net/net_namespace.h>
> >  
> >  #include <linux/rcupdate.h>	/* rcu_expedited and rcu_normal */
> >  
> > @@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
> >  static ssize_t uevent_seqnum_show(struct kobject *kobj,
> >  				  struct kobj_attribute *attr, char *buf)
> >  {
> > -	return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
> > +	u64 seqnum;
> > +
> > +#ifdef CONFIG_NET
> > +	seqnum = get_ns_uevent_seqnum_by_vpid();
> > +#else
> > +	seqnum = uevent_seqnum;
> > +#endif
> 
> This can be simplified to be just:
> 	seqnum = current->nsproxy->net_ns->uevent_seqnum;

Does that work even if CONFIG_NET=n?

> 
> Except that is not correct either.  As every instance of sysfs has a
> network namespace associated with it, and you are not fetching that
> network namespace.

I'm not yet familiar with all aspects of sysfs so thanks for pointing
that out. Then I'll try to come up with a way to fetch the network
namespace associated with sysfs. Unless you already know exactly how to
do this and can point it out.
This would also lets us drop get_ns_uevent_seqnum_by_vpid().

> 
> Typically this would call for making this file per network namespace
> so you would have this information available.  Sigh.  I don't know if
> there is an easy way to do that for this file.
> 
> > +
> > +	return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
> >  }
> >  KERNEL_ATTR_RO(uevent_seqnum);
> >  
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index f5f5038787ac..5da20def556d 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -29,21 +29,42 @@
> >  #include <net/net_namespace.h>
> >  
> >  
> > +#ifndef CONFIG_NET
> >  u64 uevent_seqnum;
> > +#endif
> > +
> >  #ifdef CONFIG_UEVENT_HELPER
> >  char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
> >  #endif
> >  
> > +/*
> > + * Size a buffer needs to be in order to hold the largest possible sequence
> > + * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
> > + */
> > +#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
> >  struct uevent_sock {
> >  	struct list_head list;
> >  	struct sock *sk;
> > +	/*
> > +	 * This mutex protects uevent sockets and the uevent counter of
> > +	 * network namespaces *not* owned by init_user_ns.
> > +	 * For network namespaces owned by init_user_ns this lock is *not*
> > +	 * valid instead the global uevent_sock_mutex must be used!
> > +	 */
> > +	struct mutex sk_mutex;
> >  };
> >  
> >  #ifdef CONFIG_NET
> >  static LIST_HEAD(uevent_sock_list);
> >  #endif
> >  
> > -/* This lock protects uevent_seqnum and uevent_sock_list */
> > +/*
> > + * This mutex protects uevent sockets and the uevent counter of network
> > + * namespaces owned by init_user_ns.
> > + * For network namespaces not owned by init_user_ns this lock is *not*
> > + * valid instead the network namespace specific sk_mutex in struct
> > + * uevent_sock must be used!
> > + */
> >  static DEFINE_MUTEX(uevent_sock_mutex);
> >  
> >  /* the strings here must match the enum in include/linux/kobject.h */
> > @@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >  
> >  	return 0;
> >  }
> > +
> > +static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
> > +{
> > +	if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
> > +		WARN(1, KERN_ERR "Failed to append sequence number. "
> > +		     "Too many uevent variables\n");
> > +		return false;
> > +	}
> > +
> > +	if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
> > +		WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
> > +		return false;
> > +	}
> > +
> > +	return true;
> > +}
> >  #endif
> >  
> >  #ifdef CONFIG_UEVENT_HELPER
> > @@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> >  
> >  	/* send netlink message */
> >  	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
> > +		/* bump sequence number */
> > +		u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
> >  		struct sock *uevent_sock = ue_sk->sk;
> > +		char buf[SEQNUM_BUFSIZE];
> >  
> >  		if (!netlink_has_listeners(uevent_sock, 1))
> >  			continue;
> >  
> >  		if (!skb) {
> > -			/* allocate message with the maximum possible size */
> > +			/* calculate header length */
> >  			size_t len = strlen(action_string) + strlen(devpath) + 2;
> >  			char *scratch;
> >  
> > +			/* allocate message with the maximum possible size */
> >  			retval = -ENOMEM;
> > -			skb = alloc_skb(len + env->buflen, GFP_KERNEL);
> > +			skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
> >  			if (!skb)
> >  				continue;
> >  
> > @@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> >  			scratch = skb_put(skb, len);
> >  			sprintf(scratch, "%s@%s", action_string, devpath);
> >  
> > +			/* add env */
> >  			skb_put_data(skb, env->buf, env->buflen);
> >  
> >  			NETLINK_CB(skb).dst_group = 1;
> >  		}
> >  
> > +		/* prepare netns seqnum */
> > +		retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
> > +		if (retval < 0 || retval >= SEQNUM_BUFSIZE)
> > +			continue;
> > +		retval++;
> > +
> > +		if (!can_hold_seqnum(env, retval))
> > +			continue;
> 
> You have allocated enough space in the skb why does can_hold_seqnum make
> sense?

Because it doesn't check whether the socket buffer can hold the sequence
number but checks whether the uevent buffer size in "env" can hold it.
uevents are only delivered if the env buffer is large enough to hold all
of the info including the sequence number. That's independent of the
socket buffer.

> 
> Do you need to back seqnum out of the env later for this to work twice
> in a row?

I guess I can just override it. It just felt cleaner to trim it.

> 
> > +
> > +		/* append netns seqnum */
> > +		skb_put_data(skb, buf, retval);
> > +
> >  		retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
> >  						    0, 1, GFP_KERNEL,
> >  						    kobj_bcast_filter,
> > @@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> >  		/* ENOBUFS should be handled in userspace */
> >  		if (retval == -ENOBUFS || retval == -ESRCH)
> >  			retval = 0;
> > +
> > +		/* remove netns seqnum */
> > +		skb_trim(skb, env->buflen);
> 
> Have you checked to see if the seqnum actually makes it to userspace.

Yes, I did. I also wonder why it wouldn't make it. Any specific reason
why you suspect this?

> >  	}
> >  	consume_skb(skb);
> > +#else
> > +	uevent_seqnum++;
> >  #endif
> >  	return retval;
> >  }
> > @@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
> >  	}
> >  
> >  	mutex_lock(&uevent_sock_mutex);
> > -	/* we will send an event, so request a new sequence number */
> > -	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
> > -	if (retval) {
> > -		mutex_unlock(&uevent_sock_mutex);
> > -		goto exit;
> > -	}
> > -	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
> > -					      devpath);
> > +	retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
> >  	mutex_unlock(&uevent_sock_mutex);
> 
> How does all of this work with events for network devices that are not
> in the initial network namespace.  This looks to me like this code fails
> to take the sk_mutex.

But in this list only non-initial network namespaces that are owned by
the initial user namespace are recorded and for these uevent_sock_mutex
has to be taken. Am I missing something?

> 
> >  #ifdef CONFIG_UEVENT_HELPER
> > @@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
> >  EXPORT_SYMBOL_GPL(add_uevent_var);
> >  
> >  #if defined(CONFIG_NET)
> > -static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
> > +static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
> >  				struct netlink_ext_ack *extack)
> >  {
> > -	/* u64 to chars: 2^64 - 1 = 21 chars */
> > -	char buf[sizeof("SEQNUM=") + 21];
> > +	struct sock *usk = ue_sk->sk;
> > +	char buf[SEQNUM_BUFSIZE];
> >  	struct sk_buff *skbc;
> >  	int ret;
> >  
> >  	/* bump and prepare sequence number */
> > -	ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
> > -	if (ret < 0 || (size_t)ret >= sizeof(buf))
> > +	ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
> > +		       ++sock_net(ue_sk->sk)->uevent_seqnum);
> > +	if (ret < 0 || ret >= SEQNUM_BUFSIZE)
> >  		return -ENOMEM;
> >  	ret++;
> >  
> > @@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
> >  		return -EPERM;
> >  	}
> >  
> > -	mutex_lock(&uevent_sock_mutex);
> > -	ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
> > -	mutex_unlock(&uevent_sock_mutex);
> > +	if (net->user_ns == &init_user_ns)
> > +		mutex_lock(&uevent_sock_mutex);
> > +	else
> > +		mutex_lock(&net->uevent_sock->sk_mutex);
> > +	ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
> > +	if (net->user_ns == &init_user_ns)
> > +		mutex_unlock(&uevent_sock_mutex);
> > +	else
> > +		mutex_unlock(&net->uevent_sock->sk_mutex);
> >  
> >  	return ret;
> >  }
> > @@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
> >  		mutex_lock(&uevent_sock_mutex);
> >  		list_add_tail(&ue_sk->list, &uevent_sock_list);
> >  		mutex_unlock(&uevent_sock_mutex);
> > +	} else {
> > +		/*
> > +		 * Uevent sockets and counters for network namespaces
> > +		 * not owned by the initial user namespace have their
> > +		 * own mutex.
> > +		 */
> > +		mutex_init(&ue_sk->sk_mutex);
> >  	}
> >  
> >  	return 0;
> > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> > index a11e03f920d3..8894638f5150 100644
> > --- a/net/core/net_namespace.c
> > +++ b/net/core/net_namespace.c
> > @@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
> >  }
> >  EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
> >  
> > +u64 get_ns_uevent_seqnum_by_vpid(void)
> > +{
> > +	pid_t cur_pid;
> > +	struct net *net;
> > +
> > +	cur_pid = task_pid_vnr(current);
> > +	net = get_net_ns_by_pid(cur_pid);
> > +	if (IS_ERR(net))
> > +		return 0;
> > +
> > +	return net->uevent_seqnum;
> > +}
> > +EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);
> 
> I just have to say this function is completely crazy.
> You go from the tsk to the pid back to the tsk.
> And you leak a struct net pointer.
> 
> It is much simpler and less racy to say:
> 
> 	current->nsproxy->net_ns->uevent_seqnum;
> 
> That you are accessing current->nsproxy means nsproxy can't
> change.  The rcu_read_lock etc that get_net_ns_by_pid does
> is there for accessing non-current tasks.
> 
> 
> 
> >  static __net_init int net_ns_net_init(struct net *net)
> >  {
> >  #ifdef CONFIG_NET_NS

^ permalink raw reply

* [PATCH] selftests: net: add in_netns.sh TEST_GEN_PROGS_EXTENDED
From: Anders Roxell @ 2018-04-24 22:19 UTC (permalink / raw)
  To: davem, shuah; +Cc: netdev, linux-kselftest, linux-kernel, Anders Roxell

Script in_netns.sh is a utility function and not its own test so it
shouldn't be part of the TEST_PROGS. The in_netns.sh get used by
run_afpackettests.
To install in_netns.sh without being added to the main run_kselftest.sh
script use the TEST_GEN_PROGS_EXTENDED variable.

Fixes: 5ff9c1a3dd92 ("selftests: net: add in_netns.sh to TEST_PROGS")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
---
 tools/testing/selftests/net/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index c3761c35f542..4a8cfe8071a7 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -5,7 +5,8 @@ CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh rtnetlink.sh
-TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh pmtu.sh
+TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh
+TEST_GEN_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
 TEST_GEN_FILES += tcp_mmap
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH bpf-next] tools/bpf: remove test_sock_addr from TEST_GEN_PROGS
From: Daniel Borkmann @ 2018-04-24 22:01 UTC (permalink / raw)
  To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180424214504.3000168-1-yhs@fb.com>

On 04/24/2018 11:45 PM, Yonghong Song wrote:
> Since test_sock_addr is not supposed to run by itself,
> remove it from TEST_GEN_PROGS and add it to
> TEST_GEN_PROGS_EXTENDED. This way, run_tests will
> not run test_sock_addr. The corresponding test to run
> is test_sock_addr.sh.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>

Applied to bpf-next, thanks Yonghong!

^ permalink raw reply

* Re: [PATCH v2] selftests: bpf: update .gitignore with missing file
From: Daniel Borkmann @ 2018-04-24 21:56 UTC (permalink / raw)
  To: Anders Roxell, ast, shuah; +Cc: netdev, linux-kernel, linux-kselftest
In-Reply-To: <20180423225305.18947-1-anders.roxell@linaro.org>

On 04/24/2018 12:53 AM, Anders Roxell wrote:
> Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests")
> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
> ---
> Rebased against bpf-next.

Applied to bpf-next, thanks Anders!

^ permalink raw reply

* Re: [PATCH net-next 1/2 v2] netns: restrict uevents
From: Eric W. Biederman @ 2018-04-24 21:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <20180424204335.12904-2-christian.brauner@ubuntu.com>


We already do this in practice in userspace.  It doesn't make much
sense to perform this delivery.  So we might as well make this optimization.

Christian Brauner <christian.brauner@ubuntu.com> writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
>   Broadcasting uevents into all network namespaces introduces significant
>   overhead.
>   All processes that listen to uevents running in non-initial user
>   namespaces will end up responding to uevents that will be meaningless to
>   them. Mainly, because non-initial user namespaces cannot easily manage
>   devices unless they have a privileged host-process helping them out. This
>   means that there will be a thundering herd of activity when there
>   shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
>   Uevents are filtered by userspace in a user namespace because the
>   received uid != 0. Instead the uid associated with the event will be
>   65534 == "nobody" because the global root uid is not mapped.
>   This means we can safely and without introducing regressions modify the
>   kernel to not send uevents into all network namespaces whose owning user
>   namespace is not the initial user namespace because we know that
>   userspace will ignore the message because of the uid anyway. I have
>   a) verified that is is true for every udev implementation out there b)
>   that this behavior has been present in all udev implementations from the
>   very beginning.
> - Removing needless overhead/Increasing performance:
>   Currently, the uevent socket for each network namespace is added to the
>   global variable uevent_sock_list. The list itself needs to be protected
>   by a mutex. So everytime a uevent is generated the mutex is taken on the
>   list. The mutex is held *from the creation of the uevent (memory
>   allocation, string creation etc. until all uevent sockets have been
>   handled*. This is aggravated by the fact that for each uevent socket that
>   has listeners the mc_list must be walked as well which means we're
>   talking O(n^2) here. Given that a standard Linux workload usually has
>   quite a lot of network namespaces and - in the face of containers - a lot
>   of user namespaces this quickly becomes a performance problem (see
>   "Thundering herd" above). By just recording uevent sockets of network
>   namespaces that are owned by the initial user namespace we significantly
>   increase performance in this codepath.
> - Injecting uevents:
>   There's a valid argument that containers might be interested in receiving
>   device events especially if they are delegated to them by a privileged
>   userspace process. One prime example are SR-IOV enabled devices that are
>   explicitly designed to be handed of to other users such as VMs or
>   containers.
>   This use-case can now be correctly handled since
>   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>   introduced the ability to send uevents from userspace. As such we can let
>   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>   the network namespace of the netlink socket) userspace process make a
>   decision what uevents should be sent. This removes the need to blindly
>   broadcast uevents into all user namespaces and provides a performant and
>   safe solution to this problem.
> - Filtering logic:
>   This patch filters by *owning user namespace of the network namespace a
>   given task resides in* and not by user namespace of the task per se. This
>   means if the user namespace of a given task is unshared but the network
>   namespace is kept and is owned by the initial user namespace a listener
>   that is opening the uevent socket in that network namespace can still
>   listen to uevents.
>
> [1]: https://lkml.org/lkml/2018/4/4/739
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

> ---
> Changelog v1->v2:
> * patch unchanged
> Changelog v0->v1:
> * patch unchanged
> ---
>  lib/kobject_uevent.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..f5f5038787ac 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
>  
>  	net->uevent_sock = ue_sk;
>  
> -	mutex_lock(&uevent_sock_mutex);
> -	list_add_tail(&ue_sk->list, &uevent_sock_list);
> -	mutex_unlock(&uevent_sock_mutex);
> +	/* Restrict uevents to initial user namespace. */
> +	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> +		mutex_lock(&uevent_sock_mutex);
> +		list_add_tail(&ue_sk->list, &uevent_sock_list);
> +		mutex_unlock(&uevent_sock_mutex);
> +	}
> +
>  	return 0;
>  }
>  
> @@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
>  {
>  	struct uevent_sock *ue_sk = net->uevent_sock;
>  
> -	mutex_lock(&uevent_sock_mutex);
> -	list_del(&ue_sk->list);
> -	mutex_unlock(&uevent_sock_mutex);
> +	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> +		mutex_lock(&uevent_sock_mutex);
> +		list_del(&ue_sk->list);
> +		mutex_unlock(&uevent_sock_mutex);
> +	}
>  
>  	netlink_kernel_release(ue_sk->sk);
>  	kfree(ue_sk);

^ permalink raw reply

* Re: [PATCH bpf-next v5 0/2] bpf: allow map helpers access to map values directly
From: Daniel Borkmann @ 2018-04-24 21:54 UTC (permalink / raw)
  To: Paul Chaignon, Alexei Starovoitov, netdev; +Cc: iovisor-dev, paul.chaignon
In-Reply-To: <20180424130636.GA6270@Nover>

On 04/24/2018 03:06 PM, Paul Chaignon wrote:
> Currently, helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE
> can only access stack and packet memory.  This patchset allows these
> helpers to directly access map values by passing registers of type
> PTR_TO_MAP_VALUE.
> 
> The first patch changes the verifier; the second adds new test cases.
> 
> The first three versions of this patchset were sent on the iovisor-dev
> mailing list only.
> 
> Changelogs:
>   Changes in v5:
>     - Refactor using check_helper_mem_access.
>   Changes in v4:
>     - Rebase.
>   Changes in v3:
>     - Bug fixes.
>     - Negative test cases.
>   Changes in v2:
>     - Additional test cases for adjusted maps.

Applied to bpf-next, thanks Paul!

^ permalink raw reply

* Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks
From: Eric W. Biederman @ 2018-04-24 21:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <20180424204335.12904-3-christian.brauner@ubuntu.com>

Christian Brauner <christian.brauner@ubuntu.com> writes:

> Now that it's possible to have a different set of uevents in different
> network namespaces, per-network namespace uevent sequence numbers are
> introduced. This increases performance as locking is now restricted to the
> network namespace affected by the uevent rather than locking everything.
> Testing revealed significant performance improvements. For details see
> "Testing" below.

Maybe.  Your locking is wrong, and a few other things are wrong.  see
below.

> Since commit 692ec06 ("netns: send uevent messages") network namespaces not
> owned by the intial user namespace can be sent uevents from a sufficiently
> privileged userspace process.
> In order to send a uevent into a network namespace not owned by the initial
> user namespace we currently still need to take the *global mutex* that
> locks the uevent socket list even though the list *only contains network
> namespaces owned by the initial user namespace*. This needs to be done
> because the uevent counter is a global variable. Taking the global lock is
> performance sensitive since a user on the host can spawn a pool of n
> process that each create their own new user and network namespaces and then
> go on to inject uevents in parallel into the network namespace of all of
> these processes. This can have a significant performance impact for the
> host's udevd since it means that there can be a lot of delay between a
> device being added and the corresponding uevent being sent out and
> available for processing by udevd. It also means that each network
> namespace not owned by the initial user namespace which userspace has sent
> a uevent to will need to wait until the lock becomes available.
>
> Implementation:
> This patch gives each network namespace its own uevent sequence number.
> Each network namespace not owned by the initial user namespace receives its
> own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
> so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
> file it is clearly documented which lock has to be taken. All network
> namespaces owned by the initial user namespace will still share the same
> lock since they are all served sequentially via the uevent socket list.
> This decouples the locking and ensures that the host retrieves uevents as
> fast as possible even if there are a lot of uevents injected into network
> namespaces not owned by the initial user namespace.  In addition, each
> network namespace not owned by the initial user namespace does not have to
> wait on any other network namespace not sharing the same user namespace.
>
> Testing:
> Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
> with decoupled locking and one without. To ensure that testing made sense
> both kernels carried the patch to remove network namespaces not owned by
> the initial user namespace from the uevent socket list.
> Three tests were constructed. All of them showed significant performance
> improvements with per-netns uevent sequence numbers and decoupled locking.
>
>  # Testcase 1:
>    Only Injecting Uevents into network namespaces not owned by the initial
>    user namespace.
>    - created 1000 new user namespace + network namespace pairs
>    - opened a uevent listener in each of those namespace pairs
>    - injected uevents into each of those network namespaces 10,000 times
>      meaning 10,000,000 (10 million) uevents were injected. (The high
>      number of uevent injections should get rid of a lot of jitter.)
>      The injection was done by fork()ing 1000 uevent injectors in a simple
>      for-loop to ensure that uevents were injected in parallel.
>    - mean transaction time was calculated:
>      - *without* uevent sequence number namespacing: 67 μs
>      - *with* uevent sequence number namespacing:    55 μs
>      - makes a difference of:                        12 μs
>    - a t-test was performed on the two data vectors which revealed
>      shows significant performance improvements:
>      Welch Two Sample t-test
>      data:  x1 and y1
>      t = 405.16, df = 18883000, p-value < 2.2e-16
>      alternative hypothesis: true difference in means is not equal to 0
>      95 percent confidence interval:
>      12.14949 12.26761
>      sample estimates:
>      mean of x mean of y
>      68.48594  56.27739
>
>  # Testcase 2:
>    Injecting Uevents into network namespaces not owned by the initial user
>    namespace and network namespaces owned by the initial user namespace.
>    - created 500 new user namespace + network namespace pairs
>    - created 500 new network namespace pairs
>    - opened a uevent listener in each of those namespace pairs
>    - injected uevents into each of those network namespaces 10,000 times
>      meaning 10,000,000 (10 million) uevents were injected. (The high
>      number of uevent injections should get rid of a lot of jitter.)
>      The injection was done by fork()ing 1000 uevent injectors in a simple
>      for-loop to ensure that uevents were injected in parallel.
>    - mean transaction time was calculated:
>      - *without* uevent sequence number namespacing: 572 μs
>      - *with* uevent sequence number namespacing:    514 μs
>      - makes a difference of:                         58 μs
>    - a t-test was performed on the two data vectors which revealed
>      shows significant performance improvements:
>      Welch Two Sample t-test
>      data:  x2 and y2
>      t = 38.685, df = 19682000, p-value < 2.2e-16
>      alternative hypothesis: true difference in means is not equal to 0
>      95 percent confidence interval:
>      55.10630 60.98815
>      sample estimates:
>      mean of x mean of y
>      572.9684  514.9211
>
>  # Testcase 3:
>    Created 500 new user namespace + network namespace pairs *without uevent
>    listeners*
>    - created 500 new network namespace pairs *without uevent listeners*
>    - injected uevents into each of those network namespaces 10,000 times
>      meaning 10,000,000 (10 million) uevents were injected. (The high number
>      of uevent injections should get rid of a lot of jitter.)
>      The injection was done by fork()ing 1000 uevent injectors in a simple
>      for-loop to ensure that uevents were injected in parallel.
>     - mean transaction time was calculated:
>       - *without* uevent sequence number namespacing: 206 μs
>       - *with* uevent sequence number namespacing:    163 μs
>       - makes a difference of:                         43 μs
>     - a t-test was performed on the two data vectors which revealed
>       shows significant performance improvements:
>       Welch Two Sample t-test
>       data:  x3 and y3
>       t = 58.37, df = 17711000, p-value < 2.2e-16
>       alternative hypothesis: true difference in means is not equal to 0
>       95 percent confidence interval:
>       41.77860 44.68178
>       sample estimates:
>       mean of x mean of y
>       207.2632  164.0330
>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
> Changelog v1->v2:
> * non-functional change: fix indendation for C directives in
>   kernel/ksysfs.c
> Changelog v0->v1:
> * add detailed test results to the commit message
> * account for kernels compiled without CONFIG_NET
> ---
>  include/linux/kobject.h     |   2 +
>  include/net/net_namespace.h |   3 ++
>  kernel/ksysfs.c             |  11 +++-
>  lib/kobject_uevent.c        | 104 +++++++++++++++++++++++++++++-------
>  net/core/net_namespace.c    |  14 +++++
>  5 files changed, 114 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/kobject.h b/include/linux/kobject.h
> index 7f6f93c3df9c..4e608968907f 100644
> --- a/include/linux/kobject.h
> +++ b/include/linux/kobject.h
> @@ -36,8 +36,10 @@
>  extern char uevent_helper[];
>  #endif
>  
> +#ifndef CONFIG_NET
>  /* counter to tag the uevent, read only except for the kobject core */
>  extern u64 uevent_seqnum;
> +#endif

That smells like an implementation bug somewhere.

>  /*
>   * The actions here must match the index to the string array
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 47e35cce3b64..e4e171b1ba69 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -85,6 +85,8 @@ struct net {
>  	struct sock		*genl_sock;
>  
>  	struct uevent_sock	*uevent_sock;		/* uevent socket */
> +	/* counter to tag the uevent, read only except for the kobject core */
> +	u64                     uevent_seqnum;
>  
>  	struct list_head 	dev_base_head;
>  	struct hlist_head 	*dev_name_head;
> @@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
>  
>  struct net *get_net_ns_by_pid(pid_t pid);
>  struct net *get_net_ns_by_fd(int fd);
> +u64 get_ns_uevent_seqnum_by_vpid(void);
>  
>  #ifdef CONFIG_SYSCTL
>  void ipx_register_sysctl(void);
> diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
> index 46ba853656f6..38b70b90a21f 100644
> --- a/kernel/ksysfs.c
> +++ b/kernel/ksysfs.c
> @@ -19,6 +19,7 @@
>  #include <linux/sched.h>
>  #include <linux/capability.h>
>  #include <linux/compiler.h>
> +#include <net/net_namespace.h>
>  
>  #include <linux/rcupdate.h>	/* rcu_expedited and rcu_normal */
>  
> @@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
>  static ssize_t uevent_seqnum_show(struct kobject *kobj,
>  				  struct kobj_attribute *attr, char *buf)
>  {
> -	return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
> +	u64 seqnum;
> +
> +#ifdef CONFIG_NET
> +	seqnum = get_ns_uevent_seqnum_by_vpid();
> +#else
> +	seqnum = uevent_seqnum;
> +#endif

This can be simplified to be just:
	seqnum = current->nsproxy->net_ns->uevent_seqnum;

Except that is not correct either.  As every instance of sysfs has a
network namespace associated with it, and you are not fetching that
network namespace.

Typically this would call for making this file per network namespace
so you would have this information available.  Sigh.  I don't know if
there is an easy way to do that for this file.

> +
> +	return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
>  }
>  KERNEL_ATTR_RO(uevent_seqnum);
>  
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index f5f5038787ac..5da20def556d 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -29,21 +29,42 @@
>  #include <net/net_namespace.h>
>  
>  
> +#ifndef CONFIG_NET
>  u64 uevent_seqnum;
> +#endif
> +
>  #ifdef CONFIG_UEVENT_HELPER
>  char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
>  #endif
>  
> +/*
> + * Size a buffer needs to be in order to hold the largest possible sequence
> + * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
> + */
> +#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
>  struct uevent_sock {
>  	struct list_head list;
>  	struct sock *sk;
> +	/*
> +	 * This mutex protects uevent sockets and the uevent counter of
> +	 * network namespaces *not* owned by init_user_ns.
> +	 * For network namespaces owned by init_user_ns this lock is *not*
> +	 * valid instead the global uevent_sock_mutex must be used!
> +	 */
> +	struct mutex sk_mutex;
>  };
>  
>  #ifdef CONFIG_NET
>  static LIST_HEAD(uevent_sock_list);
>  #endif
>  
> -/* This lock protects uevent_seqnum and uevent_sock_list */
> +/*
> + * This mutex protects uevent sockets and the uevent counter of network
> + * namespaces owned by init_user_ns.
> + * For network namespaces not owned by init_user_ns this lock is *not*
> + * valid instead the network namespace specific sk_mutex in struct
> + * uevent_sock must be used!
> + */
>  static DEFINE_MUTEX(uevent_sock_mutex);
>  
>  /* the strings here must match the enum in include/linux/kobject.h */
> @@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>  
>  	return 0;
>  }
> +
> +static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
> +{
> +	if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
> +		WARN(1, KERN_ERR "Failed to append sequence number. "
> +		     "Too many uevent variables\n");
> +		return false;
> +	}
> +
> +	if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
> +		WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
> +		return false;
> +	}
> +
> +	return true;
> +}
>  #endif
>  
>  #ifdef CONFIG_UEVENT_HELPER
> @@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
>  
>  	/* send netlink message */
>  	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
> +		/* bump sequence number */
> +		u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
>  		struct sock *uevent_sock = ue_sk->sk;
> +		char buf[SEQNUM_BUFSIZE];
>  
>  		if (!netlink_has_listeners(uevent_sock, 1))
>  			continue;
>  
>  		if (!skb) {
> -			/* allocate message with the maximum possible size */
> +			/* calculate header length */
>  			size_t len = strlen(action_string) + strlen(devpath) + 2;
>  			char *scratch;
>  
> +			/* allocate message with the maximum possible size */
>  			retval = -ENOMEM;
> -			skb = alloc_skb(len + env->buflen, GFP_KERNEL);
> +			skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
>  			if (!skb)
>  				continue;
>  
> @@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
>  			scratch = skb_put(skb, len);
>  			sprintf(scratch, "%s@%s", action_string, devpath);
>  
> +			/* add env */
>  			skb_put_data(skb, env->buf, env->buflen);
>  
>  			NETLINK_CB(skb).dst_group = 1;
>  		}
>  
> +		/* prepare netns seqnum */
> +		retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
> +		if (retval < 0 || retval >= SEQNUM_BUFSIZE)
> +			continue;
> +		retval++;
> +
> +		if (!can_hold_seqnum(env, retval))
> +			continue;

You have allocated enough space in the skb why does can_hold_seqnum make
sense?

Do you need to back seqnum out of the env later for this to work twice
in a row?

> +
> +		/* append netns seqnum */
> +		skb_put_data(skb, buf, retval);
> +
>  		retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
>  						    0, 1, GFP_KERNEL,
>  						    kobj_bcast_filter,
> @@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
>  		/* ENOBUFS should be handled in userspace */
>  		if (retval == -ENOBUFS || retval == -ESRCH)
>  			retval = 0;
> +
> +		/* remove netns seqnum */
> +		skb_trim(skb, env->buflen);

Have you checked to see if the seqnum actually makes it to userspace.
>  	}
>  	consume_skb(skb);
> +#else
> +	uevent_seqnum++;
>  #endif
>  	return retval;
>  }
> @@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
>  	}
>  
>  	mutex_lock(&uevent_sock_mutex);
> -	/* we will send an event, so request a new sequence number */
> -	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
> -	if (retval) {
> -		mutex_unlock(&uevent_sock_mutex);
> -		goto exit;
> -	}
> -	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
> -					      devpath);
> +	retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
>  	mutex_unlock(&uevent_sock_mutex);

How does all of this work with events for network devices that are not
in the initial network namespace.  This looks to me like this code fails
to take the sk_mutex.

>  #ifdef CONFIG_UEVENT_HELPER
> @@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
>  EXPORT_SYMBOL_GPL(add_uevent_var);
>  
>  #if defined(CONFIG_NET)
> -static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
> +static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
>  				struct netlink_ext_ack *extack)
>  {
> -	/* u64 to chars: 2^64 - 1 = 21 chars */
> -	char buf[sizeof("SEQNUM=") + 21];
> +	struct sock *usk = ue_sk->sk;
> +	char buf[SEQNUM_BUFSIZE];
>  	struct sk_buff *skbc;
>  	int ret;
>  
>  	/* bump and prepare sequence number */
> -	ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
> -	if (ret < 0 || (size_t)ret >= sizeof(buf))
> +	ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
> +		       ++sock_net(ue_sk->sk)->uevent_seqnum);
> +	if (ret < 0 || ret >= SEQNUM_BUFSIZE)
>  		return -ENOMEM;
>  	ret++;
>  
> @@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
>  		return -EPERM;
>  	}
>  
> -	mutex_lock(&uevent_sock_mutex);
> -	ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
> -	mutex_unlock(&uevent_sock_mutex);
> +	if (net->user_ns == &init_user_ns)
> +		mutex_lock(&uevent_sock_mutex);
> +	else
> +		mutex_lock(&net->uevent_sock->sk_mutex);
> +	ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
> +	if (net->user_ns == &init_user_ns)
> +		mutex_unlock(&uevent_sock_mutex);
> +	else
> +		mutex_unlock(&net->uevent_sock->sk_mutex);
>  
>  	return ret;
>  }
> @@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
>  		mutex_lock(&uevent_sock_mutex);
>  		list_add_tail(&ue_sk->list, &uevent_sock_list);
>  		mutex_unlock(&uevent_sock_mutex);
> +	} else {
> +		/*
> +		 * Uevent sockets and counters for network namespaces
> +		 * not owned by the initial user namespace have their
> +		 * own mutex.
> +		 */
> +		mutex_init(&ue_sk->sk_mutex);
>  	}
>  
>  	return 0;
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index a11e03f920d3..8894638f5150 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
>  }
>  EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
>  
> +u64 get_ns_uevent_seqnum_by_vpid(void)
> +{
> +	pid_t cur_pid;
> +	struct net *net;
> +
> +	cur_pid = task_pid_vnr(current);
> +	net = get_net_ns_by_pid(cur_pid);
> +	if (IS_ERR(net))
> +		return 0;
> +
> +	return net->uevent_seqnum;
> +}
> +EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);

I just have to say this function is completely crazy.
You go from the tsk to the pid back to the tsk.
And you leak a struct net pointer.

It is much simpler and less racy to say:

	current->nsproxy->net_ns->uevent_seqnum;

That you are accessing current->nsproxy means nsproxy can't
change.  The rcu_read_lock etc that get_net_ns_by_pid does
is there for accessing non-current tasks.



>  static __net_init int net_ns_net_init(struct net *net)
>  {
>  #ifdef CONFIG_NET_NS

^ permalink raw reply

* Re: [PATCH bpf-next,v3 0/2] bpf: add helper for getting xfrm states
From: Daniel Borkmann @ 2018-04-24 21:53 UTC (permalink / raw)
  To: Eyal Birger, netdev; +Cc: shmulik, ast
In-Reply-To: <1524581430-11921-1-git-send-email-eyal.birger@gmail.com>

On 04/24/2018 04:50 PM, Eyal Birger wrote:
> This patchset adds support for fetching XFRM state information from
> an eBPF program called from TC.
> 
> The first patch introduces a helper for fetching an XFRM state from the
> skb's secpath. The XFRM state is modeled using a new virtual struct which
> contains the SPI, peer address, and reqid values of the state; This struct
> can be extended in the future to provide additional state information.
> 
> The second patch adds a test example in test_tunnel_bpf.sh. The sample
> validates the correct extraction of state information by the eBPF program.
> 
> ---
> v3:
>   - Kept SPI and peer IPv4 address in state in network byte order
>     following suggestion from Alexei Starovoitov
> v2:
>   - Fixed two comments by Daniel Borkmann:
>     - disallow reserved flags in helper call
>     - avoid compiling in helper code when CONFIG_XFRM is off

Applied to bpf-next, thanks Eyal!

^ permalink raw reply

* Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV
From: Bjorn Helgaas @ 2018-04-24 21:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Bjorn Helgaas, linux-pci, virtio-dev, kvm,
	Netdev, Daly, Dan, LKML, linux-nvme, Keith Busch, netanel,
	Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
	David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <CAKgT0Ufcq6aQW5zPzBUePO4R_BCa2B_aZ8U2=meOQeOvshZ=Mw@mail.gmail.com>

On Sat, Apr 21, 2018 at 05:22:27PM -0700, Alexander Duyck wrote:
> On Sat, Apr 21, 2018 at 1:34 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:

> > For example, I'm not sure what you mean by "devices where the PF is
> > not capable of managing VF resources."
> >
> > It *sounds* like you're saying the hardware works differently on some
> > devices, but I don't think that's what you mean.  I think you're
> > saying something about which drivers are used for the PF and the VF.
> 
> That is sort of what I am saying.
> 
> So for example with ixgbe there is functionality which is controlled
> in the MMIO space of the PF that affects the functionality of the VFs
> that are generated on the device. The PF has to rearrange the
> resources such as queues and interrupts on the device before it can
> enable SR-IOV, and it could alter those later to limit what the VF is
> capable of doing.
> 
> The model I am dealing with via this patch set has a PF that is not
> much different than the VFs other than the fact that it has some
> extended configuration space bits in place for SR-IOV, ARI, ACS, and
> whatever other bits are needed in order to support spawning isolated
> VFs.

OK, thanks for the explanation, I think I understand what's going on
now, correct me if I'm mistaken.  I added a hint about "PF" for Randy,
too.

These are on pci/virtualization for v4.18.


commit 8effc395c209
Author: Alexander Duyck <alexander.h.duyck@intel.com>
Date:   Sat Apr 21 15:23:09 2018 -0500

    PCI/IOV: Add pci_sriov_configure_simple()
    
    SR-IOV (Single Root I/O Virtualization) is an optional PCIe capability (see
    PCIe r4.0, sec 9).  A PCIe Function with the SR-IOV capability is referred
    to as a PF (Physical Function).  If SR-IOV is enabled on the PF, several
    VFs (Virtual Functions) may be created.  The VFs can be individually
    assigned to virtual machines, which allows them to share a single hardware
    device while being isolated from each other.
    
    Some SR-IOV devices have resources such as queues and interrupts that must
    be set up in the PF before enabling the VFs, so they require a PF driver to
    do that.
    
    Other SR-IOV devices don't require any PF setup before enabling VFs.  Add a
    pci_sriov_configure_simple() interface so PF drivers for such devices can
    use it without repeating the VF-enabling code.
    
    Tested-by: Mark Rustad <mark.d.rustad@intel.com>
    Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
    [bhelgaas: changelog, comment]
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Greg Rose <gvrose8192@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>:wq

commit a8ccf8a66663
Author: Alexander Duyck <alexander.h.duyck@intel.com>
Date:   Tue Apr 24 16:47:16 2018 -0500

    PCI/IOV: Add pci-pf-stub driver for PFs that only enable VFs
    
    Some SR-IOV PF devices provide no functionality other than acting as a
    means of enabling VFs.  For these devices, we want to enable the VFs and
    assign them to guest virtual machines, but there's no need to have a driver
    for the PF itself.
    
    Add a new pci-pf-stub driver to claim those PF devices and provide the
    generic VF enable functionality.  An administrator can use the sysfs
    "sriov_numvfs" file to enable VFs, then assign them to guests.
    
    For now I only have one example ID provided by Amazon in terms of devices
    that require this functionality.  The general idea is that in the future we
    will see other devices added as vendors come up with devices where the PF
    is more or less just a lightweight shim used to allocate VFs.
    
    Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
    [bhelgaas: changelog]
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Greg Rose <gvrose8192@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

commit 115ddc491922
Author: Alexander Duyck <alexander.h.duyck@intel.com>
Date:   Tue Apr 24 16:47:22 2018 -0500

    net: ena: Use pci_sriov_configure_simple() to enable VFs
    
    Instead of implementing our own version of a SR-IOV configuration stub in
    the ena driver, use the existing pci_sriov_configure_simple() function.
    
    Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Greg Rose <gvrose8192@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

commit 74d986abc20b
Author: Alexander Duyck <alexander.h.duyck@intel.com>
Date:   Tue Apr 24 16:47:27 2018 -0500

    nvme-pci: Use pci_sriov_configure_simple() to enable VFs
    
    Instead of implementing our own version of a SR-IOV configuration stub in
    the nvme driver, use the existing pci_sriov_configure_simple() function.
    
    Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* [PATCH bpf-next] tools/bpf: remove test_sock_addr from TEST_GEN_PROGS
From: Yonghong Song @ 2018-04-24 21:45 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

Since test_sock_addr is not supposed to run by itself,
remove it from TEST_GEN_PROGS and add it to
TEST_GEN_PROGS_EXTENDED. This way, run_tests will
not run test_sock_addr. The corresponding test to run
is test_sock_addr.sh.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 0b72cc7..64037ee 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -24,7 +24,7 @@ urandom_read: urandom_read.c
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
-	test_sock test_sock_addr test_btf
+	test_sock test_btf
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \
 	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
@@ -43,7 +43,7 @@ TEST_PROGS := test_kmod.sh \
 	test_sock_addr.sh
 
 # Compile but not part of 'make run_tests'
-TEST_GEN_PROGS_EXTENDED = test_libbpf_open
+TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr
 
 include ../lib.mk
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next] sctp: fix identification of new acks for SFR-CACC
From: Marcelo Ricardo Leitner @ 2018-04-24 21:17 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Xin Long, Vlad Yasevich, Neil Horman

It's currently written as:

if (!tchunk->tsn_gap_acked) {   [1]
	tchunk->tsn_gap_acked = 1;
	...
}

if (TSN_lte(tsn, sack_ctsn)) {
	if (!tchunk->tsn_gap_acked) {
		/* SFR-CACC processing */
		...
	}
}

Which causes the SFR-CACC processing on ack reception to never process,
as tchunk->tsn_gap_acked is always true by then. Block [1] was
moved to that position by the commit marked below.

This patch fixes it by doing SFR-CACC processing earlier, before
tsn_gap_acked is set to true.

Fixes: 31b02e154940 ("sctp: Failover transmitted list on transport delete")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---

Even though this is a -stable candidate, please apply it to net-next
to avoid conflicts with subsequent patches in my queue. Thanks.

 net/sctp/outqueue.c | 48 +++++++++++++++++++++++-------------------------
 1 file changed, 23 insertions(+), 25 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index f211b3db6a3543073e113da121bb28518b0af491..dee7cbd5483149024f2f3195db2fe4d473b1a00a 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -1457,7 +1457,7 @@ static void sctp_check_transmitted(struct sctp_outq *q,
 			 * the outstanding bytes for this chunk, so only
 			 * count bytes associated with a transport.
 			 */
-			if (transport) {
+			if (transport && !tchunk->tsn_gap_acked) {
 				/* If this chunk is being used for RTT
 				 * measurement, calculate the RTT and update
 				 * the RTO using this value.
@@ -1469,14 +1469,34 @@ static void sctp_check_transmitted(struct sctp_outq *q,
 				 * first instance of the packet or a later
 				 * instance).
 				 */
-				if (!tchunk->tsn_gap_acked &&
-				    !sctp_chunk_retransmitted(tchunk) &&
+				if (!sctp_chunk_retransmitted(tchunk) &&
 				    tchunk->rtt_in_progress) {
 					tchunk->rtt_in_progress = 0;
 					rtt = jiffies - tchunk->sent_at;
 					sctp_transport_update_rto(transport,
 								  rtt);
 				}
+
+				if (TSN_lte(tsn, sack_ctsn)) {
+					/*
+					 * SFR-CACC algorithm:
+					 * 2) If the SACK contains gap acks
+					 * and the flag CHANGEOVER_ACTIVE is
+					 * set the receiver of the SACK MUST
+					 * take the following action:
+					 *
+					 * B) For each TSN t being acked that
+					 * has not been acked in any SACK so
+					 * far, set cacc_saw_newack to 1 for
+					 * the destination that the TSN was
+					 * sent to.
+					 */
+					if (sack->num_gap_ack_blocks &&
+					    q->asoc->peer.primary_path->cacc.
+					    changeover_active)
+						transport->cacc.cacc_saw_newack
+							= 1;
+				}
 			}

 			/* If the chunk hasn't been marked as ACKED,
@@ -1508,28 +1528,6 @@ static void sctp_check_transmitted(struct sctp_outq *q,
 				restart_timer = 1;
 				forward_progress = true;

-				if (!tchunk->tsn_gap_acked) {
-					/*
-					 * SFR-CACC algorithm:
-					 * 2) If the SACK contains gap acks
-					 * and the flag CHANGEOVER_ACTIVE is
-					 * set the receiver of the SACK MUST
-					 * take the following action:
-					 *
-					 * B) For each TSN t being acked that
-					 * has not been acked in any SACK so
-					 * far, set cacc_saw_newack to 1 for
-					 * the destination that the TSN was
-					 * sent to.
-					 */
-					if (transport &&
-					    sack->num_gap_ack_blocks &&
-					    q->asoc->peer.primary_path->cacc.
-					    changeover_active)
-						transport->cacc.cacc_saw_newack
-							= 1;
-				}
-
 				list_add_tail(&tchunk->transmitted_list,
 					      &q->sacked);
 			} else {

^ permalink raw reply related

* [PATCH net-next] sctp: fix const parameter violation in sctp_make_sack
From: Marcelo Ricardo Leitner @ 2018-04-24 21:17 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Xin Long, Vlad Yasevich, Neil Horman

sctp_make_sack() make changes to the asoc and this cast is just
bypassing the const attribute. As there is no need to have the const
there, just remove it and fix the violation.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---

This one can go to net or net-next, but targetting net-next here just to
keep it together with the rest (which I'll post as patches get in).

 include/net/sctp/sm.h    | 2 +-
 net/sctp/sm_make_chunk.c | 9 ++++-----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
index 2d0e782c90551377ad654bcef1224bbdb75ba394..f4b657478a304050851f33d92c71162a4a4a2e50 100644
--- a/include/net/sctp/sm.h
+++ b/include/net/sctp/sm.h
@@ -207,7 +207,7 @@ struct sctp_chunk *sctp_make_datafrag_empty(const struct sctp_association *asoc,
 					    int len, __u8 flags, gfp_t gfp);
 struct sctp_chunk *sctp_make_ecne(const struct sctp_association *asoc,
 				  const __u32 lowest_tsn);
-struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc);
+struct sctp_chunk *sctp_make_sack(struct sctp_association *asoc);
 struct sctp_chunk *sctp_make_shutdown(const struct sctp_association *asoc,
 				      const struct sctp_chunk *chunk);
 struct sctp_chunk *sctp_make_shutdown_ack(const struct sctp_association *asoc,
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 5a4fb1dc8400a0316177ce65be8126857297eb5e..db93eabd6ef500ab20be6e091ee06bd3b60923c9 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -779,10 +779,9 @@ struct sctp_chunk *sctp_make_datafrag_empty(const struct sctp_association *asoc,
  * association.  This reports on which TSN's we've seen to date,
  * including duplicates and gaps.
  */
-struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc)
+struct sctp_chunk *sctp_make_sack(struct sctp_association *asoc)
 {
 	struct sctp_tsnmap *map = (struct sctp_tsnmap *)&asoc->peer.tsn_map;
-	struct sctp_association *aptr = (struct sctp_association *)asoc;
 	struct sctp_gap_ack_block gabs[SCTP_MAX_GABS];
 	__u16 num_gabs, num_dup_tsns;
 	struct sctp_transport *trans;
@@ -857,7 +856,7 @@ struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc)

 	/* Add the duplicate TSN information.  */
 	if (num_dup_tsns) {
-		aptr->stats.idupchunks += num_dup_tsns;
+		asoc->stats.idupchunks += num_dup_tsns;
 		sctp_addto_chunk(retval, sizeof(__u32) * num_dup_tsns,
 				 sctp_tsnmap_get_dups(map));
 	}
@@ -869,11 +868,11 @@ struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc)
 	 * association so no transport will match after a wrap event like this,
 	 * Until the next sack
 	 */
-	if (++aptr->peer.sack_generation == 0) {
+	if (++asoc->peer.sack_generation == 0) {
 		list_for_each_entry(trans, &asoc->peer.transport_addr_list,
 				    transports)
 			trans->sack_generation = 0;
-		aptr->peer.sack_generation = 1;
+		asoc->peer.sack_generation = 1;
 	}
 nodata:
 	return retval;

^ permalink raw reply related

* [PATCH net-next] neighbour: support for NTF_EXT_LEARNED flag
From: Roopa Prabhu @ 2018-04-24 20:49 UTC (permalink / raw)
  To: davem; +Cc: netdev, nikolay, dsa

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch extends NTF_EXT_LEARNED support to the neighbour system.
Example use-case: An Ethernet VPN implementation (eg in FRR routing suite)
can use this flag to add dynamic reachable external neigh entires
learned via control plane. The use of neigh NTF_EXT_LEARNED in this
patch is consistent with its use with bridge and vxlan fdb entries.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 include/net/neighbour.h | 19 ++++++++++++++++++-
 net/core/neighbour.c    |  8 +++++++-
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index e421f86..6c1eecd 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -246,6 +246,7 @@ static inline void *neighbour_priv(const struct neighbour *n)
 #define NEIGH_UPDATE_F_OVERRIDE			0x00000001
 #define NEIGH_UPDATE_F_WEAK_OVERRIDE		0x00000002
 #define NEIGH_UPDATE_F_OVERRIDE_ISROUTER	0x00000004
+#define NEIGH_UPDATE_F_EXT_LEARNED		0x20000000
 #define NEIGH_UPDATE_F_ISROUTER			0x40000000
 #define NEIGH_UPDATE_F_ADMIN			0x80000000
 
@@ -526,5 +527,21 @@ static inline void neigh_ha_snapshot(char *dst, const struct neighbour *n,
 	} while (read_seqretry(&n->ha_lock, seq));
 }
 
-
+static inline void neigh_update_ext_learned(struct neighbour *neigh, u32 flags,
+					    int *notify)
+{
+	u8 ndm_flags = 0;
+
+	if (!(flags & NEIGH_UPDATE_F_ADMIN))
+		return;
+
+	ndm_flags |= (flags & NEIGH_UPDATE_F_EXT_LEARNED) ? NTF_EXT_LEARNED : 0;
+	if ((neigh->flags ^ ndm_flags) & NTF_EXT_LEARNED) {
+		if (ndm_flags & NTF_EXT_LEARNED)
+			neigh->flags |= NTF_EXT_LEARNED;
+		else
+			neigh->flags &= ~NTF_EXT_LEARNED;
+		*notify = 1;
+	}
+}
 #endif
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index ce51986..5afae29 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -820,7 +820,8 @@ static void neigh_periodic_work(struct work_struct *work)
 			write_lock(&n->lock);
 
 			state = n->nud_state;
-			if (state & (NUD_PERMANENT | NUD_IN_TIMER)) {
+			if ((state & (NUD_PERMANENT | NUD_IN_TIMER)) ||
+			    (n->flags & NTF_EXT_LEARNED)) {
 				write_unlock(&n->lock);
 				goto next_elt;
 			}
@@ -1136,6 +1137,8 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
 	if (neigh->dead)
 		goto out;
 
+	neigh_update_ext_learned(neigh, flags, &notify);
+
 	if (!(new & NUD_VALID)) {
 		neigh_del_timer(neigh);
 		if (old & NUD_CONNECTED)
@@ -1781,6 +1784,9 @@ static int neigh_add(struct sk_buff *skb, struct nlmsghdr *nlh,
 			flags &= ~NEIGH_UPDATE_F_OVERRIDE;
 	}
 
+	if (ndm->ndm_flags & NTF_EXT_LEARNED)
+		flags |= NEIGH_UPDATE_F_EXT_LEARNED;
+
 	if (ndm->ndm_flags & NTF_USE) {
 		neigh_event_send(neigh, NULL);
 		err = 0;
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net-next 2/2 v1] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-24 20:45 UTC (permalink / raw)
  To: David Miller
  Cc: ebiederm, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <20180424.153925.2063217911734444324.davem@davemloft.net>

On Tue, Apr 24, 2018 at 03:39:25PM -0400, David Miller wrote:
> From: Christian Brauner <christian.brauner@ubuntu.com>
> Date: Mon, 23 Apr 2018 12:24:43 +0200
> 
> > +	#ifdef CONFIG_NET
> > +		seqnum = get_ns_uevent_seqnum_by_vpid();
> > +	#else
> > +		seqnum = uevent_seqnum;
> > +	#endif
> 
> Please don't indend the code like this.
> 
> By indenting the CPP directives, which should be at column zero, the
> actual code became double indented.

Ah, sorry. Sent v2 with the indendation fixed just now.

Thanks!
Christian

^ permalink raw reply

* [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-24 20:43 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner
In-Reply-To: <20180424204335.12904-1-christian.brauner@ubuntu.com>

Now that it's possible to have a different set of uevents in different
network namespaces, per-network namespace uevent sequence numbers are
introduced. This increases performance as locking is now restricted to the
network namespace affected by the uevent rather than locking everything.
Testing revealed significant performance improvements. For details see
"Testing" below.

Since commit 692ec06 ("netns: send uevent messages") network namespaces not
owned by the intial user namespace can be sent uevents from a sufficiently
privileged userspace process.
In order to send a uevent into a network namespace not owned by the initial
user namespace we currently still need to take the *global mutex* that
locks the uevent socket list even though the list *only contains network
namespaces owned by the initial user namespace*. This needs to be done
because the uevent counter is a global variable. Taking the global lock is
performance sensitive since a user on the host can spawn a pool of n
process that each create their own new user and network namespaces and then
go on to inject uevents in parallel into the network namespace of all of
these processes. This can have a significant performance impact for the
host's udevd since it means that there can be a lot of delay between a
device being added and the corresponding uevent being sent out and
available for processing by udevd. It also means that each network
namespace not owned by the initial user namespace which userspace has sent
a uevent to will need to wait until the lock becomes available.

Implementation:
This patch gives each network namespace its own uevent sequence number.
Each network namespace not owned by the initial user namespace receives its
own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
file it is clearly documented which lock has to be taken. All network
namespaces owned by the initial user namespace will still share the same
lock since they are all served sequentially via the uevent socket list.
This decouples the locking and ensures that the host retrieves uevents as
fast as possible even if there are a lot of uevents injected into network
namespaces not owned by the initial user namespace.  In addition, each
network namespace not owned by the initial user namespace does not have to
wait on any other network namespace not sharing the same user namespace.

Testing:
Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
with decoupled locking and one without. To ensure that testing made sense
both kernels carried the patch to remove network namespaces not owned by
the initial user namespace from the uevent socket list.
Three tests were constructed. All of them showed significant performance
improvements with per-netns uevent sequence numbers and decoupled locking.

 # Testcase 1:
   Only Injecting Uevents into network namespaces not owned by the initial
   user namespace.
   - created 1000 new user namespace + network namespace pairs
   - opened a uevent listener in each of those namespace pairs
   - injected uevents into each of those network namespaces 10,000 times
     meaning 10,000,000 (10 million) uevents were injected. (The high
     number of uevent injections should get rid of a lot of jitter.)
     The injection was done by fork()ing 1000 uevent injectors in a simple
     for-loop to ensure that uevents were injected in parallel.
   - mean transaction time was calculated:
     - *without* uevent sequence number namespacing: 67 μs
     - *with* uevent sequence number namespacing:    55 μs
     - makes a difference of:                        12 μs
   - a t-test was performed on the two data vectors which revealed
     shows significant performance improvements:
     Welch Two Sample t-test
     data:  x1 and y1
     t = 405.16, df = 18883000, p-value < 2.2e-16
     alternative hypothesis: true difference in means is not equal to 0
     95 percent confidence interval:
     12.14949 12.26761
     sample estimates:
     mean of x mean of y
     68.48594  56.27739

 # Testcase 2:
   Injecting Uevents into network namespaces not owned by the initial user
   namespace and network namespaces owned by the initial user namespace.
   - created 500 new user namespace + network namespace pairs
   - created 500 new network namespace pairs
   - opened a uevent listener in each of those namespace pairs
   - injected uevents into each of those network namespaces 10,000 times
     meaning 10,000,000 (10 million) uevents were injected. (The high
     number of uevent injections should get rid of a lot of jitter.)
     The injection was done by fork()ing 1000 uevent injectors in a simple
     for-loop to ensure that uevents were injected in parallel.
   - mean transaction time was calculated:
     - *without* uevent sequence number namespacing: 572 μs
     - *with* uevent sequence number namespacing:    514 μs
     - makes a difference of:                         58 μs
   - a t-test was performed on the two data vectors which revealed
     shows significant performance improvements:
     Welch Two Sample t-test
     data:  x2 and y2
     t = 38.685, df = 19682000, p-value < 2.2e-16
     alternative hypothesis: true difference in means is not equal to 0
     95 percent confidence interval:
     55.10630 60.98815
     sample estimates:
     mean of x mean of y
     572.9684  514.9211

 # Testcase 3:
   Created 500 new user namespace + network namespace pairs *without uevent
   listeners*
   - created 500 new network namespace pairs *without uevent listeners*
   - injected uevents into each of those network namespaces 10,000 times
     meaning 10,000,000 (10 million) uevents were injected. (The high number
     of uevent injections should get rid of a lot of jitter.)
     The injection was done by fork()ing 1000 uevent injectors in a simple
     for-loop to ensure that uevents were injected in parallel.
    - mean transaction time was calculated:
      - *without* uevent sequence number namespacing: 206 μs
      - *with* uevent sequence number namespacing:    163 μs
      - makes a difference of:                         43 μs
    - a t-test was performed on the two data vectors which revealed
      shows significant performance improvements:
      Welch Two Sample t-test
      data:  x3 and y3
      t = 58.37, df = 17711000, p-value < 2.2e-16
      alternative hypothesis: true difference in means is not equal to 0
      95 percent confidence interval:
      41.77860 44.68178
      sample estimates:
      mean of x mean of y
      207.2632  164.0330

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
Changelog v1->v2:
* non-functional change: fix indendation for C directives in
  kernel/ksysfs.c
Changelog v0->v1:
* add detailed test results to the commit message
* account for kernels compiled without CONFIG_NET
---
 include/linux/kobject.h     |   2 +
 include/net/net_namespace.h |   3 ++
 kernel/ksysfs.c             |  11 +++-
 lib/kobject_uevent.c        | 104 +++++++++++++++++++++++++++++-------
 net/core/net_namespace.c    |  14 +++++
 5 files changed, 114 insertions(+), 20 deletions(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..4e608968907f 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -36,8 +36,10 @@
 extern char uevent_helper[];
 #endif
 
+#ifndef CONFIG_NET
 /* counter to tag the uevent, read only except for the kobject core */
 extern u64 uevent_seqnum;
+#endif
 
 /*
  * The actions here must match the index to the string array
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 47e35cce3b64..e4e171b1ba69 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -85,6 +85,8 @@ struct net {
 	struct sock		*genl_sock;
 
 	struct uevent_sock	*uevent_sock;		/* uevent socket */
+	/* counter to tag the uevent, read only except for the kobject core */
+	u64                     uevent_seqnum;
 
 	struct list_head 	dev_base_head;
 	struct hlist_head 	*dev_name_head;
@@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
 
 struct net *get_net_ns_by_pid(pid_t pid);
 struct net *get_net_ns_by_fd(int fd);
+u64 get_ns_uevent_seqnum_by_vpid(void);
 
 #ifdef CONFIG_SYSCTL
 void ipx_register_sysctl(void);
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 46ba853656f6..38b70b90a21f 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/capability.h>
 #include <linux/compiler.h>
+#include <net/net_namespace.h>
 
 #include <linux/rcupdate.h>	/* rcu_expedited and rcu_normal */
 
@@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
 static ssize_t uevent_seqnum_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf)
 {
-	return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
+	u64 seqnum;
+
+#ifdef CONFIG_NET
+	seqnum = get_ns_uevent_seqnum_by_vpid();
+#else
+	seqnum = uevent_seqnum;
+#endif
+
+	return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
 }
 KERNEL_ATTR_RO(uevent_seqnum);
 
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index f5f5038787ac..5da20def556d 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -29,21 +29,42 @@
 #include <net/net_namespace.h>
 
 
+#ifndef CONFIG_NET
 u64 uevent_seqnum;
+#endif
+
 #ifdef CONFIG_UEVENT_HELPER
 char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
 #endif
 
+/*
+ * Size a buffer needs to be in order to hold the largest possible sequence
+ * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
+ */
+#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
 struct uevent_sock {
 	struct list_head list;
 	struct sock *sk;
+	/*
+	 * This mutex protects uevent sockets and the uevent counter of
+	 * network namespaces *not* owned by init_user_ns.
+	 * For network namespaces owned by init_user_ns this lock is *not*
+	 * valid instead the global uevent_sock_mutex must be used!
+	 */
+	struct mutex sk_mutex;
 };
 
 #ifdef CONFIG_NET
 static LIST_HEAD(uevent_sock_list);
 #endif
 
-/* This lock protects uevent_seqnum and uevent_sock_list */
+/*
+ * This mutex protects uevent sockets and the uevent counter of network
+ * namespaces owned by init_user_ns.
+ * For network namespaces not owned by init_user_ns this lock is *not*
+ * valid instead the network namespace specific sk_mutex in struct
+ * uevent_sock must be used!
+ */
 static DEFINE_MUTEX(uevent_sock_mutex);
 
 /* the strings here must match the enum in include/linux/kobject.h */
@@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
 
 	return 0;
 }
+
+static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
+{
+	if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
+		WARN(1, KERN_ERR "Failed to append sequence number. "
+		     "Too many uevent variables\n");
+		return false;
+	}
+
+	if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
+		WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
+		return false;
+	}
+
+	return true;
+}
 #endif
 
 #ifdef CONFIG_UEVENT_HELPER
@@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 
 	/* send netlink message */
 	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
+		/* bump sequence number */
+		u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
 		struct sock *uevent_sock = ue_sk->sk;
+		char buf[SEQNUM_BUFSIZE];
 
 		if (!netlink_has_listeners(uevent_sock, 1))
 			continue;
 
 		if (!skb) {
-			/* allocate message with the maximum possible size */
+			/* calculate header length */
 			size_t len = strlen(action_string) + strlen(devpath) + 2;
 			char *scratch;
 
+			/* allocate message with the maximum possible size */
 			retval = -ENOMEM;
-			skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+			skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
 			if (!skb)
 				continue;
 
@@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 			scratch = skb_put(skb, len);
 			sprintf(scratch, "%s@%s", action_string, devpath);
 
+			/* add env */
 			skb_put_data(skb, env->buf, env->buflen);
 
 			NETLINK_CB(skb).dst_group = 1;
 		}
 
+		/* prepare netns seqnum */
+		retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
+		if (retval < 0 || retval >= SEQNUM_BUFSIZE)
+			continue;
+		retval++;
+
+		if (!can_hold_seqnum(env, retval))
+			continue;
+
+		/* append netns seqnum */
+		skb_put_data(skb, buf, retval);
+
 		retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
 						    0, 1, GFP_KERNEL,
 						    kobj_bcast_filter,
@@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 		/* ENOBUFS should be handled in userspace */
 		if (retval == -ENOBUFS || retval == -ESRCH)
 			retval = 0;
+
+		/* remove netns seqnum */
+		skb_trim(skb, env->buflen);
 	}
 	consume_skb(skb);
+#else
+	uevent_seqnum++;
 #endif
 	return retval;
 }
@@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 	}
 
 	mutex_lock(&uevent_sock_mutex);
-	/* we will send an event, so request a new sequence number */
-	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
-	if (retval) {
-		mutex_unlock(&uevent_sock_mutex);
-		goto exit;
-	}
-	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
-					      devpath);
+	retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
 	mutex_unlock(&uevent_sock_mutex);
 
 #ifdef CONFIG_UEVENT_HELPER
@@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
 EXPORT_SYMBOL_GPL(add_uevent_var);
 
 #if defined(CONFIG_NET)
-static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
+static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
 				struct netlink_ext_ack *extack)
 {
-	/* u64 to chars: 2^64 - 1 = 21 chars */
-	char buf[sizeof("SEQNUM=") + 21];
+	struct sock *usk = ue_sk->sk;
+	char buf[SEQNUM_BUFSIZE];
 	struct sk_buff *skbc;
 	int ret;
 
 	/* bump and prepare sequence number */
-	ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
-	if (ret < 0 || (size_t)ret >= sizeof(buf))
+	ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
+		       ++sock_net(ue_sk->sk)->uevent_seqnum);
+	if (ret < 0 || ret >= SEQNUM_BUFSIZE)
 		return -ENOMEM;
 	ret++;
 
@@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EPERM;
 	}
 
-	mutex_lock(&uevent_sock_mutex);
-	ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
-	mutex_unlock(&uevent_sock_mutex);
+	if (net->user_ns == &init_user_ns)
+		mutex_lock(&uevent_sock_mutex);
+	else
+		mutex_lock(&net->uevent_sock->sk_mutex);
+	ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
+	if (net->user_ns == &init_user_ns)
+		mutex_unlock(&uevent_sock_mutex);
+	else
+		mutex_unlock(&net->uevent_sock->sk_mutex);
 
 	return ret;
 }
@@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
 		mutex_lock(&uevent_sock_mutex);
 		list_add_tail(&ue_sk->list, &uevent_sock_list);
 		mutex_unlock(&uevent_sock_mutex);
+	} else {
+		/*
+		 * Uevent sockets and counters for network namespaces
+		 * not owned by the initial user namespace have their
+		 * own mutex.
+		 */
+		mutex_init(&ue_sk->sk_mutex);
 	}
 
 	return 0;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a11e03f920d3..8894638f5150 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
 }
 EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
 
+u64 get_ns_uevent_seqnum_by_vpid(void)
+{
+	pid_t cur_pid;
+	struct net *net;
+
+	cur_pid = task_pid_vnr(current);
+	net = get_net_ns_by_pid(cur_pid);
+	if (IS_ERR(net))
+		return 0;
+
+	return net->uevent_seqnum;
+}
+EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);
+
 static __net_init int net_ns_net_init(struct net *net)
 {
 #ifdef CONFIG_NET_NS
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next 1/2 v2] netns: restrict uevents
From: Christian Brauner @ 2018-04-24 20:43 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner
In-Reply-To: <20180424204335.12904-1-christian.brauner@ubuntu.com>

commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk a little. We have now reached the point where hotplug events for all
devices that carry a namespace tag are filtered according to that
namespace. Specifically, they are filtered whenever the namespace tag of
the kobject does not match the namespace tag of the netlink socket. One
example are network devices. Uevents for network devices only show up in
the network namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch restricts uevents to the initial user namespace for a couple of
reasons that have been extensively discusses on the mailing list [1].
- Thundering herd:
  Broadcasting uevents into all network namespaces introduces significant
  overhead.
  All processes that listen to uevents running in non-initial user
  namespaces will end up responding to uevents that will be meaningless to
  them. Mainly, because non-initial user namespaces cannot easily manage
  devices unless they have a privileged host-process helping them out. This
  means that there will be a thundering herd of activity when there
  shouldn't be any.
- Uevents from non-root users are already filtered in userspace:
  Uevents are filtered by userspace in a user namespace because the
  received uid != 0. Instead the uid associated with the event will be
  65534 == "nobody" because the global root uid is not mapped.
  This means we can safely and without introducing regressions modify the
  kernel to not send uevents into all network namespaces whose owning user
  namespace is not the initial user namespace because we know that
  userspace will ignore the message because of the uid anyway. I have
  a) verified that is is true for every udev implementation out there b)
  that this behavior has been present in all udev implementations from the
  very beginning.
- Removing needless overhead/Increasing performance:
  Currently, the uevent socket for each network namespace is added to the
  global variable uevent_sock_list. The list itself needs to be protected
  by a mutex. So everytime a uevent is generated the mutex is taken on the
  list. The mutex is held *from the creation of the uevent (memory
  allocation, string creation etc. until all uevent sockets have been
  handled*. This is aggravated by the fact that for each uevent socket that
  has listeners the mc_list must be walked as well which means we're
  talking O(n^2) here. Given that a standard Linux workload usually has
  quite a lot of network namespaces and - in the face of containers - a lot
  of user namespaces this quickly becomes a performance problem (see
  "Thundering herd" above). By just recording uevent sockets of network
  namespaces that are owned by the initial user namespace we significantly
  increase performance in this codepath.
- Injecting uevents:
  There's a valid argument that containers might be interested in receiving
  device events especially if they are delegated to them by a privileged
  userspace process. One prime example are SR-IOV enabled devices that are
  explicitly designed to be handed of to other users such as VMs or
  containers.
  This use-case can now be correctly handled since
  commit 692ec06d7c92 ("netns: send uevent messages"). This commit
  introduced the ability to send uevents from userspace. As such we can let
  a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
  the network namespace of the netlink socket) userspace process make a
  decision what uevents should be sent. This removes the need to blindly
  broadcast uevents into all user namespaces and provides a performant and
  safe solution to this problem.
- Filtering logic:
  This patch filters by *owning user namespace of the network namespace a
  given task resides in* and not by user namespace of the task per se. This
  means if the user namespace of a given task is unshared but the network
  namespace is kept and is owned by the initial user namespace a listener
  that is opening the uevent socket in that network namespace can still
  listen to uevents.

[1]: https://lkml.org/lkml/2018/4/4/739
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
Changelog v1->v2:
* patch unchanged
Changelog v0->v1:
* patch unchanged
---
 lib/kobject_uevent.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..f5f5038787ac 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
 
 	net->uevent_sock = ue_sk;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_add_tail(&ue_sk->list, &uevent_sock_list);
-	mutex_unlock(&uevent_sock_mutex);
+	/* Restrict uevents to initial user namespace. */
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_add_tail(&ue_sk->list, &uevent_sock_list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
+
 	return 0;
 }
 
@@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
 {
 	struct uevent_sock *ue_sk = net->uevent_sock;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_del(&ue_sk->list);
-	mutex_unlock(&uevent_sock_mutex);
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_del(&ue_sk->list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
 
 	netlink_kernel_release(ue_sk->sk);
 	kfree(ue_sk);
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next 0/2 v2] netns: uevent performance tweaks
From: Christian Brauner @ 2018-04-24 20:43 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner

Hey everyone,

This is v2 of "netns: uevent performance tweaks" which contains *no
functional changes* just a minor indendation fix as requested by David.

Like Eric requested, I did extensive testing that prove significant
performance improvements when using per-netns uevent sequence numbers
with decoupled locks. The results and test descriptions were added to
the commit message of
[PATCH 2/2 v1] netns: isolate seqnums to use per-netns locks.

This series deals with a bunch of performance improvements when sending
out uevents that have been extensively discussed here:
https://lkml.org/lkml/2018/4/10/592

- Only record uevent sockets from network namespaces owned by the
  initial user namespace in the global uevent socket list.
  Eric, this is the exact patch we agreed upon in
  https://lkml.org/lkml/2018/4/10/592.
  A very detailed rationale is present in the commit message for
  [PATCH 1/2] netns: restrict uevents
- Decouple the locking for network namespaces in the global uevent
  socket list from the locking for network namespaces not in the global
  uevent socket list.
  A very detailed rationale including performance test results is
  present in the commit message for
  [PATCH 2/2] netns: isolate seqnums to use per-netns locks

Thanks!
Christian

Christian Brauner (2):
  netns: restrict uevents
  netns: isolate seqnums to use per-netns locks

 include/linux/kobject.h     |   2 +
 include/net/net_namespace.h |   3 +
 kernel/ksysfs.c             |  11 +++-
 lib/kobject_uevent.c        | 122 ++++++++++++++++++++++++++++--------
 net/core/net_namespace.c    |  14 +++++
 5 files changed, 126 insertions(+), 26 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2018-04-24 20:38 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Fix rtnl deadlock in ipvs, from Julian Anastasov.

2) s390 qeth fixes from Julian Wiedmann (control IO completion stalls, bad MAC
   address update sequence, request side races on command IO timeouts).

3) Handle seq_file overflow properly in l2tp, from Guillaume Nault.

4) Fix VLAN priority mappings in cpsw driver, from Ivan Khoronzhuk.

5) Packet scheduler ife action fixes (malformed TLV lengths, etc.)
   from Alexander Aring.

6) Fix out of bounds access in tcp md5 option parser, from Jann Horn.

7) Missing netlink attribute policies in rtm_ipv6_policy table, from
   Eric Dumazet.

8) Missing socket address length checks in l2tp and pppoe connect,
   from Guillaume Nault.

9) Fix netconsole over team and bonding, from Xin Long.

10) Fix race with AF_PACKET socket state bitfields, from Willem de
    Bruijn.

Pulled, thanks a lot!

The following changes since commit 83beed7b2b26f232d782127792dd0cd4362fdc41:

  Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal (2018-04-20 10:56:32 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to d19efb729f10339f91c35003d480dc718cae3b3c:

  Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue (2018-04-24 16:17:59 -0400)

----------------------------------------------------------------
Ahmed Abdelsalam (1):
      ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts

Alexander Aring (3):
      net: sched: ife: signal not finding metaid
      net: sched: ife: handle malformed tlv length
      net: sched: ife: check on metadata length

Anders Roxell (1):
      selftests: bpf: update .gitignore with missing generated files

Anirudh Venkataramanan (2):
      ice: Fix initialization for num_nodes_added
      ice: Fix incorrect comment for action type

Arnd Bergmann (1):
      netfilter: fix CONFIG_NF_REJECT_IPV6=m link error

Ben Shelton (1):
      ice: Do not check INTEVENT bit for OICR interrupts

Colin Ian King (1):
      ixgbevf: ensure xdp_ring resources are free'd on error exit

Cong Wang (3):
      netfilter: conntrack: silent a memory leak warning
      llc: delete timers synchronously in llc_sk_free()
      llc: fix NULL pointer deref for SOCK_ZAPPED

David S. Miller (6):
      Merge branch 's390-qeth-fixes'
      Merge branch 'net-sched-ife-malformed-ife-packet-fixes'
      Merge git://git.kernel.org/.../bpf/bpf
      Merge git://git.kernel.org/.../pablo/nf
      Merge branch 'amd-xgbe-fixes'
      Merge branch '1GbE' of git://git.kernel.org/.../jkirsher/net-queue

Doron Roberts-Kedes (1):
      strparser: Do not call mod_delayed_work with a timeout of LONG_MAX

Edward Cree (1):
      sfc: ARFS filter IDs

Eric Dumazet (1):
      ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy

Florian Fainelli (1):
      net: ethtool: Add missing kernel doc for FEC parameters

Florian Westphal (4):
      netfilter: nf_conntrack_sip: allow duplicate SDP expectations
      netfilter: ebtables: don't attempt to allocate 0-sized compat array
      netfilter: nf_tables: can't fail after linking rule into active rule list
      netfilter: nf_tables: free set name in error path

Guillaume Nault (3):
      l2tp: fix {pppol2tp, l2tp_dfs}_seq_stop() in case of seq_file overflow
      l2tp: check sockaddr length in pppol2tp_connect()
      pppoe: check sockaddr length in pppoe_connect()

Ivan Khoronzhuk (1):
      net: ethernet: ti: cpsw: fix tx vlan priority mapping

Jack Ma (1):
      netfilter: xt_connmark: Add bit mapping for bit-shift operation.

Jann Horn (2):
      bpf: sockmap remove dead check
      tcp: don't read out-of-bounds opsize

Jingju Hou (1):
      net: phy: marvell: clear wol event before setting it

Julian Anastasov (1):
      ipvs: fix rtnl_lock lockups caused by start_sync_thread

Julian Wiedmann (6):
      s390/qeth: fix error handling in adapter command callbacks
      s390/qeth: avoid control IO completion stalls
      s390/qeth: handle failure on workqueue creation
      s390/qeth: fix MAC address update sequence
      s390/qeth: fix request-side race during cmd IO timeout
      s390/qeth: use Read device to query hypervisor for MAC

Md Fahad Iqbal Polash (1):
      ice: Fix insufficient memory issue in ice_aq_manage_mac_read

Pablo Neira Ayuso (2):
      netfilter: nf_tables: NAT chain and extensions require NF_TABLES
      netfilter: xt_connmark: do not cast xt_connmark_tginfo1 to xt_connmark_tginfo2

Stephen Rothwell (1):
      netfilter: conntrack: include kmemleak.h for kmemleak_not_leak()

Taehee Yoo (1):
      netfilter: nf_tables: fix out-of-bounds in nft_chain_commit_update

Thomas Falcon (1):
      ibmvnic: Clean actual number of RX or TX pools

Tom Lendacky (3):
      amd-xgbe: Add pre/post auto-negotiation phy hooks
      amd-xgbe: Improve KR auto-negotiation and training
      amd-xgbe: Only use the SFP supported transceiver signals

Vinicius Costa Gomes (1):
      igb: Fix the transmission mode of queue 0 for Qav mode

Willem de Bruijn (1):
      packet: fix bitfield update race

Xin Long (2):
      bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave
      team: fix netconsole setup over team

Yonghong Song (2):
      bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
      tools/bpf: fix test_sock and test_sock_addr.sh failure

 drivers/net/bonding/bond_main.c                   |   3 +-
 drivers/net/ethernet/amd/xgbe/xgbe-common.h       |   8 +++++
 drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c      |  16 +++++++++
 drivers/net/ethernet/amd/xgbe/xgbe-main.c         |   1 +
 drivers/net/ethernet/amd/xgbe/xgbe-mdio.c         |  24 ++++++++++---
 drivers/net/ethernet/amd/xgbe/xgbe-pci.c          |   2 ++
 drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c       | 196 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------
 drivers/net/ethernet/amd/xgbe/xgbe.h              |   9 +++++
 drivers/net/ethernet/ibm/ibmvnic.c                |   4 +--
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h   |   2 +-
 drivers/net/ethernet/intel/ice/ice_common.c       |  22 +++++++++---
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h   |   2 --
 drivers/net/ethernet/intel/ice/ice_main.c         |   4 ---
 drivers/net/ethernet/intel/ice/ice_sched.c        |   4 +--
 drivers/net/ethernet/intel/igb/igb_main.c         |  17 ++++++++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   2 +-
 drivers/net/ethernet/sfc/ef10.c                   |  80 +++++++++++++++++++++++------------------
 drivers/net/ethernet/sfc/efx.c                    | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/net/ethernet/sfc/efx.h                    |  21 +++++++++++
 drivers/net/ethernet/sfc/farch.c                  |  41 +++++++++++++++++----
 drivers/net/ethernet/sfc/net_driver.h             |  36 +++++++++++++++++++
 drivers/net/ethernet/sfc/rx.c                     |  62 +++++++++++++++++++++++++++++---
 drivers/net/ethernet/ti/cpsw.c                    |   2 +-
 drivers/net/phy/marvell.c                         |   9 +++++
 drivers/net/ppp/pppoe.c                           |   4 +++
 drivers/net/team/team.c                           |  19 ++++++----
 drivers/s390/net/qeth_core.h                      |   2 --
 drivers/s390/net/qeth_core_main.c                 | 158 ++++++++++++++++++++++++++++++++++++--------------------------------------------
 drivers/s390/net/qeth_core_mpc.h                  |  12 +++++++
 drivers/s390/net/qeth_l2_main.c                   |  59 ++++++++++++++++--------------
 include/linux/bpf.h                               |   4 +--
 include/linux/ethtool.h                           |   2 ++
 include/net/ife.h                                 |   3 +-
 include/net/llc_conn.h                            |   1 +
 kernel/bpf/core.c                                 |  45 ++++++++++++++---------
 kernel/bpf/sockmap.c                              |   3 --
 kernel/trace/bpf_trace.c                          |  25 ++++++++++---
 net/bridge/netfilter/ebtables.c                   |  11 +++---
 net/ife/ife.c                                     |  38 ++++++++++++++++++--
 net/ipv4/tcp_input.c                              |   7 ++--
 net/ipv6/netfilter/Kconfig                        |  55 ++++++++++++++--------------
 net/ipv6/route.c                                  |   2 ++
 net/ipv6/seg6_iptunnel.c                          |   2 +-
 net/l2tp/l2tp_debugfs.c                           |   5 ++-
 net/l2tp/l2tp_ppp.c                               |  12 ++++++-
 net/llc/af_llc.c                                  |  21 ++++++-----
 net/llc/llc_c_ac.c                                |   9 +----
 net/llc/llc_conn.c                                |  22 +++++++++++-
 net/netfilter/Kconfig                             |   1 +
 net/netfilter/ipvs/ip_vs_ctl.c                    |   8 -----
 net/netfilter/ipvs/ip_vs_sync.c                   | 155 ++++++++++++++++++++++++++++++++++++++++--------------------------------------
 net/netfilter/nf_conntrack_expect.c               |   5 ++-
 net/netfilter/nf_conntrack_extend.c               |   2 ++
 net/netfilter/nf_conntrack_sip.c                  |  16 ++++++---
 net/netfilter/nf_tables_api.c                     |  69 +++++++++++++++++++----------------
 net/netfilter/xt_connmark.c                       |  49 ++++++++++++++-----------
 net/packet/af_packet.c                            |  60 ++++++++++++++++++++++---------
 net/packet/internal.h                             |  10 +++---
 net/sched/act_ife.c                               |   9 +++--
 net/strparser/strparser.c                         |   2 +-
 tools/testing/selftests/bpf/.gitignore            |   3 ++
 tools/testing/selftests/bpf/test_sock.c           |   1 +
 tools/testing/selftests/bpf/test_sock_addr.c      |   1 +
 tools/testing/selftests/bpf/test_sock_addr.sh     |   4 +--
 64 files changed, 1163 insertions(+), 463 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next] liquidio: Swap VF representor Tx and Rx statistics
From: David Miller @ 2018-04-24 20:21 UTC (permalink / raw)
  To: felix.manlunas
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	srinivasa.jampala
In-Reply-To: <20180424172327.GA7060@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Tue, 24 Apr 2018 10:23:27 -0700

> From: Srinivas Jampala <srinivasa.jampala@cavium.com>
> 
> Swap VF representor tx and rx interface statistics since it is a
> virtual switchdev port and tx for VM should be rx for VF representor
> and vice-versa.
> 
> Signed-off-by: Srinivas Jampala <srinivasa.jampala@cavium.com>
> Acked-by: Derek Chickles <derek.chickles@cavium.com>
> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] net/ipv6: fix LOCKDEP issue in rt6_remove_exception_rt()
From: David Miller @ 2018-04-24 20:19 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet, dsahern
In-Reply-To: <20180424162249.41820-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Tue, 24 Apr 2018 09:22:49 -0700

> rt6_remove_exception_rt() is called under rcu_read_lock() only.
> 
> We lock rt6_exception_lock a bit later, so we do not hold
> rt6_exception_lock yet.
> 
> Fixes: 8a14e46f1402 ("net/ipv6: Fix missing rcu dereferences on from")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [net 0/6][pull request] Intel Wired LAN Driver Updates 2018-04-24
From: David Miller @ 2018-04-24 20:18 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <1524599024.23142.0.camel@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 24 Apr 2018 12:43:44 -0700

> On Tue, 2018-04-24 at 12:29 -0700, Jeff Kirsher wrote:
>> This series contains fixes to ixgbevf, igb and ice drivers.
>> 
>> Colin Ian King fixes the return value on error for the new XDP
>> support
>> that went into ixgbevf for 4.16.
> 
> Oops, I meant 4.17, not 4.16.

Pulled with this fixed, thanks Jeff.

^ permalink raw reply

* packetdrill 2.0 release
From: Neal Cardwell @ 2018-04-24 20:13 UTC (permalink / raw)
  To: packetdrill, Netdev

Hi All,

We're happy to announce the 2.0 release of the Google version of the
packetdrill network testing tool.

The code may be found at the packetdrill-v2.0 tag in the Google packetdrill
github repo:
   https://github.com/google/packetdrill

The commit is here:

https://github.com/google/packetdrill/commit/9a0ade62b7c8e3a19854b5855178dc3bb9d7f453

The 2.0 commit message, summarizing features and contributors, is included
below for a quick overview.

cheers,
neal

---
net-test: packetdrill: merge Google packetdrill changes through April 2018

This commit merges into Google's public packetdrill repository the
majority of the packetdrill tool changes made at Google in the period
2013-2018 (after the initial open source release of packetdrill).

Major features added in this commit include:

+ support for testing:
   + cmsg data
   + TCP send timestamping
   + TCP timestamping opt stats (TCP_NLA_BUSY and friends)
   + TCP zero-copy (e.g. see --send_omit_free)
   + TCP MD5 options
   + TCP urgent pointer field
   + experimental and RFC-compliant TCP Fast Open options
   + ICMP sockets
   + the IPv4 or IPv6 TOS field
   + IPv6 flow labels
   + in IPv6-only environments
+ wider system call support:
   + epoll system calls (epoll_create(), epoll_ctl(), epoll_wait())
   + pipe()
   + splice()
   + cap_set()
+ optional final clean-up commands for destructor-like tear-down
   commands that are basically always executed at termination, whether
   scripts fail or succeed
+ improved Python support:
   + exporting symbolic names for tcpi_state values
   + exporting recent additions to Linux struct tcp_info
   + exporting TCP_CC_INFO for Vegas, DCTCP, BBR
   + exporting SO_MEMINFO
+ the ability to test shared libraries that support the sockets API,
   rather than just the kernel sockets API (see packetdrill.h)
+ preprocessor-style symbol definitions, e.g. -Dfoo=bar
+ support for random local IP addresses

Willem de Bruijn spearheaded this effort to upstream this batch of
changes, and put in a huge amount of work to make this happen. I would
like to thank him for all his work on this.

I would also like to thank the following Googlers for their
contributions over the years to the packetdrill code base, which are
reflected in this patch:

   Wei Wang
   Maciej Żenczykowski
   Yuchung Cheng
   Eric Dumazet
   Soheil Hassas Yeganeh
   Dimitris Michailidis
   Willem de Bruijn
   Yaogong Wang
   Eric Salo
   Chonggang Li
   Priyaranjan Jha
   Andreas Terzis
   Xiao Jia
   Mike Maloney
   Barath Raghavan
   Yousuk Seung
   Nandita Dukkipati
   Michael Davidson
   Hsiao-keng Jerry Chu
   Greg Thelen
   Chema Gonzalez
   Luigi Rizzo
   Kevin Athey
   Jeff Grafton
   Francis Y. Yan
   Fabien Duchene
   Bill Sommerfeld
   Anatol Pomazau

This commit has been verified to build cleanly with the default gcc
compiler on the following Linux distributions:

   Debian 8
   Debian 9
   Red Hat Enterprise Linux 7.4
   Ubuntu 14.04
   Ubuntu 17.10

This commit has not been tested on or ported to any BSD variants, due
to lack of time among members of our team. We are happy to accept
patches to get it to compile/run on popular BSD variants.

^ permalink raw reply

* Re: Boot failures with net-next after rebase to v4.17.0-rc1
From: Linus Torvalds @ 2018-04-24 20:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev@vger.kernel.org, LKML, David Miller,
	Toke Høiland-Jørgensen, Paul E. McKenney, David Ahern
In-Reply-To: <20180424215429.1de8b1b3@redhat.com>

On Tue, Apr 24, 2018 at 12:54 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Hi all,
>
> I'm experiencing boot failures with net-next git-tree after it got
> rebased/merged with Linus'es tree at v4.17.0-rc1.

I suspect it's the global bit stuff that came in very late in the
merge window, and had been developed and tested for a while before,
but showed some problems under some configs.

The fix is currently in the x86/pti tree in -tip, see:

   x86/pti: Fix boot problems from Global-bit setting

and I expect it will percolate upstream soon.

In the meantime, it would be good to verify that merging that x86/pti
branch fixes it for you?

There is another candidate for boot problems - do you happen to have
CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled? That can under certain
circumstances get a percpu setup page fault because memory hadn't been
initialized sufficiently.

The fix there is to move the mm_init() call one step earlier in
init_main(): start_kernel() (to before trap_init()).

And if it's neither of the above, I think you'll need to help bisect it.

               Linus

^ permalink raw reply

* Re: [PATCH] net/tls: remove redundant second null check on sgout
From: David Miller @ 2018-04-24 20:02 UTC (permalink / raw)
  To: colin.king
  Cc: ilyal, aviadye, davejwatson, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20180424123658.6541-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Tue, 24 Apr 2018 13:36:58 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> A duplicated null check on sgout is redundant as it is known to be
> already true because of the identical earlier check. Remove it.
> Detected by cppcheck:
> 
> net/tls/tls_sw.c:696: (warning) Identical inner 'if' condition is always
> true.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thank you.

^ permalink raw reply

* Re: [PATCH] fsl/fman_port: remove redundant check on port->rev_info.major
From: David Miller @ 2018-04-24 20:01 UTC (permalink / raw)
  To: colin.king; +Cc: madalin.bucur, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20180424113945.16371-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Tue, 24 Apr 2018 12:39:45 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The check port->rev_info.major >= 6 is being performed twice, thus
> the inner second check is always true and is redundant, hence it
> can be removed. Detected by cppcheck.
> 
> drivers/net/ethernet/freescale/fman/fman_port.c:1394]: (warning)
> Identical inner 'if' condition is always true.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thank you.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox