Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next PATCH 2/2] bpf: sock_map fixes for !CONFIG_BPF_SYSCALL and !STREAM_PARSER
From: David Ahern @ 2017-08-16 22:06 UTC (permalink / raw)
  To: John Fastabend, daniel, davem, eric.dumazet; +Cc: netdev
In-Reply-To: <20170816220232.25438.47447.stgit@john-Precision-Tower-5810>

On 8/16/17 4:02 PM, John Fastabend wrote:
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index aa24287..897daa0 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -3,7 +3,10 @@ obj-y := core.o
>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>  ifeq ($(CONFIG_NET),y)
> -obj-$(CONFIG_BPF_SYSCALL) += devmap.o sockmap.o
> +obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> +ifeq ($(CONFIG_STREAM_PARSER),y)
> +obj-$(CONFIG_BPF_SYSCALL) += sockmap.o
> +endif
>  endif

sockmap requires KCM? I thought it just needed STREAM_PARSER. Can't
STREAM_PARSER be used outside of KCM? If so why not make STREAM_PARSER
enabled if sockmap is wanted vs requiring KCM to be compiled in to get
stream parser to get sockmap?

^ permalink raw reply

* Re: [net-next PATCH 1/2] bpf: sockmap state change warning fix
From: Daniel Borkmann @ 2017-08-16 22:10 UTC (permalink / raw)
  To: John Fastabend, davem, eric.dumazet, dsahern; +Cc: netdev
In-Reply-To: <20170816220211.25438.24547.stgit@john-Precision-Tower-5810>

On 08/17/2017 12:02 AM, John Fastabend wrote:
> psock will uninitialized in default case we need to do the same psock lookup
> and check as in other branch. Fixes compile warning below.
>
> kernel/bpf/sockmap.c: In function ‘smap_state_change’:
> kernel/bpf/sockmap.c:156:21: warning: ‘psock’ may be used uninitialized in this function [-Wmaybe-uninitialized]
>    struct smap_psock *psock;
>
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Reported-by: David Miller <davem@davemloft.net>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Thanks for fixing up quickly!

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [net-next PATCH 2/2] bpf: sock_map fixes for !CONFIG_BPF_SYSCALL and !STREAM_PARSER
From: Daniel Borkmann @ 2017-08-16 22:11 UTC (permalink / raw)
  To: John Fastabend, davem, eric.dumazet, dsahern; +Cc: netdev
In-Reply-To: <20170816220232.25438.47447.stgit@john-Precision-Tower-5810>

On 08/17/2017 12:02 AM, John Fastabend wrote:
> Resolve issues with !CONFIG_BPF_SYSCALL and !STREAM_PARSER
>
> net/core/filter.c: In function ‘do_sk_redirect_map’:
> net/core/filter.c:1881:3: error: implicit declaration of function ‘__sock_map_lookup_elem’ [-Werror=implicit-function-declaration]
>     sk = __sock_map_lookup_elem(ri->map, ri->ifindex);
>     ^
> net/core/filter.c:1881:6: warning: assignment makes pointer from integer without a cast [enabled by default]
>     sk = __sock_map_lookup_elem(ri->map, ri->ifindex);
>
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [net-next PATCH 2/2] bpf: sock_map fixes for !CONFIG_BPF_SYSCALL and !STREAM_PARSER
From: John Fastabend @ 2017-08-16 22:20 UTC (permalink / raw)
  To: David Ahern, daniel, davem, eric.dumazet; +Cc: netdev
In-Reply-To: <1b280444-18dc-7207-1708-f651a1cf2a52@gmail.com>

On 08/16/2017 03:06 PM, David Ahern wrote:
> On 8/16/17 4:02 PM, John Fastabend wrote:
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index aa24287..897daa0 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -3,7 +3,10 @@ obj-y := core.o
>>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
>>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>>  ifeq ($(CONFIG_NET),y)
>> -obj-$(CONFIG_BPF_SYSCALL) += devmap.o sockmap.o
>> +obj-$(CONFIG_BPF_SYSCALL) += devmap.o
>> +ifeq ($(CONFIG_STREAM_PARSER),y)
>> +obj-$(CONFIG_BPF_SYSCALL) += sockmap.o
>> +endif
>>  endif
> 
> sockmap requires KCM? I thought it just needed STREAM_PARSER. Can't
> STREAM_PARSER be used outside of KCM? If so why not make STREAM_PARSER
> enabled if sockmap is wanted vs requiring KCM to be compiled in to get
> stream parser to get sockmap?
> 

KCM is not required all sockmap needs is STREAM_PARSER. STREAM_PARSER can
be used outside of KCM so no problem there.

I didn't want to require all users of BPF maps to automatically get
STREAM_PARSER however. I'll add a Kconfig option for STREAM_PARSER so we
can pull it in without KCM easily. This is very similar to how the cgroup
BPF build logic works as well with CONFIG_CGROUP_BPF.

I wanted to get the build fix out though as soon as possible rather than
wait for me to write the STREAM_PARSER patch and test it.

Thanks,
John

^ permalink raw reply

* RE: [PATCH net-next 1/3] VMCI: only load on VMware hypervisor
From: Dexuan Cui @ 2017-08-16 22:33 UTC (permalink / raw)
  To: Jorgen S. Hansen
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	gregkh@linuxfoundation.org, devel@linuxdriverproject.org,
	KY Srinivasan, Haiyang Zhang, Stephen Hemminger, George Zhang,
	Michal Kubecek, Asias He, Stefan Hajnoczi, Vitaly Kuznetsov,
	Cathy Avery, jasowang@redhat.com, Rolf Neugebauer, Dave Scott,
	Marcelo Cerri, apw@canonical.com
In-Reply-To: <E1772BF8-8A8D-4768-AA8C-67EE8FE21D10@vmware.com>

> From: Jorgen S. Hansen [mailto:jhansen@vmware.com]
> > Without the patch, vmw_vsock_vmci_transport.ko and vmw_vmci.ko can
> > automatically load when an application creates an AF_VSOCK socket.
> >
> > This is the expected good behavior on VMware hypervisor, but as we
> > are going to add hv_sock.ko (i.e. Hyper-V transport for AF_VSOCK), we
> > should make sure vmw_vsock_vmci_transport.ko doesn't load on Hyper-V,
> > otherwise there is a -EBUSY conflict when both vmw_vsock_vmci_transport.ko
> > and hv_sock.ko try to call vsock_core_init() on Hyper-V.
> 
> The VMCI driver (vmw_vmci.ko) is used both by the VMware guest support
> (VMware Tools primarily) and by our Workstation product. Always disabling the
> VMCI driver on Hyper-V means that user won’t be able to run Workstation
> nested in Linux VMs on Hyper-V. Since the VMCI driver itself isn’t the problem
> here, maybe we could move the check to vmw_vsock_vmci_transport.ko?
> Ideally, there should be some way for a user to have access to both protocols,
> but for now disabling the VMCI socket transport for Hyper-V (possibly with a
> module option to skip that check and always load it) but leaving the VMCI driver
> functional would be better,
> 
> Jorgen

Thank you for explaining the background!
Then I'll make a new patch, following your suggestion.

-- Dexuan

^ permalink raw reply

* Re: [net-next PATCH 0/2] bpf: sockmap build fixes
From: David Miller @ 2017-08-16 22:35 UTC (permalink / raw)
  To: john.fastabend; +Cc: daniel, eric.dumazet, dsahern, netdev
In-Reply-To: <20170816220049.25438.62373.stgit@john-Precision-Tower-5810>

From: John Fastabend <john.fastabend@gmail.com>
Date: Wed, 16 Aug 2017 15:01:52 -0700

> Two build fixes for sockmap, this should resolve the build errors
> and warnings that were reported. Thanks everyone.

Series applied, thanks John.

^ permalink raw reply

* [PATCH net-next] net: ipv6: put host and anycast routes on device with address
From: David Ahern @ 2017-08-16 22:37 UTC (permalink / raw)
  To: netdev; +Cc: yoshfuji, hannes, David Ahern

One nagging difference between ipv4 and ipv6 is host routes for ipv6
addresses are installed using the loopback device or VRF / L3 Master
device. e.g.,

    2001:db8:1::/120 dev veth0 proto kernel metric 256 pref medium
    local 2001:db8:1::1 dev lo table local proto kernel metric 0 pref medium

Using the loopback device is convenient -- necessary for local tx, but
has some nasty side effects, most notably setting the 'lo' device down
causes all host routes for all local IPv6 address to be removed from the
FIB and completely breaks IPv6 networking across all interfaces.

This patch puts FIB entries for IPv6 routes against the device. This
simplifies the routes in the FIB, for example by making dst->dev and
rt6i_idev->dev the same (a future patch can look at removing the device
reference taken for rt6i_idev for FIB entries). For example:

    $ ip -6 r ls table all | grep veth1
    2001:db8:99::/120 dev veth1 proto kernel metric 256 pref medium
    anycast 2001:db8:99:: dev veth1 table local proto kernel metric 0 pref medium
    local 2001:db8:99::1 dev veth1 table local proto kernel metric 0 pref medium

When copies are made on FIB lookups, the cloned route has dst->dev
set to loopback (or the L3 master device). This is needed for the
local Tx of packets to local addresses.

With fib entries allocated against the real network device, the addrconf
code that reinserts host routes on admin up of 'lo' is no longer needed.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/addrconf.c | 42 ------------------------------------------
 net/ipv6/route.c    | 46 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 34 insertions(+), 54 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 640792e1ecb7..45d0a24644de 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3030,9 +3030,6 @@ static void sit_add_v4_addrs(struct inet6_dev *idev)
 static void init_loopback(struct net_device *dev)
 {
 	struct inet6_dev  *idev;
-	struct net_device *sp_dev;
-	struct inet6_ifaddr *sp_ifa;
-	struct rt6_info *sp_rt;
 
 	/* ::1 */
 
@@ -3045,45 +3042,6 @@ static void init_loopback(struct net_device *dev)
 	}
 
 	add_addr(idev, &in6addr_loopback, 128, IFA_HOST);
-
-	/* Add routes to other interface's IPv6 addresses */
-	for_each_netdev(dev_net(dev), sp_dev) {
-		if (!strcmp(sp_dev->name, dev->name))
-			continue;
-
-		idev = __in6_dev_get(sp_dev);
-		if (!idev)
-			continue;
-
-		read_lock_bh(&idev->lock);
-		list_for_each_entry(sp_ifa, &idev->addr_list, if_list) {
-
-			if (sp_ifa->flags & (IFA_F_DADFAILED | IFA_F_TENTATIVE))
-				continue;
-
-			if (sp_ifa->rt) {
-				/* This dst has been added to garbage list when
-				 * lo device down, release this obsolete dst and
-				 * reallocate a new router for ifa.
-				 */
-				if (!sp_ifa->rt->rt6i_node) {
-					ip6_rt_put(sp_ifa->rt);
-					sp_ifa->rt = NULL;
-				} else {
-					continue;
-				}
-			}
-
-			sp_rt = addrconf_dst_alloc(idev, &sp_ifa->addr, false);
-
-			/* Failure cases are ignored */
-			if (!IS_ERR(sp_rt)) {
-				sp_ifa->rt = sp_rt;
-				ip6_ins_rt(sp_rt);
-			}
-		}
-		read_unlock_bh(&idev->lock);
-	}
 }
 
 void addrconf_add_linklocal(struct inet6_dev *idev,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index dc021ed6dd37..1864effcaf00 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -958,10 +958,34 @@ int ip6_ins_rt(struct rt6_info *rt)
 	return __ip6_ins_rt(rt, &info, &mxc, NULL);
 }
 
+/* called with rcu_lock held */
+static struct net_device *ip6_rt_get_dev_rcu(struct rt6_info *rt)
+{
+	struct net_device *dev = rt->dst.dev;
+
+	if (rt->rt6i_flags & RTF_LOCAL) {
+		/* for copies of local routes, dst->dev needs to be the
+		 * device if it is a master device, the master device if
+		 * device is enslaved, and the loopback as the default
+		 */
+		if (netif_is_l3_slave(dev) &&
+		    !rt6_need_strict(&rt->rt6i_dst.addr))
+			dev = l3mdev_master_dev_rcu(dev);
+		else if (!netif_is_l3_master(dev))
+			dev = dev_net(dev)->loopback_dev;
+		/* last case is netif_is_l3_master(dev) is true in which
+		 * case we want dev returned to be dev
+		 */
+	}
+
+	return dev;
+}
+
 static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
 					   const struct in6_addr *daddr,
 					   const struct in6_addr *saddr)
 {
+	struct net_device *dev;
 	struct rt6_info *rt;
 
 	/*
@@ -971,8 +995,10 @@ static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
 	if (ort->rt6i_flags & (RTF_CACHE | RTF_PCPU))
 		ort = (struct rt6_info *)ort->dst.from;
 
-	rt = __ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev, 0);
-
+	rcu_read_lock();
+	dev = ip6_rt_get_dev_rcu(ort);
+	rt = __ip6_dst_alloc(dev_net(dev), dev, 0);
+	rcu_read_unlock();
 	if (!rt)
 		return NULL;
 
@@ -1000,11 +1026,13 @@ static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
 
 static struct rt6_info *ip6_rt_pcpu_alloc(struct rt6_info *rt)
 {
+	struct net_device *dev;
 	struct rt6_info *pcpu_rt;
 
-	pcpu_rt = __ip6_dst_alloc(dev_net(rt->dst.dev),
-				  rt->dst.dev, rt->dst.flags);
-
+	rcu_read_lock();
+	dev = ip6_rt_get_dev_rcu(rt);
+	pcpu_rt = __ip6_dst_alloc(dev_net(dev), dev, rt->dst.flags);
+	rcu_read_unlock();
 	if (!pcpu_rt)
 		return NULL;
 	ip6_rt_copy_init(pcpu_rt, rt);
@@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 {
 	u32 tb_id;
 	struct net *net = dev_net(idev->dev);
-	struct net_device *dev = net->loopback_dev;
+	struct net_device *dev = idev->dev;
 	struct rt6_info *rt;
 
-	/* use L3 Master device as loopback for host routes if device
-	 * is enslaved and address is not link local or multicast
-	 */
-	if (!rt6_need_strict(addr))
-		dev = l3mdev_master_dev_rcu(idev->dev) ? : dev;
-
 	rt = ip6_dst_alloc(net, dev, DST_NOCOUNT);
 	if (!rt)
 		return ERR_PTR(-ENOMEM);
-- 
2.1.4

^ permalink raw reply related

* Re: [net-next PATCH 0/2] bpf: sockmap build fixes
From: David Miller @ 2017-08-16 22:42 UTC (permalink / raw)
  To: john.fastabend; +Cc: daniel, eric.dumazet, dsahern, netdev
In-Reply-To: <20170816.153516.1392924482619263360.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Wed, 16 Aug 2017 15:35:16 -0700 (PDT)

> From: John Fastabend <john.fastabend@gmail.com>
> Date: Wed, 16 Aug 2017 15:01:52 -0700
> 
>> Two build fixes for sockmap, this should resolve the build errors
>> and warnings that were reported. Thanks everyone.
> 
> Series applied, thanks John.

Actually, now there is a new problem:

ERROR: "tcp_sendpage_locked" [net/ipv6/ipv6.ko] undefined!
ERROR: "tcp_sendmsg_locked" [net/ipv6/ipv6.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:91: __modpost] Error 1
make: *** [Makefile:1217: modules] Error 2

I added GPL exports for these, but please do an allmodconfig
build before you submit anything to me when there are weird
inter-module dependencies like this.

Thanks.

^ permalink raw reply

* Re: [net-next PATCH 0/2] bpf: sockmap build fixes
From: John Fastabend @ 2017-08-16 22:58 UTC (permalink / raw)
  To: David Miller; +Cc: daniel, eric.dumazet, dsahern, netdev
In-Reply-To: <20170816.154204.2107336943249325947.davem@davemloft.net>

On 08/16/2017 03:42 PM, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Wed, 16 Aug 2017 15:35:16 -0700 (PDT)
> 
>> From: John Fastabend <john.fastabend@gmail.com>
>> Date: Wed, 16 Aug 2017 15:01:52 -0700
>>
>>> Two build fixes for sockmap, this should resolve the build errors
>>> and warnings that were reported. Thanks everyone.
>>
>> Series applied, thanks John.
> 
> Actually, now there is a new problem:
> 
> ERROR: "tcp_sendpage_locked" [net/ipv6/ipv6.ko] undefined!
> ERROR: "tcp_sendmsg_locked" [net/ipv6/ipv6.ko] undefined!
> make[1]: *** [scripts/Makefile.modpost:91: __modpost] Error 1
> make: *** [Makefile:1217: modules] Error 2
> 
> I added GPL exports for these, but please do an allmodconfig
> build before you submit anything to me when there are weird
> inter-module dependencies like this.
> 
> Thanks.
> 

Thanks and allmodconfig noted.

^ permalink raw reply

* Re: [PATCH net-next] ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t
From: Eric Dumazet @ 2017-08-16 23:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <1502911469.4936.148.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, 2017-08-16 at 12:24 -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Please disregard. I will send a v2.

^ permalink raw reply

* [PATCH v2 net-next] ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t
From: Eric Dumazet @ 2017-08-16 23:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <1502911469.4936.148.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <edumazet@google.com>

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
v2: fix a missing change in net/ipv4/fib_semantics.c

 include/net/dst.h        |    2 +-
 net/core/dst.c           |    6 +++---
 net/ipv4/fib_semantics.c |    4 ++--
 net/ipv4/route.c         |    4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index f73611ec401754d4f52b5310a24da53566dafce6..dd38177c3a61f5c4e48be9d57d4d10d6b7d14672 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -107,7 +107,7 @@ struct dst_entry {
 
 struct dst_metrics {
 	u32		metrics[RTAX_MAX];
-	atomic_t	refcnt;
+	refcount_t	refcnt;
 };
 extern const struct dst_metrics dst_default_metrics;
 
diff --git a/net/core/dst.c b/net/core/dst.c
index 00aa972ad1a1a451c24f3f8211243ad35c19433a..d6ead757c25895da01eb61bc9636c7e9b3cdfb3e 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -55,7 +55,7 @@ const struct dst_metrics dst_default_metrics = {
 	 * We really want to avoid false sharing on this variable, and catch
 	 * any writes on it.
 	 */
-	.refcnt = ATOMIC_INIT(1),
+	.refcnt = REFCOUNT_INIT(1),
 };
 
 void dst_init(struct dst_entry *dst, struct dst_ops *ops,
@@ -213,7 +213,7 @@ u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old)
 		struct dst_metrics *old_p = (struct dst_metrics *)__DST_METRICS_PTR(old);
 		unsigned long prev, new;
 
-		atomic_set(&p->refcnt, 1);
+		refcount_set(&p->refcnt, 1);
 		memcpy(p->metrics, old_p->metrics, sizeof(p->metrics));
 
 		new = (unsigned long) p;
@@ -225,7 +225,7 @@ u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old)
 			if (prev & DST_METRICS_READ_ONLY)
 				p = NULL;
 		} else if (prev & DST_METRICS_REFCOUNTED) {
-			if (atomic_dec_and_test(&old_p->refcnt))
+			if (refcount_dec_and_test(&old_p->refcnt))
 				kfree(old_p);
 		}
 	}
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d521caf57385fa05f76036708057b95052330cb1..394d800db50c77c21b65e14569eb4d8b5246406f 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -220,7 +220,7 @@ static void free_fib_info_rcu(struct rcu_head *head)
 	} endfor_nexthops(fi);
 
 	m = fi->fib_metrics;
-	if (m != &dst_default_metrics && atomic_dec_and_test(&m->refcnt))
+	if (m != &dst_default_metrics && refcount_dec_and_test(&m->refcnt))
 		kfree(m);
 	kfree(fi);
 }
@@ -1090,7 +1090,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg,
 			kfree(fi);
 			return ERR_PTR(err);
 		}
-		atomic_set(&fi->fib_metrics->refcnt, 1);
+		refcount_set(&fi->fib_metrics->refcnt, 1);
 	} else {
 		fi->fib_metrics = (struct dst_metrics *)&dst_default_metrics;
 	}
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d400c05431063fc7bdd15b83ab540acc86decb3d..872b4cb136d3fa0cda403836cc83a156a65310a3 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1398,7 +1398,7 @@ static void ipv4_dst_destroy(struct dst_entry *dst)
 	struct dst_metrics *p = (struct dst_metrics *)DST_METRICS_PTR(dst);
 	struct rtable *rt = (struct rtable *) dst;
 
-	if (p != &dst_default_metrics && atomic_dec_and_test(&p->refcnt))
+	if (p != &dst_default_metrics && refcount_dec_and_test(&p->refcnt))
 		kfree(p);
 
 	if (!list_empty(&rt->rt_uncached)) {
@@ -1456,7 +1456,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 		dst_init_metrics(&rt->dst, fi->fib_metrics->metrics, true);
 		if (fi->fib_metrics != &dst_default_metrics) {
 			rt->dst._metrics |= DST_METRICS_REFCOUNTED;
-			atomic_inc(&fi->fib_metrics->refcnt);
+			refcount_inc(&fi->fib_metrics->refcnt);
 		}
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;

^ permalink raw reply related

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
From: Willem de Bruijn @ 2017-08-16 23:27 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Matthew Dawson, Network Development, Macieira, Thiago
In-Reply-To: <1502914822.2796.6.camel@redhat.com>

On Wed, Aug 16, 2017 at 4:20 PM, Paolo Abeni <pabeni@redhat.com> wrote:
> On Wed, 2017-08-16 at 11:18 -0400, Willem de Bruijn wrote:
>> > If I read the above correctly, you are arguining in favor of the
>> > addittional flag version, right?
>>
>> I was. Though if we are going to thread the argument from the caller
>> to __skb_try_recv_from_queue to avoid rereading sk->sk_peek_off,
>> on second thought it might be simpler to do it through off:
> [...]
>> This, of course, requires restricting sk_peek_off to protect against overflow.
>
> Ok, even if I'm not 100% sure overall this will be simpler when adding
> also the overflow check.

Actually, it is safe even without the check. Overflow of the signed integer
is benign here.

>> If I'm not mistaken, the test in udp_recvmsg currently incorrectly sets
>> peeking to false when peeking at offset zero:
>>
>>         peeking = off = sk_peek_offset(sk, flags);
>
> I think you are right, does not look correct.

By shifting the offset by two, we could even make both assignments
become correct. Return 0 without peek, 1 on peek without SO_PEEK_OFF,
2+ otherwise, including overflow up to INT_MIN + 1.

But the end result is more readable if we just separate those two
assignments.

@@ -1574,7 +1574,8 @@ int udp_recvmsg(struct sock *sk, struct msghdr
*msg, size_t len, int noblock,
                return ip_recv_error(sk, msg, len, addr_len);

 try_again:
-       peeking = off = sk_peek_offset(sk, flags);
+       peeking = flags & MSG_PEEK;
+       off = sk_peek_offset(sk, flags);
        skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
        if (!skb)
                return err;

@@ -362,7 +362,8 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr
*msg, size_t len,
                return ipv6_recv_rxpmtu(sk, msg, len, addr_len);

 try_again:
-       peeking = off = sk_peek_offset(sk, flags);
+       peeking = flags & MSG_PEEK;
+       off = sk_peek_offset(sk, flags);
        skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
        if (!skb)
                return err;

At which point there is also no longer a need for the variable shift
at sk_peek_offset. Just pass the raw value down to
__skb_try_recv_from_queue and disambiguate there:

@@ -506,11 +506,8 @@ int sk_set_peek_off(struct sock *sk, int val);

 static inline int sk_peek_offset(struct sock *sk, int flags)
 {
-       if (unlikely(flags & MSG_PEEK)) {
-               s32 off = READ_ONCE(sk->sk_peek_off);
-               if (off >= 0)
-                       return off;
-       }
+       if (unlikely(flags & MSG_PEEK))
+               return READ_ONCE(sk->sk_peek_off);

        return 0;
 }

@@ -169,14 +169,20 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
                                          int *peeked, int *off, int *err,
                                          struct sk_buff **last)
 {
+       bool peek_at_off = false;
        struct sk_buff *skb;
-       int _off = *off;
+       int _off = 0;
+
+       if (flags & MSG_PEEK && (*off) >= 0) {
+               peek_at_off = true;
+               _off = *off;
+       }

        *last = queue->prev;
        skb_queue_walk(queue, skb) {
                if (flags & MSG_PEEK) {
-                       if (_off >= skb->len && (skb->len || _off ||
-                                                skb->peeked)) {
+                       if (peek_at_off && _off >= skb->len &&
+                           (skb->len || _off || skb->peeked)) {
                                _off -= skb->len;
                                continue;
                        }

^ permalink raw reply

* Re: [PATCH net-next 0/3] vmbus sendpacket cleanups
From: David Miller @ 2017-08-16 23:27 UTC (permalink / raw)
  To: stephen; +Cc: kys, haiyangz, sthemmin, devel, netdev
In-Reply-To: <20170816155626.16580-1-sthemmin@microsoft.com>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Wed, 16 Aug 2017 08:56:23 -0700

> These patches remove and consolidate vmbus_sendpacket functions.
> 
> They should go through the net-next tree since these API's
> were only used by the netvsc driver.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net] ptr_ring: use kmalloc_array()
From: David Miller @ 2017-08-16 23:29 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, mst, jasowang
In-Reply-To: <1502905007.4936.133.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 16 Aug 2017 10:36:47 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> As found by syzkaller, malicious users can set whatever tx_queue_len
> on a tun device and eventually crash the kernel.
> 
> Lets remove the ALIGN(XXX, SMP_CACHE_BYTES) thing since a small
> ring buffer is not fast anyway.
> 
> Fixes: 2e0ab8ca83c1 ("ptr_ring: array based FIFO for pointers")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net] ipv4: better IP_MAX_MTU enforcement
From: David Miller @ 2017-08-16 23:29 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1502906952.4936.140.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 16 Aug 2017 11:09:12 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> While working on yet another syzkaller report, I found
> that our IP_MAX_MTU enforcements were not properly done.
> 
> gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
> final result can be bigger than IP_MAX_MTU :/
> 
> This is a problem because device mtu can be changed on other cpus or
> threads.
> 
> While this patch does not fix the issue I am working on, it is
> probably worth addressing it.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Also applied and queued up for -stable.

Thanks.

^ permalink raw reply

* Re: [PATCH] Adding-Agile-SD-TCP-module-and-modifying-Kconfig-and-makefile
From: kbuild test robot @ 2017-08-16 23:36 UTC (permalink / raw)
  To: mohamedalrshah
  Cc: kbuild-all, davem, netdev, torvalds, linux-kernel,
	Mohamed A . Alrshah, Mohamed Othman, Borhanuddin Ali,
	Zurina Hanapi
In-Reply-To: <20170815130806.25168-1-mohamed.a.alrshah@ieee.org>

[-- Attachment #1: Type: text/plain, Size: 1647 bytes --]

Hi mohamedalrshah,

[auto build test ERROR on net/master]
[also build test ERROR on v4.13-rc5 next-20170816]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/mohamedalrshah/Adding-Agile-SD-TCP-module-and-modifying-Kconfig-and-makefile/20170817-055643
config: ia64-allyesconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=ia64 

All errors (new ones prefixed by >>):

>> net//ipv4/tcp_agilesd.c:169:16: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
     .pkts_acked = agilesdtcp_acked,
                   ^~~~~~~~~~~~~~~~
   net//ipv4/tcp_agilesd.c:169:16: note: (near initialization for 'agilesdtcp.pkts_acked')
   cc1: some warnings being treated as errors

vim +169 net//ipv4/tcp_agilesd.c

   162	
   163	static struct tcp_congestion_ops agilesdtcp __read_mostly = {
   164		.init		= agilesdtcp_init,
   165		.ssthresh	= agilesdtcp_recalc_ssthresh,
   166		.cong_avoid	= agilesdtcp_cong_avoid,
   167		.set_state	= agilesdtcp_state,
   168		.undo_cwnd	= agilesdtcp_undo_cwnd,
 > 169		.pkts_acked	= agilesdtcp_acked,
   170		.owner		= THIS_MODULE,
   171		.name		= "agilesd",
   172	};
   173	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 51492 bytes --]

^ permalink raw reply

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
From: Willem de Bruijn @ 2017-08-16 23:40 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Matthew Dawson, Network Development, Macieira, Thiago
In-Reply-To: <CAF=yD-JryS8g=8nB7yq9WVdCOjqSb7uNwncrRfWPmMQbdYrh3w@mail.gmail.com>

>  static inline int sk_peek_offset(struct sock *sk, int flags)
>  {
> -       if (unlikely(flags & MSG_PEEK)) {
> -               s32 off = READ_ONCE(sk->sk_peek_off);
> -               if (off >= 0)
> -                       return off;
> -       }
> +       if (unlikely(flags & MSG_PEEK))
> +               return READ_ONCE(sk->sk_peek_off);
>
>         return 0;
>  }

I noticed too late that this function is used in one case where the
result is not blindly passed to __skb_try_recv_from_queue.

The AF_UNIX stream case will need a floor on its offset.

-               skip = sk_peek_offset(sk, flags);
+               skip = max(sk_peek_offset(sk, flags), 0);

^ permalink raw reply

* Re: [PATCH v3] openvswitch: enable NSH support
From: Yang, Yi @ 2017-08-16 23:49 UTC (permalink / raw)
  To: Eric Garver
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	jbenc-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <20170816151528.GG27729@dev-rhel7>

On Wed, Aug 16, 2017 at 11:15:28PM +0800, Eric Garver wrote:
> On Wed, Aug 16, 2017 at 01:35:30PM +0800, Yi Yang wrote:
> > +
> > +#define NSH_DST_PORT    4790     /* UDP Port for NSH on VXLAN. */
> > +#define ETH_P_NSH       0x894F   /* Ethertype for NSH. */
> 
> ETH_P_NSH probably belongs in include/uapi/linux/if_ether.h with all the
> other ETH_P_* defines.
> 

Ok, I'll move it to include/uapi/linux/if_ether.h, but in userspace, we
still need to keep it in nsh.h.

> >  
> > +struct ovs_key_nsh {
> > +	__u8 flags;
> > +	__u8 mdtype;
> > +	__u8 np;
> > +	__u8 pad;
> > +	__be32 path_hdr;
> > +	__be32 context[NSH_MD1_CONTEXT_SIZE];
> > +};
> > +
> >  struct sw_flow_key {
> >  	u8 tun_opts[IP_TUNNEL_OPTS_MAX];
> >  	u8 tun_opts_len;
> > @@ -144,6 +154,7 @@ struct sw_flow_key {
> >  			};
> >  		} ipv6;
> >  	};
> > +	struct ovs_key_nsh nsh;         /* network service header */
> 
> Are you intentionally not reserving space in the flow key for
> OVS_NSH_KEY_ATTR_MD2? I know it's not supported yet, but much of the
> code is stubbed out for it - just making sure this wasn't an oversight.
> 

For MD type 2, we'll reuse tun_metedata keys in struct flow_tnl which
will be reworked and it will be shared by NSH and GENEVE, so we won't
have new keys in "struct ovs_key_nsh" for MD type 2.

> >  	struct {
> >  		/* Connection tracking fields not packed above. */
> >  		struct {
> > diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> > index f07d10a..79059db 100644
> > --- a/net/openvswitch/flow_netlink.c
> > +++ b/net/openvswitch/flow_netlink.c
> > @@ -78,9 +78,11 @@ static bool actions_may_change_flow(const struct nlattr *actions)
> >  		case OVS_ACTION_ATTR_HASH:
> >  		case OVS_ACTION_ATTR_POP_ETH:
> >  		case OVS_ACTION_ATTR_POP_MPLS:
> > +		case OVS_ACTION_ATTR_POP_NSH:
> >  		case OVS_ACTION_ATTR_POP_VLAN:
> >  		case OVS_ACTION_ATTR_PUSH_ETH:
> >  		case OVS_ACTION_ATTR_PUSH_MPLS:
> > +		case OVS_ACTION_ATTR_PUSH_NSH:
> >  		case OVS_ACTION_ATTR_PUSH_VLAN:
> >  		case OVS_ACTION_ATTR_SAMPLE:
> >  		case OVS_ACTION_ATTR_SET:
> > @@ -322,12 +324,22 @@ size_t ovs_tun_key_attr_size(void)
> >  		+ nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
> >  }
> >  
> > +size_t ovs_nsh_key_attr_size(void)
> > +{
> > +	/* Whenever adding new OVS_NSH_KEY_ FIELDS, we should consider
> > +	 * updating this function.
> > +	 */
> > +	return  nla_total_size(8)      /* OVS_NSH_KEY_ATTR_BASE */
> > +		+ nla_total_size(16)   /* OVS_NSH_KEY_ATTR_MD1 */
> > +		+ nla_total_size(248); /* OVS_NSH_KEY_ATTR_MD2 */
> 
> _MD1 and _MD2 are mutually exclusive. You only need the larger of the
> two.
> 
> Maybe use OVS_PUSH_NSH_MAX_MD_LEN instead of 248.

Got it, will use nla_total_size(NSH_M_TYPE2_MAX_LEN - NSH_BASE_HDR_LEN)

> > +
> > +		switch (type) {
> > +		case OVS_NSH_KEY_ATTR_BASE: {
> > +			const struct ovs_nsh_key_base *base =
> > +				(struct ovs_nsh_key_base *)nla_data(a);
> > +			nsh->flags = base->flags;
> > +			nsh->mdtype = base->mdtype;
> > +			nsh->np = base->np;
> > +			nsh->path_hdr = base->path_hdr;
> > +			break;
> > +		}
> > +		case OVS_NSH_KEY_ATTR_MD1: {
> > +			const struct ovs_nsh_key_md1 *md1 =
> > +				(struct ovs_nsh_key_md1 *)nla_data(a);
> > +			memcpy(nsh->context, md1->context, sizeof(*md1));
> > +			break;
> > +		}
> > +		case OVS_NSH_KEY_ATTR_MD2:
> > +			/* Not supported yet */
> > +			break;
> 
> When we encounter these metadata attributes do we need to verify that
> nsh->mdtype is set correctly? Keep in mind we may parse ATTR_MD{1,2}
> before ATTR_BASE.

Good point, will add check code for this.

^ permalink raw reply

* Re: [PATCH v3] openvswitch: enable NSH support
From: Yang, Yi @ 2017-08-16 23:37 UTC (permalink / raw)
  To: Jiri Benc; +Cc: netdev, dev, e
In-Reply-To: <20170816160322.6a0def09@griffin>

On Wed, Aug 16, 2017 at 10:03:22PM +0800, Jiri Benc wrote:
> On Wed, 16 Aug 2017 17:31:30 +0800, Yang, Yi wrote:
> > On Wed, Aug 16, 2017 at 11:19:21AM +0200, Jiri Benc wrote:
> > > > --- a/include/uapi/linux/openvswitch.h
> > > > +++ b/include/uapi/linux/openvswitch.h
> > > [...]
> > > > +#define NSH_MD1_CONTEXT_SIZE 4
> > > 
> > > Please move this to nsh.h and use it there, too, instead of the open
> > > coded 4.
> > 
> > ovs code is very ugly, it will convert array[4] in
> > datapath/linux/compat/include/linux/openvswitch.h to other struct, I
> > have to change context[4] to such format :-), we can use 4 here for
> > Linux kernel.
> 
> Oh, right, this is uAPI and nsh.h is kernel internal. My suggestion was
> nonsense, let's keep it as it was in your patch.
> 
> > > > +	case OVS_KEY_ATTR_NSH: {
> > > > +		struct ovs_key_nsh nsh;
> > > > +		struct ovs_key_nsh nsh_mask;
> > > > +		size_t size = nla_len(a) / 2;
> > > > +		struct nlattr attr[1 + DIV_ROUND_UP(sizeof(struct ovs_key_ipv6)
> > > > +						    , sizeof(struct nlattr))];
> > > > +		struct nlattr mask[1 + DIV_ROUND_UP(sizeof(struct ovs_key_ipv6)
> > > > +						    , sizeof(struct nlattr))];
> > > > +
> > > > +		attr->nla_type = nla_type(a);
> > > > +		mask->nla_type = attr->nla_type;
> > > > +		attr->nla_len = NLA_HDRLEN + size;
> > > > +		mask->nla_len = attr->nla_len;
> > > > +		memcpy(attr + 1, (char *)(a + 1), size);
> > > > +		memcpy(mask + 1, (char *)(a + 1) + size, size);
> > > 
> > > This is too hacky. Please find a better way to handle this.
> > > 
> > > One option is to create a struct with struct nlattr as the first member
> > > followed by a char buffer. Still not nice but at least it's clear
> > > what's the intent.
> > 
> > The issue is nested attributes only can use this way, nested attribute
> > for SET_MASKED is very special, we have to handle it specially.
> 
> I'm not sure you understood what I meant. Let me explain in code:
> 
> 	struct {
> 		struct nlattr attr;
> 		struct ovs_key_ipv6 data;
> 	} attr, mask;
> 
> 	attr->attr.nla_type = nla_type(a);
> 	attr->attr.nl_len = NLA_HDRLEN + size;
> 	memcpy(&attr->data, a + 1, size);
> 
> It's much less hacky and doing the same thing.
> 
> I'm not sure we don't need verification of size not overflowing the
> available buffer. Is it checked beforehand elsewhere?
> 
> I also spotted one more bug: the 'mask' variable is not used anywhere.
> The second call of nsh_key_from_nlattr should use mask, not attr.
>

I have verified it in my real sfc test environment, but it is indeed
a bug, because mask and attr are same, so the result is still attr.

But here attr and mask shoud be of "struct nlattr attr", I'll polish it.
 
> > > > +	key->nsh.path_hdr = nsh->path_hdr;
> > > > +	switch (key->nsh.mdtype) {
> > > > +	case NSH_M_TYPE1:
> > > > +		if ((length << 2) != NSH_M_TYPE1_LEN)
> > > 
> > > Why length << 2?
> > 
> > len in NSH header is number of 4 octets, so need to multiply 4.
> 
> Should nsh_get_len take care of that, then?
> 

There are two helpers for this, nsh_hdr_len gets actual length,
nsh_get_len gets "len" field in nsh header.

> Thanks,
> 
>  Jiri

^ permalink raw reply

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
From: Thiago Macieira @ 2017-08-16 23:55 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Paolo Abeni, Matthew Dawson, Network Development
In-Reply-To: <CAF=yD-JryS8g=8nB7yq9WVdCOjqSb7uNwncrRfWPmMQbdYrh3w@mail.gmail.com>

On Wednesday, 16 August 2017 16:27:17 PDT Willem de Bruijn wrote:
> Actually, it is safe even without the check. Overflow of the signed integer
> is benign here.

Usually, it's a bad idea to allow UB to happen.

Where is the overflow? I didn't see it in the patches so far.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply

* [PATCH next] neigh: initialize neigh entry correctly during arp processing
From: Mahesh Bandewar @ 2017-08-17  0:02 UTC (permalink / raw)
  To: David Miller
  Cc: Eric Dumazet, netdev, Mahesh Bandewar, Ido Schimmel,
	Hans Liljestrand, Kees Cook, Reshetova Elena, Sowmini Varadhan,
	Florian Westphal, Roopa Prabhu, Ihar Hrachyshka, David Ahern,
	Zhang Shengju, Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

If the ARP processing creates a neigh entry, it's immediately marked
as STALE without timer and stays that way in that state as long as
host do not send traffic to that neighbour.

I observed this on hosts which are in IPv6 environment, where there is
very little to no IPv4 traffic and neigh-entries are stuck in STALE
mode. Ideally, the host should have PROBEd these neighbours before it
can send the first packet out.

It happens as a result of following call sequence in an environment
where host is mostly quiet as far as IPv4 traffic but few connected
hosts/gateways are sending ARPs.

   arp_process()
     neigh_event_ns()
       neigh_lookup()
         neigh_create()
           neigh_alloc()
             nud_state=NUD_NONE
     neigh_update(nud_state=NUD_STALE)

In the above scenario, the neighbour entry does not get a chance to get
PROBEd as subsequent call to neigh_update() marks  this entry STALE.
This patch initializes the neigh-entry correctly if it was created as a
result of neigh_lookup instead of just updating it in neigh_event_ns()
right after creating it.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 net/core/neighbour.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 16a1a4c4eb57..d8a35db6c43b 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1300,9 +1300,13 @@ struct neighbour *neigh_event_ns(struct neigh_table *tbl,
 {
 	struct neighbour *neigh = __neigh_lookup(tbl, saddr, dev,
 						 lladdr || !dev->addr_len);
-	if (neigh)
-		neigh_update(neigh, lladdr, NUD_STALE,
-			     NEIGH_UPDATE_F_OVERRIDE, 0);
+	if (neigh) {
+		if (neigh->nud_state & NUD_VALID)
+			neigh_update(neigh, lladdr, NUD_STALE,
+				     NEIGH_UPDATE_F_OVERRIDE, 0);
+		else
+			neigh_event_send(neigh, NULL);
+	}
 	return neigh;
 }
 EXPORT_SYMBOL(neigh_event_ns);
-- 
2.14.1.480.gb18f417b89-goog

^ permalink raw reply related

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
From: Willem de Bruijn @ 2017-08-17  0:10 UTC (permalink / raw)
  To: Thiago Macieira; +Cc: Paolo Abeni, Matthew Dawson, Network Development
In-Reply-To: <2308911.oTjYErtyEh@tjmaciei-mobl1>

On Wed, Aug 16, 2017 at 7:55 PM, Thiago Macieira
<thiago.macieira@intel.com> wrote:
> On Wednesday, 16 August 2017 16:27:17 PDT Willem de Bruijn wrote:
>> Actually, it is safe even without the check. Overflow of the signed integer
>> is benign here.
>
> Usually, it's a bad idea to allow UB to happen.
>
> Where is the overflow? I didn't see it in the patches so far.

There isn't. Instead of using a shift, as I proposed earlier, the latest
patch just passes the raw value of sk_peek_off.

The shift was an attempt to disambiguate between the cases that
return zero in sk_peek_offset:
(a) peek without SO_PEEK_OFF:
      each consecutive peek must look at the first packet
(b) peek with offset zero:
      peek at the first packet, move forward with each call

After coding up a version that shifts by one to differentiate the two
cases of zero it was obvious that there is no need to shift at all:
the case without SO_PEEK_OFF is identified by all negative values
if sk_peek_offset just does not explicitly return 0 for this case.

Of course, then all callers need to be verified to correctly handle
negative values. __sk_try_recv_from_queue uses this information.
The only other cases, unix stream, does not care about the difference,
as the bytestream has no record separators.

What is UB in this context?

^ permalink raw reply

* Re: [PATCH] Adding-Agile-SD-TCP-module-and-modifying-Kconfig-and-makefile
From: kbuild test robot @ 2017-08-17  0:12 UTC (permalink / raw)
  To: mohamedalrshah
  Cc: kbuild-all, davem, netdev, torvalds, linux-kernel,
	Mohamed A . Alrshah, Mohamed Othman, Borhanuddin Ali,
	Zurina Hanapi
In-Reply-To: <20170815130806.25168-1-mohamed.a.alrshah@ieee.org>

[-- Attachment #1: Type: text/plain, Size: 2250 bytes --]

Hi mohamedalrshah,

[auto build test ERROR on net/master]
[also build test ERROR on v4.13-rc5 next-20170816]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/mohamedalrshah/Adding-Agile-SD-TCP-module-and-modifying-Kconfig-and-makefile/20170817-055643
config: i386-allyesconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> net//ipv4/tcp_agilesd.c:35:15: error: expected identifier before numeric constant
     enum dystate{SS=0, CA=1} agilesd_tcp_status;
                  ^~
   net//ipv4/tcp_agilesd.c: In function 'agilesdtcp_cong_avoid':
   net//ipv4/tcp_agilesd.c:87:28: error: 'CA' undeclared (first use in this function)
      ca->agilesd_tcp_status = CA;
                               ^~
   net//ipv4/tcp_agilesd.c:87:28: note: each undeclared identifier is reported only once for each function it appears in
   net//ipv4/tcp_agilesd.c: In function 'agilesdtcp_recalc_ssthresh':
   net//ipv4/tcp_agilesd.c:121:32: error: 'CA' undeclared (first use in this function)
     if (ca->agilesd_tcp_status == CA)
                                   ^~
   net//ipv4/tcp_agilesd.c: At top level:
   net//ipv4/tcp_agilesd.c:169:16: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
     .pkts_acked = agilesdtcp_acked,
                   ^~~~~~~~~~~~~~~~
   net//ipv4/tcp_agilesd.c:169:16: note: (near initialization for 'agilesdtcp.pkts_acked')
   cc1: some warnings being treated as errors

vim +35 net//ipv4/tcp_agilesd.c

    29	
    30	/* Agile-SD Parameters */
    31	struct agilesdtcp {
    32		u32	loss_cwnd;             /* congestion window at last loss.*/
    33		u32	frac_tracer;           /* This is to trace the fractions of the increment.*/
    34		u32	degraded_loss_cwnd;    /* loss_cwnd after degradation.*/
  > 35		enum	dystate{SS=0, CA=1} agilesd_tcp_status;
    36	};
    37	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 60442 bytes --]

^ permalink raw reply

* Re: [PATCH v2] i40e/i40evf: use cpumask_copy() for assigning cpumask
From: Jeff Kirsher @ 2017-08-17  0:24 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, netdev, intel-wired-lan; +Cc: stable
In-Reply-To: <20170816175257.9647-1-jgross@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1795 bytes --]

On Wed, 2017-08-16 at 19:52 +0200, Juergen Gross wrote:
> Using direct assignment for a cpumask is wrong, cpumask_copy() should
> be used instead.
> 
> Otherwise crashes like the following might happen:
> 
> [62792.326374] BUG: unable to handle kernel paging request at
> ffff8800049ff000
> [62792.340118] IP: [<ffffffffa2043341>]
> i40e_irq_affinity_notify+0x11/0x20 [i40e]
> ...
> [62792.810770] Call Trace:
> [62792.815722]  [<ffffffff810d77a5>] irq_affinity_notify+0xb5/0xf0
> [62792.827593]  [<ffffffff8109593e>] process_one_work+0x14e/0x410
> [62792.839282]  [<ffffffff81096196>] worker_thread+0x116/0x490
> [62792.850459]  [<ffffffff8109b667>] kthread+0xc7/0xe0
> [62792.860255]  [<ffffffff816094bf>] ret_from_fork+0x3f/0x70
> [62792.871996] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
> 
> Fixes: 96db776a3682 ("i40e/i40evf: fix interrupt affinity bug")
> Cc: <stable@vger.kernel.org> # 4.10+
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> V2: enhance commit message, merge patches
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c     | 2 +-
>  drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)

This is already resolved with a previous patch from Jacob Keller, see
the following commit in my tree:

commit f15ac286b0d111499e0fec4b50c8c870ad3b4573
Author: Jacob Keller <jacob.e.keller@intel.com>
Date:   Wed Aug 16 17:12:00 2017 -0700

    i40e: use cpumask_copy instead of direct assignment
    
    According to the header file cpumask.h, we shouldn't be directly
copying
    a cpumask_t, since its a bitmap and might not be copied correctly.
Lets
    use the provided cpumask_copy() function instead.
    
    Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] i40e{,vf}: Fix out-of-bound cpumask read in IRQ affinity handler
From: Jeff Kirsher @ 2017-08-17  0:25 UTC (permalink / raw)
  To: Stefano Brivio, netdev, intel-wired-lan
  Cc: David S . Miller, Alan Brady, Stefan Assmann
In-Reply-To: <ae9c9586f61e914dc1c6fe2e6ac1fb2bf07283bc.1502792828.git.sbrivio@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4179 bytes --]

On Tue, 2017-08-15 at 12:30 +0200, Stefano Brivio wrote:
> The cpumask used in i40e{,vf}_irq_affinity_notify() is allocated
> by irq_affinity_notify() with alloc_cpumask_var(), which doesn't
> allocate NR_CPUS bits, but only nr_cpumask_bits bits. If we just
> dereference it, we'll read way more than what is allocated, e.g.
> 1024 bytes vs. 8 bytes allocated on x86_64 machine with 24 CPUs.
> 
> Use cpumask_copy() instead. A comprehensive explanation is given
> in the comments about cpumask_var_t, in include/linux/cpumask.h.
> 
> KASAN reports:
> [   25.242312] BUG: KASAN: slab-out-of-bounds in
> i40e_irq_affinity_notify+0x30/0x50 [i40e] at addr ffff880462eea960
> [   25.242315] Read of size 1024 by task kworker/2:1/170
> [   25.242322] CPU: 2 PID: 170 Comm: kworker/2:1 Not tainted 4.11.0-
> 22.el7a.x86_64 #1
> [   25.242325] Hardware name: HP ProLiant DL380 Gen9, BIOS P89
> 05/06/2015
> [   25.242336] Workqueue: events irq_affinity_notify
> [   25.242340] Call Trace:
> [   25.242350]  dump_stack+0x63/0x8d
> [   25.242358]  kasan_object_err+0x21/0x70
> [   25.242364]  kasan_report+0x288/0x540
> [   25.242397]  ? i40e_irq_affinity_notify+0x30/0x50 [i40e]
> [   25.242403]  check_memory_region+0x13c/0x1a0
> [   25.242408]  __asan_loadN+0xf/0x20
> [   25.242440]  i40e_irq_affinity_notify+0x30/0x50 [i40e]
> [   25.242446]  irq_affinity_notify+0x1b4/0x230
> [   25.242452]  ? irq_set_affinity_notifier+0x130/0x130
> [   25.242457]  ? kasan_slab_free+0x89/0xc0
> [   25.242466]  process_one_work+0x32f/0x6f0
> [   25.242472]  worker_thread+0x89/0x770
> [   25.242481]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> [   25.242488]  kthread+0x18c/0x1e0
> [   25.242493]  ? process_one_work+0x6f0/0x6f0
> [   25.242499]  ? kthread_create_on_node+0xc0/0xc0
> [   25.242506]  ret_from_fork+0x2c/0x40
> [   25.242511] Object at ffff880462eea960, in cache kmalloc-8 size: 8
> [   25.242513] Allocated:
> [   25.242514] PID = 170
> [   25.242522]  save_stack_trace+0x1b/0x20
> [   25.242529]  save_stack+0x46/0xd0
> [   25.242533]  kasan_kmalloc+0xad/0xe0
> [   25.242537]  __kmalloc_node+0x12c/0x2b0
> [   25.242542]  alloc_cpumask_var_node+0x3c/0x60
> [   25.242546]  alloc_cpumask_var+0xe/0x10
> [   25.242550]  irq_affinity_notify+0x94/0x230
> [   25.242555]  process_one_work+0x32f/0x6f0
> [   25.242559]  worker_thread+0x89/0x770
> [   25.242564]  kthread+0x18c/0x1e0
> [   25.242568]  ret_from_fork+0x2c/0x40
> [   25.242569] Freed:
> [   25.242570] PID = 0
> [   25.242572] (stack is not available)
> [   25.242573] Memory state around the buggy address:
> [   25.242578]  ffff880462eea800: fc fc 00 fc fc 00 fc fc 00 fc fc 00
> fc fc fb fc
> [   25.242582]  ffff880462eea880: fc fb fc fc fb fc fc 00 fc fc 00 fc
> fc 00 fc fc
> [   25.242586] >ffff880462eea900: 00 fc fc 00 fc fc 00 fc fc fb fc fc
> 00 fc fc fc
> [  
> 25.242588]                                                          
> ^
> [   25.242592]  ffff880462eea980: fc fc fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc
> [   25.242596]  ffff880462eeaa00: fc fc fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc
> [   25.242597]
> ==================================================================
> 
> Fixes: 96db776a3682 ("i40e/i40evf: fix interrupt affinity bug")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> This should be considered for -stable, back to 4.10.
> 
>  drivers/net/ethernet/intel/i40e/i40e_main.c     | 2 +-
>  drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)

This is already resolved with a previous patch from Jacob Keller, see
the following commit in my tree:

commit f15ac286b0d111499e0fec4b50c8c870ad3b4573
Author: Jacob Keller <jacob.e.keller@intel.com>
Date:   Wed Aug 16 17:12:00 2017 -0700

    i40e: use cpumask_copy instead of direct assignment
    
    According to the header file cpumask.h, we shouldn't be directly
copying
    a cpumask_t, since its a bitmap and might not be copied correctly.
Lets
    use the provided cpumask_copy() function instead.
    
    Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox