[PATCHES] Networking - David Miller

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Miller <davem@davemloft.net>
To: stable@vger.kernel.org
Subject: [PATCHES] Networking
Date: Mon, 28 Sep 2015 21:54:08 -0700 (PDT)	[thread overview]
Message-ID: <20150928.215408.458905239999743126.davem@davemloft.net> (raw)

[-- Attachment #1: Type: Text/Plain, Size: 117 bytes --]


Please queue up the following networking bug fixes for v3.14, v3.18,
v4.1, and v4.2 -stable, respectively.

Thanks!

[-- Attachment #2: net_314.mbox --]
[-- Type: Application/Octet-Stream, Size: 24549 bytes --]

From 64a1954f6811ea50b5fc7477d1ac43cb8bd054b5 Mon Sep 17 00:00:00 2001
From: huaibin Wang <huaibin.wang@6wind.com>
Date: Tue, 25 Aug 2015 16:20:34 +0200
Subject: [PATCH 1/8] ip6_gre: release cached dst on tunnel removal

[ Upstream commit d4257295ba1b389c693b79de857a96e4b7cd8ac0 ]

When a tunnel is deleted, the cached dst entry should be released.

This problem may prevent the removal of a netns (seen with a x-netns IPv6
gre tunnel):
  unregister_netdevice: waiting for lo to become free. Usage count = 3

CC: Dmitry Kozlov <xeb@mail.ru>
Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: huaibin Wang <huaibin.wang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/ip6_gre.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4a230b1..baffa3b 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -357,6 +357,7 @@ static void ip6gre_tunnel_uninit(struct net_device *dev)
 	struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
 
 	ip6gre_tunnel_unlink(ign, netdev_priv(dev));
+	ip6_tnl_dst_reset(netdev_priv(dev));
 	dev_put(dev);
 }
 
-- 
2.1.0


From 85f774be138b8b06c2fbd88cfdaabc1899ee9704 Mon Sep 17 00:00:00 2001
From: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Date: Mon, 24 Aug 2015 23:13:42 +0300
Subject: [PATCH 2/8] usbnet: Get EVENT_NO_RUNTIME_PM bit before it is cleared

[ Upstream commit f50791ac1aca1ac1b0370d62397b43e9f831421a ]

It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in
usbnet_stop(), but its value should be read before it is cleared
when dev->flags is set to 0.

The problem was spotted and the fix was provided by
Oliver Neukum <oneukum@suse.de>.

Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Acked-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/usb/usbnet.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index f9e96c4..23dc131 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -778,7 +778,7 @@ int usbnet_stop (struct net_device *net)
 {
 	struct usbnet		*dev = netdev_priv(net);
 	struct driver_info	*info = dev->driver_info;
-	int			retval, pm;
+	int			retval, pm, mpn;
 
 	clear_bit(EVENT_DEV_OPEN, &dev->flags);
 	netif_stop_queue (net);
@@ -809,6 +809,8 @@ int usbnet_stop (struct net_device *net)
 
 	usbnet_purge_paused_rxq(dev);
 
+	mpn = !test_and_clear_bit(EVENT_NO_RUNTIME_PM, &dev->flags);
+
 	/* deferred work (task, timer, softirq) must also stop.
 	 * can't flush_scheduled_work() until we drop rtnl (later),
 	 * else workers could deadlock; so make workers a NOP.
@@ -819,8 +821,7 @@ int usbnet_stop (struct net_device *net)
 	if (!pm)
 		usb_autopm_put_interface(dev->intf);
 
-	if (info->manage_power &&
-	    !test_and_clear_bit(EVENT_NO_RUNTIME_PM, &dev->flags))
+	if (info->manage_power && mpn)
 		info->manage_power(dev, 0);
 	else
 		usb_autopm_put_interface(dev->intf);
-- 
2.1.0


From 43fdbd6841747b2f924c2a108f70c24871c7f858 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 3 Sep 2015 00:29:07 +0200
Subject: [PATCH 3/8] ipv6: fix exthdrs offload registration in out_rt path

[ Upstream commit e41b0bedba0293b9e1e8d1e8ed553104b9693656 ]

We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
but in error path, we try to unregister it with inet_del_offload(). This
doesn't seem correct, it should actually be inet6_del_offload(), also
ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
also uses rthdr_offload twice), but it got removed entirely later on.

Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/exthdrs_offload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c
index 447a7fb..f5e2ba1 100644
--- a/net/ipv6/exthdrs_offload.c
+++ b/net/ipv6/exthdrs_offload.c
@@ -36,6 +36,6 @@ out:
 	return ret;
 
 out_rt:
-	inet_del_offload(&rthdr_offload, IPPROTO_ROUTING);
+	inet6_del_offload(&rthdr_offload, IPPROTO_ROUTING);
 	goto out;
 }
-- 
2.1.0


From 311ab9d79c1add43ad8e76e4562d9132b1c9759c Mon Sep 17 00:00:00 2001
From: Richard Laing <richard.laing@alliedtelesis.co.nz>
Date: Thu, 3 Sep 2015 13:52:31 +1200
Subject: [PATCH 4/8] net/ipv6: Correct PIM6 mrt_lock handling

[ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ]

In the IPv6 multicast routing code the mrt_lock was not being released
correctly in the MFC iterator, as a result adding or deleting a MIF would
cause a hang because the mrt_lock could not be acquired.

This fix is a copy of the code for the IPv4 case and ensures that the lock
is released correctly.

Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/ip6mr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 8737400..821d8df 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -552,7 +552,7 @@ static void ipmr_mfc_seq_stop(struct seq_file *seq, void *v)
 
 	if (it->cache == &mrt->mfc6_unres_queue)
 		spin_unlock_bh(&mfc_unres_lock);
-	else if (it->cache == mrt->mfc6_cache_array)
+	else if (it->cache == &mrt->mfc6_cache_array[it->ct])
 		read_unlock(&mrt_lock);
 }
 
-- 
2.1.0


From 63be25ac5910449798b3e545431d4d8babaf8ec3 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 10 Sep 2015 20:05:46 +0200
Subject: [PATCH 5/8] netlink, mmap: transform mmap skb into full skb on taps

[ Upstream commit 1853c949646005b5959c483becde86608f548f24 ]

Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.

I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb->head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb->head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.

Reference: http://thread.gmane.org/gmane.linux.network/371129
Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
Reported-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 30 +++++++++++++++++++++++-------
 net/netlink/af_netlink.h |  9 +++++++++
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index a0b0ea9..28119f6 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -115,6 +115,24 @@ static inline struct hlist_head *nl_portid_hashfn(struct nl_portid_hash *hash, u
 	return &hash->table[jhash_1word(portid, hash->rnd) & hash->mask];
 }
 
+static struct sk_buff *netlink_to_full_skb(const struct sk_buff *skb,
+					   gfp_t gfp_mask)
+{
+	unsigned int len = skb_end_offset(skb);
+	struct sk_buff *new;
+
+	new = alloc_skb(len, gfp_mask);
+	if (new == NULL)
+		return NULL;
+
+	NETLINK_CB(new).portid = NETLINK_CB(skb).portid;
+	NETLINK_CB(new).dst_group = NETLINK_CB(skb).dst_group;
+	NETLINK_CB(new).creds = NETLINK_CB(skb).creds;
+
+	memcpy(skb_put(new, len), skb->data, len);
+	return new;
+}
+
 int netlink_add_tap(struct netlink_tap *nt)
 {
 	if (unlikely(nt->dev->type != ARPHRD_NETLINK))
@@ -199,7 +217,11 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
 	int ret = -ENOMEM;
 
 	dev_hold(dev);
-	nskb = skb_clone(skb, GFP_ATOMIC);
+
+	if (netlink_skb_is_mmaped(skb) || is_vmalloc_addr(skb->head))
+		nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
+	else
+		nskb = skb_clone(skb, GFP_ATOMIC);
 	if (nskb) {
 		nskb->dev = dev;
 		nskb->protocol = htons((u16) sk->sk_protocol);
@@ -271,11 +293,6 @@ static void netlink_rcv_wake(struct sock *sk)
 }
 
 #ifdef CONFIG_NETLINK_MMAP
-static bool netlink_skb_is_mmaped(const struct sk_buff *skb)
-{
-	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
-}
-
 static bool netlink_rx_is_mmaped(struct sock *sk)
 {
 	return nlk_sk(sk)->rx_ring.pg_vec != NULL;
@@ -812,7 +829,6 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
 }
 
 #else /* CONFIG_NETLINK_MMAP */
-#define netlink_skb_is_mmaped(skb)	false
 #define netlink_rx_is_mmaped(sk)	false
 #define netlink_tx_is_mmaped(sk)	false
 #define netlink_mmap			sock_no_mmap
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index acbd774..dcc89c7 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -65,6 +65,15 @@ struct nl_portid_hash {
 	u32			rnd;
 };
 
+static inline bool netlink_skb_is_mmaped(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETLINK_MMAP
+	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
+#else
+	return false;
+#endif /* CONFIG_NETLINK_MMAP */
+}
+
 struct netlink_table {
 	struct nl_portid_hash	hash;
 	struct hlist_head	mc_list;
-- 
2.1.0


From 4d926a5772e7c840bfce1a43d52e243ab57135a1 Mon Sep 17 00:00:00 2001
From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Thu, 10 Sep 2015 17:31:15 -0300
Subject: [PATCH 6/8] sctp: fix race on protocol/netns initialization

[ Upstream commit 8e2d61e0aed2b7c4ecb35844fe07e0b2b762dee4 ]

Consider sctp module is unloaded and is being requested because an user
is creating a sctp socket.

During initialization, sctp will add the new protocol type and then
initialize pernet subsys:

        status = sctp_v4_protosw_init();
        if (status)
                goto err_protosw_init;

        status = sctp_v6_protosw_init();
        if (status)
                goto err_v6_protosw_init;

        status = register_pernet_subsys(&sctp_net_ops);

The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
is possible for userspace to create SCTP sockets like if the module is
already fully loaded. If that happens, one of the possible effects is
that we will have readers for net->sctp.local_addr_list list earlier
than expected and sctp_net_init() does not take precautions while
dealing with that list, leading to a potential panic but not limited to
that, as sctp_sock_init() will copy a bunch of blank/partially
initialized values from net->sctp.

The race happens like this:

     CPU 0                           |  CPU 1
  socket()                           |
   __sock_create                     | socket()
    inet_create                      |  __sock_create
     list_for_each_entry_rcu(        |
        answer, &inetsw[sock->type], |
        list) {                      |   inet_create
      /* no hits */                  |
     if (unlikely(err)) {            |
      ...                            |
      request_module()               |
      /* socket creation is blocked  |
       * the module is fully loaded  |
       */                            |
       sctp_init                     |
        sctp_v4_protosw_init         |
         inet_register_protosw       |
          list_add_rcu(&p->list,     |
                       last_perm);   |
                                     |  list_for_each_entry_rcu(
                                     |     answer, &inetsw[sock->type],
        sctp_v6_protosw_init         |     list) {
                                     |     /* hit, so assumes protocol
                                     |      * is already loaded
                                     |      */
                                     |  /* socket creation continues
                                     |   * before netns is initialized
                                     |   */
        register_pernet_subsys       |

Simply inverting the initialization order between
register_pernet_subsys() and sctp_v4_protosw_init() is not possible
because register_pernet_subsys() will create a control sctp socket, so
the protocol must be already visible by then. Deferring the socket
creation to a work-queue is not good specially because we loose the
ability to handle its errors.

So, as suggested by Vlad, the fix is to split netns initialization in
two moments: defaults and control socket, so that the defaults are
already loaded by when we register the protocol, while control socket
initialization is kept at the same moment it is today.

Fixes: 4db67e808640 ("sctp: Make the address lists per network namespace")
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sctp/protocol.c | 64 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 23 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index a62a215..1237a23 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1167,7 +1167,7 @@ static void sctp_v4_del_protocol(void)
 	unregister_inetaddr_notifier(&sctp_inetaddr_notifier);
 }
 
-static int __net_init sctp_net_init(struct net *net)
+static int __net_init sctp_defaults_init(struct net *net)
 {
 	int status;
 
@@ -1260,12 +1260,6 @@ static int __net_init sctp_net_init(struct net *net)
 
 	sctp_dbg_objcnt_init(net);
 
-	/* Initialize the control inode/socket for handling OOTB packets.  */
-	if ((status = sctp_ctl_sock_init(net))) {
-		pr_err("Failed to initialize the SCTP control sock\n");
-		goto err_ctl_sock_init;
-	}
-
 	/* Initialize the local address list. */
 	INIT_LIST_HEAD(&net->sctp.local_addr_list);
 	spin_lock_init(&net->sctp.local_addr_lock);
@@ -1281,9 +1275,6 @@ static int __net_init sctp_net_init(struct net *net)
 
 	return 0;
 
-err_ctl_sock_init:
-	sctp_dbg_objcnt_exit(net);
-	sctp_proc_exit(net);
 err_init_proc:
 	cleanup_sctp_mibs(net);
 err_init_mibs:
@@ -1292,15 +1283,12 @@ err_sysctl_register:
 	return status;
 }
 
-static void __net_exit sctp_net_exit(struct net *net)
+static void __net_exit sctp_defaults_exit(struct net *net)
 {
 	/* Free the local address list */
 	sctp_free_addr_wq(net);
 	sctp_free_local_addr_list(net);
 
-	/* Free the control endpoint.  */
-	inet_ctl_sock_destroy(net->sctp.ctl_sock);
-
 	sctp_dbg_objcnt_exit(net);
 
 	sctp_proc_exit(net);
@@ -1308,9 +1296,32 @@ static void __net_exit sctp_net_exit(struct net *net)
 	sctp_sysctl_net_unregister(net);
 }
 
-static struct pernet_operations sctp_net_ops = {
-	.init = sctp_net_init,
-	.exit = sctp_net_exit,
+static struct pernet_operations sctp_defaults_ops = {
+	.init = sctp_defaults_init,
+	.exit = sctp_defaults_exit,
+};
+
+static int __net_init sctp_ctrlsock_init(struct net *net)
+{
+	int status;
+
+	/* Initialize the control inode/socket for handling OOTB packets.  */
+	status = sctp_ctl_sock_init(net);
+	if (status)
+		pr_err("Failed to initialize the SCTP control sock\n");
+
+	return status;
+}
+
+static void __net_init sctp_ctrlsock_exit(struct net *net)
+{
+	/* Free the control endpoint.  */
+	inet_ctl_sock_destroy(net->sctp.ctl_sock);
+}
+
+static struct pernet_operations sctp_ctrlsock_ops = {
+	.init = sctp_ctrlsock_init,
+	.exit = sctp_ctrlsock_exit,
 };
 
 /* Initialize the universe into something sensible.  */
@@ -1444,8 +1455,11 @@ static __init int sctp_init(void)
 	sctp_v4_pf_init();
 	sctp_v6_pf_init();
 
-	status = sctp_v4_protosw_init();
+	status = register_pernet_subsys(&sctp_defaults_ops);
+	if (status)
+		goto err_register_defaults;
 
+	status = sctp_v4_protosw_init();
 	if (status)
 		goto err_protosw_init;
 
@@ -1453,9 +1467,9 @@ static __init int sctp_init(void)
 	if (status)
 		goto err_v6_protosw_init;
 
-	status = register_pernet_subsys(&sctp_net_ops);
+	status = register_pernet_subsys(&sctp_ctrlsock_ops);
 	if (status)
-		goto err_register_pernet_subsys;
+		goto err_register_ctrlsock;
 
 	status = sctp_v4_add_protocol();
 	if (status)
@@ -1471,12 +1485,14 @@ out:
 err_v6_add_protocol:
 	sctp_v4_del_protocol();
 err_add_protocol:
-	unregister_pernet_subsys(&sctp_net_ops);
-err_register_pernet_subsys:
+	unregister_pernet_subsys(&sctp_ctrlsock_ops);
+err_register_ctrlsock:
 	sctp_v6_protosw_exit();
 err_v6_protosw_init:
 	sctp_v4_protosw_exit();
 err_protosw_init:
+	unregister_pernet_subsys(&sctp_defaults_ops);
+err_register_defaults:
 	sctp_v4_pf_exit();
 	sctp_v6_pf_exit();
 	sctp_sysctl_unregister();
@@ -1509,12 +1525,14 @@ static __exit void sctp_exit(void)
 	sctp_v6_del_protocol();
 	sctp_v4_del_protocol();
 
-	unregister_pernet_subsys(&sctp_net_ops);
+	unregister_pernet_subsys(&sctp_ctrlsock_ops);
 
 	/* Free protosw registrations */
 	sctp_v6_protosw_exit();
 	sctp_v4_protosw_exit();
 
+	unregister_pernet_subsys(&sctp_defaults_ops);
+
 	/* Unregister with socket layer. */
 	sctp_v6_pf_exit();
 	sctp_v4_pf_exit();
-- 
2.1.0


From c746e0a10bd5da46f589bdaffa17575ba14df439 Mon Sep 17 00:00:00 2001
From: Jesse Gross <jesse@nicira.com>
Date: Mon, 21 Sep 2015 20:21:20 -0700
Subject: [PATCH 7/8] openvswitch: Zero flows on allocation.

[ Upstream commit ae5f2fb1d51fa128a460bcfbe3c56d7ab8bf6a43 ]

When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/openvswitch/datapath.c   |  2 +-
 net/openvswitch/flow_table.c | 21 ++++++++++++---------
 net/openvswitch/flow_table.h |  2 +-
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 270b77d..8bb1a5a 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -803,7 +803,7 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info)
 		if (IS_ERR(acts))
 			goto error;
 
-		ovs_flow_mask_key(&masked_key, &key, &mask);
+		ovs_flow_mask_key(&masked_key, &key, true, &mask);
 		error = ovs_nla_copy_actions(a[OVS_FLOW_ATTR_ACTIONS],
 					     &masked_key, 0, &acts);
 		if (error) {
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index 3c268b3..4877d5a 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -55,18 +55,21 @@ static u16 range_n_bytes(const struct sw_flow_key_range *range)
 }
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
-		       const struct sw_flow_mask *mask)
+		       bool full, const struct sw_flow_mask *mask)
 {
-	const long *m = (long *)((u8 *)&mask->key + mask->range.start);
-	const long *s = (long *)((u8 *)src + mask->range.start);
-	long *d = (long *)((u8 *)dst + mask->range.start);
+	int start = full ? 0 : mask->range.start;
+	int len = full ? sizeof *dst : range_n_bytes(&mask->range);
+	const long *m = (const long *)((const u8 *)&mask->key + start);
+	const long *s = (const long *)((const u8 *)src + start);
+	long *d = (long *)((u8 *)dst + start);
 	int i;
 
-	/* The memory outside of the 'mask->range' are not set since
-	 * further operations on 'dst' only uses contents within
-	 * 'mask->range'.
+	/* If 'full' is true then all of 'dst' is fully initialized. Otherwise,
+	 * if 'full' is false the memory outside of the 'mask->range' is left
+	 * uninitialized. This can be used as an optimization when further
+	 * operations on 'dst' only use contents within 'mask->range'.
 	 */
-	for (i = 0; i < range_n_bytes(&mask->range); i += sizeof(long))
+	for (i = 0; i < len; i += sizeof(long))
 		*d++ = *s++ & *m++;
 }
 
@@ -436,7 +439,7 @@ static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
 	u32 hash;
 	struct sw_flow_key masked_key;
 
-	ovs_flow_mask_key(&masked_key, unmasked, mask);
+	ovs_flow_mask_key(&masked_key, unmasked, false, mask);
 	hash = flow_hash(&masked_key, key_start, key_end);
 	head = find_bucket(ti, hash);
 	hlist_for_each_entry_rcu(flow, head, hash_node[ti->node_ver]) {
diff --git a/net/openvswitch/flow_table.h b/net/openvswitch/flow_table.h
index baaeb10..82e64a9 100644
--- a/net/openvswitch/flow_table.h
+++ b/net/openvswitch/flow_table.h
@@ -79,5 +79,5 @@ bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
 			       struct sw_flow_match *match);
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
-		       const struct sw_flow_mask *mask);
+		       bool full, const struct sw_flow_mask *mask);
 #endif /* flow_table.h */
-- 
2.1.0


From 6c4dea7ae27cd5e42bf2176211070c6846893129 Mon Sep 17 00:00:00 2001
From: Wilson Kok <wkok@cumulusnetworks.com>
Date: Tue, 22 Sep 2015 21:40:22 -0700
Subject: [PATCH 8/8] fib_rules: fix fib rule dumps across multiple skbs

[ Upstream commit 41fc014332d91ee90c32840bf161f9685b7fbf2b ]

dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.

This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.

Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/fib_rules.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 185c341..aeedc3a 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -621,15 +621,17 @@ static int dump_rules(struct sk_buff *skb, struct netlink_callback *cb,
 {
 	int idx = 0;
 	struct fib_rule *rule;
+	int err = 0;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(rule, &ops->rules_list, list) {
 		if (idx < cb->args[1])
 			goto skip;
 
-		if (fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
-				     cb->nlh->nlmsg_seq, RTM_NEWRULE,
-				     NLM_F_MULTI, ops) < 0)
+		err = fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
+				       cb->nlh->nlmsg_seq, RTM_NEWRULE,
+				       NLM_F_MULTI, ops);
+		if (err)
 			break;
 skip:
 		idx++;
@@ -638,7 +640,7 @@ skip:
 	cb->args[1] = idx;
 	rules_ops_put(ops);
 
-	return skb->len;
+	return err;
 }
 
 static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
@@ -654,7 +656,9 @@ static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
 		if (ops == NULL)
 			return -EAFNOSUPPORT;
 
-		return dump_rules(skb, cb, ops);
+		dump_rules(skb, cb, ops);
+
+		return skb->len;
 	}
 
 	rcu_read_lock();
-- 
2.1.0


[-- Attachment #3: net_318.mbox --]
[-- Type: Application/Octet-Stream, Size: 32017 bytes --]

From 786c86281cef8a44a3a49db9bcbb4dd04039c802 Mon Sep 17 00:00:00 2001
From: huaibin Wang <huaibin.wang@6wind.com>
Date: Tue, 25 Aug 2015 16:20:34 +0200
Subject: [PATCH 01/12] ip6_gre: release cached dst on tunnel removal

[ Upstream commit d4257295ba1b389c693b79de857a96e4b7cd8ac0 ]

When a tunnel is deleted, the cached dst entry should be released.

This problem may prevent the removal of a netns (seen with a x-netns IPv6
gre tunnel):
  unregister_netdevice: waiting for lo to become free. Usage count = 3

CC: Dmitry Kozlov <xeb@mail.ru>
Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: huaibin Wang <huaibin.wang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/ip6_gre.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 0e32d2e..fe85d46 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -361,6 +361,7 @@ static void ip6gre_tunnel_uninit(struct net_device *dev)
 	struct ip6gre_net *ign = net_generic(t->net, ip6gre_net_id);
 
 	ip6gre_tunnel_unlink(ign, t);
+	ip6_tnl_dst_reset(t);
 	dev_put(dev);
 }
 
-- 
2.1.0


From 330448ee09755fa2a746e2165a59ea971a7179d5 Mon Sep 17 00:00:00 2001
From: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Date: Mon, 24 Aug 2015 23:13:42 +0300
Subject: [PATCH 02/12] usbnet: Get EVENT_NO_RUNTIME_PM bit before it is
 cleared

[ Upstream commit f50791ac1aca1ac1b0370d62397b43e9f831421a ]

It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in
usbnet_stop(), but its value should be read before it is cleared
when dev->flags is set to 0.

The problem was spotted and the fix was provided by
Oliver Neukum <oneukum@suse.de>.

Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Acked-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/usb/usbnet.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index e7ed251..7a59893 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -779,7 +779,7 @@ int usbnet_stop (struct net_device *net)
 {
 	struct usbnet		*dev = netdev_priv(net);
 	struct driver_info	*info = dev->driver_info;
-	int			retval, pm;
+	int			retval, pm, mpn;
 
 	clear_bit(EVENT_DEV_OPEN, &dev->flags);
 	netif_stop_queue (net);
@@ -810,6 +810,8 @@ int usbnet_stop (struct net_device *net)
 
 	usbnet_purge_paused_rxq(dev);
 
+	mpn = !test_and_clear_bit(EVENT_NO_RUNTIME_PM, &dev->flags);
+
 	/* deferred work (task, timer, softirq) must also stop.
 	 * can't flush_scheduled_work() until we drop rtnl (later),
 	 * else workers could deadlock; so make workers a NOP.
@@ -820,8 +822,7 @@ int usbnet_stop (struct net_device *net)
 	if (!pm)
 		usb_autopm_put_interface(dev->intf);
 
-	if (info->manage_power &&
-	    !test_and_clear_bit(EVENT_NO_RUNTIME_PM, &dev->flags))
+	if (info->manage_power && mpn)
 		info->manage_power(dev, 0);
 	else
 		usb_autopm_put_interface(dev->intf);
-- 
2.1.0


From aaa82c63e101e99c4efa2537695c260a2245f1d7 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 3 Sep 2015 00:29:07 +0200
Subject: [PATCH 03/12] ipv6: fix exthdrs offload registration in out_rt path

[ Upstream commit e41b0bedba0293b9e1e8d1e8ed553104b9693656 ]

We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
but in error path, we try to unregister it with inet_del_offload(). This
doesn't seem correct, it should actually be inet6_del_offload(), also
ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
also uses rthdr_offload twice), but it got removed entirely later on.

Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/exthdrs_offload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c
index 447a7fb..f5e2ba1 100644
--- a/net/ipv6/exthdrs_offload.c
+++ b/net/ipv6/exthdrs_offload.c
@@ -36,6 +36,6 @@ out:
 	return ret;
 
 out_rt:
-	inet_del_offload(&rthdr_offload, IPPROTO_ROUTING);
+	inet6_del_offload(&rthdr_offload, IPPROTO_ROUTING);
 	goto out;
 }
-- 
2.1.0


From f86594cbe2531cc440b9270d7714e178c1af4d17 Mon Sep 17 00:00:00 2001
From: Russell King <rmk+kernel@arm.linux.org.uk>
Date: Wed, 2 Sep 2015 17:24:14 +0800
Subject: [PATCH 04/12] net: fec: clear receive interrupts before processing a
 packet

[ Upstream commit ed63f1dcd5788d36f942fbcce350742385e3e18c ]

The patch just to re-submit the patch "db3421c114cfa6326" because the
patch "4d494cdc92b3b9a0" remove the change.

Clear any pending receive interrupt before we process a pending packet.
This helps to avoid any spurious interrupts being raised after we have
fully cleaned the receive ring, while still allowing an interrupt to be
raised if we receive another packet.

The position of this is critical: we must do this prior to reading the
next packet status to avoid potentially dropping an interrupt when a
packet is still pending.

Acked-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/freescale/fec_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 96ba23e..56fa3d5 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1390,6 +1390,7 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
 		if ((status & BD_ENET_RX_LAST) == 0)
 			netdev_err(ndev, "rcv is not +last\n");
 
+		writel(FEC_ENET_RXF, fep->hwp + FEC_IEVENT);
 
 		/* Check for errors. */
 		if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH | BD_ENET_RX_NO |
-- 
2.1.0


From d9b0ef90d9eeddb96d72a0a75799a6f1f768b188 Mon Sep 17 00:00:00 2001
From: Atsushi Nemoto <nemoto@toshiba-tops.co.jp>
Date: Wed, 2 Sep 2015 17:49:29 +0900
Subject: [PATCH 05/12] net: eth: altera: fix napi poll_list corruption

[ Upstream commit 4548a697e4969d695047cebd6d9af5e2f6cc728e ]

tse_poll() calls __napi_complete() with irq enabled.  This leads napi
poll_list corruption and may stop all napi drivers working.
Use napi_complete() instead of __napi_complete().

Signed-off-by: Atsushi Nemoto <nemoto@toshiba-tops.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/altera/altera_tse_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/altera/altera_tse_main.c b/drivers/net/ethernet/altera/altera_tse_main.c
index 4efc435..2eb6404 100644
--- a/drivers/net/ethernet/altera/altera_tse_main.c
+++ b/drivers/net/ethernet/altera/altera_tse_main.c
@@ -501,8 +501,7 @@ static int tse_poll(struct napi_struct *napi, int budget)
 	if (rxcomplete >= budget || txcomplete > 0)
 		return rxcomplete;
 
-	napi_gro_flush(napi, false);
-	__napi_complete(napi);
+	napi_complete(napi);
 
 	netdev_dbg(priv->dev,
 		   "NAPI Complete, did %d packets with budget %d\n",
-- 
2.1.0


From 97aa79a462e1f64861eb6bb75f2220bfaa4bb780 Mon Sep 17 00:00:00 2001
From: Richard Laing <richard.laing@alliedtelesis.co.nz>
Date: Thu, 3 Sep 2015 13:52:31 +1200
Subject: [PATCH 06/12] net/ipv6: Correct PIM6 mrt_lock handling

[ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ]

In the IPv6 multicast routing code the mrt_lock was not being released
correctly in the MFC iterator, as a result adding or deleting a MIF would
cause a hang because the mrt_lock could not be acquired.

This fix is a copy of the code for the IPv4 case and ensures that the lock
is released correctly.

Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/ip6mr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 1a01d79..0d58542 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -552,7 +552,7 @@ static void ipmr_mfc_seq_stop(struct seq_file *seq, void *v)
 
 	if (it->cache == &mrt->mfc6_unres_queue)
 		spin_unlock_bh(&mfc_unres_lock);
-	else if (it->cache == mrt->mfc6_cache_array)
+	else if (it->cache == &mrt->mfc6_cache_array[it->ct])
 		read_unlock(&mrt_lock);
 }
 
-- 
2.1.0


From bd2de867630d2850e344828be1517e3cf1c1354f Mon Sep 17 00:00:00 2001
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Tue, 8 Sep 2015 20:06:41 -0700
Subject: [PATCH 07/12] net: dsa: bcm_sf2: Fix 64-bits register writes

[ Upstream commit 03679a14739a0d4c14b52ba65a69ff553bfba73b ]

The macro to write 64-bits quantities to the 32-bits register swapped
the value and offsets arguments, we want to preserve the ordering of the
arguments with respect to how writel() is implemented for instance:
value first, offset/base second.

Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/dsa/bcm_sf2.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.h b/drivers/net/dsa/bcm_sf2.h
index ee9f650..3ecfda8 100644
--- a/drivers/net/dsa/bcm_sf2.h
+++ b/drivers/net/dsa/bcm_sf2.h
@@ -110,8 +110,8 @@ static inline u64 name##_readq(struct bcm_sf2_priv *priv, u32 off)	\
 	spin_unlock(&priv->indir_lock);					\
 	return (u64)indir << 32 | dir;					\
 }									\
-static inline void name##_writeq(struct bcm_sf2_priv *priv, u32 off,	\
-							u64 val)	\
+static inline void name##_writeq(struct bcm_sf2_priv *priv, u64 val,	\
+							u32 off)	\
 {									\
 	spin_lock(&priv->indir_lock);					\
 	reg_writel(priv, upper_32_bits(val), REG_DIR_DATA_WRITE);	\
-- 
2.1.0


From 1b44705be0bc71eb59e380b31fa824a4e7d607de Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 10 Sep 2015 20:05:46 +0200
Subject: [PATCH 08/12] netlink, mmap: transform mmap skb into full skb on taps

[ Upstream commit 1853c949646005b5959c483becde86608f548f24 ]

Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.

I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb->head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb->head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.

Reference: http://thread.gmane.org/gmane.linux.network/371129
Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
Reported-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 30 +++++++++++++++++++++++-------
 net/netlink/af_netlink.h |  9 +++++++++
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c82b2e3..da0b73a 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -133,6 +133,24 @@ static inline u32 netlink_group_mask(u32 group)
 	return group ? 1 << (group - 1) : 0;
 }
 
+static struct sk_buff *netlink_to_full_skb(const struct sk_buff *skb,
+					   gfp_t gfp_mask)
+{
+	unsigned int len = skb_end_offset(skb);
+	struct sk_buff *new;
+
+	new = alloc_skb(len, gfp_mask);
+	if (new == NULL)
+		return NULL;
+
+	NETLINK_CB(new).portid = NETLINK_CB(skb).portid;
+	NETLINK_CB(new).dst_group = NETLINK_CB(skb).dst_group;
+	NETLINK_CB(new).creds = NETLINK_CB(skb).creds;
+
+	memcpy(skb_put(new, len), skb->data, len);
+	return new;
+}
+
 int netlink_add_tap(struct netlink_tap *nt)
 {
 	if (unlikely(nt->dev->type != ARPHRD_NETLINK))
@@ -215,7 +233,11 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
 	int ret = -ENOMEM;
 
 	dev_hold(dev);
-	nskb = skb_clone(skb, GFP_ATOMIC);
+
+	if (netlink_skb_is_mmaped(skb) || is_vmalloc_addr(skb->head))
+		nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
+	else
+		nskb = skb_clone(skb, GFP_ATOMIC);
 	if (nskb) {
 		nskb->dev = dev;
 		nskb->protocol = htons((u16) sk->sk_protocol);
@@ -287,11 +309,6 @@ static void netlink_rcv_wake(struct sock *sk)
 }
 
 #ifdef CONFIG_NETLINK_MMAP
-static bool netlink_skb_is_mmaped(const struct sk_buff *skb)
-{
-	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
-}
-
 static bool netlink_rx_is_mmaped(struct sock *sk)
 {
 	return nlk_sk(sk)->rx_ring.pg_vec != NULL;
@@ -828,7 +845,6 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
 }
 
 #else /* CONFIG_NETLINK_MMAP */
-#define netlink_skb_is_mmaped(skb)	false
 #define netlink_rx_is_mmaped(sk)	false
 #define netlink_tx_is_mmaped(sk)	false
 #define netlink_mmap			sock_no_mmap
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index b20a173..3951874 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -57,6 +57,15 @@ static inline struct netlink_sock *nlk_sk(struct sock *sk)
 	return container_of(sk, struct netlink_sock, sk);
 }
 
+static inline bool netlink_skb_is_mmaped(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETLINK_MMAP
+	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
+#else
+	return false;
+#endif /* CONFIG_NETLINK_MMAP */
+}
+
 struct netlink_table {
 	struct rhashtable	hash;
 	struct hlist_head	mc_list;
-- 
2.1.0


From fd90cd689a1a8d3c28b03ab1e4e5ed5c8fbdd0f7 Mon Sep 17 00:00:00 2001
From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Thu, 10 Sep 2015 17:31:15 -0300
Subject: [PATCH 09/12] sctp: fix race on protocol/netns initialization

[ Upstream commit 8e2d61e0aed2b7c4ecb35844fe07e0b2b762dee4 ]

Consider sctp module is unloaded and is being requested because an user
is creating a sctp socket.

During initialization, sctp will add the new protocol type and then
initialize pernet subsys:

        status = sctp_v4_protosw_init();
        if (status)
                goto err_protosw_init;

        status = sctp_v6_protosw_init();
        if (status)
                goto err_v6_protosw_init;

        status = register_pernet_subsys(&sctp_net_ops);

The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
is possible for userspace to create SCTP sockets like if the module is
already fully loaded. If that happens, one of the possible effects is
that we will have readers for net->sctp.local_addr_list list earlier
than expected and sctp_net_init() does not take precautions while
dealing with that list, leading to a potential panic but not limited to
that, as sctp_sock_init() will copy a bunch of blank/partially
initialized values from net->sctp.

The race happens like this:

     CPU 0                           |  CPU 1
  socket()                           |
   __sock_create                     | socket()
    inet_create                      |  __sock_create
     list_for_each_entry_rcu(        |
        answer, &inetsw[sock->type], |
        list) {                      |   inet_create
      /* no hits */                  |
     if (unlikely(err)) {            |
      ...                            |
      request_module()               |
      /* socket creation is blocked  |
       * the module is fully loaded  |
       */                            |
       sctp_init                     |
        sctp_v4_protosw_init         |
         inet_register_protosw       |
          list_add_rcu(&p->list,     |
                       last_perm);   |
                                     |  list_for_each_entry_rcu(
                                     |     answer, &inetsw[sock->type],
        sctp_v6_protosw_init         |     list) {
                                     |     /* hit, so assumes protocol
                                     |      * is already loaded
                                     |      */
                                     |  /* socket creation continues
                                     |   * before netns is initialized
                                     |   */
        register_pernet_subsys       |

Simply inverting the initialization order between
register_pernet_subsys() and sctp_v4_protosw_init() is not possible
because register_pernet_subsys() will create a control sctp socket, so
the protocol must be already visible by then. Deferring the socket
creation to a work-queue is not good specially because we loose the
ability to handle its errors.

So, as suggested by Vlad, the fix is to split netns initialization in
two moments: defaults and control socket, so that the defaults are
already loaded by when we register the protocol, while control socket
initialization is kept at the same moment it is today.

Fixes: 4db67e808640 ("sctp: Make the address lists per network namespace")
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sctp/protocol.c | 64 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 23 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 8f34b27..143c4eb 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1166,7 +1166,7 @@ static void sctp_v4_del_protocol(void)
 	unregister_inetaddr_notifier(&sctp_inetaddr_notifier);
 }
 
-static int __net_init sctp_net_init(struct net *net)
+static int __net_init sctp_defaults_init(struct net *net)
 {
 	int status;
 
@@ -1259,12 +1259,6 @@ static int __net_init sctp_net_init(struct net *net)
 
 	sctp_dbg_objcnt_init(net);
 
-	/* Initialize the control inode/socket for handling OOTB packets.  */
-	if ((status = sctp_ctl_sock_init(net))) {
-		pr_err("Failed to initialize the SCTP control sock\n");
-		goto err_ctl_sock_init;
-	}
-
 	/* Initialize the local address list. */
 	INIT_LIST_HEAD(&net->sctp.local_addr_list);
 	spin_lock_init(&net->sctp.local_addr_lock);
@@ -1280,9 +1274,6 @@ static int __net_init sctp_net_init(struct net *net)
 
 	return 0;
 
-err_ctl_sock_init:
-	sctp_dbg_objcnt_exit(net);
-	sctp_proc_exit(net);
 err_init_proc:
 	cleanup_sctp_mibs(net);
 err_init_mibs:
@@ -1291,15 +1282,12 @@ err_sysctl_register:
 	return status;
 }
 
-static void __net_exit sctp_net_exit(struct net *net)
+static void __net_exit sctp_defaults_exit(struct net *net)
 {
 	/* Free the local address list */
 	sctp_free_addr_wq(net);
 	sctp_free_local_addr_list(net);
 
-	/* Free the control endpoint.  */
-	inet_ctl_sock_destroy(net->sctp.ctl_sock);
-
 	sctp_dbg_objcnt_exit(net);
 
 	sctp_proc_exit(net);
@@ -1307,9 +1295,32 @@ static void __net_exit sctp_net_exit(struct net *net)
 	sctp_sysctl_net_unregister(net);
 }
 
-static struct pernet_operations sctp_net_ops = {
-	.init = sctp_net_init,
-	.exit = sctp_net_exit,
+static struct pernet_operations sctp_defaults_ops = {
+	.init = sctp_defaults_init,
+	.exit = sctp_defaults_exit,
+};
+
+static int __net_init sctp_ctrlsock_init(struct net *net)
+{
+	int status;
+
+	/* Initialize the control inode/socket for handling OOTB packets.  */
+	status = sctp_ctl_sock_init(net);
+	if (status)
+		pr_err("Failed to initialize the SCTP control sock\n");
+
+	return status;
+}
+
+static void __net_init sctp_ctrlsock_exit(struct net *net)
+{
+	/* Free the control endpoint.  */
+	inet_ctl_sock_destroy(net->sctp.ctl_sock);
+}
+
+static struct pernet_operations sctp_ctrlsock_ops = {
+	.init = sctp_ctrlsock_init,
+	.exit = sctp_ctrlsock_exit,
 };
 
 /* Initialize the universe into something sensible.  */
@@ -1443,8 +1454,11 @@ static __init int sctp_init(void)
 	sctp_v4_pf_init();
 	sctp_v6_pf_init();
 
-	status = sctp_v4_protosw_init();
+	status = register_pernet_subsys(&sctp_defaults_ops);
+	if (status)
+		goto err_register_defaults;
 
+	status = sctp_v4_protosw_init();
 	if (status)
 		goto err_protosw_init;
 
@@ -1452,9 +1466,9 @@ static __init int sctp_init(void)
 	if (status)
 		goto err_v6_protosw_init;
 
-	status = register_pernet_subsys(&sctp_net_ops);
+	status = register_pernet_subsys(&sctp_ctrlsock_ops);
 	if (status)
-		goto err_register_pernet_subsys;
+		goto err_register_ctrlsock;
 
 	status = sctp_v4_add_protocol();
 	if (status)
@@ -1470,12 +1484,14 @@ out:
 err_v6_add_protocol:
 	sctp_v4_del_protocol();
 err_add_protocol:
-	unregister_pernet_subsys(&sctp_net_ops);
-err_register_pernet_subsys:
+	unregister_pernet_subsys(&sctp_ctrlsock_ops);
+err_register_ctrlsock:
 	sctp_v6_protosw_exit();
 err_v6_protosw_init:
 	sctp_v4_protosw_exit();
 err_protosw_init:
+	unregister_pernet_subsys(&sctp_defaults_ops);
+err_register_defaults:
 	sctp_v4_pf_exit();
 	sctp_v6_pf_exit();
 	sctp_sysctl_unregister();
@@ -1508,12 +1524,14 @@ static __exit void sctp_exit(void)
 	sctp_v6_del_protocol();
 	sctp_v4_del_protocol();
 
-	unregister_pernet_subsys(&sctp_net_ops);
+	unregister_pernet_subsys(&sctp_ctrlsock_ops);
 
 	/* Free protosw registrations */
 	sctp_v6_protosw_exit();
 	sctp_v4_protosw_exit();
 
+	unregister_pernet_subsys(&sctp_defaults_ops);
+
 	/* Unregister with socket layer. */
 	sctp_v6_pf_exit();
 	sctp_v4_pf_exit();
-- 
2.1.0


From a6c5aeb519e0cae934a0aff1c9195bd2f522f006 Mon Sep 17 00:00:00 2001
From: Jesse Gross <jesse@nicira.com>
Date: Mon, 21 Sep 2015 20:21:20 -0700
Subject: [PATCH 10/12] openvswitch: Zero flows on allocation.

[ Upstream commit ae5f2fb1d51fa128a460bcfbe3c56d7ab8bf6a43 ]

When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/openvswitch/datapath.c   |  4 ++--
 net/openvswitch/flow_table.c | 23 ++++++++++++-----------
 net/openvswitch/flow_table.h |  2 +-
 3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 28213df..acf6b2e 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -834,7 +834,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	if (error)
 		goto err_kfree_flow;
 
-	ovs_flow_mask_key(&new_flow->key, &new_flow->unmasked_key, &mask);
+	ovs_flow_mask_key(&new_flow->key, &new_flow->unmasked_key, true, &mask);
 
 	/* Validate actions. */
 	acts = ovs_nla_alloc_flow_actions(nla_len(a[OVS_FLOW_ATTR_ACTIONS]));
@@ -949,7 +949,7 @@ static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,
 	if (IS_ERR(acts))
 		return acts;
 
-	ovs_flow_mask_key(&masked_key, key, mask);
+	ovs_flow_mask_key(&masked_key, key, true, mask);
 	error = ovs_nla_copy_actions(a, &masked_key, 0, &acts);
 	if (error) {
 		OVS_NLERR("Flow actions may not be safe on all matching packets.\n");
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index cf2d853..740041a 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -56,20 +56,21 @@ static u16 range_n_bytes(const struct sw_flow_key_range *range)
 }
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
-		       const struct sw_flow_mask *mask)
+		       bool full, const struct sw_flow_mask *mask)
 {
-	const long *m = (const long *)((const u8 *)&mask->key +
-				mask->range.start);
-	const long *s = (const long *)((const u8 *)src +
-				mask->range.start);
-	long *d = (long *)((u8 *)dst + mask->range.start);
+	int start = full ? 0 : mask->range.start;
+	int len = full ? sizeof *dst : range_n_bytes(&mask->range);
+	const long *m = (const long *)((const u8 *)&mask->key + start);
+	const long *s = (const long *)((const u8 *)src + start);
+	long *d = (long *)((u8 *)dst + start);
 	int i;
 
-	/* The memory outside of the 'mask->range' are not set since
-	 * further operations on 'dst' only uses contents within
-	 * 'mask->range'.
+	/* If 'full' is true then all of 'dst' is fully initialized. Otherwise,
+	 * if 'full' is false the memory outside of the 'mask->range' is left
+	 * uninitialized. This can be used as an optimization when further
+	 * operations on 'dst' only use contents within 'mask->range'.
 	 */
-	for (i = 0; i < range_n_bytes(&mask->range); i += sizeof(long))
+	for (i = 0; i < len; i += sizeof(long))
 		*d++ = *s++ & *m++;
 }
 
@@ -418,7 +419,7 @@ static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
 	u32 hash;
 	struct sw_flow_key masked_key;
 
-	ovs_flow_mask_key(&masked_key, unmasked, mask);
+	ovs_flow_mask_key(&masked_key, unmasked, false, mask);
 	hash = flow_hash(&masked_key, key_start, key_end);
 	head = find_bucket(ti, hash);
 	hlist_for_each_entry_rcu(flow, head, hash_node[ti->node_ver]) {
diff --git a/net/openvswitch/flow_table.h b/net/openvswitch/flow_table.h
index 5918bff..2f0cf20 100644
--- a/net/openvswitch/flow_table.h
+++ b/net/openvswitch/flow_table.h
@@ -82,5 +82,5 @@ bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
 			       struct sw_flow_match *match);
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
-		       const struct sw_flow_mask *mask);
+		       bool full, const struct sw_flow_mask *mask);
 #endif /* flow_table.h */
-- 
2.1.0


From 9149d5def926efb3c3a4e6bc66d3b3f60591ddbe Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Wed, 23 Sep 2015 14:00:21 -0700
Subject: [PATCH 11/12] tcp: add proper TS val into RST packets

[ Upstream commit 675ee231d960af2af3606b4480324e26797eb010 ]

RST packets sent on behalf of TCP connections with TS option (RFC 7323
TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.

A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
ecr 0], length 0
B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
1460,nop,nop,TS val 7264344 ecr 100], length 0
A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
7264344], length 0

B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
ecr 110], length 0

We need to call skb_mstamp_get() to get proper TS val,
derived from skb->skb_mstamp

Note that RFC 1323 was advocating to not send TS option in RST segment,
but RFC 7323 recommends the opposite :

  Once TSopt has been successfully negotiated, that is both <SYN> and
  <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
  segment for the duration of the connection, and SHOULD be sent in an
  <RST> segment (see Section 5.2 for details)

Note this RFC recommends to send TS val = 0, but we believe it is
premature : We do not know if all TCP stacks are properly
handling the receive side :

   When an <RST> segment is
   received, it MUST NOT be subjected to the PAWS check by verifying an
   acceptable value in SEG.TSval, and information from the Timestamps
   option MUST NOT be used to update connection state information.
   SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.

In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
to decide to send TS val = 0, if it buys something.

Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_output.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dc9f925..9c7d8887 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2798,6 +2798,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority)
 	skb_reserve(skb, MAX_TCP_HEADER);
 	tcp_init_nondata_skb(skb, tcp_acceptable_seq(sk),
 			     TCPHDR_ACK | TCPHDR_RST);
+	skb_mstamp_get(&skb->skb_mstamp);
 	/* Send it off. */
 	if (tcp_transmit_skb(sk, skb, 0, priority))
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTFAILED);
-- 
2.1.0


From 374d3782e0297a83a398ddad1a67eac606720e22 Mon Sep 17 00:00:00 2001
From: Wilson Kok <wkok@cumulusnetworks.com>
Date: Tue, 22 Sep 2015 21:40:22 -0700
Subject: [PATCH 12/12] fib_rules: fix fib rule dumps across multiple skbs

[ Upstream commit 41fc014332d91ee90c32840bf161f9685b7fbf2b ]

dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.

This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.

Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/fib_rules.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 185c341..aeedc3a 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -621,15 +621,17 @@ static int dump_rules(struct sk_buff *skb, struct netlink_callback *cb,
 {
 	int idx = 0;
 	struct fib_rule *rule;
+	int err = 0;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(rule, &ops->rules_list, list) {
 		if (idx < cb->args[1])
 			goto skip;
 
-		if (fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
-				     cb->nlh->nlmsg_seq, RTM_NEWRULE,
-				     NLM_F_MULTI, ops) < 0)
+		err = fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
+				       cb->nlh->nlmsg_seq, RTM_NEWRULE,
+				       NLM_F_MULTI, ops);
+		if (err)
 			break;
 skip:
 		idx++;
@@ -638,7 +640,7 @@ skip:
 	cb->args[1] = idx;
 	rules_ops_put(ops);
 
-	return skb->len;
+	return err;
 }
 
 static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
@@ -654,7 +656,9 @@ static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
 		if (ops == NULL)
 			return -EAFNOSUPPORT;
 
-		return dump_rules(skb, cb, ops);
+		dump_rules(skb, cb, ops);
+
+		return skb->len;
 	}
 
 	rcu_read_lock();
-- 
2.1.0


[-- Attachment #4: net_41.mbox --]
[-- Type: Application/Octet-Stream, Size: 85559 bytes --]

From 923462db2683c5e46dcf038528097c7f0e5b83d3 Mon Sep 17 00:00:00 2001
From: huaibin Wang <huaibin.wang@6wind.com>
Date: Tue, 25 Aug 2015 16:20:34 +0200
Subject: [PATCH 01/28] ip6_gre: release cached dst on tunnel removal

[ Upstream commit d4257295ba1b389c693b79de857a96e4b7cd8ac0 ]

When a tunnel is deleted, the cached dst entry should be released.

This problem may prevent the removal of a netns (seen with a x-netns IPv6
gre tunnel):
  unregister_netdevice: waiting for lo to become free. Usage count = 3

CC: Dmitry Kozlov <xeb@mail.ru>
Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: huaibin Wang <huaibin.wang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/ip6_gre.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index a38d3ac..69f4f68 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -361,6 +361,7 @@ static void ip6gre_tunnel_uninit(struct net_device *dev)
 	struct ip6gre_net *ign = net_generic(t->net, ip6gre_net_id);
 
 	ip6gre_tunnel_unlink(ign, t);
+	ip6_tnl_dst_reset(t);
 	dev_put(dev);
 }
 
-- 
2.1.0


From 9615a90e16e4dc465ccb8d0b0a6d5aae9f3ddfe0 Mon Sep 17 00:00:00 2001
From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Tue, 25 Aug 2015 20:22:35 -0300
Subject: [PATCH 02/28] vxlan: re-ignore EADDRINUSE from igmp_join

[ Upstream commit bef0057b7ba881d5ae67eec876df7a26fe672a59 ]

Before 56ef9c909b40[1] it used to ignore all errors from igmp_join().
That commit enhanced that and made it error out whatever error happened
with igmp_join(), but that's not good because when using multicast
groups vxlan will try to join it multiple times if the socket is reused
and then the 2nd and further attempts will fail with EADDRINUSE.

As we don't track to which groups the socket is already subscribed, it's
okay to just ignore that error.

Fixes: 56ef9c909b40 ("vxlan: Move socket initialization to within rtnl scope")
Reported-by: John Nielsen <lists@jnielsen.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/vxlan.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 21a0fbf..0085b8d 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2212,6 +2212,8 @@ static int vxlan_open(struct net_device *dev)
 
 	if (vxlan_addr_multicast(&vxlan->default_dst.remote_ip)) {
 		ret = vxlan_igmp_join(vxlan);
+		if (ret == -EADDRINUSE)
+			ret = 0;
 		if (ret) {
 			vxlan_sock_release(vs);
 			return ret;
-- 
2.1.0


From 50829ec4c515b379721304dbcf0e3d6288f56761 Mon Sep 17 00:00:00 2001
From: WANG Cong <xiyou.wangcong@gmail.com>
Date: Tue, 25 Aug 2015 16:38:12 -0700
Subject: [PATCH 03/28] cls_u32: complete the check for non-forced case in
 u32_destroy()

[ Upstream commit a6c1aea044e490da3e59124ec55991fe316818d5 ]

In commit 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
I added a check in u32_destroy() to see if all real filters are gone
for each tp, however, that is only done for root_ht, same is needed
for others.

This can be reproduced by the following tc commands:

tc filter add dev eth0 parent 1:0 prio 5 handle 15: protocol ip u32 divisor 256
tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:2 u32
ht 15:2: match ip src 10.0.0.2 flowid 1:10
tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:3 u32
ht 15:2: match ip src 10.0.0.3 flowid 1:10

Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sched/cls_u32.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index cab9e9b..4fbb674 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -490,6 +490,19 @@ static bool u32_destroy(struct tcf_proto *tp, bool force)
 					return false;
 			}
 		}
+
+		if (tp_c->refcnt > 1)
+			return false;
+
+		if (tp_c->refcnt == 1) {
+			struct tc_u_hnode *ht;
+
+			for (ht = rtnl_dereference(tp_c->hlist);
+			     ht;
+			     ht = rtnl_dereference(ht->next))
+				if (!ht_empty(ht))
+					return false;
+		}
 	}
 
 	if (root_ht && --root_ht->refcnt == 0)
-- 
2.1.0


From 917a5220ce98d3e0c96899d5d926631ee09fc50b Mon Sep 17 00:00:00 2001
From: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Date: Mon, 24 Aug 2015 23:13:42 +0300
Subject: [PATCH 04/28] usbnet: Get EVENT_NO_RUNTIME_PM bit before it is
 cleared

[ Upstream commit f50791ac1aca1ac1b0370d62397b43e9f831421a ]

It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in
usbnet_stop(), but its value should be read before it is cleared
when dev->flags is set to 0.

The problem was spotted and the fix was provided by
Oliver Neukum <oneukum@suse.de>.

Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Acked-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/usb/usbnet.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 3c86b10..e049857 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -778,7 +778,7 @@ int usbnet_stop (struct net_device *net)
 {
 	struct usbnet		*dev = netdev_priv(net);
 	struct driver_info	*info = dev->driver_info;
-	int			retval, pm;
+	int			retval, pm, mpn;
 
 	clear_bit(EVENT_DEV_OPEN, &dev->flags);
 	netif_stop_queue (net);
@@ -809,6 +809,8 @@ int usbnet_stop (struct net_device *net)
 
 	usbnet_purge_paused_rxq(dev);
 
+	mpn = !test_and_clear_bit(EVENT_NO_RUNTIME_PM, &dev->flags);
+
 	/* deferred work (task, timer, softirq) must also stop.
 	 * can't flush_scheduled_work() until we drop rtnl (later),
 	 * else workers could deadlock; so make workers a NOP.
@@ -819,8 +821,7 @@ int usbnet_stop (struct net_device *net)
 	if (!pm)
 		usb_autopm_put_interface(dev->intf);
 
-	if (info->manage_power &&
-	    !test_and_clear_bit(EVENT_NO_RUNTIME_PM, &dev->flags))
+	if (info->manage_power && mpn)
 		info->manage_power(dev, 0);
 	else
 		usb_autopm_put_interface(dev->intf);
-- 
2.1.0


From 4ae73f6fa9d42d42bfa762d48e7d80c6120f58ad Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed, 2 Sep 2015 14:00:36 +0200
Subject: [PATCH 05/28] sock, diag: fix panic in sock_diag_put_filterinfo

[ Upstream commit b382c08656000c12a146723a153b85b13a855b49 ]

diag socket's sock_diag_put_filterinfo() dumps classic BPF programs
upon request to user space (ss -0 -b). However, native eBPF programs
attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method:

Their orig_prog is always NULL. However, sock_diag_put_filterinfo()
unconditionally tries to access its filter length resp. wants to copy
the filter insns from there. Internal cBPF to eBPF transformations
attached to sockets don't have this issue, as orig_prog state is kept.

It's currently only used by packet sockets. If we would want to add
native eBPF support in the future, this needs to be done through
a different attribute than PACKET_DIAG_FILTER to not confuse possible
user space disassemblers that work on diag data.

Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/sock_diag.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 74dddf8..556ecf96 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -86,6 +86,9 @@ int sock_diag_put_filterinfo(bool may_report_filterinfo, struct sock *sk,
 		goto out;
 
 	fprog = filter->prog->orig_prog;
+	if (!fprog)
+		goto out;
+
 	flen = bpf_classic_proglen(fprog);
 
 	attr = nla_reserve(skb, attrtype, flen);
-- 
2.1.0


From 82918a07fe222feea4ebe2dca23f7c0a3c1eb874 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 3 Sep 2015 00:29:07 +0200
Subject: [PATCH 06/28] ipv6: fix exthdrs offload registration in out_rt path

[ Upstream commit e41b0bedba0293b9e1e8d1e8ed553104b9693656 ]

We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
but in error path, we try to unregister it with inet_del_offload(). This
doesn't seem correct, it should actually be inet6_del_offload(), also
ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
also uses rthdr_offload twice), but it got removed entirely later on.

Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/exthdrs_offload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c
index 447a7fb..f5e2ba1 100644
--- a/net/ipv6/exthdrs_offload.c
+++ b/net/ipv6/exthdrs_offload.c
@@ -36,6 +36,6 @@ out:
 	return ret;
 
 out_rt:
-	inet_del_offload(&rthdr_offload, IPPROTO_ROUTING);
+	inet6_del_offload(&rthdr_offload, IPPROTO_ROUTING);
 	goto out;
 }
-- 
2.1.0


From dd222d5ded20452f5d1db209dce7b7babf54be5a Mon Sep 17 00:00:00 2001
From: Russell King <rmk+kernel@arm.linux.org.uk>
Date: Wed, 2 Sep 2015 17:24:14 +0800
Subject: [PATCH 07/28] net: fec: clear receive interrupts before processing a
 packet

[ Upstream commit ed63f1dcd5788d36f942fbcce350742385e3e18c ]

The patch just to re-submit the patch "db3421c114cfa6326" because the
patch "4d494cdc92b3b9a0" remove the change.

Clear any pending receive interrupt before we process a pending packet.
This helps to avoid any spurious interrupts being raised after we have
fully cleaned the receive ring, while still allowing an interrupt to be
raised if we receive another packet.

The position of this is critical: we must do this prior to reading the
next packet status to avoid potentially dropping an interrupt when a
packet is still pending.

Acked-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/freescale/fec_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 66d47e4..570390b 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1396,6 +1396,7 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
 		if ((status & BD_ENET_RX_LAST) == 0)
 			netdev_err(ndev, "rcv is not +last\n");
 
+		writel(FEC_ENET_RXF, fep->hwp + FEC_IEVENT);
 
 		/* Check for errors. */
 		if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH | BD_ENET_RX_NO |
-- 
2.1.0


From e12b93d0d23f3cb4dd68d099408f4c953f0e1347 Mon Sep 17 00:00:00 2001
From: Atsushi Nemoto <nemoto@toshiba-tops.co.jp>
Date: Wed, 2 Sep 2015 17:49:29 +0900
Subject: [PATCH 08/28] net: eth: altera: fix napi poll_list corruption

[ Upstream commit 4548a697e4969d695047cebd6d9af5e2f6cc728e ]

tse_poll() calls __napi_complete() with irq enabled.  This leads napi
poll_list corruption and may stop all napi drivers working.
Use napi_complete() instead of __napi_complete().

Signed-off-by: Atsushi Nemoto <nemoto@toshiba-tops.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/altera/altera_tse_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/altera/altera_tse_main.c b/drivers/net/ethernet/altera/altera_tse_main.c
index da48e66..8207877 100644
--- a/drivers/net/ethernet/altera/altera_tse_main.c
+++ b/drivers/net/ethernet/altera/altera_tse_main.c
@@ -511,8 +511,7 @@ static int tse_poll(struct napi_struct *napi, int budget)
 
 	if (rxcomplete < budget) {
 
-		napi_gro_flush(napi, false);
-		__napi_complete(napi);
+		napi_complete(napi);
 
 		netdev_dbg(priv->dev,
 			   "NAPI Complete, did %d packets with budget %d\n",
-- 
2.1.0


From d8be124bd7a68ccd82c3522c4a3d132c66ffc328 Mon Sep 17 00:00:00 2001
From: Richard Laing <richard.laing@alliedtelesis.co.nz>
Date: Thu, 3 Sep 2015 13:52:31 +1200
Subject: [PATCH 09/28] net/ipv6: Correct PIM6 mrt_lock handling

[ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ]

In the IPv6 multicast routing code the mrt_lock was not being released
correctly in the MFC iterator, as a result adding or deleting a MIF would
cause a hang because the mrt_lock could not be acquired.

This fix is a copy of the code for the IPv4 case and ensures that the lock
is released correctly.

Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/ip6mr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 74ceb73..5f36266 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -550,7 +550,7 @@ static void ipmr_mfc_seq_stop(struct seq_file *seq, void *v)
 
 	if (it->cache == &mrt->mfc6_unres_queue)
 		spin_unlock_bh(&mfc_unres_lock);
-	else if (it->cache == mrt->mfc6_cache_array)
+	else if (it->cache == &mrt->mfc6_cache_array[it->ct])
 		read_unlock(&mrt_lock);
 }
 
-- 
2.1.0


From 3ba957b71c4b6de20048dba028556cd4aeee3aec Mon Sep 17 00:00:00 2001
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Sat, 5 Sep 2015 13:07:27 -0700
Subject: [PATCH 10/28] net: dsa: bcm_sf2: Fix ageing conditions and operation

[ Upstream commit 39797a279d62972cd914ef580fdfacb13e508bf8 ]

The comparison check between cur_hw_state and hw_state is currently
invalid because cur_hw_state is right shifted by G_MISTP_SHIFT, while
hw_state is not, so we end-up comparing bits 2:0 with bits 7:5, which is
going to cause an additional aging to occur. Fix this by not shifting
cur_hw_state while reading it, but instead, mask the value with the
appropriately shitfted bitmask.

The other problem with the fast-ageing process is that we did not set
the EN_AGE_DYNAMIC bit to request the ageing to occur for dynamically
learned MAC addresses. Finally, write back 0 to the FAST_AGE_CTRL
register to avoid leaving spurious bits sets from one operation to the
other.

Fixes: 12f460f23423 ("net: dsa: bcm_sf2: add HW bridging support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/dsa/bcm_sf2.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index cedb572..f24c4fe 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -417,7 +417,7 @@ static int bcm_sf2_sw_fast_age_port(struct dsa_switch  *ds, int port)
 	core_writel(priv, port, CORE_FAST_AGE_PORT);
 
 	reg = core_readl(priv, CORE_FAST_AGE_CTRL);
-	reg |= EN_AGE_PORT | FAST_AGE_STR_DONE;
+	reg |= EN_AGE_PORT | EN_AGE_DYNAMIC | FAST_AGE_STR_DONE;
 	core_writel(priv, reg, CORE_FAST_AGE_CTRL);
 
 	do {
@@ -431,6 +431,8 @@ static int bcm_sf2_sw_fast_age_port(struct dsa_switch  *ds, int port)
 	if (!timeout)
 		return -ETIMEDOUT;
 
+	core_writel(priv, 0, CORE_FAST_AGE_CTRL);
+
 	return 0;
 }
 
@@ -506,7 +508,7 @@ static int bcm_sf2_sw_br_set_stp_state(struct dsa_switch *ds, int port,
 	u32 reg;
 
 	reg = core_readl(priv, CORE_G_PCTL_PORT(port));
-	cur_hw_state = reg >> G_MISTP_STATE_SHIFT;
+	cur_hw_state = reg & (G_MISTP_STATE_MASK << G_MISTP_STATE_SHIFT);
 
 	switch (state) {
 	case BR_STATE_DISABLED:
@@ -530,10 +532,12 @@ static int bcm_sf2_sw_br_set_stp_state(struct dsa_switch *ds, int port,
 	}
 
 	/* Fast-age ARL entries if we are moving a port from Learning or
-	 * Forwarding state to Disabled, Blocking or Listening state
+	 * Forwarding (cur_hw_state) state to Disabled, Blocking or Listening
+	 * state (hw_state)
 	 */
 	if (cur_hw_state != hw_state) {
-		if (cur_hw_state & 4 && !(hw_state & 4)) {
+		if (cur_hw_state >= G_MISTP_LEARN_STATE &&
+		    hw_state <= G_MISTP_LISTEN_STATE) {
 			ret = bcm_sf2_sw_fast_age_port(ds, port);
 			if (ret) {
 				pr_err("%s: fast-ageing failed\n", __func__);
-- 
2.1.0


From 54ae54b11332d089dc4d9ad989bbf63428d49641 Mon Sep 17 00:00:00 2001
From: Roopa Prabhu <roopa@cumulusnetworks.com>
Date: Tue, 8 Sep 2015 10:53:04 -0700
Subject: [PATCH 11/28] ipv6: fix multipath route replace error recovery

[ Upstream commit 6b9ea5a64ed5eeb3f68f2e6fcce0ed1179801d1e ]

Problem:
The ecmp route replace support for ipv6 in the kernel, deletes the
existing ecmp route too early, ie when it installs the first nexthop.
If there is an error in installing the subsequent nexthops, its too late
to recover the already deleted existing route leaving the fib
in an inconsistent state.

This patch reduces the possibility of this by doing the following:
a) Changes the existing multipath route add code to a two stage process:
  build rt6_infos + insert them
	ip6_route_add rt6_info creation code is moved into
	ip6_route_info_create.
b) This ensures that most errors are caught during building rt6_infos
  and we fail early
c) Separates multipath add and del code. Because add needs the special
  two stage mode in a) and delete essentially does not care.
d) In any event if the code fails during inserting a route again, a
  warning is printed (This should be unlikely)

Before the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists

$ip -6 route show
/* previously added ecmp route 3000:1000:1000:1000::2 dissappears from
 * kernel */

After the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists

$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/route.c | 201 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 175 insertions(+), 26 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c73ae50..f371fef 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1515,7 +1515,7 @@ static int ip6_convert_metrics(struct mx6_config *mxc,
 	return -EINVAL;
 }
 
-int ip6_route_add(struct fib6_config *cfg)
+int ip6_route_info_create(struct fib6_config *cfg, struct rt6_info **rt_ret)
 {
 	int err;
 	struct net *net = cfg->fc_nlinfo.nl_net;
@@ -1523,7 +1523,6 @@ int ip6_route_add(struct fib6_config *cfg)
 	struct net_device *dev = NULL;
 	struct inet6_dev *idev = NULL;
 	struct fib6_table *table;
-	struct mx6_config mxc = { .mx = NULL, };
 	int addr_type;
 
 	if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128)
@@ -1719,6 +1718,32 @@ install_route:
 
 	cfg->fc_nlinfo.nl_net = dev_net(dev);
 
+	*rt_ret = rt;
+
+	return 0;
+out:
+	if (dev)
+		dev_put(dev);
+	if (idev)
+		in6_dev_put(idev);
+	if (rt)
+		dst_free(&rt->dst);
+
+	*rt_ret = NULL;
+
+	return err;
+}
+
+int ip6_route_add(struct fib6_config *cfg)
+{
+	struct mx6_config mxc = { .mx = NULL, };
+	struct rt6_info *rt = NULL;
+	int err;
+
+	err = ip6_route_info_create(cfg, &rt);
+	if (err)
+		goto out;
+
 	err = ip6_convert_metrics(&mxc, cfg);
 	if (err)
 		goto out;
@@ -1726,14 +1751,12 @@ install_route:
 	err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, &mxc);
 
 	kfree(mxc.mx);
+
 	return err;
 out:
-	if (dev)
-		dev_put(dev);
-	if (idev)
-		in6_dev_put(idev);
 	if (rt)
 		dst_free(&rt->dst);
+
 	return err;
 }
 
@@ -2496,19 +2519,78 @@ errout:
 	return err;
 }
 
-static int ip6_route_multipath(struct fib6_config *cfg, int add)
+struct rt6_nh {
+	struct rt6_info *rt6_info;
+	struct fib6_config r_cfg;
+	struct mx6_config mxc;
+	struct list_head next;
+};
+
+static void ip6_print_replace_route_err(struct list_head *rt6_nh_list)
+{
+	struct rt6_nh *nh;
+
+	list_for_each_entry(nh, rt6_nh_list, next) {
+		pr_warn("IPV6: multipath route replace failed (check consistency of installed routes): %pI6 nexthop %pI6 ifi %d\n",
+		        &nh->r_cfg.fc_dst, &nh->r_cfg.fc_gateway,
+		        nh->r_cfg.fc_ifindex);
+	}
+}
+
+static int ip6_route_info_append(struct list_head *rt6_nh_list,
+				 struct rt6_info *rt, struct fib6_config *r_cfg)
+{
+	struct rt6_nh *nh;
+	struct rt6_info *rtnh;
+	int err = -EEXIST;
+
+	list_for_each_entry(nh, rt6_nh_list, next) {
+		/* check if rt6_info already exists */
+		rtnh = nh->rt6_info;
+
+		if (rtnh->dst.dev == rt->dst.dev &&
+		    rtnh->rt6i_idev == rt->rt6i_idev &&
+		    ipv6_addr_equal(&rtnh->rt6i_gateway,
+				    &rt->rt6i_gateway))
+			return err;
+	}
+
+	nh = kzalloc(sizeof(*nh), GFP_KERNEL);
+	if (!nh)
+		return -ENOMEM;
+	nh->rt6_info = rt;
+	err = ip6_convert_metrics(&nh->mxc, r_cfg);
+	if (err) {
+		kfree(nh);
+		return err;
+	}
+	memcpy(&nh->r_cfg, r_cfg, sizeof(*r_cfg));
+	list_add_tail(&nh->next, rt6_nh_list);
+
+	return 0;
+}
+
+static int ip6_route_multipath_add(struct fib6_config *cfg)
 {
 	struct fib6_config r_cfg;
 	struct rtnexthop *rtnh;
+	struct rt6_info *rt;
+	struct rt6_nh *err_nh;
+	struct rt6_nh *nh, *nh_safe;
 	int remaining;
 	int attrlen;
-	int err = 0, last_err = 0;
+	int err = 1;
+	int nhn = 0;
+	int replace = (cfg->fc_nlinfo.nlh &&
+		       (cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_REPLACE));
+	LIST_HEAD(rt6_nh_list);
 
 	remaining = cfg->fc_mp_len;
-beginning:
 	rtnh = (struct rtnexthop *)cfg->fc_mp;
 
-	/* Parse a Multipath Entry */
+	/* Parse a Multipath Entry and build a list (rt6_nh_list) of
+	 * rt6_info structs per nexthop
+	 */
 	while (rtnh_ok(rtnh, remaining)) {
 		memcpy(&r_cfg, cfg, sizeof(*cfg));
 		if (rtnh->rtnh_ifindex)
@@ -2524,22 +2606,32 @@ beginning:
 				r_cfg.fc_flags |= RTF_GATEWAY;
 			}
 		}
-		err = add ? ip6_route_add(&r_cfg) : ip6_route_del(&r_cfg);
+
+		err = ip6_route_info_create(&r_cfg, &rt);
+		if (err)
+			goto cleanup;
+
+		err = ip6_route_info_append(&rt6_nh_list, rt, &r_cfg);
 		if (err) {
-			last_err = err;
-			/* If we are trying to remove a route, do not stop the
-			 * loop when ip6_route_del() fails (because next hop is
-			 * already gone), we should try to remove all next hops.
-			 */
-			if (add) {
-				/* If add fails, we should try to delete all
-				 * next hops that have been already added.
-				 */
-				add = 0;
-				remaining = cfg->fc_mp_len - remaining;
-				goto beginning;
-			}
+			dst_free(&rt->dst);
+			goto cleanup;
+		}
+
+		rtnh = rtnh_next(rtnh, &remaining);
+	}
+
+	err_nh = NULL;
+	list_for_each_entry(nh, &rt6_nh_list, next) {
+		err = __ip6_ins_rt(nh->rt6_info, &cfg->fc_nlinfo, &nh->mxc);
+		/* nh->rt6_info is used or freed at this point, reset to NULL*/
+		nh->rt6_info = NULL;
+		if (err) {
+			if (replace && nhn)
+				ip6_print_replace_route_err(&rt6_nh_list);
+			err_nh = nh;
+			goto add_errout;
 		}
+
 		/* Because each route is added like a single route we remove
 		 * these flags after the first nexthop: if there is a collision,
 		 * we have already failed to add the first nexthop:
@@ -2549,6 +2641,63 @@ beginning:
 		 */
 		cfg->fc_nlinfo.nlh->nlmsg_flags &= ~(NLM_F_EXCL |
 						     NLM_F_REPLACE);
+		nhn++;
+	}
+
+	goto cleanup;
+
+add_errout:
+	/* Delete routes that were already added */
+	list_for_each_entry(nh, &rt6_nh_list, next) {
+		if (err_nh == nh)
+			break;
+		ip6_route_del(&nh->r_cfg);
+	}
+
+cleanup:
+	list_for_each_entry_safe(nh, nh_safe, &rt6_nh_list, next) {
+		if (nh->rt6_info)
+			dst_free(&nh->rt6_info->dst);
+		if (nh->mxc.mx)
+			kfree(nh->mxc.mx);
+		list_del(&nh->next);
+		kfree(nh);
+	}
+
+	return err;
+}
+
+static int ip6_route_multipath_del(struct fib6_config *cfg)
+{
+	struct fib6_config r_cfg;
+	struct rtnexthop *rtnh;
+	int remaining;
+	int attrlen;
+	int err = 1, last_err = 0;
+
+	remaining = cfg->fc_mp_len;
+	rtnh = (struct rtnexthop *)cfg->fc_mp;
+
+	/* Parse a Multipath Entry */
+	while (rtnh_ok(rtnh, remaining)) {
+		memcpy(&r_cfg, cfg, sizeof(*cfg));
+		if (rtnh->rtnh_ifindex)
+			r_cfg.fc_ifindex = rtnh->rtnh_ifindex;
+
+		attrlen = rtnh_attrlen(rtnh);
+		if (attrlen > 0) {
+			struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
+
+			nla = nla_find(attrs, attrlen, RTA_GATEWAY);
+			if (nla) {
+				nla_memcpy(&r_cfg.fc_gateway, nla, 16);
+				r_cfg.fc_flags |= RTF_GATEWAY;
+			}
+		}
+		err = ip6_route_del(&r_cfg);
+		if (err)
+			last_err = err;
+
 		rtnh = rtnh_next(rtnh, &remaining);
 	}
 
@@ -2565,7 +2714,7 @@ static int inet6_rtm_delroute(struct sk_buff *skb, struct nlmsghdr *nlh)
 		return err;
 
 	if (cfg.fc_mp)
-		return ip6_route_multipath(&cfg, 0);
+		return ip6_route_multipath_del(&cfg);
 	else
 		return ip6_route_del(&cfg);
 }
@@ -2580,7 +2729,7 @@ static int inet6_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh)
 		return err;
 
 	if (cfg.fc_mp)
-		return ip6_route_multipath(&cfg, 1);
+		return ip6_route_multipath_add(&cfg);
 	else
 		return ip6_route_add(&cfg);
 }
-- 
2.1.0


From 9ebe12001cb76d9b47fe0fab691980fe8e1e7c25 Mon Sep 17 00:00:00 2001
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Tue, 8 Sep 2015 20:06:41 -0700
Subject: [PATCH 12/28] net: dsa: bcm_sf2: Fix 64-bits register writes

[ Upstream commit 03679a14739a0d4c14b52ba65a69ff553bfba73b ]

The macro to write 64-bits quantities to the 32-bits register swapped
the value and offsets arguments, we want to preserve the ordering of the
arguments with respect to how writel() is implemented for instance:
value first, offset/base second.

Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/dsa/bcm_sf2.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.h b/drivers/net/dsa/bcm_sf2.h
index 22e2ebf..789d7b7 100644
--- a/drivers/net/dsa/bcm_sf2.h
+++ b/drivers/net/dsa/bcm_sf2.h
@@ -112,8 +112,8 @@ static inline u64 name##_readq(struct bcm_sf2_priv *priv, u32 off)	\
 	spin_unlock(&priv->indir_lock);					\
 	return (u64)indir << 32 | dir;					\
 }									\
-static inline void name##_writeq(struct bcm_sf2_priv *priv, u32 off,	\
-							u64 val)	\
+static inline void name##_writeq(struct bcm_sf2_priv *priv, u64 val,	\
+							u32 off)	\
 {									\
 	spin_lock(&priv->indir_lock);					\
 	reg_writel(priv, upper_32_bits(val), REG_DIR_DATA_WRITE);	\
-- 
2.1.0


From 3179751dd88c387d12fe607b111fde3f3ecf5a02 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 10 Sep 2015 20:05:46 +0200
Subject: [PATCH 13/28] netlink, mmap: transform mmap skb into full skb on taps

[ Upstream commit 1853c949646005b5959c483becde86608f548f24 ]

Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.

I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb->head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb->head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.

Reference: http://thread.gmane.org/gmane.linux.network/371129
Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
Reported-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 30 +++++++++++++++++++++++-------
 net/netlink/af_netlink.h |  9 +++++++++
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index bf6e766..ca08ffb 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -123,6 +123,24 @@ static inline u32 netlink_group_mask(u32 group)
 	return group ? 1 << (group - 1) : 0;
 }
 
+static struct sk_buff *netlink_to_full_skb(const struct sk_buff *skb,
+					   gfp_t gfp_mask)
+{
+	unsigned int len = skb_end_offset(skb);
+	struct sk_buff *new;
+
+	new = alloc_skb(len, gfp_mask);
+	if (new == NULL)
+		return NULL;
+
+	NETLINK_CB(new).portid = NETLINK_CB(skb).portid;
+	NETLINK_CB(new).dst_group = NETLINK_CB(skb).dst_group;
+	NETLINK_CB(new).creds = NETLINK_CB(skb).creds;
+
+	memcpy(skb_put(new, len), skb->data, len);
+	return new;
+}
+
 int netlink_add_tap(struct netlink_tap *nt)
 {
 	if (unlikely(nt->dev->type != ARPHRD_NETLINK))
@@ -204,7 +222,11 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
 	int ret = -ENOMEM;
 
 	dev_hold(dev);
-	nskb = skb_clone(skb, GFP_ATOMIC);
+
+	if (netlink_skb_is_mmaped(skb) || is_vmalloc_addr(skb->head))
+		nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
+	else
+		nskb = skb_clone(skb, GFP_ATOMIC);
 	if (nskb) {
 		nskb->dev = dev;
 		nskb->protocol = htons((u16) sk->sk_protocol);
@@ -276,11 +298,6 @@ static void netlink_rcv_wake(struct sock *sk)
 }
 
 #ifdef CONFIG_NETLINK_MMAP
-static bool netlink_skb_is_mmaped(const struct sk_buff *skb)
-{
-	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
-}
-
 static bool netlink_rx_is_mmaped(struct sock *sk)
 {
 	return nlk_sk(sk)->rx_ring.pg_vec != NULL;
@@ -817,7 +834,6 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
 }
 
 #else /* CONFIG_NETLINK_MMAP */
-#define netlink_skb_is_mmaped(skb)	false
 #define netlink_rx_is_mmaped(sk)	false
 #define netlink_tx_is_mmaped(sk)	false
 #define netlink_mmap			sock_no_mmap
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 8900840..df9a060 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -59,6 +59,15 @@ static inline struct netlink_sock *nlk_sk(struct sock *sk)
 	return container_of(sk, struct netlink_sock, sk);
 }
 
+static inline bool netlink_skb_is_mmaped(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETLINK_MMAP
+	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
+#else
+	return false;
+#endif /* CONFIG_NETLINK_MMAP */
+}
+
 struct netlink_table {
 	struct rhashtable	hash;
 	struct hlist_head	mc_list;
-- 
2.1.0


From 8d895660866bca3fd683e2cdc331b651f855f6d2 Mon Sep 17 00:00:00 2001
From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Thu, 10 Sep 2015 17:31:15 -0300
Subject: [PATCH 14/28] sctp: fix race on protocol/netns initialization

[ Upstream commit 8e2d61e0aed2b7c4ecb35844fe07e0b2b762dee4 ]

Consider sctp module is unloaded and is being requested because an user
is creating a sctp socket.

During initialization, sctp will add the new protocol type and then
initialize pernet subsys:

        status = sctp_v4_protosw_init();
        if (status)
                goto err_protosw_init;

        status = sctp_v6_protosw_init();
        if (status)
                goto err_v6_protosw_init;

        status = register_pernet_subsys(&sctp_net_ops);

The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
is possible for userspace to create SCTP sockets like if the module is
already fully loaded. If that happens, one of the possible effects is
that we will have readers for net->sctp.local_addr_list list earlier
than expected and sctp_net_init() does not take precautions while
dealing with that list, leading to a potential panic but not limited to
that, as sctp_sock_init() will copy a bunch of blank/partially
initialized values from net->sctp.

The race happens like this:

     CPU 0                           |  CPU 1
  socket()                           |
   __sock_create                     | socket()
    inet_create                      |  __sock_create
     list_for_each_entry_rcu(        |
        answer, &inetsw[sock->type], |
        list) {                      |   inet_create
      /* no hits */                  |
     if (unlikely(err)) {            |
      ...                            |
      request_module()               |
      /* socket creation is blocked  |
       * the module is fully loaded  |
       */                            |
       sctp_init                     |
        sctp_v4_protosw_init         |
         inet_register_protosw       |
          list_add_rcu(&p->list,     |
                       last_perm);   |
                                     |  list_for_each_entry_rcu(
                                     |     answer, &inetsw[sock->type],
        sctp_v6_protosw_init         |     list) {
                                     |     /* hit, so assumes protocol
                                     |      * is already loaded
                                     |      */
                                     |  /* socket creation continues
                                     |   * before netns is initialized
                                     |   */
        register_pernet_subsys       |

Simply inverting the initialization order between
register_pernet_subsys() and sctp_v4_protosw_init() is not possible
because register_pernet_subsys() will create a control sctp socket, so
the protocol must be already visible by then. Deferring the socket
creation to a work-queue is not good specially because we loose the
ability to handle its errors.

So, as suggested by Vlad, the fix is to split netns initialization in
two moments: defaults and control socket, so that the defaults are
already loaded by when we register the protocol, while control socket
initialization is kept at the same moment it is today.

Fixes: 4db67e808640 ("sctp: Make the address lists per network namespace")
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sctp/protocol.c | 64 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 23 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 53b7acd..e13c3c3 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1166,7 +1166,7 @@ static void sctp_v4_del_protocol(void)
 	unregister_inetaddr_notifier(&sctp_inetaddr_notifier);
 }
 
-static int __net_init sctp_net_init(struct net *net)
+static int __net_init sctp_defaults_init(struct net *net)
 {
 	int status;
 
@@ -1259,12 +1259,6 @@ static int __net_init sctp_net_init(struct net *net)
 
 	sctp_dbg_objcnt_init(net);
 
-	/* Initialize the control inode/socket for handling OOTB packets.  */
-	if ((status = sctp_ctl_sock_init(net))) {
-		pr_err("Failed to initialize the SCTP control sock\n");
-		goto err_ctl_sock_init;
-	}
-
 	/* Initialize the local address list. */
 	INIT_LIST_HEAD(&net->sctp.local_addr_list);
 	spin_lock_init(&net->sctp.local_addr_lock);
@@ -1280,9 +1274,6 @@ static int __net_init sctp_net_init(struct net *net)
 
 	return 0;
 
-err_ctl_sock_init:
-	sctp_dbg_objcnt_exit(net);
-	sctp_proc_exit(net);
 err_init_proc:
 	cleanup_sctp_mibs(net);
 err_init_mibs:
@@ -1291,15 +1282,12 @@ err_sysctl_register:
 	return status;
 }
 
-static void __net_exit sctp_net_exit(struct net *net)
+static void __net_exit sctp_defaults_exit(struct net *net)
 {
 	/* Free the local address list */
 	sctp_free_addr_wq(net);
 	sctp_free_local_addr_list(net);
 
-	/* Free the control endpoint.  */
-	inet_ctl_sock_destroy(net->sctp.ctl_sock);
-
 	sctp_dbg_objcnt_exit(net);
 
 	sctp_proc_exit(net);
@@ -1307,9 +1295,32 @@ static void __net_exit sctp_net_exit(struct net *net)
 	sctp_sysctl_net_unregister(net);
 }
 
-static struct pernet_operations sctp_net_ops = {
-	.init = sctp_net_init,
-	.exit = sctp_net_exit,
+static struct pernet_operations sctp_defaults_ops = {
+	.init = sctp_defaults_init,
+	.exit = sctp_defaults_exit,
+};
+
+static int __net_init sctp_ctrlsock_init(struct net *net)
+{
+	int status;
+
+	/* Initialize the control inode/socket for handling OOTB packets.  */
+	status = sctp_ctl_sock_init(net);
+	if (status)
+		pr_err("Failed to initialize the SCTP control sock\n");
+
+	return status;
+}
+
+static void __net_init sctp_ctrlsock_exit(struct net *net)
+{
+	/* Free the control endpoint.  */
+	inet_ctl_sock_destroy(net->sctp.ctl_sock);
+}
+
+static struct pernet_operations sctp_ctrlsock_ops = {
+	.init = sctp_ctrlsock_init,
+	.exit = sctp_ctrlsock_exit,
 };
 
 /* Initialize the universe into something sensible.  */
@@ -1442,8 +1453,11 @@ static __init int sctp_init(void)
 	sctp_v4_pf_init();
 	sctp_v6_pf_init();
 
-	status = sctp_v4_protosw_init();
+	status = register_pernet_subsys(&sctp_defaults_ops);
+	if (status)
+		goto err_register_defaults;
 
+	status = sctp_v4_protosw_init();
 	if (status)
 		goto err_protosw_init;
 
@@ -1451,9 +1465,9 @@ static __init int sctp_init(void)
 	if (status)
 		goto err_v6_protosw_init;
 
-	status = register_pernet_subsys(&sctp_net_ops);
+	status = register_pernet_subsys(&sctp_ctrlsock_ops);
 	if (status)
-		goto err_register_pernet_subsys;
+		goto err_register_ctrlsock;
 
 	status = sctp_v4_add_protocol();
 	if (status)
@@ -1469,12 +1483,14 @@ out:
 err_v6_add_protocol:
 	sctp_v4_del_protocol();
 err_add_protocol:
-	unregister_pernet_subsys(&sctp_net_ops);
-err_register_pernet_subsys:
+	unregister_pernet_subsys(&sctp_ctrlsock_ops);
+err_register_ctrlsock:
 	sctp_v6_protosw_exit();
 err_v6_protosw_init:
 	sctp_v4_protosw_exit();
 err_protosw_init:
+	unregister_pernet_subsys(&sctp_defaults_ops);
+err_register_defaults:
 	sctp_v4_pf_exit();
 	sctp_v6_pf_exit();
 	sctp_sysctl_unregister();
@@ -1507,12 +1523,14 @@ static __exit void sctp_exit(void)
 	sctp_v6_del_protocol();
 	sctp_v4_del_protocol();
 
-	unregister_pernet_subsys(&sctp_net_ops);
+	unregister_pernet_subsys(&sctp_ctrlsock_ops);
 
 	/* Free protosw registrations */
 	sctp_v6_protosw_exit();
 	sctp_v4_protosw_exit();
 
+	unregister_pernet_subsys(&sctp_defaults_ops);
+
 	/* Unregister with socket layer. */
 	sctp_v6_pf_exit();
 	sctp_v4_pf_exit();
-- 
2.1.0


From 902fd2280ee30d1be55739643704d79d8c92b1ac Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Linus=20L=C3=BCssing?= <linus.luessing@c0d3.blue>
Date: Fri, 11 Sep 2015 18:39:48 +0200
Subject: [PATCH 15/28] bridge: fix igmpv3 / mldv2 report parsing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

[ Upstream commit c2d4fbd2163e607915cc05798ce7fb7f31117cc1 ]

With the newly introduced helper functions the skb pulling is hidden in
the checksumming function - and undone before returning to the caller.

The IGMPv3 and MLDv2 report parsing functions in the bridge still
assumed that the skb is pointing to the beginning of the IGMP/MLD
message while it is now kept at the beginning of the IPv4/6 header,
breaking the message parsing and creating packet loss.

Fixing this by taking the offset between IP and IGMP/MLD header into
account, too.

Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code")
Reported-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
Tested-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/bridge/br_multicast.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index ff667e1..9ba383f 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -980,7 +980,7 @@ static int br_ip4_multicast_igmp3_report(struct net_bridge *br,
 
 	ih = igmpv3_report_hdr(skb);
 	num = ntohs(ih->ngrec);
-	len = sizeof(*ih);
+	len = skb_transport_offset(skb) + sizeof(*ih);
 
 	for (i = 0; i < num; i++) {
 		len += sizeof(*grec);
@@ -1035,7 +1035,7 @@ static int br_ip6_multicast_mld2_report(struct net_bridge *br,
 
 	icmp6h = icmp6_hdr(skb);
 	num = ntohs(icmp6h->icmp6_dataun.un_data16[1]);
-	len = sizeof(*icmp6h);
+	len = skb_transport_offset(skb) + sizeof(*icmp6h);
 
 	for (i = 0; i < num; i++) {
 		__be16 *nsrcs, _nsrcs;
-- 
2.1.0


From 3b10baa848d5421e7300e629dd1d47dc5d5cf7e5 Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 15 Sep 2015 18:29:47 -0700
Subject: [PATCH 16/28] net/mlx4_en: really allow to change RSS key

[ Upsteam commit 4671fc6d47e0a0108fe24a4d830347d6a6ef4aa7 ]

When changing rss key, we do not want to overwrite user provided key
by the one provided by netdev_rss_key_fill(), which is the host random
key generated at boot time.

Fixes: 947cbb0ac242 ("net/mlx4_en: Support for configurable RSS hash function")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eyal Perry <eyalpe@mellanox.com>
CC: Amir Vadai <amirv@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index eab4e08..80aac20 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1256,8 +1256,6 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 		rss_context->hash_fn = MLX4_RSS_HASH_TOP;
 		memcpy(rss_context->rss_key, priv->rss_key,
 		       MLX4_EN_RSS_KEY_SIZE);
-		netdev_rss_key_fill(rss_context->rss_key,
-				    MLX4_EN_RSS_KEY_SIZE);
 	} else {
 		en_err(priv, "Unknown RSS hash function requested\n");
 		err = -EINVAL;
-- 
2.1.0


From 550e098bd3241e02a60ee1d513e01dfa58a2cd16 Mon Sep 17 00:00:00 2001
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 18 Sep 2015 13:41:09 +0300
Subject: [PATCH 17/28] macvtap: fix TUNSETSNDBUF values > 64k

[ Upstream commit 3ea79249e81e5ed051f2e6480cbde896d99046e8 ]

Upon TUNSETSNDBUF,  macvtap reads the requested sndbuf size into
a local variable u.
commit 39ec7de7092b ("macvtap: fix uninitialized access on
TUNSETIFF") changed its type to u16 (which is the right thing to
do for all other macvtap ioctls), breaking all values > 64k.

The value of TUNSETSNDBUF is actually a signed 32 bit integer, so
the right thing to do is to read it into an int.

Cc: David S. Miller <davem@davemloft.net>
Fixes: 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF")
Reported-by: Mark A. Peloquin
Bisected-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by:  Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/macvtap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 8c350c5..58858c5 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -1054,10 +1054,10 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		return 0;
 
 	case TUNSETSNDBUF:
-		if (get_user(u, up))
+		if (get_user(s, sp))
 			return -EFAULT;
 
-		q->sk.sk_sndbuf = u;
+		q->sk.sk_sndbuf = s;
 		return 0;
 
 	case TUNGETVNETHDRSZ:
-- 
2.1.0


From b934f9581fe4db0923ee82f633ab8efab675e2e0 Mon Sep 17 00:00:00 2001
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 18 Sep 2015 19:16:50 +0800
Subject: [PATCH 18/28] netlink: Fix autobind race condition that leads to zero
 port ID

[ Upstream commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ]

The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 12 +++++++-----
 net/netlink/af_netlink.h |  1 +
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ca08ffb..7d20abe 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1002,7 +1002,7 @@ static inline int netlink_compare(struct rhashtable_compare_arg *arg,
 	const struct netlink_compare_arg *x = arg->key;
 	const struct netlink_sock *nlk = ptr;
 
-	return nlk->portid != x->portid ||
+	return nlk->rhash_portid != x->portid ||
 	       !net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
 }
 
@@ -1028,7 +1028,7 @@ static int __netlink_insert(struct netlink_table *table, struct sock *sk)
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->rhash_portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1090,17 +1090,19 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->portid = portid;
+	nlk_sk(sk)->rhash_portid = portid;
 	sock_hold(sk);
 
 	err = __netlink_insert(table, sk);
 	if (err) {
 		if (err == -EEXIST)
 			err = -EADDRINUSE;
-		nlk_sk(sk)->portid = 0;
 		sock_put(sk);
+		goto err;
 	}
 
+	nlk_sk(sk)->portid = portid;
+
 err:
 	release_sock(sk);
 	return err;
@@ -3147,7 +3149,7 @@ static inline u32 netlink_hash(const void *data, u32 len, u32 seed)
 	const struct netlink_sock *nlk = data;
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->rhash_portid);
 	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index df9a060..80b2b75 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -25,6 +25,7 @@ struct netlink_ring {
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
+	u32			rhash_portid;
 	u32			portid;
 	u32			dst_portid;
 	u32			dst_group;
-- 
2.1.0


From 403b09ea207341e05c877f101098f3667c232117 Mon Sep 17 00:00:00 2001
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 22 Sep 2015 11:38:56 +0800
Subject: [PATCH 19/28] netlink: Replace rhash_portid with bound

[ Upstream commit da314c9923fed553a007785a901fd395b7eb6c19 ]

On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way.  You have to pair store_release
> and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above.  Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here.  This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> >  		}
> >  	}
> >
> > -	if (!nlk->portid) {
> > +	if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable.  That doesn't change anything.  Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> >  		return -EPERM;
> >
> > -	if (!nlk->portid)
> > +	if (!nlk->bound)
>
> Don't we need load_acquire here too?  Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8<---
The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Nacked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 39 ++++++++++++++++++++++++++++-----------
 net/netlink/af_netlink.h |  2 +-
 2 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7d20abe..7d824ac 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1002,7 +1002,7 @@ static inline int netlink_compare(struct rhashtable_compare_arg *arg,
 	const struct netlink_compare_arg *x = arg->key;
 	const struct netlink_sock *nlk = ptr;
 
-	return nlk->rhash_portid != x->portid ||
+	return nlk->portid != x->portid ||
 	       !net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
 }
 
@@ -1028,7 +1028,7 @@ static int __netlink_insert(struct netlink_table *table, struct sock *sk)
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1081,8 +1081,8 @@ static int netlink_insert(struct sock *sk, u32 portid)
 
 	lock_sock(sk);
 
-	err = -EBUSY;
-	if (nlk_sk(sk)->portid)
+	err = nlk_sk(sk)->portid == portid ? 0 : -EBUSY;
+	if (nlk_sk(sk)->bound)
 		goto err;
 
 	err = -ENOMEM;
@@ -1090,7 +1090,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->rhash_portid = portid;
+	nlk_sk(sk)->portid = portid;
 	sock_hold(sk);
 
 	err = __netlink_insert(table, sk);
@@ -1101,7 +1101,9 @@ static int netlink_insert(struct sock *sk, u32 portid)
 		goto err;
 	}
 
-	nlk_sk(sk)->portid = portid;
+	/* We need to ensure that the socket is hashed and visible. */
+	smp_wmb();
+	nlk_sk(sk)->bound = portid;
 
 err:
 	release_sock(sk);
@@ -1482,6 +1484,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 	struct sockaddr_nl *nladdr = (struct sockaddr_nl *)addr;
 	int err;
 	long unsigned int groups = nladdr->nl_groups;
+	bool bound;
 
 	if (addr_len < sizeof(struct sockaddr_nl))
 		return -EINVAL;
@@ -1498,9 +1501,14 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	if (nlk->portid)
+	bound = nlk->bound;
+	if (bound) {
+		/* Ensure nlk->portid is up-to-date. */
+		smp_rmb();
+
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
+	}
 
 	if (nlk->netlink_bind && groups) {
 		int group;
@@ -1516,7 +1524,10 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 		}
 	}
 
-	if (!nlk->portid) {
+	/* No need for barriers here as we return to user-space without
+	 * using any of the bound attributes.
+	 */
+	if (!bound) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
 			netlink_autobind(sock);
@@ -1564,7 +1575,10 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	if (!nlk->portid)
+	/* No need for barriers here as we return to user-space without
+	 * using any of the bound attributes.
+	 */
+	if (!nlk->bound)
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2321,10 +2335,13 @@ static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->portid) {
+	if (!nlk->bound) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
+	} else {
+		/* Ensure nlk is hashed and visible. */
+		smp_rmb();
 	}
 
 	/* It's a really convoluted way for userland to ask for mmaped
@@ -3149,7 +3166,7 @@ static inline u32 netlink_hash(const void *data, u32 len, u32 seed)
 	const struct netlink_sock *nlk = data;
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
 	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 80b2b75..14437d9 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -25,7 +25,6 @@ struct netlink_ring {
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
-	u32			rhash_portid;
 	u32			portid;
 	u32			dst_portid;
 	u32			dst_group;
@@ -36,6 +35,7 @@ struct netlink_sock {
 	unsigned long		state;
 	size_t			max_recvmsg_len;
 	wait_queue_head_t	wait;
+	bool			bound;
 	bool			cb_running;
 	struct netlink_callback	cb;
 	struct mutex		*cb_mutex;
-- 
2.1.0


From 336f32fe105e34e610d5b7d46c147ba01d24fec9 Mon Sep 17 00:00:00 2001
From: Jesse Gross <jesse@nicira.com>
Date: Mon, 21 Sep 2015 20:21:20 -0700
Subject: [PATCH 20/28] openvswitch: Zero flows on allocation.

[ Upstream commit ae5f2fb1d51fa128a460bcfbe3c56d7ab8bf6a43 ]

When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/openvswitch/datapath.c   |  4 ++--
 net/openvswitch/flow_table.c | 23 ++++++++++++-----------
 net/openvswitch/flow_table.h |  2 +-
 3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 096c627..27e1496 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -906,7 +906,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	if (error)
 		goto err_kfree_flow;
 
-	ovs_flow_mask_key(&new_flow->key, &key, &mask);
+	ovs_flow_mask_key(&new_flow->key, &key, true, &mask);
 
 	/* Extract flow identifier. */
 	error = ovs_nla_get_identifier(&new_flow->id, a[OVS_FLOW_ATTR_UFID],
@@ -1033,7 +1033,7 @@ static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,
 	struct sw_flow_key masked_key;
 	int error;
 
-	ovs_flow_mask_key(&masked_key, key, mask);
+	ovs_flow_mask_key(&masked_key, key, true, mask);
 	error = ovs_nla_copy_actions(a, &masked_key, &acts, log);
 	if (error) {
 		OVS_NLERR(log,
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index 4613df8..aa34951 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -56,20 +56,21 @@ static u16 range_n_bytes(const struct sw_flow_key_range *range)
 }
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
-		       const struct sw_flow_mask *mask)
+		       bool full, const struct sw_flow_mask *mask)
 {
-	const long *m = (const long *)((const u8 *)&mask->key +
-				mask->range.start);
-	const long *s = (const long *)((const u8 *)src +
-				mask->range.start);
-	long *d = (long *)((u8 *)dst + mask->range.start);
+	int start = full ? 0 : mask->range.start;
+	int len = full ? sizeof *dst : range_n_bytes(&mask->range);
+	const long *m = (const long *)((const u8 *)&mask->key + start);
+	const long *s = (const long *)((const u8 *)src + start);
+	long *d = (long *)((u8 *)dst + start);
 	int i;
 
-	/* The memory outside of the 'mask->range' are not set since
-	 * further operations on 'dst' only uses contents within
-	 * 'mask->range'.
+	/* If 'full' is true then all of 'dst' is fully initialized. Otherwise,
+	 * if 'full' is false the memory outside of the 'mask->range' is left
+	 * uninitialized. This can be used as an optimization when further
+	 * operations on 'dst' only use contents within 'mask->range'.
 	 */
-	for (i = 0; i < range_n_bytes(&mask->range); i += sizeof(long))
+	for (i = 0; i < len; i += sizeof(long))
 		*d++ = *s++ & *m++;
 }
 
@@ -473,7 +474,7 @@ static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
 	u32 hash;
 	struct sw_flow_key masked_key;
 
-	ovs_flow_mask_key(&masked_key, unmasked, mask);
+	ovs_flow_mask_key(&masked_key, unmasked, false, mask);
 	hash = flow_hash(&masked_key, &mask->range);
 	head = find_bucket(ti, hash);
 	hlist_for_each_entry_rcu(flow, head, flow_table.node[ti->node_ver]) {
diff --git a/net/openvswitch/flow_table.h b/net/openvswitch/flow_table.h
index 616eda1..2dd9900 100644
--- a/net/openvswitch/flow_table.h
+++ b/net/openvswitch/flow_table.h
@@ -86,5 +86,5 @@ struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *,
 bool ovs_flow_cmp(const struct sw_flow *, const struct sw_flow_match *);
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
-		       const struct sw_flow_mask *mask);
+		       bool full, const struct sw_flow_mask *mask);
 #endif /* flow_table.h */
-- 
2.1.0


From f702441c691d47fa63f69da6b69f041b2b6f3261 Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Wed, 23 Sep 2015 14:00:21 -0700
Subject: [PATCH 21/28] tcp: add proper TS val into RST packets

[ Upstream commit 675ee231d960af2af3606b4480324e26797eb010 ]

RST packets sent on behalf of TCP connections with TS option (RFC 7323
TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.

A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
ecr 0], length 0
B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
1460,nop,nop,TS val 7264344 ecr 100], length 0
A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
7264344], length 0

B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
ecr 110], length 0

We need to call skb_mstamp_get() to get proper TS val,
derived from skb->skb_mstamp

Note that RFC 1323 was advocating to not send TS option in RST segment,
but RFC 7323 recommends the opposite :

  Once TSopt has been successfully negotiated, that is both <SYN> and
  <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
  segment for the duration of the connection, and SHOULD be sent in an
  <RST> segment (see Section 5.2 for details)

Note this RFC recommends to send TS val = 0, but we believe it is
premature : We do not know if all TCP stacks are properly
handling the receive side :

   When an <RST> segment is
   received, it MUST NOT be subjected to the PAWS check by verifying an
   acceptable value in SEG.TSval, and information from the Timestamps
   option MUST NOT be used to update connection state information.
   SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.

In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
to decide to send TS val = 0, if it buys something.

Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_output.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a369e8a..986440b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2893,6 +2893,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority)
 	skb_reserve(skb, MAX_TCP_HEADER);
 	tcp_init_nondata_skb(skb, tcp_acceptable_seq(sk),
 			     TCPHDR_ACK | TCPHDR_RST);
+	skb_mstamp_get(&skb->skb_mstamp);
 	/* Send it off. */
 	if (tcp_transmit_skb(sk, skb, 0, priority))
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTFAILED);
-- 
2.1.0


From 8563bea6b79e1c25773944d6c5276cb8e72f2084 Mon Sep 17 00:00:00 2001
From: WANG Cong <xiyou.wangcong@gmail.com>
Date: Tue, 22 Sep 2015 17:01:11 -0700
Subject: [PATCH 22/28] net: revert "net_sched: move tp->root allocation into
 fw_init()"

[ Upstream commit d8aecb10115497f6cdf841df8c88ebb3ba25fa28 ]

fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.

Fixes: 33f8b9ecdb15 ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sched/cls_fw.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index 715e01e..f23a3b6 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -33,7 +33,6 @@
 
 struct fw_head {
 	u32			mask;
-	bool			mask_set;
 	struct fw_filter __rcu	*ht[HTSIZE];
 	struct rcu_head		rcu;
 };
@@ -84,7 +83,7 @@ static int fw_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 			}
 		}
 	} else {
-		/* old method */
+		/* Old method: classify the packet using its skb mark. */
 		if (id && (TC_H_MAJ(id) == 0 ||
 			   !(TC_H_MAJ(id ^ tp->q->handle)))) {
 			res->classid = id;
@@ -114,14 +113,9 @@ static unsigned long fw_get(struct tcf_proto *tp, u32 handle)
 
 static int fw_init(struct tcf_proto *tp)
 {
-	struct fw_head *head;
-
-	head = kzalloc(sizeof(struct fw_head), GFP_KERNEL);
-	if (head == NULL)
-		return -ENOBUFS;
-
-	head->mask_set = false;
-	rcu_assign_pointer(tp->root, head);
+	/* We don't allocate fw_head here, because in the old method
+	 * we don't need it at all.
+	 */
 	return 0;
 }
 
@@ -252,7 +246,7 @@ static int fw_change(struct net *net, struct sk_buff *in_skb,
 	int err;
 
 	if (!opt)
-		return handle ? -EINVAL : 0;
+		return handle ? -EINVAL : 0; /* Succeed if it is old method. */
 
 	err = nla_parse_nested(tb, TCA_FW_MAX, opt, fw_policy);
 	if (err < 0)
@@ -302,11 +296,17 @@ static int fw_change(struct net *net, struct sk_buff *in_skb,
 	if (!handle)
 		return -EINVAL;
 
-	if (!head->mask_set) {
-		head->mask = 0xFFFFFFFF;
+	if (!head) {
+		u32 mask = 0xFFFFFFFF;
 		if (tb[TCA_FW_MASK])
-			head->mask = nla_get_u32(tb[TCA_FW_MASK]);
-		head->mask_set = true;
+			mask = nla_get_u32(tb[TCA_FW_MASK]);
+
+		head = kzalloc(sizeof(*head), GFP_KERNEL);
+		if (!head)
+			return -ENOBUFS;
+		head->mask = mask;
+
+		rcu_assign_pointer(tp->root, head);
 	}
 
 	f = kzalloc(sizeof(struct fw_filter), GFP_KERNEL);
-- 
2.1.0


From 5d9dfb608531d4b95b8d000652d4ced4bd3adf41 Mon Sep 17 00:00:00 2001
From: Wilson Kok <wkok@cumulusnetworks.com>
Date: Tue, 22 Sep 2015 21:40:22 -0700
Subject: [PATCH 23/28] fib_rules: fix fib rule dumps across multiple skbs

[ Upstream commit 41fc014332d91ee90c32840bf161f9685b7fbf2b ]

dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.

This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.

Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/fib_rules.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 9a12668..0ad144f 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -615,15 +615,17 @@ static int dump_rules(struct sk_buff *skb, struct netlink_callback *cb,
 {
 	int idx = 0;
 	struct fib_rule *rule;
+	int err = 0;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(rule, &ops->rules_list, list) {
 		if (idx < cb->args[1])
 			goto skip;
 
-		if (fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
-				     cb->nlh->nlmsg_seq, RTM_NEWRULE,
-				     NLM_F_MULTI, ops) < 0)
+		err = fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
+				       cb->nlh->nlmsg_seq, RTM_NEWRULE,
+				       NLM_F_MULTI, ops);
+		if (err)
 			break;
 skip:
 		idx++;
@@ -632,7 +634,7 @@ skip:
 	cb->args[1] = idx;
 	rules_ops_put(ops);
 
-	return skb->len;
+	return err;
 }
 
 static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
@@ -648,7 +650,9 @@ static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
 		if (ops == NULL)
 			return -EAFNOSUPPORT;
 
-		return dump_rules(skb, cb, ops);
+		dump_rules(skb, cb, ops);
+
+		return skb->len;
 	}
 
 	rcu_read_lock();
-- 
2.1.0


From a682bd590e9fe3284054df5cf5ca5f2e82065042 Mon Sep 17 00:00:00 2001
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Mon, 20 Jul 2015 17:49:55 -0700
Subject: [PATCH 24/28] net: dsa: bcm_sf2: Do not override speed settings

[ Upstream d2eac98f7d1b950b762a7eca05a9ce0ea1d878d2 in net-next tree,
  will be pushed to Linus very soon. ]

The SF2 driver currently overrides speed settings for its port
configured using a fixed PHY, this is both unnecessary and incorrect,
because we keep feedback to the hardware parameters that we read from
the PHY device, which in the case of a fixed PHY cannot possibly change
speed.

This is a required change to allow the fixed PHY code to allow
registering a PHY with a link configured as DOWN by default and avoid
some sort of circular dependency where we require the link_update
callback to run to program the hardware, and we then utilize the fixed
PHY parameters to program the hardware with the same settings.

Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/dsa/bcm_sf2.c | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index f24c4fe..db9ebbc 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -893,15 +893,11 @@ static void bcm_sf2_sw_fixed_link_update(struct dsa_switch *ds, int port,
 					 struct fixed_phy_status *status)
 {
 	struct bcm_sf2_priv *priv = ds_to_priv(ds);
-	u32 duplex, pause, speed;
+	u32 duplex, pause;
 	u32 reg;
 
 	duplex = core_readl(priv, CORE_DUPSTS);
 	pause = core_readl(priv, CORE_PAUSESTS);
-	speed = core_readl(priv, CORE_SPDSTS);
-
-	speed >>= (port * SPDSTS_SHIFT);
-	speed &= SPDSTS_MASK;
 
 	status->link = 0;
 
@@ -929,18 +925,6 @@ static void bcm_sf2_sw_fixed_link_update(struct dsa_switch *ds, int port,
 		reg &= ~LINK_STS;
 	core_writel(priv, reg, CORE_STS_OVERRIDE_GMIIP_PORT(port));
 
-	switch (speed) {
-	case SPDSTS_10:
-		status->speed = SPEED_10;
-		break;
-	case SPDSTS_100:
-		status->speed = SPEED_100;
-		break;
-	case SPDSTS_1000:
-		status->speed = SPEED_1000;
-		break;
-	}
-
 	if ((pause & (1 << port)) &&
 	    (pause & (1 << (port + PAUSESTS_TX_PAUSE_SHIFT)))) {
 		status->asym_pause = 1;
-- 
2.1.0


From 426e5e860e4d076992943ec7837c1234f7384a67 Mon Sep 17 00:00:00 2001
From: Stas Sergeev <stsp@list.ru>
Date: Mon, 20 Jul 2015 17:49:56 -0700
Subject: [PATCH 25/28] net: phy: fixed_phy: handle link-down case

[ Upstream 868a4215be9a6d80548ccb74763b883dc99d32a2 in net-next tree,
  will be pushed to Linus very soon. ]

fixed_phy_register() currently hardcodes the fixed PHY link to 1, and
expects to find a "speed" parameter to provide correct information
towards the fixed PHY consumer.

In a subsequent change, where we allow "managed" (e.g: (RS)GMII in-band
status auto-negotiation) fixed PHYs, none of these parameters can be
provided since they will be auto-negotiated, hence, we just provide a
zero-initialized fixed_phy_status to fixed_phy_register() which makes it
fail when we call fixed_phy_update_regs() since status.speed = 0 which
makes us hit the "default" label and error out.

Without this change, we would also see potentially inconsistent
speed/duplex parameters for fixed PHYs when the link is DOWN.

CC: netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>
[florian: add more background to why this is correct and desirable]
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/phy/fixed_phy.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/phy/fixed_phy.c b/drivers/net/phy/fixed_phy.c
index 1960b46..479b93f 100644
--- a/drivers/net/phy/fixed_phy.c
+++ b/drivers/net/phy/fixed_phy.c
@@ -52,6 +52,10 @@ static int fixed_phy_update_regs(struct fixed_phy *fp)
 	u16 lpagb = 0;
 	u16 lpa = 0;
 
+	if (!fp->status.link)
+		goto done;
+	bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE;
+
 	if (fp->status.duplex) {
 		bmcr |= BMCR_FULLDPLX;
 
@@ -96,15 +100,13 @@ static int fixed_phy_update_regs(struct fixed_phy *fp)
 		}
 	}
 
-	if (fp->status.link)
-		bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE;
-
 	if (fp->status.pause)
 		lpa |= LPA_PAUSE_CAP;
 
 	if (fp->status.asym_pause)
 		lpa |= LPA_PAUSE_ASYM;
 
+done:
 	fp->regs[MII_PHYSID1] = 0;
 	fp->regs[MII_PHYSID2] = 0;
 
-- 
2.1.0


From a231920c27e977bd8f6ca5fc909b6de42f6ec5b3 Mon Sep 17 00:00:00 2001
From: Stas Sergeev <stsp@list.ru>
Date: Mon, 20 Jul 2015 17:49:57 -0700
Subject: [PATCH 26/28] of_mdio: add new DT property 'managed' to specify the
 PHY management type

[ Upstream commit 4cba5c2103657d43d0886e4cff8004d95a3d0def in net-next tree,
  will be pushed to Linus very soon. ]

Currently the PHY management type is selected by the MAC driver arbitrary.
The decision is based on the presence of the "fixed-link" node and on a
will of the driver's authors.
This caused a regression recently, when mvneta driver suddenly started
to use the in-band status for auto-negotiation on fixed links.
It appears the auto-negotiation may not work when expected by the MAC driver.
Sebastien Rannou explains:
<< Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
inband status) is connected to the switch through QSGMII, and in this context
we are on the media side of the PHY. >>
https://lkml.org/lkml/2015/7/10/206

This patch introduces the new string property 'managed' that allows
the user to set the management type explicitly.
The supported values are:
"auto" - default. Uses either MDIO or nothing, depending on the presence
of the fixed-link node
"in-band-status" - use in-band status

Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>

CC: Rob Herring <robh+dt@kernel.org>
CC: Pawel Moll <pawel.moll@arm.com>
CC: Mark Rutland <mark.rutland@arm.com>
CC: Ian Campbell <ijc+devicetree@hellion.org.uk>
CC: Kumar Gala <galak@codeaurora.org>
CC: Florian Fainelli <f.fainelli@gmail.com>
CC: Grant Likely <grant.likely@linaro.org>
CC: devicetree@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 Documentation/devicetree/bindings/net/ethernet.txt |  4 ++++
 drivers/of/of_mdio.c                               | 19 +++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/ethernet.txt b/Documentation/devicetree/bindings/net/ethernet.txt
index 41b3f3f..5d88f37 100644
--- a/Documentation/devicetree/bindings/net/ethernet.txt
+++ b/Documentation/devicetree/bindings/net/ethernet.txt
@@ -25,7 +25,11 @@ The following properties are common to the Ethernet controllers:
   flow control thresholds.
 - tx-fifo-depth: the size of the controller's transmit fifo in bytes. This
   is used for components that can have configurable fifo sizes.
+- managed: string, specifies the PHY management type. Supported values are:
+  "auto", "in-band-status". "auto" is the default, it usess MDIO for
+  management if fixed-link is not specified.
 
 Child nodes of the Ethernet controller are typically the individual PHY devices
 connected via the MDIO bus (sometimes the MDIO bus controller is separate).
 They are described in the phy.txt file in this same directory.
+For non-MDIO PHY management see fixed-link.txt.
diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index 0c06448..bec8ec2 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -263,7 +263,8 @@ EXPORT_SYMBOL(of_phy_attach);
 bool of_phy_is_fixed_link(struct device_node *np)
 {
 	struct device_node *dn;
-	int len;
+	int len, err;
+	const char *managed;
 
 	/* New binding */
 	dn = of_get_child_by_name(np, "fixed-link");
@@ -272,6 +273,10 @@ bool of_phy_is_fixed_link(struct device_node *np)
 		return true;
 	}
 
+	err = of_property_read_string(np, "managed", &managed);
+	if (err == 0 && strcmp(managed, "auto") != 0)
+		return true;
+
 	/* Old binding */
 	if (of_get_property(np, "fixed-link", &len) &&
 	    len == (5 * sizeof(__be32)))
@@ -286,8 +291,18 @@ int of_phy_register_fixed_link(struct device_node *np)
 	struct fixed_phy_status status = {};
 	struct device_node *fixed_link_node;
 	const __be32 *fixed_link_prop;
-	int len;
+	int len, err;
 	struct phy_device *phy;
+	const char *managed;
+
+	err = of_property_read_string(np, "managed", &managed);
+	if (err == 0) {
+		if (strcmp(managed, "in-band-status") == 0) {
+			/* status is zeroed, namely its .link member */
+			phy = fixed_phy_register(PHY_POLL, &status, np);
+			return IS_ERR(phy) ? PTR_ERR(phy) : 0;
+		}
+	}
 
 	/* New binding */
 	fixed_link_node = of_get_child_by_name(np, "fixed-link");
-- 
2.1.0


From f2a50cea6256782e375a44ba02d39441307433ff Mon Sep 17 00:00:00 2001
From: Stas Sergeev <stsp@list.ru>
Date: Mon, 20 Jul 2015 17:49:58 -0700
Subject: [PATCH 27/28] mvneta: use inband status only when explicitly enabled

[ Upstream commit f8af8e6eb95093d5ce5ebcc52bd1929b0433e172 in net-next tree,
  will be pushed to Linus very soon. ]

The commit 898b2970e2c9 ("mvneta: implement SGMII-based in-band link state
signaling") implemented the link parameters auto-negotiation unconditionally.
Unfortunately it appears that some HW that implements SGMII protocol,
doesn't generate the inband status, so it is not possible to auto-negotiate
anything with such HW.

This patch enables the auto-negotiation only if explicitly requested with
the 'managed' DT property.

This patch fixes the following regression:
https://lkml.org/lkml/2015/7/8/865

Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>

CC: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
CC: netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/marvell/mvneta.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 74d0389..4d608f0 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -3029,8 +3029,8 @@ static int mvneta_probe(struct platform_device *pdev)
 	const char *dt_mac_addr;
 	char hw_mac_addr[ETH_ALEN];
 	const char *mac_from;
+	const char *managed;
 	int phy_mode;
-	int fixed_phy = 0;
 	int err;
 
 	/* Our multiqueue support is not complete, so for now, only
@@ -3064,7 +3064,6 @@ static int mvneta_probe(struct platform_device *pdev)
 			dev_err(&pdev->dev, "cannot register fixed PHY\n");
 			goto err_free_irq;
 		}
-		fixed_phy = 1;
 
 		/* In the case of a fixed PHY, the DT node associated
 		 * to the PHY is the Ethernet MAC DT node.
@@ -3088,8 +3087,10 @@ static int mvneta_probe(struct platform_device *pdev)
 	pp = netdev_priv(dev);
 	pp->phy_node = phy_node;
 	pp->phy_interface = phy_mode;
-	pp->use_inband_status = (phy_mode == PHY_INTERFACE_MODE_SGMII) &&
-				fixed_phy;
+
+	err = of_property_read_string(dn, "managed", &managed);
+	pp->use_inband_status = (err == 0 &&
+				 strcmp(managed, "in-band-status") == 0);
 
 	pp->clk = devm_clk_get(&pdev->dev, NULL);
 	if (IS_ERR(pp->clk)) {
-- 
2.1.0


From b5eae410ec845978d1cffab6eae27d7ab28c9fcf Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri, 7 Aug 2015 00:26:41 +0200
Subject: [PATCH 28/28] netlink: make sure -EBUSY won't escape from
 netlink_insert

[ Upstream commit 4e7c1330689e27556de407d3fdadc65ffff5eb12 ]

Linus reports the following deadlock on rtnl_mutex; triggered only
once so far (extract):

[12236.694209] NetworkManager  D 0000000000013b80     0  1047      1 0x00000000
[12236.694218]  ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018
[12236.694224]  ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff
[12236.694230]  ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80
[12236.694235] Call Trace:
[12236.694250]  [<ffffffff815d15a9>] ? schedule_preempt_disabled+0x9/0x10
[12236.694257]  [<ffffffff815d133a>] ? schedule+0x2a/0x70
[12236.694263]  [<ffffffff815d15a9>] ? schedule_preempt_disabled+0x9/0x10
[12236.694271]  [<ffffffff815d2c3f>] ? __mutex_lock_slowpath+0x7f/0xf0
[12236.694280]  [<ffffffff815d2cc6>] ? mutex_lock+0x16/0x30
[12236.694291]  [<ffffffff814f1f90>] ? rtnetlink_rcv+0x10/0x30
[12236.694299]  [<ffffffff8150ce3b>] ? netlink_unicast+0xfb/0x180
[12236.694309]  [<ffffffff814f5ad3>] ? rtnl_getlink+0x113/0x190
[12236.694319]  [<ffffffff814f202a>] ? rtnetlink_rcv_msg+0x7a/0x210
[12236.694331]  [<ffffffff8124565c>] ? sock_has_perm+0x5c/0x70
[12236.694339]  [<ffffffff814f1fb0>] ? rtnetlink_rcv+0x30/0x30
[12236.694346]  [<ffffffff8150d62c>] ? netlink_rcv_skb+0x9c/0xc0
[12236.694354]  [<ffffffff814f1f9f>] ? rtnetlink_rcv+0x1f/0x30
[12236.694360]  [<ffffffff8150ce3b>] ? netlink_unicast+0xfb/0x180
[12236.694367]  [<ffffffff8150d344>] ? netlink_sendmsg+0x484/0x5d0
[12236.694376]  [<ffffffff810a236f>] ? __wake_up+0x2f/0x50
[12236.694387]  [<ffffffff814cad23>] ? sock_sendmsg+0x33/0x40
[12236.694396]  [<ffffffff814cb05e>] ? ___sys_sendmsg+0x22e/0x240
[12236.694405]  [<ffffffff814cab75>] ? ___sys_recvmsg+0x135/0x1a0
[12236.694415]  [<ffffffff811a9d12>] ? eventfd_write+0x82/0x210
[12236.694423]  [<ffffffff811a0f9e>] ? fsnotify+0x32e/0x4c0
[12236.694429]  [<ffffffff8108cb70>] ? wake_up_q+0x60/0x60
[12236.694434]  [<ffffffff814cba09>] ? __sys_sendmsg+0x39/0x70
[12236.694440]  [<ffffffff815d4797>] ? entry_SYSCALL_64_fastpath+0x12/0x6a

It seems so far plausible that the recursive call into rtnetlink_rcv()
looks suspicious. One way, where this could trigger is that the senders
NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so
the rtnl_getlink() request's answer would be sent to the kernel instead
to the actual user process, thus grabbing rtnl_mutex() twice.

One theory would be that netlink_autobind() triggered via netlink_sendmsg()
internally overwrites the -EBUSY error to 0, but where it is wrongly
originating from __netlink_insert() instead. That would reset the
socket's portid to 0, which is then filled into NETLINK_CB(skb).portid
later on. As commit d470e3b483dc ("[NETLINK]: Fix two socket hashing bugs.")
also puts it, -EBUSY should not be propagated from netlink_insert().

It looks like it's very unlikely to reproduce. We need to trigger the
rhashtable_insert_rehash() handler under a situation where rehashing
currently occurs (one /rare/ way would be to hit ht->elasticity limits
while not filled enough to expand the hashtable, but that would rather
require a specifically crafted bind() sequence with knowledge about
destination slots, seems unlikely). It probably makes sense to guard
__netlink_insert() in any case and remap that error. It was suggested
that EOVERFLOW might be better than an already overloaded ENOMEM.

Reference: http://thread.gmane.org/gmane.linux.network/372676
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7d824ac..c87bcf9 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1095,6 +1095,11 @@ static int netlink_insert(struct sock *sk, u32 portid)
 
 	err = __netlink_insert(table, sk);
 	if (err) {
+		/* In case the hashtable backend returns with -EBUSY
+		 * from here, it must not escape to the caller.
+		 */
+		if (unlikely(err == -EBUSY))
+			err = -EOVERFLOW;
 		if (err == -EEXIST)
 			err = -EADDRINUSE;
 		sock_put(sk);
-- 
2.1.0


[-- Attachment #5: net_42.mbox --]
[-- Type: Application/Octet-Stream, Size: 97439 bytes --]

From 42f257d5d3dfb28aeb3bdaf1b885d428a9ab7284 Mon Sep 17 00:00:00 2001
From: Mark Salter <msalter@redhat.com>
Date: Tue, 1 Sep 2015 09:36:05 -0400
Subject: [PATCH 01/30] phylib: fix device deletion order in
 mdiobus_unregister()

[ Upstream commit b6c6aedcbcbacd7b0cb4b64ed5ac835bc1c60a03 ]

commit 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not
the bus' parent.") uncovered a problem in mdiobus_unregister() which
leads to this warning when I reboot an APM Mustang (arm64) platform:

  WARNING: CPU: 7 PID: 4239 at fs/sysfs/group.c:224 sysfs_remove_group+0xa0/0xa4()
  sysfs group fffffe0000e07a10 not found for kobject 'xgene-mii-eth0:03'
  ...
  CPU: 7 PID: 4239 Comm: reboot Tainted: G            E   4.2.0-0.18.el7.test15.aarch64 #1
  Hardware name: AppliedMicro Mustang/Mustang, BIOS 1.1.0 Aug 26 2015
  Call Trace:
  [<fffffe000009739c>] dump_backtrace+0x0/0x170
  [<fffffe000009752c>] show_stack+0x20/0x2c
  [<fffffe00007436f0>] dump_stack+0x78/0x9c
  [<fffffe00000c2cb4>] warn_slowpath_common+0xa0/0xd8
  [<fffffe00000c2d60>] warn_slowpath_fmt+0x74/0x88
  [<fffffe0000293d3c>] sysfs_remove_group+0x9c/0xa4
  [<fffffe00004a8bac>] dpm_sysfs_remove+0x5c/0x70
  [<fffffe000049b388>] device_del+0x44/0x208
  [<fffffe000049b578>] device_unregister+0x2c/0x7c
  [<fffffe000050dc68>] mdiobus_unregister+0x48/0x94
  [<fffffe000052afd0>] xgene_enet_mdio_remove+0x28/0x44
  [<fffffe000052d3f0>] xgene_enet_remove+0xd0/0xd8
  [<fffffe000052d424>] xgene_enet_shutdown+0x2c/0x3c
  [<fffffe00004a204c>] platform_drv_shutdown+0x24/0x40
  [<fffffe000049d4f4>] device_shutdown+0xf0/0x1b4
  [<fffffe00000e31ec>] kernel_restart_prepare+0x40/0x4c
  [<fffffe00000e32f8>] kernel_restart+0x1c/0x80
  [<fffffe00000e3670>] SyS_reboot+0x17c/0x250

The problem is that mdiobus_unregister() deletes the bus device before
unregistering the phy devices on the bus. This wasn't a problem before
because the phys were not children of the bus:

  /sys/devices/platform/APMC0D05:00/net/eth0/xgene-mii-eth0:03
  /sys/devices/platform/APMC0D05:00/net/eth0/xgene-mii-eth0

But now that they are:

  /sys/devices/platform/APMC0D05:00/net/eth0/xgene-mii-eth0/xgene-mii-eth0:03

when mdiobus_unregister deletes the bus device, the phy subdirs are
removed from sysfs also. So when the phys are unregistered afterward,
we get the warning. This patch changes the order so that phys are
unregistered before the bus device is deleted.

Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the bus' parent.")
Signed-off-by: Mark Salter <msalter@redhat.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Mark Langsdorf <mlangsdo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/phy/mdio_bus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 46a14cb..02a4615 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -303,12 +303,12 @@ void mdiobus_unregister(struct mii_bus *bus)
 	BUG_ON(bus->state != MDIOBUS_REGISTERED);
 	bus->state = MDIOBUS_UNREGISTERED;
 
-	device_del(&bus->dev);
 	for (i = 0; i < PHY_MAX_ADDR; i++) {
 		if (bus->phy_map[i])
 			device_unregister(&bus->phy_map[i]->dev);
 		bus->phy_map[i] = NULL;
 	}
+	device_del(&bus->dev);
 }
 EXPORT_SYMBOL(mdiobus_unregister);
 
-- 
2.1.0


From 12fa777b0629951d79d36c3f5899a35eb91f828b Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed, 2 Sep 2015 14:00:36 +0200
Subject: [PATCH 02/30] sock, diag: fix panic in sock_diag_put_filterinfo

[ Upstream commit b382c08656000c12a146723a153b85b13a855b49 ]

diag socket's sock_diag_put_filterinfo() dumps classic BPF programs
upon request to user space (ss -0 -b). However, native eBPF programs
attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method:

Their orig_prog is always NULL. However, sock_diag_put_filterinfo()
unconditionally tries to access its filter length resp. wants to copy
the filter insns from there. Internal cBPF to eBPF transformations
attached to sockets don't have this issue, as orig_prog state is kept.

It's currently only used by packet sockets. If we would want to add
native eBPF support in the future, this needs to be done through
a different attribute than PACKET_DIAG_FILTER to not confuse possible
user space disassemblers that work on diag data.

Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/sock_diag.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index d79866c..817622f 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -90,6 +90,9 @@ int sock_diag_put_filterinfo(bool may_report_filterinfo, struct sock *sk,
 		goto out;
 
 	fprog = filter->prog->orig_prog;
+	if (!fprog)
+		goto out;
+
 	flen = bpf_classic_proglen(fprog);
 
 	attr = nla_reserve(skb, attrtype, flen);
-- 
2.1.0


From 6f8f9135005900cd00365273df66da12130fa0bc Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 3 Sep 2015 00:29:07 +0200
Subject: [PATCH 03/30] ipv6: fix exthdrs offload registration in out_rt path

[ Upstream commit e41b0bedba0293b9e1e8d1e8ed553104b9693656 ]

We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
but in error path, we try to unregister it with inet_del_offload(). This
doesn't seem correct, it should actually be inet6_del_offload(), also
ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
also uses rthdr_offload twice), but it got removed entirely later on.

Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/exthdrs_offload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c
index 447a7fb..f5e2ba1 100644
--- a/net/ipv6/exthdrs_offload.c
+++ b/net/ipv6/exthdrs_offload.c
@@ -36,6 +36,6 @@ out:
 	return ret;
 
 out_rt:
-	inet_del_offload(&rthdr_offload, IPPROTO_ROUTING);
+	inet6_del_offload(&rthdr_offload, IPPROTO_ROUTING);
 	goto out;
 }
-- 
2.1.0


From 8bd68ca93e602c06d66eee5f7f87f36bcf8e74a8 Mon Sep 17 00:00:00 2001
From: Russell King <rmk+kernel@arm.linux.org.uk>
Date: Wed, 2 Sep 2015 17:24:14 +0800
Subject: [PATCH 04/30] net: fec: clear receive interrupts before processing a
 packet

[ Upstream commit ed63f1dcd5788d36f942fbcce350742385e3e18c ]

The patch just to re-submit the patch "db3421c114cfa6326" because the
patch "4d494cdc92b3b9a0" remove the change.

Clear any pending receive interrupt before we process a pending packet.
This helps to avoid any spurious interrupts being raised after we have
fully cleaned the receive ring, while still allowing an interrupt to be
raised if we receive another packet.

The position of this is critical: we must do this prior to reading the
next packet status to avoid potentially dropping an interrupt when a
packet is still pending.

Acked-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/freescale/fec_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index b349e6f..de63266d 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1402,6 +1402,7 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
 		if ((status & BD_ENET_RX_LAST) == 0)
 			netdev_err(ndev, "rcv is not +last\n");
 
+		writel(FEC_ENET_RXF, fep->hwp + FEC_IEVENT);
 
 		/* Check for errors. */
 		if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH | BD_ENET_RX_NO |
-- 
2.1.0


From f8b294e6c916632036f06ec27f5f8f50ffd025ac Mon Sep 17 00:00:00 2001
From: Atsushi Nemoto <nemoto@toshiba-tops.co.jp>
Date: Wed, 2 Sep 2015 17:49:29 +0900
Subject: [PATCH 05/30] net: eth: altera: fix napi poll_list corruption

[ Upstream commit 4548a697e4969d695047cebd6d9af5e2f6cc728e ]

tse_poll() calls __napi_complete() with irq enabled.  This leads napi
poll_list corruption and may stop all napi drivers working.
Use napi_complete() instead of __napi_complete().

Signed-off-by: Atsushi Nemoto <nemoto@toshiba-tops.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/altera/altera_tse_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/altera/altera_tse_main.c b/drivers/net/ethernet/altera/altera_tse_main.c
index da48e66..8207877 100644
--- a/drivers/net/ethernet/altera/altera_tse_main.c
+++ b/drivers/net/ethernet/altera/altera_tse_main.c
@@ -511,8 +511,7 @@ static int tse_poll(struct napi_struct *napi, int budget)
 
 	if (rxcomplete < budget) {
 
-		napi_gro_flush(napi, false);
-		__napi_complete(napi);
+		napi_complete(napi);
 
 		netdev_dbg(priv->dev,
 			   "NAPI Complete, did %d packets with budget %d\n",
-- 
2.1.0


From 29e43d1c1ee9ed0292e2bf235fb6150984acc410 Mon Sep 17 00:00:00 2001
From: Richard Laing <richard.laing@alliedtelesis.co.nz>
Date: Thu, 3 Sep 2015 13:52:31 +1200
Subject: [PATCH 06/30] net/ipv6: Correct PIM6 mrt_lock handling

[ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ]

In the IPv6 multicast routing code the mrt_lock was not being released
correctly in the MFC iterator, as a result adding or deleting a MIF would
cause a hang because the mrt_lock could not be acquired.

This fix is a copy of the code for the IPv4 case and ensures that the lock
is released correctly.

Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/ip6mr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 74ceb73..5f36266 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -550,7 +550,7 @@ static void ipmr_mfc_seq_stop(struct seq_file *seq, void *v)
 
 	if (it->cache == &mrt->mfc6_unres_queue)
 		spin_unlock_bh(&mfc_unres_lock);
-	else if (it->cache == mrt->mfc6_cache_array)
+	else if (it->cache == &mrt->mfc6_cache_array[it->ct])
 		read_unlock(&mrt_lock);
 }
 
-- 
2.1.0


From 8ef957156771e81b37d25a8227a2534414f28b5b Mon Sep 17 00:00:00 2001
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Sat, 5 Sep 2015 13:07:27 -0700
Subject: [PATCH 07/30] net: dsa: bcm_sf2: Fix ageing conditions and operation

[ Upstream commit 39797a279d62972cd914ef580fdfacb13e508bf8 ]

The comparison check between cur_hw_state and hw_state is currently
invalid because cur_hw_state is right shifted by G_MISTP_SHIFT, while
hw_state is not, so we end-up comparing bits 2:0 with bits 7:5, which is
going to cause an additional aging to occur. Fix this by not shifting
cur_hw_state while reading it, but instead, mask the value with the
appropriately shitfted bitmask.

The other problem with the fast-ageing process is that we did not set
the EN_AGE_DYNAMIC bit to request the ageing to occur for dynamically
learned MAC addresses. Finally, write back 0 to the FAST_AGE_CTRL
register to avoid leaving spurious bits sets from one operation to the
other.

Fixes: 12f460f23423 ("net: dsa: bcm_sf2: add HW bridging support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/dsa/bcm_sf2.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 079897b..691caea 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -418,7 +418,7 @@ static int bcm_sf2_sw_fast_age_port(struct dsa_switch  *ds, int port)
 	core_writel(priv, port, CORE_FAST_AGE_PORT);
 
 	reg = core_readl(priv, CORE_FAST_AGE_CTRL);
-	reg |= EN_AGE_PORT | FAST_AGE_STR_DONE;
+	reg |= EN_AGE_PORT | EN_AGE_DYNAMIC | FAST_AGE_STR_DONE;
 	core_writel(priv, reg, CORE_FAST_AGE_CTRL);
 
 	do {
@@ -432,6 +432,8 @@ static int bcm_sf2_sw_fast_age_port(struct dsa_switch  *ds, int port)
 	if (!timeout)
 		return -ETIMEDOUT;
 
+	core_writel(priv, 0, CORE_FAST_AGE_CTRL);
+
 	return 0;
 }
 
@@ -507,7 +509,7 @@ static int bcm_sf2_sw_br_set_stp_state(struct dsa_switch *ds, int port,
 	u32 reg;
 
 	reg = core_readl(priv, CORE_G_PCTL_PORT(port));
-	cur_hw_state = reg >> G_MISTP_STATE_SHIFT;
+	cur_hw_state = reg & (G_MISTP_STATE_MASK << G_MISTP_STATE_SHIFT);
 
 	switch (state) {
 	case BR_STATE_DISABLED:
@@ -531,10 +533,12 @@ static int bcm_sf2_sw_br_set_stp_state(struct dsa_switch *ds, int port,
 	}
 
 	/* Fast-age ARL entries if we are moving a port from Learning or
-	 * Forwarding state to Disabled, Blocking or Listening state
+	 * Forwarding (cur_hw_state) state to Disabled, Blocking or Listening
+	 * state (hw_state)
 	 */
 	if (cur_hw_state != hw_state) {
-		if (cur_hw_state & 4 && !(hw_state & 4)) {
+		if (cur_hw_state >= G_MISTP_LEARN_STATE &&
+		    hw_state <= G_MISTP_LISTEN_STATE) {
 			ret = bcm_sf2_sw_fast_age_port(ds, port);
 			if (ret) {
 				pr_err("%s: fast-ageing failed\n", __func__);
-- 
2.1.0


From e2a239cf73304bd8c7b3c71a91de8d6137bca80d Mon Sep 17 00:00:00 2001
From: Roopa Prabhu <roopa@cumulusnetworks.com>
Date: Tue, 8 Sep 2015 10:53:04 -0700
Subject: [PATCH 08/30] ipv6: fix multipath route replace error recovery

[ Upstream commit 6b9ea5a64ed5eeb3f68f2e6fcce0ed1179801d1e ]

Problem:
The ecmp route replace support for ipv6 in the kernel, deletes the
existing ecmp route too early, ie when it installs the first nexthop.
If there is an error in installing the subsequent nexthops, its too late
to recover the already deleted existing route leaving the fib
in an inconsistent state.

This patch reduces the possibility of this by doing the following:
a) Changes the existing multipath route add code to a two stage process:
  build rt6_infos + insert them
	ip6_route_add rt6_info creation code is moved into
	ip6_route_info_create.
b) This ensures that most errors are caught during building rt6_infos
  and we fail early
c) Separates multipath add and del code. Because add needs the special
  two stage mode in a) and delete essentially does not care.
d) In any event if the code fails during inserting a route again, a
  warning is printed (This should be unlikely)

Before the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists

$ip -6 route show
/* previously added ecmp route 3000:1000:1000:1000::2 dissappears from
 * kernel */

After the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists

$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/route.c | 201 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 175 insertions(+), 26 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d155864..00b64d4 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1727,7 +1727,7 @@ static int ip6_convert_metrics(struct mx6_config *mxc,
 	return -EINVAL;
 }
 
-int ip6_route_add(struct fib6_config *cfg)
+int ip6_route_info_create(struct fib6_config *cfg, struct rt6_info **rt_ret)
 {
 	int err;
 	struct net *net = cfg->fc_nlinfo.nl_net;
@@ -1735,7 +1735,6 @@ int ip6_route_add(struct fib6_config *cfg)
 	struct net_device *dev = NULL;
 	struct inet6_dev *idev = NULL;
 	struct fib6_table *table;
-	struct mx6_config mxc = { .mx = NULL, };
 	int addr_type;
 
 	if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128)
@@ -1941,6 +1940,32 @@ install_route:
 
 	cfg->fc_nlinfo.nl_net = dev_net(dev);
 
+	*rt_ret = rt;
+
+	return 0;
+out:
+	if (dev)
+		dev_put(dev);
+	if (idev)
+		in6_dev_put(idev);
+	if (rt)
+		dst_free(&rt->dst);
+
+	*rt_ret = NULL;
+
+	return err;
+}
+
+int ip6_route_add(struct fib6_config *cfg)
+{
+	struct mx6_config mxc = { .mx = NULL, };
+	struct rt6_info *rt = NULL;
+	int err;
+
+	err = ip6_route_info_create(cfg, &rt);
+	if (err)
+		goto out;
+
 	err = ip6_convert_metrics(&mxc, cfg);
 	if (err)
 		goto out;
@@ -1948,14 +1973,12 @@ install_route:
 	err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, &mxc);
 
 	kfree(mxc.mx);
+
 	return err;
 out:
-	if (dev)
-		dev_put(dev);
-	if (idev)
-		in6_dev_put(idev);
 	if (rt)
 		dst_free(&rt->dst);
+
 	return err;
 }
 
@@ -2727,19 +2750,78 @@ errout:
 	return err;
 }
 
-static int ip6_route_multipath(struct fib6_config *cfg, int add)
+struct rt6_nh {
+	struct rt6_info *rt6_info;
+	struct fib6_config r_cfg;
+	struct mx6_config mxc;
+	struct list_head next;
+};
+
+static void ip6_print_replace_route_err(struct list_head *rt6_nh_list)
+{
+	struct rt6_nh *nh;
+
+	list_for_each_entry(nh, rt6_nh_list, next) {
+		pr_warn("IPV6: multipath route replace failed (check consistency of installed routes): %pI6 nexthop %pI6 ifi %d\n",
+		        &nh->r_cfg.fc_dst, &nh->r_cfg.fc_gateway,
+		        nh->r_cfg.fc_ifindex);
+	}
+}
+
+static int ip6_route_info_append(struct list_head *rt6_nh_list,
+				 struct rt6_info *rt, struct fib6_config *r_cfg)
+{
+	struct rt6_nh *nh;
+	struct rt6_info *rtnh;
+	int err = -EEXIST;
+
+	list_for_each_entry(nh, rt6_nh_list, next) {
+		/* check if rt6_info already exists */
+		rtnh = nh->rt6_info;
+
+		if (rtnh->dst.dev == rt->dst.dev &&
+		    rtnh->rt6i_idev == rt->rt6i_idev &&
+		    ipv6_addr_equal(&rtnh->rt6i_gateway,
+				    &rt->rt6i_gateway))
+			return err;
+	}
+
+	nh = kzalloc(sizeof(*nh), GFP_KERNEL);
+	if (!nh)
+		return -ENOMEM;
+	nh->rt6_info = rt;
+	err = ip6_convert_metrics(&nh->mxc, r_cfg);
+	if (err) {
+		kfree(nh);
+		return err;
+	}
+	memcpy(&nh->r_cfg, r_cfg, sizeof(*r_cfg));
+	list_add_tail(&nh->next, rt6_nh_list);
+
+	return 0;
+}
+
+static int ip6_route_multipath_add(struct fib6_config *cfg)
 {
 	struct fib6_config r_cfg;
 	struct rtnexthop *rtnh;
+	struct rt6_info *rt;
+	struct rt6_nh *err_nh;
+	struct rt6_nh *nh, *nh_safe;
 	int remaining;
 	int attrlen;
-	int err = 0, last_err = 0;
+	int err = 1;
+	int nhn = 0;
+	int replace = (cfg->fc_nlinfo.nlh &&
+		       (cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_REPLACE));
+	LIST_HEAD(rt6_nh_list);
 
 	remaining = cfg->fc_mp_len;
-beginning:
 	rtnh = (struct rtnexthop *)cfg->fc_mp;
 
-	/* Parse a Multipath Entry */
+	/* Parse a Multipath Entry and build a list (rt6_nh_list) of
+	 * rt6_info structs per nexthop
+	 */
 	while (rtnh_ok(rtnh, remaining)) {
 		memcpy(&r_cfg, cfg, sizeof(*cfg));
 		if (rtnh->rtnh_ifindex)
@@ -2755,22 +2837,32 @@ beginning:
 				r_cfg.fc_flags |= RTF_GATEWAY;
 			}
 		}
-		err = add ? ip6_route_add(&r_cfg) : ip6_route_del(&r_cfg);
+
+		err = ip6_route_info_create(&r_cfg, &rt);
+		if (err)
+			goto cleanup;
+
+		err = ip6_route_info_append(&rt6_nh_list, rt, &r_cfg);
 		if (err) {
-			last_err = err;
-			/* If we are trying to remove a route, do not stop the
-			 * loop when ip6_route_del() fails (because next hop is
-			 * already gone), we should try to remove all next hops.
-			 */
-			if (add) {
-				/* If add fails, we should try to delete all
-				 * next hops that have been already added.
-				 */
-				add = 0;
-				remaining = cfg->fc_mp_len - remaining;
-				goto beginning;
-			}
+			dst_free(&rt->dst);
+			goto cleanup;
+		}
+
+		rtnh = rtnh_next(rtnh, &remaining);
+	}
+
+	err_nh = NULL;
+	list_for_each_entry(nh, &rt6_nh_list, next) {
+		err = __ip6_ins_rt(nh->rt6_info, &cfg->fc_nlinfo, &nh->mxc);
+		/* nh->rt6_info is used or freed at this point, reset to NULL*/
+		nh->rt6_info = NULL;
+		if (err) {
+			if (replace && nhn)
+				ip6_print_replace_route_err(&rt6_nh_list);
+			err_nh = nh;
+			goto add_errout;
 		}
+
 		/* Because each route is added like a single route we remove
 		 * these flags after the first nexthop: if there is a collision,
 		 * we have already failed to add the first nexthop:
@@ -2780,6 +2872,63 @@ beginning:
 		 */
 		cfg->fc_nlinfo.nlh->nlmsg_flags &= ~(NLM_F_EXCL |
 						     NLM_F_REPLACE);
+		nhn++;
+	}
+
+	goto cleanup;
+
+add_errout:
+	/* Delete routes that were already added */
+	list_for_each_entry(nh, &rt6_nh_list, next) {
+		if (err_nh == nh)
+			break;
+		ip6_route_del(&nh->r_cfg);
+	}
+
+cleanup:
+	list_for_each_entry_safe(nh, nh_safe, &rt6_nh_list, next) {
+		if (nh->rt6_info)
+			dst_free(&nh->rt6_info->dst);
+		if (nh->mxc.mx)
+			kfree(nh->mxc.mx);
+		list_del(&nh->next);
+		kfree(nh);
+	}
+
+	return err;
+}
+
+static int ip6_route_multipath_del(struct fib6_config *cfg)
+{
+	struct fib6_config r_cfg;
+	struct rtnexthop *rtnh;
+	int remaining;
+	int attrlen;
+	int err = 1, last_err = 0;
+
+	remaining = cfg->fc_mp_len;
+	rtnh = (struct rtnexthop *)cfg->fc_mp;
+
+	/* Parse a Multipath Entry */
+	while (rtnh_ok(rtnh, remaining)) {
+		memcpy(&r_cfg, cfg, sizeof(*cfg));
+		if (rtnh->rtnh_ifindex)
+			r_cfg.fc_ifindex = rtnh->rtnh_ifindex;
+
+		attrlen = rtnh_attrlen(rtnh);
+		if (attrlen > 0) {
+			struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
+
+			nla = nla_find(attrs, attrlen, RTA_GATEWAY);
+			if (nla) {
+				nla_memcpy(&r_cfg.fc_gateway, nla, 16);
+				r_cfg.fc_flags |= RTF_GATEWAY;
+			}
+		}
+		err = ip6_route_del(&r_cfg);
+		if (err)
+			last_err = err;
+
 		rtnh = rtnh_next(rtnh, &remaining);
 	}
 
@@ -2796,7 +2945,7 @@ static int inet6_rtm_delroute(struct sk_buff *skb, struct nlmsghdr *nlh)
 		return err;
 
 	if (cfg.fc_mp)
-		return ip6_route_multipath(&cfg, 0);
+		return ip6_route_multipath_del(&cfg);
 	else
 		return ip6_route_del(&cfg);
 }
@@ -2811,7 +2960,7 @@ static int inet6_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh)
 		return err;
 
 	if (cfg.fc_mp)
-		return ip6_route_multipath(&cfg, 1);
+		return ip6_route_multipath_add(&cfg);
 	else
 		return ip6_route_add(&cfg);
 }
-- 
2.1.0


From fc01b1b13b865133352c866fbed764f4f619fa98 Mon Sep 17 00:00:00 2001
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Tue, 8 Sep 2015 20:06:41 -0700
Subject: [PATCH 09/30] net: dsa: bcm_sf2: Fix 64-bits register writes

[ Upstream commit 03679a14739a0d4c14b52ba65a69ff553bfba73b ]

The macro to write 64-bits quantities to the 32-bits register swapped
the value and offsets arguments, we want to preserve the ordering of the
arguments with respect to how writel() is implemented for instance:
value first, offset/base second.

Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/dsa/bcm_sf2.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.h b/drivers/net/dsa/bcm_sf2.h
index 22e2ebf..789d7b7 100644
--- a/drivers/net/dsa/bcm_sf2.h
+++ b/drivers/net/dsa/bcm_sf2.h
@@ -112,8 +112,8 @@ static inline u64 name##_readq(struct bcm_sf2_priv *priv, u32 off)	\
 	spin_unlock(&priv->indir_lock);					\
 	return (u64)indir << 32 | dir;					\
 }									\
-static inline void name##_writeq(struct bcm_sf2_priv *priv, u32 off,	\
-							u64 val)	\
+static inline void name##_writeq(struct bcm_sf2_priv *priv, u64 val,	\
+							u32 off)	\
 {									\
 	spin_lock(&priv->indir_lock);					\
 	reg_writel(priv, upper_32_bits(val), REG_DIR_DATA_WRITE);	\
-- 
2.1.0


From b92b169ab9398c5be10a41ab908ff9ecee1c9457 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 10 Sep 2015 20:05:46 +0200
Subject: [PATCH 10/30] netlink, mmap: transform mmap skb into full skb on taps

[ Upstream commit 1853c949646005b5959c483becde86608f548f24 ]

Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.

I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb->head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb->head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.

Reference: http://thread.gmane.org/gmane.linux.network/371129
Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
Reported-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 30 +++++++++++++++++++++++-------
 net/netlink/af_netlink.h |  9 +++++++++
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index a774985..bbd5d43 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -124,6 +124,24 @@ static inline u32 netlink_group_mask(u32 group)
 	return group ? 1 << (group - 1) : 0;
 }
 
+static struct sk_buff *netlink_to_full_skb(const struct sk_buff *skb,
+					   gfp_t gfp_mask)
+{
+	unsigned int len = skb_end_offset(skb);
+	struct sk_buff *new;
+
+	new = alloc_skb(len, gfp_mask);
+	if (new == NULL)
+		return NULL;
+
+	NETLINK_CB(new).portid = NETLINK_CB(skb).portid;
+	NETLINK_CB(new).dst_group = NETLINK_CB(skb).dst_group;
+	NETLINK_CB(new).creds = NETLINK_CB(skb).creds;
+
+	memcpy(skb_put(new, len), skb->data, len);
+	return new;
+}
+
 int netlink_add_tap(struct netlink_tap *nt)
 {
 	if (unlikely(nt->dev->type != ARPHRD_NETLINK))
@@ -205,7 +223,11 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
 	int ret = -ENOMEM;
 
 	dev_hold(dev);
-	nskb = skb_clone(skb, GFP_ATOMIC);
+
+	if (netlink_skb_is_mmaped(skb) || is_vmalloc_addr(skb->head))
+		nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
+	else
+		nskb = skb_clone(skb, GFP_ATOMIC);
 	if (nskb) {
 		nskb->dev = dev;
 		nskb->protocol = htons((u16) sk->sk_protocol);
@@ -278,11 +300,6 @@ static void netlink_rcv_wake(struct sock *sk)
 }
 
 #ifdef CONFIG_NETLINK_MMAP
-static bool netlink_skb_is_mmaped(const struct sk_buff *skb)
-{
-	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
-}
-
 static bool netlink_rx_is_mmaped(struct sock *sk)
 {
 	return nlk_sk(sk)->rx_ring.pg_vec != NULL;
@@ -834,7 +851,6 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
 }
 
 #else /* CONFIG_NETLINK_MMAP */
-#define netlink_skb_is_mmaped(skb)	false
 #define netlink_rx_is_mmaped(sk)	false
 #define netlink_tx_is_mmaped(sk)	false
 #define netlink_mmap			sock_no_mmap
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 8900840..df9a060 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -59,6 +59,15 @@ static inline struct netlink_sock *nlk_sk(struct sock *sk)
 	return container_of(sk, struct netlink_sock, sk);
 }
 
+static inline bool netlink_skb_is_mmaped(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETLINK_MMAP
+	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
+#else
+	return false;
+#endif /* CONFIG_NETLINK_MMAP */
+}
+
 struct netlink_table {
 	struct rhashtable	hash;
 	struct hlist_head	mc_list;
-- 
2.1.0


From 3c78d3e283f1e8bb0b27dcd4ffc672f59fb887bb Mon Sep 17 00:00:00 2001
From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Thu, 10 Sep 2015 17:31:15 -0300
Subject: [PATCH 11/30] sctp: fix race on protocol/netns initialization

[ Upstream commit 8e2d61e0aed2b7c4ecb35844fe07e0b2b762dee4 ]

Consider sctp module is unloaded and is being requested because an user
is creating a sctp socket.

During initialization, sctp will add the new protocol type and then
initialize pernet subsys:

        status = sctp_v4_protosw_init();
        if (status)
                goto err_protosw_init;

        status = sctp_v6_protosw_init();
        if (status)
                goto err_v6_protosw_init;

        status = register_pernet_subsys(&sctp_net_ops);

The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
is possible for userspace to create SCTP sockets like if the module is
already fully loaded. If that happens, one of the possible effects is
that we will have readers for net->sctp.local_addr_list list earlier
than expected and sctp_net_init() does not take precautions while
dealing with that list, leading to a potential panic but not limited to
that, as sctp_sock_init() will copy a bunch of blank/partially
initialized values from net->sctp.

The race happens like this:

     CPU 0                           |  CPU 1
  socket()                           |
   __sock_create                     | socket()
    inet_create                      |  __sock_create
     list_for_each_entry_rcu(        |
        answer, &inetsw[sock->type], |
        list) {                      |   inet_create
      /* no hits */                  |
     if (unlikely(err)) {            |
      ...                            |
      request_module()               |
      /* socket creation is blocked  |
       * the module is fully loaded  |
       */                            |
       sctp_init                     |
        sctp_v4_protosw_init         |
         inet_register_protosw       |
          list_add_rcu(&p->list,     |
                       last_perm);   |
                                     |  list_for_each_entry_rcu(
                                     |     answer, &inetsw[sock->type],
        sctp_v6_protosw_init         |     list) {
                                     |     /* hit, so assumes protocol
                                     |      * is already loaded
                                     |      */
                                     |  /* socket creation continues
                                     |   * before netns is initialized
                                     |   */
        register_pernet_subsys       |

Simply inverting the initialization order between
register_pernet_subsys() and sctp_v4_protosw_init() is not possible
because register_pernet_subsys() will create a control sctp socket, so
the protocol must be already visible by then. Deferring the socket
creation to a work-queue is not good specially because we loose the
ability to handle its errors.

So, as suggested by Vlad, the fix is to split netns initialization in
two moments: defaults and control socket, so that the defaults are
already loaded by when we register the protocol, while control socket
initialization is kept at the same moment it is today.

Fixes: 4db67e808640 ("sctp: Make the address lists per network namespace")
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sctp/protocol.c | 64 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 23 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 59e8035..3ac604f 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1166,7 +1166,7 @@ static void sctp_v4_del_protocol(void)
 	unregister_inetaddr_notifier(&sctp_inetaddr_notifier);
 }
 
-static int __net_init sctp_net_init(struct net *net)
+static int __net_init sctp_defaults_init(struct net *net)
 {
 	int status;
 
@@ -1259,12 +1259,6 @@ static int __net_init sctp_net_init(struct net *net)
 
 	sctp_dbg_objcnt_init(net);
 
-	/* Initialize the control inode/socket for handling OOTB packets.  */
-	if ((status = sctp_ctl_sock_init(net))) {
-		pr_err("Failed to initialize the SCTP control sock\n");
-		goto err_ctl_sock_init;
-	}
-
 	/* Initialize the local address list. */
 	INIT_LIST_HEAD(&net->sctp.local_addr_list);
 	spin_lock_init(&net->sctp.local_addr_lock);
@@ -1280,9 +1274,6 @@ static int __net_init sctp_net_init(struct net *net)
 
 	return 0;
 
-err_ctl_sock_init:
-	sctp_dbg_objcnt_exit(net);
-	sctp_proc_exit(net);
 err_init_proc:
 	cleanup_sctp_mibs(net);
 err_init_mibs:
@@ -1291,15 +1282,12 @@ err_sysctl_register:
 	return status;
 }
 
-static void __net_exit sctp_net_exit(struct net *net)
+static void __net_exit sctp_defaults_exit(struct net *net)
 {
 	/* Free the local address list */
 	sctp_free_addr_wq(net);
 	sctp_free_local_addr_list(net);
 
-	/* Free the control endpoint.  */
-	inet_ctl_sock_destroy(net->sctp.ctl_sock);
-
 	sctp_dbg_objcnt_exit(net);
 
 	sctp_proc_exit(net);
@@ -1307,9 +1295,32 @@ static void __net_exit sctp_net_exit(struct net *net)
 	sctp_sysctl_net_unregister(net);
 }
 
-static struct pernet_operations sctp_net_ops = {
-	.init = sctp_net_init,
-	.exit = sctp_net_exit,
+static struct pernet_operations sctp_defaults_ops = {
+	.init = sctp_defaults_init,
+	.exit = sctp_defaults_exit,
+};
+
+static int __net_init sctp_ctrlsock_init(struct net *net)
+{
+	int status;
+
+	/* Initialize the control inode/socket for handling OOTB packets.  */
+	status = sctp_ctl_sock_init(net);
+	if (status)
+		pr_err("Failed to initialize the SCTP control sock\n");
+
+	return status;
+}
+
+static void __net_init sctp_ctrlsock_exit(struct net *net)
+{
+	/* Free the control endpoint.  */
+	inet_ctl_sock_destroy(net->sctp.ctl_sock);
+}
+
+static struct pernet_operations sctp_ctrlsock_ops = {
+	.init = sctp_ctrlsock_init,
+	.exit = sctp_ctrlsock_exit,
 };
 
 /* Initialize the universe into something sensible.  */
@@ -1442,8 +1453,11 @@ static __init int sctp_init(void)
 	sctp_v4_pf_init();
 	sctp_v6_pf_init();
 
-	status = sctp_v4_protosw_init();
+	status = register_pernet_subsys(&sctp_defaults_ops);
+	if (status)
+		goto err_register_defaults;
 
+	status = sctp_v4_protosw_init();
 	if (status)
 		goto err_protosw_init;
 
@@ -1451,9 +1465,9 @@ static __init int sctp_init(void)
 	if (status)
 		goto err_v6_protosw_init;
 
-	status = register_pernet_subsys(&sctp_net_ops);
+	status = register_pernet_subsys(&sctp_ctrlsock_ops);
 	if (status)
-		goto err_register_pernet_subsys;
+		goto err_register_ctrlsock;
 
 	status = sctp_v4_add_protocol();
 	if (status)
@@ -1469,12 +1483,14 @@ out:
 err_v6_add_protocol:
 	sctp_v4_del_protocol();
 err_add_protocol:
-	unregister_pernet_subsys(&sctp_net_ops);
-err_register_pernet_subsys:
+	unregister_pernet_subsys(&sctp_ctrlsock_ops);
+err_register_ctrlsock:
 	sctp_v6_protosw_exit();
 err_v6_protosw_init:
 	sctp_v4_protosw_exit();
 err_protosw_init:
+	unregister_pernet_subsys(&sctp_defaults_ops);
+err_register_defaults:
 	sctp_v4_pf_exit();
 	sctp_v6_pf_exit();
 	sctp_sysctl_unregister();
@@ -1507,12 +1523,14 @@ static __exit void sctp_exit(void)
 	sctp_v6_del_protocol();
 	sctp_v4_del_protocol();
 
-	unregister_pernet_subsys(&sctp_net_ops);
+	unregister_pernet_subsys(&sctp_ctrlsock_ops);
 
 	/* Free protosw registrations */
 	sctp_v6_protosw_exit();
 	sctp_v4_protosw_exit();
 
+	unregister_pernet_subsys(&sctp_defaults_ops);
+
 	/* Unregister with socket layer. */
 	sctp_v6_pf_exit();
 	sctp_v4_pf_exit();
-- 
2.1.0


From 1e7617b8ab80e76f4a7dd09f0198f9b9008e8691 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Linus=20L=C3=BCssing?= <linus.luessing@c0d3.blue>
Date: Fri, 11 Sep 2015 18:39:48 +0200
Subject: [PATCH 12/30] bridge: fix igmpv3 / mldv2 report parsing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

[ Upstream commit c2d4fbd2163e607915cc05798ce7fb7f31117cc1 ]

With the newly introduced helper functions the skb pulling is hidden in
the checksumming function - and undone before returning to the caller.

The IGMPv3 and MLDv2 report parsing functions in the bridge still
assumed that the skb is pointing to the beginning of the IGMP/MLD
message while it is now kept at the beginning of the IPv4/6 header,
breaking the message parsing and creating packet loss.

Fixing this by taking the offset between IP and IGMP/MLD header into
account, too.

Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code")
Reported-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
Tested-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/bridge/br_multicast.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 1285eaf..03cdb9e 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -991,7 +991,7 @@ static int br_ip4_multicast_igmp3_report(struct net_bridge *br,
 
 	ih = igmpv3_report_hdr(skb);
 	num = ntohs(ih->ngrec);
-	len = sizeof(*ih);
+	len = skb_transport_offset(skb) + sizeof(*ih);
 
 	for (i = 0; i < num; i++) {
 		len += sizeof(*grec);
@@ -1052,7 +1052,7 @@ static int br_ip6_multicast_mld2_report(struct net_bridge *br,
 
 	icmp6h = icmp6_hdr(skb);
 	num = ntohs(icmp6h->icmp6_dataun.un_data16[1]);
-	len = sizeof(*icmp6h);
+	len = skb_transport_offset(skb) + sizeof(*icmp6h);
 
 	for (i = 0; i < num; i++) {
 		__be16 *nsrcs, _nsrcs;
-- 
2.1.0


From e34419618ff3b9f9b191f847fa92a09939c36d77 Mon Sep 17 00:00:00 2001
From: Simon Guinot <simon.guinot@sequanux.org>
Date: Tue, 15 Sep 2015 22:41:21 +0200
Subject: [PATCH 13/30] net: mvneta: fix DMA buffer unmapping in mvneta_rx()

[ Upstream commit daf158d0d544cec80b7b30deff8cfc59a6e17610 ]

This patch fixes a regression introduced by the commit a84e32894191
("net: mvneta: fix refilling for Rx DMA buffers"). Due to this commit
the newly allocated Rx buffers are DMA-unmapped in place of those passed
to the networking stack. Obviously, this causes data corruptions.

This patch fixes the issue by ensuring that the right Rx buffers are
DMA-unmapped.

Reported-by: Oren Laskin <oren@igneous.io>
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org>
Fixes: a84e32894191 ("net: mvneta: fix refilling for Rx DMA buffers")
Cc: <stable@vger.kernel.org> # v3.8+
Tested-by: Oren Laskin <oren@igneous.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/marvell/mvneta.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 62e48bc..03e052a 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1479,6 +1479,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq);
 		struct sk_buff *skb;
 		unsigned char *data;
+		dma_addr_t phys_addr;
 		u32 rx_status;
 		int rx_bytes, err;
 
@@ -1486,6 +1487,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		rx_status = rx_desc->status;
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (unsigned char *)rx_desc->buf_cookie;
+		phys_addr = rx_desc->buf_phys_addr;
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
@@ -1534,7 +1536,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		if (!skb)
 			goto err_drop_frame;
 
-		dma_unmap_single(dev->dev.parent, rx_desc->buf_phys_addr,
+		dma_unmap_single(dev->dev.parent, phys_addr,
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 
 		rcvd_pkts++;
-- 
2.1.0


From ff98428008fee09e3552c7f8fc6b0e08a955d048 Mon Sep 17 00:00:00 2001
From: Roopa Prabhu <roopa@cumulusnetworks.com>
Date: Tue, 15 Sep 2015 14:44:29 -0700
Subject: [PATCH 14/30] rtnetlink: catch -EOPNOTSUPP errors from
 ndo_bridge_getlink

[ Upstream commit d64f69b0373a7d0bcec8b5da7712977518a8f42b ]

problem reported:
	kernel 4.1.3
	------------
	# bridge vlan
	port	vlan ids
	eth0	 1 PVID Egress Untagged
	 	90
	 	91
	 	92
	 	93
	 	94
	 	95
	 	96
	 	97
	 	98
	 	99
	 	100

	vmbr0	 1 PVID Egress Untagged
	 	94

	kernel 4.2
	-----------
	# bridge vlan
	port	vlan ids

ndo_bridge_getlink can return -EOPNOTSUPP when an interfaces
ndo_bridge_getlink op is set to switchdev_port_bridge_getlink
and CONFIG_SWITCHDEV is not defined. This today can happen to
bond, rocker and team devices. This patch adds -EOPNOTSUPP
checks after calls to ndo_bridge_getlink.

Fixes: 85fdb956726ff2a ("switchdev: cut over to new switchdev_port_bridge_getlink")
Reported-by: Alexandre DERUMIER <aderumier@odiso.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/rtnetlink.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index dc004b1..0861018 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3021,6 +3021,7 @@ static int rtnl_bridge_getlink(struct sk_buff *skb, struct netlink_callback *cb)
 	u32 portid = NETLINK_CB(cb->skb).portid;
 	u32 seq = cb->nlh->nlmsg_seq;
 	u32 filter_mask = 0;
+	int err;
 
 	if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) {
 		struct nlattr *extfilt;
@@ -3041,20 +3042,25 @@ static int rtnl_bridge_getlink(struct sk_buff *skb, struct netlink_callback *cb)
 		struct net_device *br_dev = netdev_master_upper_dev_get(dev);
 
 		if (br_dev && br_dev->netdev_ops->ndo_bridge_getlink) {
-			if (idx >= cb->args[0] &&
-			    br_dev->netdev_ops->ndo_bridge_getlink(
-				    skb, portid, seq, dev, filter_mask,
-				    NLM_F_MULTI) < 0)
-				break;
+			if (idx >= cb->args[0]) {
+				err = br_dev->netdev_ops->ndo_bridge_getlink(
+						skb, portid, seq, dev,
+						filter_mask, NLM_F_MULTI);
+				if (err < 0 && err != -EOPNOTSUPP)
+					break;
+			}
 			idx++;
 		}
 
 		if (ops->ndo_bridge_getlink) {
-			if (idx >= cb->args[0] &&
-			    ops->ndo_bridge_getlink(skb, portid, seq, dev,
-						    filter_mask,
-						    NLM_F_MULTI) < 0)
-				break;
+			if (idx >= cb->args[0]) {
+				err = ops->ndo_bridge_getlink(skb, portid,
+							      seq, dev,
+							      filter_mask,
+							      NLM_F_MULTI);
+				if (err < 0 && err != -EOPNOTSUPP)
+					break;
+			}
 			idx++;
 		}
 	}
-- 
2.1.0


From 5d11d899f25b20905c2ea21c46ef8c3ddf54e57a Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 15 Sep 2015 18:29:47 -0700
Subject: [PATCH 15/30] net/mlx4_en: really allow to change RSS key

[ Upsteam commit 4671fc6d47e0a0108fe24a4d830347d6a6ef4aa7 ]

When changing rss key, we do not want to overwrite user provided key
by the one provided by netdev_rss_key_fill(), which is the host random
key generated at boot time.

Fixes: 947cbb0ac242 ("net/mlx4_en: Support for configurable RSS hash function")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eyal Perry <eyalpe@mellanox.com>
CC: Amir Vadai <amirv@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 9c145dd..4f95fa7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1250,8 +1250,6 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 		rss_context->hash_fn = MLX4_RSS_HASH_TOP;
 		memcpy(rss_context->rss_key, priv->rss_key,
 		       MLX4_EN_RSS_KEY_SIZE);
-		netdev_rss_key_fill(rss_context->rss_key,
-				    MLX4_EN_RSS_KEY_SIZE);
 	} else {
 		en_err(priv, "Unknown RSS hash function requested\n");
 		err = -EINVAL;
-- 
2.1.0


From c8fb6c89a432b7055eed2fce946bc660a2c6f498 Mon Sep 17 00:00:00 2001
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 18 Sep 2015 13:41:09 +0300
Subject: [PATCH 16/30] macvtap: fix TUNSETSNDBUF values > 64k

[ Upstream commit 3ea79249e81e5ed051f2e6480cbde896d99046e8 ]

Upon TUNSETSNDBUF,  macvtap reads the requested sndbuf size into
a local variable u.
commit 39ec7de7092b ("macvtap: fix uninitialized access on
TUNSETIFF") changed its type to u16 (which is the right thing to
do for all other macvtap ioctls), breaking all values > 64k.

The value of TUNSETSNDBUF is actually a signed 32 bit integer, so
the right thing to do is to read it into an int.

Cc: David S. Miller <davem@davemloft.net>
Fixes: 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF")
Reported-by: Mark A. Peloquin
Bisected-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by:  Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/macvtap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index edd7734..248478c 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -1111,10 +1111,10 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		return 0;
 
 	case TUNSETSNDBUF:
-		if (get_user(u, up))
+		if (get_user(s, sp))
 			return -EFAULT;
 
-		q->sk.sk_sndbuf = u;
+		q->sk.sk_sndbuf = s;
 		return 0;
 
 	case TUNGETVNETHDRSZ:
-- 
2.1.0


From 22bed6544249fa903260a5fea28202db990bfa39 Mon Sep 17 00:00:00 2001
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 18 Sep 2015 19:16:50 +0800
Subject: [PATCH 17/30] netlink: Fix autobind race condition that leads to zero
 port ID

[ Upstream commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ]

The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 12 +++++++-----
 net/netlink/af_netlink.h |  1 +
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index bbd5d43..42de385 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1019,7 +1019,7 @@ static inline int netlink_compare(struct rhashtable_compare_arg *arg,
 	const struct netlink_compare_arg *x = arg->key;
 	const struct netlink_sock *nlk = ptr;
 
-	return nlk->portid != x->portid ||
+	return nlk->rhash_portid != x->portid ||
 	       !net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
 }
 
@@ -1045,7 +1045,7 @@ static int __netlink_insert(struct netlink_table *table, struct sock *sk)
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->rhash_portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1107,7 +1107,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->portid = portid;
+	nlk_sk(sk)->rhash_portid = portid;
 	sock_hold(sk);
 
 	err = __netlink_insert(table, sk);
@@ -1119,10 +1119,12 @@ static int netlink_insert(struct sock *sk, u32 portid)
 			err = -EOVERFLOW;
 		if (err == -EEXIST)
 			err = -EADDRINUSE;
-		nlk_sk(sk)->portid = 0;
 		sock_put(sk);
+		goto err;
 	}
 
+	nlk_sk(sk)->portid = portid;
+
 err:
 	release_sock(sk);
 	return err;
@@ -3233,7 +3235,7 @@ static inline u32 netlink_hash(const void *data, u32 len, u32 seed)
 	const struct netlink_sock *nlk = data;
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->rhash_portid);
 	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index df9a060..80b2b75 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -25,6 +25,7 @@ struct netlink_ring {
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
+	u32			rhash_portid;
 	u32			portid;
 	u32			dst_portid;
 	u32			dst_group;
-- 
2.1.0


From a5c5dea26e50fcfc2b4553c78df875555f5a20ff Mon Sep 17 00:00:00 2001
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 22 Sep 2015 11:38:56 +0800
Subject: [PATCH 18/30] netlink: Replace rhash_portid with bound

[ Upstream commit da314c9923fed553a007785a901fd395b7eb6c19 ]

On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way.  You have to pair store_release
> and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above.  Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here.  This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> >  		}
> >  	}
> >
> > -	if (!nlk->portid) {
> > +	if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable.  That doesn't change anything.  Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> >  		return -EPERM;
> >
> > -	if (!nlk->portid)
> > +	if (!nlk->bound)
>
> Don't we need load_acquire here too?  Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8<---
The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Nacked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 39 ++++++++++++++++++++++++++++-----------
 net/netlink/af_netlink.h |  2 +-
 2 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 42de385..0857f72 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1019,7 +1019,7 @@ static inline int netlink_compare(struct rhashtable_compare_arg *arg,
 	const struct netlink_compare_arg *x = arg->key;
 	const struct netlink_sock *nlk = ptr;
 
-	return nlk->rhash_portid != x->portid ||
+	return nlk->portid != x->portid ||
 	       !net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
 }
 
@@ -1045,7 +1045,7 @@ static int __netlink_insert(struct netlink_table *table, struct sock *sk)
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1098,8 +1098,8 @@ static int netlink_insert(struct sock *sk, u32 portid)
 
 	lock_sock(sk);
 
-	err = -EBUSY;
-	if (nlk_sk(sk)->portid)
+	err = nlk_sk(sk)->portid == portid ? 0 : -EBUSY;
+	if (nlk_sk(sk)->bound)
 		goto err;
 
 	err = -ENOMEM;
@@ -1107,7 +1107,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->rhash_portid = portid;
+	nlk_sk(sk)->portid = portid;
 	sock_hold(sk);
 
 	err = __netlink_insert(table, sk);
@@ -1123,7 +1123,9 @@ static int netlink_insert(struct sock *sk, u32 portid)
 		goto err;
 	}
 
-	nlk_sk(sk)->portid = portid;
+	/* We need to ensure that the socket is hashed and visible. */
+	smp_wmb();
+	nlk_sk(sk)->bound = portid;
 
 err:
 	release_sock(sk);
@@ -1509,6 +1511,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 	struct sockaddr_nl *nladdr = (struct sockaddr_nl *)addr;
 	int err;
 	long unsigned int groups = nladdr->nl_groups;
+	bool bound;
 
 	if (addr_len < sizeof(struct sockaddr_nl))
 		return -EINVAL;
@@ -1525,9 +1528,14 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	if (nlk->portid)
+	bound = nlk->bound;
+	if (bound) {
+		/* Ensure nlk->portid is up-to-date. */
+		smp_rmb();
+
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
+	}
 
 	if (nlk->netlink_bind && groups) {
 		int group;
@@ -1543,7 +1551,10 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 		}
 	}
 
-	if (!nlk->portid) {
+	/* No need for barriers here as we return to user-space without
+	 * using any of the bound attributes.
+	 */
+	if (!bound) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
 			netlink_autobind(sock);
@@ -1591,7 +1602,10 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	if (!nlk->portid)
+	/* No need for barriers here as we return to user-space without
+	 * using any of the bound attributes.
+	 */
+	if (!nlk->bound)
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2409,10 +2423,13 @@ static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->portid) {
+	if (!nlk->bound) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
+	} else {
+		/* Ensure nlk is hashed and visible. */
+		smp_rmb();
 	}
 
 	/* It's a really convoluted way for userland to ask for mmaped
@@ -3235,7 +3252,7 @@ static inline u32 netlink_hash(const void *data, u32 len, u32 seed)
 	const struct netlink_sock *nlk = data;
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
 	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 80b2b75..14437d9 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -25,7 +25,6 @@ struct netlink_ring {
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
-	u32			rhash_portid;
 	u32			portid;
 	u32			dst_portid;
 	u32			dst_group;
@@ -36,6 +35,7 @@ struct netlink_sock {
 	unsigned long		state;
 	size_t			max_recvmsg_len;
 	wait_queue_head_t	wait;
+	bool			bound;
 	bool			cb_running;
 	struct netlink_callback	cb;
 	struct mutex		*cb_mutex;
-- 
2.1.0


From 4cdf5359b8226ba9c1ce1a4ded955d2adf035fdb Mon Sep 17 00:00:00 2001
From: Russell King <rmk+kernel@arm.linux.org.uk>
Date: Mon, 21 Sep 2015 21:42:59 +0100
Subject: [PATCH 19/30] net: dsa: actually force the speed on the CPU port

[ Upstream commit 53adc9e83028d9e35b6408231ebaf62a94a16e4d ]

Commit 54d792f257c6 ("net: dsa: Centralise global and port setup
code into mv88e6xxx.") merged in the 4.2 merge window broke the link
speed forcing for the CPU port of Marvell DSA switches.  The original
code was:

        /* MAC Forcing register: don't force link, speed, duplex
         * or flow control state to any particular values on physical
         * ports, but force the CPU port and all DSA ports to 1000 Mb/s
         * full duplex.
         */
        if (dsa_is_cpu_port(ds, p) || ds->dsa_port_mask & (1 << p))
                REG_WRITE(addr, 0x01, 0x003e);
        else
                REG_WRITE(addr, 0x01, 0x0003);

but the new code does a read-modify-write:

                reg = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_PCS_CTRL);
                if (dsa_is_cpu_port(ds, port) ||
                    ds->dsa_port_mask & (1 << port)) {
                        reg |= PORT_PCS_CTRL_FORCE_LINK |
                                PORT_PCS_CTRL_LINK_UP |
                                PORT_PCS_CTRL_DUPLEX_FULL |
                                PORT_PCS_CTRL_FORCE_DUPLEX;
                        if (mv88e6xxx_6065_family(ds))
                                reg |= PORT_PCS_CTRL_100;
                        else
                                reg |= PORT_PCS_CTRL_1000;

The link speed in the PCS control register is a two bit field.  Forcing
the link speed in this way doesn't ensure that the bit field is set to
the correct value - on the hardware I have here, the speed bitfield
remains set to 0x03, resulting in the speed not being forced to gigabit.

We must clear both bits before forcing the link speed.

Fixes: 54d792f257c6 ("net: dsa: Centralise global and port setup code into mv88e6xxx.")
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/dsa/mv88e6xxx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 5613424..26ec2fb 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1387,6 +1387,7 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port)
 		reg = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_PCS_CTRL);
 		if (dsa_is_cpu_port(ds, port) ||
 		    ds->dsa_port_mask & (1 << port)) {
+			reg &= ~PORT_PCS_CTRL_UNFORCED;
 			reg |= PORT_PCS_CTRL_FORCE_LINK |
 				PORT_PCS_CTRL_LINK_UP |
 				PORT_PCS_CTRL_DUPLEX_FULL |
-- 
2.1.0


From a5cbaca9e02496bc3850e4d057c667d98d87a7c7 Mon Sep 17 00:00:00 2001
From: Jesse Gross <jesse@nicira.com>
Date: Mon, 21 Sep 2015 20:21:20 -0700
Subject: [PATCH 20/30] openvswitch: Zero flows on allocation.

[ Upstream commit ae5f2fb1d51fa128a460bcfbe3c56d7ab8bf6a43 ]

When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/openvswitch/datapath.c   |  4 ++--
 net/openvswitch/flow_table.c | 23 ++++++++++++-----------
 net/openvswitch/flow_table.h |  2 +-
 3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index ff8c4a4..ff332d1 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -920,7 +920,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	if (error)
 		goto err_kfree_flow;
 
-	ovs_flow_mask_key(&new_flow->key, &key, &mask);
+	ovs_flow_mask_key(&new_flow->key, &key, true, &mask);
 
 	/* Extract flow identifier. */
 	error = ovs_nla_get_identifier(&new_flow->id, a[OVS_FLOW_ATTR_UFID],
@@ -1047,7 +1047,7 @@ static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,
 	struct sw_flow_key masked_key;
 	int error;
 
-	ovs_flow_mask_key(&masked_key, key, mask);
+	ovs_flow_mask_key(&masked_key, key, true, mask);
 	error = ovs_nla_copy_actions(a, &masked_key, &acts, log);
 	if (error) {
 		OVS_NLERR(log,
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index 6552394..b5c3bba 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -56,20 +56,21 @@ static u16 range_n_bytes(const struct sw_flow_key_range *range)
 }
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
-		       const struct sw_flow_mask *mask)
+		       bool full, const struct sw_flow_mask *mask)
 {
-	const long *m = (const long *)((const u8 *)&mask->key +
-				mask->range.start);
-	const long *s = (const long *)((const u8 *)src +
-				mask->range.start);
-	long *d = (long *)((u8 *)dst + mask->range.start);
+	int start = full ? 0 : mask->range.start;
+	int len = full ? sizeof *dst : range_n_bytes(&mask->range);
+	const long *m = (const long *)((const u8 *)&mask->key + start);
+	const long *s = (const long *)((const u8 *)src + start);
+	long *d = (long *)((u8 *)dst + start);
 	int i;
 
-	/* The memory outside of the 'mask->range' are not set since
-	 * further operations on 'dst' only uses contents within
-	 * 'mask->range'.
+	/* If 'full' is true then all of 'dst' is fully initialized. Otherwise,
+	 * if 'full' is false the memory outside of the 'mask->range' is left
+	 * uninitialized. This can be used as an optimization when further
+	 * operations on 'dst' only use contents within 'mask->range'.
 	 */
-	for (i = 0; i < range_n_bytes(&mask->range); i += sizeof(long))
+	for (i = 0; i < len; i += sizeof(long))
 		*d++ = *s++ & *m++;
 }
 
@@ -473,7 +474,7 @@ static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
 	u32 hash;
 	struct sw_flow_key masked_key;
 
-	ovs_flow_mask_key(&masked_key, unmasked, mask);
+	ovs_flow_mask_key(&masked_key, unmasked, false, mask);
 	hash = flow_hash(&masked_key, &mask->range);
 	head = find_bucket(ti, hash);
 	hlist_for_each_entry_rcu(flow, head, flow_table.node[ti->node_ver]) {
diff --git a/net/openvswitch/flow_table.h b/net/openvswitch/flow_table.h
index 616eda1..2dd9900 100644
--- a/net/openvswitch/flow_table.h
+++ b/net/openvswitch/flow_table.h
@@ -86,5 +86,5 @@ struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *,
 bool ovs_flow_cmp(const struct sw_flow *, const struct sw_flow_match *);
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
-		       const struct sw_flow_mask *mask);
+		       bool full, const struct sw_flow_mask *mask);
 #endif /* flow_table.h */
-- 
2.1.0


From fe62699d2210be635242035461ab5636bc87d5d4 Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Wed, 23 Sep 2015 14:00:21 -0700
Subject: [PATCH 21/30] tcp: add proper TS val into RST packets

[ Upstream commit 675ee231d960af2af3606b4480324e26797eb010 ]

RST packets sent on behalf of TCP connections with TS option (RFC 7323
TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.

A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
ecr 0], length 0
B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
1460,nop,nop,TS val 7264344 ecr 100], length 0
A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
7264344], length 0

B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
ecr 110], length 0

We need to call skb_mstamp_get() to get proper TS val,
derived from skb->skb_mstamp

Note that RFC 1323 was advocating to not send TS option in RST segment,
but RFC 7323 recommends the opposite :

  Once TSopt has been successfully negotiated, that is both <SYN> and
  <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
  segment for the duration of the connection, and SHOULD be sent in an
  <RST> segment (see Section 5.2 for details)

Note this RFC recommends to send TS val = 0, but we believe it is
premature : We do not know if all TCP stacks are properly
handling the receive side :

   When an <RST> segment is
   received, it MUST NOT be subjected to the PAWS check by verifying an
   acceptable value in SEG.TSval, and information from the Timestamps
   option MUST NOT be used to update connection state information.
   SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.

In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
to decide to send TS val = 0, if it buys something.

Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_output.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b1c218d..b7dedd9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2898,6 +2898,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority)
 	skb_reserve(skb, MAX_TCP_HEADER);
 	tcp_init_nondata_skb(skb, tcp_acceptable_seq(sk),
 			     TCPHDR_ACK | TCPHDR_RST);
+	skb_mstamp_get(&skb->skb_mstamp);
 	/* Send it off. */
 	if (tcp_transmit_skb(sk, skb, 0, priority))
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTFAILED);
-- 
2.1.0


From 5829d2eafe497437bda2ef310a1eaa4be5783306 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw2@infradead.org>
Date: Wed, 23 Sep 2015 19:45:08 +0100
Subject: [PATCH 22/30] Fix AF_PACKET ABI breakage in 4.2

[ Upstream commit d3869efe7a8a2298516d9af4f91487cf486ca945 ]

Commit 7d82410950aa ("virtio: add explicit big-endian support to memory
accessors") accidentally changed the virtio_net header used by
AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.

Since virtio_legacy_is_little_endian() is a very long identifier,
define a vio_le macro and use that throughout the code instead of the
hard-coded 'false' for little-endian.

This restores the ABI to match 4.1 and earlier kernels, and makes my
test program work again.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/packet/af_packet.c | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ed458b3..7851b12 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -229,6 +229,8 @@ struct packet_skb_cb {
 	} sa;
 };
 
+#define vio_le() virtio_legacy_is_little_endian()
+
 #define PACKET_SKB_CB(__skb)	((struct packet_skb_cb *)((__skb)->cb))
 
 #define GET_PBDQC_FROM_RB(x)	((struct tpacket_kbdq_core *)(&(x)->prb_bdqc))
@@ -2561,15 +2563,15 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 			goto out_unlock;
 
 		if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-		    (__virtio16_to_cpu(false, vnet_hdr.csum_start) +
-		     __virtio16_to_cpu(false, vnet_hdr.csum_offset) + 2 >
-		      __virtio16_to_cpu(false, vnet_hdr.hdr_len)))
-			vnet_hdr.hdr_len = __cpu_to_virtio16(false,
-				 __virtio16_to_cpu(false, vnet_hdr.csum_start) +
-				__virtio16_to_cpu(false, vnet_hdr.csum_offset) + 2);
+		    (__virtio16_to_cpu(vio_le(), vnet_hdr.csum_start) +
+		     __virtio16_to_cpu(vio_le(), vnet_hdr.csum_offset) + 2 >
+		      __virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len)))
+			vnet_hdr.hdr_len = __cpu_to_virtio16(vio_le(),
+				 __virtio16_to_cpu(vio_le(), vnet_hdr.csum_start) +
+				__virtio16_to_cpu(vio_le(), vnet_hdr.csum_offset) + 2);
 
 		err = -EINVAL;
-		if (__virtio16_to_cpu(false, vnet_hdr.hdr_len) > len)
+		if (__virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len) > len)
 			goto out_unlock;
 
 		if (vnet_hdr.gso_type != VIRTIO_NET_HDR_GSO_NONE) {
@@ -2612,7 +2614,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	hlen = LL_RESERVED_SPACE(dev);
 	tlen = dev->needed_tailroom;
 	skb = packet_alloc_skb(sk, hlen + tlen, hlen, len,
-			       __virtio16_to_cpu(false, vnet_hdr.hdr_len),
+			       __virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len),
 			       msg->msg_flags & MSG_DONTWAIT, &err);
 	if (skb == NULL)
 		goto out_unlock;
@@ -2659,8 +2661,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 
 	if (po->has_vnet_hdr) {
 		if (vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-			u16 s = __virtio16_to_cpu(false, vnet_hdr.csum_start);
-			u16 o = __virtio16_to_cpu(false, vnet_hdr.csum_offset);
+			u16 s = __virtio16_to_cpu(vio_le(), vnet_hdr.csum_start);
+			u16 o = __virtio16_to_cpu(vio_le(), vnet_hdr.csum_offset);
 			if (!skb_partial_csum_set(skb, s, o)) {
 				err = -EINVAL;
 				goto out_free;
@@ -2668,7 +2670,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 		}
 
 		skb_shinfo(skb)->gso_size =
-			__virtio16_to_cpu(false, vnet_hdr.gso_size);
+			__virtio16_to_cpu(vio_le(), vnet_hdr.gso_size);
 		skb_shinfo(skb)->gso_type = gso_type;
 
 		/* Header must be checked, and gso_segs computed. */
@@ -3042,9 +3044,9 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 
 			/* This is a hint as to how much should be linear. */
 			vnet_hdr.hdr_len =
-				__cpu_to_virtio16(false, skb_headlen(skb));
+				__cpu_to_virtio16(vio_le(), skb_headlen(skb));
 			vnet_hdr.gso_size =
-				__cpu_to_virtio16(false, sinfo->gso_size);
+				__cpu_to_virtio16(vio_le(), sinfo->gso_size);
 			if (sinfo->gso_type & SKB_GSO_TCPV4)
 				vnet_hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
 			else if (sinfo->gso_type & SKB_GSO_TCPV6)
@@ -3062,9 +3064,9 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 
 		if (skb->ip_summed == CHECKSUM_PARTIAL) {
 			vnet_hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			vnet_hdr.csum_start = __cpu_to_virtio16(false,
+			vnet_hdr.csum_start = __cpu_to_virtio16(vio_le(),
 					  skb_checksum_start_offset(skb));
-			vnet_hdr.csum_offset = __cpu_to_virtio16(false,
+			vnet_hdr.csum_offset = __cpu_to_virtio16(vio_le(),
 							 skb->csum_offset);
 		} else if (skb->ip_summed == CHECKSUM_UNNECESSARY) {
 			vnet_hdr.flags = VIRTIO_NET_HDR_F_DATA_VALID;
-- 
2.1.0


From 13d563e7a1dc1c5e656285161b3862edab52112a Mon Sep 17 00:00:00 2001
From: WANG Cong <xiyou.wangcong@gmail.com>
Date: Tue, 22 Sep 2015 17:01:11 -0700
Subject: [PATCH 23/30] net: revert "net_sched: move tp->root allocation into
 fw_init()"

[ Upstream commit d8aecb10115497f6cdf841df8c88ebb3ba25fa28 ]

fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.

Fixes: 33f8b9ecdb15 ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sched/cls_fw.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index 715e01e..f23a3b6 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -33,7 +33,6 @@
 
 struct fw_head {
 	u32			mask;
-	bool			mask_set;
 	struct fw_filter __rcu	*ht[HTSIZE];
 	struct rcu_head		rcu;
 };
@@ -84,7 +83,7 @@ static int fw_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 			}
 		}
 	} else {
-		/* old method */
+		/* Old method: classify the packet using its skb mark. */
 		if (id && (TC_H_MAJ(id) == 0 ||
 			   !(TC_H_MAJ(id ^ tp->q->handle)))) {
 			res->classid = id;
@@ -114,14 +113,9 @@ static unsigned long fw_get(struct tcf_proto *tp, u32 handle)
 
 static int fw_init(struct tcf_proto *tp)
 {
-	struct fw_head *head;
-
-	head = kzalloc(sizeof(struct fw_head), GFP_KERNEL);
-	if (head == NULL)
-		return -ENOBUFS;
-
-	head->mask_set = false;
-	rcu_assign_pointer(tp->root, head);
+	/* We don't allocate fw_head here, because in the old method
+	 * we don't need it at all.
+	 */
 	return 0;
 }
 
@@ -252,7 +246,7 @@ static int fw_change(struct net *net, struct sk_buff *in_skb,
 	int err;
 
 	if (!opt)
-		return handle ? -EINVAL : 0;
+		return handle ? -EINVAL : 0; /* Succeed if it is old method. */
 
 	err = nla_parse_nested(tb, TCA_FW_MAX, opt, fw_policy);
 	if (err < 0)
@@ -302,11 +296,17 @@ static int fw_change(struct net *net, struct sk_buff *in_skb,
 	if (!handle)
 		return -EINVAL;
 
-	if (!head->mask_set) {
-		head->mask = 0xFFFFFFFF;
+	if (!head) {
+		u32 mask = 0xFFFFFFFF;
 		if (tb[TCA_FW_MASK])
-			head->mask = nla_get_u32(tb[TCA_FW_MASK]);
-		head->mask_set = true;
+			mask = nla_get_u32(tb[TCA_FW_MASK]);
+
+		head = kzalloc(sizeof(*head), GFP_KERNEL);
+		if (!head)
+			return -ENOBUFS;
+		head->mask = mask;
+
+		rcu_assign_pointer(tp->root, head);
 	}
 
 	f = kzalloc(sizeof(struct fw_filter), GFP_KERNEL);
-- 
2.1.0


From 7a084c0513beddf19a13b4b29a53c2bd8e1511b7 Mon Sep 17 00:00:00 2001
From: Wilson Kok <wkok@cumulusnetworks.com>
Date: Tue, 22 Sep 2015 21:40:22 -0700
Subject: [PATCH 24/30] fib_rules: fix fib rule dumps across multiple skbs

[ Upstream commit 41fc014332d91ee90c32840bf161f9685b7fbf2b ]

dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.

This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.

Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/fib_rules.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 9a12668..0ad144f 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -615,15 +615,17 @@ static int dump_rules(struct sk_buff *skb, struct netlink_callback *cb,
 {
 	int idx = 0;
 	struct fib_rule *rule;
+	int err = 0;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(rule, &ops->rules_list, list) {
 		if (idx < cb->args[1])
 			goto skip;
 
-		if (fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
-				     cb->nlh->nlmsg_seq, RTM_NEWRULE,
-				     NLM_F_MULTI, ops) < 0)
+		err = fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
+				       cb->nlh->nlmsg_seq, RTM_NEWRULE,
+				       NLM_F_MULTI, ops);
+		if (err)
 			break;
 skip:
 		idx++;
@@ -632,7 +634,7 @@ skip:
 	cb->args[1] = idx;
 	rules_ops_put(ops);
 
-	return skb->len;
+	return err;
 }
 
 static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
@@ -648,7 +650,9 @@ static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
 		if (ops == NULL)
 			return -EAFNOSUPPORT;
 
-		return dump_rules(skb, cb, ops);
+		dump_rules(skb, cb, ops);
+
+		return skb->len;
 	}
 
 	rcu_read_lock();
-- 
2.1.0


From e3dc378e623e1a12f3ff19e19310b7b8571b910a Mon Sep 17 00:00:00 2001
From: Guillaume Nault <g.nault@alphalink.fr>
Date: Thu, 24 Sep 2015 12:54:01 +0200
Subject: [PATCH 25/30] ppp: fix lockdep splat in ppp_dev_uninit()

[ Upstream commit 58a89ecaca53736aa465170530acea4f8be34ab4 ]

ppp_dev_uninit() locks all_ppp_mutex while under rtnl mutex protection.
ppp_create_interface() must then lock these mutexes in that same order
to avoid possible deadlock.

[  120.880011] ======================================================
[  120.880011] [ INFO: possible circular locking dependency detected ]
[  120.880011] 4.2.0 #1 Not tainted
[  120.880011] -------------------------------------------------------
[  120.880011] ppp-apitest/15827 is trying to acquire lock:
[  120.880011]  (&pn->all_ppp_mutex){+.+.+.}, at: [<ffffffffa0145f56>] ppp_dev_uninit+0x64/0xb0 [ppp_generic]
[  120.880011]
[  120.880011] but task is already holding lock:
[  120.880011]  (rtnl_mutex){+.+.+.}, at: [<ffffffff812e4255>] rtnl_lock+0x12/0x14
[  120.880011]
[  120.880011] which lock already depends on the new lock.
[  120.880011]
[  120.880011]
[  120.880011] the existing dependency chain (in reverse order) is:
[  120.880011]
[  120.880011] -> #1 (rtnl_mutex){+.+.+.}:
[  120.880011]        [<ffffffff81073a6f>] lock_acquire+0xcf/0x10e
[  120.880011]        [<ffffffff813ab18a>] mutex_lock_nested+0x56/0x341
[  120.880011]        [<ffffffff812e4255>] rtnl_lock+0x12/0x14
[  120.880011]        [<ffffffff812d9d94>] register_netdev+0x11/0x27
[  120.880011]        [<ffffffffa0147b17>] ppp_ioctl+0x289/0xc98 [ppp_generic]
[  120.880011]        [<ffffffff8113b367>] do_vfs_ioctl+0x4ea/0x532
[  120.880011]        [<ffffffff8113b3fd>] SyS_ioctl+0x4e/0x7d
[  120.880011]        [<ffffffff813ad7d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[  120.880011]
[  120.880011] -> #0 (&pn->all_ppp_mutex){+.+.+.}:
[  120.880011]        [<ffffffff8107334e>] __lock_acquire+0xb07/0xe76
[  120.880011]        [<ffffffff81073a6f>] lock_acquire+0xcf/0x10e
[  120.880011]        [<ffffffff813ab18a>] mutex_lock_nested+0x56/0x341
[  120.880011]        [<ffffffffa0145f56>] ppp_dev_uninit+0x64/0xb0 [ppp_generic]
[  120.880011]        [<ffffffff812d5263>] rollback_registered_many+0x19e/0x252
[  120.880011]        [<ffffffff812d5381>] rollback_registered+0x29/0x38
[  120.880011]        [<ffffffff812d53fa>] unregister_netdevice_queue+0x6a/0x77
[  120.880011]        [<ffffffffa0146a94>] ppp_release+0x42/0x79 [ppp_generic]
[  120.880011]        [<ffffffff8112d9f6>] __fput+0xec/0x192
[  120.880011]        [<ffffffff8112dacc>] ____fput+0x9/0xb
[  120.880011]        [<ffffffff8105447a>] task_work_run+0x66/0x80
[  120.880011]        [<ffffffff81001801>] prepare_exit_to_usermode+0x8c/0xa7
[  120.880011]        [<ffffffff81001900>] syscall_return_slowpath+0xe4/0x104
[  120.880011]        [<ffffffff813ad931>] int_ret_from_sys_call+0x25/0x9f
[  120.880011]
[  120.880011] other info that might help us debug this:
[  120.880011]
[  120.880011]  Possible unsafe locking scenario:
[  120.880011]
[  120.880011]        CPU0                    CPU1
[  120.880011]        ----                    ----
[  120.880011]   lock(rtnl_mutex);
[  120.880011]                                lock(&pn->all_ppp_mutex);
[  120.880011]                                lock(rtnl_mutex);
[  120.880011]   lock(&pn->all_ppp_mutex);
[  120.880011]
[  120.880011]  *** DEADLOCK ***

Fixes: 8cb775bc0a34 ("ppp: fix device unregistration upon netns deletion")
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ppp/ppp_generic.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index fa8f504..487be20 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -2742,6 +2742,7 @@ static struct ppp *ppp_create_interface(struct net *net, int unit,
 	 */
 	dev_net_set(dev, net);
 
+	rtnl_lock();
 	mutex_lock(&pn->all_ppp_mutex);
 
 	if (unit < 0) {
@@ -2772,7 +2773,7 @@ static struct ppp *ppp_create_interface(struct net *net, int unit,
 	ppp->file.index = unit;
 	sprintf(dev->name, "ppp%d", unit);
 
-	ret = register_netdev(dev);
+	ret = register_netdevice(dev);
 	if (ret != 0) {
 		unit_put(&pn->units_idr, unit);
 		netdev_err(ppp->dev, "PPP: couldn't register device %s (%d)\n",
@@ -2784,6 +2785,7 @@ static struct ppp *ppp_create_interface(struct net *net, int unit,
 
 	atomic_inc(&ppp_unit_count);
 	mutex_unlock(&pn->all_ppp_mutex);
+	rtnl_unlock();
 
 	*retp = 0;
 	return ppp;
-- 
2.1.0


From d82d92e35535ebaafef4bad60493d1e2873fd7a9 Mon Sep 17 00:00:00 2001
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Mon, 20 Jul 2015 17:49:55 -0700
Subject: [PATCH 26/30] net: dsa: bcm_sf2: Do not override speed settings

[ Upstream d2eac98f7d1b950b762a7eca05a9ce0ea1d878d2 in net-next tree,
  will be pushed to Linus very soon. ]

The SF2 driver currently overrides speed settings for its port
configured using a fixed PHY, this is both unnecessary and incorrect,
because we keep feedback to the hardware parameters that we read from
the PHY device, which in the case of a fixed PHY cannot possibly change
speed.

This is a required change to allow the fixed PHY code to allow
registering a PHY with a link configured as DOWN by default and avoid
some sort of circular dependency where we require the link_update
callback to run to program the hardware, and we then utilize the fixed
PHY parameters to program the hardware with the same settings.

Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/dsa/bcm_sf2.c | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 691caea..9d56515 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -905,15 +905,11 @@ static void bcm_sf2_sw_fixed_link_update(struct dsa_switch *ds, int port,
 					 struct fixed_phy_status *status)
 {
 	struct bcm_sf2_priv *priv = ds_to_priv(ds);
-	u32 duplex, pause, speed;
+	u32 duplex, pause;
 	u32 reg;
 
 	duplex = core_readl(priv, CORE_DUPSTS);
 	pause = core_readl(priv, CORE_PAUSESTS);
-	speed = core_readl(priv, CORE_SPDSTS);
-
-	speed >>= (port * SPDSTS_SHIFT);
-	speed &= SPDSTS_MASK;
 
 	status->link = 0;
 
@@ -948,18 +944,6 @@ static void bcm_sf2_sw_fixed_link_update(struct dsa_switch *ds, int port,
 		reg &= ~LINK_STS;
 	core_writel(priv, reg, CORE_STS_OVERRIDE_GMIIP_PORT(port));
 
-	switch (speed) {
-	case SPDSTS_10:
-		status->speed = SPEED_10;
-		break;
-	case SPDSTS_100:
-		status->speed = SPEED_100;
-		break;
-	case SPDSTS_1000:
-		status->speed = SPEED_1000;
-		break;
-	}
-
 	if ((pause & (1 << port)) &&
 	    (pause & (1 << (port + PAUSESTS_TX_PAUSE_SHIFT)))) {
 		status->asym_pause = 1;
-- 
2.1.0


From 7b3713db16ac51ad80d855bb96dddf44e77df24d Mon Sep 17 00:00:00 2001
From: Stas Sergeev <stsp@list.ru>
Date: Mon, 20 Jul 2015 17:49:56 -0700
Subject: [PATCH 27/30] net: phy: fixed_phy: handle link-down case

[ Upstream 868a4215be9a6d80548ccb74763b883dc99d32a2 in net-next tree,
  will be pushed to Linus very soon. ]

fixed_phy_register() currently hardcodes the fixed PHY link to 1, and
expects to find a "speed" parameter to provide correct information
towards the fixed PHY consumer.

In a subsequent change, where we allow "managed" (e.g: (RS)GMII in-band
status auto-negotiation) fixed PHYs, none of these parameters can be
provided since they will be auto-negotiated, hence, we just provide a
zero-initialized fixed_phy_status to fixed_phy_register() which makes it
fail when we call fixed_phy_update_regs() since status.speed = 0 which
makes us hit the "default" label and error out.

Without this change, we would also see potentially inconsistent
speed/duplex parameters for fixed PHYs when the link is DOWN.

CC: netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>
[florian: add more background to why this is correct and desirable]
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/phy/fixed_phy.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/phy/fixed_phy.c b/drivers/net/phy/fixed_phy.c
index d7a6524..99d9bc1 100644
--- a/drivers/net/phy/fixed_phy.c
+++ b/drivers/net/phy/fixed_phy.c
@@ -52,6 +52,10 @@ static int fixed_phy_update_regs(struct fixed_phy *fp)
 	u16 lpagb = 0;
 	u16 lpa = 0;
 
+	if (!fp->status.link)
+		goto done;
+	bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE;
+
 	if (fp->status.duplex) {
 		bmcr |= BMCR_FULLDPLX;
 
@@ -96,15 +100,13 @@ static int fixed_phy_update_regs(struct fixed_phy *fp)
 		}
 	}
 
-	if (fp->status.link)
-		bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE;
-
 	if (fp->status.pause)
 		lpa |= LPA_PAUSE_CAP;
 
 	if (fp->status.asym_pause)
 		lpa |= LPA_PAUSE_ASYM;
 
+done:
 	fp->regs[MII_PHYSID1] = 0;
 	fp->regs[MII_PHYSID2] = 0;
 
-- 
2.1.0


From dec98dd4bf6107f1de79668119a3434ff12b04e4 Mon Sep 17 00:00:00 2001
From: Stas Sergeev <stsp@list.ru>
Date: Mon, 20 Jul 2015 17:49:57 -0700
Subject: [PATCH 28/30] of_mdio: add new DT property 'managed' to specify the
 PHY management type

[ Upstream commit 4cba5c2103657d43d0886e4cff8004d95a3d0def in net-next tree,
  will be pushed to Linus very soon. ]

Currently the PHY management type is selected by the MAC driver arbitrary.
The decision is based on the presence of the "fixed-link" node and on a
will of the driver's authors.
This caused a regression recently, when mvneta driver suddenly started
to use the in-band status for auto-negotiation on fixed links.
It appears the auto-negotiation may not work when expected by the MAC driver.
Sebastien Rannou explains:
<< Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
inband status) is connected to the switch through QSGMII, and in this context
we are on the media side of the PHY. >>
https://lkml.org/lkml/2015/7/10/206

This patch introduces the new string property 'managed' that allows
the user to set the management type explicitly.
The supported values are:
"auto" - default. Uses either MDIO or nothing, depending on the presence
of the fixed-link node
"in-band-status" - use in-band status

Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>

CC: Rob Herring <robh+dt@kernel.org>
CC: Pawel Moll <pawel.moll@arm.com>
CC: Mark Rutland <mark.rutland@arm.com>
CC: Ian Campbell <ijc+devicetree@hellion.org.uk>
CC: Kumar Gala <galak@codeaurora.org>
CC: Florian Fainelli <f.fainelli@gmail.com>
CC: Grant Likely <grant.likely@linaro.org>
CC: devicetree@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 Documentation/devicetree/bindings/net/ethernet.txt |  4 ++++
 drivers/of/of_mdio.c                               | 19 +++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/ethernet.txt b/Documentation/devicetree/bindings/net/ethernet.txt
index 41b3f3f..5d88f37 100644
--- a/Documentation/devicetree/bindings/net/ethernet.txt
+++ b/Documentation/devicetree/bindings/net/ethernet.txt
@@ -25,7 +25,11 @@ The following properties are common to the Ethernet controllers:
   flow control thresholds.
 - tx-fifo-depth: the size of the controller's transmit fifo in bytes. This
   is used for components that can have configurable fifo sizes.
+- managed: string, specifies the PHY management type. Supported values are:
+  "auto", "in-band-status". "auto" is the default, it usess MDIO for
+  management if fixed-link is not specified.
 
 Child nodes of the Ethernet controller are typically the individual PHY devices
 connected via the MDIO bus (sometimes the MDIO bus controller is separate).
 They are described in the phy.txt file in this same directory.
+For non-MDIO PHY management see fixed-link.txt.
diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index fdc60db..7c8c23c 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -266,7 +266,8 @@ EXPORT_SYMBOL(of_phy_attach);
 bool of_phy_is_fixed_link(struct device_node *np)
 {
 	struct device_node *dn;
-	int len;
+	int len, err;
+	const char *managed;
 
 	/* New binding */
 	dn = of_get_child_by_name(np, "fixed-link");
@@ -275,6 +276,10 @@ bool of_phy_is_fixed_link(struct device_node *np)
 		return true;
 	}
 
+	err = of_property_read_string(np, "managed", &managed);
+	if (err == 0 && strcmp(managed, "auto") != 0)
+		return true;
+
 	/* Old binding */
 	if (of_get_property(np, "fixed-link", &len) &&
 	    len == (5 * sizeof(__be32)))
@@ -289,8 +294,18 @@ int of_phy_register_fixed_link(struct device_node *np)
 	struct fixed_phy_status status = {};
 	struct device_node *fixed_link_node;
 	const __be32 *fixed_link_prop;
-	int len;
+	int len, err;
 	struct phy_device *phy;
+	const char *managed;
+
+	err = of_property_read_string(np, "managed", &managed);
+	if (err == 0) {
+		if (strcmp(managed, "in-band-status") == 0) {
+			/* status is zeroed, namely its .link member */
+			phy = fixed_phy_register(PHY_POLL, &status, np);
+			return IS_ERR(phy) ? PTR_ERR(phy) : 0;
+		}
+	}
 
 	/* New binding */
 	fixed_link_node = of_get_child_by_name(np, "fixed-link");
-- 
2.1.0


From d47cfb643b75954ad54341ee63369c59cd3bd670 Mon Sep 17 00:00:00 2001
From: Stas Sergeev <stsp@list.ru>
Date: Mon, 20 Jul 2015 17:49:58 -0700
Subject: [PATCH 29/30] mvneta: use inband status only when explicitly enabled

[ Upstream commit f8af8e6eb95093d5ce5ebcc52bd1929b0433e172 in net-next tree,
  will be pushed to Linus very soon. ]

The commit 898b2970e2c9 ("mvneta: implement SGMII-based in-band link state
signaling") implemented the link parameters auto-negotiation unconditionally.
Unfortunately it appears that some HW that implements SGMII protocol,
doesn't generate the inband status, so it is not possible to auto-negotiate
anything with such HW.

This patch enables the auto-negotiation only if explicitly requested with
the 'managed' DT property.

This patch fixes the following regression:
https://lkml.org/lkml/2015/7/8/865

Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>

CC: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
CC: netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/marvell/mvneta.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 03e052a..09ec32e 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -3029,8 +3029,8 @@ static int mvneta_probe(struct platform_device *pdev)
 	const char *dt_mac_addr;
 	char hw_mac_addr[ETH_ALEN];
 	const char *mac_from;
+	const char *managed;
 	int phy_mode;
-	int fixed_phy = 0;
 	int err;
 
 	/* Our multiqueue support is not complete, so for now, only
@@ -3064,7 +3064,6 @@ static int mvneta_probe(struct platform_device *pdev)
 			dev_err(&pdev->dev, "cannot register fixed PHY\n");
 			goto err_free_irq;
 		}
-		fixed_phy = 1;
 
 		/* In the case of a fixed PHY, the DT node associated
 		 * to the PHY is the Ethernet MAC DT node.
@@ -3088,8 +3087,10 @@ static int mvneta_probe(struct platform_device *pdev)
 	pp = netdev_priv(dev);
 	pp->phy_node = phy_node;
 	pp->phy_interface = phy_mode;
-	pp->use_inband_status = (phy_mode == PHY_INTERFACE_MODE_SGMII) &&
-				fixed_phy;
+
+	err = of_property_read_string(dn, "managed", &managed);
+	pp->use_inband_status = (err == 0 &&
+				 strcmp(managed, "in-band-status") == 0);
 
 	pp->clk = devm_clk_get(&pdev->dev, NULL);
 	if (IS_ERR(pp->clk)) {
-- 
2.1.0


From a80c11e7e00b9cfacb5d0ff49404636c9b114082 Mon Sep 17 00:00:00 2001
From: Carol L Soto <clsoto@linux.vnet.ibm.com>
Date: Thu, 27 Aug 2015 14:43:25 -0500
Subject: [PATCH 30/30] net/mlx4_core: Capping number of requested MSIXs to
 MAX_MSIX

[ Upstream commit 9293267a3e2a7a2555d8ddc8f9301525e5b03b1b ]

We currently manage IRQs in pool_bm which is a bit field
of MAX_MSIX bits. Thus, allocating more than MAX_MSIX
interrupts can't be managed in pool_bm.
Fixing this by capping number of requested MSIXs to
MAX_MSIX.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 29c2a01..a408977 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2654,9 +2654,14 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 
 	if (msi_x) {
 		int nreq = dev->caps.num_ports * num_online_cpus() + 1;
+		bool shared_ports = false;
 
 		nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs,
 			     nreq);
+		if (nreq > MAX_MSIX) {
+			nreq = MAX_MSIX;
+			shared_ports = true;
+		}
 
 		entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
 		if (!entries)
@@ -2679,6 +2684,9 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 		bitmap_zero(priv->eq_table.eq[MLX4_EQ_ASYNC].actv_ports.ports,
 			    dev->caps.num_ports);
 
+		if (MLX4_IS_LEGACY_EQ_MODE(dev->caps))
+			shared_ports = true;
+
 		for (i = 0; i < dev->caps.num_comp_vectors + 1; i++) {
 			if (i == MLX4_EQ_ASYNC)
 				continue;
@@ -2686,7 +2694,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 			priv->eq_table.eq[i].irq =
 				entries[i + 1 - !!(i > MLX4_EQ_ASYNC)].vector;
 
-			if (MLX4_IS_LEGACY_EQ_MODE(dev->caps)) {
+			if (shared_ports) {
 				bitmap_fill(priv->eq_table.eq[i].actv_ports.ports,
 					    dev->caps.num_ports);
 				/* We don't set affinity hint when there
-- 
2.1.0

next             reply	other threads:[~2015-09-29  4:54 UTC|newest]

Thread overview: 308+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-29  4:54 David Miller [this message]
2015-09-30  3:33 ` [PATCHES] Networking Greg KH
  -- strict thread matches above, loose matches on Subject: below --
2020-09-24 21:40 David Miller
2020-09-25  8:38 ` Greg KH
2020-08-24 16:52 David Miller
2020-08-26 10:13 ` Greg KH
2020-08-15  1:36 David Miller
2020-08-17  9:37 ` Greg KH
2020-08-08  1:53 David Miller
2020-08-10 12:08 ` Greg KH
2020-07-29  3:12 David Miller
2020-07-29 11:42 ` Greg KH
2020-07-16 23:07 David Miller
2020-07-17  8:21 ` Greg KH
2020-06-28  0:55 David Miller
2020-06-28 14:15 ` Greg KH
2020-06-16  1:27 David Miller
2020-06-16  7:43 ` Greg KH
2020-06-17 16:16   ` Greg KH
2020-06-10  0:08 David Miller
2020-06-11 10:01 ` Greg KH
2020-06-07  2:55 David Miller
2020-06-07 13:01 ` Greg KH
2020-05-27  6:16 David Miller
2020-05-28 12:21 ` Greg KH
2020-05-16  0:44 David Miller
2020-05-16 16:14 ` Greg KH
2020-05-12  0:41 David Miller
2020-05-12  8:59 ` Greg KH
2020-04-27  1:10 David Miller
2020-04-27 11:40 ` Greg KH
2020-04-17 17:51 David Miller
2020-04-18  9:08 ` Greg KH
2020-04-08 22:05 David Miller
2020-04-09 11:17 ` Greg KH
2020-04-02 20:16 David Miller
2020-04-02 20:23 ` Greg KH
2020-03-27 23:51 David Miller
2020-03-28  8:49 ` Greg KH
2020-03-14  4:52 David Miller
2020-03-15  8:34 ` Greg KH
2020-03-01  5:11 David Miller
2020-03-01  9:27 ` Greg KH
2020-02-18 23:41 David Miller
2020-02-19 20:32 ` Greg KH
2020-02-09 21:21 David Miller
2020-02-09 21:54 ` Greg KH
2020-02-05 14:07 David Miller
2020-02-06  6:54 ` Greg KH
2020-01-30 10:12 David Miller
2020-01-30 10:22 ` Greg KH
2020-01-27 11:16 David Miller
2020-01-27 14:27 ` Greg KH
2020-01-20 19:44 David Miller
2020-01-21 15:29 ` Greg KH
2020-01-11  0:33 David Miller
2020-01-11  8:19 ` Greg KH
2020-01-01 20:13 David Miller
2020-01-01 21:37 ` Greg KH
2019-12-28  8:14 David Miller
2019-12-28 11:23 ` Greg KH
2019-12-16 22:10 David Miller
2019-12-17  7:43 ` Greg KH
2019-12-19 15:42 ` Greg KH
2019-12-03  3:21 David Miller
2019-12-03  6:46 ` Greg KH
2019-11-25  5:54 David Miller
2019-11-25 13:31 ` Greg KH
2019-11-18  8:08 David Miller
2019-11-18  8:16 ` Greg KH
2019-11-10  5:47 David Miller
2019-11-10 15:34 ` Greg KH
2019-10-24 21:30 David Miller
2019-10-25  1:38 ` Greg KH
2019-10-05 21:57 David Miller
2019-10-06  7:50 ` Greg KH
2019-09-29 22:39 David Miller
2019-10-01 13:38 ` Greg KH
2019-09-19 12:07 David Miller
2019-09-19 13:02 ` Greg KH
2019-09-15 19:37 David Miller
2019-09-16 11:05 ` Greg KH
2019-09-05  7:23 David Miller
2019-09-08 10:40 ` Greg KH
2019-08-28  0:42 David Miller
2019-09-02 16:30 ` Greg KH
2019-09-02 17:51   ` David Miller
2019-08-20 23:01 David Miller
2019-08-20 23:19 ` Greg KH
2019-08-07 23:27 David Miller
2019-08-08  6:57 ` Greg KH
2019-07-25  1:55 David Miller
2019-07-26  8:54 ` Greg KH
2019-07-02  1:52 David Miller
2019-07-02  4:40 ` Greg KH
2019-06-18  4:23 David Miller
2019-06-19 12:35 ` Greg KH
2019-06-08 23:27 David Miller
2019-06-09  7:26 ` Greg KH
2019-06-09 19:42   ` David Miller
2019-05-21  6:37 David Miller
2019-05-22  6:36 ` Greg KH
2019-05-14 19:58 David Miller
2019-05-15  6:02 ` Greg KH
2019-05-04  7:01 David Miller
2019-05-04  7:34 ` Greg KH
2019-04-30  2:06 David Miller
2019-04-30  7:53 ` Greg KH
2019-04-18 22:53 David Miller
2019-04-23 20:06 ` Greg KH
2019-04-10  3:55 David Miller
2019-04-10 15:35 ` Sasha Levin
2019-03-28 19:24 David Miller
2019-03-28 20:55 ` Greg KH
2019-03-28 21:51   ` Greg KH
2019-03-28 23:18     ` David Miller
2019-03-29  6:18       ` Greg KH
2019-03-15  1:47 David Miller
2019-03-15  6:30 ` Greg KH
2019-03-19 13:03   ` Greg KH
2019-03-07 22:47 David Miller
2019-03-08  6:38 ` Greg KH
2019-02-24  5:18 David Miller
2019-02-24  7:52 ` Greg KH
2019-02-20 20:42 David Miller
2019-02-21  3:08 ` Sasha Levin
2019-02-21  7:21 ` Greg KH
2019-02-09 23:21 David Miller
2019-02-10 12:21 ` Greg KH
2019-02-01 21:45 David Miller
2019-02-02  9:55 ` Greg KH
2019-01-26  0:18 David Miller
2019-01-26  9:29 ` Greg KH
2019-01-21 23:28 David Miller
2019-01-22  7:18 ` Greg KH
2019-01-23  7:33 ` Greg KH
2019-01-20 19:12 David Miller
2019-01-21  8:00 ` Greg KH
2019-01-04 18:17 David Miller
2019-01-04 18:48 ` Greg KH
2018-12-12  6:31 David Miller
2018-12-13  9:53 ` Greg KH
2018-12-03  7:01 David Miller
2018-12-03  9:13 ` Greg KH
2018-11-21  3:49 David Miller
2018-11-21 17:49 ` Greg KH
2018-11-02  3:55 David Miller
2018-11-02  5:27 ` Greg KH
2018-09-24 16:46 David Miller
2018-09-26  9:32 ` Greg KH
2018-09-18 16:14 David Miller
2018-09-20  5:25 ` Greg KH
2018-09-11  6:15 David Miller
2018-09-11  8:29 ` Greg KH
2018-08-17 19:32 David Miller
2018-08-18  9:43 ` Greg KH
2018-08-04  5:05 David Miller
2018-08-04  7:33 ` Greg KH
2018-08-01  5:32 David Miller
2018-08-01  6:20 ` Greg KH
2018-07-26 23:50 David Miller
2018-07-27  0:06 ` Eric Dumazet
2018-07-27  6:34 ` Greg KH
2018-07-23  3:51 David Miller
2018-07-23  6:21 ` Greg KH
2018-07-18 23:35 David Miller
2018-07-19  6:33 ` Greg KH
2018-06-20 12:37 David Miller
2018-06-21 21:10 ` Greg KH
2018-06-24 11:20   ` Greg KH
2018-06-08  2:18 David Miller
2018-06-08  4:52 ` Greg KH
2018-05-15 20:50 David Miller
2018-05-16  8:40 ` Greg KH
2018-04-26 18:38 David Miller
2018-04-26 18:50 ` Greg KH
2018-04-13 17:47 David Miller
2018-04-14 14:04 ` Greg KH
2018-04-10 19:39 David Miller
2018-04-10 21:26 ` Greg KH
2018-03-28 15:35 David Miller
2018-03-28 15:40 ` Willy Tarreau
2018-03-28 15:46   ` David Miller
2018-03-28 16:36     ` Greg KH
2018-03-28 16:49 ` Greg KH
2018-03-07  2:28 David Miller
2018-03-07  3:30 ` Greg KH
2018-02-06 20:19 David Miller
2018-02-07 19:39 ` Greg KH
2018-01-28 16:22 David Miller
2018-01-28 16:39 ` Greg KH
2018-01-12 21:12 David Miller
2018-01-13  9:54 ` Greg KH
2017-12-31  4:15 David Miller
2017-12-31 10:14 ` Greg KH
2017-12-12 15:44 David Miller
2017-12-14 17:51 ` Greg KH
2017-11-20 11:47 David Miller
2017-11-21 14:04 ` Greg KH
2017-11-14  6:36 David Miller
2017-11-16 14:12 ` Greg KH
2017-10-09  4:02 David Miller
2017-10-09  7:34 ` Greg KH
2017-10-09  7:56   ` Greg KH
2017-10-09 16:55     ` David Miller
2017-10-09 19:04       ` Greg KH
2017-10-09 22:54         ` David Miller
2017-10-10 14:10           ` Greg KH
2017-09-15  4:57 David Miller
2017-09-15  6:24 ` Greg KH
2018-06-07  7:00 ` Jiri Slaby
2018-06-07  9:21   ` Greg KH
2018-06-07 10:47   ` Ido Schimmel
2018-06-07 10:52     ` Greg KH
2018-07-05 16:15     ` Greg KH
2018-07-05 16:42       ` Ido Schimmel
2017-08-24  3:24 David Miller
2017-08-25  0:55 ` Greg KH
2017-08-11  5:25 David Miller
2017-08-11 16:22 ` Greg KH
2017-08-08 23:21 David Miller
2017-08-08 23:30 ` Greg KH
2017-07-17 16:44 David Miller
2017-07-17 19:23 ` Greg KH
2017-07-19 10:27   ` Greg KH
2017-06-29 16:19 David Miller
2017-06-29 17:34 ` Greg KH
2017-05-30 23:14 David Miller
2017-05-31  0:18 ` Greg KH
2017-05-11  2:41 David Miller
2017-05-11 13:10 ` Greg KH
2017-05-22 10:16 ` Greg KH
2017-04-28 19:41 David Miller
2017-04-29  6:23 ` Greg KH
2017-03-25  7:53 David Miller
2017-03-25  9:26 ` Thomas Backlund
2017-03-25 17:38   ` David Miller
2017-03-26 18:47     ` Thomas Backlund
2017-03-27 16:19     ` Greg KH
2017-03-17  1:48 David Miller
2017-03-18 14:13 ` Greg KH
2017-02-23 19:54 David Miller
2017-02-23 20:19 ` Greg KH
2017-02-13 17:15 David Miller
2017-02-15 17:21 ` Greg KH
2017-01-31 21:50 [PATCHES] networking David Miller
2017-02-01  8:10 ` Greg KH
2017-01-12 18:55 [PATCHES] Networking David Miller
2017-01-12 20:40 ` Greg KH
2016-12-07 23:43 David Miller
2016-12-08  6:34 ` Greg KH
2016-11-18  2:59 David Miller
2016-11-18 10:36 ` Greg KH
2016-11-09 17:19 David Miller
2016-11-10 15:50 ` Greg KH
2016-09-21  5:07 David Miller
2016-09-21  9:23 ` Greg KH
2016-08-12  0:50 David Miller
2016-08-12  7:37 ` Greg KH
2016-07-13 21:43 David Miller
2016-07-13 22:38 ` Greg KH
2016-07-06  5:02 David Miller
2016-07-07  0:35 ` Greg KH
2016-06-17  7:03 David Miller
2016-06-18  1:01 ` Greg KH
2016-05-16 16:35 David Miller
2016-05-16 21:50 ` Greg KH
2016-04-15  4:45 David Miller
2016-04-16 17:49 ` Greg KH
2016-02-29 21:56 David Miller
2016-02-29 22:45 ` Greg KH
2016-01-27  2:00 David Miller
2016-01-27  6:35 ` Greg KH
2015-12-22 21:51 David Miller
2016-01-19  5:20 ` Greg KH
2016-01-19 12:00   ` Josh Boyer
2016-01-19 13:29     ` Josh Boyer
2016-01-19 17:39       ` Greg KH
2016-01-19 17:41         ` Josh Boyer
2015-12-10 19:37 David Miller
2015-12-11 16:49 ` Greg KH
2015-11-13 21:38 David Miller
2015-11-14 15:59 ` Jiri Slaby
2015-11-15 17:55   ` David Miller
2015-12-06  5:25 ` Greg KH
2015-10-21  3:51 David Miller
2015-10-23 16:25 ` Greg KH
2015-08-27  6:05 David Miller
2015-08-27  7:29 ` Jiri Slaby
2015-08-27 13:35 ` Luis Henriques
2015-08-27 16:34   ` David Miller
2015-09-28 14:04     ` Greg KH
2015-09-26 19:21 ` Greg KH
2015-07-03 22:31 David Miller
2015-07-04  3:04 ` Greg KH
2015-06-10  3:01 David Miller
2015-06-10 13:26 ` Jiri Slaby
2015-06-19 18:03 ` Greg KH
2015-05-05 17:34 [PATCHES] NETWORKING David Miller
2015-05-06  6:57 ` Jiri Slaby
2015-05-08 11:14 ` Greg KH
2015-05-08 14:42 ` Greg KH
2015-04-29  4:48 [PATCHES] Networking David Miller
2015-04-29 11:09 ` Greg KH
2015-04-29 16:03   ` David Miller
2015-04-30 12:25 ` Jiri Slaby
     [not found] <20150421.143012.2106864724544609194.davem@davemloft.net>
2015-04-27  9:23 ` Jiri Slaby
2015-05-04 19:53   ` Ben Hutchings

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:4a230b1 dfblob:baffa3b dfblob:f9e96c4 dfblob:23dc131
dfblob:447a7fb dfblob:f5e2ba1 dfblob:8737400 dfblob:821d8df
dfblob:a0b0ea9 dfblob:28119f6 dfblob:acbd774 dfblob:dcc89c7
dfblob:a62a215 dfblob:1237a23 dfblob:270b77d dfblob:8bb1a5a
dfblob:3c268b3 dfblob:4877d5a dfblob:baaeb10 dfblob:82e64a9
dfblob:185c341 dfblob:aeedc3a dfblob:0e32d2e dfblob:fe85d46
dfblob:e7ed251 dfblob:7a59893 dfblob:447a7fb dfblob:f5e2ba1
dfblob:96ba23e dfblob:56fa3d5 dfblob:4efc435 dfblob:2eb6404
dfblob:1a01d79 dfblob:0d58542 dfblob:ee9f650 dfblob:3ecfda8
dfblob:c82b2e3 dfblob:da0b73a dfblob:b20a173 dfblob:3951874
dfblob:8f34b27 dfblob:143c4eb dfblob:28213df dfblob:acf6b2e
dfblob:cf2d853 dfblob:740041a dfblob:5918bff dfblob:2f0cf20
dfblob:dc9f925 dfblob:9c7d888 dfblob:185c341 dfblob:aeedc3a
dfblob:a38d3ac dfblob:69f4f68 dfblob:21a0fbf dfblob:0085b8d
dfblob:cab9e9b dfblob:4fbb674 dfblob:3c86b10 dfblob:e049857
dfblob:74dddf8 dfblob:556ecf9 dfblob:447a7fb dfblob:f5e2ba1
dfblob:66d47e4 dfblob:570390b dfblob:da48e66 dfblob:8207877
dfblob:74ceb73 dfblob:5f36266 dfblob:cedb572 dfblob:f24c4fe
dfblob:c73ae50 dfblob:f371fef dfblob:22e2ebf dfblob:789d7b7
dfblob:bf6e766 dfblob:ca08ffb dfblob:8900840 dfblob:df9a060
dfblob:53b7acd dfblob:e13c3c3 dfblob:ff667e1 dfblob:9ba383f
dfblob:eab4e08 dfblob:80aac20 dfblob:8c350c5 dfblob:58858c5
dfblob:ca08ffb dfblob:7d20abe dfblob:df9a060 dfblob:80b2b75
dfblob:7d20abe dfblob:7d824ac dfblob:80b2b75 dfblob:14437d9
dfblob:096c627 dfblob:27e1496 dfblob:4613df8 dfblob:aa34951
dfblob:616eda1 dfblob:2dd9900 dfblob:a369e8a dfblob:986440b
dfblob:715e01e dfblob:f23a3b6 dfblob:9a12668 dfblob:0ad144f
dfblob:f24c4fe dfblob:db9ebbc dfblob:1960b46 dfblob:479b93f
dfblob:41b3f3f dfblob:5d88f37 dfblob:0c06448 dfblob:bec8ec2
dfblob:74d0389 dfblob:4d608f0 dfblob:7d824ac dfblob:c87bcf9
dfblob:46a14cb dfblob:02a4615 dfblob:d79866c dfblob:817622f
dfblob:447a7fb dfblob:f5e2ba1 dfblob:b349e6f dfblob:de63266
dfblob:da48e66 dfblob:8207877 dfblob:74ceb73 dfblob:5f36266
dfblob:079897b dfblob:691caea dfblob:d155864 dfblob:00b64d4
dfblob:22e2ebf dfblob:789d7b7 dfblob:a774985 dfblob:bbd5d43
dfblob:8900840 dfblob:df9a060 dfblob:59e8035 dfblob:3ac604f
dfblob:1285eaf dfblob:03cdb9e dfblob:62e48bc dfblob:03e052a
dfblob:dc004b1 dfblob:0861018 dfblob:9c145dd dfblob:4f95fa7
dfblob:edd7734 dfblob:248478c dfblob:bbd5d43 dfblob:42de385
dfblob:df9a060 dfblob:80b2b75 dfblob:42de385 dfblob:0857f72
dfblob:80b2b75 dfblob:14437d9 dfblob:5613424 dfblob:26ec2fb
dfblob:ff8c4a4 dfblob:ff332d1 dfblob:6552394 dfblob:b5c3bba
dfblob:616eda1 dfblob:2dd9900 dfblob:b1c218d dfblob:b7dedd9
dfblob:ed458b3 dfblob:7851b12 dfblob:715e01e dfblob:f23a3b6
dfblob:9a12668 dfblob:0ad144f dfblob:fa8f504 dfblob:487be20
dfblob:691caea dfblob:9d56515 dfblob:d7a6524 dfblob:99d9bc1
dfblob:41b3f3f dfblob:5d88f37 dfblob:fdc60db dfblob:7c8c23c
dfblob:03e052a dfblob:09ec32e dfblob:29c2a01 dfblob:a408977 )
 OR (
bs:"net: dsa: actually force the speed on the CPU port" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150928.215408.458905239999743126.davem@davemloft.net \
    --to=davem@davemloft.net \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).