Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next v3] bridge: trigger RTM_NEWLINK when interface is modified by bridge ioctl
From: David Miller @ 2017-09-21 22:45 UTC (permalink / raw)
  To: vincent; +Cc: stephen, dsahern, bridge, netdev
In-Reply-To: <20170921100525.20395-1-vincent@bernat.im>

From: Vincent Bernat <vincent@bernat.im>
Date: Thu, 21 Sep 2017 12:05:25 +0200

> Currently, there is a difference in netlink events received when an
> interface is modified through bridge ioctl() or through netlink. This
> patch generates additional events when an interface is added to or
> removed from a bridge via ioctl().
...
> Signed-off-by: Vincent Bernat <vincent@bernat.im>

Applied.

^ permalink raw reply

* [PATCH net-next v2] bpf: Optimize lpm trie delete
From: Craig Gallek @ 2017-09-21 22:43 UTC (permalink / raw)
  To: Daniel Mack, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller; +Cc: netdev

From: Craig Gallek <kraig@google.com>

Before the delete operator was added, this datastructure maintained
an invariant that intermediate nodes were only present when necessary
to build the tree.  This patch updates the delete operation to reinstate
that invariant by removing unnecessary intermediate nodes after a node is
removed and thus keeping the tree structure at a minimal size.

Suggested-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: Craig Gallek <kraig@google.com>
---
v2:
  I really failed the interview on this one.  v1 did not collapse all
  extra intermediate node possibilities.  I believe this one does by
	additionally tracking node parent information.  See comments.

 kernel/bpf/lpm_trie.c | 71 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 43 insertions(+), 28 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 9d58a576b2ae..34d8a690ea05 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -394,8 +394,8 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 {
 	struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
 	struct bpf_lpm_trie_key *key = _key;
-	struct lpm_trie_node __rcu **trim;
-	struct lpm_trie_node *node;
+	struct lpm_trie_node __rcu **trim, **trim2;
+	struct lpm_trie_node *node, *parent;
 	unsigned long irq_flags;
 	unsigned int next_bit;
 	size_t matchlen = 0;
@@ -407,31 +407,26 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 	raw_spin_lock_irqsave(&trie->lock, irq_flags);
 
 	/* Walk the tree looking for an exact key/length match and keeping
-	 * track of where we could begin trimming the tree.  The trim-point
-	 * is the sub-tree along the walk consisting of only single-child
-	 * intermediate nodes and ending at a leaf node that we want to
-	 * remove.
+	 * track of the path we traverse.  We will need to know the node
+	 * we wish to delete, and the slot that points to the node we want
+	 * to delete.  We may also need to know the nodes parent and the
+	 * slot that contains it.
 	 */
 	trim = &trie->root;
-	node = rcu_dereference_protected(
-		trie->root, lockdep_is_held(&trie->lock));
-	while (node) {
+	trim2 = trim;
+	parent = NULL;
+	while ((node = rcu_dereference_protected(
+		       *trim, lockdep_is_held(&trie->lock)))) {
 		matchlen = longest_prefix_match(trie, node, key);
 
 		if (node->prefixlen != matchlen ||
 		    node->prefixlen == key->prefixlen)
 			break;
 
+		parent = node;
+		trim2 = trim;
 		next_bit = extract_bit(key->data, node->prefixlen);
-		/* If we hit a node that has more than one child or is a valid
-		 * prefix itself, do not remove it. Reset the root of the trim
-		 * path to its descendant on our path.
-		 */
-		if (!(node->flags & LPM_TREE_NODE_FLAG_IM) ||
-		    (node->child[0] && node->child[1]))
-			trim = &node->child[next_bit];
-		node = rcu_dereference_protected(
-			node->child[next_bit], lockdep_is_held(&trie->lock));
+		trim = &node->child[next_bit];
 	}
 
 	if (!node || node->prefixlen != key->prefixlen ||
@@ -442,27 +437,47 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 
 	trie->n_entries--;
 
-	/* If the node we are removing is not a leaf node, simply mark it
+	/* If the node we are removing has two children, simply mark it
 	 * as intermediate and we are done.
 	 */
-	if (rcu_access_pointer(node->child[0]) ||
+	if (rcu_access_pointer(node->child[0]) &&
 	    rcu_access_pointer(node->child[1])) {
 		node->flags |= LPM_TREE_NODE_FLAG_IM;
 		goto out;
 	}
 
-	/* trim should now point to the slot holding the start of a path from
-	 * zero or more intermediate nodes to our leaf node for deletion.
+	/* If the parent of the node we are about to delete is an intermediate
+	 * node, and the deleted node doesn't have any children, we can delete
+	 * the intermediate parent as well and promote its other child
+	 * up the tree.  Doing this maintains the invariant that all
+	 * intermediate nodes have exactly 2 children and that there are no
+	 * unnecessary intermediate nodes in the tree.
 	 */
-	while ((node = rcu_dereference_protected(
-			*trim, lockdep_is_held(&trie->lock)))) {
-		RCU_INIT_POINTER(*trim, NULL);
-		trim = rcu_access_pointer(node->child[0]) ?
-			&node->child[0] :
-			&node->child[1];
+	if (parent && (parent->flags & LPM_TREE_NODE_FLAG_IM) &&
+	    !node->child[0] && !node->child[1]) {
+		if (node == rcu_access_pointer(parent->child[0]))
+			rcu_assign_pointer(
+				*trim2, rcu_access_pointer(parent->child[1]));
+		else
+			rcu_assign_pointer(
+				*trim2, rcu_access_pointer(parent->child[0]));
+		kfree_rcu(parent, rcu);
 		kfree_rcu(node, rcu);
+		goto out;
 	}
 
+	/* The node we are removing has either zero or one child. If there
+	 * is a child, move it into the removed node's slot then delete
+	 * the node.  Otherwise just clear the slot and delete the node.
+	 */
+	if (node->child[0])
+		rcu_assign_pointer(*trim, rcu_access_pointer(node->child[0]));
+	else if (node->child[1])
+		rcu_assign_pointer(*trim, rcu_access_pointer(node->child[1]));
+	else
+		RCU_INIT_POINTER(*trim, NULL);
+	kfree_rcu(node, rcu);
+
 out:
 	raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
 
-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply related

* Re: [PATCH net-next 12/14] gtp: Configuration for zero UDP checksum
From: Tom Herbert @ 2017-09-21 22:41 UTC (permalink / raw)
  To: Harald Welte
  Cc: Tom Herbert, David Miller, Linux Kernel Network Developers,
	Pablo Neira Ayuso, Rohit Seth
In-Reply-To: <20170921015522.sv5netrlcj5vebys@nataraja>

On Wed, Sep 20, 2017 at 6:55 PM, Harald Welte <laforge@gnumonks.org> wrote:
> Hi Tom,
>
> On Wed, Sep 20, 2017 at 11:09:29AM -0700, Tom Herbert wrote:
>> On Mon, Sep 18, 2017 at 9:24 PM, David Miller <davem@davemloft.net> wrote:
>> > From: Tom Herbert <tom@quantonium.net>
>> >> Add configuration to control use of zero checksums on transmit for both
>> >> IPv4 and IPv6, and control over accepting zero IPv6 checksums on
>> >> receive.
>> >
>> > I thought we were trying to move away from this special case of allowing
>> > zero UDP checksums with tunnels, especially for ipv6.
>>
>> I don't have a strong preference either way. I like consistency with
>> VXLAN and foo/UDP, but I guess it's not required. Interestingly, since
>> GTP only carries IP, IPv6 zero checksums are actually safer here than
>> VXLAN or GRE/UDP.
>
> Just for the record: I don't care either way and I defer to the kernel
> networking developers to decide if they want to have zero UDP checksum
> in GTP or not.
>
> The 3GPP specs don't say anything about UDP checksums.  So there's no
> requirement to use them, and hence operation without UDP checksums
> should be compliant.  Cisco GTP implementation has udp checksumming
> configurable, so other implementations also seem to provide both ways.
>
> In general, I would argue one wants UDP checksumming of GTP in all
> setups, as while the inner IP packet might be protected, the GTP header
> itself is not, and that's what contains important data suhc as the TEID
> (Tunnel Endpoint ID).  But that's of course just my personal opinion,
> and I'm not saying we should prevent people from using lower protection
> if that's what they want.
>
The tradeoffs and requirements of zero UDP6 checksums are discussed at
length in RFC6935 and RFC6936. Given other implementations make it
configurable it should also be here.

Tom

> --
> - Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
> ============================================================================
> "Privacy in residential applications is a desirable marketing option."
>                                                   (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* Re: [PATCH 2/2] ip_tunnel: add mpls over gre encapsulation
From: Roopa Prabhu @ 2017-09-21 22:35 UTC (permalink / raw)
  To: Amine Kherbouche; +Cc: netdev@vger.kernel.org, xeb, David Lamparter
In-Reply-To: <1505985924-12479-3-git-send-email-amine.kherbouche@6wind.com>

On Thu, Sep 21, 2017 at 2:25 AM, Amine Kherbouche
<amine.kherbouche@6wind.com> wrote:
> This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
> API.
>
> Encap:
>   - Add a new iptunnel type mpls.
>
> Decap:
>   - pull gre hdr and call mpls_forward().
>
> Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com>
> ---
>  include/net/gre.h              |  3 +++
>  include/uapi/linux/if_tunnel.h |  1 +
>  net/ipv4/gre_demux.c           | 22 ++++++++++++++++++++++
>  net/ipv4/ip_gre.c              |  9 +++++++++
>  net/ipv6/ip6_gre.c             |  7 +++++++
>  net/mpls/af_mpls.c             | 37 +++++++++++++++++++++++++++++++++++++
>  6 files changed, 79 insertions(+)
>
> diff --git a/include/net/gre.h b/include/net/gre.h
> index d25d836..88a8343 100644
> --- a/include/net/gre.h
> +++ b/include/net/gre.h
> @@ -35,6 +35,9 @@ struct net_device *gretap_fb_dev_create(struct net *net, const char *name,
>                                        u8 name_assign_type);
>  int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
>                      bool *csum_err, __be16 proto, int nhs);
> +#if IS_ENABLED(CONFIG_MPLS)
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len);
> +#endif
>
>  static inline int gre_calc_hlen(__be16 o_flags)
>  {
> diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
> index 2e52088..a2f48c0 100644
> --- a/include/uapi/linux/if_tunnel.h
> +++ b/include/uapi/linux/if_tunnel.h
> @@ -84,6 +84,7 @@ enum tunnel_encap_types {
>         TUNNEL_ENCAP_NONE,
>         TUNNEL_ENCAP_FOU,
>         TUNNEL_ENCAP_GUE,
> +       TUNNEL_ENCAP_MPLS,
>  };
>
>  #define TUNNEL_ENCAP_FLAG_CSUM         (1<<0)
> diff --git a/net/ipv4/gre_demux.c b/net/ipv4/gre_demux.c
> index b798862..a6a937e 100644
> --- a/net/ipv4/gre_demux.c
> +++ b/net/ipv4/gre_demux.c
> @@ -23,6 +23,9 @@
>  #include <linux/netdevice.h>
>  #include <linux/if_tunnel.h>
>  #include <linux/spinlock.h>
> +#if IS_ENABLED(CONFIG_MPLS)
> +#include <linux/mpls.h>
> +#endif
>  #include <net/protocol.h>
>  #include <net/gre.h>
>
> @@ -122,6 +125,25 @@ int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
>  }
>  EXPORT_SYMBOL(gre_parse_header);
>
> +#if IS_ENABLED(CONFIG_MPLS)
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len)
> +{
> +       if (unlikely(!pskb_may_pull(skb, gre_hdr_len)))
> +               goto drop;
> +
> +       /* Pop GRE hdr and reset the skb */
> +       skb_pull(skb, gre_hdr_len);
> +       skb_reset_network_header(skb);
> +
> +       mpls_forward(skb, skb->dev, NULL, NULL);
> +
> +       return 0;
> +drop:
> +       return NET_RX_DROP;
> +}
> +EXPORT_SYMBOL(mpls_gre_rcv);
> +#endif
> +
>  static int gre_rcv(struct sk_buff *skb)
>  {
>         const struct gre_protocol *proto;
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 9cee986..dd4431c 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -412,10 +412,19 @@ static int gre_rcv(struct sk_buff *skb)
>                         return 0;
>         }
>
> +#if IS_ENABLED(CONFIG_MPLS)
> +       if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC))) {
> +               if (mpls_gre_rcv(skb, hdr_len))
> +                       goto drop;
> +               return 0;
> +       }
> +#endif
> +
>         if (ipgre_rcv(skb, &tpi, hdr_len) == PACKET_RCVD)
>                 return 0;
>
>         icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
> +
>  drop:
>         kfree_skb(skb);
>         return 0;
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index c82d41e..e52396d 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -476,6 +476,13 @@ static int gre_rcv(struct sk_buff *skb)
>         if (hdr_len < 0)
>                 goto drop;
>
> +#if IS_ENABLED(CONFIG_MPLS)
> +       if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC))) {
> +               if (mpls_gre_rcv(skb, hdr_len))
> +                       goto drop;
> +               return 0;
> +       }
> +#endif
>         if (iptunnel_pull_header(skb, hdr_len, tpi.proto, false))
>                 goto drop;
>
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 36ea2ad..060ed07 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -16,6 +16,7 @@
>  #include <net/arp.h>
>  #include <net/ip_fib.h>
>  #include <net/netevent.h>
> +#include <net/ip_tunnels.h>
>  #include <net/netns/generic.h>
>  #if IS_ENABLED(CONFIG_IPV6)
>  #include <net/ipv6.h>
> @@ -39,6 +40,40 @@ static int one = 1;
>  static int label_limit = (1 << 20) - 1;
>  static int ttl_max = 255;
>
> +size_t ipgre_mpls_encap_hlen(struct ip_tunnel_encap *e)
> +{
> +       return sizeof(struct mpls_shim_hdr);
> +}
> +
> +int ipgre_mpls_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
> +                           u8 *protocol, struct flowi4 *fl4)
> +{
> +       return 0;
> +}


any reason you are supporting only rx ?


> +
> +static const struct ip_tunnel_encap_ops mpls_iptun_ops = {
> +       .encap_hlen = ipgre_mpls_encap_hlen,
> +       .build_header = ipgre_mpls_build_header,
> +};
> +
> +int ipgre_tunnel_encap_add_mpls_ops(void)
> +{
> +       int ret;
> +
> +       ret = ip_tunnel_encap_add_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
> +       if (ret < 0) {
> +               pr_err("can't add mplsgre ops\n");
> +               return ret;
> +       }
> +
> +       return 0;
> +}
> +
> +static void ipgre_tunnel_encap_del_mpls_ops(void)
> +{
> +       ip_tunnel_encap_del_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
> +}
> +
>  static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
>                        struct nlmsghdr *nlh, struct net *net, u32 portid,
>                        unsigned int nlm_flags);
> @@ -2486,6 +2521,7 @@ static int __init mpls_init(void)
>                       0);
>         rtnl_register(PF_MPLS, RTM_GETNETCONF, mpls_netconf_get_devconf,
>                       mpls_netconf_dump_devconf, 0);
> +       ipgre_tunnel_encap_add_mpls_ops();
>         err = 0;
>  out:
>         return err;
> @@ -2503,6 +2539,7 @@ static void __exit mpls_exit(void)
>         dev_remove_pack(&mpls_packet_type);
>         unregister_netdevice_notifier(&mpls_dev_notifier);
>         unregister_pernet_subsys(&mpls_net_ops);
> +       ipgre_tunnel_encap_del_mpls_ops();
>  }
>  module_exit(mpls_exit);
>
> --
> 2.1.4
>

^ permalink raw reply

* Re: [PATCH net-next] cxgb4: avoid stall while shutting down the adapter
From: David Miller @ 2017-09-21 22:35 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, indranil, venkatesh
In-Reply-To: <1505978447-14944-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Thu, 21 Sep 2017 12:50:47 +0530

> do not wait for completion while deleting the filters
> when the adapter is shutting down because we may not get
> the response as interrupts will be disabled.
> 
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 1/1] net/smc: parameter cleanup in smc_cdc_get_free_slot()
From: David Miller @ 2017-09-21 22:33 UTC (permalink / raw)
  To: ubraun-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	jwi-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	schwidefsky-tA70FqPdS9bQT0dZR+AlfA,
	heiko.carstens-tA70FqPdS9bQT0dZR+AlfA,
	raspl-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
In-Reply-To: <20170921071734.16977-1-ubraun-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

From: Ursula Braun <ubraun-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Date: Thu, 21 Sep 2017 09:17:34 +0200

> Use the smc_connection as first parameter with smc_cdc_get_free_slot().
> This is just a small code cleanup, no functional change.
> 
> Signed-off-by: Ursula Braun <ubraun-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net 0/9] net/smc: bug fixes 2017-09-20
From: David Miller @ 2017-09-21 22:31 UTC (permalink / raw)
  To: ubraun
  Cc: netdev, linux-rdma, linux-s390, jwi, schwidefsky, heiko.carstens,
	raspl
In-Reply-To: <20170921071634.16883-1-ubraun@linux.vnet.ibm.com>

From: Ursula Braun <ubraun@linux.vnet.ibm.com>
Date: Thu, 21 Sep 2017 09:16:25 +0200

> here is a collection of small smc-patches built for net fixing
> smc problems in different areas.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next ] net: Remove useless function skb_header_release
From: David Miller @ 2017-09-21 22:25 UTC (permalink / raw)
  To: gfree.wind; +Cc: netdev
In-Reply-To: <1505968771-92232-1-git-send-email-gfree.wind@vip.163.com>

From: gfree.wind@vip.163.com
Date: Thu, 21 Sep 2017 12:39:31 +0800

> From: Gao Feng <gfree.wind@vip.163.com>
> 
> There is no one which would invokes the function skb_header_release.
> So just remove it now.
> 
> Signed-off-by: Gao Feng <gfree.wind@vip.163.com>

As Joe Perches mentioned, there are comment references remaining.

This should be removed too because they can be confusing for people.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: add port fdb dump
From: David Miller @ 2017-09-21 22:24 UTC (permalink / raw)
  To: vivien.didelot; +Cc: netdev, linux-kernel, kernel, f.fainelli, andrew
In-Reply-To: <20170920233214.6936-1-vivien.didelot@savoirfairelinux.com>

From: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Date: Wed, 20 Sep 2017 19:32:14 -0400

> Dumping a DSA port's FDB entries is not specific to a DSA slave, so add
> a dsa_port_fdb_dump function, similarly to dsa_port_fdb_add and
> dsa_port_fdb_del.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: better scoping of slave functions
From: David Miller @ 2017-09-21 22:24 UTC (permalink / raw)
  To: vivien.didelot; +Cc: netdev, linux-kernel, kernel, f.fainelli, andrew
In-Reply-To: <20170920233157.6832-1-vivien.didelot@savoirfairelinux.com>

From: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Date: Wed, 20 Sep 2017 19:31:57 -0400

> A few DSA slave functions take a dsa_slave_priv pointer as first
> argument, whereas the scope of the slave.c functions is the slave
> net_device structure. Fix this and rename dsa_netpoll_send_skb to
> dsa_slave_netpoll_send_skb.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Applied.

^ permalink raw reply

* Re: [PATCH v5 net 0/3] lan78xx: This series of patches are for lan78xx driver.
From: David Miller @ 2017-09-21 22:23 UTC (permalink / raw)
  To: Nisar.Sayed; +Cc: UNGLinuxDriver, netdev
In-Reply-To: <20170920210638.11150-1-Nisar.Sayed@microchip.com>

From: Nisar Sayed <Nisar.Sayed@microchip.com>
Date: Thu, 21 Sep 2017 02:36:35 +0530

> This series of patches are for lan78xx driver.
> 
> These patches fixes potential issues associated with lan78xx driver.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net 0/2] Bring back transceiver type for PHYLIB
From: David Miller @ 2017-09-21 22:20 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, linville, decot, tremyfr, andrew
In-Reply-To: <20170920225214.21974-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Wed, 20 Sep 2017 15:52:12 -0700

> With the introduction of the xLINKSETTINGS ethtool APIs, the transceiver type
> was deprecated, but in that process we lost some useful information that PHYLIB
> was consistently reporting about internal vs. external PHYs.
> 
> This brings back transceiver as a read-only field that is only consumed in the
> legacy path where ETHTOOL_GET is called but the underlying drivers implement the
> new style klink_settings API.

Series applied, thanks Florian.

^ permalink raw reply

* Re: [PATCH] [RESEND][for 4.14] net: qcom/emac: add software control for pause frame mode
From: David Miller @ 2017-09-21 22:19 UTC (permalink / raw)
  To: timur; +Cc: netdev
In-Reply-To: <1505939573-17335-1-git-send-email-timur@codeaurora.org>

From: Timur Tabi <timur@codeaurora.org>
Date: Wed, 20 Sep 2017 15:32:53 -0500

> The EMAC has the option of sending only a single pause frame when
> flow control is enabled and the RX queue is full.  Although sending
> only one pause frame has little value, this would allow admins to
> enable automatic flow control without having to worry about the EMAC
> flooding nearby switches with pause frames if the kernel hangs.
> 
> The option is enabled by using the single-pause-mode private flag.
> 
> Signed-off-by: Timur Tabi <timur@codeaurora.org>

I guess this is fine.  Applied, thanks.

^ permalink raw reply

* Re: [PATCH] hv_netvsc: fix send buffer failure on MTU change
From: David Miller @ 2017-09-21 22:18 UTC (permalink / raw)
  To: stephen; +Cc: kys, netdev, alexng, sthemmin
In-Reply-To: <20170920181735.31316-1-sthemmin@microsoft.com>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Wed, 20 Sep 2017 11:17:35 -0700

> From: Alex Ng <alexng@microsoft.com>
> 
> If MTU is changed the host would reject the send buffer change.
> This problem is result of recent change to allow changing send
> buffer size.
> 
> Every time we change the MTU, we store the previous net_device section
> count before destroying the buffer, but we don’t store the previous
> section size. When we reinitialize the buffer, its size is calculated
> by multiplying the previous count and previous size. Since we
> continuously increase the MTU, the host returns us a decreasing count
> value while the section size is reinitialized to 1728 bytes every
> time.
> 
> This eventually leads to a condition where the calculated buf_size is
> so small that the host rejects it.
> 
> Fixes: 8b5327975ae1 ("netvsc: allow controlling send/recv buffer size")
> Signed-off-by: Alex Ng <alexng@microsoft.com>
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: use dedicated CPU port
From: David Miller @ 2017-09-21 22:16 UTC (permalink / raw)
  To: vivien.didelot; +Cc: netdev, linux-kernel, kernel, f.fainelli, andrew
In-Reply-To: <20170920162805.29503-1-vivien.didelot@savoirfairelinux.com>

From: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Date: Wed, 20 Sep 2017 12:28:05 -0400

> Each port in DSA has its own dedicated CPU port currently available in
> its parent switch's ds->ports[port].cpu_dp. Use it instead of getting
> the unique tree CPU port, which will be deprecated soon.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: avoid a full fib lookup when rp_filter is disabled.
From: David Miller @ 2017-09-21 22:15 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, edumazet, hannes
In-Reply-To: <597db26b5e57eb89d2ff27454cce997fa6c0f5aa.1505924560.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 20 Sep 2017 18:26:53 +0200

> Since commit 1dced6a85482 ("ipv4: Restore accept_local behaviour
> in fib_validate_source()") a full fib lookup is needed even if
> the rp_filter is disabled, if accept_local is false - which is
> the default.
> 
> What we really need in the above scenario is just checking
> that the source IP address is not local, and in most case we
> can do that is a cheaper way looking up the ifaddr hash table.
> 
> This commit adds a helper for such lookup, and uses it to
> validate the src address when rp_filter is disabled and no
> 'local' routes are created by the user space in the relevant
> namespace.
> 
> A new ipv4 netns flag is added to account for such routes.
> We need that to preserve the same behavior we had before this
> patch.
> 
> It also drops the checks to bail early from __fib_validate_source,
> added by the commit 1dced6a85482 ("ipv4: Restore accept_local
> behaviour in fib_validate_source()") they do not give any
> measurable performance improvement: if we do the lookup with are
> on a slower path.
> 
> This improves UDP performances for unconnected sockets
> when rp_filter is disabled by 5% and also gives small but
> measurable performance improvement for TCP flood scenarios.
> 
> v1 -> v2:
>  - use the ifaddr lookup helper in __ip_dev_find(), as suggested
>    by Eric
>  - fall-back to full lookup if custom local routes are present
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Looks good, applied, thanks!

^ permalink raw reply

* Re: [Patch net] net_sched: remove cls_flower idr on failure
From: David Miller @ 2017-09-21 22:14 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, chrism, jiri
In-Reply-To: <20170920161845.28753-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Wed, 20 Sep 2017 09:18:45 -0700

> Fixes: c15ab236d69d ("net/sched: Change cls_flower to use IDR")
> Cc: Chris Mi <chrism@mellanox.com>
> Cc: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
From: Florian Fainelli @ 2017-09-21 22:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paweł Staszewski, Paolo Abeni, Jesper Dangaard Brouer,
	Linux Kernel Network Developers, Alexander Duyck
In-Reply-To: <1506030895.29839.153.camel@edumazet-glaptop3.roam.corp.google.com>

On 09/21/2017 02:54 PM, Eric Dumazet wrote:
> On Thu, 2017-09-21 at 14:41 -0700, Florian Fainelli wrote:
> 
>> Would not this apply to pretty much any stacked device setup though? It
>> seems like any network device that just queues up its packet on another
>> physical device for actual transmission may need that (e.g: DSA, bond,
>> team, more.?)
> 
> We support bonding and team already.

Right, so that seems to mostly leave us with DSA at least. What about
other devices that also have IFF_NO_QUEUE set?
-- 
Florian

^ permalink raw reply

* [PATCH] e1000: avoid null pointer dereference on invalid stat type
From: Colin King @ 2017-09-21 22:01 UTC (permalink / raw)
  To: Jeff Kirsher, intel-wired-lan, netdev; +Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Currently if the stat type is invalid then data[i] is being set
either by dereferencing a null pointer p, or it is reading from
an incorrect previous location if we had a valid stat type
previously.  Fix this by nullify pointer p if a stat type is
invalid and only setting data if p is not null.

Detected by CoverityScan, CID#113385 ("Explicit null dereferenced")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/intel/e1000/e1000_ethtool.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_ethtool.c b/drivers/net/ethernet/intel/e1000/e1000_ethtool.c
index ec8aa4562cc9..2ef6f08b580b 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_ethtool.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_ethtool.c
@@ -1824,11 +1824,12 @@ static void e1000_get_ethtool_stats(struct net_device *netdev,
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 	int i;
-	char *p = NULL;
 	const struct e1000_stats *stat = e1000_gstrings_stats;
 
 	e1000_update_stats(adapter);
 	for (i = 0; i < E1000_GLOBAL_STATS_LEN; i++) {
+		char *p;
+
 		switch (stat->type) {
 		case NETDEV_STATS:
 			p = (char *)netdev + stat->stat_offset;
@@ -1837,12 +1838,13 @@ static void e1000_get_ethtool_stats(struct net_device *netdev,
 			p = (char *)adapter + stat->stat_offset;
 			break;
 		default:
+			p = NULL;
 			WARN_ONCE(1, "Invalid E1000 stat type: %u index %d\n",
 				  stat->type, i);
 			break;
 		}
 
-		if (stat->sizeof_stat == sizeof(u64))
+		if (p && stat->sizeof_stat == sizeof(u64))
 			data[i] = *(u64 *)p;
 		else
 			data[i] = *(u32 *)p;
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH 2/2] userns: control capabilities of some user namespaces
From: Mahesh Bandewar @ 2017-09-21 21:56 UTC (permalink / raw)
  To: LKML, Netdev
  Cc: Kees Cook, Serge Hallyn, Eric W . Biederman, Eric Dumazet,
	David Miller, Mahesh Bandewar, Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

With this new notion of "controlled" user-namespaces, the controlled
user-namespaces are marked at the time of their creation while the
capabilities of processes that belong to them are controlled using the
global mask.

Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
that belongs to uncontrolled user-ns can create another (child) user-
namespace that is uncontrolled. Any other process (that either does
not have SYS_ADMIN or belongs to a controlled user-ns) can only
create a user-ns that is controlled.

global-capability-whitelist (controlled_userns_caps_whitelist) is used
at the capability check-time and keeps the semantics for the processes
that belong to uncontrolled user-ns as it is. Processes that belong to
controlled user-ns however are subjected to different checks-

   (a) if the capability in question is controlled and process belongs
       to controlled user-ns, then it's always denied.
   (b) if the capability in question is NOT controlled then fall back
       to the traditional check.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 include/linux/capability.h     |  1 +
 include/linux/user_namespace.h | 20 ++++++++++++++++++++
 kernel/capability.c            |  5 +++++
 kernel/user_namespace.c        |  3 +++
 security/commoncap.c           |  8 ++++++++
 5 files changed, 37 insertions(+)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 6c0b9677c03f..b8c6cac18658 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -250,6 +250,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
 				 void __user *buff, size_t *lenp, loff_t *ppos);
+bool is_capability_controlled(int cap);
 
 extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
 
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index c18e01252346..e890fe81b47e 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -22,6 +22,7 @@ struct uid_gid_map {	/* 64 bytes -- 1 cache line */
 };
 
 #define USERNS_SETGROUPS_ALLOWED 1UL
+#define USERNS_CONTROLLED	 2UL
 
 #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
 
@@ -102,6 +103,16 @@ static inline void put_user_ns(struct user_namespace *ns)
 		__put_user_ns(ns);
 }
 
+static inline bool is_user_ns_controlled(const struct user_namespace *ns)
+{
+	return ns->flags & USERNS_CONTROLLED;
+}
+
+static inline void mark_user_ns_controlled(struct user_namespace *ns)
+{
+	ns->flags |= USERNS_CONTROLLED;
+}
+
 struct seq_operations;
 extern const struct seq_operations proc_uid_seq_operations;
 extern const struct seq_operations proc_gid_seq_operations;
@@ -160,6 +171,15 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns)
 {
 	return ERR_PTR(-EPERM);
 }
+
+static inline bool is_user_ns_controlled(const struct user_namespace *ns)
+{
+	return false;
+}
+
+static inline void mark_user_ns_controlled(struct user_namespace *ns)
+{
+}
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/capability.c b/kernel/capability.c
index 62dbe3350c1b..40a38cc4ff43 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -510,6 +510,11 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns)
 }
 
 /* Controlled-userns capabilities routines */
+bool is_capability_controlled(int cap)
+{
+	return !cap_raised(controlled_userns_caps_whitelist, cap);
+}
+
 #ifdef CONFIG_SYSCTL
 int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
 				 void __user *buff, size_t *lenp, loff_t *ppos)
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index c490f1e4313b..f393ea5108f0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -53,6 +53,9 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
 	cred->cap_effective = CAP_FULL_SET;
 	cred->cap_ambient = CAP_EMPTY_SET;
 	cred->cap_bset = CAP_FULL_SET;
+	if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) ||
+	    is_user_ns_controlled(user_ns->parent))
+		mark_user_ns_controlled(user_ns);
 #ifdef CONFIG_KEYS
 	key_put(cred->request_key_auth);
 	cred->request_key_auth = NULL;
diff --git a/security/commoncap.c b/security/commoncap.c
index 6bf72b175b49..26f41602da10 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -73,6 +73,14 @@ int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
 {
 	struct user_namespace *ns = targ_ns;
 
+	/* If the capability is controlled and user-ns that process
+	 * belongs-to is 'controlled' then return EPERM and no need
+	 * to check the user-ns hierarchy.
+	 */
+	if (is_user_ns_controlled(cred->user_ns) &&
+	    is_capability_controlled(cap))
+		return -EPERM;
+
 	/* See if cred has the capability in the target user namespace
 	 * by examining the target user namespace and all of the target
 	 * user namespace's parents.
-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply related

* [RFC PATCH 1/2] capability: introduce sysctl for controlled user-ns capability whitelist
From: Mahesh Bandewar @ 2017-09-21 21:56 UTC (permalink / raw)
  To: LKML, Netdev
  Cc: Kees Cook, Serge Hallyn, Eric W . Biederman, Eric Dumazet,
	David Miller, Mahesh Bandewar, Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

Add a sysctl variable kernel.controlled_userns_caps_whitelist. This
takes input as capability mask expressed as two comma separated hex
u32 words. The mask, however, is stored in kernel as kernel_cap_t type.

Any capabilities that are not part of this mask will be controlled and
will not be allowed to processes in controlled user-ns.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Serge Hallyn <serge@hallyn.com>
CC: Kees Cook <keescook@chromium.org>
CC: "Eric W. Biederman" <ebiederm@xmission.com>

---
 Documentation/sysctl/kernel.txt | 21 ++++++++++++++++++
 include/linux/capability.h      |  3 +++
 kernel/capability.c             | 47 +++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c                 |  5 +++++
 4 files changed, 76 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ce61d1fe08ca..ec0d74476f48 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -25,6 +25,7 @@ show up in /proc/sys/kernel:
 - bootloader_version	     [ X86 only ]
 - callhome		     [ S390 only ]
 - cap_last_cap
+- controlled_userns_caps_whitelist
 - core_pattern
 - core_pipe_limit
 - core_uses_pid
@@ -186,6 +187,26 @@ CAP_LAST_CAP from the kernel.
 
 ==============================================================
 
+controlled_userns_caps_whitelist
+
+Capability mask that is whitelisted for "controlled" user namespaces.
+Any capability that is missing from this mask will not be allowed to
+any process that is attached to a controlled-userns. e.g. if CAP_NET_RAW
+is not part of this mask, then processes running inside any controlled
+userns's will not be allowed to perform action that needs CAP_NET_RAW
+capability. However, processes that are attached to a parent user-ns
+hierarchy that is *not* controlled and has CAP_NET_RAW can continue
+performing those actions. User-namespaces are marked "controlled" at
+the time of their creation based on the capabilities of the creator.
+A process that does not have CAP_SYS_ADMIN will create user-namespaces
+that are controlled.
+
+The value is expressed as two comma separated hex words (u32). This
+sysctl is avaialble in init-ns and users with CAP_SYS_ADMIN in init-ns
+are allowed to make changes.
+
+==============================================================
+
 core_pattern:
 
 core_pattern is used to specify a core dumpfile pattern name.
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b52e278e4744..6c0b9677c03f 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,6 +13,7 @@
 #define _LINUX_CAPABILITY_H
 
 #include <uapi/linux/capability.h>
+#include <linux/sysctl.h>
 
 
 #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
@@ -247,6 +248,8 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
 
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
+int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
+				 void __user *buff, size_t *lenp, loff_t *ppos);
 
 extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
 
diff --git a/kernel/capability.c b/kernel/capability.c
index f97fe77ceb88..62dbe3350c1b 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -28,6 +28,8 @@ EXPORT_SYMBOL(__cap_empty_set);
 
 int file_caps_enabled = 1;
 
+kernel_cap_t controlled_userns_caps_whitelist = CAP_FULL_SET;
+
 static int __init file_caps_disable(char *str)
 {
 	file_caps_enabled = 0;
@@ -506,3 +508,48 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns)
 	rcu_read_unlock();
 	return (ret == 0);
 }
+
+/* Controlled-userns capabilities routines */
+#ifdef CONFIG_SYSCTL
+int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
+				 void __user *buff, size_t *lenp, loff_t *ppos)
+{
+	DECLARE_BITMAP(caps_bitmap, CAP_LAST_CAP);
+	struct ctl_table caps_table;
+	char tbuf[NAME_MAX];
+	int ret;
+
+	ret = bitmap_from_u32array(caps_bitmap, CAP_LAST_CAP,
+				   controlled_userns_caps_whitelist.cap,
+				   _KERNEL_CAPABILITY_U32S);
+	if (ret != CAP_LAST_CAP)
+		return -1;
+
+	scnprintf(tbuf, NAME_MAX, "%*pb", CAP_LAST_CAP, caps_bitmap);
+
+	caps_table.data = tbuf;
+	caps_table.maxlen = NAME_MAX;
+	caps_table.mode = table->mode;
+	ret = proc_dostring(&caps_table, write, buff, lenp, ppos);
+	if (ret)
+		return ret;
+	if (write) {
+		kernel_cap_t tmp;
+
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+
+		ret = bitmap_parse_user(buff, *lenp, caps_bitmap, CAP_LAST_CAP);
+		if (ret)
+			return ret;
+
+		ret = bitmap_to_u32array(tmp.cap, _KERNEL_CAPABILITY_U32S,
+					 caps_bitmap, CAP_LAST_CAP);
+		if (ret != CAP_LAST_CAP)
+			return -1;
+
+		controlled_userns_caps_whitelist = tmp;
+	}
+	return 0;
+}
+#endif /* CONFIG_SYSCTL */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbbb8157..9903cf0de287 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1229,6 +1229,11 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "controlled_userns_caps_whitelist",
+		.mode		= 0644,
+		.proc_handler	= proc_douserns_caps_whitelist,
+	},
 	{ }
 };
 
-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply related

* [RFC PATCH 0/2] capability controlled user-namespaces
From: Mahesh Bandewar @ 2017-09-21 21:56 UTC (permalink / raw)
  To: LKML, Netdev
  Cc: Kees Cook, Serge Hallyn, Eric W . Biederman, Eric Dumazet,
	David Miller, Mahesh Bandewar, Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

TL;DR version
-------------
Creating a sandbox environment with namespaces is challenging
considering what these sandboxed processes can engage into. e.g.
CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
Current form of user-namespaces, however, if changed a bit can allow
us to create a sandbox environment without locking down user-
namespaces.

Detailed version
----------------

Problem
-------
User-namespaces in the current form have increased the attack surface as
any process can acquire capabilities which are not available to them (by
default) by performing combination of clone()/unshare()/setns() syscalls.

    #define _GNU_SOURCE
    #include <stdio.h>
    #include <sched.h>
    #include <netinet/in.h>

    int main(int ac, char **av)
    {
        int sock = -1;

        printf("Attempting to open RAW socket before unshare()...\n");
        sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
        if (sock < 0) {
            perror("socket() SOCK_RAW failed: ");
        } else {
            printf("Successfully opened RAW-Sock before unshare().\n");
            close(sock);
            sock = -1;
        }

        if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
            perror("unshare() failed: ");
            return 1;
        }

        printf("Attempting to open RAW socket after unshare()...\n");
        sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
        if (sock < 0) {
            perror("socket() SOCK_RAW failed: ");
        } else {
            printf("Successfully opened RAW-Sock after unshare().\n");
            close(sock);
            sock = -1;
        }

        return 0;
    }

The above example shows how easy it is to acquire NET_RAW capabilities
and once acquired, these processes could take benefit of above mentioned
or similar issues discovered/undiscovered with malicious intent. Note
that this is just an example and the problem/solution is not limited
to NET_RAW capability *only*. 

The easiest fix one can apply here is to lock-down user-namespaces which
many of the distros do (i.e. don't allow users to create user namespaces),
but unfortunately that prevents everyone from using them.

Approach
--------
Introduce a notion of 'controlled' user-namespaces. Every process on
the host is allowed to create user-namespaces (governed by the limit
imposed by per-ns sysctl) however, mark user-namespaces created by
sandboxed processes as 'controlled'. Use this 'mark' at the time of
capability check in conjunction with a global capability whitelist.
If the capability is not whitelisted, processes that belong to 
controlled user-namespaces will not be allowed.

Once a user-ns is marked as 'controlled'; all its child user-
namespaces are marked as 'controlled' too.

A global whitelist is list of capabilities governed by the
sysctl which is available to (privileged) user in init-ns to modify
while it's applicable to all controlled user-namespaces on the host.

Marking user-namespaces controlled without modifying the whitelist is
equivalent of the current behavior. The default value of whitelist includes
all capabilities so that the compatibility is maintained. However it gives
admins fine-grained ability to control various capabilities system wide
without locking down user-namespaces.

Please see individual patches in this series.

Mahesh Bandewar (2):
  capability: introduce sysctl for controlled user-ns capability
    whitelist
  userns: control capabilities of some user namespaces

 Documentation/sysctl/kernel.txt | 21 +++++++++++++++++
 include/linux/capability.h      |  4 ++++
 include/linux/user_namespace.h  | 20 ++++++++++++++++
 kernel/capability.c             | 52 +++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c                 |  5 ++++
 kernel/user_namespace.c         |  3 +++
 security/commoncap.c            |  8 +++++++
 7 files changed, 113 insertions(+)

-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply

* Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
From: Eric Dumazet @ 2017-09-21 21:54 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Paweł Staszewski, Paolo Abeni, Jesper Dangaard Brouer,
	Linux Kernel Network Developers, Alexander Duyck
In-Reply-To: <6607c631-580d-825b-6205-6f6ee688ce32@gmail.com>

On Thu, 2017-09-21 at 14:41 -0700, Florian Fainelli wrote:

> Would not this apply to pretty much any stacked device setup though? It
> seems like any network device that just queues up its packet on another
> physical device for actual transmission may need that (e.g: DSA, bond,
> team, more.?)

We support bonding and team already.

^ permalink raw reply

* Re: [PATCH net] bpf: one perf event close won't free bpf program attached by another perf event
From: Alexei Starovoitov @ 2017-09-21 21:53 UTC (permalink / raw)
  To: Peter Zijlstra, Yonghong Song; +Cc: Steven Rostedt, daniel, netdev, kernel-team
In-Reply-To: <20170921111706.343om7252gcagco6@hirez.programming.kicks-ass.net>

On 9/21/17 4:17 AM, Peter Zijlstra wrote:
> On Wed, Sep 20, 2017 at 10:20:13PM -0700, Yonghong Song wrote:
>>> (2). trace_event_call->perf_events are per cpu data structure, that
>>> means, some filtering logic is needed to avoid the same perf_event prog
>>> is executing twice.
>>
>> What I mean here is that the trace_event_call->perf_events need to be
>> checked on ALL cpus since bpf prog should be executed regardless of
>> cpu affiliation. It is possible that the same perf_event in different
>> per_cpu bucket and hence filtering is needed to avoid the same
>> perf_event bpf_prog is executed twice.
>
> An event will only ever be on a single CPU's list at any one time IIRC.

yes, but doing for_each_cpu there is not an option. too slow.
struct trace_event_call is the only stable argument in
perf_trace_##call(), so we gotta have a pointer there for stuff
we need to run.
This patch added another annoying pointer, since it's the simplest
bugfix for stable. For net-next we're going to remove it, since
we're working on multi-prog support for kprobes/tracepoints.
(right now there is only one prog allowed and that's very limiting)
With multi-prog that bpf_prog_owner pointer will be removed and
existing 'struct bpf_prog *prog' pointer will be replaced with
something else.

> Now, hysterically perf_event_set_bpf_prog used the tracepoint crud
> because that already had bpf bits in. But it might make sense to look at
> unifying the bpf stuff across all the different event types. Have them
> all use event->prog.

it sounds good in theory, but in practice we need a separate
'stuff to run' pointer in both perf_event and trace_even_call,
since that's what being passed to overflow_handle and perf_trace_##call.

> I suspect that would break a fair bunch of bpf proglets, since the data
> access to the trace data would be completely different, but it would be
> much nicer to not have this distinction based on event type.

such things are certainly an abi.
kprobe+bpf has to see struct pt_regs
perf_event+bpf has to see struct bpf_perf_event_data and
tracepoint+bpf has to see struct foo { fields }
The fields will change every time tracepoint is changed.
That's fine.
But we cannot unify kprobe with tracepoints with perf_event prog types.
And frankly I don't see the need.
Note that in pt_regs we don't need to populate everything.
The 'optimized fprobe' we were talking about at plumbers we
would populate di,si,dx,cx,sp since most of the kprobe+bpf progs
don't care about the other regs and especially cpu flags.
So plenty of room for tweaks and optimizations.

^ permalink raw reply

* Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
From: Paweł Staszewski @ 2017-09-21 21:43 UTC (permalink / raw)
  To: Florian Fainelli, Eric Dumazet
  Cc: Paolo Abeni, Jesper Dangaard Brouer,
	Linux Kernel Network Developers, Alexander Duyck
In-Reply-To: <6607c631-580d-825b-6205-6f6ee688ce32@gmail.com>



W dniu 2017-09-21 o 23:41, Florian Fainelli pisze:
> On 09/21/2017 02:26 PM, Paweł Staszewski wrote:
>>
>> W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:
>>> diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
>>> index
>>> 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
>>> 100644
>>> --- a/net/8021q/vlan_netlink.c
>>> +++ b/net/8021q/vlan_netlink.c
>>> @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net,
>>> struct net_device *dev,
>>>        vlan->vlan_proto = proto;
>>>        vlan->vlan_id     = nla_get_u16(data[IFLA_VLAN_ID]);
>>>        vlan->real_dev     = real_dev;
>>> +    dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
>>>        vlan->flags     = VLAN_FLAG_REORDER_HDR;
>>>          err = vlan_check_real_dev(real_dev, vlan->vlan_proto,
>>> vlan->vlan_id);
>> Any plans for this patch to go normal into the kernel ?
> Would not this apply to pretty much any stacked device setup though? It
> seems like any network device that just queues up its packet on another
> physical device for actual transmission may need that (e.g: DSA, bond,
> team, more.?)
Some devices libe bond have it.

Just maybee when there was first patch vlans were not taken into account.
Did not checked all :)

But I know Eric will do :)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox