Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 15/17] net: Enable some sysctls that are safe for the userns root
From: Eric W. Biederman @ 2012-11-16 13:03 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Eric W. Biederman
In-Reply-To: <1353070992-5552-1-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

From: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

- Enable the per device ipv4 sysctls:
   net/ipv4/conf/<if>/forwarding
   net/ipv4/conf/<if>/mc_forwarding
   net/ipv4/conf/<if>/accept_redirects
   net/ipv4/conf/<if>/secure_redirects
   net/ipv4/conf/<if>/shared_media
   net/ipv4/conf/<if>/rp_filter
   net/ipv4/conf/<if>/send_redirects
   net/ipv4/conf/<if>/accept_source_route
   net/ipv4/conf/<if>/accept_local
   net/ipv4/conf/<if>/src_valid_mark
   net/ipv4/conf/<if>/proxy_arp
   net/ipv4/conf/<if>/medium_id
   net/ipv4/conf/<if>/bootp_relay
   net/ipv4/conf/<if>/log_martians
   net/ipv4/conf/<if>/tag
   net/ipv4/conf/<if>/arp_filter
   net/ipv4/conf/<if>/arp_announce
   net/ipv4/conf/<if>/arp_ignore
   net/ipv4/conf/<if>/arp_accept
   net/ipv4/conf/<if>/arp_notify
   net/ipv4/conf/<if>/proxy_arp_pvlan
   net/ipv4/conf/<if>/disable_xfrm
   net/ipv4/conf/<if>/disable_policy
   net/ipv4/conf/<if>/force_igmp_version
   net/ipv4/conf/<if>/promote_secondaries
   net/ipv4/conf/<if>/route_localnet

- Enable the global ipv4 sysctl:
   net/ipv4/ip_forward

- Enable the per device ipv6 sysctls:
   net/ipv6/conf/<if>/forwarding
   net/ipv6/conf/<if>/hop_limit
   net/ipv6/conf/<if>/mtu
   net/ipv6/conf/<if>/accept_ra
   net/ipv6/conf/<if>/accept_redirects
   net/ipv6/conf/<if>/autoconf
   net/ipv6/conf/<if>/dad_transmits
   net/ipv6/conf/<if>/router_solicitations
   net/ipv6/conf/<if>/router_solicitation_interval
   net/ipv6/conf/<if>/router_solicitation_delay
   net/ipv6/conf/<if>/force_mld_version
   net/ipv6/conf/<if>/use_tempaddr
   net/ipv6/conf/<if>/temp_valid_lft
   net/ipv6/conf/<if>/temp_prefered_lft
   net/ipv6/conf/<if>/regen_max_retry
   net/ipv6/conf/<if>/max_desync_factor
   net/ipv6/conf/<if>/max_addresses
   net/ipv6/conf/<if>/accept_ra_defrtr
   net/ipv6/conf/<if>/accept_ra_pinfo
   net/ipv6/conf/<if>/accept_ra_rtr_pref
   net/ipv6/conf/<if>/router_probe_interval
   net/ipv6/conf/<if>/accept_ra_rt_info_max_plen
   net/ipv6/conf/<if>/proxy_ndp
   net/ipv6/conf/<if>/accept_source_route
   net/ipv6/conf/<if>/optimistic_dad
   net/ipv6/conf/<if>/mc_forwarding
   net/ipv6/conf/<if>/disable_ipv6
   net/ipv6/conf/<if>/accept_dad
   net/ipv6/conf/<if>/force_tllao

- Enable the global ipv6 sysctls:
   net/ipv6/bindv6only
   net/ipv6/icmp/ratelimit

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/ipv4/devinet.c         |    8 --------
 net/ipv6/addrconf.c        |    4 ----
 net/ipv6/icmp.c            |    7 +------
 net/ipv6/sysctl_net_ipv6.c |    4 ----
 4 files changed, 1 insertions(+), 22 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index f75f4f6..446b1b9 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1643,10 +1643,6 @@ static int __devinet_sysctl_register(struct net *net, char *dev_name,
 		t->devinet_vars[i].extra2 = net;
 	}
 
-	/* Don't export sysctls to unprivileged users */
-	if (net->user_ns != &init_user_ns)
-		t->devinet_vars[0].procname = NULL;
-
 	snprintf(path, sizeof(path), "net/ipv4/conf/%s", dev_name);
 
 	t->sysctl_header = register_net_sysctl(net, path, t->devinet_vars);
@@ -1732,10 +1728,6 @@ static __net_init int devinet_init_net(struct net *net)
 		tbl[0].data = &all->data[IPV4_DEVCONF_FORWARDING - 1];
 		tbl[0].extra1 = all;
 		tbl[0].extra2 = net;
-
-		/* Don't export sysctls to unprivileged users */
-		if (net->user_ns != &init_user_ns)
-			tbl[0].procname = NULL;
 #endif
 	}
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index b8e0a62..5f1967b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4588,10 +4588,6 @@ static int __addrconf_sysctl_register(struct net *net, char *dev_name,
 		t->addrconf_vars[i].extra2 = net;
 	}
 
-	/* Don't export sysctls to unprivileged users */
-	if (net->user_ns != &init_user_ns)
-		t->addrconf_vars[0].procname = NULL;
-
 	snprintf(path, sizeof(path), "net/ipv6/conf/%s", dev_name);
 
 	t->sysctl_header = register_net_sysctl(net, path, t->addrconf_vars);
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index db9df8a..24d69db 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -967,14 +967,9 @@ struct ctl_table * __net_init ipv6_icmp_sysctl_init(struct net *net)
 			sizeof(ipv6_icmp_table_template),
 			GFP_KERNEL);
 
-	if (table) {
+	if (table)
 		table[0].data = &net->ipv6.sysctl.icmpv6_time;
 
-		/* Don't export sysctls to unprivileged users */
-		if (net->user_ns != &init_user_ns)
-			table[0].procname = NULL;
-	}
-
 	return table;
 }
 #endif
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index b06fd07..e85c48b 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -52,10 +52,6 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
 		goto out;
 	ipv6_table[0].data = &net->ipv6.sysctl.bindv6only;
 
-	/* Don't export sysctls to unprivileged users */
-	if (net->user_ns != &init_user_ns)
-		ipv6_table[0].procname = NULL;
-
 	ipv6_route_table = ipv6_route_sysctl_init(net);
 	if (!ipv6_route_table)
 		goto out_ipv6_table;
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH net-next 16/17] net: Enable a userns root rtnl calls that are safe for unprivilged users
From: Eric W. Biederman @ 2012-11-16 13:03 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Eric W. Biederman
In-Reply-To: <1353070992-5552-1-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

From: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

- Only allow moving network devices to network namespaces you have
  CAP_NET_ADMIN privileges over.

- Enable creating/deleting/modifying interfaces
- Enable adding/deleting addresses
- Enable adding/setting/deleting neighbour entries
- Enable adding/removing routes
- Enable adding/removing fib rules
- Enable setting the forwarding state
- Enable adding/removing ipv6 address labels
- Enable setting bridge parameter

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/bridge/br_netlink.c |    3 ---
 net/core/fib_rules.c    |    6 ------
 net/core/neighbour.c    |    9 ---------
 net/core/rtnetlink.c    |   13 ++++---------
 net/ipv4/devinet.c      |    6 ------
 net/ipv4/fib_frontend.c |    6 ------
 net/ipv6/addrconf.c     |    6 ------
 net/ipv6/addrlabel.c    |    3 ---
 net/ipv6/route.c        |    6 ------
 9 files changed, 4 insertions(+), 54 deletions(-)

diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 251d558..093f527 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -153,9 +153,6 @@ static int br_rtm_setlink(struct sk_buff *skb,  struct nlmsghdr *nlh, void *arg)
 	struct net_bridge_port *p;
 	u8 new_state;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	if (nlmsg_len(nlh) < sizeof(*ifm))
 		return -EINVAL;
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index bf5b5b8..58a4ba2 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -275,9 +275,6 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
 	struct nlattr *tb[FRA_MAX+1];
 	int err = -EINVAL, unresolved = 0;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
 		goto errout;
 
@@ -427,9 +424,6 @@ static int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
 	struct nlattr *tb[FRA_MAX+1];
 	int err = -EINVAL;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
 		goto errout;
 
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 7adcdaf..f1c0c2e 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1620,9 +1620,6 @@ static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	struct net_device *dev = NULL;
 	int err = -EINVAL;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	ASSERT_RTNL();
 	if (nlmsg_len(nlh) < sizeof(*ndm))
 		goto out;
@@ -1687,9 +1684,6 @@ static int neigh_add(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	struct net_device *dev = NULL;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	ASSERT_RTNL();
 	err = nlmsg_parse(nlh, sizeof(*ndm), tb, NDA_MAX, NULL);
 	if (err < 0)
@@ -1968,9 +1962,6 @@ static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	struct nlattr *tb[NDTA_MAX+1];
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = nlmsg_parse(nlh, sizeof(*ndtmsg), tb, NDTA_MAX,
 			  nl_neightbl_policy);
 	if (err < 0)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 5d55c30..06dcf44 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1316,6 +1316,10 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
 			err = PTR_ERR(net);
 			goto errout;
 		}
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) {
+			err = -EPERM;
+			goto errout;
+		}
 		err = dev_change_net_namespace(dev, net, ifname);
 		put_net(net);
 		if (err)
@@ -1547,9 +1551,6 @@ static int rtnl_setlink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	struct nlattr *tb[IFLA_MAX+1];
 	char ifname[IFNAMSIZ];
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
 	if (err < 0)
 		goto errout;
@@ -1593,9 +1594,6 @@ static int rtnl_dellink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	int err;
 	LIST_HEAD(list_kill);
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
 	if (err < 0)
 		return err;
@@ -1726,9 +1724,6 @@ static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	struct nlattr *linkinfo[IFLA_INFO_MAX+1];
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 #ifdef CONFIG_MODULES
 replay:
 #endif
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 446b1b9..7059d6f 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -538,9 +538,6 @@ static int inet_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg
 
 	ASSERT_RTNL();
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFA_MAX, ifa_ipv4_policy);
 	if (err < 0)
 		goto errout;
@@ -648,9 +645,6 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg
 
 	ASSERT_RTNL();
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	ifa = rtm_to_ifaddr(net, nlh);
 	if (IS_ERR(ifa))
 		return PTR_ERR(ifa);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 784716a..5cd75e2 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -613,9 +613,6 @@ static int inet_rtm_delroute(struct sk_buff *skb, struct nlmsghdr *nlh, void *ar
 	struct fib_table *tb;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = rtm_to_fib_config(net, skb, nlh, &cfg);
 	if (err < 0)
 		goto errout;
@@ -638,9 +635,6 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh, void *ar
 	struct fib_table *tb;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = rtm_to_fib_config(net, skb, nlh, &cfg);
 	if (err < 0)
 		goto errout;
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 5f1967b..27b1e8f 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3369,9 +3369,6 @@ inet6_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	struct in6_addr *pfx;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFA_MAX, ifa_ipv6_policy);
 	if (err < 0)
 		return err;
@@ -3442,9 +3439,6 @@ inet6_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	u8 ifa_flags;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFA_MAX, ifa_ipv6_policy);
 	if (err < 0)
 		return err;
diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
index b106f80..ff76eec 100644
--- a/net/ipv6/addrlabel.c
+++ b/net/ipv6/addrlabel.c
@@ -425,9 +425,6 @@ static int ip6addrlbl_newdel(struct sk_buff *skb, struct nlmsghdr *nlh,
 	u32 label;
 	int err = 0;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = nlmsg_parse(nlh, sizeof(*ifal), tb, IFAL_MAX, ifal_policy);
 	if (err < 0)
 		return err;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index be2c173..c7f7fda 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2336,9 +2336,6 @@ static int inet6_rtm_delroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *a
 	struct fib6_config cfg;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = rtm_to_fib6_config(skb, nlh, &cfg);
 	if (err < 0)
 		return err;
@@ -2351,9 +2348,6 @@ static int inet6_rtm_newroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *a
 	struct fib6_config cfg;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	err = rtm_to_fib6_config(skb, nlh, &cfg);
 	if (err < 0)
 		return err;
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH net-next 17/17] net: Make CAP_NET_BIND_SERVICE per user namespace
From: Eric W. Biederman @ 2012-11-16 13:03 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Eric W. Biederman
In-Reply-To: <1353070992-5552-1-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

From: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Allow privileged users in any user namespace to bind to
privileged sockets in network namespaces they control.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/ipv4/af_inet.c  |    6 ++++--
 net/ipv6/af_inet6.c |    2 +-
 net/sctp/socket.c   |    8 +++++---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 7449bcf..6a76956 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -474,6 +474,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
 	struct sock *sk = sock->sk;
 	struct inet_sock *inet = inet_sk(sk);
+	struct net *net = sock_net(sk);
 	unsigned short snum;
 	int chk_addr_ret;
 	int err;
@@ -497,7 +498,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 			goto out;
 	}
 
-	chk_addr_ret = inet_addr_type(sock_net(sk), addr->sin_addr.s_addr);
+	chk_addr_ret = inet_addr_type(net, addr->sin_addr.s_addr);
 
 	/* Not specified by any standard per-se, however it breaks too
 	 * many applications when removed.  It is unfortunate since
@@ -517,7 +518,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 
 	snum = ntohs(addr->sin_port);
 	err = -EACCES;
-	if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE))
+	if (snum && snum < PROT_SOCK &&
+	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		goto out;
 
 	/*      We keep a pair of addresses. rcv_saddr is the one
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 19f68b2..5d4e45e 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -283,7 +283,7 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
-	if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE))
+	if (snum && snum < PROT_SOCK && !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
 	lock_sock(sk);
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 59d16ea..e4a362d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -336,6 +336,7 @@ static struct sctp_af *sctp_sockaddr_af(struct sctp_sock *opt,
 /* Bind a local address either to an endpoint or to an association.  */
 SCTP_STATIC int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
 {
+	struct net *net = sock_net(sk);
 	struct sctp_sock *sp = sctp_sk(sk);
 	struct sctp_endpoint *ep = sp->ep;
 	struct sctp_bind_addr *bp = &ep->base.bind_addr;
@@ -379,7 +380,8 @@ SCTP_STATIC int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
 		}
 	}
 
-	if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE))
+	if (snum && snum < PROT_SOCK &&
+	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
 	/* See if the address matches any of the addresses we may have
@@ -1162,7 +1164,7 @@ static int __sctp_connect(struct sock* sk,
 				 * be permitted to open new associations.
 				 */
 				if (ep->base.bind_addr.port < PROT_SOCK &&
-				    !capable(CAP_NET_BIND_SERVICE)) {
+				    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) {
 					err = -EACCES;
 					goto out_free;
 				}
@@ -1791,7 +1793,7 @@ SCTP_STATIC int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
 			 * associations.
 			 */
 			if (ep->base.bind_addr.port < PROT_SOCK &&
-			    !capable(CAP_NET_BIND_SERVICE)) {
+			    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) {
 				err = -EACCES;
 				goto out_unlock;
 			}
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH net-next 02/17] userns: make each net (net_ns) belong to a user_ns
From: Eric W. Biederman @ 2012-11-16 13:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Serge Hallyn, Linux Containers, Eric W. Biederman
In-Reply-To: <1353070992-5552-1-git-send-email-ebiederm@xmission.com>

From: "Eric W. Biederman" <ebiederm@xmission.com>

The user namespace which creates a new network namespace owns that
namespace and all resources created in it.  This way we can target
capability checks for privileged operations against network resources to
the user_ns which created the network namespace in which the resource
lives.  Privilege to the user namespace which owns the network
namespace, or any parent user namespace thereof, provides the same
privilege to the network resource.

This patch is reworked from a version originally by
Serge E. Hallyn <serge.hallyn@canonical.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/net_namespace.h |    9 +++++++--
 kernel/nsproxy.c            |    2 +-
 net/core/net_namespace.c    |   16 ++++++++++++----
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 32dcb60..c5a43f5 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -23,6 +23,7 @@
 #endif
 #include <net/netns/xfrm.h>
 
+struct user_namespace;
 struct proc_dir_entry;
 struct net_device;
 struct sock;
@@ -53,6 +54,8 @@ struct net {
 	struct list_head	cleanup_list;	/* namespaces on death row */
 	struct list_head	exit_list;	/* Use only net_mutex */
 
+	struct user_namespace   *user_ns;	/* Owning user namespace */
+
 	struct proc_dir_entry 	*proc_net;
 	struct proc_dir_entry 	*proc_net_stat;
 
@@ -127,12 +130,14 @@ struct net {
 extern struct net init_net;
 
 #ifdef CONFIG_NET_NS
-extern struct net *copy_net_ns(unsigned long flags, struct net *net_ns);
+extern struct net *copy_net_ns(unsigned long flags,
+	struct user_namespace *user_ns, struct net *old_net);
 
 #else /* CONFIG_NET_NS */
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
-static inline struct net *copy_net_ns(unsigned long flags, struct net *old_net)
+static inline struct net *copy_net_ns(unsigned long flags,
+	struct user_namespace *user_ns, struct net *old_net)
 {
 	if (flags & CLONE_NEWNET)
 		return ERR_PTR(-EINVAL);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b576f7f..7e1c3de 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -90,7 +90,7 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
-	new_nsp->net_ns = copy_net_ns(flags, tsk->nsproxy->net_ns);
+	new_nsp->net_ns = copy_net_ns(flags, task_cred_xxx(tsk, user_ns), tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
 		goto out_net;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2c1c590..6456439 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -13,6 +13,7 @@
 #include <linux/proc_fs.h>
 #include <linux/file.h>
 #include <linux/export.h>
+#include <linux/user_namespace.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -145,7 +146,7 @@ static void ops_free_list(const struct pernet_operations *ops,
 /*
  * setup_net runs the initializers for the network namespace object.
  */
-static __net_init int setup_net(struct net *net)
+static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 {
 	/* Must be called with net_mutex held */
 	const struct pernet_operations *ops, *saved_ops;
@@ -155,6 +156,7 @@ static __net_init int setup_net(struct net *net)
 	atomic_set(&net->count, 1);
 	atomic_set(&net->passive, 1);
 	net->dev_base_seq = 1;
+	net->user_ns = user_ns;
 
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_set(&net->use_count, 0);
@@ -232,7 +234,8 @@ void net_drop_ns(void *p)
 		net_free(ns);
 }
 
-struct net *copy_net_ns(unsigned long flags, struct net *old_net)
+struct net *copy_net_ns(unsigned long flags,
+			struct user_namespace *user_ns, struct net *old_net)
 {
 	struct net *net;
 	int rv;
@@ -243,8 +246,11 @@ struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 	net = net_alloc();
 	if (!net)
 		return ERR_PTR(-ENOMEM);
+
+	get_user_ns(user_ns);
+
 	mutex_lock(&net_mutex);
-	rv = setup_net(net);
+	rv = setup_net(net, user_ns);
 	if (rv == 0) {
 		rtnl_lock();
 		list_add_tail_rcu(&net->list, &net_namespace_list);
@@ -252,6 +258,7 @@ struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 	}
 	mutex_unlock(&net_mutex);
 	if (rv < 0) {
+		put_user_ns(user_ns);
 		net_drop_ns(net);
 		return ERR_PTR(rv);
 	}
@@ -308,6 +315,7 @@ static void cleanup_net(struct work_struct *work)
 	/* Finally it is safe to free my network namespace structure */
 	list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {
 		list_del_init(&net->exit_list);
+		put_user_ns(net->user_ns);
 		net_drop_ns(net);
 	}
 }
@@ -395,7 +403,7 @@ static int __init net_ns_init(void)
 	rcu_assign_pointer(init_net.gen, ng);
 
 	mutex_lock(&net_mutex);
-	if (setup_net(&init_net))
+	if (setup_net(&init_net, &init_user_ns))
 		panic("Could not setup the initial network namespace");
 
 	rtnl_lock();
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH net-next 09/17] net: Allow userns root control of the core of the network stack.
From: Eric W. Biederman @ 2012-11-16 13:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Serge Hallyn, Linux Containers, Eric W. Biederman
In-Reply-To: <1353070992-5552-1-git-send-email-ebiederm@xmission.com>

From: "Eric W. Biederman" <ebiederm@xmission.com>

Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.

In general policy and network stack state changes are allowed
while resource control is left unchanged.

Allow ethtool ioctls.

Allow binding to network devices.
Allow setting the socket mark.
Allow setting the socket priority.

Allow setting the network device alias via sysfs.
Allow setting the mtu via sysfs.
Allow changing the network device flags via sysfs.
Allow setting the network device group via sysfs.

Allow the following network device ioctls.
SIOCGMIIPHY
SIOCGMIIREG
SIOCSIFNAME
SIOCSIFFLAGS
SIOCSIFMETRIC
SIOCSIFMTU
SIOCSIFHWADDR
SIOCSIFSLAVE
SIOCADDMULTI
SIOCDELMULTI
SIOCSIFHWBROADCAST
SIOCSMIIREG
SIOCBONDENSLAVE
SIOCBONDRELEASE
SIOCBONDSETHWADDR
SIOCBONDCHANGEACTIVE
SIOCBRADDIF
SIOCBRDELIF
SIOCSHWTSTAMP

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/core/dev.c       |   17 +++++++++++++----
 net/core/ethtool.c   |    2 +-
 net/core/net-sysfs.c |   15 ++++++++++-----
 net/core/sock.c      |    7 ++++---
 4 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 09cb3f6..7150ea9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5200,7 +5200,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCGMIIPHY:
 	case SIOCGMIIREG:
 	case SIOCSIFNAME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		dev_load(net, ifr.ifr_name);
 		rtnl_lock();
@@ -5221,16 +5221,25 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	 *	- require strict serialization.
 	 *	- do not return a value
 	 */
+	case SIOCSIFMAP:
+	case SIOCSIFTXQLEN:
+		if (!capable(CAP_NET_ADMIN))
+			return -EPERM;
+		/* fall through */
+	/*
+	 *	These ioctl calls:
+	 *	- require local superuser power.
+	 *	- require strict serialization.
+	 *	- do not return a value
+	 */
 	case SIOCSIFFLAGS:
 	case SIOCSIFMETRIC:
 	case SIOCSIFMTU:
-	case SIOCSIFMAP:
 	case SIOCSIFHWADDR:
 	case SIOCSIFSLAVE:
 	case SIOCADDMULTI:
 	case SIOCDELMULTI:
 	case SIOCSIFHWBROADCAST:
-	case SIOCSIFTXQLEN:
 	case SIOCSMIIREG:
 	case SIOCBONDENSLAVE:
 	case SIOCBONDRELEASE:
@@ -5239,7 +5248,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCBRADDIF:
 	case SIOCBRDELIF:
 	case SIOCSHWTSTAMP:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		/* fall through */
 	case SIOCBONDSLAVEINFOQUERY:
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 4d64cc2..a870543 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1460,7 +1460,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 	case ETHTOOL_GEEE:
 		break;
 	default:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 	}
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bcf02f6..c66b8c2 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -73,11 +73,12 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr,
 			    const char *buf, size_t len,
 			    int (*set)(struct net_device *, unsigned long))
 {
-	struct net_device *net = to_net_dev(dev);
+	struct net_device *netdev = to_net_dev(dev);
+	struct net *net = dev_net(netdev);
 	unsigned long new;
 	int ret = -EINVAL;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	ret = kstrtoul(buf, 0, &new);
@@ -87,8 +88,8 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr,
 	if (!rtnl_trylock())
 		return restart_syscall();
 
-	if (dev_isalive(net)) {
-		if ((ret = (*set)(net, new)) == 0)
+	if (dev_isalive(netdev)) {
+		if ((ret = (*set)(netdev, new)) == 0)
 			ret = len;
 	}
 	rtnl_unlock();
@@ -264,6 +265,9 @@ static ssize_t store_tx_queue_len(struct device *dev,
 				  struct device_attribute *attr,
 				  const char *buf, size_t len)
 {
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
 	return netdev_store(dev, attr, buf, len, change_tx_queue_len);
 }
 
@@ -271,10 +275,11 @@ static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr,
 			     const char *buf, size_t len)
 {
 	struct net_device *netdev = to_net_dev(dev);
+	struct net *net = dev_net(netdev);
 	size_t count = len;
 	ssize_t ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	/* ignore trailing newline */
diff --git a/net/core/sock.c b/net/core/sock.c
index 8a146cf..85d75cb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -515,7 +515,7 @@ static int sock_bindtodevice(struct sock *sk, char __user *optval, int optlen)
 
 	/* Sorry... */
 	ret = -EPERM;
-	if (!capable(CAP_NET_RAW))
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
 		goto out;
 
 	ret = -EINVAL;
@@ -696,7 +696,8 @@ set_rcvbuf:
 		break;
 
 	case SO_PRIORITY:
-		if ((val >= 0 && val <= 6) || capable(CAP_NET_ADMIN))
+		if ((val >= 0 && val <= 6) ||
+		    ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 			sk->sk_priority = val;
 		else
 			ret = -EPERM;
@@ -813,7 +814,7 @@ set_rcvbuf:
 			clear_bit(SOCK_PASSSEC, &sock->flags);
 		break;
 	case SO_MARK:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 			ret = -EPERM;
 		else
 			sk->sk_mark = val;
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH net-next 13/17] net: Allow userns root to control the network bridge code.
From: Eric W. Biederman @ 2012-11-16 13:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Serge Hallyn, Linux Containers, Eric W. Biederman
In-Reply-To: <1353070992-5552-1-git-send-email-ebiederm@xmission.com>

From: "Eric W. Biederman" <ebiederm@xmission.com>

Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

Allow setting bridge paramters via sysfs.

Allow all of the bridge ioctls:
BRCTL_ADD_IF
BRCTL_DEL_IF
BRCTL_SET_BRDIGE_FORWARD_DELAY
BRCTL_SET_BRIDGE_HELLO_TIME
BRCTL_SET_BRIDGE_MAX_AGE
BRCTL_SET_BRIDGE_AGING_TIME
BRCTL_SET_BRIDGE_STP_STATE
BRCTL_SET_BRIDGE_PRIORITY
BRCTL_SET_PORT_PRIORITY
BRCTL_SET_PATH_COST
BRCTL_ADD_BRIDGE
BRCTL_DEL_BRDIGE

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/bridge/br_ioctl.c    |   25 +++++++++++++------------
 net/bridge/br_sysfs_br.c |   10 +++++-----
 net/bridge/br_sysfs_if.c |    2 +-
 3 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
index 7222fe1..cd8c3a4 100644
--- a/net/bridge/br_ioctl.c
+++ b/net/bridge/br_ioctl.c
@@ -85,13 +85,14 @@ static int get_fdb_entries(struct net_bridge *br, void __user *userbuf,
 /* called with RTNL */
 static int add_del_if(struct net_bridge *br, int ifindex, int isadd)
 {
+	struct net *net = dev_net(br->dev);
 	struct net_device *dev;
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
-	dev = __dev_get_by_index(dev_net(br->dev), ifindex);
+	dev = __dev_get_by_index(net, ifindex);
 	if (dev == NULL)
 		return -EINVAL;
 
@@ -178,25 +179,25 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	}
 
 	case BRCTL_SET_BRIDGE_FORWARD_DELAY:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_forward_delay(br, args[1]);
 
 	case BRCTL_SET_BRIDGE_HELLO_TIME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_hello_time(br, args[1]);
 
 	case BRCTL_SET_BRIDGE_MAX_AGE:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_max_age(br, args[1]);
 
 	case BRCTL_SET_AGEING_TIME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		br->ageing_time = clock_t_to_jiffies(args[1]);
@@ -236,14 +237,14 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	}
 
 	case BRCTL_SET_BRIDGE_STP_STATE:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		br_stp_set_enabled(br, args[1]);
 		return 0;
 
 	case BRCTL_SET_BRIDGE_PRIORITY:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -256,7 +257,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 		struct net_bridge_port *p;
 		int ret;
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -273,7 +274,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 		struct net_bridge_port *p;
 		int ret;
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -330,7 +331,7 @@ static int old_deviceless(struct net *net, void __user *uarg)
 	{
 		char buf[IFNAMSIZ];
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(buf, (void __user *)args[1], IFNAMSIZ))
@@ -360,7 +361,7 @@ int br_ioctl_deviceless_stub(struct net *net, unsigned int cmd, void __user *uar
 	{
 		char buf[IFNAMSIZ];
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(buf, uarg, IFNAMSIZ))
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index c5c0593..53dc9cb 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -36,7 +36,7 @@ static ssize_t store_bridge_parm(struct device *d,
 	unsigned long val;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
@@ -132,7 +132,7 @@ static ssize_t store_stp_state(struct device *d,
 	char *endp;
 	unsigned long val;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
@@ -165,7 +165,7 @@ static ssize_t store_group_fwd_mask(struct device *d,
 	char *endp;
 	unsigned long val;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
@@ -300,7 +300,7 @@ static ssize_t store_group_addr(struct device *d,
 	unsigned int new_addr[6];
 	int i;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (sscanf(buf, "%x:%x:%x:%x:%x:%x",
@@ -337,7 +337,7 @@ static ssize_t store_flush(struct device *d,
 {
 	struct net_bridge *br = to_bridge(d);
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	br_fdb_flush(br);
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 13b36bd..7844478 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -209,7 +209,7 @@ static ssize_t brport_store(struct kobject * kobj,
 	char *endp;
 	unsigned long val;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(p->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
-- 
1.7.5.4

^ permalink raw reply related

* Re: [PATCH net-next 09/17] net: Allow userns root control of the core of the network stack.
From: Glauber Costa @ 2012-11-16 13:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers, David Miller
In-Reply-To: <1353070992-5552-9-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

On 11/16/2012 05:03 PM, Eric W. Biederman wrote:
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
>  	return netdev_store(dev, attr, buf, len, change_tx_queue_len);

You mean ns_capable here?

^ permalink raw reply

* losing addresses if reusing interface names too fast
From: Jan Dvořák @ 2012-11-16 13:43 UTC (permalink / raw)
  To: netdev

Hello,

I have noticed a surprising behavior when removing virtual interfaces.
For example, removing a bridge with configured IP addresses and then
quickly creating another interface with identical name will cause the
addresses to be removed from the new interface a second or so later.

To demonstrate the issue:

brctl addbr foobar
ip addr add 10.20.30.40/24 dev foobar
ip addr show dev foobar

# 40: foobar: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
#     link/ether 12:12:f5:1b:c6:8e brd ff:ff:ff:ff:ff:ff
#     inet 10.20.30.40/24 scope global foobar

sleep 1

ip link set down dev foobar
brctl delbr foobar
ip link add name foobar link em1 type vlan id 42
ip addr add 10.20.30.40/24 dev foobar
ip addr show dev foobar

# 41: foobar@em1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN
#     link/ether 00:21:cc:63:ec:3c brd ff:ff:ff:ff:ff:ff
#     inet 10.20.30.40/24 scope global foobar

sleep 3

ip addr show dev foobar
# 41: foobar@em1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN
#     link/ether 00:21:cc:63:ec:3c brd ff:ff:ff:ff:ff:ff

ip link del foobar type vlan

If you monitor udev during course of this script, you will get following
sequence, spaces where sleeps were:

KERNEL[457506.286789] add      /devices/virtual/net/foobar (net)

KERNEL[457507.298630] remove   /devices/virtual/net/foobar (net)
KERNEL[457507.309597] add      /devices/virtual/net/foobar (net)

KERNEL[457510.324433] remove   /devices/virtual/net/foobar (net)

This means that I can't event wait for the bridge to get completely
removed, kernel tells me it's gone as soon as I delete it.

Could you please advise a workaround or confirm this as a bug?

Best regards,
	Jan Dvorak

^ permalink raw reply

* Re: [PATCH net-next 09/17] net: Allow userns root control of the core of the network stack.
From: Eric W. Biederman @ 2012-11-16 14:32 UTC (permalink / raw)
  To: Glauber Costa; +Cc: David Miller, netdev, Linux Containers
In-Reply-To: <50A645C2.1000604@parallels.com>

Glauber Costa <glommer@parallels.com> writes:

> On 11/16/2012 05:03 PM, Eric W. Biederman wrote:
>> +	if (!capable(CAP_NET_ADMIN))
>> +		return -EPERM;
>> +
>>  	return netdev_store(dev, attr, buf, len, change_tx_queue_len);
>
> You mean ns_capable here?

No.  There I meant capable.

I deliberately call capable here because I don't understand what
the tx_queue_len well enough to be certain it is safe to relax
that check to be just ns_capable.

My get feel is that allowing an unprivileged user to be able to
arbitrarily change the tx_queue_len on a networking device would be a
nice way to allow queuing as many network packets as you would like with
kernel memory and DOSing the machine.

So since with a quick read of the code I could not convince myself it
was safe to allow unprivilged users to change tx_queue_len I left it
protected by capable.  While at the same time I relaxed the check in
netdev_store to be ns_capable.

Eric

^ permalink raw reply

* [net-next PATCH v2] net/ethernet: remove useless is_valid_ether_addr from drivers ndo_open
From: Joachim Eastwood @ 2012-11-16 14:47 UTC (permalink / raw)
  To: davem, nicolas.ferre, shemminger, steve.glendinning, stigge,
	msink
  Cc: netdev, Joachim Eastwood

If ndo_validate_addr is set to the generic eth_validate_addr
function there is no point in calling is_valid_ether_addr
from driver ndo_open if ndo_open is not used elsewhere in
the driver.

With this change is_valid_ether_addr will be called from the
generic eth_validate_addr function. So there should be no change
in the actual behavior.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
---
Hi,

v2: Audit changed drivers to ensure ndo_open functions is
only called from net core and not used elsewhere in the
drivers.

net/core/dev.c __dev_open does
 1165        if (ops->ndo_validate_addr)
 1166                ret = ops->ndo_validate_addr(dev);
 1167
 1168        if (!ret && ops->ndo_open)
 1169                ret = ops->ndo_open(dev);

so there shouldn't be a need for a is_valid_ether_addr in
ndo_open if the open function is not used elsewhere in the
driver.

ndo_validate_addr is set to eth_validate_addr in all
changed drivers.

regards
Joachim Eastwood

 drivers/net/ethernet/8390/etherh.c        |  6 ------
 drivers/net/ethernet/cadence/at91_ether.c |  3 ---
 drivers/net/ethernet/cadence/macb.c       |  3 ---
 drivers/net/ethernet/dnet.c               |  3 ---
 drivers/net/ethernet/i825xx/ether1.c      |  6 ------
 drivers/net/ethernet/micrel/ks8695net.c   |  3 ---
 drivers/net/ethernet/nxp/lpc_eth.c        |  4 +---
 drivers/net/ethernet/seeq/ether3.c        |  6 ------
 drivers/net/ethernet/smsc/smc911x.c       | 10 ----------
 drivers/net/ethernet/smsc/smc91x.c        | 10 ----------
 drivers/net/ethernet/smsc/smsc911x.c      |  5 -----
 drivers/net/ethernet/wiznet/w5100.c       |  2 --
 drivers/net/ethernet/wiznet/w5300.c       |  2 --
 13 files changed, 1 insertion(+), 62 deletions(-)

diff --git a/drivers/net/ethernet/8390/etherh.c b/drivers/net/ethernet/8390/etherh.c
index 8322c54..6414e84 100644
--- a/drivers/net/ethernet/8390/etherh.c
+++ b/drivers/net/ethernet/8390/etherh.c
@@ -463,12 +463,6 @@ etherh_open(struct net_device *dev)
 {
 	struct ei_device *ei_local = netdev_priv(dev);
 
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		printk(KERN_WARNING "%s: invalid ethernet MAC address\n",
-			dev->name);
-		return -EINVAL;
-	}
-
 	if (request_irq(dev->irq, __ei_interrupt, 0, dev->name, dev))
 		return -EAGAIN;
 
diff --git a/drivers/net/ethernet/cadence/at91_ether.c b/drivers/net/ethernet/cadence/at91_ether.c
index e7a476c..716cc01 100644
--- a/drivers/net/ethernet/cadence/at91_ether.c
+++ b/drivers/net/ethernet/cadence/at91_ether.c
@@ -97,9 +97,6 @@ static int at91ether_open(struct net_device *dev)
 	u32 ctl;
 	int ret;
 
-	if (!is_valid_ether_addr(dev->dev_addr))
-		return -EADDRNOTAVAIL;
-
 	/* Clear internal statistics */
 	ctl = macb_readl(lp, NCR);
 	macb_writel(lp, NCR, ctl | MACB_BIT(CLRSTAT));
diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
index edb2aba..d556c52 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -1218,9 +1218,6 @@ static int macb_open(struct net_device *dev)
 	if (!bp->phy_dev)
 		return -EAGAIN;
 
-	if (!is_valid_ether_addr(dev->dev_addr))
-		return -EADDRNOTAVAIL;
-
 	err = macb_alloc_consistent(bp);
 	if (err) {
 		netdev_err(dev, "Unable to allocate DMA memory (error %d)\n",
diff --git a/drivers/net/ethernet/dnet.c b/drivers/net/ethernet/dnet.c
index 290b26f..feb5095 100644
--- a/drivers/net/ethernet/dnet.c
+++ b/drivers/net/ethernet/dnet.c
@@ -664,9 +664,6 @@ static int dnet_open(struct net_device *dev)
 	if (!bp->phy_dev)
 		return -EAGAIN;
 
-	if (!is_valid_ether_addr(dev->dev_addr))
-		return -EADDRNOTAVAIL;
-
 	napi_enable(&bp->napi);
 	dnet_init_hw(bp);
 
diff --git a/drivers/net/ethernet/i825xx/ether1.c b/drivers/net/ethernet/i825xx/ether1.c
index 067db3f..7b9609d 100644
--- a/drivers/net/ethernet/i825xx/ether1.c
+++ b/drivers/net/ethernet/i825xx/ether1.c
@@ -638,12 +638,6 @@ ether1_txalloc (struct net_device *dev, int size)
 static int
 ether1_open (struct net_device *dev)
 {
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		printk(KERN_WARNING "%s: invalid ethernet MAC address\n",
-			dev->name);
-		return -EINVAL;
-	}
-
 	if (request_irq(dev->irq, ether1_interrupt, 0, "ether1", dev))
 		return -EAGAIN;
 
diff --git a/drivers/net/ethernet/micrel/ks8695net.c b/drivers/net/ethernet/micrel/ks8695net.c
index dccae1d..e62c312 100644
--- a/drivers/net/ethernet/micrel/ks8695net.c
+++ b/drivers/net/ethernet/micrel/ks8695net.c
@@ -1249,9 +1249,6 @@ ks8695_open(struct net_device *ndev)
 	struct ks8695_priv *ksp = netdev_priv(ndev);
 	int ret;
 
-	if (!is_valid_ether_addr(ndev->dev_addr))
-		return -EADDRNOTAVAIL;
-
 	ks8695_reset(ksp);
 
 	ks8695_update_mac(ksp);
diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index af8b414..db6e101 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -1219,9 +1219,6 @@ static int lpc_eth_open(struct net_device *ndev)
 	if (netif_msg_ifup(pldat))
 		dev_dbg(&pldat->pdev->dev, "enabling %s\n", ndev->name);
 
-	if (!is_valid_ether_addr(ndev->dev_addr))
-		return -EADDRNOTAVAIL;
-
 	__lpc_eth_clock_enable(pldat, true);
 
 	/* Reset and initialize */
@@ -1301,6 +1298,7 @@ static const struct net_device_ops lpc_netdev_ops = {
 	.ndo_set_rx_mode	= lpc_eth_set_multicast_list,
 	.ndo_do_ioctl		= lpc_eth_ioctl,
 	.ndo_set_mac_address	= lpc_set_mac_address,
+	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_change_mtu		= eth_change_mtu,
 };
 
diff --git a/drivers/net/ethernet/seeq/ether3.c b/drivers/net/ethernet/seeq/ether3.c
index 6a40dd0..72a0174 100644
--- a/drivers/net/ethernet/seeq/ether3.c
+++ b/drivers/net/ethernet/seeq/ether3.c
@@ -399,12 +399,6 @@ ether3_probe_bus_16(struct net_device *dev, int val)
 static int
 ether3_open(struct net_device *dev)
 {
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		printk(KERN_WARNING "%s: invalid ethernet MAC address\n",
-			dev->name);
-		return -EINVAL;
-	}
-
 	if (request_irq(dev->irq, ether3_interrupt, 0, "ether3", dev))
 		return -EAGAIN;
 
diff --git a/drivers/net/ethernet/smsc/smc911x.c b/drivers/net/ethernet/smsc/smc911x.c
index 8d15f7a..990f574 100644
--- a/drivers/net/ethernet/smsc/smc911x.c
+++ b/drivers/net/ethernet/smsc/smc911x.c
@@ -1400,16 +1400,6 @@ smc911x_open(struct net_device *dev)
 
 	DBG(SMC_DEBUG_FUNC, "%s: --> %s\n", dev->name, __func__);
 
-	/*
-	 * Check that the address is valid.  If its not, refuse
-	 * to bring the device up.	 The user must specify an
-	 * address using ifconfig eth0 hw ether xx:xx:xx:xx:xx:xx
-	 */
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		PRINTK("%s: no valid ethernet hw addr\n", __func__);
-		return -EINVAL;
-	}
-
 	/* reset the hardware */
 	smc911x_reset(dev);
 
diff --git a/drivers/net/ethernet/smsc/smc91x.c b/drivers/net/ethernet/smsc/smc91x.c
index 318adc9..f516e5a 100644
--- a/drivers/net/ethernet/smsc/smc91x.c
+++ b/drivers/net/ethernet/smsc/smc91x.c
@@ -1474,16 +1474,6 @@ smc_open(struct net_device *dev)
 
 	DBG(2, "%s: %s\n", dev->name, __func__);
 
-	/*
-	 * Check that the address is valid.  If its not, refuse
-	 * to bring the device up.  The user must specify an
-	 * address using ifconfig eth0 hw ether xx:xx:xx:xx:xx:xx
-	 */
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		PRINTK("%s: no valid ethernet hw addr\n", __func__);
-		return -EINVAL;
-	}
-
 	/* Setup the default Register Modes */
 	lp->tcr_cur_mode = TCR_DEFAULT;
 	lp->rcr_cur_mode = RCR_DEFAULT;
diff --git a/drivers/net/ethernet/smsc/smsc911x.c b/drivers/net/ethernet/smsc/smsc911x.c
index 62d1baf..a088c4f 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -1463,11 +1463,6 @@ static int smsc911x_open(struct net_device *dev)
 		return -EAGAIN;
 	}
 
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		SMSC_WARN(pdata, hw, "dev_addr is not a valid MAC address");
-		return -EADDRNOTAVAIL;
-	}
-
 	/* Reset the LAN911x */
 	if (smsc911x_soft_reset(pdata)) {
 		SMSC_WARN(pdata, hw, "soft reset failed");
diff --git a/drivers/net/ethernet/wiznet/w5100.c b/drivers/net/ethernet/wiznet/w5100.c
index 2c08bf6..7daf92e 100644
--- a/drivers/net/ethernet/wiznet/w5100.c
+++ b/drivers/net/ethernet/wiznet/w5100.c
@@ -580,8 +580,6 @@ static int w5100_open(struct net_device *ndev)
 	struct w5100_priv *priv = netdev_priv(ndev);
 
 	netif_info(priv, ifup, ndev, "enabling\n");
-	if (!is_valid_ether_addr(ndev->dev_addr))
-		return -EINVAL;
 	w5100_hw_start(priv);
 	napi_enable(&priv->napi);
 	netif_start_queue(ndev);
diff --git a/drivers/net/ethernet/wiznet/w5300.c b/drivers/net/ethernet/wiznet/w5300.c
index 88943d9..bd9eec6 100644
--- a/drivers/net/ethernet/wiznet/w5300.c
+++ b/drivers/net/ethernet/wiznet/w5300.c
@@ -500,8 +500,6 @@ static int w5300_open(struct net_device *ndev)
 	struct w5300_priv *priv = netdev_priv(ndev);
 
 	netif_info(priv, ifup, ndev, "enabling\n");
-	if (!is_valid_ether_addr(ndev->dev_addr))
-		return -EINVAL;
 	w5300_hw_start(priv);
 	napi_enable(&priv->napi);
 	netif_start_queue(ndev);
-- 
1.8.0

^ permalink raw reply related

* Re: [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:18 UTC (permalink / raw)
  To: Annie Li; +Cc: xen-devel, netdev, Ian.Campbell
In-Reply-To: <1352962987-541-1-git-send-email-annie.li@oracle.com>

On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> This patch implements persistent grants for xen-netfront/netback. This
> mechanism maintains page pools in netback/netfront, these page pools is used to
> save grant pages which are mapped. This way improve performance which is wasted
> when doing grant operations.
> 
> Current netback/netfront does map/unmap grant operations frequently when
> transmitting/receiving packets, and grant operations costs much cpu clock. In
> this patch, netfront/netback maps grant pages when needed and then saves them
> into a page pool for future use. All these pages will be unmapped when
> removing/releasing the net device.
> 
> In netfront, two pools are maintained for transmitting and receiving packets.
> When new grant pages are needed, the driver gets grant pages from this pool
> first. If no free grant page exists, it allocates new page, maps it and then
> saves it into the pool. The pool size for transmit/receive is exactly tx/rx
> ring size. The driver uses memcpy(not grantcopy) to copy data grant pages.
> Here, memcpy is copying the whole page size data. I tried to copy len size data
> from offset, but network does not seem work well. I am trying to find the root
> cause now.
> 
> In netback, it also maintains two page pools for tx/rx. When netback gets a
> request, it does a search first to find out whether the grant reference of
> this request is already mapped into its page pool. If the grant ref is mapped,
> the address of this mapped page is gotten and memcpy is used to copy data
> between grant pages. However, if the grant ref is not mapped, a new page is
> allocated, mapped with this grant ref, and then saved into page pool for
> future use. Similarly, memcpy replaces grant copy to copy data between grant
> pages. In this implementation, two arrays(gnttab_tx_vif,gnttab_rx_vif) are
> used to save vif pointer for every request because current netback is not
> per-vif based. This would be changed after implementing 1:1 model in netback.
> 
> This patch supports both persistent-grant and non persistent grant. A new
> xenstore key "feature-persistent-grants" is used to represent this feature.
> 
> This patch is based on linux3.4-rc3. I hit netperf/netserver failure on
> linux latest version v3.7-rc1, v3.7-rc2 and v3.7-rc4. Not sure whether this
> netperf/netserver failure connects compound page commit in v3.7-rc1, but I did
> hit BUG_ON with debug patch from thread
> http://lists.xen.org/archives/html/xen-devel/2012-10/msg00893.html

FYI, I get this:

 477.814511] BUG: sleeping function called from invalid context at /home/konrad/ssd/linux/mm/page_alloc.c:2487
[  477.815281] in_atomic(): 1, irqs_disabled(): 1, pid: 3017, name: netperf
[  477.815281] Pid: 3017, comm: netperf Not tainted 3.5.0upstream-00004-g69047bb #1
[  477.815281] Call Trace:
[  477.815281]  [<ffffffff810b990a>] __might_sleep+0xda/0x100
[  477.815281]  [<ffffffff81142e93>] __alloc_pages_nodemask+0x223/0x920
[  477.815281]  [<ffffffff81158439>] ? zone_statistics+0x99/0xc0
[  477.815281]  [<ffffffff81076e79>] ? default_spin_lock_flags+0x9/0x10
[  477.815281]  [<ffffffff81615e3a>] ? _raw_spin_lock_irqsave+0x3a/0x50
[  477.815281]  [<ffffffff81076e79>] ? default_spin_lock_flags+0x9/0x10
[  477.815281]  [<ffffffff81098977>] ? lock_timer_base+0x37/0x70
[  477.815281]  [<ffffffff8109a03d>] ? mod_timer_pending+0x11d/0x230
[  477.815281]  [<ffffffff81616144>] ? _raw_spin_unlock_bh+0x24/0x30
[  477.815281]  [<ffffffff8117e7e1>] alloc_pages_current+0xb1/0x110
[  477.815281]  [<ffffffffa0034238>] xennet_alloc_tx_ref+0x78/0x1c0 [xen_netfront]
[  477.815281]  [<ffffffffa00344eb>] xennet_start_xmit+0x16b/0x9f0 [xen_netfront]
[  477.815281]  [<ffffffff814c69eb>] dev_hard_start_xmit+0x2fb/0x6f0
[  477.815281]  [<ffffffff814e4566>] sch_direct_xmit+0x116/0x1e0
[  477.815281]  [<ffffffff814c6f6a>] dev_queue_xmit+0x18a/0x6b0
[  477.815281]  [<ffffffff8151264e>] ip_finish_output+0x18e/0x300
[  477.815281]  [<ffffffff81512821>] ip_output+0x61/0xa0
[  477.815281]  [<ffffffff81511b82>] ? __ip_local_out+0xa2/0xb0
[  477.815281]  [<ffffffff81511bb4>] ip_local_out+0x24/0x30
[  477.815281]  [<ffffffff81511ffe>] ip_queue_xmit+0x15e/0x410
[  477.815281]  [<ffffffff81528354>] tcp_transmit_skb+0x424/0x8f0
[  477.815281]  [<ffffffff8152a8c2>] tcp_write_xmit+0x1f2/0x9c0
[  477.815281]  [<ffffffff81182194>] ? ksize+0x14/0x70
[  477.815281]  [<ffffffff8152b711>] __tcp_push_pending_frames+0x21/0x90
[  477.815281]  [<ffffffff8151db23>] tcp_sendmsg+0x983/0xcd0
[  477.815281]  [<ffffffff81540daf>] inet_sendmsg+0x7f/0xd0
[  477.815281]  [<ffffffff81290dde>] ? selinux_socket_sendmsg+0x1e/0x20
[  477.815281]  [<ffffffff814aed13>] sock_sendmsg+0xf3/0x120
[  477.815281]  [<ffffffff81076f48>] ? pvclock_clocksource_read+0x58/0xd0
[  477.815281]  [<ffffffff812de7c0>] ? timerqueue_add+0x60/0xb0
[  477.815281]  [<ffffffff810b0b85>] ? enqueue_hrtimer+0x25/0xb0
[  477.815281]  [<ffffffff814af4d4>] sys_sendto+0x104/0x140
[  477.815281]  [<ffffffff81041279>] ? xen_clocksource_read+0x39/0x50
[  477.815281]  [<ffffffff81041419>] ? xen_clocksource_get_cycles+0x9/0x10
[  477.815281]  [<ffffffff810d3242>] ? getnstimeofday+0x52/0xe0
[  477.815281]  [<ffffffff8161dfb9>] system_call_fastpath+0x16/0x1b

> 
> 
> Annie Li (4):
>   xen/netback: implements persistent grant with one page pool.
>   xen/netback: Split one page pool into two(tx/rx) page pool.
>   Xen/netfront: Implement persistent grant in netfront.
>   fix code indent issue in xen-netfront.
> 
>  drivers/net/xen-netback/common.h    |   24 ++-
>  drivers/net/xen-netback/interface.c |   26 +++
>  drivers/net/xen-netback/netback.c   |  215 ++++++++++++++++++--
>  drivers/net/xen-netback/xenbus.c    |   14 ++-
>  drivers/net/xen-netfront.c          |  378 +++++++++++++++++++++++++++++------
>  5 files changed, 570 insertions(+), 87 deletions(-)
> 
> -- 
> 1.7.3.4

^ permalink raw reply

* Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:21 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: ANNIE LI, Pasi Kärkkäinen, netdev@vger.kernel.org,
	xen-devel@lists.xensource.com, Ian Campbell
In-Reply-To: <50A4CA51.8080208@citrix.com>

On Thu, Nov 15, 2012 at 11:56:17AM +0100, Roger Pau Monné wrote:
> On 15/11/12 09:38, ANNIE LI wrote:
> > 
> > 
> > On 2012-11-15 15:40, Pasi Kärkkäinen wrote:
> >> Hello,
> >>
> >> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> >>> This patch implements persistent grants for xen-netfront/netback. This
> >>> mechanism maintains page pools in netback/netfront, these page pools is used to
> >>> save grant pages which are mapped. This way improve performance which is wasted
> >>> when doing grant operations.
> >>>
> >>> Current netback/netfront does map/unmap grant operations frequently when
> >>> transmitting/receiving packets, and grant operations costs much cpu clock. In
> >>> this patch, netfront/netback maps grant pages when needed and then saves them
> >>> into a page pool for future use. All these pages will be unmapped when
> >>> removing/releasing the net device.
> >>>
> >> Do you have performance numbers available already? with/without persistent grants?
> > I have some simple netperf/netserver test result with/without persistent 
> > grants,
> > 
> > Following is result of with persistent grant patch,
> > 
> > Guests, Sum,      Avg,     Min,     Max
> >   1,  15106.4,  15106.4, 15106.36, 15106.36
> >   2,  13052.7,  6526.34,  6261.81,  6790.86
> >   3,  12675.1,  6337.53,  6220.24,  6454.83
> >   4,  13194,  6596.98,  6274.70,  6919.25
> > 
> > 
> > Following are result of without persistent patch
> > 
> > Guests, Sum,     Avg,    Min,        Max
> >   1,  10864.1,  10864.1, 10864.10, 10864.10
> >   2,  10898.5,  5449.24,  4862.08,  6036.40
> >   3,  10734.5,  5367.26,  5261.43,  5473.08
> >   4,  10924,    5461.99,  5314.84,  5609.14
> 
> In the block case, performance improvement is seen when using a large
> number of guests, could you perform the same benchmark increasing the
> number of guests to 15?

Keep in mind that one of the things that is limiting these numbers is
that netback is very CPU intensive. So I think it could get much much
faster - but netback pegs at 100%. With Wei Liu's patches the CPU usage
did drop by 40% (this is when I tested the old netback with Wei's
netback patches)- so we should see even a further speed increase.

^ permalink raw reply

* Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:23 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Roger Pau Monne, ANNIE LI, Pasi Kärkkäinen,
	netdev@vger.kernel.org, xen-devel@lists.xensource.com
In-Reply-To: <1353006667.26243.6.camel@dagon.hellion.org.uk>

On Thu, Nov 15, 2012 at 07:11:07PM +0000, Ian Campbell wrote:
> On Thu, 2012-11-15 at 18:29 +0000, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 15, 2012 at 11:15:06AM +0000, Ian Campbell wrote:
> > > On Thu, 2012-11-15 at 10:56 +0000, Roger Pau Monne wrote:
> > > > On 15/11/12 09:38, ANNIE LI wrote:
> > > > > 
> > > > > 
> > > > > On 2012-11-15 15:40, Pasi Kärkkäinen wrote:
> > > > >> Hello,
> > > > >>
> > > > >> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> > > > >>> This patch implements persistent grants for xen-netfront/netback. This
> > > > >>> mechanism maintains page pools in netback/netfront, these page pools is used to
> > > > >>> save grant pages which are mapped. This way improve performance which is wasted
> > > > >>> when doing grant operations.
> > > > >>>
> > > > >>> Current netback/netfront does map/unmap grant operations frequently when
> > > > >>> transmitting/receiving packets, and grant operations costs much cpu clock. In
> > > > >>> this patch, netfront/netback maps grant pages when needed and then saves them
> > > > >>> into a page pool for future use. All these pages will be unmapped when
> > > > >>> removing/releasing the net device.
> > > > >>>
> > > > >> Do you have performance numbers available already? with/without persistent grants?
> > > > > I have some simple netperf/netserver test result with/without persistent 
> > > > > grants,
> > > > > 
> > > > > Following is result of with persistent grant patch,
> > > > > 
> > > > > Guests, Sum,      Avg,     Min,     Max
> > > > >   1,  15106.4,  15106.4, 15106.36, 15106.36
> > > > >   2,  13052.7,  6526.34,  6261.81,  6790.86
> > > > >   3,  12675.1,  6337.53,  6220.24,  6454.83
> > > > >   4,  13194,  6596.98,  6274.70,  6919.25
> > > > > 
> > > > > 
> > > > > Following are result of without persistent patch
> > > > > 
> > > > > Guests, Sum,     Avg,    Min,        Max
> > > > >   1,  10864.1,  10864.1, 10864.10, 10864.10
> > > > >   2,  10898.5,  5449.24,  4862.08,  6036.40
> > > > >   3,  10734.5,  5367.26,  5261.43,  5473.08
> > > > >   4,  10924,    5461.99,  5314.84,  5609.14
> > > > 
> > > > In the block case, performance improvement is seen when using a large
> > > > number of guests, could you perform the same benchmark increasing the
> > > > number of guests to 15?
> > > 
> > > It would also be nice to see some analysis of the numbers which justify
> > > why this change is a good one without every reviewer having to evaluate
> > > the raw data themselves. In fact this should really be part of the
> > > commit message.
> > 
> > You mean like a nice graph, eh?
> 
> Together with an analysis of what it means and why it is a good thing,
> yes.

OK, lets put that on the TODO list for the next posting. In the meantime -
it sounds like you (the maintainer) are happy with the direction this is going.

The other things we want to do _after_ these patches is to look at the Wei
Liu patches and try to address the different reviewers comments.

The neat thing about them is that they have a concept of a page pool system.
And with persistent pages in both blkback and netback this gets more exciting.


> 
> Ian.
> 
> > 
> > I will run these patches on my 32GB box and see if I can give you
> > a nice PDF/jpg.
> > 
> > > 
> > > Ian.
> > > 
> 

^ permalink raw reply

* Re: [PATCH] tcp: handle tcp_net_metrics_init() order-5 memory allocation failures
From: Eric Dumazet @ 2012-11-16 15:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jln
In-Reply-To: <20121116.013940.813652515905883288.davem@davemloft.net>

On Fri, 2012-11-16 at 01:39 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 15 Nov 2012 15:41:04 -0800
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > order-5 allocations can fail with current kernels, we should
> > try to reduce allocation sizes to allow network namespace
> > creation.
> > 
> > Reported-by: Julien Tinnes <jln@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> Indeed, this has to be done better.
> 
> But this kind of retry solution results in non-deterministic behavior.
> Yes the tcp metrics cache is best effort, but it's size can influence
> behavior in a substantial way depending upon the workload.
> 
> I would suggest that we instead use different limits, ones which the
> page allocator will satisfy for us always with GFP_KERNEL.
> 
> 1) include linux/mmzone.h
> 
> 2) Make the two limits based upon PAGE_ALLOC_COSTLY_ORDER.
> 
> That is, make the larger table size PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER
> and the smaller one PAGE_SIZE << (PAGE_ALLOC_COSTLY_ORDER - 1).

Well, we dont really know what the size needs to be, and your proposal
reduces the size by a 4 factor, even for the initial namespace.

Julien report was about Chrome browser own netns, on a suspend/resume
cycle (or something like that)

If size can influence behavior, we could try a vmalloc() if kmalloc()
fails...

Thanks

[PATCH v3] tcp: handle tcp_net_metrics_init() order-5 memory allocation failures

order-5 allocations can fail with current kernels, we should
try vmalloc() as well.

Reported-by: Julien Tinnes <jln@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_metrics.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 53bc584..f696d7c 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1,7 +1,6 @@
 #include <linux/rcupdate.h>
 #include <linux/spinlock.h>
 #include <linux/jiffies.h>
-#include <linux/bootmem.h>
 #include <linux/module.h>
 #include <linux/cache.h>
 #include <linux/slab.h>
@@ -9,6 +8,7 @@
 #include <linux/tcp.h>
 #include <linux/hash.h>
 #include <linux/tcp_metrics.h>
+#include <linux/vmalloc.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/net_namespace.h>
@@ -1034,7 +1034,10 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 	net->ipv4.tcp_metrics_hash_log = order_base_2(slots);
 	size = sizeof(struct tcpm_hash_bucket) << net->ipv4.tcp_metrics_hash_log;
 
-	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL);
+	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!net->ipv4.tcp_metrics_hash)
+		net->ipv4.tcp_metrics_hash = vzalloc(size);
+
 	if (!net->ipv4.tcp_metrics_hash)
 		return -ENOMEM;
 
@@ -1055,7 +1058,10 @@ static void __net_exit tcp_net_metrics_exit(struct net *net)
 			tm = next;
 		}
 	}
-	kfree(net->ipv4.tcp_metrics_hash);
+	if (is_vmalloc_addr(net->ipv4.tcp_metrics_hash))
+		vfree(net->ipv4.tcp_metrics_hash);
+	else
+		kfree(net->ipv4.tcp_metrics_hash);
 }
 
 static __net_initdata struct pernet_operations tcp_net_metrics_ops = {

^ permalink raw reply related

* Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:34 UTC (permalink / raw)
  To: Wei Liu
  Cc: ANNIE LI, Pasi Kärkkäinen, netdev,
	xen-devel@lists.xensource.com, Ian Campbell
In-Reply-To: <CAOsiSVU+9fkGQhVhnrx=xxUD8hej55XXJGpKGWAxe1USPEhiEQ@mail.gmail.com>

On Thu, Nov 15, 2012 at 05:35:13PM +0800, Wei Liu wrote:
> On Thu, Nov 15, 2012 at 4:38 PM, ANNIE LI <annie.li@oracle.com> wrote:
> 
> >
> >
> > On 2012-11-15 15:40, Pasi Kärkkäinen wrote:
> >
> >> Hello,
> >>
> >> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> >>
> >>> This patch implements persistent grants for xen-netfront/netback. This
> >>> mechanism maintains page pools in netback/netfront, these page pools is
> >>> used to
> >>> save grant pages which are mapped. This way improve performance which is
> >>> wasted
> >>> when doing grant operations.
> >>>
> >>> Current netback/netfront does map/unmap grant operations frequently when
> >>> transmitting/receiving packets, and grant operations costs much cpu
> >>> clock. In
> >>> this patch, netfront/netback maps grant pages when needed and then saves
> >>> them
> >>> into a page pool for future use. All these pages will be unmapped when
> >>> removing/releasing the net device.
> >>>
> >>>  Do you have performance numbers available already? with/without
> >> persistent grants?
> >>
> > I have some simple netperf/netserver test result with/without persistent
> > grants,
> >
> > Following is result of with persistent grant patch,
> >
> > Guests, Sum,      Avg,     Min,     Max
> >  1,  15106.4,  15106.4, 15106.36, 15106.36
> >  2,  13052.7,  6526.34,  6261.81,  6790.86
> >  3,  12675.1,  6337.53,  6220.24,  6454.83
> >  4,  13194,  6596.98,  6274.70,  6919.25
> >
> >
> > Following are result of without persistent patch
> >
> > Guests, Sum,     Avg,    Min,        Max
> >  1,  10864.1,  10864.1, 10864.10, 10864.10
> >  2,  10898.5,  5449.24,  4862.08,  6036.40
> >  3,  10734.5,  5367.26,  5261.43,  5473.08
> >  4,  10924,    5461.99,  5314.84,  5609.14
> >
> >
> >
> Interesting results. Have you tested how good it is on a 10G nic, i.e.
> guest sending packets
> through physical network to another host.

Not yet. This was done with two guests pounding each other. I am
setting two machines up for Annie so she can do that type of testing
and also with more guests.

^ permalink raw reply

* [PATCH] openvswitch: Make IPv6 packet parsing dependent on IPv6 config
From: Vlad Yasevich @ 2012-11-16 15:43 UTC (permalink / raw)
  To: netdev
In-Reply-To: <50a5c2e5.lgIvZwesNLp78CVD%fengguang.wu@intel.com>

Openvswitch attempts to use IPv6 packet parsing functions without
any dependency on IPv6 (unlike every other place in kernel).  Pull
the IPv6 code in openvswitch togeter and put a conditional that's
dependent on CONFIG_IPV6.

Resolves:
net/built-in.o: In function `ovs_flow_extract':
(.text+0xbf5d5): undefined reference to `ipv6_skip_exthdr'

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
---
 net/openvswitch/flow.c |  168 ++++++++++++++++++++++++-----------------------
 1 files changed, 86 insertions(+), 82 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 98c7063..6dfaf60 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -124,6 +124,7 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies)
 	(offsetof(struct sw_flow_key, field) +	\
 	 FIELD_SIZEOF(struct sw_flow_key, field))
 
+#if IS_ENABLED(CONFIG_IPV6)
 static int parse_ipv6hdr(struct sk_buff *skb, struct sw_flow_key *key,
 			 int *key_lenp)
 {
@@ -175,6 +176,89 @@ static bool icmp6hdr_ok(struct sk_buff *skb)
 				  sizeof(struct icmp6hdr));
 }
 
+static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
+			int *key_lenp, int nh_len)
+{
+	struct icmp6hdr *icmp = icmp6_hdr(skb);
+	int error = 0;
+	int key_len;
+
+	/* The ICMPv6 type and code fields use the 16-bit transport port
+	 * fields, so we need to store them in 16-bit network byte order.
+	 */
+	key->ipv6.tp.src = htons(icmp->icmp6_type);
+	key->ipv6.tp.dst = htons(icmp->icmp6_code);
+	key_len = SW_FLOW_KEY_OFFSET(ipv6.tp);
+
+	if (icmp->icmp6_code == 0 &&
+	    (icmp->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION ||
+	     icmp->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT)) {
+		int icmp_len = skb->len - skb_transport_offset(skb);
+		struct nd_msg *nd;
+		int offset;
+
+		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
+
+		/* In order to process neighbor discovery options, we need the
+		 * entire packet.
+		 */
+		if (unlikely(icmp_len < sizeof(*nd)))
+			goto out;
+		if (unlikely(skb_linearize(skb))) {
+			error = -ENOMEM;
+			goto out;
+		}
+
+		nd = (struct nd_msg *)skb_transport_header(skb);
+		key->ipv6.nd.target = nd->target;
+		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
+
+		icmp_len -= sizeof(*nd);
+		offset = 0;
+		while (icmp_len >= 8) {
+			struct nd_opt_hdr *nd_opt =
+				 (struct nd_opt_hdr *)(nd->opt + offset);
+			int opt_len = nd_opt->nd_opt_len * 8;
+
+			if (unlikely(!opt_len || opt_len > icmp_len))
+				goto invalid;
+
+			/* Store the link layer address if the appropriate
+			 * option is provided.  It is considered an error if
+			 * the same link layer option is specified twice.
+			 */
+			if (nd_opt->nd_opt_type == ND_OPT_SOURCE_LL_ADDR
+			    && opt_len == 8) {
+				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.sll)))
+					goto invalid;
+				memcpy(key->ipv6.nd.sll,
+				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
+			} else if (nd_opt->nd_opt_type == ND_OPT_TARGET_LL_ADDR
+				   && opt_len == 8) {
+				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.tll)))
+					goto invalid;
+				memcpy(key->ipv6.nd.tll,
+				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
+			}
+
+			icmp_len -= opt_len;
+			offset += opt_len;
+		}
+	}
+
+	goto out;
+
+invalid:
+	memset(&key->ipv6.nd.target, 0, sizeof(key->ipv6.nd.target));
+	memset(key->ipv6.nd.sll, 0, sizeof(key->ipv6.nd.sll));
+	memset(key->ipv6.nd.tll, 0, sizeof(key->ipv6.nd.tll));
+
+out:
+	*key_lenp = key_len;
+	return error;
+}
+#endif
+
 #define TCP_FLAGS_OFFSET 13
 #define TCP_FLAG_MASK 0x3f
 
@@ -487,88 +571,6 @@ static __be16 parse_ethertype(struct sk_buff *skb)
 	return llc->ethertype;
 }
 
-static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
-			int *key_lenp, int nh_len)
-{
-	struct icmp6hdr *icmp = icmp6_hdr(skb);
-	int error = 0;
-	int key_len;
-
-	/* The ICMPv6 type and code fields use the 16-bit transport port
-	 * fields, so we need to store them in 16-bit network byte order.
-	 */
-	key->ipv6.tp.src = htons(icmp->icmp6_type);
-	key->ipv6.tp.dst = htons(icmp->icmp6_code);
-	key_len = SW_FLOW_KEY_OFFSET(ipv6.tp);
-
-	if (icmp->icmp6_code == 0 &&
-	    (icmp->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION ||
-	     icmp->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT)) {
-		int icmp_len = skb->len - skb_transport_offset(skb);
-		struct nd_msg *nd;
-		int offset;
-
-		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
-
-		/* In order to process neighbor discovery options, we need the
-		 * entire packet.
-		 */
-		if (unlikely(icmp_len < sizeof(*nd)))
-			goto out;
-		if (unlikely(skb_linearize(skb))) {
-			error = -ENOMEM;
-			goto out;
-		}
-
-		nd = (struct nd_msg *)skb_transport_header(skb);
-		key->ipv6.nd.target = nd->target;
-		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
-
-		icmp_len -= sizeof(*nd);
-		offset = 0;
-		while (icmp_len >= 8) {
-			struct nd_opt_hdr *nd_opt =
-				 (struct nd_opt_hdr *)(nd->opt + offset);
-			int opt_len = nd_opt->nd_opt_len * 8;
-
-			if (unlikely(!opt_len || opt_len > icmp_len))
-				goto invalid;
-
-			/* Store the link layer address if the appropriate
-			 * option is provided.  It is considered an error if
-			 * the same link layer option is specified twice.
-			 */
-			if (nd_opt->nd_opt_type == ND_OPT_SOURCE_LL_ADDR
-			    && opt_len == 8) {
-				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.sll)))
-					goto invalid;
-				memcpy(key->ipv6.nd.sll,
-				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
-			} else if (nd_opt->nd_opt_type == ND_OPT_TARGET_LL_ADDR
-				   && opt_len == 8) {
-				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.tll)))
-					goto invalid;
-				memcpy(key->ipv6.nd.tll,
-				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
-			}
-
-			icmp_len -= opt_len;
-			offset += opt_len;
-		}
-	}
-
-	goto out;
-
-invalid:
-	memset(&key->ipv6.nd.target, 0, sizeof(key->ipv6.nd.target));
-	memset(key->ipv6.nd.sll, 0, sizeof(key->ipv6.nd.sll));
-	memset(key->ipv6.nd.tll, 0, sizeof(key->ipv6.nd.tll));
-
-out:
-	*key_lenp = key_len;
-	return error;
-}
-
 /**
  * ovs_flow_extract - extracts a flow key from an Ethernet frame.
  * @skb: sk_buff that contains the frame, with skb->data pointing to the
@@ -712,6 +714,7 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 				key_len = SW_FLOW_KEY_OFFSET(ipv4.arp);
 			}
 		}
+#if IS_ENABLED(CONFIG_IPV6)
 	} else if (key->eth.type == htons(ETH_P_IPV6)) {
 		int nh_len;             /* IPv6 Header + Extensions */
 
@@ -752,6 +755,7 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 					goto out;
 			}
 		}
+#endif
 	}
 
 out:
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH net-next 2/4] sit: allow to configure 6rd tunnels via netlink
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

This patch add the support of 6RD tunnels management via netlink.
Note that netdev_state_change() is now called when 6RD parameters are updated.

6RD parameters are updated only if there is at least one 6RD attribute.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/uapi/linux/if_tunnel.h |   4 ++
 net/ipv6/sit.c                 | 149 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 128 insertions(+), 25 deletions(-)

diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 5ab0c8d..aee73d0 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -49,6 +49,10 @@ enum {
 	IFLA_IPTUN_FLAGS,
 	IFLA_IPTUN_PROTO,
 	IFLA_IPTUN_PMTUDISC,
+	IFLA_IPTUN_6RD_PREFIX,
+	IFLA_IPTUN_6RD_RELAY_PREFIX,
+	IFLA_IPTUN_6RD_PREFIXLEN,
+	IFLA_IPTUN_6RD_RELAY_PREFIXLEN,
 	__IFLA_IPTUN_MAX,
 };
 #define IFLA_IPTUN_MAX	(__IFLA_IPTUN_MAX - 1)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index ca6c2c8..504422d 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -936,6 +936,38 @@ static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p)
 	netdev_state_change(t->dev);
 }
 
+#ifdef CONFIG_IPV6_SIT_6RD
+static int ipip6_tunnel_update_6rd(struct ip_tunnel *t,
+				   struct ip_tunnel_6rd *ip6rd)
+{
+	struct in6_addr prefix;
+	__be32 relay_prefix;
+
+	if (ip6rd->relay_prefixlen > 32 ||
+	    ip6rd->prefixlen + (32 - ip6rd->relay_prefixlen) > 64)
+		return -EINVAL;
+
+	ipv6_addr_prefix(&prefix, &ip6rd->prefix, ip6rd->prefixlen);
+	if (!ipv6_addr_equal(&prefix, &ip6rd->prefix))
+		return -EINVAL;
+	if (ip6rd->relay_prefixlen)
+		relay_prefix = ip6rd->relay_prefix &
+			       htonl(0xffffffffUL <<
+				     (32 - ip6rd->relay_prefixlen));
+	else
+		relay_prefix = 0;
+	if (relay_prefix != ip6rd->relay_prefix)
+		return -EINVAL;
+
+	t->ip6rd.prefix = prefix;
+	t->ip6rd.relay_prefix = relay_prefix;
+	t->ip6rd.prefixlen = ip6rd->prefixlen;
+	t->ip6rd.relay_prefixlen = ip6rd->relay_prefixlen;
+	netdev_state_change(t->dev);
+	return 0;
+}
+#endif
+
 static int
 ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 {
@@ -1105,31 +1137,9 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 		t = netdev_priv(dev);
 
 		if (cmd != SIOCDEL6RD) {
-			struct in6_addr prefix;
-			__be32 relay_prefix;
-
-			err = -EINVAL;
-			if (ip6rd.relay_prefixlen > 32 ||
-			    ip6rd.prefixlen + (32 - ip6rd.relay_prefixlen) > 64)
-				goto done;
-
-			ipv6_addr_prefix(&prefix, &ip6rd.prefix,
-					 ip6rd.prefixlen);
-			if (!ipv6_addr_equal(&prefix, &ip6rd.prefix))
+			err = ipip6_tunnel_update_6rd(t, &ip6rd);
+			if (err < 0)
 				goto done;
-			if (ip6rd.relay_prefixlen)
-				relay_prefix = ip6rd.relay_prefix &
-					       htonl(0xffffffffUL <<
-						     (32 - ip6rd.relay_prefixlen));
-			else
-				relay_prefix = 0;
-			if (relay_prefix != ip6rd.relay_prefix)
-				goto done;
-
-			t->ip6rd.prefix = prefix;
-			t->ip6rd.relay_prefix = relay_prefix;
-			t->ip6rd.prefixlen = ip6rd.prefixlen;
-			t->ip6rd.relay_prefixlen = ip6rd.relay_prefixlen;
 		} else
 			ipip6_tunnel_clone_6rd(dev, sitn);
 
@@ -1261,11 +1271,53 @@ static void ipip6_netlink_parms(struct nlattr *data[],
 		parms->i_flags = nla_get_be16(data[IFLA_IPTUN_FLAGS]);
 }
 
+#ifdef CONFIG_IPV6_SIT_6RD
+/* This function returns true when 6RD attributes are present in the nl msg */
+static bool ipip6_netlink_6rd_parms(struct nlattr *data[],
+				    struct ip_tunnel_6rd *ip6rd)
+{
+	bool ret = false;
+	memset(ip6rd, 0, sizeof(*ip6rd));
+
+	if (!data)
+		return ret;
+
+	if (data[IFLA_IPTUN_6RD_PREFIX]) {
+		ret = true;
+		nla_memcpy(&ip6rd->prefix, data[IFLA_IPTUN_6RD_PREFIX],
+			   sizeof(struct in6_addr));
+	}
+
+	if (data[IFLA_IPTUN_6RD_RELAY_PREFIX]) {
+		ret = true;
+		ip6rd->relay_prefix =
+			nla_get_be32(data[IFLA_IPTUN_6RD_RELAY_PREFIX]);
+	}
+
+	if (data[IFLA_IPTUN_6RD_PREFIXLEN]) {
+		ret = true;
+		ip6rd->prefixlen = nla_get_u16(data[IFLA_IPTUN_6RD_PREFIXLEN]);
+	}
+
+	if (data[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]) {
+		ret = true;
+		ip6rd->relay_prefixlen =
+			nla_get_u16(data[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]);
+	}
+
+	return ret;
+}
+#endif
+
 static int ipip6_newlink(struct net *src_net, struct net_device *dev,
 			 struct nlattr *tb[], struct nlattr *data[])
 {
 	struct net *net = dev_net(dev);
 	struct ip_tunnel *nt;
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd ip6rd;
+#endif
+	int err;
 
 	nt = netdev_priv(dev);
 	ipip6_netlink_parms(data, &nt->parms);
@@ -1273,7 +1325,16 @@ static int ipip6_newlink(struct net *src_net, struct net_device *dev,
 	if (ipip6_tunnel_locate(net, &nt->parms, 0))
 		return -EEXIST;
 
-	return ipip6_tunnel_create(dev);
+	err = ipip6_tunnel_create(dev);
+	if (err < 0)
+		return err;
+
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (ipip6_netlink_6rd_parms(data, &ip6rd))
+		err = ipip6_tunnel_update_6rd(nt, &ip6rd);
+#endif
+
+	return err;
 }
 
 static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[],
@@ -1283,6 +1344,9 @@ static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[],
 	struct ip_tunnel_parm p;
 	struct net *net = dev_net(dev);
 	struct sit_net *sitn = net_generic(net, sit_net_id);
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd ip6rd;
+#endif
 
 	if (dev == sitn->fb_tunnel_dev)
 		return -EINVAL;
@@ -1302,6 +1366,12 @@ static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[],
 		t = netdev_priv(dev);
 
 	ipip6_tunnel_update(t, &p);
+
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (ipip6_netlink_6rd_parms(data, &ip6rd))
+		return ipip6_tunnel_update_6rd(t, &ip6rd);
+#endif
+
 	return 0;
 }
 
@@ -1322,6 +1392,16 @@ static size_t ipip6_get_size(const struct net_device *dev)
 		nla_total_size(1) +
 		/* IFLA_IPTUN_FLAGS */
 		nla_total_size(2) +
+#ifdef CONFIG_IPV6_SIT_6RD
+		/* IFLA_IPTUN_6RD_PREFIX */
+		nla_total_size(sizeof(struct in6_addr)) +
+		/* IFLA_IPTUN_6RD_RELAY_PREFIX */
+		nla_total_size(4) +
+		/* IFLA_IPTUN_6RD_PREFIXLEN */
+		nla_total_size(2) +
+		/* IFLA_IPTUN_6RD_RELAY_PREFIXLEN */
+		nla_total_size(2) +
+#endif
 		0;
 }
 
@@ -1339,6 +1419,19 @@ static int ipip6_fill_info(struct sk_buff *skb, const struct net_device *dev)
 		       !!(parm->iph.frag_off & htons(IP_DF))) ||
 	    nla_put_be16(skb, IFLA_IPTUN_FLAGS, parm->i_flags))
 		goto nla_put_failure;
+
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (nla_put(skb, IFLA_IPTUN_6RD_PREFIX, sizeof(struct in6_addr),
+		    &tunnel->ip6rd.prefix) ||
+	    nla_put_be32(skb, IFLA_IPTUN_6RD_RELAY_PREFIX,
+			 tunnel->ip6rd.relay_prefix) ||
+	    nla_put_u16(skb, IFLA_IPTUN_6RD_PREFIXLEN,
+			tunnel->ip6rd.prefixlen) ||
+	    nla_put_u16(skb, IFLA_IPTUN_6RD_RELAY_PREFIXLEN,
+			tunnel->ip6rd.relay_prefixlen))
+		goto nla_put_failure;
+#endif
+
 	return 0;
 
 nla_put_failure:
@@ -1353,6 +1446,12 @@ static const struct nla_policy ipip6_policy[IFLA_IPTUN_MAX + 1] = {
 	[IFLA_IPTUN_TOS]		= { .type = NLA_U8 },
 	[IFLA_IPTUN_PMTUDISC]		= { .type = NLA_U8 },
 	[IFLA_IPTUN_FLAGS]		= { .type = NLA_U16 },
+#ifdef CONFIG_IPV6_SIT_6RD
+	[IFLA_IPTUN_6RD_PREFIX]		= { .len = sizeof(struct in6_addr) },
+	[IFLA_IPTUN_6RD_RELAY_PREFIX]	= { .type = NLA_U32 },
+	[IFLA_IPTUN_6RD_PREFIXLEN]	= { .type = NLA_U16 },
+	[IFLA_IPTUN_6RD_RELAY_PREFIXLEN] = { .type = NLA_U16 },
+#endif
 };
 
 static struct rtnl_link_ops sit_link_ops __read_mostly = {
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 3/4] sit: allow to deactivate the creation of fb device
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

Now that tunnels can be configured via rtnetlink, this device is not mandatory.
The default is conservative.

The fb device was also used by 6RD as a template for default 6RD parameters.
User was able to update this template for each netns, hence when the fb device
is not created, this option does not exist. However, user can now set 6RD
parameters in the same netlink message that create the tunnel (before two
ioctl were needed) thus user can directly set the right value.

Last point is about ISATAP. The potential routers list (prl) management has not
been converted to netlink, but the ioctl is anyway performed on the related
device directly and not on the fb device.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/sit.c | 62 +++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 40 insertions(+), 22 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 504422d..d789a55 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -65,6 +65,15 @@
 #define HASH_SIZE  16
 #define HASH(addr) (((__force u32)addr^((__force u32)addr>>4))&0xF)
 
+static bool setup_fb = true;
+module_param(setup_fb, bool, 0644);
+MODULE_PARM_DESC(setup_fb,
+		 "Setup the fb device to configure tunnel via IOCTL");
+
+#ifdef CONFIG_IPV6_SIT_6RD
+static struct ip_tunnel_6rd_parm ip6rd_template;
+#endif
+
 static int ipip6_tunnel_init(struct net_device *dev);
 static void ipip6_tunnel_setup(struct net_device *dev);
 static void ipip6_dev_free(struct net_device *dev);
@@ -204,12 +213,9 @@ static void ipip6_tunnel_clone_6rd(struct net_device *dev, struct sit_net *sitn)
 #ifdef CONFIG_IPV6_SIT_6RD
 	struct ip_tunnel *t = netdev_priv(dev);
 
-	if (t->dev == sitn->fb_tunnel_dev) {
-		ipv6_addr_set(&t->ip6rd.prefix, htonl(0x20020000), 0, 0, 0);
-		t->ip6rd.relay_prefix = 0;
-		t->ip6rd.prefixlen = 16;
-		t->ip6rd.relay_prefixlen = 0;
-	} else {
+	if (t->dev == sitn->fb_tunnel_dev || setup_fb == false)
+		memcpy(&t->ip6rd, &ip6rd_template, sizeof(t->ip6rd));
+	else {
 		struct ip_tunnel *t0 = netdev_priv(sitn->fb_tunnel_dev);
 		memcpy(&t->ip6rd, &t0->ip6rd, sizeof(t->ip6rd));
 	}
@@ -1501,26 +1507,30 @@ static int __net_init sit_init_net(struct net *net)
 	sitn->tunnels[2] = sitn->tunnels_r;
 	sitn->tunnels[3] = sitn->tunnels_r_l;
 
-	sitn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel), "sit0",
-					   ipip6_tunnel_setup);
-	if (!sitn->fb_tunnel_dev) {
-		err = -ENOMEM;
-		goto err_alloc_dev;
-	}
-	dev_net_set(sitn->fb_tunnel_dev, net);
+	if (setup_fb) {
+		sitn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+						   "sit0", ipip6_tunnel_setup);
+		if (!sitn->fb_tunnel_dev) {
+			err = -ENOMEM;
+			goto err_alloc_dev;
+		}
+		dev_net_set(sitn->fb_tunnel_dev, net);
 
-	err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
-	if (err)
-		goto err_dev_free;
+		err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
+		if (err)
+			goto err_dev_free;
 
-	ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
+		ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
 
-	if ((err = register_netdev(sitn->fb_tunnel_dev)))
-		goto err_reg_dev;
+		err = register_netdev(sitn->fb_tunnel_dev);
+		if (err)
+			goto err_reg_dev;
 
-	t = netdev_priv(sitn->fb_tunnel_dev);
+		t = netdev_priv(sitn->fb_tunnel_dev);
 
-	strcpy(t->parms.name, sitn->fb_tunnel_dev->name);
+		strcpy(t->parms.name, sitn->fb_tunnel_dev->name);
+	} else
+		sitn->fb_tunnel_dev = NULL;
 	return 0;
 
 err_reg_dev:
@@ -1538,7 +1548,8 @@ static void __net_exit sit_exit_net(struct net *net)
 
 	rtnl_lock();
 	sit_destroy_tunnels(sitn, &list);
-	unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
+	if (setup_fb)
+		unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
@@ -1565,6 +1576,13 @@ static int __init sit_init(void)
 
 	pr_info("IPv6 over IPv4 tunneling driver\n");
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	ipv6_addr_set(&ip6rd_template.prefix, htonl(0x20020000), 0, 0, 0);
+	ip6rd_template.relay_prefix = 0;
+	ip6rd_template.prefixlen = 16;
+	ip6rd_template.relay_prefixlen = 0;
+#endif
+
 	err = register_pernet_device(&sit_net_ops);
 	if (err < 0)
 		return err;
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 4/4] ip6tnl: allow to deactivate the creation of fb dev
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

Now that tunnels can be configured via rtnetlink, this device is not mandatory.
The default is conservative.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/ip6_tunnel.c | 43 +++++++++++++++++++++++++++----------------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index bf3a549..fe2028e 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -62,6 +62,11 @@ MODULE_DESCRIPTION("IPv6 tunneling device");
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_NETDEV("ip6tnl0");
 
+static bool setup_fb = true;
+module_param(setup_fb, bool, 0644);
+MODULE_PARM_DESC(setup_fb,
+		 "Setup the fb device to configure tunnel via IOCTL");
+
 #ifdef IP6_TNL_DEBUG
 #define IP6_TNL_TRACE(x...) pr_debug("%s:" x "\n", __func__)
 #else
@@ -1711,8 +1716,10 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n)
 	}
 
 	t = rtnl_dereference(ip6n->tnls_wc[0]);
-	unregister_netdevice_queue(t->dev, &list);
-	unregister_netdevice_many(&list);
+	if (t) {
+		unregister_netdevice_queue(t->dev, &list);
+		unregister_netdevice_many(&list);
+	}
 }
 
 static int __net_init ip6_tnl_init_net(struct net *net)
@@ -1724,25 +1731,29 @@ static int __net_init ip6_tnl_init_net(struct net *net)
 	ip6n->tnls[0] = ip6n->tnls_wc;
 	ip6n->tnls[1] = ip6n->tnls_r_l;
 
-	err = -ENOMEM;
-	ip6n->fb_tnl_dev = alloc_netdev(sizeof(struct ip6_tnl), "ip6tnl0",
-				      ip6_tnl_dev_setup);
+	if (setup_fb) {
+		err = -ENOMEM;
+		ip6n->fb_tnl_dev = alloc_netdev(sizeof(struct ip6_tnl),
+						"ip6tnl0", ip6_tnl_dev_setup);
 
-	if (!ip6n->fb_tnl_dev)
-		goto err_alloc_dev;
-	dev_net_set(ip6n->fb_tnl_dev, net);
+		if (!ip6n->fb_tnl_dev)
+			goto err_alloc_dev;
+		dev_net_set(ip6n->fb_tnl_dev, net);
 
-	err = ip6_fb_tnl_dev_init(ip6n->fb_tnl_dev);
-	if (err < 0)
-		goto err_register;
+		err = ip6_fb_tnl_dev_init(ip6n->fb_tnl_dev);
+		if (err < 0)
+			goto err_register;
 
-	err = register_netdev(ip6n->fb_tnl_dev);
-	if (err < 0)
-		goto err_register;
+		err = register_netdev(ip6n->fb_tnl_dev);
+		if (err < 0)
+			goto err_register;
+
+		t = netdev_priv(ip6n->fb_tnl_dev);
 
-	t = netdev_priv(ip6n->fb_tnl_dev);
+		strcpy(t->parms.name, ip6n->fb_tnl_dev->name);
+	} else
+		ip6n->fb_tnl_dev = NULL;
 
-	strcpy(t->parms.name, ip6n->fb_tnl_dev->name);
 	return 0;
 
 err_register:
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 1/4] ipip: allow to deactivate the creation of fb dev
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

Now that tunnels can be configured via rtnetlink, this device is not mandatory.
The default is conservative.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv4/ipip.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index c26c171..9e11633 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -124,6 +124,11 @@ static bool log_ecn_error = true;
 module_param(log_ecn_error, bool, 0644);
 MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
 
+static bool setup_fb = true;
+module_param(setup_fb, bool, 0644);
+MODULE_PARM_DESC(setup_fb,
+		 "Setup the fb device to configure tunnel via IOCTL");
+
 static int ipip_net_id __read_mostly;
 struct ipip_net {
 	struct ip_tunnel __rcu *tunnels_r_l[HASH_SIZE];
@@ -1022,25 +1027,29 @@ static int __net_init ipip_init_net(struct net *net)
 	ipn->tunnels[2] = ipn->tunnels_r;
 	ipn->tunnels[3] = ipn->tunnels_r_l;
 
-	ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
-					   "tunl0",
-					   ipip_tunnel_setup);
-	if (!ipn->fb_tunnel_dev) {
-		err = -ENOMEM;
-		goto err_alloc_dev;
-	}
-	dev_net_set(ipn->fb_tunnel_dev, net);
+	if (setup_fb) {
+		ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+						  "tunl0",
+						  ipip_tunnel_setup);
+		if (!ipn->fb_tunnel_dev) {
+			err = -ENOMEM;
+			goto err_alloc_dev;
+		}
+		dev_net_set(ipn->fb_tunnel_dev, net);
 
-	err = ipip_fb_tunnel_init(ipn->fb_tunnel_dev);
-	if (err)
-		goto err_reg_dev;
+		err = ipip_fb_tunnel_init(ipn->fb_tunnel_dev);
+		if (err)
+			goto err_reg_dev;
 
-	if ((err = register_netdev(ipn->fb_tunnel_dev)))
-		goto err_reg_dev;
+		err = register_netdev(ipn->fb_tunnel_dev);
+		if (err)
+			goto err_reg_dev;
 
-	t = netdev_priv(ipn->fb_tunnel_dev);
+		t = netdev_priv(ipn->fb_tunnel_dev);
 
-	strcpy(t->parms.name, ipn->fb_tunnel_dev->name);
+		strcpy(t->parms.name, ipn->fb_tunnel_dev->name);
+	} else
+		ipn->fb_tunnel_dev = NULL;
 	return 0;
 
 err_reg_dev:
@@ -1057,7 +1066,8 @@ static void __net_exit ipip_exit_net(struct net *net)
 
 	rtnl_lock();
 	ipip_destroy_tunnels(ipn, &list);
-	unregister_netdevice_queue(ipn->fb_tunnel_dev, &list);
+	if (setup_fb)
+		unregister_netdevice_queue(ipn->fb_tunnel_dev, &list);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 0/4] Allow to deactivate fb tunnels device
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem

This serie proposes a module option to avoid the creation of the fb tunnels
device for sit/ipip/ip6tnl.
It also add netlink management for 6RD tunnels. The last info that still
requires ioctl is the management of the potential routers list (prl) for isatap
(note that this ioctl is not performed on the fb device).

As usual, the patch against iproute2 will be sent once the patches are included and
net-next merged. I can send it on demand.

 include/uapi/linux/if_tunnel.h |   4 +
 net/ipv4/ipip.c                |  42 ++++----
 net/ipv6/ip6_tunnel.c          |  43 +++++----
 net/ipv6/sit.c                 | 211 ++++++++++++++++++++++++++++++++---------
 4 files changed, 221 insertions(+), 79 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply

* Re: [PATCH iproute2 3/3] ip/ip6tunnel: fix update of tclass and flowlabel
From: Stephen Hemminger @ 2012-11-16 16:16 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev
In-Reply-To: <1352906966-12932-3-git-send-email-nicolas.dichtel@6wind.com>

On Wed, 14 Nov 2012 16:29:26 +0100
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:

> When tclass or flowlabel field were updated, we only performed an OR with the
> new value. For example, it was not possible to reset tclass:
>   ip -6 tunnel change ip6tnl2 tclass 0
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> ---
>  ip/ip6tunnel.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/ip/ip6tunnel.c b/ip/ip6tunnel.c
> index 7aaac61..fcc9f33 100644
> --- a/ip/ip6tunnel.c
> +++ b/ip/ip6tunnel.c
> @@ -173,6 +173,7 @@ static int parse_args(int argc, char **argv, int cmd, struct ip6_tnl_parm *p)
>  			   matches(*argv, "dsfield") == 0) {
>  			__u8 uval;
>  			NEXT_ARG();
> +			p->flowinfo &= ~IP6_FLOWINFO_TCLASS;
>  			if (strcmp(*argv, "inherit") == 0)
>  				p->flags |= IP6_TNL_F_USE_ORIG_TCLASS;
>  			else {
> @@ -185,6 +186,7 @@ static int parse_args(int argc, char **argv, int cmd, struct ip6_tnl_parm *p)
>  			   strcmp(*argv, "fl") == 0) {
>  			__u32 uval;
>  			NEXT_ARG();
> +			p->flowinfo &= ~IP6_FLOWINFO_FLOWLABEL;
>  			if (strcmp(*argv, "inherit") == 0)
>  				p->flags |= IP6_TNL_F_USE_ORIG_FLOWLABEL;
>  			else {

All applied thanks.

^ permalink raw reply

* Re: [PATCH net-next 1/4] ipip: allow to deactivate the creation of fb dev
From: Stephen Hemminger @ 2012-11-16 16:29 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, davem
In-Reply-To: <1353082456-21234-2-git-send-email-nicolas.dichtel@6wind.com>

On Fri, 16 Nov 2012 17:14:13 +0100
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:

> Now that tunnels can be configured via rtnetlink, this device is not mandatory.
> The default is conservative.
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

Although I am in favor of reducing clutter, and we even have to put in special case
code to ignore these stub devices in the Vyatta scripts. Module parameters are bit of a nuisance to deal with, but maybe
the only way for this kind of thing and keep the required ABI.

Not sure if I can fully endorse this. The device may still have uses.
It is still useful for capturing "none of the above" packets
and is used to auto-load module via module aliases.

^ permalink raw reply

* Re: [PATCH v2 net-next] sctp: Add support to per-association statistics via a new SCTP_GET_ASSOC_STATS call
From: Neil Horman @ 2012-11-16 16:39 UTC (permalink / raw)
  To: Michele Baldessari
  Cc: linux-sctp, Thomas Graf, Vlad Yasevich, netdev, David S. Miller
In-Reply-To: <1352991680-12289-1-git-send-email-michele@acksyn.org>

On Thu, Nov 15, 2012 at 04:01:20PM +0100, Michele Baldessari wrote:
> The current SCTP stack is lacking a mechanism to have per association
> statistics. This is an implementation modeled after OpenSolaris'
> SCTP_GET_ASSOC_STATS.
> 
> Userspace part will follow on lksctp if/when there is a general ACK on
> this.
> 
> V2)
>   - Implement partial retrieval of stat struct to cope for future expansion
>   - Kill the rtxpackets counter as it cannot be precise anyway
>   - Rename outseqtsns to outofseqtsns to make it clearer that these are out
>     of sequence unexpected TSNs
>   - Move asoc->ipackets++ under a lock to avoid potential miscounts
>   - Fold asoc->opackets++ into the already existing asoc check
>   - Kill unneeded (q->asoc) test when increasing rtxchunks
>   - Do not count octrlchunks if sending failed (SCTP_XMIT_OK != 0)
>   - Don't count SHUTDOWNs as SACKs
>   - Move SCTP_GET_ASSOC_STATS to the private space API
>   - Adjust the len check in sctp_getsockopt_assoc_stats() to allow for
>     future struct growth
>   - Move association statistics in their own struct
>   - Update idupchunks when we send a SACK with dup TSNs
>   - return min_rto in max_rto when RTO has not changed. Also return the
>     transport when max_rto last changed.
> 
> Signed-off: Michele Baldessari <michele@acksyn.org>
> Acked-by: Thomas Graf <tgraf@suug.ch>

Yes, I think this is good, I still don't like the idea of having to do these via
an ioctl, but I suppose it fits well enough.
Neil

Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: [PATCH net-next 1/4] ipip: allow to deactivate the creation of fb dev
From: Nicolas Dichtel @ 2012-11-16 16:46 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, davem
In-Reply-To: <20121116082926.1c6cccd2@nehalam.linuxnetplumber.net>

Le 16/11/2012 17:29, Stephen Hemminger a écrit :
> On Fri, 16 Nov 2012 17:14:13 +0100
> Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
>
>> Now that tunnels can be configured via rtnetlink, this device is not mandatory.
>> The default is conservative.
>>
>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>
>
> Although I am in favor of reducing clutter, and we even have to put in special case
> code to ignore these stub devices in the Vyatta scripts. Module parameters are bit of a nuisance to deal with, but maybe
> the only way for this kind of thing and keep the required ABI.
>
> Not sure if I can fully endorse this. The device may still have uses.
> It is still useful for capturing "none of the above" packets
If you need to capture these packets, you can still create a tunnel with local 
any and remote any, even if the fb_device has not been created.

> and is used to auto-load module via module aliases.
Right, but if user uses netlink, the problem exists without these patches too.

By default, the fb device is created, so there is no change if you don't set 
explicitly setup_fb to 0.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox