[PATCH 0/3] Make mark-based routing work better with multiple separate networks.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Make mark-based routing work better with multiple separate networks.
@ 2014-05-09 17:36 Lorenzo Colitti
  2014-05-09 17:36 ` [PATCH 1/3] net: add a sysctl to reflect the fwmark on replies Lorenzo Colitti
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Lorenzo Colitti @ 2014-05-09 17:36 UTC (permalink / raw)
  To: netdev; +Cc: jpa, davem, ja, hannes, Lorenzo Colitti

Mark-based routing (ip rule fwmark 17 lookup 100) combined with
either iptables marking (iptables -j MARK --set-mark 17) or
application-based marking (the SO_MARK setsockopt) are a good
way to deal with connecting simultaneously to multiple networks.

Each network can be given a routing table, and ip rules can
be configured to make different fwmarks select different
networks. Applications can select networks them by setting
appropriate socket marks, and iptables rules can be used to
handle non-aware applications, enforce policy, etc.

This patch series improves functionality when mark-based routing
is used in this way. Current behaviour has the following
limitations:

1. Kernel-originated replies that are not associated with a
   socket always use a mark of zero. This means that, for
   example, when the kernel sends a ping reply or a TCP reset,
   it does not send it on the network from which it received the
   original packet.
2. Path MTU discovery, which is triggered by incoming packets,
   does not always work correctly, because the routing lookups it
   uses to clone routes do not take the fwmark into account and
   thus can happen in the wrong routing table.
3. Application-based marking works well for outbound connections,
   but does not work well for incoming connections. Marking a
   listening socket causes that socket to only accept
   connections from a given network, and sockets that are
   returned by accept() are not marked (and are thus not routed
   correctly).

#1 and #2 are addressed by a new net.ipv[46].fwmark_reflect
sysctl. This causes route lookups for kernel-generated replies
and PMTUD to use the fwmark of the packet that caused them.

#3 is addressed by a new net.ipv4.tcp_fwmark_accept sysctl,
which causes TCP sockets returned by accept() to be marked with
the same mark that sent the intial SYN packet.

Lorenzo Colitti (3):
  net: add a sysctl to reflect the fwmark on replies
  net: Use fwmark reflection in PMTU discovery.
  net: support marking accepting TCP sockets

 include/net/inet_sock.h          | 10 ++++++++++
 include/net/ip.h                 |  3 +++
 include/net/ipv6.h               |  3 +++
 include/net/netns/ipv4.h         |  3 +++
 include/net/netns/ipv6.h         |  1 +
 net/ipv4/icmp.c                  | 11 +++++++++--
 net/ipv4/inet_connection_sock.c  |  6 ++++--
 net/ipv4/ip_output.c             |  3 ++-
 net/ipv4/route.c                 |  7 +++++++
 net/ipv4/syncookies.c            |  3 ++-
 net/ipv4/sysctl_net_ipv4.c       | 14 ++++++++++++++
 net/ipv4/tcp_ipv4.c              |  1 +
 net/ipv6/icmp.c                  |  6 ++++++
 net/ipv6/inet6_connection_sock.c |  2 +-
 net/ipv6/route.c                 |  2 +-
 net/ipv6/syncookies.c            |  4 +++-
 net/ipv6/sysctl_net_ipv6.c       |  7 +++++++
 net/ipv6/tcp_ipv6.c              |  2 ++
 18 files changed, 79 insertions(+), 9 deletions(-)

-- 
1.9.1.423.g4596e3a

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] net: add a sysctl to reflect the fwmark on replies
  2014-05-09 17:36 [PATCH 0/3] Make mark-based routing work better with multiple separate networks Lorenzo Colitti
@ 2014-05-09 17:36 ` Lorenzo Colitti
  2014-05-09 17:37 ` [PATCH 2/3] net: Use fwmark reflection in PMTU discovery Lorenzo Colitti
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Lorenzo Colitti @ 2014-05-09 17:36 UTC (permalink / raw)
  To: netdev; +Cc: jpa, davem, ja, hannes, Lorenzo Colitti

Kernel-originated IP packets that have no user socket associated
with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
are emitted with a mark of zero. Add a sysctl to make them have
the same mark as the packet they are replying to.

This allows an administrator that wishes to do so to use
mark-based routing, firewalling, etc. for these replies by
marking the original packets inbound.

Tested using user-mode linux:
 - ICMP/ICMPv6 echo replies and errors.
 - TCP RST packets (IPv4 and IPv6).

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
---
 include/net/ip.h           |  3 +++
 include/net/ipv6.h         |  3 +++
 include/net/netns/ipv4.h   |  2 ++
 include/net/netns/ipv6.h   |  1 +
 net/ipv4/icmp.c            | 11 +++++++++--
 net/ipv4/ip_output.c       |  3 ++-
 net/ipv4/sysctl_net_ipv4.c |  7 +++++++
 net/ipv6/icmp.c            |  6 ++++++
 net/ipv6/sysctl_net_ipv6.c |  7 +++++++
 net/ipv6/tcp_ipv6.c        |  1 +
 10 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 16146b6..a4b52c4 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -231,6 +231,9 @@ void ipfrag_init(void);
 
 void ip_static_sysctl_init(void);
 
+#define IP4_REPLY_MARK(net, mark) \
+	((net)->ipv4.sysctl_fwmark_reflect ? (mark) : 0)
+
 static inline bool ip_is_fragment(const struct iphdr *iph)
 {
 	return (iph->frag_off & htons(IP_MF | IP_OFFSET)) != 0;
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 5b40ad2..ba810d0 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -113,6 +113,9 @@ struct frag_hdr {
 #define	IP6_MF		0x0001
 #define	IP6_OFFSET	0xFFF8
 
+#define IP6_REPLY_MARK(net, mark) \
+	((net)->ipv6.sysctl.fwmark_reflect ? (mark) : 0)
+
 #include <net/sock.h>
 
 /* sysctls */
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 80f500a..8e1a9c0 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -72,6 +72,8 @@ struct netns_ipv4 {
 	int sysctl_ip_no_pmtu_disc;
 	int sysctl_ip_fwd_use_pmtu;
 
+	int sysctl_fwmark_reflect;
+
 	kgid_t sysctl_ping_group_range[2];
 
 	atomic_t dev_addr_genid;
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 21edaf1..19d3446 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -30,6 +30,7 @@ struct netns_sysctl_ipv6 {
 	int flowlabel_consistency;
 	int icmpv6_time;
 	int anycast_src_echo_reply;
+	int fwmark_reflect;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index fe52666..79c3d94 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -337,6 +337,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	struct sock *sk;
 	struct inet_sock *inet;
 	__be32 daddr, saddr;
+	u32 mark = IP4_REPLY_MARK(net, skb->mark);
 
 	if (ip_options_echo(&icmp_param->replyopts.opt.opt, skb))
 		return;
@@ -349,6 +350,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	icmp_param->data.icmph.checksum = 0;
 
 	inet->tos = ip_hdr(skb)->tos;
+	sk->sk_mark = mark;
 	daddr = ipc.addr = ip_hdr(skb)->saddr;
 	saddr = fib_compute_spec_dst(skb);
 	ipc.opt = NULL;
@@ -364,6 +366,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	memset(&fl4, 0, sizeof(fl4));
 	fl4.daddr = daddr;
 	fl4.saddr = saddr;
+	fl4.flowi4_mark = mark;
 	fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
 	fl4.flowi4_proto = IPPROTO_ICMP;
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
@@ -382,7 +385,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
 					struct flowi4 *fl4,
 					struct sk_buff *skb_in,
 					const struct iphdr *iph,
-					__be32 saddr, u8 tos,
+					__be32 saddr, u8 tos, u32 mark,
 					int type, int code,
 					struct icmp_bxm *param)
 {
@@ -394,6 +397,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	fl4->daddr = (param->replyopts.opt.opt.srr ?
 		      param->replyopts.opt.opt.faddr : iph->saddr);
 	fl4->saddr = saddr;
+	fl4->flowi4_mark = mark;
 	fl4->flowi4_tos = RT_TOS(tos);
 	fl4->flowi4_proto = IPPROTO_ICMP;
 	fl4->fl4_icmp_type = type;
@@ -491,6 +495,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	struct flowi4 fl4;
 	__be32 saddr;
 	u8  tos;
+	u32 mark;
 	struct net *net;
 	struct sock *sk;
 
@@ -592,6 +597,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	tos = icmp_pointers[type].error ? ((iph->tos & IPTOS_TOS_MASK) |
 					   IPTOS_PREC_INTERNETCONTROL) :
 					  iph->tos;
+	mark = IP4_REPLY_MARK(net, skb_in->mark);
 
 	if (ip_options_echo(&icmp_param->replyopts.opt.opt, skb_in))
 		goto out_unlock;
@@ -608,13 +614,14 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	icmp_param->skb	  = skb_in;
 	icmp_param->offset = skb_network_offset(skb_in);
 	inet_sk(sk)->tos = tos;
+	sk->sk_mark = mark;
 	ipc.addr = iph->saddr;
 	ipc.opt = &icmp_param->replyopts.opt;
 	ipc.tx_flags = 0;
 	ipc.ttl = 0;
 	ipc.tos = -1;
 
-	rt = icmp_route_lookup(net, &fl4, skb_in, iph, saddr, tos,
+	rt = icmp_route_lookup(net, &fl4, skb_in, iph, saddr, tos, mark,
 			       type, code, icmp_param);
 	if (IS_ERR(rt))
 		goto out_unlock;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 1cbeba5..dd1a0d6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1501,7 +1501,8 @@ void ip_send_unicast_reply(struct net *net, struct sk_buff *skb, __be32 daddr,
 			daddr = replyopts.opt.opt.faddr;
 	}
 
-	flowi4_init_output(&fl4, arg->bound_dev_if, 0,
+	flowi4_init_output(&fl4, arg->bound_dev_if,
+			   IP4_REPLY_MARK(net, skb->mark),
 			   RT_TOS(arg->tos),
 			   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
 			   ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 44eba05..e40a738 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -838,6 +838,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "fwmark_reflect",
+		.data		= &init_net.ipv4.sysctl_fwmark_reflect,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 8d39527..f6c84a6 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -400,6 +400,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info)
 	int len;
 	int hlimit;
 	int err = 0;
+	u32 mark = IP6_REPLY_MARK(net, skb->mark);
 
 	if ((u8 *)hdr < skb->head ||
 	    (skb_network_header(skb) + sizeof(*hdr)) > skb_tail_pointer(skb))
@@ -466,6 +467,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info)
 	fl6.daddr = hdr->saddr;
 	if (saddr)
 		fl6.saddr = *saddr;
+	fl6.flowi6_mark = mark;
 	fl6.flowi6_oif = iif;
 	fl6.fl6_icmp_type = type;
 	fl6.fl6_icmp_code = code;
@@ -474,6 +476,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info)
 	sk = icmpv6_xmit_lock(net);
 	if (sk == NULL)
 		return;
+	sk->sk_mark = mark;
 	np = inet6_sk(sk);
 
 	if (!icmpv6_xrlim_allow(sk, type, &fl6))
@@ -551,6 +554,7 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 	int err = 0;
 	int hlimit;
 	u8 tclass;
+	u32 mark = IP6_REPLY_MARK(net, skb->mark);
 
 	saddr = &ipv6_hdr(skb)->daddr;
 
@@ -569,11 +573,13 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 		fl6.saddr = *saddr;
 	fl6.flowi6_oif = skb->dev->ifindex;
 	fl6.fl6_icmp_type = ICMPV6_ECHO_REPLY;
+	fl6.flowi6_mark = mark;
 	security_skb_classify_flow(skb, flowi6_to_flowi(&fl6));
 
 	sk = icmpv6_xmit_lock(net);
 	if (sk == NULL)
 		return;
+	sk->sk_mark = mark;
 	np = inet6_sk(sk);
 
 	if (!fl6.flowi6_oif && ipv6_addr_is_multicast(&fl6.daddr))
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 7f405a1..058f3ec 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -38,6 +38,13 @@ static struct ctl_table ipv6_table_template[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "fwmark_reflect",
+		.data		= &init_net.ipv6.sysctl.fwmark_reflect,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{ }
 };
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 7fa6743..994572c 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -802,6 +802,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
 		fl6.flowi6_oif = inet6_iif(skb);
 	else
 		fl6.flowi6_oif = oif;
+	fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark);
 	fl6.fl6_dport = t1->dest;
 	fl6.fl6_sport = t1->source;
 	security_skb_classify_flow(skb, flowi6_to_flowi(&fl6));
-- 
1.9.1.423.g4596e3a

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] net: Use fwmark reflection in PMTU discovery.
  2014-05-09 17:36 [PATCH 0/3] Make mark-based routing work better with multiple separate networks Lorenzo Colitti
  2014-05-09 17:36 ` [PATCH 1/3] net: add a sysctl to reflect the fwmark on replies Lorenzo Colitti
@ 2014-05-09 17:37 ` Lorenzo Colitti
  2014-05-09 17:37 ` [PATCH 3/3] net: support marking accepting TCP sockets Lorenzo Colitti
  2014-05-12 12:21 ` [PATCH 0/3] Make mark-based routing work better with multiple separate networks sowmini varadhan
  3 siblings, 0 replies; 14+ messages in thread
From: Lorenzo Colitti @ 2014-05-09 17:37 UTC (permalink / raw)
  To: netdev; +Cc: jpa, davem, ja, hannes, Lorenzo Colitti

Currently, routing lookups used for Path PMTU Discovery in
absence of a socket or on unmarked sockets use a mark of 0.
This causes PMTUD not to work when using routing based on
netfilter fwmark mangling and fwmark ip rules, such as:

  iptables -j MARK --set-mark 17
  ip rule add fwmark 17 lookup 100

This patch causes these route lookups to use the fwmark from the
received ICMP error when the fwmark_reflect sysctl is enabled.
This allows the administrator to make PMTUD work by configuring
appropriate fwmark rules to mark the inbound ICMP packets.

Black-box tested using user-mode linux by pointing different
fwmarks at routing tables egressing on different interfaces, and
using iptables mangling to mark packets inbound on each interface
with the interface's fwmark. ICMPv4 and ICMPv6 PMTU discovery
work as expected when mark reflection is enabled and fail when
it is disabled.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
---
 net/ipv4/route.c | 7 +++++++
 net/ipv6/route.c | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index db1e0da..50e1e0f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -993,6 +993,9 @@ void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu,
 	struct flowi4 fl4;
 	struct rtable *rt;

+	if (!mark)
+		mark = IP4_REPLY_MARK(net, skb->mark);
+
 	__build_flow_key(&fl4, NULL, iph, oif,
 			 RT_TOS(iph->tos), protocol, mark, flow_flags);
 	rt = __ip_route_output_key(net, &fl4);
@@ -1010,6 +1013,10 @@ static void __ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 	struct rtable *rt;

 	__build_flow_key(&fl4, sk, iph, 0, 0, 0, 0, 0);
+
+	if (!fl4.flowi4_mark)
+		fl4.flowi4_mark = IP4_REPLY_MARK(sock_net(sk), skb->mark);
+
 	rt = __ip_route_output_key(sock_net(sk), &fl4);
 	if (!IS_ERR(rt)) {
 		__ip_rt_update_pmtu(rt, &fl4, mtu);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 4011617..63fbddb 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1176,7 +1176,7 @@ void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,

 	memset(&fl6, 0, sizeof(fl6));
 	fl6.flowi6_oif = oif;
-	fl6.flowi6_mark = mark;
+	fl6.flowi6_mark = mark ? mark : IP6_REPLY_MARK(net, skb->mark);
 	fl6.daddr = iph->daddr;
 	fl6.saddr = iph->saddr;
 	fl6.flowlabel = ip6_flowinfo(iph);
-- 
1.9.1.423.g4596e3a

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] net: support marking accepting TCP sockets
  2014-05-09 17:36 [PATCH 0/3] Make mark-based routing work better with multiple separate networks Lorenzo Colitti
  2014-05-09 17:36 ` [PATCH 1/3] net: add a sysctl to reflect the fwmark on replies Lorenzo Colitti
  2014-05-09 17:37 ` [PATCH 2/3] net: Use fwmark reflection in PMTU discovery Lorenzo Colitti
@ 2014-05-09 17:37 ` Lorenzo Colitti
  2014-05-09 18:05   ` Eric Dumazet
  2014-05-12 12:21 ` [PATCH 0/3] Make mark-based routing work better with multiple separate networks sowmini varadhan
  3 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Colitti @ 2014-05-09 17:37 UTC (permalink / raw)
  To: netdev; +Cc: jpa, davem, ja, hannes, Lorenzo Colitti

When using mark-based routing, sockets returned from accept()
may need to be marked differently depending on the incoming
connection request.

This is the case, for example, if different socket marks identify
different networks: a listening socket may want to accept
connections from all networks, but each connection should be
marked with the network that the request came in on, so that
subsequent packets are sent on the correct network.

This patch adds a sysctl to mark TCP sockets based on the fwmark
of the incoming SYN packet. If enabled, and an unmarked socket
receives a SYN, then the SYN packet's fwmark is written to the
connection's inet_request_sock, and later written back to the
accepted socket when the connection is established.  If the
socket already has a nonzero mark, then the behaviour is the same
as it is today, i.e., the listening socket's fwmark is used.

Black-box tested using user-mode linux:

- IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
  mark of the incoming SYN packet.
- The socket returned by accept() is marked with the mark of the
  incoming SYN packet.
- Tested with syncookies=1 and syncookies=2.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
---
 include/net/inet_sock.h          | 10 ++++++++++
 include/net/netns/ipv4.h         |  1 +
 net/ipv4/inet_connection_sock.c  |  6 ++++--
 net/ipv4/syncookies.c            |  3 ++-
 net/ipv4/sysctl_net_ipv4.c       |  7 +++++++
 net/ipv4/tcp_ipv4.c              |  1 +
 net/ipv6/inet6_connection_sock.c |  2 +-
 net/ipv6/syncookies.c            |  4 +++-
 net/ipv6/tcp_ipv6.c              |  1 +
 9 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 1833c3f..b1edf17 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -90,6 +90,7 @@ struct inet_request_sock {
 	kmemcheck_bitfield_end(flags);
 	struct ip_options_rcu	*opt;
 	struct sk_buff		*pktopts;
+	u32                     ir_mark;
 };
 
 static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
@@ -97,6 +98,15 @@ static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
 	return (struct inet_request_sock *)sk;
 }
 
+static inline u32 inet_request_mark(struct sock *sk, struct sk_buff *skb)
+{
+	if (!sk->sk_mark && sock_net(sk)->ipv4.sysctl_tcp_fwmark_accept) {
+		return skb->mark;
+	} else {
+		return sk->sk_mark;
+	}
+}
+
 struct inet_cork {
 	unsigned int		flags;
 	__be32			addr;
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8e1a9c0..c701843 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -73,6 +73,7 @@ struct netns_ipv4 {
 	int sysctl_ip_fwd_use_pmtu;
 
 	int sysctl_fwmark_reflect;
+	int sysctl_tcp_fwmark_accept;
 
 	kgid_t sysctl_ping_group_range[2];
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 0d1e2cb..5ae71ec 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -408,7 +408,7 @@ struct dst_entry *inet_csk_route_req(struct sock *sk,
 	struct net *net = sock_net(sk);
 	int flags = inet_sk_flowi_flags(sk);
 
-	flowi4_init_output(fl4, sk->sk_bound_dev_if, sk->sk_mark,
+	flowi4_init_output(fl4, sk->sk_bound_dev_if, ireq->ir_mark,
 			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE,
 			   sk->sk_protocol,
 			   flags,
@@ -445,7 +445,7 @@ struct dst_entry *inet_csk_route_child_sock(struct sock *sk,
 
 	rcu_read_lock();
 	opt = rcu_dereference(newinet->inet_opt);
-	flowi4_init_output(fl4, sk->sk_bound_dev_if, sk->sk_mark,
+	flowi4_init_output(fl4, sk->sk_bound_dev_if, inet_rsk(req)->ir_mark,
 			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE,
 			   sk->sk_protocol, inet_sk_flowi_flags(sk),
 			   (opt && opt->opt.srr) ? opt->opt.faddr : ireq->ir_rmt_addr,
@@ -680,6 +680,8 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 		inet_sk(newsk)->inet_sport = htons(inet_rsk(req)->ir_num);
 		newsk->sk_write_space = sk_stream_write_space;
 
+		newsk->sk_mark = inet_rsk(req)->ir_mark;
+
 		newicsk->icsk_retransmits = 0;
 		newicsk->icsk_backoff	  = 0;
 		newicsk->icsk_probes_out  = 0;
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index f2ed13c..c86624b 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -303,6 +303,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	ireq->ir_rmt_port	= th->source;
 	ireq->ir_loc_addr	= ip_hdr(skb)->daddr;
 	ireq->ir_rmt_addr	= ip_hdr(skb)->saddr;
+	ireq->ir_mark		= inet_request_mark(sk, skb);
 	ireq->ecn_ok		= ecn_ok;
 	ireq->snd_wscale	= tcp_opt.snd_wscale;
 	ireq->sack_ok		= tcp_opt.sack_ok;
@@ -339,7 +340,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	 * hasn't changed since we received the original syn, but I see
 	 * no easy way to do this.
 	 */
-	flowi4_init_output(&fl4, sk->sk_bound_dev_if, sk->sk_mark,
+	flowi4_init_output(&fl4, sk->sk_bound_dev_if, ireq->ir_mark,
 			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE, IPPROTO_TCP,
 			   inet_sk_flowi_flags(sk),
 			   (opt && opt->srr) ? opt->faddr : ireq->ir_rmt_addr,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e40a738..6480281 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -845,6 +845,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "tcp_fwmark_accept",
+		.data		= &init_net.ipv4.sysctl_tcp_fwmark_accept,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ad166dc..6c1b6f6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1507,6 +1507,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	ireq->ir_rmt_addr = saddr;
 	ireq->no_srccheck = inet_sk(sk)->transparent;
 	ireq->opt = tcp_v4_save_options(skb);
+	ireq->ir_mark = inet_request_mark(sk, skb);
 
 	if (security_inet_conn_request(sk, skb, req))
 		goto drop_and_free;
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index d4ade34..a245e5d 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -81,7 +81,7 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
 	final_p = fl6_update_dst(fl6, np->opt, &final);
 	fl6->saddr = ireq->ir_v6_loc_addr;
 	fl6->flowi6_oif = ireq->ir_iif;
-	fl6->flowi6_mark = sk->sk_mark;
+	fl6->flowi6_mark = ireq->ir_mark;
 	fl6->fl6_dport = ireq->ir_rmt_port;
 	fl6->fl6_sport = htons(ireq->ir_num);
 	security_req_classify_flow(req, flowi6_to_flowi(fl6));
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index bb53a5e7..a822b88 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -216,6 +216,8 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
 		ireq->ir_iif = inet6_iif(skb);
 
+	ireq->ir_mark = inet_request_mark(sk, skb);
+
 	req->expires = 0UL;
 	req->num_retrans = 0;
 	ireq->ecn_ok		= ecn_ok;
@@ -242,7 +244,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 		final_p = fl6_update_dst(&fl6, np->opt, &final);
 		fl6.saddr = ireq->ir_v6_loc_addr;
 		fl6.flowi6_oif = sk->sk_bound_dev_if;
-		fl6.flowi6_mark = sk->sk_mark;
+		fl6.flowi6_mark = ireq->ir_mark;
 		fl6.fl6_dport = ireq->ir_rmt_port;
 		fl6.fl6_sport = inet_sk(sk)->inet_sport;
 		security_req_classify_flow(req, flowi6_to_flowi(&fl6));
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 994572c..37c8334 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1017,6 +1017,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 		TCP_ECN_create_request(req, skb, sock_net(sk));
 
 	ireq->ir_iif = sk->sk_bound_dev_if;
+	ireq->ir_mark = inet_request_mark(sk, skb);
 
 	/* So that link locals have meaning */
 	if (!sk->sk_bound_dev_if &&
-- 
1.9.1.423.g4596e3a

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] net: support marking accepting TCP sockets
  2014-05-09 17:37 ` [PATCH 3/3] net: support marking accepting TCP sockets Lorenzo Colitti
@ 2014-05-09 18:05   ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2014-05-09 18:05 UTC (permalink / raw)
  To: Lorenzo Colitti; +Cc: netdev, jpa, davem, ja, hannes

On Sat, 2014-05-10 at 02:37 +0900, Lorenzo Colitti wrote:
> When using mark-based routing, sockets returned from accept()
> may need to be marked differently depending on the incoming
> connection request.
> 
> This is the case, for example, if different socket marks identify
> different networks: a listening socket may want to accept
> connections from all networks, but each connection should be
> marked with the network that the request came in on, so that
> subsequent packets are sent on the correct network.
> 
> This patch adds a sysctl to mark TCP sockets based on the fwmark
> of the incoming SYN packet. If enabled, and an unmarked socket
> receives a SYN, then the SYN packet's fwmark is written to the
> connection's inet_request_sock, and later written back to the
> accepted socket when the connection is established.  If the
> socket already has a nonzero mark, then the behaviour is the same
> as it is today, i.e., the listening socket's fwmark is used.
> 
> Black-box tested using user-mode linux:
> 
> - IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
>   mark of the incoming SYN packet.
> - The socket returned by accept() is marked with the mark of the
>   incoming SYN packet.
> - Tested with syncookies=1 and syncookies=2.
> 
> Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
> ---
>  include/net/inet_sock.h          | 10 ++++++++++
>  include/net/netns/ipv4.h         |  1 +
>  net/ipv4/inet_connection_sock.c  |  6 ++++--
>  net/ipv4/syncookies.c            |  3 ++-
>  net/ipv4/sysctl_net_ipv4.c       |  7 +++++++
>  net/ipv4/tcp_ipv4.c              |  1 +
>  net/ipv6/inet6_connection_sock.c |  2 +-
>  net/ipv6/syncookies.c            |  4 +++-
>  net/ipv6/tcp_ipv6.c              |  1 +
>  9 files changed, 30 insertions(+), 5 deletions(-)
> 

Patch looks mostly OK.

> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> index 1833c3f..b1edf17 100644
> --- a/include/net/inet_sock.h
> +++ b/include/net/inet_sock.h
> @@ -90,6 +90,7 @@ struct inet_request_sock {
>  	kmemcheck_bitfield_end(flags);
>  	struct ip_options_rcu	*opt;
>  	struct sk_buff		*pktopts;
> +	u32                     ir_mark;

Move this before the *opt field to avoid an extra 4byte hole on 64bit
arches.

>  };
>  
>  static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
> @@ -97,6 +98,15 @@ static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
>  	return (struct inet_request_sock *)sk;
>  }
>  
> +static inline u32 inet_request_mark(struct sock *sk, struct sk_buff *skb)
> +{
> +	if (!sk->sk_mark && sock_net(sk)->ipv4.sysctl_tcp_fwmark_accept) {
> +		return skb->mark;
> +	} else {
> +		return sk->sk_mark;
> +	}

No need for {} blocks :

	if (!sk->sk_mark && sock_net(sk)->ipv4.sysctl_tcp_fwmark_accept)
		return skb->mark;

	return sk->sk_mark;

>  struct inet_cork {
>  	unsigned int		flags;
>  	__be32			addr;

...
> index e40a738..6480281 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -845,6 +845,13 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	{
> +		.procname	= "tcp_fwmark_accept",
> +		.data		= &init_net.ipv4.sysctl_tcp_fwmark_accept,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
>  	{ }
>  };
>  

Please add relevant section in Documentation/networking/ip-sysctl.txt

Thanks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Make mark-based routing work better with multiple separate networks.
  2014-05-09 17:36 [PATCH 0/3] Make mark-based routing work better with multiple separate networks Lorenzo Colitti
                   ` (2 preceding siblings ...)
  2014-05-09 17:37 ` [PATCH 3/3] net: support marking accepting TCP sockets Lorenzo Colitti
@ 2014-05-12 12:21 ` sowmini varadhan
  2014-05-12 19:58   ` Lorenzo Colitti
  3 siblings, 1 reply; 14+ messages in thread
From: sowmini varadhan @ 2014-05-12 12:21 UTC (permalink / raw)
  To: Lorenzo Colitti; +Cc: netdev, jpa, David Miller, ja, hannes

Hi,

I havent read your patch, but the description triggers
some questions in my mind - what is the relationship
between this proposal and the current linux vrf implementation
(which afaik is done using network namespaces)?  Multiple
IP routing tables seems to be motivated by forwarding-policy,
isn't your use-case really a VRF one?

--Sowmini

On Fri, May 9, 2014 at 1:36 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
> Mark-based routing (ip rule fwmark 17 lookup 100) combined with
> either iptables marking (iptables -j MARK --set-mark 17) or
> application-based marking (the SO_MARK setsockopt) are a good
> way to deal with connecting simultaneously to multiple networks.
>
> Each network can be given a routing table, and ip rules can
> be configured to make different fwmarks select different
> networks. Applications can select networks them by setting
> appropriate socket marks, and iptables rules can be used to
> handle non-aware applications, enforce policy, etc.
>
> This patch series improves functionality when mark-based routing
> is used in this way. Current behaviour has the following
> limitations:
>
> 1. Kernel-originated replies that are not associated with a
>    socket always use a mark of zero. This means that, for
>    example, when the kernel sends a ping reply or a TCP reset,
>    it does not send it on the network from which it received the
>    original packet.
> 2. Path MTU discovery, which is triggered by incoming packets,
>    does not always work correctly, because the routing lookups it
>    uses to clone routes do not take the fwmark into account and
>    thus can happen in the wrong routing table.
> 3. Application-based marking works well for outbound connections,
>    but does not work well for incoming connections. Marking a
>    listening socket causes that socket to only accept
>    connections from a given network, and sockets that are
>    returned by accept() are not marked (and are thus not routed
>    correctly).
>
> #1 and #2 are addressed by a new net.ipv[46].fwmark_reflect
> sysctl. This causes route lookups for kernel-generated replies
> and PMTUD to use the fwmark of the packet that caused them.
>
> #3 is addressed by a new net.ipv4.tcp_fwmark_accept sysctl,
> which causes TCP sockets returned by accept() to be marked with
> the same mark that sent the intial SYN packet.
>
> Lorenzo Colitti (3):
>   net: add a sysctl to reflect the fwmark on replies
>   net: Use fwmark reflection in PMTU discovery.
>   net: support marking accepting TCP sockets
>
>  include/net/inet_sock.h          | 10 ++++++++++
>  include/net/ip.h                 |  3 +++
>  include/net/ipv6.h               |  3 +++
>  include/net/netns/ipv4.h         |  3 +++
>  include/net/netns/ipv6.h         |  1 +
>  net/ipv4/icmp.c                  | 11 +++++++++--
>  net/ipv4/inet_connection_sock.c  |  6 ++++--
>  net/ipv4/ip_output.c             |  3 ++-
>  net/ipv4/route.c                 |  7 +++++++
>  net/ipv4/syncookies.c            |  3 ++-
>  net/ipv4/sysctl_net_ipv4.c       | 14 ++++++++++++++
>  net/ipv4/tcp_ipv4.c              |  1 +
>  net/ipv6/icmp.c                  |  6 ++++++
>  net/ipv6/inet6_connection_sock.c |  2 +-
>  net/ipv6/route.c                 |  2 +-
>  net/ipv6/syncookies.c            |  4 +++-
>  net/ipv6/sysctl_net_ipv6.c       |  7 +++++++
>  net/ipv6/tcp_ipv6.c              |  2 ++
>  18 files changed, 79 insertions(+), 9 deletions(-)
>
> --
> 1.9.1.423.g4596e3a
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Make mark-based routing work better with multiple separate networks.
  2014-05-12 12:21 ` [PATCH 0/3] Make mark-based routing work better with multiple separate networks sowmini varadhan
@ 2014-05-12 19:58   ` Lorenzo Colitti
  2014-05-12 21:09     ` sowmini varadhan
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Colitti @ 2014-05-12 19:58 UTC (permalink / raw)
  To: sowmini varadhan
  Cc: netdev, JP Abgrall, David Miller, Julian Anastasov,
	Hannes Frederic Sowa

On Mon, May 12, 2014 at 9:21 PM, sowmini varadhan <sowmini05@gmail.com> wrote:
> what is the relationship
> between this proposal and the current linux vrf implementation

I'm not sure what you mean by VRF implementation, do you have a
pointer to the code?

> (which afaik is done using network namespaces)?

Socket marking is a per-socket operation, and a given process may have
many sockets with many different marks. I don't think it's possible to
do this using namespaces, because AFAIK a process can only be in one
namespace.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Make mark-based routing work better with multiple separate networks.
  2014-05-12 19:58   ` Lorenzo Colitti
@ 2014-05-12 21:09     ` sowmini varadhan
  2014-05-12 22:53       ` Lorenzo Colitti
  0 siblings, 1 reply; 14+ messages in thread
From: sowmini varadhan @ 2014-05-12 21:09 UTC (permalink / raw)
  To: Lorenzo Colitti
  Cc: netdev, JP Abgrall, David Miller, Julian Anastasov,
	Hannes Frederic Sowa

On Mon, May 12, 2014 at 3:58 PM, Lorenzo Colitti <lorenzo@google.com> wrote:

> I'm not sure what you mean by VRF implementation, do you have a
> pointer to the code?
>
>> (which afaik is done using network namespaces)?

See http://lists.openvz.org/pipermail/devel/2008-October/015055.html

I'm not sure what the current status of the patch is ( perhaps someone on
this list can comment? I thought this had been merged into kernel.git
at some point, but evidently not so?)  but after
http://lwn.net/Articles/407495/, a single
process should be able to open sockes in different namespaces.

> Socket marking is a per-socket operation, and a given process may have
> many sockets with many different marks. I don't think it's possible to
> do this using namespaces, because AFAIK a process can only be in one
> namespace.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Make mark-based routing work better with multiple separate networks.
  2014-05-12 21:09     ` sowmini varadhan
@ 2014-05-12 22:53       ` Lorenzo Colitti
  2014-05-13 10:49         ` sowmini varadhan
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Colitti @ 2014-05-12 22:53 UTC (permalink / raw)
  To: sowmini varadhan
  Cc: netdev, JP Abgrall, David Miller, Julian Anastasov,
	Hannes Frederic Sowa

On Tue, May 13, 2014 at 6:09 AM, sowmini varadhan <sowmini05@gmail.com> wrote:
> http://lwn.net/Articles/407495/, a single
> process should be able to open sockes in different namespaces.

Other things that you can't do with namespaces are have the same physical
interface (and the same IP address?) in two different namespaces, or
have the same listening socket in two different namespaces. Namespaces
are not a panacea.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Make mark-based routing work better with multiple separate networks.
  2014-05-12 22:53       ` Lorenzo Colitti
@ 2014-05-13 10:49         ` sowmini varadhan
  2014-05-13 15:28           ` Lorenzo Colitti
  2014-05-13 17:12           ` David Ahern
  0 siblings, 2 replies; 14+ messages in thread
From: sowmini varadhan @ 2014-05-13 10:49 UTC (permalink / raw)
  To: Lorenzo Colitti
  Cc: netdev, JP Abgrall, David Miller, Julian Anastasov,
	Hannes Frederic Sowa

On Mon, May 12, 2014 at 6:53 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
> On Tue, May 13, 2014 at 6:09 AM, sowmini varadhan <sowmini05@gmail.com> wrote:
>> http://lwn.net/Articles/407495/, a single
>> process should be able to open sockes in different namespaces.
>
> Other things that you can't do with namespaces are have the same physical
> interface (and the same IP address?) in two different namespaces, or
> have the same listening socket in two different namespaces. Namespaces
> are not a panacea.'

So this thread got unintentionally cut off by my not selecting Reply-All
in the google gui.

But to summarize a couple of private exchanges between Lorenzo and
me, it still appears to me that the use-case here is what routers
consider a "VRF". Thus it makes sense to add code (if/as needed)
to fix the VRF support in linux, rather than adding yet-another-one-off
feature with socket marking.

Specifically addressing the two issues raised above:
- yes, it is true that an interface can exist in only one netns at a time.
  But the same ip address can exist in multiple netns-es. If the
  app wants to listen to a proper-subset of networks that go in/out
  a single physical interface, you can use macvlan, and assign the
  macvlans to the desired netns.
- "same listening socket for multiple namespaces". Clearly that problem
  also exists for the socket-marks approach. But again this can actually
  be solved (for both netns and sock-marks) by having the application
  set up separate sockets for each netns (netns or whatever) of interest,
  and build an epoll fd over that set of sockets. No need for any kernel
  code for this.

  Or you can optimize this by building infra in the kernel to support the
  Wildcard ALL_VRFS notion. Or add even more code to support something
  less than ALL_VRFS.

My point is: what is the real networking construct that this use-case needs?
Isn't it what routers describe as the VRF? If yes, then shouldnt
we have one single way of supporting that in linux, instead of having
a little-bit-here and a little-bit-there?

--Sowmini

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Make mark-based routing work better with multiple separate networks.
  2014-05-13 10:49         ` sowmini varadhan
@ 2014-05-13 15:28           ` Lorenzo Colitti
  2014-05-13 15:38             ` sowmini varadhan
  2014-05-13 17:12           ` David Ahern
  1 sibling, 1 reply; 14+ messages in thread
From: Lorenzo Colitti @ 2014-05-13 15:28 UTC (permalink / raw)
  To: sowmini varadhan
  Cc: netdev, JP Abgrall, David Miller, Julian Anastasov,
	Hannes Frederic Sowa

On Tue, May 13, 2014 at 3:49 AM, sowmini varadhan <sowmini05@gmail.com> wrote:
> Specifically addressing the two issues raised above:
> - yes, it is true that an interface can exist in only one netns at a time.
>   But the same ip address can exist in multiple netns-es. If the
>   app wants to listen to a proper-subset of networks that go in/out
>   a single physical interface, you can use macvlan, and assign the
>   macvlans to the desired netns.

You can't use macvlan if you're not using interfaces that don't have
MAC addresses such as tun devices, 4G interfaces, and so on.

> - "same listening socket for multiple namespaces". Clearly that problem
>   also exists for the socket-marks approach. But again this can actually
>   be solved (for both netns and sock-marks) by having the application
>   set up separate sockets for each netns (netns or whatever) of interest,
>   and build an epoll fd over that set of sockets. No need for any kernel
>   code for this.

IMO forcing an application to open one socket per namespace, and
constantly be listening for any namespace changes as networks come and
go, is an unreasonable burden when all the application wants to do is
"accept connections on port 80 from everywhere". It's true that it
requires no kernel code, but you could also observe that no kernel
code is required for TCP, since it can all be done by the app using
raw sockets.

> My point is: what is the real networking construct that this use-case needs?
> Isn't it what routers describe as the VRF? If yes, then shouldnt
> we have one single way of supporting that in linux, instead of having
> a little-bit-here and a little-bit-there?

This is not a little-bit-here and a little-bit there. Socket marking
is an existing feature and this patch makes it work better when
different fwmarks identify different networks, without needing
namespace isolation, moving sockets between namespaces, etc.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Make mark-based routing work better with multiple separate networks.
  2014-05-13 15:28           ` Lorenzo Colitti
@ 2014-05-13 15:38             ` sowmini varadhan
  2014-05-13 16:09               ` Ben Greear
  0 siblings, 1 reply; 14+ messages in thread
From: sowmini varadhan @ 2014-05-13 15:38 UTC (permalink / raw)
  To: Lorenzo Colitti
  Cc: netdev, JP Abgrall, David Miller, Julian Anastasov,
	Hannes Frederic Sowa

On Tue, May 13, 2014 at 11:28 AM, Lorenzo Colitti <lorenzo@google.com> wrote:

> You can't use macvlan if you're not using interfaces that don't have
> MAC addresses such as tun devices, 4G interfaces, and so on.

So to repeat, "what problem do you need to solve?" You indicated
that
" As described in the patch cover letter, one of the things I'm trying
  to do is have fwmarks select between multiple separate networks, which
  may be on multiple physical interfaces. I also want applications to be
  able to listen for connections from all networks using a single
  listening socket. I don't care about network isolation."

To do that, you just have to bind() the under-socket to the desired address,
no need for macvlans etc.

You'd only need macvlans to solve the rest of the problems in your
list, such as getting ping to work right, which involves network isolation,
aka vrf. (fwiw, the ping etc will JustWork with network-namespaces)

> IMO forcing an application to open one socket per namespace, and
> constantly be listening for any namespace changes as networks come and
> go, is an unreasonable burden when all the application wants to do is

your typical network app is going to have to open a netlink socket
and listen for state changes (intf events, address events etc) anyway.
And once you receive a packet on your wildcard socket, I suspect that
youi need to do some extra work anyway to figure out which
vrf/netns/sock-mark/whatever it came in on.

But note that I'm not averse to optimizing some of this- it just
seems odd to me that you have vrfs-with-namespaces,
not-quite-vrfs-with-sock-marks,
multiple-routing-tables-for-pbr-but-we-are-not-vrfs etc.

ymmv.

--Sowmini

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Make mark-based routing work better with multiple separate networks.
  2014-05-13 15:38             ` sowmini varadhan
@ 2014-05-13 16:09               ` Ben Greear
  0 siblings, 0 replies; 14+ messages in thread
From: Ben Greear @ 2014-05-13 16:09 UTC (permalink / raw)
  To: sowmini varadhan
  Cc: Lorenzo Colitti, netdev, JP Abgrall, David Miller,
	Julian Anastasov, Hannes Frederic Sowa

On 05/13/2014 08:38 AM, sowmini varadhan wrote:
> On Tue, May 13, 2014 at 11:28 AM, Lorenzo Colitti <lorenzo@google.com> wrote:
> 
>> You can't use macvlan if you're not using interfaces that don't have
>> MAC addresses such as tun devices, 4G interfaces, and so on.
> 
> So to repeat, "what problem do you need to solve?" You indicated
> that
> " As described in the patch cover letter, one of the things I'm trying
>   to do is have fwmarks select between multiple separate networks, which
>   may be on multiple physical interfaces. I also want applications to be
>   able to listen for connections from all networks using a single
>   listening socket. I don't care about network isolation."

For what it's worth, we have had good luck doing something similar to the
marking rules with a private patch some years ago.  Then, we backed that out
and ran similar rules based on source/dest ports and IP addrs.  You can do these
routing tricks using standard kernels, but it is not very efficient because you
end up with hundreds or thousands of ip rules.

We found this approach much easier to deal with than using namespaces, though
when we started namespaces did not even exist so maybe we are biased.

If I ever have time to hack on this again, I'd like to optimize the ip rules
so that they can be bound to a specific interface so that the cost of having
lots of rules is not so high...

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Make mark-based routing work better with multiple separate networks.
  2014-05-13 10:49         ` sowmini varadhan
  2014-05-13 15:28           ` Lorenzo Colitti
@ 2014-05-13 17:12           ` David Ahern
  1 sibling, 0 replies; 14+ messages in thread
From: David Ahern @ 2014-05-13 17:12 UTC (permalink / raw)
  To: sowmini varadhan, Lorenzo Colitti
  Cc: netdev, JP Abgrall, David Miller, Julian Anastasov,
	Hannes Frederic Sowa

On 5/13/14, 4:49 AM, sowmini varadhan wrote:
> On Mon, May 12, 2014 at 6:53 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
>> On Tue, May 13, 2014 at 6:09 AM, sowmini varadhan <sowmini05@gmail.com> wrote:
>>> http://lwn.net/Articles/407495/, a single
>>> process should be able to open sockes in different namespaces.
>>
>> Other things that you can't do with namespaces are have the same physical
>> interface (and the same IP address?) in two different namespaces, or
>> have the same listening socket in two different namespaces. Namespaces
>> are not a panacea.'
>
> So this thread got unintentionally cut off by my not selecting Reply-All
> in the google gui.
>
> But to summarize a couple of private exchanges between Lorenzo and
> me, it still appears to me that the use-case here is what routers
> consider a "VRF". Thus it makes sense to add code (if/as needed)
> to fix the VRF support in linux, rather than adding yet-another-one-off
> feature with socket marking.
>
> Specifically addressing the two issues raised above:
> - yes, it is true that an interface can exist in only one netns at a time.
>    But the same ip address can exist in multiple netns-es. If the
>    app wants to listen to a proper-subset of networks that go in/out
>    a single physical interface, you can use macvlan, and assign the
>    macvlans to the desired netns.
> - "same listening socket for multiple namespaces". Clearly that problem
>    also exists for the socket-marks approach. But again this can actually
>    be solved (for both netns and sock-marks) by having the application
>    set up separate sockets for each netns (netns or whatever) of interest,
>    and build an epoll fd over that set of sockets. No need for any kernel
>    code for this.

using namespaces for VRFs has a number of problems:

1. It does not scale efficiently -- e.g., 1k VRFs.
    a. namespaces have high memory consumption. It depends on features 
enabled, but I see ~200kB/namespace. At 1024 namespaces that's a high 
memory hit.

    b. requiring separate processes/threads/sockets per namespace for a 
service to have a presence in each. ie., the 'same listening socket for 
multiple namespaces' problem.

2. Complicates L2 apps which should be vrf agnostic.

3. Requires root (CAP_SYS_ADMIN) to use setns. If you go the 
thread/socket per namespace route all of those processes need SYS_ADMIN 
capability which is not the desired security posture.

>
>    Or you can optimize this by building infra in the kernel to support the
>    Wildcard ALL_VRFS notion. Or add even more code to support something
>    less than ALL_VRFS.
>
> My point is: what is the real networking construct that this use-case needs?
> Isn't it what routers describe as the VRF? If yes, then shouldnt
> we have one single way of supporting that in linux, instead of having
> a little-bit-here and a little-bit-there?

 From a separation of resources perspective why not have the 
infrastructure kernel side that allows interfaces to be separated into 
namespaces for isolation and then within a namespace provide L3 
abstractions that allow separate routing tables, neighbor caches, etc -- 
ie., VRF abstraction within a network namespace. Allow apps to have a 
listen socket that works across the VRFs in a namespace; connected 
sockets are VRF based.

Nested network namespaces (which does not seem to work with 3.4 and 3.10 
kernels) would provide that layering but still suffers from the problems 
mentioned above.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-05-13 17:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-09 17:36 [PATCH 0/3] Make mark-based routing work better with multiple separate networks Lorenzo Colitti
2014-05-09 17:36 ` [PATCH 1/3] net: add a sysctl to reflect the fwmark on replies Lorenzo Colitti
2014-05-09 17:37 ` [PATCH 2/3] net: Use fwmark reflection in PMTU discovery Lorenzo Colitti
2014-05-09 17:37 ` [PATCH 3/3] net: support marking accepting TCP sockets Lorenzo Colitti
2014-05-09 18:05   ` Eric Dumazet
2014-05-12 12:21 ` [PATCH 0/3] Make mark-based routing work better with multiple separate networks sowmini varadhan
2014-05-12 19:58   ` Lorenzo Colitti
2014-05-12 21:09     ` sowmini varadhan
2014-05-12 22:53       ` Lorenzo Colitti
2014-05-13 10:49         ` sowmini varadhan
2014-05-13 15:28           ` Lorenzo Colitti
2014-05-13 15:38             ` sowmini varadhan
2014-05-13 16:09               ` Ben Greear
2014-05-13 17:12           ` David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).