netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net 0/6] ipv4: Changes for rt_gateway
@ 2012-10-07 11:26 Julian Anastasov
  2012-10-07 11:26 ` [PATCH net 1/6] ipv4: fix sending of redirects Julian Anastasov
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev

	This patchset fixes some problems for the routing caused
by the new rt_gateway semantics. What started as a fix for
IPVS-DR ended as fixes for more problems. To solve the IPVS
problem I decided to name the flag FLOWI_FLAG_KNOWN_NH, so that
we can even get route cached in FNHE or FIB NH.

	Different flag FLOWI_FLAG_RT_NOCACHE could be equally good
for IPVS, we again would be able to use data from fnhe but working
with cached routes should be preferred. If there is no FNHE, the
common case is IPVS to get uncached route, of course, IPVS caches
it itself.

	Patches 1-3 are fixes not related to IPVS problem,
4 and 5 add code that will be used by IPVS in patch 6.

Julian Anastasov (6):
  ipv4: fix sending of redirects
  ipv4: fix forwarding for strict source routes
  ipv4: add check if nh_pcpu_rth_output is allocated
  ipv4: introduce rt_uses_gateway
  ipv4: Add FLOWI_FLAG_KNOWN_NH
  ipvs: fix ARP resolving for direct routing mode

 include/net/flow.h              |    1 +
 include/net/route.h             |    3 +-
 net/ipv4/fib_frontend.c         |    3 +-
 net/ipv4/inet_connection_sock.c |    4 +-
 net/ipv4/ip_forward.c           |    2 +-
 net/ipv4/ip_output.c            |    4 +-
 net/ipv4/route.c                |   96 ++++++++++++++++++++++++---------------
 net/ipv4/xfrm4_policy.c         |    1 +
 net/netfilter/ipvs/ip_vs_xmit.c |    6 ++-
 9 files changed, 75 insertions(+), 45 deletions(-)

-- 
1.7.3.4

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net 1/6] ipv4: fix sending of redirects
  2012-10-07 11:26 [PATCH net 0/6] ipv4: Changes for rt_gateway Julian Anastasov
@ 2012-10-07 11:26 ` Julian Anastasov
  2012-10-08 19:16   ` David Miller
  2012-10-07 11:26 ` [PATCH net 2/6] ipv4: fix forwarding for strict source routes Julian Anastasov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev

	After "Cache input routes in fib_info nexthops" (commit
d2d68ba9fe) and "Elide fib_validate_source() completely when possible"
(commit 7a9bc9b81a) we can not send ICMP redirects. It seems we
should not cache the RTCF_DOREDIRECT flag in nh_rth_input because
the same fib_info can be used for traffic that is not redirected,
eg. from other input devices or from sources that are not in same subnet.

	As result, we have to disable the caching of RTCF_DOREDIRECT
flag and to force source validation for the case when forwarding
traffic to the input device. If traffic comes from directly connected
source we allow redirection as it was done before both changes.

	After the change "Adjust semantics of rt->rt_gateway"
(commit f8126f1d51) we should make sure our ICMP_REDIR_HOST messages
contain daddr instead of 0.0.0.0 when target is directly connected.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/ipv4/fib_frontend.c |    3 ++-
 net/ipv4/route.c        |   28 +++++++++++++++-------------
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 68c93d1..6dacae6 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -322,7 +322,8 @@ int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 {
 	int r = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(idev);
 
-	if (!r && !fib_num_tclassid_users(dev_net(dev))) {
+	if (!r && !fib_num_tclassid_users(dev_net(dev)) &&
+	    dev->ifindex != oif) {
 		*itag = 0;
 		return 0;
 	}
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ff62206..488a8bb 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -802,7 +802,8 @@ void ip_rt_send_redirect(struct sk_buff *skb)
 	net = dev_net(rt->dst.dev);
 	peer = inet_getpeer_v4(net->ipv4.peers, ip_hdr(skb)->saddr, 1);
 	if (!peer) {
-		icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST, rt->rt_gateway);
+		icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST,
+			  rt->rt_gateway ? : ip_hdr(skb)->daddr);
 		return;
 	}
 
@@ -827,7 +828,9 @@ void ip_rt_send_redirect(struct sk_buff *skb)
 	    time_after(jiffies,
 		       (peer->rate_last +
 			(ip_rt_redirect_load << peer->rate_tokens)))) {
-		icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST, rt->rt_gateway);
+		__be32 gw = rt->rt_gateway ? : ip_hdr(skb)->daddr;
+
+		icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST, gw);
 		peer->rate_last = jiffies;
 		++peer->rate_tokens;
 #ifdef CONFIG_IP_ROUTE_VERBOSE
@@ -835,7 +838,7 @@ void ip_rt_send_redirect(struct sk_buff *skb)
 		    peer->rate_tokens == ip_rt_redirect_number)
 			net_warn_ratelimited("host %pI4/if%d ignores redirects for %pI4 to %pI4\n",
 					     &ip_hdr(skb)->saddr, inet_iif(skb),
-					     &ip_hdr(skb)->daddr, &rt->rt_gateway);
+					     &ip_hdr(skb)->daddr, &gw);
 #endif
 	}
 out_put_peer:
@@ -1439,10 +1442,13 @@ static int __mkroute_input(struct sk_buff *skb,
 		goto cleanup;
 	}
 
+	do_cache = res->fi && !itag;
 	if (out_dev == in_dev && err &&
 	    (IN_DEV_SHARED_MEDIA(out_dev) ||
-	     inet_addr_onlink(out_dev, saddr, FIB_RES_GW(*res))))
+	     inet_addr_onlink(out_dev, saddr, FIB_RES_GW(*res)))) {
 		flags |= RTCF_DOREDIRECT;
+		do_cache = false;
+	}
 
 	if (skb->protocol != htons(ETH_P_IP)) {
 		/* Not IP (i.e. ARP). Do not create route, if it is
@@ -1459,15 +1465,11 @@ static int __mkroute_input(struct sk_buff *skb,
 		}
 	}
 
-	do_cache = false;
-	if (res->fi) {
-		if (!itag) {
-			rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
-			if (rt_cache_valid(rth)) {
-				skb_dst_set_noref(skb, &rth->dst);
-				goto out;
-			}
-			do_cache = true;
+	if (do_cache) {
+		rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
+		if (rt_cache_valid(rth)) {
+			skb_dst_set_noref(skb, &rth->dst);
+			goto out;
 		}
 	}
 
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net 2/6] ipv4: fix forwarding for strict source routes
  2012-10-07 11:26 [PATCH net 0/6] ipv4: Changes for rt_gateway Julian Anastasov
  2012-10-07 11:26 ` [PATCH net 1/6] ipv4: fix sending of redirects Julian Anastasov
@ 2012-10-07 11:26 ` Julian Anastasov
  2012-10-07 11:26 ` [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated Julian Anastasov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev

	After the change "Adjust semantics of rt->rt_gateway"
(commit f8126f1d51) rt_gateway can be 0 but ip_forward() compares
it directly with nexthop. What we want here is to check if traffic
is to directly connected nexthop and to fail if using gateway.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/ipv4/ip_forward.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index ab09b12..7f35ac2 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -85,7 +85,7 @@ int ip_forward(struct sk_buff *skb)
 
 	rt = skb_rtable(skb);
 
-	if (opt->is_strictroute && opt->nexthop != rt->rt_gateway)
+	if (opt->is_strictroute && rt->rt_gateway)
 		goto sr_failed;
 
 	if (unlikely(skb->len > dst_mtu(&rt->dst) && !skb_is_gso(skb) &&
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated
  2012-10-07 11:26 [PATCH net 0/6] ipv4: Changes for rt_gateway Julian Anastasov
  2012-10-07 11:26 ` [PATCH net 1/6] ipv4: fix sending of redirects Julian Anastasov
  2012-10-07 11:26 ` [PATCH net 2/6] ipv4: fix forwarding for strict source routes Julian Anastasov
@ 2012-10-07 11:26 ` Julian Anastasov
  2012-10-07 13:34   ` Eric Dumazet
  2012-10-08 19:17   ` David Miller
  2012-10-07 11:26 ` [PATCH net 4/6] ipv4: introduce rt_uses_gateway Julian Anastasov
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev

	Avoid NULL ptr dereference and caching if
nh_pcpu_rth_output is not allocated.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/ipv4/route.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 488a8bb..0a600cc 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1798,18 +1798,24 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	fnhe = NULL;
 	if (fi) {
 		struct rtable __rcu **prth;
+		struct fib_nh *nh = &FIB_RES_NH(*res);
 
-		fnhe = find_exception(&FIB_RES_NH(*res), fl4->daddr);
+		fnhe = find_exception(nh, fl4->daddr);
 		if (fnhe)
 			prth = &fnhe->fnhe_rth;
-		else
-			prth = __this_cpu_ptr(FIB_RES_NH(*res).nh_pcpu_rth_output);
+		else {
+			if (!nh->nh_pcpu_rth_output)
+				goto add;
+			prth = __this_cpu_ptr(nh->nh_pcpu_rth_output);
+		}
 		rth = rcu_dereference(*prth);
 		if (rt_cache_valid(rth)) {
 			dst_hold(&rth->dst);
 			return rth;
 		}
 	}
+
+add:
 	rth = rt_dst_alloc(dev_out,
 			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
 			   IN_DEV_CONF_GET(in_dev, NOXFRM),
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net 4/6] ipv4: introduce rt_uses_gateway
  2012-10-07 11:26 [PATCH net 0/6] ipv4: Changes for rt_gateway Julian Anastasov
                   ` (2 preceding siblings ...)
  2012-10-07 11:26 ` [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated Julian Anastasov
@ 2012-10-07 11:26 ` Julian Anastasov
  2012-10-07 11:26 ` [PATCH net 5/6] ipv4: Add FLOWI_FLAG_KNOWN_NH Julian Anastasov
  2012-10-07 11:26 ` [PATCH net 6/6] ipvs: fix ARP resolving for direct routing mode Julian Anastasov
  5 siblings, 0 replies; 13+ messages in thread
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev

	Add new flag to remember when route is via gateway.
We will use it to allow rt_gateway to contain address of
directly connected host for the cases when DST_NOCACHE is
used or when the NH exception caches per-destination route
without DST_NOCACHE flag, i.e. when routes are not used for
other destinations. By this way we force the neighbour
resolving to work with the routed destination but we
can use different address in the packet, feature needed
for IPVS-DR where original packet for virtual IP is routed
via route to real IP.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/route.h             |    3 +-
 net/ipv4/inet_connection_sock.c |    4 +-
 net/ipv4/ip_forward.c           |    2 +-
 net/ipv4/ip_output.c            |    4 +-
 net/ipv4/route.c                |   45 +++++++++++++++++++++-----------------
 net/ipv4/xfrm4_policy.c         |    1 +
 6 files changed, 33 insertions(+), 26 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index da22243..bc40b63 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -48,7 +48,8 @@ struct rtable {
 	int			rt_genid;
 	unsigned int		rt_flags;
 	__u16			rt_type;
-	__u16			rt_is_input;
+	__u8			rt_is_input;
+	__u8			rt_uses_gateway;
 
 	int			rt_iif;
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index f0c5b9c..d34ce29 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -406,7 +406,7 @@ struct dst_entry *inet_csk_route_req(struct sock *sk,
 	rt = ip_route_output_flow(net, fl4, sk);
 	if (IS_ERR(rt))
 		goto no_route;
-	if (opt && opt->opt.is_strictroute && rt->rt_gateway)
+	if (opt && opt->opt.is_strictroute && rt->rt_uses_gateway)
 		goto route_err;
 	return &rt->dst;
 
@@ -442,7 +442,7 @@ struct dst_entry *inet_csk_route_child_sock(struct sock *sk,
 	rt = ip_route_output_flow(net, fl4, sk);
 	if (IS_ERR(rt))
 		goto no_route;
-	if (opt && opt->opt.is_strictroute && rt->rt_gateway)
+	if (opt && opt->opt.is_strictroute && rt->rt_uses_gateway)
 		goto route_err;
 	rcu_read_unlock();
 	return &rt->dst;
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 7f35ac2..694de3b 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -85,7 +85,7 @@ int ip_forward(struct sk_buff *skb)
 
 	rt = skb_rtable(skb);
 
-	if (opt->is_strictroute && rt->rt_gateway)
+	if (opt->is_strictroute && rt->rt_uses_gateway)
 		goto sr_failed;
 
 	if (unlikely(skb->len > dst_mtu(&rt->dst) && !skb_is_gso(skb) &&
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 24a29a3..6537a40 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -193,7 +193,7 @@ static inline int ip_finish_output2(struct sk_buff *skb)
 	}
 
 	rcu_read_lock_bh();
-	nexthop = rt->rt_gateway ? rt->rt_gateway : ip_hdr(skb)->daddr;
+	nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
 	neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
 	if (unlikely(!neigh))
 		neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
@@ -371,7 +371,7 @@ int ip_queue_xmit(struct sk_buff *skb, struct flowi *fl)
 	skb_dst_set_noref(skb, &rt->dst);
 
 packet_routed:
-	if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_gateway)
+	if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_uses_gateway)
 		goto no_route;
 
 	/* OK, we know where to send it, allocate and build IP header. */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0a600cc..eaf9575 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1123,7 +1123,7 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst)
 	mtu = dst->dev->mtu;
 
 	if (unlikely(dst_metric_locked(dst, RTAX_MTU))) {
-		if (rt->rt_gateway && mtu > 576)
+		if (rt->rt_uses_gateway && mtu > 576)
 			mtu = 576;
 	}
 
@@ -1174,7 +1174,9 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 		if (fnhe->fnhe_gw) {
 			rt->rt_flags |= RTCF_REDIRECTED;
 			rt->rt_gateway = fnhe->fnhe_gw;
-		}
+			rt->rt_uses_gateway = 1;
+		} else if (!rt->rt_gateway)
+			rt->rt_gateway = daddr;
 
 		orig = rcu_dereference(fnhe->fnhe_rth);
 		rcu_assign_pointer(fnhe->fnhe_rth, rt);
@@ -1183,13 +1185,6 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 
 		fnhe->fnhe_stamp = jiffies;
 		ret = true;
-	} else {
-		/* Routes we intend to cache in nexthop exception have
-		 * the DST_NOCACHE bit clear.  However, if we are
-		 * unsuccessful at storing this route into the cache
-		 * we really need to set it.
-		 */
-		rt->dst.flags |= DST_NOCACHE;
 	}
 	spin_unlock_bh(&fnhe_lock);
 
@@ -1215,13 +1210,7 @@ static bool rt_cache_route(struct fib_nh *nh, struct rtable *rt)
 		if (orig)
 			rt_free(orig);
 	} else {
-		/* Routes we intend to cache in the FIB nexthop have
-		 * the DST_NOCACHE bit clear.  However, if we are
-		 * unsuccessful at storing this route into the cache
-		 * we really need to set it.
-		 */
 nocache:
-		rt->dst.flags |= DST_NOCACHE;
 		ret = false;
 	}
 
@@ -1284,8 +1273,10 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 	if (fi) {
 		struct fib_nh *nh = &FIB_RES_NH(*res);
 
-		if (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK)
+		if (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) {
 			rt->rt_gateway = nh->nh_gw;
+			rt->rt_uses_gateway = 1;
+		}
 		dst_init_metrics(&rt->dst, fi->fib_metrics, true);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
@@ -1294,8 +1285,18 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			cached = rt_bind_exception(rt, fnhe, daddr);
 		else if (!(rt->dst.flags & DST_NOCACHE))
 			cached = rt_cache_route(nh, rt);
-	}
-	if (unlikely(!cached))
+		if (unlikely(!cached)) {
+			/* Routes we intend to cache in nexthop exception or
+			 * FIB nexthop have the DST_NOCACHE bit clear.
+			 * However, if we are unsuccessful at storing this
+			 * route into the cache we really need to set it.
+			 */
+			rt->dst.flags |= DST_NOCACHE;
+			if (!rt->rt_gateway)
+				rt->rt_gateway = daddr;
+			rt_add_uncached_list(rt);
+		}
+	} else
 		rt_add_uncached_list(rt);
 
 #ifdef CONFIG_IP_ROUTE_CLASSID
@@ -1363,6 +1364,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	rth->rt_iif	= 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway	= 0;
+	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
 	if (our) {
 		rth->dst.input= ip_local_deliver;
@@ -1432,7 +1434,6 @@ static int __mkroute_input(struct sk_buff *skb,
 		return -EINVAL;
 	}
 
-
 	err = fib_validate_source(skb, saddr, daddr, tos, FIB_RES_OIF(*res),
 				  in_dev->dev, in_dev, &itag);
 	if (err < 0) {
@@ -1488,6 +1489,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->rt_iif 	= 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway	= 0;
+	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
 
 	rth->dst.input = ip_forward;
@@ -1658,6 +1660,7 @@ local_input:
 	rth->rt_iif	= 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway	= 0;
+	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
 	if (res.type == RTN_UNREACHABLE) {
 		rth->dst.input= ip_error;
@@ -1832,6 +1835,7 @@ add:
 	rth->rt_iif	= orig_oif ? : 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway = 0;
+	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
 
 	RT_CACHE_STAT_INC(out_slow_tot);
@@ -2110,6 +2114,7 @@ struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_or
 		rt->rt_flags = ort->rt_flags;
 		rt->rt_type = ort->rt_type;
 		rt->rt_gateway = ort->rt_gateway;
+		rt->rt_uses_gateway = ort->rt_uses_gateway;
 
 		INIT_LIST_HEAD(&rt->rt_uncached);
 
@@ -2188,7 +2193,7 @@ static int rt_fill_info(struct net *net,  __be32 dst, __be32 src,
 		if (nla_put_be32(skb, RTA_PREFSRC, fl4->saddr))
 			goto nla_put_failure;
 	}
-	if (rt->rt_gateway &&
+	if (rt->rt_uses_gateway &&
 	    nla_put_be32(skb, RTA_GATEWAY, rt->rt_gateway))
 		goto nla_put_failure;
 
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 681ea2f..05c5ab8 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -91,6 +91,7 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 					      RTCF_LOCAL);
 	xdst->u.rt.rt_type = rt->rt_type;
 	xdst->u.rt.rt_gateway = rt->rt_gateway;
+	xdst->u.rt.rt_uses_gateway = rt->rt_uses_gateway;
 	xdst->u.rt.rt_pmtu = rt->rt_pmtu;
 	INIT_LIST_HEAD(&xdst->u.rt.rt_uncached);
 
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net 5/6] ipv4: Add FLOWI_FLAG_KNOWN_NH
  2012-10-07 11:26 [PATCH net 0/6] ipv4: Changes for rt_gateway Julian Anastasov
                   ` (3 preceding siblings ...)
  2012-10-07 11:26 ` [PATCH net 4/6] ipv4: introduce rt_uses_gateway Julian Anastasov
@ 2012-10-07 11:26 ` Julian Anastasov
  2012-10-07 11:26 ` [PATCH net 6/6] ipvs: fix ARP resolving for direct routing mode Julian Anastasov
  5 siblings, 0 replies; 13+ messages in thread
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev

	Add flag to request that output route should be
returned with known rt_gateway, in case we want to use
it as nexthop for neighbour resolving.

	The returned route can be cached as follows:

- in NH exception: because the cached routes are not shared
	with other destinations
- in FIB NH: when using gateway because all destinations for
	NH share same gateway

	As last option, to return rt_gateway!=0 we have to
set DST_NOCACHE.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/flow.h |    1 +
 net/ipv4/route.c   |   11 ++++++++++-
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index e1dd508..628e11b 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -21,6 +21,7 @@ struct flowi_common {
 	__u8	flowic_flags;
 #define FLOWI_FLAG_ANYSRC		0x01
 #define FLOWI_FLAG_CAN_SLEEP		0x02
+#define FLOWI_FLAG_KNOWN_NH		0x04
 	__u32	flowic_secid;
 };
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index eaf9575..7f20b6e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1763,6 +1763,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	struct in_device *in_dev;
 	u16 type = res->type;
 	struct rtable *rth;
+	bool do_cache;
 
 	in_dev = __in_dev_get_rcu(dev_out);
 	if (!in_dev)
@@ -1799,6 +1800,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	fnhe = NULL;
+	do_cache = fi != NULL;
 	if (fi) {
 		struct rtable __rcu **prth;
 		struct fib_nh *nh = &FIB_RES_NH(*res);
@@ -1807,6 +1809,13 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 		if (fnhe)
 			prth = &fnhe->fnhe_rth;
 		else {
+			if (unlikely(fl4->flowi4_flags &
+				     FLOWI_FLAG_KNOWN_NH &&
+				     !(nh->nh_gw &&
+				       nh->nh_scope == RT_SCOPE_LINK))) {
+				do_cache = false;
+				goto add;
+			}
 			if (!nh->nh_pcpu_rth_output)
 				goto add;
 			prth = __this_cpu_ptr(nh->nh_pcpu_rth_output);
@@ -1822,7 +1831,7 @@ add:
 	rth = rt_dst_alloc(dev_out,
 			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
 			   IN_DEV_CONF_GET(in_dev, NOXFRM),
-			   fi);
+			   do_cache);
 	if (!rth)
 		return ERR_PTR(-ENOBUFS);
 
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net 6/6] ipvs: fix ARP resolving for direct routing mode
  2012-10-07 11:26 [PATCH net 0/6] ipv4: Changes for rt_gateway Julian Anastasov
                   ` (4 preceding siblings ...)
  2012-10-07 11:26 ` [PATCH net 5/6] ipv4: Add FLOWI_FLAG_KNOWN_NH Julian Anastasov
@ 2012-10-07 11:26 ` Julian Anastasov
  5 siblings, 0 replies; 13+ messages in thread
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev

	After the change "Make neigh lookups directly in output packet path"
(commit a263b30936) IPVS can not reach the real server for DR mode
because we resolve the destination address from IP header, not from
route neighbour. Use the new FLOWI_FLAG_KNOWN_NH flag to request
output routes with known nexthop, so that it has preference
on resolving.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/netfilter/ipvs/ip_vs_xmit.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 56f6d5d..cc4c809 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -50,6 +50,7 @@ enum {
 				      * local
 				      */
 	IP_VS_RT_MODE_CONNECT	= 8, /* Always bind route to saddr */
+	IP_VS_RT_MODE_KNOWN_NH	= 16,/* Route via remote addr */
 };
 
 /*
@@ -113,6 +114,8 @@ static struct rtable *do_output_route4(struct net *net, __be32 daddr,
 	fl4.daddr = daddr;
 	fl4.saddr = (rt_mode & IP_VS_RT_MODE_CONNECT) ? *saddr : 0;
 	fl4.flowi4_tos = rtos;
+	fl4.flowi4_flags = (rt_mode & IP_VS_RT_MODE_KNOWN_NH) ?
+			   FLOWI_FLAG_KNOWN_NH : 0;
 
 retry:
 	rt = ip_route_output_key(net, &fl4);
@@ -1061,7 +1064,8 @@ ip_vs_dr_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	if (!(rt = __ip_vs_get_out_rt(skb, cp->dest, cp->daddr.ip,
 				      RT_TOS(iph->tos),
 				      IP_VS_RT_MODE_LOCAL |
-					IP_VS_RT_MODE_NON_LOCAL, NULL)))
+				      IP_VS_RT_MODE_NON_LOCAL |
+				      IP_VS_RT_MODE_KNOWN_NH, NULL)))
 		goto tx_error_icmp;
 	if (rt->rt_flags & RTCF_LOCAL) {
 		ip_rt_put(rt);
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated
  2012-10-07 11:26 ` [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated Julian Anastasov
@ 2012-10-07 13:34   ` Eric Dumazet
  2012-10-07 17:24     ` Julian Anastasov
  2012-10-08 19:17   ` David Miller
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2012-10-07 13:34 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev

On Sun, 2012-10-07 at 14:26 +0300, Julian Anastasov wrote:
> 	Avoid NULL ptr dereference and caching if
> nh_pcpu_rth_output is not allocated.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
>  net/ipv4/route.c |   12 +++++++++---
>  1 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 488a8bb..0a600cc 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1798,18 +1798,24 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
>  	fnhe = NULL;
>  	if (fi) {
>  		struct rtable __rcu **prth;
> +		struct fib_nh *nh = &FIB_RES_NH(*res);
>  
> -		fnhe = find_exception(&FIB_RES_NH(*res), fl4->daddr);
> +		fnhe = find_exception(nh, fl4->daddr);
>  		if (fnhe)
>  			prth = &fnhe->fnhe_rth;
> -		else
> -			prth = __this_cpu_ptr(FIB_RES_NH(*res).nh_pcpu_rth_output);
> +		else {
> +			if (!nh->nh_pcpu_rth_output)
> +				goto add;
> +			prth = __this_cpu_ptr(nh->nh_pcpu_rth_output);
> +		}
>  		rth = rcu_dereference(*prth);
>  		if (rt_cache_valid(rth)) {
>  			dst_hold(&rth->dst);
>  			return rth;
>  		}
>  	}
> +
> +add:
>  	rth = rt_dst_alloc(dev_out,
>  			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
>  			   IN_DEV_CONF_GET(in_dev, NOXFRM),

Alternative would be to make sure the allocation succeeded in
fib_create_info(), but I have no idea on the maximal number of fib_info
a complex routing setup might need ?

I guess a typical machine needs less than 30 fib_info...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated
  2012-10-07 13:34   ` Eric Dumazet
@ 2012-10-07 17:24     ` Julian Anastasov
  0 siblings, 0 replies; 13+ messages in thread
From: Julian Anastasov @ 2012-10-07 17:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev


	Hello,

On Sun, 7 Oct 2012, Eric Dumazet wrote:

> On Sun, 2012-10-07 at 14:26 +0300, Julian Anastasov wrote:
> > 	Avoid NULL ptr dereference and caching if
> > nh_pcpu_rth_output is not allocated.
> > 
> > Signed-off-by: Julian Anastasov <ja@ssi.bg>
> > ---
> >  net/ipv4/route.c |   12 +++++++++---
> >  1 files changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 488a8bb..0a600cc 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1798,18 +1798,24 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
> >  	fnhe = NULL;
> >  	if (fi) {
> >  		struct rtable __rcu **prth;
> > +		struct fib_nh *nh = &FIB_RES_NH(*res);
> >  
> > -		fnhe = find_exception(&FIB_RES_NH(*res), fl4->daddr);
> > +		fnhe = find_exception(nh, fl4->daddr);
> >  		if (fnhe)
> >  			prth = &fnhe->fnhe_rth;
> > -		else
> > -			prth = __this_cpu_ptr(FIB_RES_NH(*res).nh_pcpu_rth_output);
> > +		else {
> > +			if (!nh->nh_pcpu_rth_output)
> > +				goto add;
> > +			prth = __this_cpu_ptr(nh->nh_pcpu_rth_output);
> > +		}
> >  		rth = rcu_dereference(*prth);
> >  		if (rt_cache_valid(rth)) {
> >  			dst_hold(&rth->dst);
> >  			return rth;
> >  		}
> >  	}
> > +
> > +add:
> >  	rth = rt_dst_alloc(dev_out,
> >  			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
> >  			   IN_DEV_CONF_GET(in_dev, NOXFRM),
> 
> Alternative would be to make sure the allocation succeeded in
> fib_create_info(), but I have no idea on the maximal number of fib_info
> a complex routing setup might need ?

	I too preferred not to touch there because I don't
know how much we can want from alloc_percpu.

> I guess a typical machine needs less than 30 fib_info...

	May be, I guess it all depends on the present primary
addresses, gateways and devices. Thousands of routes 
do not look so scary because they will use small number
of fib_infos.

	But it is a problem that we allocate memory
that may never be used, there is no chance to expire it.
Not sure how good is the idea to create fib_infos without
percpu allocations and the first traffic to notify some
thread to attach such arrays? We can safe memory if the
local/broadcast fib_infos are not used for output routes.
But such idea will cost many checks in fast path.
Another hack could be to use 1 entry for local/broadcast/mcast
and array for unicast. Type RTN_BROADCAST is actually not
cached at all, so we do not need to allocate array.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/6] ipv4: fix sending of redirects
  2012-10-07 11:26 ` [PATCH net 1/6] ipv4: fix sending of redirects Julian Anastasov
@ 2012-10-08 19:16   ` David Miller
  2012-10-08 20:43     ` Julian Anastasov
  0 siblings, 1 reply; 13+ messages in thread
From: David Miller @ 2012-10-08 19:16 UTC (permalink / raw)
  To: ja; +Cc: netdev

From: Julian Anastasov <ja@ssi.bg>
Date: Sun,  7 Oct 2012 14:26:03 +0300

> @@ -322,7 +322,8 @@ int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
>  {
>  	int r = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(idev);
>  
> -	if (!r && !fib_num_tclassid_users(dev_net(dev))) {
> +	if (!r && !fib_num_tclassid_users(dev_net(dev)) &&
> +	    dev->ifindex != oif) {
>  		*itag = 0;
>  		return 0;
>  	}

Hmmm, won't this cause the slow path to be taken for locally
destined traffic?

> +			  rt->rt_gateway ? : ip_hdr(skb)->daddr);

Please use the rt_nexthop() helper.

> +		__be32 gw = rt->rt_gateway ? : ip_hdr(skb)->daddr;

Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated
  2012-10-07 11:26 ` [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated Julian Anastasov
  2012-10-07 13:34   ` Eric Dumazet
@ 2012-10-08 19:17   ` David Miller
  1 sibling, 0 replies; 13+ messages in thread
From: David Miller @ 2012-10-08 19:17 UTC (permalink / raw)
  To: ja; +Cc: netdev

From: Julian Anastasov <ja@ssi.bg>
Date: Sun,  7 Oct 2012 14:26:05 +0300

> 	Avoid NULL ptr dereference and caching if
> nh_pcpu_rth_output is not allocated.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

Ugh.

Checking for alloc_percpu() failures in fib_create_info() is 1,000
times better than doing stuff like this in the fast path.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/6] ipv4: fix sending of redirects
  2012-10-08 20:43     ` Julian Anastasov
@ 2012-10-08 20:41       ` David Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2012-10-08 20:41 UTC (permalink / raw)
  To: ja; +Cc: netdev

From: Julian Anastasov <ja@ssi.bg>
Date: Mon, 8 Oct 2012 23:43:45 +0300 (EEST)

> On Mon, 8 Oct 2012, David Miller wrote:
> 
>> From: Julian Anastasov <ja@ssi.bg>
>> Date: Sun,  7 Oct 2012 14:26:03 +0300
>> 
>> > @@ -322,7 +322,8 @@ int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
>> >  {
>> >  	int r = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(idev);
>> >  
>> > -	if (!r && !fib_num_tclassid_users(dev_net(dev))) {
>> > +	if (!r && !fib_num_tclassid_users(dev_net(dev)) &&
>> > +	    dev->ifindex != oif) {
>> >  		*itag = 0;
>> >  		return 0;
>> >  	}
>> 
>> Hmmm, won't this cause the slow path to be taken for locally
>> destined traffic?
> 
> 	In this case idev=eth0 and oif=lo. The only case
> where we can see same input and output device is for
> forwarding and loopback (but only output routes where
> there is no such validation).

Ok, now I understand.  This added condition is fine.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/6] ipv4: fix sending of redirects
  2012-10-08 19:16   ` David Miller
@ 2012-10-08 20:43     ` Julian Anastasov
  2012-10-08 20:41       ` David Miller
  0 siblings, 1 reply; 13+ messages in thread
From: Julian Anastasov @ 2012-10-08 20:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev


	Hello,

On Mon, 8 Oct 2012, David Miller wrote:

> From: Julian Anastasov <ja@ssi.bg>
> Date: Sun,  7 Oct 2012 14:26:03 +0300
> 
> > @@ -322,7 +322,8 @@ int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
> >  {
> >  	int r = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(idev);
> >  
> > -	if (!r && !fib_num_tclassid_users(dev_net(dev))) {
> > +	if (!r && !fib_num_tclassid_users(dev_net(dev)) &&
> > +	    dev->ifindex != oif) {
> >  		*itag = 0;
> >  		return 0;
> >  	}
> 
> Hmmm, won't this cause the slow path to be taken for locally
> destined traffic?

	In this case idev=eth0 and oif=lo. The only case
where we can see same input and output device is for
forwarding and loopback (but only output routes where
there is no such validation).

	Of course, it slows down on traffic that ignores
our redirects. We can save some cycles if IN_DEV_TX_REDIRECTS
is checked before setting RTCF_DOREDIRECT:

	(dev->ifindex != oif || !IN_DEV_TX_REDIRECTS(idev)) {

	RTCF_DOREDIRECT is used just to call ip_rt_send_redirect
where all depends on IN_DEV_TX_REDIRECTS.

	The only difference can be that user space will
not see RTCF_DOREDIRECT flag if IN_DEV_TX_REDIRECTS is false.
But may be it is better to show "redirect" in ip route
only when IN_DEV_TX_REDIRECTS are enabled. The old way
is preferred only when there is routing cache and changes
in IN_DEV_TX_REDIRECTS do not flush cache in devinet_conf_proc().

> > +			  rt->rt_gateway ? : ip_hdr(skb)->daddr);
> 
> Please use the rt_nexthop() helper.
> 
> > +		__be32 gw = rt->rt_gateway ? : ip_hdr(skb)->daddr;

	Good idea.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-10-08 20:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-07 11:26 [PATCH net 0/6] ipv4: Changes for rt_gateway Julian Anastasov
2012-10-07 11:26 ` [PATCH net 1/6] ipv4: fix sending of redirects Julian Anastasov
2012-10-08 19:16   ` David Miller
2012-10-08 20:43     ` Julian Anastasov
2012-10-08 20:41       ` David Miller
2012-10-07 11:26 ` [PATCH net 2/6] ipv4: fix forwarding for strict source routes Julian Anastasov
2012-10-07 11:26 ` [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated Julian Anastasov
2012-10-07 13:34   ` Eric Dumazet
2012-10-07 17:24     ` Julian Anastasov
2012-10-08 19:17   ` David Miller
2012-10-07 11:26 ` [PATCH net 4/6] ipv4: introduce rt_uses_gateway Julian Anastasov
2012-10-07 11:26 ` [PATCH net 5/6] ipv4: Add FLOWI_FLAG_KNOWN_NH Julian Anastasov
2012-10-07 11:26 ` [PATCH net 6/6] ipvs: fix ARP resolving for direct routing mode Julian Anastasov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).