* [DANGER 8/7]: ipv4: Cache output routes in fib_info nexthops. @ 2012-07-12 17:47 David Miller 2012-07-13 0:52 ` Vijay Subramanian 0 siblings, 1 reply; 5+ messages in thread From: David Miller @ 2012-07-12 17:47 UTC (permalink / raw) To: netdev Signed-off-by: David S. Miller <davem@davemloft.net> --- If you really feel like playing with fire, try this patch on top of the routing cache removal patches. It gets the output route lookup down to 888 cycles for me. Something is flaky about it, when I ssh into my test system for the first time after a boot there is a strange delay of some sort. It's as if the SYN-ACK is dropped on the way out of the test machine, and my desktop has to retry the initial SYN. I plan to investigate this after some sleep. diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index e91fedd..d133110 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -45,6 +45,7 @@ struct fib_config { }; struct fib_info; +struct rtable; struct fib_nh { struct net_device *nh_dev; @@ -63,6 +64,8 @@ struct fib_nh { __be32 nh_gw; __be32 nh_saddr; int nh_saddr_genid; + + struct rtable *rth; }; /* diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index d71bfbd..d1240a0 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -148,6 +148,8 @@ static void free_fib_info_rcu(struct rcu_head *head) change_nexthops(fi) { if (nexthop_nh->nh_dev) dev_put(nexthop_nh->nh_dev); + if (nexthop_nh->rth) + dst_release(&nexthop_nh->rth->dst); } endfor_nexthops(fi); release_net(fi->fib_net); diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c4b2df6..53b006a 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -983,6 +983,8 @@ static void rt_set_nexthop(struct rtable *rt, const struct flowi4 *fl4, #ifdef CONFIG_IP_ROUTE_CLASSID rt->dst.tclassid = FIB_RES_NH(*res).nh_tclassid; #endif + FIB_RES_NH(*res).rth = rt; + dst_clone(&rt->dst); } #ifdef CONFIG_IP_ROUTE_CLASSID @@ -1468,6 +1470,13 @@ static struct rtable *__mkroute_output(const struct fib_result *res, fi = NULL; } + if (fi) { + rth = FIB_RES_NH(*res).rth; + if (rth) { + dst_use(&rth->dst, jiffies); + return rth; + } + } rth = rt_dst_alloc(dev_out, IN_DEV_CONF_GET(in_dev, NOPOLICY), IN_DEV_CONF_GET(in_dev, NOXFRM)); ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [DANGER 8/7]: ipv4: Cache output routes in fib_info nexthops. 2012-07-12 17:47 [DANGER 8/7]: ipv4: Cache output routes in fib_info nexthops David Miller @ 2012-07-13 0:52 ` Vijay Subramanian 2012-07-13 1:06 ` David Miller 2012-07-13 11:10 ` David Miller 0 siblings, 2 replies; 5+ messages in thread From: Vijay Subramanian @ 2012-07-13 0:52 UTC (permalink / raw) To: David Miller, netdev > > Something is flaky about it, when I ssh into my test system > for the first time after a boot there is a strange delay of > some sort. It's as if the SYN-ACK is dropped on the way out > of the test machine, and my desktop has to retry the initial > SYN. I plan to investigate this after some sleep. Dave, I applied these patches and got the same symptoms. It takes a long time for ssh to work right after boot but it starts working after about a minute. (I am sshing into the machine with the patches applied). My knowledge of routing code is rudimentary but I traced the following. I think this is because the SYN packets do not even reach the TCP handler. It looks like ip_route_input_slow() sets the dst.input function to ip_error(). The code path I saw was as follows: ip_rcv_finish() eventually calls ip_route_input_slow() wherein fib_lookup() initially returns a res.type of RTN_UNICAST (why not RTN_LOCAL?). However, the following code if (!IN_DEV_FORWARD(in_dev)) goto no_route; is executed and sets the res.type to RTN_UNREACHABLE. After the jump to local_input, rth->dst.input is set first to ip_local_deliver() but again to ip_error(). Due to this, the SYN packet does not even make it to ip_local_deliver and so the TCP handler is never called. I did not get a chance to see why it suddenly starts working. Hope this helps. I will dig around more. Thanks, Vijay ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [DANGER 8/7]: ipv4: Cache output routes in fib_info nexthops. 2012-07-13 0:52 ` Vijay Subramanian @ 2012-07-13 1:06 ` David Miller 2012-07-13 11:10 ` David Miller 1 sibling, 0 replies; 5+ messages in thread From: David Miller @ 2012-07-13 1:06 UTC (permalink / raw) To: subramanian.vijay; +Cc: netdev From: Vijay Subramanian <subramanian.vijay@gmail.com> Date: Thu, 12 Jul 2012 17:52:54 -0700 > ip_rcv_finish() eventually calls ip_route_input_slow() wherein > fib_lookup() initially returns a res.type of RTN_UNICAST (why not > RTN_LOCAL?). Strange, since only this specific patch causes the problem there must be some issue with the size of struct fib_info or something like that triggering the problem. Something might be corrupted in the nexthops. I wonder if there is some uninitialized data somewhere, or something silly like that. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [DANGER 8/7]: ipv4: Cache output routes in fib_info nexthops. 2012-07-13 0:52 ` Vijay Subramanian 2012-07-13 1:06 ` David Miller @ 2012-07-13 11:10 ` David Miller 2012-07-13 20:03 ` Vijay Subramanian 1 sibling, 1 reply; 5+ messages in thread From: David Miller @ 2012-07-13 11:10 UTC (permalink / raw) To: subramanian.vijay; +Cc: netdev From: Vijay Subramanian <subramanian.vijay@gmail.com> Date: Thu, 12 Jul 2012 17:52:54 -0700 > I did not get a chance to see why it suddenly starts working. Hope > this helps. I will dig around more. The problem is the setting of ->rt_gateway for local subnet routes. In order for this to work with fib_info cached routes we have to convert rt_gateway to be set to zero when there is no explicit nexthop (local subnet) and to the non-zero gateway address otherwise. This would be applied right before the "DANGER" patch: diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index c38293f..672d6f3 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -476,7 +476,8 @@ int arp_find(unsigned char *haddr, struct sk_buff *skb) } paddr = skb_rtable(skb)->rt_gateway; - + if (!paddr) + paddr = ip_hdr(skb)->daddr; if (arp_set_predefined(inet_addr_type(dev_net(dev), paddr), haddr, paddr, dev)) return 0; diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 76825be..18f9854 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -389,7 +389,7 @@ struct dst_entry *inet_csk_route_req(struct sock *sk, rt = ip_route_output_flow(net, fl4, sk); if (IS_ERR(rt)) goto no_route; - if (opt && opt->opt.is_strictroute && fl4->daddr != rt->rt_gateway) + if (opt && opt->opt.is_strictroute && rt->rt_gateway) goto route_err; return &rt->dst; @@ -422,7 +422,7 @@ struct dst_entry *inet_csk_route_child_sock(struct sock *sk, rt = ip_route_output_flow(net, fl4, sk); if (IS_ERR(rt)) goto no_route; - if (opt && opt->opt.is_strictroute && fl4->daddr != rt->rt_gateway) + if (opt && opt->opt.is_strictroute && rt->rt_gateway) goto route_err; return &rt->dst; diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index 0c31235..5b77c2c 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -767,6 +767,8 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev if (skb->protocol == htons(ETH_P_IP)) { rt = skb_rtable(skb); dst = rt->rt_gateway; + if (!dst) + dst = old_iph->daddr; } #if IS_ENABLED(CONFIG_IPV6) else if (skb->protocol == htons(ETH_P_IPV6)) { diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index cc52679..6b805e0 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -371,7 +371,7 @@ int ip_queue_xmit(struct sk_buff *skb, struct flowi *fl) skb_dst_set_noref(skb, &rt->dst); packet_routed: - if (inet_opt && inet_opt->opt.is_strictroute && fl4->daddr != rt->rt_gateway) + if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_gateway) goto no_route; /* OK, we know where to send it, allocate and build IP header. */ diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index c2d0e6d..095fec0 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -488,6 +488,8 @@ static netdev_tx_t ipip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev) goto tx_error; } dst = rt->rt_gateway; + if (!dst) + dst = old_iph->daddr; } rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, diff --git a/net/ipv4/netfilter/ipt_MASQUERADE.c b/net/ipv4/netfilter/ipt_MASQUERADE.c index 2f210c7..b99746b 100644 --- a/net/ipv4/netfilter/ipt_MASQUERADE.c +++ b/net/ipv4/netfilter/ipt_MASQUERADE.c @@ -52,7 +52,7 @@ masquerade_tg(struct sk_buff *skb, const struct xt_action_param *par) struct nf_nat_ipv4_range newrange; const struct nf_nat_ipv4_multi_range_compat *mr; const struct rtable *rt; - __be32 newsrc; + __be32 newsrc, nh; NF_CT_ASSERT(par->hooknum == NF_INET_POST_ROUTING); @@ -70,7 +70,10 @@ masquerade_tg(struct sk_buff *skb, const struct xt_action_param *par) mr = par->targinfo; rt = skb_rtable(skb); - newsrc = inet_select_addr(par->out, rt->rt_gateway, RT_SCOPE_UNIVERSE); + nh = rt->rt_gateway; + if (!nh) + nh = ip_hdr(skb)->daddr; + newsrc = inet_select_addr(par->out, nh, RT_SCOPE_UNIVERSE); if (!newsrc) { pr_info("%s ate my IP address\n", par->out->name); return NF_DROP; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c4b2df6..f594e4a 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -904,8 +904,10 @@ void ip_rt_get_source(u8 *addr, struct sk_buff *skb, struct rtable *rt) if (fib_lookup(dev_net(rt->dst.dev), &fl4, &res) == 0) src = FIB_RES_PREFSRC(dev_net(rt->dst.dev), res); else - src = inet_select_addr(rt->dst.dev, rt->rt_gateway, - RT_SCOPE_UNIVERSE); + src = inet_select_addr(rt->dst.dev, (rt->rt_gateway ? + rt->rt_gateway : + iph->daddr), + RT_SCOPE_UNIVERSE); rcu_read_unlock(); } memcpy(addr, &src, 4); @@ -951,7 +953,7 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst) mtu = dst->dev->mtu; if (unlikely(dst_metric_locked(dst, RTAX_MTU))) { - if (rt->rt_gateway != 0 && mtu > 576) + if (rt->rt_gateway && mtu > 576) mtu = 576; } @@ -1050,7 +1052,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr, rth->rt_iif = dev->ifindex; rth->rt_oif = 0; rth->rt_pmtu = 0; - rth->rt_gateway = daddr; + rth->rt_gateway = 0; rth->fi = NULL; if (our) { rth->dst.input= ip_local_deliver; @@ -1168,7 +1170,7 @@ static int __mkroute_input(struct sk_buff *skb, rth->rt_iif = in_dev->dev->ifindex; rth->rt_oif = 0; rth->rt_pmtu = 0; - rth->rt_gateway = daddr; + rth->rt_gateway = 0; rth->fi = NULL; rth->dst.input = ip_forward; @@ -1333,7 +1335,7 @@ local_input: rth->rt_iif = dev->ifindex; rth->rt_oif = 0; rth->rt_pmtu = 0; - rth->rt_gateway = daddr; + rth->rt_gateway = 0; rth->fi = NULL; if (res.type == RTN_UNREACHABLE) { rth->dst.input= ip_error; @@ -1483,7 +1485,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res, rth->rt_iif = orig_oif ? : dev_out->ifindex; rth->rt_oif = orig_oif; rth->rt_pmtu = 0; - rth->rt_gateway = fl4->daddr; + rth->rt_gateway = 0; rth->fi = NULL; RT_CACHE_STAT_INC(out_slow_tot); @@ -1845,7 +1847,7 @@ static int rt_fill_info(struct net *net, __be32 dst, __be32 src, if (nla_put_be32(skb, RTA_PREFSRC, fl4->saddr)) goto nla_put_failure; } - if (fl4->daddr != rt->rt_gateway && + if (rt->rt_gateway && nla_put_be32(skb, RTA_GATEWAY, rt->rt_gateway)) goto nla_put_failure; ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [DANGER 8/7]: ipv4: Cache output routes in fib_info nexthops. 2012-07-13 11:10 ` David Miller @ 2012-07-13 20:03 ` Vijay Subramanian 0 siblings, 0 replies; 5+ messages in thread From: Vijay Subramanian @ 2012-07-13 20:03 UTC (permalink / raw) To: David Miller; +Cc: netdev On 13 July 2012 04:10, David Miller <davem@davemloft.net> wrote: > From: Vijay Subramanian <subramanian.vijay@gmail.com> > Date: Thu, 12 Jul 2012 17:52:54 -0700 > >> I did not get a chance to see why it suddenly starts working. Hope >> this helps. I will dig around more. > > The problem is the setting of ->rt_gateway for local subnet routes. > When I tested this yesterday, the peer was actually not on the same subnet and I saw this problem. I think the problem was also present for a peer on the same subnet. Anyway, I applied this patch and the DANGER one after this and the problem has disappeared, fib_lookup() now returns RTN_LOCAL and ssh responds immediately. I tested with 2 peers, one on the same subnet and one on a different one. So far, it looks good. Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Thanks, Vijay ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-07-13 20:03 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-07-12 17:47 [DANGER 8/7]: ipv4: Cache output routes in fib_info nexthops David Miller 2012-07-13 0:52 ` Vijay Subramanian 2012-07-13 1:06 ` David Miller 2012-07-13 11:10 ` David Miller 2012-07-13 20:03 ` Vijay Subramanian
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).