netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception
@ 2015-04-11  1:54 Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 01/10] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

Hi,

This series is to avoid creating a RTF_CACHE route whenever we are consulting
the fib6 tree with a new destination.  Instead, only create RTF_CACHE route
when we see a pmtu exception.

Out of all ipv6 RTF_CACHE routes that are created, the percentage that has a
different mtu is very small. In one of our end-user facing proxy server,
only 1k out of 80k RTF_CACHE routes have a smaller MTU.  For our DC
traffic, there is no mtu exception.

A large fib6 tree has problems like, 'ip -6 r show' takes a long time.
gc may kick in too often.  Also, when a service has restarted and a lot
of new TCP conn requests come in, it creates pressure on the tree by inserting
a lot of RTF_CACHE in a short time and it currently requires a write lock
to do that.

The first few patches are prep works to remove assumption that the
returned rt is always RTF_CACHE.

The patch 'ipv6: Only create RTF_CACHE routes after encountering pmtu exception'
do the lazy RTF_CACHE route creation.

The next few patches fix the /128 via gateway route issue.  One of them
is by "Steffen Klassert <steffen.klassert@secunet.com>" which I pulled off
from netdev.

The last two patches added percpu rt to compensate the performance loss after
doing the RTF_CACHE lazy creation.

Here is some numbers of the udpflood test.  The udpflood has been
slightly modified to have a time limit instead of count limit.

A /64 via gateway route is used for the test. Each udpflood uses 10000 dst
addresses.  The dst addresses of different udpflood processes do not overlap
with each other.

# of udpflood        # of trans (patched)        # of trans (upstream)

1                    16M                          15M
10                   61M                          61M
20                   65M                          62M
40                   88M                          83M


Many thanks to "Hannes Frederic Sowa <hannes@stressinduktion.org>" on
reviewing the patches and giving advice.

--Martin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 01/10] ipv6: Remove external dependency on rt6i_dst and rt6i_src
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
@ 2015-04-11  1:54 ` Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 02/10] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST Martin KaFai Lau
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

This patch removes the assumptions that the returned rt is always
a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the
destination and source address.  The dst and src can be recovered from the
calling site.

We may consider to rename (rt6i_dst, rt6i_src) to (rt6i_key_dst, rt6i_key_src)
later.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/scsi/cxgbi/libcxgbi.c   |  2 +-
 include/net/ipv6.h              |  3 ++-
 net/ipv6/icmp.c                 |  2 +-
 net/ipv6/ip6_output.c           | 22 +++++++++++-----------
 net/ipv6/ndisc.c                |  2 +-
 net/ipv6/output_core.c          |  9 +++++----
 net/ipv6/tcp_ipv6.c             |  2 +-
 net/netfilter/ipvs/ip_vs_xmit.c |  4 ++--
 net/sctp/ipv6.c                 |  3 ++-
 9 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
index eb58afc..45d3039 100644
--- a/drivers/scsi/cxgbi/libcxgbi.c
+++ b/drivers/scsi/cxgbi/libcxgbi.c
@@ -728,7 +728,7 @@ static struct cxgbi_sock *cxgbi_check_route6(struct sockaddr *dst_addr)
 	}
 	ndev = n->dev;
 
-	if (ipv6_addr_is_multicast(&rt->rt6i_dst.addr)) {
+	if (ipv6_addr_is_multicast(&daddr6->sin6_addr)) {
 		pr_info("multi-cast route %pI6 port %u, dev %s.\n",
 			daddr6->sin6_addr.s6_addr,
 			ntohs(daddr6->sin6_port), ndev->name);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index eec8ad3..a0890d6 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -670,7 +670,8 @@ static inline int ipv6_addr_diff(const struct in6_addr *a1, const struct in6_add
 }
 
 void ipv6_select_ident(struct net *net, struct frag_hdr *fhdr,
-		       struct rt6_info *rt);
+		       const struct in6_addr *daddr,
+		       const struct in6_addr *saddr);
 void ipv6_proxy_select_ident(struct net *net, struct sk_buff *skb);
 
 int ip6_dst_hoplimit(struct dst_entry *dst);
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 2c2b5d5..24b359d 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -207,7 +207,7 @@ static bool icmpv6_xrlim_allow(struct sock *sk, u8 type,
 			struct inet_peer *peer;
 
 			peer = inet_getpeer_v6(net->ipv6.peers,
-					       &rt->rt6i_dst.addr, 1);
+					       &fl6->daddr, 1);
 			res = inet_peer_xrlim_allow(peer, tmo);
 			if (peer)
 				inet_putpeer(peer);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7fde1f2..b987fbf 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -459,7 +459,7 @@ int ip6_forward(struct sk_buff *skb)
 		else
 			target = &hdr->daddr;
 
-		peer = inet_getpeer_v6(net->ipv6.peers, &rt->rt6i_dst.addr, 1);
+		peer = inet_getpeer_v6(net->ipv6.peers, &hdr->daddr, 1);
 
 		/* Limit redirects both by destination (here)
 		   and by source (inside ndisc_send_redirect)
@@ -549,6 +549,7 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 				inet6_sk(skb->sk) : NULL;
 	struct ipv6hdr *tmp_hdr;
 	struct frag_hdr *fh;
+	struct frag_hdr tmp_fh;
 	unsigned int mtu, hlen, left, len;
 	int hroom, troom;
 	__be32 frag_id = 0;
@@ -584,6 +585,10 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 	}
 	mtu -= hlen + sizeof(struct frag_hdr);
 
+	ipv6_select_ident(net, &tmp_fh, &ipv6_hdr(skb)->daddr,
+			  &ipv6_hdr(skb)->saddr);
+	frag_id = tmp_fh.identification;
+
 	if (skb_has_frag_list(skb)) {
 		int first_len = skb_pagelen(skb);
 		struct sk_buff *frag2;
@@ -632,11 +637,10 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 		skb_reset_network_header(skb);
 		memcpy(skb_network_header(skb), tmp_hdr, hlen);
 
-		ipv6_select_ident(net, fh, rt);
 		fh->nexthdr = nexthdr;
 		fh->reserved = 0;
 		fh->frag_off = htons(IP6_MF);
-		frag_id = fh->identification;
+		fh->identification = frag_id;
 
 		first_len = skb_pagelen(skb);
 		skb->data_len = first_len - skb_headlen(skb);
@@ -778,11 +782,7 @@ slow_path:
 		 */
 		fh->nexthdr = nexthdr;
 		fh->reserved = 0;
-		if (!frag_id) {
-			ipv6_select_ident(net, fh, rt);
-			frag_id = fh->identification;
-		} else
-			fh->identification = frag_id;
+		fh->identification = frag_id;
 
 		/*
 		 *	Copy a block of the IP datagram.
@@ -1037,7 +1037,7 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 			int odd, struct sk_buff *skb),
 			void *from, int length, int hh_len, int fragheaderlen,
 			int transhdrlen, int mtu, unsigned int flags,
-			struct rt6_info *rt)
+			const struct flowi6 *fl6)
 
 {
 	struct sk_buff *skb;
@@ -1083,7 +1083,7 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 	skb_shinfo(skb)->gso_size = (mtu - fragheaderlen -
 				     sizeof(struct frag_hdr)) & ~7;
 	skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
-	ipv6_select_ident(sock_net(sk), &fhdr, rt);
+	ipv6_select_ident(sock_net(sk), &fhdr, &fl6->daddr, &fl6->saddr);
 	skb_shinfo(skb)->ip6_frag_id = fhdr.identification;
 
 append:
@@ -1307,7 +1307,7 @@ emsgsize:
 	    (sk->sk_type == SOCK_DGRAM)) {
 		err = ip6_ufo_append_data(sk, queue, getfrag, from, length,
 					  hh_len, fragheaderlen,
-					  transhdrlen, mtu, flags, rt);
+					  transhdrlen, mtu, flags, fl6);
 		if (err)
 			goto error;
 		return 0;
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 96f153c..0a05b35 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1506,7 +1506,7 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
 			  "Redirect: destination is not a neighbour\n");
 		goto release;
 	}
-	peer = inet_getpeer_v6(net->ipv6.peers, &rt->rt6i_dst.addr, 1);
+	peer = inet_getpeer_v6(net->ipv6.peers, &ipv6_hdr(skb)->saddr, 1);
 	ret = inet_peer_xrlim_allow(peer, 1*HZ);
 	if (peer)
 		inet_putpeer(peer);
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 85892af..f37cfa9 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -10,7 +10,8 @@
 #include <net/secure_seq.h>
 
 static u32 __ipv6_select_ident(struct net *net, u32 hashrnd,
-			       struct in6_addr *dst, struct in6_addr *src)
+			       const struct in6_addr *dst,
+			       const struct in6_addr *src)
 {
 	u32 hash, id;
 
@@ -61,15 +62,15 @@ void ipv6_proxy_select_ident(struct net *net, struct sk_buff *skb)
 EXPORT_SYMBOL_GPL(ipv6_proxy_select_ident);
 
 void ipv6_select_ident(struct net *net, struct frag_hdr *fhdr,
-		       struct rt6_info *rt)
+		       const struct in6_addr *daddr,
+		       const struct in6_addr *saddr)
 {
 	static u32 ip6_idents_hashrnd __read_mostly;
 	u32 id;
 
 	net_get_random_once(&ip6_idents_hashrnd, sizeof(ip6_idents_hashrnd));
 
-	id = __ipv6_select_ident(net, ip6_idents_hashrnd, &rt->rt6i_dst.addr,
-				 &rt->rt6i_src.addr);
+	id = __ipv6_select_ident(net, ip6_idents_hashrnd, daddr, saddr);
 	fhdr->identification = htonl(id);
 }
 EXPORT_SYMBOL(ipv6_select_ident);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f73a97f..dfcca70 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -262,7 +262,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	rt = (struct rt6_info *) dst;
 	if (tcp_death_row.sysctl_tw_recycle &&
 	    !tp->rx_opt.ts_recent_stamp &&
-	    ipv6_addr_equal(&rt->rt6i_dst.addr, &sk->sk_v6_daddr))
+	    ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr))
 		tcp_fetch_timewait_stamp(sk, dst);
 
 	icsk->icsk_ext_hdr_len = 0;
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 19986ec..38f8627 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -781,7 +781,7 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	/* From world but DNAT to loopback address? */
 	if (local && skb->dev && !(skb->dev->flags & IFF_LOOPBACK) &&
-	    ipv6_addr_type(&rt->rt6i_dst.addr) & IPV6_ADDR_LOOPBACK) {
+	    ipv6_addr_type(&cp->daddr.in6) & IPV6_ADDR_LOOPBACK) {
 		IP_VS_DBG_RL_PKT(1, AF_INET6, pp, skb, 0,
 				 "ip_vs_nat_xmit_v6(): "
 				 "stopping DNAT to loopback address");
@@ -1346,7 +1346,7 @@ ip_vs_icmp_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	/* From world but DNAT to loopback address? */
 	if (local && skb->dev && !(skb->dev->flags & IFF_LOOPBACK) &&
-	    ipv6_addr_type(&rt->rt6i_dst.addr) & IPV6_ADDR_LOOPBACK) {
+	    ipv6_addr_type(&cp->daddr.in6) & IPV6_ADDR_LOOPBACK) {
 		IP_VS_DBG(1, "%s(): "
 			  "stopping DNAT to loopback %pI6\n",
 			  __func__, &cp->daddr.in6);
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 0e4198e..9fa13f6 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -332,7 +332,8 @@ out:
 		rt = (struct rt6_info *)dst;
 		t->dst = dst;
 		t->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
-		pr_debug("rt6_dst:%pI6 rt6_src:%pI6\n", &rt->rt6i_dst.addr,
+		pr_debug("rt6_dst:%pI6/%d rt6_src:%pI6\n",
+			 &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
 			 &fl6->saddr);
 	} else {
 		t->dst = NULL;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 02/10] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 01/10] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
@ 2015-04-11  1:54 ` Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 03/10] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone Martin KaFai Lau
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst.
Also, rt6i_gateway is always set to the nexthop while the nexthop
could be a gateway or the rt6i_dst.addr.

After removing the rt6i_dst and rt6i_src dependency in the last patch, we also
need to stop the caller from depending on rt6i_gateway and RTF_ANYCAST.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/ip6_route.h                | 14 +++++++++-----
 net/bluetooth/6lowpan.c                |  2 +-
 net/ipv6/icmp.c                        |  4 ++--
 net/ipv6/ip6_output.c                  |  5 +++--
 net/ipv6/route.c                       |  6 +-----
 net/netfilter/nf_conntrack_h323_main.c |  4 ++--
 net/netfilter/xt_addrtype.c            |  2 +-
 7 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 5e19206..0e4d170 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -163,11 +163,14 @@ static inline bool ipv6_unicast_destination(const struct sk_buff *skb)
 	return rt->rt6i_flags & RTF_LOCAL;
 }
 
-static inline bool ipv6_anycast_destination(const struct sk_buff *skb)
+static inline bool ipv6_anycast_destination(const struct dst_entry *dst,
+					    const struct in6_addr *daddr)
 {
-	struct rt6_info *rt = (struct rt6_info *) skb_dst(skb);
+	struct rt6_info *rt = (struct rt6_info *)dst;
 
-	return rt->rt6i_flags & RTF_ANYCAST;
+	return rt->rt6i_flags & RTF_ANYCAST ||
+		(rt->rt6i_dst.plen != 128 &&
+		 ipv6_addr_equal(&rt->rt6i_dst.addr, daddr));
 }
 
 int ip6_fragment(struct sock *sk, struct sk_buff *skb,
@@ -194,9 +197,10 @@ static inline bool ip6_sk_ignore_df(const struct sock *sk)
 	       inet6_sk(sk)->pmtudisc == IPV6_PMTUDISC_OMIT;
 }
 
-static inline struct in6_addr *rt6_nexthop(struct rt6_info *rt)
+static inline struct in6_addr *rt6_nexthop(struct rt6_info *rt,
+					   struct in6_addr *daddr)
 {
-	return &rt->rt6i_gateway;
+	return (rt->rt6i_flags & RTF_GATEWAY) ? &rt->rt6i_gateway : daddr;
 }
 
 #endif
diff --git a/net/bluetooth/6lowpan.c b/net/bluetooth/6lowpan.c
index 1742b84..f3d6046 100644
--- a/net/bluetooth/6lowpan.c
+++ b/net/bluetooth/6lowpan.c
@@ -192,7 +192,7 @@ static inline struct lowpan_peer *peer_lookup_dst(struct lowpan_dev *dev,
 		if (ipv6_addr_any(nexthop))
 			return NULL;
 	} else {
-		nexthop = rt6_nexthop(rt);
+		nexthop = rt6_nexthop(rt, daddr);
 
 		/* We need to remember the address because it is needed
 		 * by bt_xmit() when sending the packet. In bt_xmit(), the
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 24b359d..713d743 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -337,7 +337,7 @@ static struct dst_entry *icmpv6_route_lookup(struct net *net,
 	 * We won't send icmp if the destination is known
 	 * anycast.
 	 */
-	if (((struct rt6_info *)dst)->rt6i_flags & RTF_ANYCAST) {
+	if (ipv6_anycast_destination(dst, &fl6->daddr)) {
 		net_dbg_ratelimited("icmp6_send: acast source\n");
 		dst_release(dst);
 		return ERR_PTR(-EINVAL);
@@ -564,7 +564,7 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 
 	if (!ipv6_unicast_destination(skb) &&
 	    !(net->ipv6.sysctl.anycast_src_echo_reply &&
-	      ipv6_anycast_destination(skb)))
+	      ipv6_anycast_destination(skb_dst(skb), saddr)))
 		saddr = NULL;
 
 	memcpy(&tmp_hdr, icmph, sizeof(tmp_hdr));
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index b987fbf..e58e402 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -105,7 +105,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff *skb)
 	}
 
 	rcu_read_lock_bh();
-	nexthop = rt6_nexthop((struct rt6_info *)dst);
+	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
 	neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
 	if (unlikely(!neigh))
 		neigh = __neigh_create(&nd_tbl, nexthop, dst->dev, false);
@@ -913,7 +913,8 @@ static int ip6_dst_lookup_tail(struct sock *sk,
 	 */
 	rt = (struct rt6_info *) *dst;
 	rcu_read_lock_bh();
-	n = __ipv6_neigh_lookup_noref(rt->dst.dev, rt6_nexthop(rt));
+	n = __ipv6_neigh_lookup_noref(rt->dst.dev,
+				      rt6_nexthop(rt, &fl6->daddr));
 	err = n && !(n->nud_state & NUD_VALID) ? -EINVAL : 0;
 	rcu_read_unlock_bh();
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 5c48293..0ccb8ec 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1946,11 +1946,7 @@ static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
 		if (rt->rt6i_idev)
 			in6_dev_hold(rt->rt6i_idev);
 		rt->dst.lastuse = jiffies;
-
-		if (ort->rt6i_flags & RTF_GATEWAY)
-			rt->rt6i_gateway = ort->rt6i_gateway;
-		else
-			rt->rt6i_gateway = *dest;
+		rt->rt6i_gateway = ort->rt6i_gateway;
 		rt->rt6i_flags = ort->rt6i_flags;
 		rt6_set_from(rt, ort);
 		rt->rt6i_metric = 0;
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 1d69f5b..9511af0 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -779,8 +779,8 @@ static int callforward_do_filter(struct net *net,
 				   flowi6_to_flowi(&fl1), false)) {
 			if (!afinfo->route(net, (struct dst_entry **)&rt2,
 					   flowi6_to_flowi(&fl2), false)) {
-				if (ipv6_addr_equal(rt6_nexthop(rt1),
-						    rt6_nexthop(rt2)) &&
+				if (ipv6_addr_equal(rt6_nexthop(rt1, &fl1.daddr),
+						    rt6_nexthop(rt2, &fl2.daddr)) &&
 				    rt1->dst.dev == rt2->dst.dev)
 					ret = 1;
 				dst_release(&rt2->dst);
diff --git a/net/netfilter/xt_addrtype.c b/net/netfilter/xt_addrtype.c
index fab6eea..5b4743c 100644
--- a/net/netfilter/xt_addrtype.c
+++ b/net/netfilter/xt_addrtype.c
@@ -73,7 +73,7 @@ static u32 match_lookup_rt6(struct net *net, const struct net_device *dev,
 
 	if (dev == NULL && rt->rt6i_flags & RTF_LOCAL)
 		ret |= XT_ADDRTYPE_LOCAL;
-	if (rt->rt6i_flags & RTF_ANYCAST)
+	if (ipv6_anycast_destination((struct dst_entry *)rt, addr))
 		ret |= XT_ADDRTYPE_ANYCAST;
 
 	dst_release(&rt->dst);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 03/10] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 01/10] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 02/10] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST Martin KaFai Lau
@ 2015-04-11  1:54 ` Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

A prep work for creating RTF_CACHE on exception only.  After this
patch, the same condition (rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY))
is checked twice. This redundancy will be removed in the later patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 40 +++++++++++++++-------------------------
 1 file changed, 15 insertions(+), 25 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0ccb8ec..f753a67 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -872,9 +872,9 @@ int ip6_ins_rt(struct rt6_info *rt)
 	return __ip6_ins_rt(rt, &info, &mxc);
 }
 
-static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort,
-				      const struct in6_addr *daddr,
-				      const struct in6_addr *saddr)
+static struct rt6_info *ip6_pmtu_rt_cache_alloc(struct rt6_info *ort,
+						const struct in6_addr *daddr,
+						const struct in6_addr *saddr)
 {
 	struct rt6_info *rt;
 
@@ -885,33 +885,24 @@ static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort,
 	rt = ip6_rt_copy(ort, daddr);
 
 	if (rt) {
-		if (ort->rt6i_dst.plen != 128 &&
-		    ipv6_addr_equal(&ort->rt6i_dst.addr, daddr))
-			rt->rt6i_flags |= RTF_ANYCAST;
-
 		rt->rt6i_flags |= RTF_CACHE;
 
+		if (!(ort->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY))) {
+			if (ort->rt6i_dst.plen != 128 &&
+			    ipv6_addr_equal(&ort->rt6i_dst.addr, daddr))
+				rt->rt6i_flags |= RTF_ANYCAST;
 #ifdef CONFIG_IPV6_SUBTREES
-		if (rt->rt6i_src.plen && saddr) {
-			rt->rt6i_src.addr = *saddr;
-			rt->rt6i_src.plen = 128;
-		}
+			if (rt->rt6i_src.plen && saddr) {
+				rt->rt6i_src.addr = *saddr;
+				rt->rt6i_src.plen = 128;
+			}
 #endif
+		}
 	}
 
 	return rt;
 }
 
-static struct rt6_info *rt6_alloc_clone(struct rt6_info *ort,
-					const struct in6_addr *daddr)
-{
-	struct rt6_info *rt = ip6_rt_copy(ort, daddr);
-
-	if (rt)
-		rt->rt6i_flags |= RTF_CACHE;
-	return rt;
-}
-
 static struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, int oif,
 				      struct flowi6 *fl6, int flags)
 {
@@ -957,10 +948,9 @@ redo_rt6_select:
 	if (rt->rt6i_flags & RTF_CACHE)
 		goto out2;
 
-	if (!(rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)))
-		nrt = rt6_alloc_cow(rt, &fl6->daddr, &fl6->saddr);
-	else if (!(rt->dst.flags & DST_HOST))
-		nrt = rt6_alloc_clone(rt, &fl6->daddr);
+	if (!(rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)) ||
+	    !(rt->dst.flags & DST_HOST))
+		nrt = ip6_pmtu_rt_cache_alloc(rt, &fl6->daddr, &fl6->saddr);
 	else
 		goto out2;
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2015-04-11  1:54 ` [RFC PATCH 03/10] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone Martin KaFai Lau
@ 2015-04-11  1:54 ` Martin KaFai Lau
  2015-04-20 18:27   ` David Miller
  2015-04-20 18:28   ` David Miller
  2015-04-11  1:54 ` [RFC PATCH 05/10] ipv6: Allow pmtu update on /128 via gateway route Martin KaFai Lau
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

This patch creates a RTF_CACHE routes only after encountering a pmtu exception.

After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6 tree,
the rt->rt6i_node->fn_sernum will be bumped which fails the ip6_dst_check() and
triggers a relookup.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 92 ++++++++++++++++++++++++++++++--------------------------
 1 file changed, 49 insertions(+), 43 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f753a67..1b57bc9 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -907,16 +907,13 @@ static struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
 				      struct flowi6 *fl6, int flags)
 {
 	struct fib6_node *fn, *saved_fn;
-	struct rt6_info *rt, *nrt;
+	struct rt6_info *rt;
 	int strict = 0;
-	int attempts = 3;
-	int err;
 
 	strict |= flags & RT6_LOOKUP_F_IFACE;
 	if (net->ipv6.devconf_all->forwarding == 0)
 		strict |= RT6_LOOKUP_F_REACHABLE;
 
-redo_fib6_lookup_lock:
 	read_lock_bh(&table->tb6_lock);
 
 	fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
@@ -935,46 +932,12 @@ redo_rt6_select:
 			strict &= ~RT6_LOOKUP_F_REACHABLE;
 			fn = saved_fn;
 			goto redo_rt6_select;
-		} else {
-			dst_hold(&rt->dst);
-			read_unlock_bh(&table->tb6_lock);
-			goto out2;
 		}
 	}
 
 	dst_hold(&rt->dst);
 	read_unlock_bh(&table->tb6_lock);
 
-	if (rt->rt6i_flags & RTF_CACHE)
-		goto out2;
-
-	if (!(rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)) ||
-	    !(rt->dst.flags & DST_HOST))
-		nrt = ip6_pmtu_rt_cache_alloc(rt, &fl6->daddr, &fl6->saddr);
-	else
-		goto out2;
-
-	ip6_rt_put(rt);
-	rt = nrt ? : net->ipv6.ip6_null_entry;
-
-	dst_hold(&rt->dst);
-	if (nrt) {
-		err = ip6_ins_rt(nrt);
-		if (!err)
-			goto out2;
-	}
-
-	if (--attempts <= 0)
-		goto out2;
-
-	/*
-	 * Race condition! In the gap, when table->tb6_lock was
-	 * released someone could insert this route.  Relookup.
-	 */
-	ip6_rt_put(rt);
-	goto redo_fib6_lookup_lock;
-
-out2:
 	rt->dst.lastuse = jiffies;
 	rt->dst.__use++;
 
@@ -1144,13 +1107,49 @@ static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 	struct rt6_info *rt6 = (struct rt6_info *)dst;
 
 	dst_confirm(dst);
-	if (mtu < dst_mtu(dst) && rt6->rt6i_dst.plen == 128) {
+	mtu = max_t(u32, mtu, IPV6_MIN_MTU);
+	if (mtu >= dst_mtu(dst))
+		return;
+
+	if (!(rt6->rt6i_flags & RTF_CACHE) &&
+	    (!(rt6->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)) ||
+	     !(rt6->dst.flags & DST_HOST))) {
+		const struct in6_addr *daddr, *saddr;
+		struct rt6_info *nrt6;
+
+		if (skb) {
+			const struct ipv6hdr *iph = ipv6_hdr(skb);
+
+			daddr = &iph->daddr;
+			saddr = &iph->saddr;
+		} else if (sk) {
+			daddr = &sk->sk_v6_daddr;
+			saddr = &inet6_sk(sk)->saddr;
+		} else {
+			return;
+		}
+		nrt6 = ip6_pmtu_rt_cache_alloc(rt6, daddr, saddr);
+		if (!nrt6)
+			return;
+		/* ip6_ins_rt(nrt6) will bump the rt6->rt6i_node->fn_sernum
+		 * which will fail the next rt6_check() and invalidate the
+		 * sk->sk_dst_cache.
+		 */
+		if (ip6_ins_rt(nrt6)) {
+			dst_destroy(&nrt6->dst);
+			return;
+		}
+
+		rt6 = nrt6;
+		dst = &nrt6->dst;
+	} else {
+		rt6 = (struct rt6_info *)dst;
+	}
+
+	if (rt6->rt6i_dst.plen == 128) {
 		struct net *net = dev_net(dst->dev);
 
 		rt6->rt6i_flags |= RTF_MODIFIED;
-		if (mtu < IPV6_MIN_MTU)
-			mtu = IPV6_MIN_MTU;
-
 		dst_metric_set(dst, RTAX_MTU, mtu);
 		rt6_update_expires(rt6, net->ipv6.sysctl.ip6_rt_mtu_expires);
 	}
@@ -1171,8 +1170,15 @@ void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,
 	fl6.flowlabel = ip6_flowinfo(iph);
 
 	dst = ip6_route_output(net, NULL, &fl6);
-	if (!dst->error)
+	if (!dst->error) {
+		unsigned char *outer_network_header = skb_network_header(skb);
+		int offset;
+
+		skb_reset_network_header(skb);
+		offset = outer_network_header - skb_network_header(skb);
 		ip6_rt_update_pmtu(dst, NULL, skb, ntohl(mtu));
+		skb_set_network_header(skb, offset);
+	}
 	dst_release(dst);
 }
 EXPORT_SYMBOL_GPL(ip6_update_pmtu);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 05/10] ipv6: Allow pmtu update on /128 via gateway route
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2015-04-11  1:54 ` [RFC PATCH 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
@ 2015-04-11  1:54 ` Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del() Martin KaFai Lau
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

Consider there is a permanent /128 via gateway route (DST_HOST) in
the route table.  When there is a pmtu update, the pmtu DST_HOST route is
updated and the RTF_EXPIRES is set.  The permanent DST_HOST route will be
removed after expiration.

Since we are at it, the patch is trying to simplify some checking cases in
ip6_rt_update_pmtu().

1. !(rt6->rt6i_flags & RTF_CACHE) is used to decide when
a RTF_CACHE route needs to be created for pmtu update.

2. Remove the rt6->rt6i_dst.plen == 128 check since RTF_CACHE route will
be created (if it is needed) before updating the mtu.

3. Add a check to ensure no pmtu update on RTF_LOCAL route

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1b57bc9..75f3b5d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1105,15 +1105,17 @@ static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 			       struct sk_buff *skb, u32 mtu)
 {
 	struct rt6_info *rt6 = (struct rt6_info *)dst;
+	struct net *net;
+
+	if (rt6->rt6i_flags & RTF_LOCAL)
+		return;
 
 	dst_confirm(dst);
 	mtu = max_t(u32, mtu, IPV6_MIN_MTU);
 	if (mtu >= dst_mtu(dst))
 		return;
 
-	if (!(rt6->rt6i_flags & RTF_CACHE) &&
-	    (!(rt6->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)) ||
-	     !(rt6->dst.flags & DST_HOST))) {
+	if (!(rt6->rt6i_flags & RTF_CACHE)) {
 		const struct in6_addr *daddr, *saddr;
 		struct rt6_info *nrt6;
 
@@ -1146,13 +1148,10 @@ static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 		rt6 = (struct rt6_info *)dst;
 	}
 
-	if (rt6->rt6i_dst.plen == 128) {
-		struct net *net = dev_net(dst->dev);
-
-		rt6->rt6i_flags |= RTF_MODIFIED;
-		dst_metric_set(dst, RTAX_MTU, mtu);
-		rt6_update_expires(rt6, net->ipv6.sysctl.ip6_rt_mtu_expires);
-	}
+	net = dev_net(rt6->dst.dev);
+	rt6->rt6i_flags |= RTF_MODIFIED;
+	dst_metric_set(dst, RTAX_MTU, mtu);
+	rt6_update_expires(rt6, net->ipv6.sysctl.ip6_rt_mtu_expires);
 }
 
 void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del()
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2015-04-11  1:54 ` [RFC PATCH 05/10] ipv6: Allow pmtu update on /128 via gateway route Martin KaFai Lau
@ 2015-04-11  1:54 ` Martin KaFai Lau
  2015-04-20 18:23   ` David Miller
  2015-04-11  1:54 ` [RFC PATCH 07/10] ipv6: Extend the route lookups to low priority metrics Martin KaFai Lau
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

Before patch 'Allow pmtu update on /128 via gateway route',
RTF_CACHE route was not created for DST_HOST.  It also requires changes on both
delete code path and rt6_select() code patch.

This patch fixes the delete code path to avoid deleting the RTF_CACHE
route by 'ip -6 r del...'

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 75f3b5d..5d0fd6c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1780,6 +1780,9 @@ static int ip6_route_del(struct fib6_config *cfg)
 
 	if (fn) {
 		for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) {
+			if ((rt->rt6i_flags & RTF_CACHE) &&
+			    !(cfg->fc_flags & RTF_CACHE))
+				continue;
 			if (cfg->fc_ifindex &&
 			    (!rt->dst.dev ||
 			     rt->dst.dev->ifindex != cfg->fc_ifindex))
@@ -2424,6 +2427,9 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (rtm->rtm_type == RTN_LOCAL)
 		cfg->fc_flags |= RTF_LOCAL;
 
+	if (rtm->rtm_flags & RTM_F_CLONED)
+		cfg->fc_flags |= RTF_CACHE;
+
 	cfg->fc_nlinfo.portid = NETLINK_CB(skb).portid;
 	cfg->fc_nlinfo.nlh = nlh;
 	cfg->fc_nlinfo.nl_net = sock_net(skb->sk);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 07/10] ipv6: Extend the route lookups to low priority metrics.
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2015-04-11  1:54 ` [RFC PATCH 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del() Martin KaFai Lau
@ 2015-04-11  1:54 ` Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry Martin KaFai Lau
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team, Steffen Klassert

From: Steffen Klassert <steffen.klassert@secunet.com>

We search only for routes with highest priority metric in
find_rr_leaf(). However if one of these routes is marked
as invalid, we may fail to find a route even if there is
a appropriate route with lower priority. Then we loose
connectivity until the garbage collector deletes the
invalid route. This typically happens if a host route
expires afer a pmtu event. Fix this by searching also
for routes with a lower priority metric.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 5d0fd6c..91c80bc 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -652,15 +652,33 @@ static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
 				     u32 metric, int oif, int strict,
 				     bool *do_rr)
 {
-	struct rt6_info *rt, *match;
+	struct rt6_info *rt, *match, *cont;
 	int mpri = -1;
 
 	match = NULL;
-	for (rt = rr_head; rt && rt->rt6i_metric == metric;
-	     rt = rt->dst.rt6_next)
+	cont = NULL;
+	for (rt = rr_head; rt; rt = rt->dst.rt6_next) {
+		if (rt->rt6i_metric != metric) {
+			cont = rt;
+			break;
+		}
+
+		match = find_match(rt, oif, strict, &mpri, match, do_rr);
+	}
+
+	for (rt = fn->leaf; rt && rt != rr_head; rt = rt->dst.rt6_next) {
+		if (rt->rt6i_metric != metric) {
+			cont = rt;
+			break;
+		}
+
 		match = find_match(rt, oif, strict, &mpri, match, do_rr);
-	for (rt = fn->leaf; rt && rt != rr_head && rt->rt6i_metric == metric;
-	     rt = rt->dst.rt6_next)
+	}
+
+	if (match || !cont)
+		return match;
+
+	for (rt = cont; rt; rt = rt->dst.rt6_next)
 		match = find_match(rt, oif, strict, &mpri, match, do_rr);
 
 	return match;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2015-04-11  1:54 ` [RFC PATCH 07/10] ipv6: Extend the route lookups to low priority metrics Martin KaFai Lau
@ 2015-04-11  1:54 ` Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 09/10] ipv6: Break up ip6_rt_copy() Martin KaFai Lau
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

When there is a pmtu exception on /128 via gateway route, we need to
create a separate metrics copy for the newly created RTF_CACHE route instead
of reusing the inetpeer cache.  Otherwise, the original mtu will be
over-written and the mtu update will stay after the expiration.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 91c80bc..61ce45e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -322,8 +322,16 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 	struct rt6_info *rt = (struct rt6_info *)dst;
 	struct inet6_dev *idev = rt->rt6i_idev;
 	struct dst_entry *from = dst->from;
+	unsigned long peer_metrics = 0;
 
-	if (!(rt->dst.flags & DST_HOST))
+	if (rt6_has_peer(rt)) {
+		struct inet_peer *peer = rt6_peer_ptr(rt);
+
+		peer_metrics = (unsigned long)peer->metrics;
+		inet_putpeer(peer);
+	}
+
+	if (peer_metrics != dst->_metrics)
 		dst_destroy_metrics_generic(dst);
 
 	if (idev) {
@@ -333,11 +341,6 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 
 	dst->from = NULL;
 	dst_release(from);
-
-	if (rt6_has_peer(rt)) {
-		struct inet_peer *peer = rt6_peer_ptr(rt);
-		inet_putpeer(peer);
-	}
 }
 
 static void ip6_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
@@ -1956,6 +1959,8 @@ static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
 
 		rt->rt6i_dst.addr = *dest;
 		rt->rt6i_dst.plen = 128;
+		if (ort->dst.flags & DST_HOST)
+			dst_cow_metrics_generic(&rt->dst, rt->dst._metrics);
 		dst_copy_metrics(&rt->dst, &ort->dst);
 		rt->dst.error = ort->dst.error;
 		rt->rt6i_idev = ort->rt6i_idev;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 09/10] ipv6: Break up ip6_rt_copy()
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2015-04-11  1:54 ` [RFC PATCH 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry Martin KaFai Lau
@ 2015-04-11  1:54 ` Martin KaFai Lau
  2015-04-11  1:54 ` [RFC PATCH 10/10] ipv6: Create percpu rt6_info Martin KaFai Lau
  2015-04-20 18:29 ` [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception David Miller
  10 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

This patch breaks up ip6_rt_copy() into ip6_rt_copy_init() and
ip6_rt_cache_alloc().

In the later patch, we need to create a percpu rt6_info copy. Hence, refactor
the common rt6_info init codes to ip6_rt_copy_init().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 76 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 32 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 61ce45e..665e41c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -72,8 +72,11 @@ enum rt6_nud_state {
 	RT6_NUD_SUCCEED = 1
 };
 
-static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
-				    const struct in6_addr *dest);
+static void ip6_rt_copy_init(struct rt6_info *rt,
+			     struct rt6_info *ort,
+			     const struct in6_addr *dest);
+static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
+					   const struct in6_addr *dest);
 static struct dst_entry	*ip6_dst_check(struct dst_entry *dst, u32 cookie);
 static unsigned int	 ip6_default_advmss(const struct dst_entry *dst);
 static unsigned int	 ip6_mtu(const struct dst_entry *dst);
@@ -903,11 +906,9 @@ static struct rt6_info *ip6_pmtu_rt_cache_alloc(struct rt6_info *ort,
 	 *	Clone the route.
 	 */
 
-	rt = ip6_rt_copy(ort, daddr);
+	rt = ip6_rt_cache_alloc(ort, daddr);
 
 	if (rt) {
-		rt->rt6i_flags |= RTF_CACHE;
-
 		if (!(ort->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY))) {
 			if (ort->rt6i_dst.plen != 128 &&
 			    ipv6_addr_equal(&ort->rt6i_dst.addr, daddr))
@@ -1913,7 +1914,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 				     NEIGH_UPDATE_F_ISROUTER))
 		     );
 
-	nrt = ip6_rt_copy(rt, &msg->dest);
+	nrt = ip6_rt_cache_alloc(rt, &msg->dest);
 	if (!nrt)
 		goto out;
 
@@ -1945,39 +1946,50 @@ out:
  *	Misc support functions
  */
 
-static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
-				    const struct in6_addr *dest)
+static void ip6_rt_copy_init(struct rt6_info *rt,
+			     struct rt6_info *ort,
+			     const struct in6_addr *dest)
 {
-	struct net *net = dev_net(ort->dst.dev);
-	struct rt6_info *rt = ip6_dst_alloc(net, ort->dst.dev, 0,
-					    ort->rt6i_table);
-
-	if (rt) {
-		rt->dst.input = ort->dst.input;
-		rt->dst.output = ort->dst.output;
+	if (dest) {
 		rt->dst.flags |= DST_HOST;
-
 		rt->rt6i_dst.addr = *dest;
 		rt->rt6i_dst.plen = 128;
-		if (ort->dst.flags & DST_HOST)
-			dst_cow_metrics_generic(&rt->dst, rt->dst._metrics);
-		dst_copy_metrics(&rt->dst, &ort->dst);
-		rt->dst.error = ort->dst.error;
-		rt->rt6i_idev = ort->rt6i_idev;
-		if (rt->rt6i_idev)
-			in6_dev_hold(rt->rt6i_idev);
-		rt->dst.lastuse = jiffies;
-		rt->rt6i_gateway = ort->rt6i_gateway;
-		rt->rt6i_flags = ort->rt6i_flags;
-		rt6_set_from(rt, ort);
-		rt->rt6i_metric = 0;
+	} else {
+		memcpy(&rt->rt6i_dst, &ort->rt6i_dst, sizeof(rt->rt6i_dst));
+	}
 
+	rt->dst.input = ort->dst.input;
+	rt->dst.output = ort->dst.output;
+	rt->dst.error = ort->dst.error;
+	rt->rt6i_idev = ort->rt6i_idev;
+	if (rt->rt6i_idev)
+		in6_dev_hold(rt->rt6i_idev);
+	rt->dst.lastuse = jiffies;
+	rt->rt6i_gateway = ort->rt6i_gateway;
+	rt->rt6i_flags = ort->rt6i_flags;
+	rt->rt6i_metric = ort->rt6i_metric;
 #ifdef CONFIG_IPV6_SUBTREES
-		memcpy(&rt->rt6i_src, &ort->rt6i_src, sizeof(struct rt6key));
+	rt->rt6i_src = ort->rt6i_src;
 #endif
-		memcpy(&rt->rt6i_prefsrc, &ort->rt6i_prefsrc, sizeof(struct rt6key));
-		rt->rt6i_table = ort->rt6i_table;
-	}
+	rt->rt6i_prefsrc = ort->rt6i_prefsrc;
+	rt->rt6i_table = ort->rt6i_table;
+}
+
+static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
+					   const struct in6_addr *dest)
+{
+	struct rt6_info *rt = ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
+					    0, ort->rt6i_table);
+
+	if (!rt)
+		return NULL;
+	ip6_rt_copy_init(rt, ort, dest);
+	if (ort->dst.flags & DST_HOST)
+		dst_cow_metrics_generic(&rt->dst, rt->dst._metrics);
+	dst_copy_metrics(&rt->dst, &ort->dst);
+	rt->rt6i_flags |= RTF_CACHE;
+	rt6_set_from(rt, ort);
+	rt->rt6i_metric = 0;
 	return rt;
 }
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 10/10] ipv6: Create percpu rt6_info
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (8 preceding siblings ...)
  2015-04-11  1:54 ` [RFC PATCH 09/10] ipv6: Break up ip6_rt_copy() Martin KaFai Lau
@ 2015-04-11  1:54 ` Martin KaFai Lau
  2015-04-20 18:29 ` [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception David Miller
  10 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:54 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

After the patch
'ipv6: Only create RTF_CACHE routes after encountering pmtu exceptions',
we need to compensate the performance hit (bouncing dst->__refcnt).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/ip6_fib.h           |   8 ++
 include/net/ip6_route.h         |   2 +-
 include/uapi/linux/ipv6_route.h |   1 +
 net/ipv6/ip6_fib.c              |  22 +++++-
 net/ipv6/ip6_tunnel.c           |   2 +-
 net/ipv6/route.c                | 163 +++++++++++++++++++++++++++++++++++-----
 net/ipv6/tcp_ipv6.c             |   3 +-
 net/ipv6/xfrm6_policy.c         |   4 +-
 net/netfilter/ipvs/ip_vs_xmit.c |   2 +-
 net/sctp/ipv6.c                 |   2 +-
 10 files changed, 182 insertions(+), 27 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 20e80fa..65702c5 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -124,6 +124,7 @@ struct rt6_info {
 	unsigned long			_rt6i_peer;
 
 	u32				rt6i_metric;
+	struct rt6_info __rcu * __percpu	*rt6i_pcpu;
 	/* more non-fragment space at head required */
 	unsigned short			rt6i_nfheader_len;
 	u8				rt6i_protocol;
@@ -198,6 +199,13 @@ static inline void rt6_set_from(struct rt6_info *rt, struct rt6_info *from)
 	rt->dst.from = new;
 }
 
+static inline u32 rt6_get_cookie(const struct rt6_info *rt)
+{
+	if (rt->rt6i_flags & RTF_PCPU)
+		rt = (struct rt6_info *)(rt->dst.from);
+	return rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+}
+
 static inline void ip6_rt_put(struct rt6_info *rt)
 {
 	/* dst_release() accepts a NULL parameter.
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 0e4d170..397dd3a 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -145,7 +145,7 @@ static inline void __ip6_dst_store(struct sock *sk, struct dst_entry *dst,
 #ifdef CONFIG_IPV6_SUBTREES
 	np->saddr_cache = saddr;
 #endif
-	np->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+	np->dst_cookie = rt6_get_cookie(rt);
 }
 
 static inline void ip6_dst_store(struct sock *sk, struct dst_entry *dst,
diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h
index 2be7bd1..f6598d1 100644
--- a/include/uapi/linux/ipv6_route.h
+++ b/include/uapi/linux/ipv6_route.h
@@ -34,6 +34,7 @@
 #define RTF_PREF(pref)	((pref) << 27)
 #define RTF_PREF_MASK	0x18000000
 
+#define RTF_PCPU	0x40000000
 #define RTF_LOCAL	0x80000000
 
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 96dbfff..6aa9b80 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -154,10 +154,30 @@ static void node_free(struct fib6_node *fn)
 	kmem_cache_free(fib6_node_kmem, fn);
 }
 
+static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct rt6_info **ppcpu_rt;
+		struct rt6_info *pcpu_rt;
+
+		ppcpu_rt = per_cpu_ptr(non_pcpu_rt->rt6i_pcpu, cpu);
+		pcpu_rt = rcu_dereference_protected(*ppcpu_rt,
+			lockdep_is_held(&non_pcpu_rt->rt6i_table->tb6_lock));
+		if (pcpu_rt) {
+			dst_free(&pcpu_rt->dst);
+			*ppcpu_rt = NULL;
+		}
+	}
+}
+
 static void rt6_release(struct rt6_info *rt)
 {
-	if (atomic_dec_and_test(&rt->rt6i_ref))
+	if (atomic_dec_and_test(&rt->rt6i_ref)) {
+		rt6_free_pcpu(rt);
 		dst_free(&rt->dst);
+	}
 }
 
 static void fib6_link_table(struct net *net, struct fib6_table *tb)
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 5cafd92..2e67b66 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -151,7 +151,7 @@ EXPORT_SYMBOL_GPL(ip6_tnl_dst_reset);
 void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *) dst;
-	t->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+	t->dst_cookie = rt6_get_cookie(rt);
 	dst_release(t->dst_cache);
 	t->dst_cache = dst;
 }
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 665e41c..14f99c1 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -137,9 +137,16 @@ static struct inet_peer *rt6_get_peer_create(struct rt6_info *rt)
 	return __rt6_get_peer(rt, 1);
 }
 
-static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old)
+static u32 *rt6_pcpu_cow_metrics(struct rt6_info *rt)
 {
-	struct rt6_info *rt = (struct rt6_info *) dst;
+	rt = (struct rt6_info *)rt->dst.from;
+	BUG_ON(rt->rt6i_flags & RTF_PCPU);
+	return dst_metrics_write_ptr(&rt->dst);
+}
+
+static u32 *rt6_cow_metrics(struct rt6_info *rt, unsigned long old)
+{
+	struct dst_entry *dst = &rt->dst;
 	struct inet_peer *peer;
 	u32 *p = NULL;
 
@@ -168,6 +175,16 @@ static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old)
 	return p;
 }
 
+static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old)
+{
+	struct rt6_info *rt = (struct rt6_info *)dst;
+
+	if (rt->rt6i_flags & RTF_PCPU)
+		return rt6_pcpu_cow_metrics(rt);
+	else
+		return rt6_cow_metrics(rt, old);
+}
+
 static inline const void *choose_neigh_daddr(struct rt6_info *rt,
 					     struct sk_buff *skb,
 					     const void *daddr)
@@ -302,10 +319,10 @@ static const struct rt6_info ip6_blk_hole_entry_template = {
 #endif
 
 /* allocate dst with ip6_dst_ops */
-static inline struct rt6_info *ip6_dst_alloc(struct net *net,
-					     struct net_device *dev,
-					     int flags,
-					     struct fib6_table *table)
+static struct rt6_info *__ip6_dst_alloc(struct net *net,
+					struct net_device *dev,
+					int flags,
+					struct fib6_table *table)
 {
 	struct rt6_info *rt = dst_alloc(&net->ipv6.ip6_dst_ops, dev,
 					0, DST_OBSOLETE_FORCE_CHK, flags);
@@ -320,6 +337,34 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
 	return rt;
 }
 
+static struct rt6_info *ip6_dst_alloc(struct net *net,
+				      struct net_device *dev,
+				      int flags,
+				      struct fib6_table *table)
+{
+	struct rt6_info *rt = __ip6_dst_alloc(net, dev, flags, table);
+
+	if (rt) {
+		rt->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, GFP_ATOMIC);
+		if (rt->rt6i_pcpu) {
+			int cpu;
+
+			for_each_possible_cpu(cpu) {
+				struct rt6_info **p;
+
+				p = per_cpu_ptr(rt->rt6i_pcpu, cpu);
+				/* no one shares rt */
+				*p =  NULL;
+			}
+		} else {
+			dst_destroy((struct dst_entry *)rt);
+			return NULL;
+		}
+	}
+
+	return rt;
+}
+
 static void ip6_dst_destroy(struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *)dst;
@@ -337,6 +382,9 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 	if (peer_metrics != dst->_metrics)
 		dst_destroy_metrics_generic(dst);
 
+	if (rt->rt6i_pcpu)
+		free_percpu(rt->rt6i_pcpu);
+
 	if (idev) {
 		rt->rt6i_idev = NULL;
 		in6_dev_put(idev);
@@ -925,11 +973,68 @@ static struct rt6_info *ip6_pmtu_rt_cache_alloc(struct rt6_info *ort,
 	return rt;
 }
 
+static struct rt6_info *ip6_rt_pcpu_alloc(struct rt6_info *rt)
+{
+	struct rt6_info *pcpu_rt = __ip6_dst_alloc(dev_net(rt->dst.dev),
+						   rt->dst.dev, rt->dst.flags,
+						   rt->rt6i_table);
+
+	if (!pcpu_rt)
+		return NULL;
+	ip6_rt_copy_init(pcpu_rt, rt, NULL);
+	pcpu_rt->dst._metrics = (rt->dst._metrics | DST_METRICS_READ_ONLY);
+	rt6_set_from(pcpu_rt, rt);
+	pcpu_rt->rt6i_metric = rt->rt6i_metric;
+	pcpu_rt->rt6i_protocol = rt->rt6i_protocol;
+	pcpu_rt->rt6i_flags |= RTF_PCPU;
+	return pcpu_rt;
+}
+
+static struct rt6_info *rt6_get_pcpu_route(struct rt6_info *rt)
+{
+	struct rt6_info *pcpu_rt, *orig, *prev, **p;
+	struct net *net = dev_net(rt->dst.dev);
+
+	if (rt->rt6i_flags & RTF_CACHE || rt == net->ipv6.ip6_null_entry)
+		goto done;
+
+	rcu_read_lock();
+	p = raw_cpu_ptr(rt->rt6i_pcpu);
+	orig = rcu_dereference_check(*p,
+				     lockdep_is_held(&rt->rt6i_table->tb6_lock));
+	if (orig &&
+	    dst_metrics_ptr(orig->dst.from) == dst_metrics_ptr(&orig->dst)) {
+		dst_hold(&orig->dst);
+		rcu_read_unlock();
+		return orig;
+	}
+	rcu_read_unlock();
+
+	pcpu_rt = ip6_rt_pcpu_alloc(rt);
+	if (!pcpu_rt) {
+		rt = net->ipv6.ip6_null_entry;
+		goto done;
+	}
+
+	prev = cmpxchg(p, orig, pcpu_rt);
+	if (prev == orig) {
+		if (orig)
+			call_rcu(&orig->dst.rcu_head, dst_rcu_free);
+	} else {
+		pcpu_rt->dst.flags |= DST_NOCACHE;
+	}
+	rt = pcpu_rt;
+
+done:
+	dst_hold(&rt->dst);
+	return rt;
+}
+
 static struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, int oif,
 				      struct flowi6 *fl6, int flags)
 {
 	struct fib6_node *fn, *saved_fn;
-	struct rt6_info *rt;
+	struct rt6_info *rt, *pcpu_rt;
 	int strict = 0;
 
 	strict |= flags & RT6_LOOKUP_F_IFACE;
@@ -957,13 +1062,13 @@ redo_rt6_select:
 		}
 	}
 
-	dst_hold(&rt->dst);
+	pcpu_rt = rt6_get_pcpu_route(rt);
 	read_unlock_bh(&table->tb6_lock);
 
 	rt->dst.lastuse = jiffies;
 	rt->dst.__use++;
 
-	return rt;
+	return pcpu_rt;
 }
 
 static struct rt6_info *ip6_pol_route_input(struct net *net, struct fib6_table *table,
@@ -1068,6 +1173,26 @@ struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_ori
  *	Destination cache support functions
  */
 
+static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
+{
+	if (!rt->rt6i_node || rt->rt6i_node->fn_sernum != cookie)
+		return NULL;
+
+	if (rt6_check_expired(rt))
+		return NULL;
+
+	return &rt->dst;
+}
+
+static struct dst_entry *rt6_pcpu_check(struct rt6_info *rt, u32 cookie)
+{
+	if (rt->dst.obsolete == DST_OBSOLETE_FORCE_CHK &&
+	    dst_metrics_ptr(rt->dst.from) == dst_metrics_ptr(&rt->dst))
+		return rt6_check((struct rt6_info *)(rt->dst.from), cookie);
+	else
+		return NULL;
+}
+
 static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 {
 	struct rt6_info *rt;
@@ -1078,13 +1203,10 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
 	 * into this function always.
 	 */
-	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
-		return NULL;
-
-	if (rt6_check_expired(rt))
-		return NULL;
-
-	return dst;
+	if (rt->rt6i_flags & RTF_PCPU)
+		return rt6_pcpu_check(rt, cookie);
+	else
+		return rt6_check(rt, cookie);
 }
 
 static struct dst_entry *ip6_negative_advice(struct dst_entry *dst)
@@ -1978,8 +2100,13 @@ static void ip6_rt_copy_init(struct rt6_info *rt,
 static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
 					   const struct in6_addr *dest)
 {
-	struct rt6_info *rt = ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
-					    0, ort->rt6i_table);
+	struct rt6_info *rt;
+
+	if (ort->rt6i_flags & RTF_PCPU)
+		ort = (struct rt6_info *)ort->dst.from;
+
+	rt = __ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
+			     0, ort->rt6i_table);
 
 	if (!rt)
 		return NULL;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index dfcca70..e2e9576 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -99,8 +99,7 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
 		dst_hold(dst);
 		sk->sk_rx_dst = dst;
 		inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
-		if (rt->rt6i_node)
-			inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+		inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
 	}
 }
 
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index f337a90..e818c61 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -84,7 +84,7 @@ static int xfrm6_init_path(struct xfrm_dst *path, struct dst_entry *dst,
 	if (dst->ops->family == AF_INET6) {
 		struct rt6_info *rt = (struct rt6_info *)dst;
 		if (rt->rt6i_node)
-			path->path_cookie = rt->rt6i_node->fn_sernum;
+			path->path_cookie = rt6_get_cookie(rt);
 	}
 
 	path->u.rt6.rt6i_nfheader_len = nfheader_len;
@@ -115,7 +115,7 @@ static int xfrm6_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	xdst->u.rt6.rt6i_metric = rt->rt6i_metric;
 	xdst->u.rt6.rt6i_node = rt->rt6i_node;
 	if (rt->rt6i_node)
-		xdst->route_cookie = rt->rt6i_node->fn_sernum;
+		xdst->route_cookie = rt6_get_cookie(rt);
 	xdst->u.rt6.rt6i_gateway = rt->rt6i_gateway;
 	xdst->u.rt6.rt6i_dst = rt->rt6i_dst;
 	xdst->u.rt6.rt6i_src = rt->rt6i_src;
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 38f8627..5eff9f6 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -435,7 +435,7 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
 				goto err_unreach;
 			}
 			rt = (struct rt6_info *) dst;
-			cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+			cookie = rt6_get_cookie(rt);
 			__ip_vs_dst_set(dest, dest_dst, &rt->dst, cookie);
 			spin_unlock_bh(&dest->dst_lock);
 			IP_VS_DBG(10, "new dst %pI6, src %pI6, refcnt=%d\n",
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 9fa13f6..d012834 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -331,7 +331,7 @@ out:
 
 		rt = (struct rt6_info *)dst;
 		t->dst = dst;
-		t->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+		t->dst_cookie = rt6_get_cookie(rt);
 		pr_debug("rt6_dst:%pI6/%d rt6_src:%pI6\n",
 			 &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
 			 &fl6->saddr);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del()
  2015-04-11  1:54 ` [RFC PATCH 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del() Martin KaFai Lau
@ 2015-04-20 18:23   ` David Miller
  2015-04-20 19:33     ` Martin KaFai Lau
  0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2015-04-20 18:23 UTC (permalink / raw)
  To: kafai; +Cc: netdev, hannes, kernel-team

From: Martin KaFai Lau <kafai@fb.com>
Date: Fri, 10 Apr 2015 18:54:09 -0700

> Before patch 'Allow pmtu update on /128 via gateway route',
> RTF_CACHE route was not created for DST_HOST.  It also requires changes on both
> delete code path and rt6_select() code patch.
> 
> This patch fixes the delete code path to avoid deleting the RTF_CACHE
> route by 'ip -6 r del...'
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

If a cached route was created in response to say a PMTU event, and
it's a clone/copy/cow of the route we are being asked to delete,
it absolutely should be removed.

In fact this is a critically important aspect of removing routes
from the table.

So this change does not seem correct.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception
  2015-04-11  1:54 ` [RFC PATCH 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
@ 2015-04-20 18:27   ` David Miller
  2015-04-20 18:28   ` David Miller
  1 sibling, 0 replies; 17+ messages in thread
From: David Miller @ 2015-04-20 18:27 UTC (permalink / raw)
  To: kafai; +Cc: netdev, hannes, kernel-team

From: Martin KaFai Lau <kafai@fb.com>
Date: Fri, 10 Apr 2015 18:54:07 -0700

> @@ -1171,8 +1170,15 @@ void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,
>  	fl6.flowlabel = ip6_flowinfo(iph);
>  
>  	dst = ip6_route_output(net, NULL, &fl6);
> -	if (!dst->error)
> +	if (!dst->error) {
> +		unsigned char *outer_network_header = skb_network_header(skb);
> +		int offset;
> +
> +		skb_reset_network_header(skb);
> +		offset = outer_network_header - skb_network_header(skb);
>  		ip6_rt_update_pmtu(dst, NULL, skb, ntohl(mtu));
> +		skb_set_network_header(skb, offset);
> +	}

I seriously object to adjusting then restoring the location of the SKB
network header in this kind of code path.

Instead, adjust the interfaces to the code doing the packet header
inspection so that it can accomodate an offset or something like that
instead.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception
  2015-04-11  1:54 ` [RFC PATCH 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
  2015-04-20 18:27   ` David Miller
@ 2015-04-20 18:28   ` David Miller
  1 sibling, 0 replies; 17+ messages in thread
From: David Miller @ 2015-04-20 18:28 UTC (permalink / raw)
  To: kafai; +Cc: netdev, hannes, kernel-team

From: Martin KaFai Lau <kafai@fb.com>
Date: Fri, 10 Apr 2015 18:54:07 -0700

> +	if (!(rt6->rt6i_flags & RTF_CACHE) &&
> +	    (!(rt6->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)) ||
> +	     !(rt6->dst.flags & DST_HOST))) {

These big convoluted tests are tiring to read over and over again.

At the very least, "(rt6->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)"
deserves to be a descriptively named inline function.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (9 preceding siblings ...)
  2015-04-11  1:54 ` [RFC PATCH 10/10] ipv6: Create percpu rt6_info Martin KaFai Lau
@ 2015-04-20 18:29 ` David Miller
  10 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2015-04-20 18:29 UTC (permalink / raw)
  To: kafai; +Cc: netdev, hannes, kernel-team

From: Martin KaFai Lau <kafai@fb.com>
Date: Fri, 10 Apr 2015 18:54:03 -0700

> This series is to avoid creating a RTF_CACHE route whenever we are consulting
> the fib6 tree with a new destination.  Instead, only create RTF_CACHE route
> when we see a pmtu exception.

Please separate out the pure bug fixes from this series and submit them for
inclusion into 'net', thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del()
  2015-04-20 18:23   ` David Miller
@ 2015-04-20 19:33     ` Martin KaFai Lau
  2015-04-20 19:37       ` David Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Martin KaFai Lau @ 2015-04-20 19:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, hannes, kernel-team

On Mon, Apr 20, 2015 at 02:23:05PM -0400, David Miller wrote:
> From: Martin KaFai Lau <kafai@fb.com>
> Date: Fri, 10 Apr 2015 18:54:09 -0700
> 
> > Before patch 'Allow pmtu update on /128 via gateway route',
> > RTF_CACHE route was not created for DST_HOST.  It also requires changes on both
> > delete code path and rt6_select() code patch.
> > 
> > This patch fixes the delete code path to avoid deleting the RTF_CACHE
> > route by 'ip -6 r del...'
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> 
> If a cached route was created in response to say a PMTU event, and
> it's a clone/copy/cow of the route we are being asked to delete,
> it absolutely should be removed.
> 
> In fact this is a critically important aspect of removing routes
> from the table.
When a non-clone routes are removed, its clones are removed together by
fib6_prune_clones() in fib6_del().

Hence, 'ip -6 r del' will remove a route and its clones.
'ip -6 r flush table cache will only remove RTF_CACHE routes.

I will fix up the commit message.

Thanks,
--Martin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del()
  2015-04-20 19:33     ` Martin KaFai Lau
@ 2015-04-20 19:37       ` David Miller
  0 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2015-04-20 19:37 UTC (permalink / raw)
  To: kafai; +Cc: netdev, hannes, kernel-team

From: Martin KaFai Lau <kafai@fb.com>
Date: Mon, 20 Apr 2015 12:33:05 -0700

> On Mon, Apr 20, 2015 at 02:23:05PM -0400, David Miller wrote:
>> From: Martin KaFai Lau <kafai@fb.com>
>> Date: Fri, 10 Apr 2015 18:54:09 -0700
>> 
>> > Before patch 'Allow pmtu update on /128 via gateway route',
>> > RTF_CACHE route was not created for DST_HOST.  It also requires changes on both
>> > delete code path and rt6_select() code patch.
>> > 
>> > This patch fixes the delete code path to avoid deleting the RTF_CACHE
>> > route by 'ip -6 r del...'
>> > 
>> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>> > Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> 
>> If a cached route was created in response to say a PMTU event, and
>> it's a clone/copy/cow of the route we are being asked to delete,
>> it absolutely should be removed.
>> 
>> In fact this is a critically important aspect of removing routes
>> from the table.
> When a non-clone routes are removed, its clones are removed together by
> fib6_prune_clones() in fib6_del().
> 
> Hence, 'ip -6 r del' will remove a route and its clones.
> 'ip -6 r flush table cache will only remove RTF_CACHE routes.
> 
> I will fix up the commit message.

Ok, thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-04-20 19:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-11  1:54 [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
2015-04-11  1:54 ` [RFC PATCH 01/10] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
2015-04-11  1:54 ` [RFC PATCH 02/10] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST Martin KaFai Lau
2015-04-11  1:54 ` [RFC PATCH 03/10] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone Martin KaFai Lau
2015-04-11  1:54 ` [RFC PATCH 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
2015-04-20 18:27   ` David Miller
2015-04-20 18:28   ` David Miller
2015-04-11  1:54 ` [RFC PATCH 05/10] ipv6: Allow pmtu update on /128 via gateway route Martin KaFai Lau
2015-04-11  1:54 ` [RFC PATCH 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del() Martin KaFai Lau
2015-04-20 18:23   ` David Miller
2015-04-20 19:33     ` Martin KaFai Lau
2015-04-20 19:37       ` David Miller
2015-04-11  1:54 ` [RFC PATCH 07/10] ipv6: Extend the route lookups to low priority metrics Martin KaFai Lau
2015-04-11  1:54 ` [RFC PATCH 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry Martin KaFai Lau
2015-04-11  1:54 ` [RFC PATCH 09/10] ipv6: Break up ip6_rt_copy() Martin KaFai Lau
2015-04-11  1:54 ` [RFC PATCH 10/10] ipv6: Create percpu rt6_info Martin KaFai Lau
2015-04-20 18:29 ` [RFC PATCH 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).