netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception
@ 2015-04-11  1:59 Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 01/10] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
                   ` (9 more replies)
  0 siblings, 10 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

[Just a re-sent of the last one with the net-next tag]

Hi,

This series is to avoid creating a RTF_CACHE route whenever we are consulting
the fib6 tree with a new destination.  Instead, only create RTF_CACHE route
when we see a pmtu exception.

Out of all ipv6 RTF_CACHE routes that are created, the percentage that has a
different mtu is very small. In one of our end-user facing proxy server,
only 1k out of 80k RTF_CACHE routes have a smaller MTU.  For our DC
traffic, there is no mtu exception.

A large fib6 tree has problems like, 'ip -6 r show' takes a long time.
gc may kick in too often.  Also, when a service has restarted and a lot
of new TCP conn requests come in, it creates pressure on the tree by inserting
a lot of RTF_CACHE in a short time and it currently requires a write lock
to do that.

The first few patches are prep works to remove assumption that the
returned rt is always RTF_CACHE.

The patch 'ipv6: Only create RTF_CACHE routes after encountering pmtu exception'
do the lazy RTF_CACHE route creation.

The next few patches fix the /128 via gateway route issue.  One of them
is by "Steffen Klassert <steffen.klassert@secunet.com>" which I pulled off
from netdev.

The last two patches added percpu rt to compensate the performance loss after
doing the RTF_CACHE lazy creation.

Here is some numbers of the udpflood test.  The udpflood has been
slightly modified to have a time limit instead of count limit.

A /64 via gateway route is used for the test. Each udpflood uses 10000 dst
addresses.  The dst addresses of different udpflood processes do not overlap
with each other.

# of udpflood        # of trans (patched)        # of trans (upstream)

1                    16M                          15M
10                   61M                          61M
20                   65M                          62M
40                   88M                          83M


Many thanks to "Hannes Frederic Sowa <hannes@stressinduktion.org>" on
reviewing the patches and giving advice.

--Martin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 01/10] ipv6: Remove external dependency on rt6i_dst and rt6i_src
  2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
@ 2015-04-11  1:59 ` Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 02/10] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST Martin KaFai Lau
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

This patch removes the assumptions that the returned rt is always
a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the
destination and source address.  The dst and src can be recovered from the
calling site.

We may consider to rename (rt6i_dst, rt6i_src) to (rt6i_key_dst, rt6i_key_src)
later.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/scsi/cxgbi/libcxgbi.c   |  2 +-
 include/net/ipv6.h              |  3 ++-
 net/ipv6/icmp.c                 |  2 +-
 net/ipv6/ip6_output.c           | 22 +++++++++++-----------
 net/ipv6/ndisc.c                |  2 +-
 net/ipv6/output_core.c          |  9 +++++----
 net/ipv6/tcp_ipv6.c             |  2 +-
 net/netfilter/ipvs/ip_vs_xmit.c |  4 ++--
 net/sctp/ipv6.c                 |  3 ++-
 9 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
index eb58afc..45d3039 100644
--- a/drivers/scsi/cxgbi/libcxgbi.c
+++ b/drivers/scsi/cxgbi/libcxgbi.c
@@ -728,7 +728,7 @@ static struct cxgbi_sock *cxgbi_check_route6(struct sockaddr *dst_addr)
 	}
 	ndev = n->dev;
 
-	if (ipv6_addr_is_multicast(&rt->rt6i_dst.addr)) {
+	if (ipv6_addr_is_multicast(&daddr6->sin6_addr)) {
 		pr_info("multi-cast route %pI6 port %u, dev %s.\n",
 			daddr6->sin6_addr.s6_addr,
 			ntohs(daddr6->sin6_port), ndev->name);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index eec8ad3..a0890d6 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -670,7 +670,8 @@ static inline int ipv6_addr_diff(const struct in6_addr *a1, const struct in6_add
 }
 
 void ipv6_select_ident(struct net *net, struct frag_hdr *fhdr,
-		       struct rt6_info *rt);
+		       const struct in6_addr *daddr,
+		       const struct in6_addr *saddr);
 void ipv6_proxy_select_ident(struct net *net, struct sk_buff *skb);
 
 int ip6_dst_hoplimit(struct dst_entry *dst);
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 2c2b5d5..24b359d 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -207,7 +207,7 @@ static bool icmpv6_xrlim_allow(struct sock *sk, u8 type,
 			struct inet_peer *peer;
 
 			peer = inet_getpeer_v6(net->ipv6.peers,
-					       &rt->rt6i_dst.addr, 1);
+					       &fl6->daddr, 1);
 			res = inet_peer_xrlim_allow(peer, tmo);
 			if (peer)
 				inet_putpeer(peer);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7fde1f2..b987fbf 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -459,7 +459,7 @@ int ip6_forward(struct sk_buff *skb)
 		else
 			target = &hdr->daddr;
 
-		peer = inet_getpeer_v6(net->ipv6.peers, &rt->rt6i_dst.addr, 1);
+		peer = inet_getpeer_v6(net->ipv6.peers, &hdr->daddr, 1);
 
 		/* Limit redirects both by destination (here)
 		   and by source (inside ndisc_send_redirect)
@@ -549,6 +549,7 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 				inet6_sk(skb->sk) : NULL;
 	struct ipv6hdr *tmp_hdr;
 	struct frag_hdr *fh;
+	struct frag_hdr tmp_fh;
 	unsigned int mtu, hlen, left, len;
 	int hroom, troom;
 	__be32 frag_id = 0;
@@ -584,6 +585,10 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 	}
 	mtu -= hlen + sizeof(struct frag_hdr);
 
+	ipv6_select_ident(net, &tmp_fh, &ipv6_hdr(skb)->daddr,
+			  &ipv6_hdr(skb)->saddr);
+	frag_id = tmp_fh.identification;
+
 	if (skb_has_frag_list(skb)) {
 		int first_len = skb_pagelen(skb);
 		struct sk_buff *frag2;
@@ -632,11 +637,10 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 		skb_reset_network_header(skb);
 		memcpy(skb_network_header(skb), tmp_hdr, hlen);
 
-		ipv6_select_ident(net, fh, rt);
 		fh->nexthdr = nexthdr;
 		fh->reserved = 0;
 		fh->frag_off = htons(IP6_MF);
-		frag_id = fh->identification;
+		fh->identification = frag_id;
 
 		first_len = skb_pagelen(skb);
 		skb->data_len = first_len - skb_headlen(skb);
@@ -778,11 +782,7 @@ slow_path:
 		 */
 		fh->nexthdr = nexthdr;
 		fh->reserved = 0;
-		if (!frag_id) {
-			ipv6_select_ident(net, fh, rt);
-			frag_id = fh->identification;
-		} else
-			fh->identification = frag_id;
+		fh->identification = frag_id;
 
 		/*
 		 *	Copy a block of the IP datagram.
@@ -1037,7 +1037,7 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 			int odd, struct sk_buff *skb),
 			void *from, int length, int hh_len, int fragheaderlen,
 			int transhdrlen, int mtu, unsigned int flags,
-			struct rt6_info *rt)
+			const struct flowi6 *fl6)
 
 {
 	struct sk_buff *skb;
@@ -1083,7 +1083,7 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 	skb_shinfo(skb)->gso_size = (mtu - fragheaderlen -
 				     sizeof(struct frag_hdr)) & ~7;
 	skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
-	ipv6_select_ident(sock_net(sk), &fhdr, rt);
+	ipv6_select_ident(sock_net(sk), &fhdr, &fl6->daddr, &fl6->saddr);
 	skb_shinfo(skb)->ip6_frag_id = fhdr.identification;
 
 append:
@@ -1307,7 +1307,7 @@ emsgsize:
 	    (sk->sk_type == SOCK_DGRAM)) {
 		err = ip6_ufo_append_data(sk, queue, getfrag, from, length,
 					  hh_len, fragheaderlen,
-					  transhdrlen, mtu, flags, rt);
+					  transhdrlen, mtu, flags, fl6);
 		if (err)
 			goto error;
 		return 0;
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 96f153c..0a05b35 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1506,7 +1506,7 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
 			  "Redirect: destination is not a neighbour\n");
 		goto release;
 	}
-	peer = inet_getpeer_v6(net->ipv6.peers, &rt->rt6i_dst.addr, 1);
+	peer = inet_getpeer_v6(net->ipv6.peers, &ipv6_hdr(skb)->saddr, 1);
 	ret = inet_peer_xrlim_allow(peer, 1*HZ);
 	if (peer)
 		inet_putpeer(peer);
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 85892af..f37cfa9 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -10,7 +10,8 @@
 #include <net/secure_seq.h>
 
 static u32 __ipv6_select_ident(struct net *net, u32 hashrnd,
-			       struct in6_addr *dst, struct in6_addr *src)
+			       const struct in6_addr *dst,
+			       const struct in6_addr *src)
 {
 	u32 hash, id;
 
@@ -61,15 +62,15 @@ void ipv6_proxy_select_ident(struct net *net, struct sk_buff *skb)
 EXPORT_SYMBOL_GPL(ipv6_proxy_select_ident);
 
 void ipv6_select_ident(struct net *net, struct frag_hdr *fhdr,
-		       struct rt6_info *rt)
+		       const struct in6_addr *daddr,
+		       const struct in6_addr *saddr)
 {
 	static u32 ip6_idents_hashrnd __read_mostly;
 	u32 id;
 
 	net_get_random_once(&ip6_idents_hashrnd, sizeof(ip6_idents_hashrnd));
 
-	id = __ipv6_select_ident(net, ip6_idents_hashrnd, &rt->rt6i_dst.addr,
-				 &rt->rt6i_src.addr);
+	id = __ipv6_select_ident(net, ip6_idents_hashrnd, daddr, saddr);
 	fhdr->identification = htonl(id);
 }
 EXPORT_SYMBOL(ipv6_select_ident);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f73a97f..dfcca70 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -262,7 +262,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	rt = (struct rt6_info *) dst;
 	if (tcp_death_row.sysctl_tw_recycle &&
 	    !tp->rx_opt.ts_recent_stamp &&
-	    ipv6_addr_equal(&rt->rt6i_dst.addr, &sk->sk_v6_daddr))
+	    ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr))
 		tcp_fetch_timewait_stamp(sk, dst);
 
 	icsk->icsk_ext_hdr_len = 0;
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 19986ec..38f8627 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -781,7 +781,7 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	/* From world but DNAT to loopback address? */
 	if (local && skb->dev && !(skb->dev->flags & IFF_LOOPBACK) &&
-	    ipv6_addr_type(&rt->rt6i_dst.addr) & IPV6_ADDR_LOOPBACK) {
+	    ipv6_addr_type(&cp->daddr.in6) & IPV6_ADDR_LOOPBACK) {
 		IP_VS_DBG_RL_PKT(1, AF_INET6, pp, skb, 0,
 				 "ip_vs_nat_xmit_v6(): "
 				 "stopping DNAT to loopback address");
@@ -1346,7 +1346,7 @@ ip_vs_icmp_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	/* From world but DNAT to loopback address? */
 	if (local && skb->dev && !(skb->dev->flags & IFF_LOOPBACK) &&
-	    ipv6_addr_type(&rt->rt6i_dst.addr) & IPV6_ADDR_LOOPBACK) {
+	    ipv6_addr_type(&cp->daddr.in6) & IPV6_ADDR_LOOPBACK) {
 		IP_VS_DBG(1, "%s(): "
 			  "stopping DNAT to loopback %pI6\n",
 			  __func__, &cp->daddr.in6);
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 0e4198e..9fa13f6 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -332,7 +332,8 @@ out:
 		rt = (struct rt6_info *)dst;
 		t->dst = dst;
 		t->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
-		pr_debug("rt6_dst:%pI6 rt6_src:%pI6\n", &rt->rt6i_dst.addr,
+		pr_debug("rt6_dst:%pI6/%d rt6_src:%pI6\n",
+			 &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
 			 &fl6->saddr);
 	} else {
 		t->dst = NULL;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 02/10] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST
  2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 01/10] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
@ 2015-04-11  1:59 ` Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 03/10] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone Martin KaFai Lau
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst.
Also, rt6i_gateway is always set to the nexthop while the nexthop
could be a gateway or the rt6i_dst.addr.

After removing the rt6i_dst and rt6i_src dependency in the last patch, we also
need to stop the caller from depending on rt6i_gateway and RTF_ANYCAST.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/ip6_route.h                | 14 +++++++++-----
 net/bluetooth/6lowpan.c                |  2 +-
 net/ipv6/icmp.c                        |  4 ++--
 net/ipv6/ip6_output.c                  |  5 +++--
 net/ipv6/route.c                       |  6 +-----
 net/netfilter/nf_conntrack_h323_main.c |  4 ++--
 net/netfilter/xt_addrtype.c            |  2 +-
 7 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 5e19206..0e4d170 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -163,11 +163,14 @@ static inline bool ipv6_unicast_destination(const struct sk_buff *skb)
 	return rt->rt6i_flags & RTF_LOCAL;
 }
 
-static inline bool ipv6_anycast_destination(const struct sk_buff *skb)
+static inline bool ipv6_anycast_destination(const struct dst_entry *dst,
+					    const struct in6_addr *daddr)
 {
-	struct rt6_info *rt = (struct rt6_info *) skb_dst(skb);
+	struct rt6_info *rt = (struct rt6_info *)dst;
 
-	return rt->rt6i_flags & RTF_ANYCAST;
+	return rt->rt6i_flags & RTF_ANYCAST ||
+		(rt->rt6i_dst.plen != 128 &&
+		 ipv6_addr_equal(&rt->rt6i_dst.addr, daddr));
 }
 
 int ip6_fragment(struct sock *sk, struct sk_buff *skb,
@@ -194,9 +197,10 @@ static inline bool ip6_sk_ignore_df(const struct sock *sk)
 	       inet6_sk(sk)->pmtudisc == IPV6_PMTUDISC_OMIT;
 }
 
-static inline struct in6_addr *rt6_nexthop(struct rt6_info *rt)
+static inline struct in6_addr *rt6_nexthop(struct rt6_info *rt,
+					   struct in6_addr *daddr)
 {
-	return &rt->rt6i_gateway;
+	return (rt->rt6i_flags & RTF_GATEWAY) ? &rt->rt6i_gateway : daddr;
 }
 
 #endif
diff --git a/net/bluetooth/6lowpan.c b/net/bluetooth/6lowpan.c
index 1742b84..f3d6046 100644
--- a/net/bluetooth/6lowpan.c
+++ b/net/bluetooth/6lowpan.c
@@ -192,7 +192,7 @@ static inline struct lowpan_peer *peer_lookup_dst(struct lowpan_dev *dev,
 		if (ipv6_addr_any(nexthop))
 			return NULL;
 	} else {
-		nexthop = rt6_nexthop(rt);
+		nexthop = rt6_nexthop(rt, daddr);
 
 		/* We need to remember the address because it is needed
 		 * by bt_xmit() when sending the packet. In bt_xmit(), the
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 24b359d..713d743 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -337,7 +337,7 @@ static struct dst_entry *icmpv6_route_lookup(struct net *net,
 	 * We won't send icmp if the destination is known
 	 * anycast.
 	 */
-	if (((struct rt6_info *)dst)->rt6i_flags & RTF_ANYCAST) {
+	if (ipv6_anycast_destination(dst, &fl6->daddr)) {
 		net_dbg_ratelimited("icmp6_send: acast source\n");
 		dst_release(dst);
 		return ERR_PTR(-EINVAL);
@@ -564,7 +564,7 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 
 	if (!ipv6_unicast_destination(skb) &&
 	    !(net->ipv6.sysctl.anycast_src_echo_reply &&
-	      ipv6_anycast_destination(skb)))
+	      ipv6_anycast_destination(skb_dst(skb), saddr)))
 		saddr = NULL;
 
 	memcpy(&tmp_hdr, icmph, sizeof(tmp_hdr));
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index b987fbf..e58e402 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -105,7 +105,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff *skb)
 	}
 
 	rcu_read_lock_bh();
-	nexthop = rt6_nexthop((struct rt6_info *)dst);
+	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
 	neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
 	if (unlikely(!neigh))
 		neigh = __neigh_create(&nd_tbl, nexthop, dst->dev, false);
@@ -913,7 +913,8 @@ static int ip6_dst_lookup_tail(struct sock *sk,
 	 */
 	rt = (struct rt6_info *) *dst;
 	rcu_read_lock_bh();
-	n = __ipv6_neigh_lookup_noref(rt->dst.dev, rt6_nexthop(rt));
+	n = __ipv6_neigh_lookup_noref(rt->dst.dev,
+				      rt6_nexthop(rt, &fl6->daddr));
 	err = n && !(n->nud_state & NUD_VALID) ? -EINVAL : 0;
 	rcu_read_unlock_bh();
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 5c48293..0ccb8ec 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1946,11 +1946,7 @@ static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
 		if (rt->rt6i_idev)
 			in6_dev_hold(rt->rt6i_idev);
 		rt->dst.lastuse = jiffies;
-
-		if (ort->rt6i_flags & RTF_GATEWAY)
-			rt->rt6i_gateway = ort->rt6i_gateway;
-		else
-			rt->rt6i_gateway = *dest;
+		rt->rt6i_gateway = ort->rt6i_gateway;
 		rt->rt6i_flags = ort->rt6i_flags;
 		rt6_set_from(rt, ort);
 		rt->rt6i_metric = 0;
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 1d69f5b..9511af0 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -779,8 +779,8 @@ static int callforward_do_filter(struct net *net,
 				   flowi6_to_flowi(&fl1), false)) {
 			if (!afinfo->route(net, (struct dst_entry **)&rt2,
 					   flowi6_to_flowi(&fl2), false)) {
-				if (ipv6_addr_equal(rt6_nexthop(rt1),
-						    rt6_nexthop(rt2)) &&
+				if (ipv6_addr_equal(rt6_nexthop(rt1, &fl1.daddr),
+						    rt6_nexthop(rt2, &fl2.daddr)) &&
 				    rt1->dst.dev == rt2->dst.dev)
 					ret = 1;
 				dst_release(&rt2->dst);
diff --git a/net/netfilter/xt_addrtype.c b/net/netfilter/xt_addrtype.c
index fab6eea..5b4743c 100644
--- a/net/netfilter/xt_addrtype.c
+++ b/net/netfilter/xt_addrtype.c
@@ -73,7 +73,7 @@ static u32 match_lookup_rt6(struct net *net, const struct net_device *dev,
 
 	if (dev == NULL && rt->rt6i_flags & RTF_LOCAL)
 		ret |= XT_ADDRTYPE_LOCAL;
-	if (rt->rt6i_flags & RTF_ANYCAST)
+	if (ipv6_anycast_destination((struct dst_entry *)rt, addr))
 		ret |= XT_ADDRTYPE_ANYCAST;
 
 	dst_release(&rt->dst);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 03/10] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone
  2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 01/10] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 02/10] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST Martin KaFai Lau
@ 2015-04-11  1:59 ` Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

A prep work for creating RTF_CACHE on exception only.  After this
patch, the same condition (rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY))
is checked twice. This redundancy will be removed in the later patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 40 +++++++++++++++-------------------------
 1 file changed, 15 insertions(+), 25 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0ccb8ec..f753a67 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -872,9 +872,9 @@ int ip6_ins_rt(struct rt6_info *rt)
 	return __ip6_ins_rt(rt, &info, &mxc);
 }
 
-static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort,
-				      const struct in6_addr *daddr,
-				      const struct in6_addr *saddr)
+static struct rt6_info *ip6_pmtu_rt_cache_alloc(struct rt6_info *ort,
+						const struct in6_addr *daddr,
+						const struct in6_addr *saddr)
 {
 	struct rt6_info *rt;
 
@@ -885,33 +885,24 @@ static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort,
 	rt = ip6_rt_copy(ort, daddr);
 
 	if (rt) {
-		if (ort->rt6i_dst.plen != 128 &&
-		    ipv6_addr_equal(&ort->rt6i_dst.addr, daddr))
-			rt->rt6i_flags |= RTF_ANYCAST;
-
 		rt->rt6i_flags |= RTF_CACHE;
 
+		if (!(ort->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY))) {
+			if (ort->rt6i_dst.plen != 128 &&
+			    ipv6_addr_equal(&ort->rt6i_dst.addr, daddr))
+				rt->rt6i_flags |= RTF_ANYCAST;
 #ifdef CONFIG_IPV6_SUBTREES
-		if (rt->rt6i_src.plen && saddr) {
-			rt->rt6i_src.addr = *saddr;
-			rt->rt6i_src.plen = 128;
-		}
+			if (rt->rt6i_src.plen && saddr) {
+				rt->rt6i_src.addr = *saddr;
+				rt->rt6i_src.plen = 128;
+			}
 #endif
+		}
 	}
 
 	return rt;
 }
 
-static struct rt6_info *rt6_alloc_clone(struct rt6_info *ort,
-					const struct in6_addr *daddr)
-{
-	struct rt6_info *rt = ip6_rt_copy(ort, daddr);
-
-	if (rt)
-		rt->rt6i_flags |= RTF_CACHE;
-	return rt;
-}
-
 static struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, int oif,
 				      struct flowi6 *fl6, int flags)
 {
@@ -957,10 +948,9 @@ redo_rt6_select:
 	if (rt->rt6i_flags & RTF_CACHE)
 		goto out2;
 
-	if (!(rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)))
-		nrt = rt6_alloc_cow(rt, &fl6->daddr, &fl6->saddr);
-	else if (!(rt->dst.flags & DST_HOST))
-		nrt = rt6_alloc_clone(rt, &fl6->daddr);
+	if (!(rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)) ||
+	    !(rt->dst.flags & DST_HOST))
+		nrt = ip6_pmtu_rt_cache_alloc(rt, &fl6->daddr, &fl6->saddr);
 	else
 		goto out2;
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception
  2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2015-04-11  1:59 ` [RFC PATCH net-next 03/10] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone Martin KaFai Lau
@ 2015-04-11  1:59 ` Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 05/10] ipv6: Allow pmtu update on /128 via gateway route Martin KaFai Lau
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

This patch creates a RTF_CACHE routes only after encountering a pmtu exception.

After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6 tree,
the rt->rt6i_node->fn_sernum will be bumped which fails the ip6_dst_check() and
triggers a relookup.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 92 ++++++++++++++++++++++++++++++--------------------------
 1 file changed, 49 insertions(+), 43 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f753a67..1b57bc9 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -907,16 +907,13 @@ static struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
 				      struct flowi6 *fl6, int flags)
 {
 	struct fib6_node *fn, *saved_fn;
-	struct rt6_info *rt, *nrt;
+	struct rt6_info *rt;
 	int strict = 0;
-	int attempts = 3;
-	int err;
 
 	strict |= flags & RT6_LOOKUP_F_IFACE;
 	if (net->ipv6.devconf_all->forwarding == 0)
 		strict |= RT6_LOOKUP_F_REACHABLE;
 
-redo_fib6_lookup_lock:
 	read_lock_bh(&table->tb6_lock);
 
 	fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
@@ -935,46 +932,12 @@ redo_rt6_select:
 			strict &= ~RT6_LOOKUP_F_REACHABLE;
 			fn = saved_fn;
 			goto redo_rt6_select;
-		} else {
-			dst_hold(&rt->dst);
-			read_unlock_bh(&table->tb6_lock);
-			goto out2;
 		}
 	}
 
 	dst_hold(&rt->dst);
 	read_unlock_bh(&table->tb6_lock);
 
-	if (rt->rt6i_flags & RTF_CACHE)
-		goto out2;
-
-	if (!(rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)) ||
-	    !(rt->dst.flags & DST_HOST))
-		nrt = ip6_pmtu_rt_cache_alloc(rt, &fl6->daddr, &fl6->saddr);
-	else
-		goto out2;
-
-	ip6_rt_put(rt);
-	rt = nrt ? : net->ipv6.ip6_null_entry;
-
-	dst_hold(&rt->dst);
-	if (nrt) {
-		err = ip6_ins_rt(nrt);
-		if (!err)
-			goto out2;
-	}
-
-	if (--attempts <= 0)
-		goto out2;
-
-	/*
-	 * Race condition! In the gap, when table->tb6_lock was
-	 * released someone could insert this route.  Relookup.
-	 */
-	ip6_rt_put(rt);
-	goto redo_fib6_lookup_lock;
-
-out2:
 	rt->dst.lastuse = jiffies;
 	rt->dst.__use++;
 
@@ -1144,13 +1107,49 @@ static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 	struct rt6_info *rt6 = (struct rt6_info *)dst;
 
 	dst_confirm(dst);
-	if (mtu < dst_mtu(dst) && rt6->rt6i_dst.plen == 128) {
+	mtu = max_t(u32, mtu, IPV6_MIN_MTU);
+	if (mtu >= dst_mtu(dst))
+		return;
+
+	if (!(rt6->rt6i_flags & RTF_CACHE) &&
+	    (!(rt6->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)) ||
+	     !(rt6->dst.flags & DST_HOST))) {
+		const struct in6_addr *daddr, *saddr;
+		struct rt6_info *nrt6;
+
+		if (skb) {
+			const struct ipv6hdr *iph = ipv6_hdr(skb);
+
+			daddr = &iph->daddr;
+			saddr = &iph->saddr;
+		} else if (sk) {
+			daddr = &sk->sk_v6_daddr;
+			saddr = &inet6_sk(sk)->saddr;
+		} else {
+			return;
+		}
+		nrt6 = ip6_pmtu_rt_cache_alloc(rt6, daddr, saddr);
+		if (!nrt6)
+			return;
+		/* ip6_ins_rt(nrt6) will bump the rt6->rt6i_node->fn_sernum
+		 * which will fail the next rt6_check() and invalidate the
+		 * sk->sk_dst_cache.
+		 */
+		if (ip6_ins_rt(nrt6)) {
+			dst_destroy(&nrt6->dst);
+			return;
+		}
+
+		rt6 = nrt6;
+		dst = &nrt6->dst;
+	} else {
+		rt6 = (struct rt6_info *)dst;
+	}
+
+	if (rt6->rt6i_dst.plen == 128) {
 		struct net *net = dev_net(dst->dev);
 
 		rt6->rt6i_flags |= RTF_MODIFIED;
-		if (mtu < IPV6_MIN_MTU)
-			mtu = IPV6_MIN_MTU;
-
 		dst_metric_set(dst, RTAX_MTU, mtu);
 		rt6_update_expires(rt6, net->ipv6.sysctl.ip6_rt_mtu_expires);
 	}
@@ -1171,8 +1170,15 @@ void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,
 	fl6.flowlabel = ip6_flowinfo(iph);
 
 	dst = ip6_route_output(net, NULL, &fl6);
-	if (!dst->error)
+	if (!dst->error) {
+		unsigned char *outer_network_header = skb_network_header(skb);
+		int offset;
+
+		skb_reset_network_header(skb);
+		offset = outer_network_header - skb_network_header(skb);
 		ip6_rt_update_pmtu(dst, NULL, skb, ntohl(mtu));
+		skb_set_network_header(skb, offset);
+	}
 	dst_release(dst);
 }
 EXPORT_SYMBOL_GPL(ip6_update_pmtu);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 05/10] ipv6: Allow pmtu update on /128 via gateway route
  2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2015-04-11  1:59 ` [RFC PATCH net-next 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
@ 2015-04-11  1:59 ` Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del() Martin KaFai Lau
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

Consider there is a permanent /128 via gateway route (DST_HOST) in
the route table.  When there is a pmtu update, the pmtu DST_HOST route is
updated and the RTF_EXPIRES is set.  The permanent DST_HOST route will be
removed after expiration.

Since we are at it, the patch is trying to simplify some checking cases in
ip6_rt_update_pmtu().

1. !(rt6->rt6i_flags & RTF_CACHE) is used to decide when
a RTF_CACHE route needs to be created for pmtu update.

2. Remove the rt6->rt6i_dst.plen == 128 check since RTF_CACHE route will
be created (if it is needed) before updating the mtu.

3. Add a check to ensure no pmtu update on RTF_LOCAL route

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1b57bc9..75f3b5d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1105,15 +1105,17 @@ static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 			       struct sk_buff *skb, u32 mtu)
 {
 	struct rt6_info *rt6 = (struct rt6_info *)dst;
+	struct net *net;
+
+	if (rt6->rt6i_flags & RTF_LOCAL)
+		return;
 
 	dst_confirm(dst);
 	mtu = max_t(u32, mtu, IPV6_MIN_MTU);
 	if (mtu >= dst_mtu(dst))
 		return;
 
-	if (!(rt6->rt6i_flags & RTF_CACHE) &&
-	    (!(rt6->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)) ||
-	     !(rt6->dst.flags & DST_HOST))) {
+	if (!(rt6->rt6i_flags & RTF_CACHE)) {
 		const struct in6_addr *daddr, *saddr;
 		struct rt6_info *nrt6;
 
@@ -1146,13 +1148,10 @@ static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 		rt6 = (struct rt6_info *)dst;
 	}
 
-	if (rt6->rt6i_dst.plen == 128) {
-		struct net *net = dev_net(dst->dev);
-
-		rt6->rt6i_flags |= RTF_MODIFIED;
-		dst_metric_set(dst, RTAX_MTU, mtu);
-		rt6_update_expires(rt6, net->ipv6.sysctl.ip6_rt_mtu_expires);
-	}
+	net = dev_net(rt6->dst.dev);
+	rt6->rt6i_flags |= RTF_MODIFIED;
+	dst_metric_set(dst, RTAX_MTU, mtu);
+	rt6_update_expires(rt6, net->ipv6.sysctl.ip6_rt_mtu_expires);
 }
 
 void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del()
  2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2015-04-11  1:59 ` [RFC PATCH net-next 05/10] ipv6: Allow pmtu update on /128 via gateway route Martin KaFai Lau
@ 2015-04-11  1:59 ` Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 07/10] ipv6: Extend the route lookups to low priority metrics Martin KaFai Lau
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

Before patch 'Allow pmtu update on /128 via gateway route',
RTF_CACHE route was not created for DST_HOST.  It also requires changes on both
delete code path and rt6_select() code patch.

This patch fixes the delete code path to avoid deleting the RTF_CACHE
route by 'ip -6 r del...'

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 75f3b5d..5d0fd6c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1780,6 +1780,9 @@ static int ip6_route_del(struct fib6_config *cfg)
 
 	if (fn) {
 		for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) {
+			if ((rt->rt6i_flags & RTF_CACHE) &&
+			    !(cfg->fc_flags & RTF_CACHE))
+				continue;
 			if (cfg->fc_ifindex &&
 			    (!rt->dst.dev ||
 			     rt->dst.dev->ifindex != cfg->fc_ifindex))
@@ -2424,6 +2427,9 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (rtm->rtm_type == RTN_LOCAL)
 		cfg->fc_flags |= RTF_LOCAL;
 
+	if (rtm->rtm_flags & RTM_F_CLONED)
+		cfg->fc_flags |= RTF_CACHE;
+
 	cfg->fc_nlinfo.portid = NETLINK_CB(skb).portid;
 	cfg->fc_nlinfo.nlh = nlh;
 	cfg->fc_nlinfo.nl_net = sock_net(skb->sk);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 07/10] ipv6: Extend the route lookups to low priority metrics.
  2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2015-04-11  1:59 ` [RFC PATCH net-next 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del() Martin KaFai Lau
@ 2015-04-11  1:59 ` Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry Martin KaFai Lau
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team, Steffen Klassert

From: Steffen Klassert <steffen.klassert@secunet.com>

We search only for routes with highest priority metric in
find_rr_leaf(). However if one of these routes is marked
as invalid, we may fail to find a route even if there is
a appropriate route with lower priority. Then we loose
connectivity until the garbage collector deletes the
invalid route. This typically happens if a host route
expires afer a pmtu event. Fix this by searching also
for routes with a lower priority metric.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 5d0fd6c..91c80bc 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -652,15 +652,33 @@ static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
 				     u32 metric, int oif, int strict,
 				     bool *do_rr)
 {
-	struct rt6_info *rt, *match;
+	struct rt6_info *rt, *match, *cont;
 	int mpri = -1;
 
 	match = NULL;
-	for (rt = rr_head; rt && rt->rt6i_metric == metric;
-	     rt = rt->dst.rt6_next)
+	cont = NULL;
+	for (rt = rr_head; rt; rt = rt->dst.rt6_next) {
+		if (rt->rt6i_metric != metric) {
+			cont = rt;
+			break;
+		}
+
+		match = find_match(rt, oif, strict, &mpri, match, do_rr);
+	}
+
+	for (rt = fn->leaf; rt && rt != rr_head; rt = rt->dst.rt6_next) {
+		if (rt->rt6i_metric != metric) {
+			cont = rt;
+			break;
+		}
+
 		match = find_match(rt, oif, strict, &mpri, match, do_rr);
-	for (rt = fn->leaf; rt && rt != rr_head && rt->rt6i_metric == metric;
-	     rt = rt->dst.rt6_next)
+	}
+
+	if (match || !cont)
+		return match;
+
+	for (rt = cont; rt; rt = rt->dst.rt6_next)
 		match = find_match(rt, oif, strict, &mpri, match, do_rr);
 
 	return match;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry
  2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2015-04-11  1:59 ` [RFC PATCH net-next 07/10] ipv6: Extend the route lookups to low priority metrics Martin KaFai Lau
@ 2015-04-11  1:59 ` Martin KaFai Lau
  2015-04-13 11:06   ` Steffen Klassert
  2015-04-11  1:59 ` [RFC PATCH net-next 09/10] ipv6: Break up ip6_rt_copy() Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 10/10] ipv6: Create percpu rt6_info Martin KaFai Lau
  9 siblings, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

When there is a pmtu exception on /128 via gateway route, we need to
create a separate metrics copy for the newly created RTF_CACHE route instead
of reusing the inetpeer cache.  Otherwise, the original mtu will be
over-written and the mtu update will stay after the expiration.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 91c80bc..61ce45e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -322,8 +322,16 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 	struct rt6_info *rt = (struct rt6_info *)dst;
 	struct inet6_dev *idev = rt->rt6i_idev;
 	struct dst_entry *from = dst->from;
+	unsigned long peer_metrics = 0;
 
-	if (!(rt->dst.flags & DST_HOST))
+	if (rt6_has_peer(rt)) {
+		struct inet_peer *peer = rt6_peer_ptr(rt);
+
+		peer_metrics = (unsigned long)peer->metrics;
+		inet_putpeer(peer);
+	}
+
+	if (peer_metrics != dst->_metrics)
 		dst_destroy_metrics_generic(dst);
 
 	if (idev) {
@@ -333,11 +341,6 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 
 	dst->from = NULL;
 	dst_release(from);
-
-	if (rt6_has_peer(rt)) {
-		struct inet_peer *peer = rt6_peer_ptr(rt);
-		inet_putpeer(peer);
-	}
 }
 
 static void ip6_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
@@ -1956,6 +1959,8 @@ static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
 
 		rt->rt6i_dst.addr = *dest;
 		rt->rt6i_dst.plen = 128;
+		if (ort->dst.flags & DST_HOST)
+			dst_cow_metrics_generic(&rt->dst, rt->dst._metrics);
 		dst_copy_metrics(&rt->dst, &ort->dst);
 		rt->dst.error = ort->dst.error;
 		rt->rt6i_idev = ort->rt6i_idev;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 09/10] ipv6: Break up ip6_rt_copy()
  2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2015-04-11  1:59 ` [RFC PATCH net-next 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry Martin KaFai Lau
@ 2015-04-11  1:59 ` Martin KaFai Lau
  2015-04-11  1:59 ` [RFC PATCH net-next 10/10] ipv6: Create percpu rt6_info Martin KaFai Lau
  9 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

This patch breaks up ip6_rt_copy() into ip6_rt_copy_init() and
ip6_rt_cache_alloc().

In the later patch, we need to create a percpu rt6_info copy. Hence, refactor
the common rt6_info init codes to ip6_rt_copy_init().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/route.c | 76 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 32 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 61ce45e..665e41c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -72,8 +72,11 @@ enum rt6_nud_state {
 	RT6_NUD_SUCCEED = 1
 };
 
-static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
-				    const struct in6_addr *dest);
+static void ip6_rt_copy_init(struct rt6_info *rt,
+			     struct rt6_info *ort,
+			     const struct in6_addr *dest);
+static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
+					   const struct in6_addr *dest);
 static struct dst_entry	*ip6_dst_check(struct dst_entry *dst, u32 cookie);
 static unsigned int	 ip6_default_advmss(const struct dst_entry *dst);
 static unsigned int	 ip6_mtu(const struct dst_entry *dst);
@@ -903,11 +906,9 @@ static struct rt6_info *ip6_pmtu_rt_cache_alloc(struct rt6_info *ort,
 	 *	Clone the route.
 	 */
 
-	rt = ip6_rt_copy(ort, daddr);
+	rt = ip6_rt_cache_alloc(ort, daddr);
 
 	if (rt) {
-		rt->rt6i_flags |= RTF_CACHE;
-
 		if (!(ort->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY))) {
 			if (ort->rt6i_dst.plen != 128 &&
 			    ipv6_addr_equal(&ort->rt6i_dst.addr, daddr))
@@ -1913,7 +1914,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 				     NEIGH_UPDATE_F_ISROUTER))
 		     );
 
-	nrt = ip6_rt_copy(rt, &msg->dest);
+	nrt = ip6_rt_cache_alloc(rt, &msg->dest);
 	if (!nrt)
 		goto out;
 
@@ -1945,39 +1946,50 @@ out:
  *	Misc support functions
  */
 
-static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
-				    const struct in6_addr *dest)
+static void ip6_rt_copy_init(struct rt6_info *rt,
+			     struct rt6_info *ort,
+			     const struct in6_addr *dest)
 {
-	struct net *net = dev_net(ort->dst.dev);
-	struct rt6_info *rt = ip6_dst_alloc(net, ort->dst.dev, 0,
-					    ort->rt6i_table);
-
-	if (rt) {
-		rt->dst.input = ort->dst.input;
-		rt->dst.output = ort->dst.output;
+	if (dest) {
 		rt->dst.flags |= DST_HOST;
-
 		rt->rt6i_dst.addr = *dest;
 		rt->rt6i_dst.plen = 128;
-		if (ort->dst.flags & DST_HOST)
-			dst_cow_metrics_generic(&rt->dst, rt->dst._metrics);
-		dst_copy_metrics(&rt->dst, &ort->dst);
-		rt->dst.error = ort->dst.error;
-		rt->rt6i_idev = ort->rt6i_idev;
-		if (rt->rt6i_idev)
-			in6_dev_hold(rt->rt6i_idev);
-		rt->dst.lastuse = jiffies;
-		rt->rt6i_gateway = ort->rt6i_gateway;
-		rt->rt6i_flags = ort->rt6i_flags;
-		rt6_set_from(rt, ort);
-		rt->rt6i_metric = 0;
+	} else {
+		memcpy(&rt->rt6i_dst, &ort->rt6i_dst, sizeof(rt->rt6i_dst));
+	}
 
+	rt->dst.input = ort->dst.input;
+	rt->dst.output = ort->dst.output;
+	rt->dst.error = ort->dst.error;
+	rt->rt6i_idev = ort->rt6i_idev;
+	if (rt->rt6i_idev)
+		in6_dev_hold(rt->rt6i_idev);
+	rt->dst.lastuse = jiffies;
+	rt->rt6i_gateway = ort->rt6i_gateway;
+	rt->rt6i_flags = ort->rt6i_flags;
+	rt->rt6i_metric = ort->rt6i_metric;
 #ifdef CONFIG_IPV6_SUBTREES
-		memcpy(&rt->rt6i_src, &ort->rt6i_src, sizeof(struct rt6key));
+	rt->rt6i_src = ort->rt6i_src;
 #endif
-		memcpy(&rt->rt6i_prefsrc, &ort->rt6i_prefsrc, sizeof(struct rt6key));
-		rt->rt6i_table = ort->rt6i_table;
-	}
+	rt->rt6i_prefsrc = ort->rt6i_prefsrc;
+	rt->rt6i_table = ort->rt6i_table;
+}
+
+static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
+					   const struct in6_addr *dest)
+{
+	struct rt6_info *rt = ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
+					    0, ort->rt6i_table);
+
+	if (!rt)
+		return NULL;
+	ip6_rt_copy_init(rt, ort, dest);
+	if (ort->dst.flags & DST_HOST)
+		dst_cow_metrics_generic(&rt->dst, rt->dst._metrics);
+	dst_copy_metrics(&rt->dst, &ort->dst);
+	rt->rt6i_flags |= RTF_CACHE;
+	rt6_set_from(rt, ort);
+	rt->rt6i_metric = 0;
 	return rt;
 }
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 10/10] ipv6: Create percpu rt6_info
  2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (8 preceding siblings ...)
  2015-04-11  1:59 ` [RFC PATCH net-next 09/10] ipv6: Break up ip6_rt_copy() Martin KaFai Lau
@ 2015-04-11  1:59 ` Martin KaFai Lau
  2015-04-13 10:59   ` Steffen Klassert
  9 siblings, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-11  1:59 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, kernel-team

After the patch
'ipv6: Only create RTF_CACHE routes after encountering pmtu exceptions',
we need to compensate the performance hit (bouncing dst->__refcnt).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/ip6_fib.h           |   8 ++
 include/net/ip6_route.h         |   2 +-
 include/uapi/linux/ipv6_route.h |   1 +
 net/ipv6/ip6_fib.c              |  22 +++++-
 net/ipv6/ip6_tunnel.c           |   2 +-
 net/ipv6/route.c                | 163 +++++++++++++++++++++++++++++++++++-----
 net/ipv6/tcp_ipv6.c             |   3 +-
 net/ipv6/xfrm6_policy.c         |   4 +-
 net/netfilter/ipvs/ip_vs_xmit.c |   2 +-
 net/sctp/ipv6.c                 |   2 +-
 10 files changed, 182 insertions(+), 27 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 20e80fa..65702c5 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -124,6 +124,7 @@ struct rt6_info {
 	unsigned long			_rt6i_peer;
 
 	u32				rt6i_metric;
+	struct rt6_info __rcu * __percpu	*rt6i_pcpu;
 	/* more non-fragment space at head required */
 	unsigned short			rt6i_nfheader_len;
 	u8				rt6i_protocol;
@@ -198,6 +199,13 @@ static inline void rt6_set_from(struct rt6_info *rt, struct rt6_info *from)
 	rt->dst.from = new;
 }
 
+static inline u32 rt6_get_cookie(const struct rt6_info *rt)
+{
+	if (rt->rt6i_flags & RTF_PCPU)
+		rt = (struct rt6_info *)(rt->dst.from);
+	return rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+}
+
 static inline void ip6_rt_put(struct rt6_info *rt)
 {
 	/* dst_release() accepts a NULL parameter.
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 0e4d170..397dd3a 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -145,7 +145,7 @@ static inline void __ip6_dst_store(struct sock *sk, struct dst_entry *dst,
 #ifdef CONFIG_IPV6_SUBTREES
 	np->saddr_cache = saddr;
 #endif
-	np->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+	np->dst_cookie = rt6_get_cookie(rt);
 }
 
 static inline void ip6_dst_store(struct sock *sk, struct dst_entry *dst,
diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h
index 2be7bd1..f6598d1 100644
--- a/include/uapi/linux/ipv6_route.h
+++ b/include/uapi/linux/ipv6_route.h
@@ -34,6 +34,7 @@
 #define RTF_PREF(pref)	((pref) << 27)
 #define RTF_PREF_MASK	0x18000000
 
+#define RTF_PCPU	0x40000000
 #define RTF_LOCAL	0x80000000
 
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 96dbfff..6aa9b80 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -154,10 +154,30 @@ static void node_free(struct fib6_node *fn)
 	kmem_cache_free(fib6_node_kmem, fn);
 }
 
+static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct rt6_info **ppcpu_rt;
+		struct rt6_info *pcpu_rt;
+
+		ppcpu_rt = per_cpu_ptr(non_pcpu_rt->rt6i_pcpu, cpu);
+		pcpu_rt = rcu_dereference_protected(*ppcpu_rt,
+			lockdep_is_held(&non_pcpu_rt->rt6i_table->tb6_lock));
+		if (pcpu_rt) {
+			dst_free(&pcpu_rt->dst);
+			*ppcpu_rt = NULL;
+		}
+	}
+}
+
 static void rt6_release(struct rt6_info *rt)
 {
-	if (atomic_dec_and_test(&rt->rt6i_ref))
+	if (atomic_dec_and_test(&rt->rt6i_ref)) {
+		rt6_free_pcpu(rt);
 		dst_free(&rt->dst);
+	}
 }
 
 static void fib6_link_table(struct net *net, struct fib6_table *tb)
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 5cafd92..2e67b66 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -151,7 +151,7 @@ EXPORT_SYMBOL_GPL(ip6_tnl_dst_reset);
 void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *) dst;
-	t->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+	t->dst_cookie = rt6_get_cookie(rt);
 	dst_release(t->dst_cache);
 	t->dst_cache = dst;
 }
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 665e41c..14f99c1 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -137,9 +137,16 @@ static struct inet_peer *rt6_get_peer_create(struct rt6_info *rt)
 	return __rt6_get_peer(rt, 1);
 }
 
-static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old)
+static u32 *rt6_pcpu_cow_metrics(struct rt6_info *rt)
 {
-	struct rt6_info *rt = (struct rt6_info *) dst;
+	rt = (struct rt6_info *)rt->dst.from;
+	BUG_ON(rt->rt6i_flags & RTF_PCPU);
+	return dst_metrics_write_ptr(&rt->dst);
+}
+
+static u32 *rt6_cow_metrics(struct rt6_info *rt, unsigned long old)
+{
+	struct dst_entry *dst = &rt->dst;
 	struct inet_peer *peer;
 	u32 *p = NULL;
 
@@ -168,6 +175,16 @@ static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old)
 	return p;
 }
 
+static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old)
+{
+	struct rt6_info *rt = (struct rt6_info *)dst;
+
+	if (rt->rt6i_flags & RTF_PCPU)
+		return rt6_pcpu_cow_metrics(rt);
+	else
+		return rt6_cow_metrics(rt, old);
+}
+
 static inline const void *choose_neigh_daddr(struct rt6_info *rt,
 					     struct sk_buff *skb,
 					     const void *daddr)
@@ -302,10 +319,10 @@ static const struct rt6_info ip6_blk_hole_entry_template = {
 #endif
 
 /* allocate dst with ip6_dst_ops */
-static inline struct rt6_info *ip6_dst_alloc(struct net *net,
-					     struct net_device *dev,
-					     int flags,
-					     struct fib6_table *table)
+static struct rt6_info *__ip6_dst_alloc(struct net *net,
+					struct net_device *dev,
+					int flags,
+					struct fib6_table *table)
 {
 	struct rt6_info *rt = dst_alloc(&net->ipv6.ip6_dst_ops, dev,
 					0, DST_OBSOLETE_FORCE_CHK, flags);
@@ -320,6 +337,34 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
 	return rt;
 }
 
+static struct rt6_info *ip6_dst_alloc(struct net *net,
+				      struct net_device *dev,
+				      int flags,
+				      struct fib6_table *table)
+{
+	struct rt6_info *rt = __ip6_dst_alloc(net, dev, flags, table);
+
+	if (rt) {
+		rt->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, GFP_ATOMIC);
+		if (rt->rt6i_pcpu) {
+			int cpu;
+
+			for_each_possible_cpu(cpu) {
+				struct rt6_info **p;
+
+				p = per_cpu_ptr(rt->rt6i_pcpu, cpu);
+				/* no one shares rt */
+				*p =  NULL;
+			}
+		} else {
+			dst_destroy((struct dst_entry *)rt);
+			return NULL;
+		}
+	}
+
+	return rt;
+}
+
 static void ip6_dst_destroy(struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *)dst;
@@ -337,6 +382,9 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 	if (peer_metrics != dst->_metrics)
 		dst_destroy_metrics_generic(dst);
 
+	if (rt->rt6i_pcpu)
+		free_percpu(rt->rt6i_pcpu);
+
 	if (idev) {
 		rt->rt6i_idev = NULL;
 		in6_dev_put(idev);
@@ -925,11 +973,68 @@ static struct rt6_info *ip6_pmtu_rt_cache_alloc(struct rt6_info *ort,
 	return rt;
 }
 
+static struct rt6_info *ip6_rt_pcpu_alloc(struct rt6_info *rt)
+{
+	struct rt6_info *pcpu_rt = __ip6_dst_alloc(dev_net(rt->dst.dev),
+						   rt->dst.dev, rt->dst.flags,
+						   rt->rt6i_table);
+
+	if (!pcpu_rt)
+		return NULL;
+	ip6_rt_copy_init(pcpu_rt, rt, NULL);
+	pcpu_rt->dst._metrics = (rt->dst._metrics | DST_METRICS_READ_ONLY);
+	rt6_set_from(pcpu_rt, rt);
+	pcpu_rt->rt6i_metric = rt->rt6i_metric;
+	pcpu_rt->rt6i_protocol = rt->rt6i_protocol;
+	pcpu_rt->rt6i_flags |= RTF_PCPU;
+	return pcpu_rt;
+}
+
+static struct rt6_info *rt6_get_pcpu_route(struct rt6_info *rt)
+{
+	struct rt6_info *pcpu_rt, *orig, *prev, **p;
+	struct net *net = dev_net(rt->dst.dev);
+
+	if (rt->rt6i_flags & RTF_CACHE || rt == net->ipv6.ip6_null_entry)
+		goto done;
+
+	rcu_read_lock();
+	p = raw_cpu_ptr(rt->rt6i_pcpu);
+	orig = rcu_dereference_check(*p,
+				     lockdep_is_held(&rt->rt6i_table->tb6_lock));
+	if (orig &&
+	    dst_metrics_ptr(orig->dst.from) == dst_metrics_ptr(&orig->dst)) {
+		dst_hold(&orig->dst);
+		rcu_read_unlock();
+		return orig;
+	}
+	rcu_read_unlock();
+
+	pcpu_rt = ip6_rt_pcpu_alloc(rt);
+	if (!pcpu_rt) {
+		rt = net->ipv6.ip6_null_entry;
+		goto done;
+	}
+
+	prev = cmpxchg(p, orig, pcpu_rt);
+	if (prev == orig) {
+		if (orig)
+			call_rcu(&orig->dst.rcu_head, dst_rcu_free);
+	} else {
+		pcpu_rt->dst.flags |= DST_NOCACHE;
+	}
+	rt = pcpu_rt;
+
+done:
+	dst_hold(&rt->dst);
+	return rt;
+}
+
 static struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, int oif,
 				      struct flowi6 *fl6, int flags)
 {
 	struct fib6_node *fn, *saved_fn;
-	struct rt6_info *rt;
+	struct rt6_info *rt, *pcpu_rt;
 	int strict = 0;
 
 	strict |= flags & RT6_LOOKUP_F_IFACE;
@@ -957,13 +1062,13 @@ redo_rt6_select:
 		}
 	}
 
-	dst_hold(&rt->dst);
+	pcpu_rt = rt6_get_pcpu_route(rt);
 	read_unlock_bh(&table->tb6_lock);
 
 	rt->dst.lastuse = jiffies;
 	rt->dst.__use++;
 
-	return rt;
+	return pcpu_rt;
 }
 
 static struct rt6_info *ip6_pol_route_input(struct net *net, struct fib6_table *table,
@@ -1068,6 +1173,26 @@ struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_ori
  *	Destination cache support functions
  */
 
+static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
+{
+	if (!rt->rt6i_node || rt->rt6i_node->fn_sernum != cookie)
+		return NULL;
+
+	if (rt6_check_expired(rt))
+		return NULL;
+
+	return &rt->dst;
+}
+
+static struct dst_entry *rt6_pcpu_check(struct rt6_info *rt, u32 cookie)
+{
+	if (rt->dst.obsolete == DST_OBSOLETE_FORCE_CHK &&
+	    dst_metrics_ptr(rt->dst.from) == dst_metrics_ptr(&rt->dst))
+		return rt6_check((struct rt6_info *)(rt->dst.from), cookie);
+	else
+		return NULL;
+}
+
 static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 {
 	struct rt6_info *rt;
@@ -1078,13 +1203,10 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
 	 * into this function always.
 	 */
-	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
-		return NULL;
-
-	if (rt6_check_expired(rt))
-		return NULL;
-
-	return dst;
+	if (rt->rt6i_flags & RTF_PCPU)
+		return rt6_pcpu_check(rt, cookie);
+	else
+		return rt6_check(rt, cookie);
 }
 
 static struct dst_entry *ip6_negative_advice(struct dst_entry *dst)
@@ -1978,8 +2100,13 @@ static void ip6_rt_copy_init(struct rt6_info *rt,
 static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
 					   const struct in6_addr *dest)
 {
-	struct rt6_info *rt = ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
-					    0, ort->rt6i_table);
+	struct rt6_info *rt;
+
+	if (ort->rt6i_flags & RTF_PCPU)
+		ort = (struct rt6_info *)ort->dst.from;
+
+	rt = __ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
+			     0, ort->rt6i_table);
 
 	if (!rt)
 		return NULL;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index dfcca70..e2e9576 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -99,8 +99,7 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
 		dst_hold(dst);
 		sk->sk_rx_dst = dst;
 		inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
-		if (rt->rt6i_node)
-			inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+		inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
 	}
 }
 
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index f337a90..e818c61 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -84,7 +84,7 @@ static int xfrm6_init_path(struct xfrm_dst *path, struct dst_entry *dst,
 	if (dst->ops->family == AF_INET6) {
 		struct rt6_info *rt = (struct rt6_info *)dst;
 		if (rt->rt6i_node)
-			path->path_cookie = rt->rt6i_node->fn_sernum;
+			path->path_cookie = rt6_get_cookie(rt);
 	}
 
 	path->u.rt6.rt6i_nfheader_len = nfheader_len;
@@ -115,7 +115,7 @@ static int xfrm6_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	xdst->u.rt6.rt6i_metric = rt->rt6i_metric;
 	xdst->u.rt6.rt6i_node = rt->rt6i_node;
 	if (rt->rt6i_node)
-		xdst->route_cookie = rt->rt6i_node->fn_sernum;
+		xdst->route_cookie = rt6_get_cookie(rt);
 	xdst->u.rt6.rt6i_gateway = rt->rt6i_gateway;
 	xdst->u.rt6.rt6i_dst = rt->rt6i_dst;
 	xdst->u.rt6.rt6i_src = rt->rt6i_src;
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 38f8627..5eff9f6 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -435,7 +435,7 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
 				goto err_unreach;
 			}
 			rt = (struct rt6_info *) dst;
-			cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+			cookie = rt6_get_cookie(rt);
 			__ip_vs_dst_set(dest, dest_dst, &rt->dst, cookie);
 			spin_unlock_bh(&dest->dst_lock);
 			IP_VS_DBG(10, "new dst %pI6, src %pI6, refcnt=%d\n",
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 9fa13f6..d012834 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -331,7 +331,7 @@ out:
 
 		rt = (struct rt6_info *)dst;
 		t->dst = dst;
-		t->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+		t->dst_cookie = rt6_get_cookie(rt);
 		pr_debug("rt6_dst:%pI6/%d rt6_src:%pI6\n",
 			 &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
 			 &fl6->saddr);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 10/10] ipv6: Create percpu rt6_info
  2015-04-11  1:59 ` [RFC PATCH net-next 10/10] ipv6: Create percpu rt6_info Martin KaFai Lau
@ 2015-04-13 10:59   ` Steffen Klassert
  2015-04-13 20:16     ` Martin KaFai Lau
  0 siblings, 1 reply; 16+ messages in thread
From: Steffen Klassert @ 2015-04-13 10:59 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, Hannes Frederic Sowa, kernel-team

On Fri, Apr 10, 2015 at 06:59:36PM -0700, Martin KaFai Lau wrote:

> diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h
> index 2be7bd1..f6598d1 100644
> --- a/include/uapi/linux/ipv6_route.h
> +++ b/include/uapi/linux/ipv6_route.h
> @@ -34,6 +34,7 @@
>  #define RTF_PREF(pref)	((pref) << 27)
>  #define RTF_PREF_MASK	0x18000000
>  
> +#define RTF_PCPU	0x40000000

This percpu flag is something internal, should IMO not be added
to the uapi.

> @@ -1978,8 +2100,13 @@ static void ip6_rt_copy_init(struct rt6_info *rt,
>  static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
>  					   const struct in6_addr *dest)
>  {
> -	struct rt6_info *rt = ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
> -					    0, ort->rt6i_table);
> +	struct rt6_info *rt;
> +
> +	if (ort->rt6i_flags & RTF_PCPU)
> +		ort = (struct rt6_info *)ort->dst.from;
> +
> +	rt = __ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
> +			     0, ort->rt6i_table);

Why don't you allocate the percpu resources for cached routes?
I think using ip6_dst_alloc() would be better here, cached routes
could benefit from this too. And in particular rt6_release() tries
to free the percpu resources unconditionally when the route is removed
from the fib tree. This crashes if the percpu resources are not allocated.

The rest of the patchset looks good.

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry
  2015-04-11  1:59 ` [RFC PATCH net-next 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry Martin KaFai Lau
@ 2015-04-13 11:06   ` Steffen Klassert
  2015-04-13 17:51     ` Martin KaFai Lau
  0 siblings, 1 reply; 16+ messages in thread
From: Steffen Klassert @ 2015-04-13 11:06 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, Hannes Frederic Sowa, kernel-team

On Fri, Apr 10, 2015 at 06:59:34PM -0700, Martin KaFai Lau wrote:
> When there is a pmtu exception on /128 via gateway route, we need to
> create a separate metrics copy for the newly created RTF_CACHE route instead
> of reusing the inetpeer cache.

Maybe we should remove the caching of the metrics on the inetpeer
completely. After your patchset only static hostroutes using this,
and this is exactly the case where it is buggy. If a second route
to the same host is added, the metrics of the first will be
overwritten.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry
  2015-04-13 11:06   ` Steffen Klassert
@ 2015-04-13 17:51     ` Martin KaFai Lau
  0 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-13 17:51 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, Hannes Frederic Sowa, kernel-team

On Mon, Apr 13, 2015 at 01:06:32PM +0200, Steffen Klassert wrote:
> On Fri, Apr 10, 2015 at 06:59:34PM -0700, Martin KaFai Lau wrote:
> > When there is a pmtu exception on /128 via gateway route, we need to
> > create a separate metrics copy for the newly created RTF_CACHE route instead
> > of reusing the inetpeer cache.
>
> Maybe we should remove the caching of the metrics on the inetpeer
> completely. After your patchset only static hostroutes using this,
The RTF_CACHE copied from "plen <128 via gateway" route will also use
the inetpeer.

> and this is exactly the case where it is buggy. If a second route
> to the same host is added, the metrics of the first will be
> overwritten.
I agree.  The current upstream also has similar bug.
I had thought about changes as you suggested but decided to
use a separate patch instead.  I will try to consider it in v2.

Thanks,
--Martin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 10/10] ipv6: Create percpu rt6_info
  2015-04-13 10:59   ` Steffen Klassert
@ 2015-04-13 20:16     ` Martin KaFai Lau
  2015-04-13 21:46       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2015-04-13 20:16 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, Hannes Frederic Sowa, kernel-team

On Mon, Apr 13, 2015 at 12:59:44PM +0200, Steffen Klassert wrote:
> On Fri, Apr 10, 2015 at 06:59:36PM -0700, Martin KaFai Lau wrote:
> 
> > diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h
> > index 2be7bd1..f6598d1 100644
> > --- a/include/uapi/linux/ipv6_route.h
> > +++ b/include/uapi/linux/ipv6_route.h
> > @@ -34,6 +34,7 @@
> >  #define RTF_PREF(pref)	((pref) << 27)
> >  #define RTF_PREF_MASK	0x18000000
> >  
> > +#define RTF_PCPU	0x40000000
> 
> This percpu flag is something internal, should IMO not be added
> to the uapi.
Make sense.  Where may be the right place for it?

It seems 'uapi/linux/ipv6_route.h' is the one which has all IPv6 flags laid out.
It makes modification easier since it tells us which bit is still available.

How about we add a comment to 'uapi/linux/ipv6_route.h' and move the '#define'
to 'linux/ipv6_route.h'?

> 
> > @@ -1978,8 +2100,13 @@ static void ip6_rt_copy_init(struct rt6_info *rt,
> >  static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
> >  					   const struct in6_addr *dest)
> >  {
> > -	struct rt6_info *rt = ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
> > -					    0, ort->rt6i_table);
> > +	struct rt6_info *rt;
> > +
> > +	if (ort->rt6i_flags & RTF_PCPU)
> > +		ort = (struct rt6_info *)ort->dst.from;
> > +
> > +	rt = __ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
> > +			     0, ort->rt6i_table);
> 
> Why don't you allocate the percpu resources for cached routes?
> I think using ip6_dst_alloc() would be better here, cached routes
> could benefit from this too.
Not sure if we want to open up percpu cache opportunity for RTF_CACHE
route after receiving ICMPv6 too-big from remote peers and some of our machines
have a lot of CPU (like 40), so I opt not to do it.
If we do this, we probably need to modify the gc and its related counting also.

> And in particular rt6_release() tries
> to free the percpu resources unconditionally when the route is removed
> from the fib tree. This crashes if the percpu resources are not allocated.
Good catch and will fix in v2.

Thanks,
--Martin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 10/10] ipv6: Create percpu rt6_info
  2015-04-13 20:16     ` Martin KaFai Lau
@ 2015-04-13 21:46       ` Hannes Frederic Sowa
  0 siblings, 0 replies; 16+ messages in thread
From: Hannes Frederic Sowa @ 2015-04-13 21:46 UTC (permalink / raw)
  To: Martin KaFai Lau, Steffen Klassert; +Cc: netdev, kernel-team

On Mon, Apr 13, 2015, at 22:16, Martin KaFai Lau wrote:
> On Mon, Apr 13, 2015 at 12:59:44PM +0200, Steffen Klassert wrote:
> > On Fri, Apr 10, 2015 at 06:59:36PM -0700, Martin KaFai Lau wrote:
> > 
> > > diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h
> > > index 2be7bd1..f6598d1 100644
> > > --- a/include/uapi/linux/ipv6_route.h
> > > +++ b/include/uapi/linux/ipv6_route.h
> > > @@ -34,6 +34,7 @@
> > >  #define RTF_PREF(pref)	((pref) << 27)
> > >  #define RTF_PREF_MASK	0x18000000
> > >  
> > > +#define RTF_PCPU	0x40000000
> > 
> > This percpu flag is something internal, should IMO not be added
> > to the uapi.
> Make sense.  Where may be the right place for it?
> 
> It seems 'uapi/linux/ipv6_route.h' is the one which has all IPv6 flags
> laid out.
> It makes modification easier since it tells us which bit is still
> available.

Actually, we expose those flags via /proc/net/ipv6_route. I would be
fine keeping it in uapi.

> How about we add a comment to 'uapi/linux/ipv6_route.h' and move the
> '#define'
> to 'linux/ipv6_route.h'?
> 
> > 
> > > @@ -1978,8 +2100,13 @@ static void ip6_rt_copy_init(struct rt6_info *rt,
> > >  static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
> > >  					   const struct in6_addr *dest)
> > >  {
> > > -	struct rt6_info *rt = ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
> > > -					    0, ort->rt6i_table);
> > > +	struct rt6_info *rt;
> > > +
> > > +	if (ort->rt6i_flags & RTF_PCPU)
> > > +		ort = (struct rt6_info *)ort->dst.from;
> > > +
> > > +	rt = __ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
> > > +			     0, ort->rt6i_table);
> > 
> > Why don't you allocate the percpu resources for cached routes?
> > I think using ip6_dst_alloc() would be better here, cached routes
> > could benefit from this too.
> Not sure if we want to open up percpu cache opportunity for RTF_CACHE
> route after receiving ICMPv6 too-big from remote peers and some of our
> machines
> have a lot of CPU (like 40), so I opt not to do it.
> If we do this, we probably need to modify the gc and its related counting
> also.

I agree that we should handle those exceptions as a kind of "slow" path.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-04-13 21:46 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-11  1:59 [RFC PATCH net-next 00/10] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
2015-04-11  1:59 ` [RFC PATCH net-next 01/10] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
2015-04-11  1:59 ` [RFC PATCH net-next 02/10] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST Martin KaFai Lau
2015-04-11  1:59 ` [RFC PATCH net-next 03/10] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone Martin KaFai Lau
2015-04-11  1:59 ` [RFC PATCH net-next 04/10] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
2015-04-11  1:59 ` [RFC PATCH net-next 05/10] ipv6: Allow pmtu update on /128 via gateway route Martin KaFai Lau
2015-04-11  1:59 ` [RFC PATCH net-next 06/10] ipv6: Avoid deleting RTF_CACHE route from ip6_route_del() Martin KaFai Lau
2015-04-11  1:59 ` [RFC PATCH net-next 07/10] ipv6: Extend the route lookups to low priority metrics Martin KaFai Lau
2015-04-11  1:59 ` [RFC PATCH net-next 08/10] ipv6: Do not use inetpeer when creating RTF_CACHE route for /128 via gateway entry Martin KaFai Lau
2015-04-13 11:06   ` Steffen Klassert
2015-04-13 17:51     ` Martin KaFai Lau
2015-04-11  1:59 ` [RFC PATCH net-next 09/10] ipv6: Break up ip6_rt_copy() Martin KaFai Lau
2015-04-11  1:59 ` [RFC PATCH net-next 10/10] ipv6: Create percpu rt6_info Martin KaFai Lau
2015-04-13 10:59   ` Steffen Klassert
2015-04-13 20:16     ` Martin KaFai Lau
2015-04-13 21:46       ` Hannes Frederic Sowa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).