[PATCH net-next 0/6] ipv4 fixes for dmvpn

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/6] ipv4 fixes for dmvpn
@ 2013-05-27 11:16 Timo Teräs
  2013-05-27 11:16 ` [PATCH net-next 1/6] net: inform NETDEV_CHANGE callbacks which flags were changed Timo Teräs
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Timo Teräs @ 2013-05-27 11:16 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs

Collection of pending fixes which mostly hit DMVPN setups, but some
are doing good to generic setups too.

Patches 1 & 2 fix the NOARP flag change to flush nud NOARP entries
from ARP cache. The only change from previous send is updating the
commit message per request, and adding Ben's Acked-By.

Patch 3 reworks IPsec PMTU propagation to be less invasive, and fixes
certain subtle caching issues.

Patch 4 and 5 are performance improvements. Both are especially useful
on dmvpn setups, but the latter patch helps general setups too.

And finally the patch 6 is a respin of the input route caching so
we get proper pmtu and fragmentation in forwarding path (needed for
tunnel devices and TCPMSS clamp-to-pmtu). While it would be better if
we could avoid input route caching, this was the only non-intrusive way
to fix all issues I came up with.

The whole patchset is tested on top of 3.9.x and fixes the issues
I've encountered.

Please review and consider applying to net-next.

Timo Teräs (6):
  net: inform NETDEV_CHANGE callbacks which flags were changed
  arp: flush arp cache on IFF_NOARP change
  ipv4: properly refresh rtable entries on pmtu/redirect events
  ipv4: rate limit updating of next hop exceptions with same pmtu
  ipv4: use separate genid for next hop exceptions
  ipv4: use next hop exceptions also for input routes

 include/linux/netdevice.h   |   4 +-
 include/net/ip_fib.h        |   4 +-
 include/net/net_namespace.h |  11 ++++
 net/core/dev.c              |   5 +-
 net/ipv4/ah4.c              |   7 +--
 net/ipv4/arp.c              |   4 ++
 net/ipv4/esp4.c             |   7 +--
 net/ipv4/fib_semantics.c    |   3 +-
 net/ipv4/ipcomp.c           |   7 +--
 net/ipv4/route.c            | 140 +++++++++++++++++++++++++++++++-------------
 10 files changed, 132 insertions(+), 60 deletions(-)

-- 
1.8.2.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next 1/6] net: inform NETDEV_CHANGE callbacks which flags were changed
  2013-05-27 11:16 [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teräs
@ 2013-05-27 11:16 ` Timo Teräs
  2013-05-27 11:18   ` Jiri Pirko
  2013-05-27 11:16 ` [PATCH net-next 2/6] arp: flush arp cache on IFF_NOARP change Timo Teräs
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Timo Teräs @ 2013-05-27 11:16 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs

In certain cases (like the follow up commit to arp.c) will need to
check which flags actually changed to avoid excessive work.

Ben Hutchings nicely worded why to put these transient flags to
struct net_device for the time being:
> It's inelegant to put transient data associated with an event in a
> persistent data structure.  On the other hand, having every user cache
> the old state is pretty awful as well.
>
> Really, netdev notifiers should be changed to accept a structure that
> encapsulates the changes rather than just a pointer to the net_device.
> But making such a change would be an enormous pain and error-prone
> because notifier functions aren't type-safe.
>
> As an interim solution, I think either the general flags_changed or
> old_flags would be preferable to defining extra transient flags.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
---
 include/linux/netdevice.h | 4 +++-
 net/core/dev.c            | 5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ea7b6bc..f336e03 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1112,7 +1112,9 @@ struct net_device {
 	/* Hardware header description */
 	const struct header_ops *header_ops;
 
-	unsigned int		flags;	/* interface flags (a la BSD)	*/
+	unsigned int		flags;		/* interface flags (a la BSD)	*/
+	unsigned int		flags_changed;	/* flags that are being changed
+						 * valid during NETDEV_CHANGE notifier */
 	unsigned int		priv_flags; /* Like 'flags' but invisible to userspace.
 					     * See if.h for definitions. */
 	unsigned short		gflags;
diff --git a/net/core/dev.c b/net/core/dev.c
index 50c02de..bbaa3c2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4742,8 +4742,11 @@ void __dev_notify_flags(struct net_device *dev, unsigned int old_flags)
 	}
 
 	if (dev->flags & IFF_UP &&
-	    (changes & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI | IFF_VOLATILE)))
+	    (changes & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI | IFF_VOLATILE))) {
+		dev->flags_changed = changes;
 		call_netdevice_notifiers(NETDEV_CHANGE, dev);
+		dev->flags_changed = 0;
+	}
 }
 
 /**
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 2/6] arp: flush arp cache on IFF_NOARP change
  2013-05-27 11:16 [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teräs
  2013-05-27 11:16 ` [PATCH net-next 1/6] net: inform NETDEV_CHANGE callbacks which flags were changed Timo Teräs
@ 2013-05-27 11:16 ` Timo Teräs
  2013-05-27 11:16 ` [PATCH net-next 3/6] ipv4: properly refresh rtable entries on pmtu/redirect events Timo Teräs
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Timo Teräs @ 2013-05-27 11:16 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs

IFF_NOARP affects what kind of neighbor entries are created
(nud NOARP or nud INCOMPLETE). If the flag changes, flush the arp
cache to refresh all entries.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
---
 net/ipv4/arp.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 247ec19..0a15fb7 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -1241,6 +1241,10 @@ static int arp_netdev_event(struct notifier_block *this, unsigned long event,
 		neigh_changeaddr(&arp_tbl, dev);
 		rt_cache_flush(dev_net(dev));
 		break;
+	case NETDEV_CHANGE:
+		if (dev->flags_changed & IFF_NOARP)
+			neigh_changeaddr(&arp_tbl, dev);
+		break;
 	default:
 		break;
 	}
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 3/6] ipv4: properly refresh rtable entries on pmtu/redirect events
  2013-05-27 11:16 [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teräs
  2013-05-27 11:16 ` [PATCH net-next 1/6] net: inform NETDEV_CHANGE callbacks which flags were changed Timo Teräs
  2013-05-27 11:16 ` [PATCH net-next 2/6] arp: flush arp cache on IFF_NOARP change Timo Teräs
@ 2013-05-27 11:16 ` Timo Teräs
  2013-05-27 11:16 ` [PATCH net-next 4/6] ipv4: rate limit updating of next hop exceptions with same pmtu Timo Teräs
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Timo Teräs @ 2013-05-27 11:16 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs, Steffen Klassert

This reverts commit 05ab86c5 (xfrm4: Invalidate all ipv4 routes on
IPsec pmtu events). Flushing all cached entries is not needed.

Instead, invalidate only the related next hop dsts to recheck for
the added next hop exception where needed. This also fixes a subtle
race due to bumping generation id's before updating the pmtu.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
---
 net/ipv4/ah4.c    |  7 ++-----
 net/ipv4/esp4.c   |  7 ++-----
 net/ipv4/ipcomp.c |  7 ++-----
 net/ipv4/route.c  | 63 ++++++++++++++++++++++++++++++++-----------------------
 4 files changed, 43 insertions(+), 41 deletions(-)

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 2e7f194..7179026 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -419,12 +419,9 @@ static void ah4_err(struct sk_buff *skb, u32 info)
 	if (!x)
 		return;
 
-	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH) {
-		atomic_inc(&flow_cache_genid);
-		rt_genid_bump(net);
-
+	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
 		ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_AH, 0);
-	} else
+	else
 		ipv4_redirect(skb, net, 0, 0, IPPROTO_AH, 0);
 	xfrm_state_put(x);
 }
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 4cfe34d..ab3d814 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -502,12 +502,9 @@ static void esp4_err(struct sk_buff *skb, u32 info)
 	if (!x)
 		return;
 
-	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH) {
-		atomic_inc(&flow_cache_genid);
-		rt_genid_bump(net);
-
+	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
 		ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_ESP, 0);
-	} else
+	else
 		ipv4_redirect(skb, net, 0, 0, IPPROTO_ESP, 0);
 	xfrm_state_put(x);
 }
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index 59cb8c7..826be4c 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -47,12 +47,9 @@ static void ipcomp4_err(struct sk_buff *skb, u32 info)
 	if (!x)
 		return;
 
-	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH) {
-		atomic_inc(&flow_cache_genid);
-		rt_genid_bump(net);
-
+	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
 		ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_COMP, 0);
-	} else
+	else
 		ipv4_redirect(skb, net, 0, 0, IPPROTO_COMP, 0);
 	xfrm_state_put(x);
 }
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 550781a..561a378 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -594,11 +594,25 @@ static inline u32 fnhe_hashfun(__be32 daddr)
 	return hval & (FNHE_HASH_SIZE - 1);
 }
 
+static void fill_route_from_fnhe(struct rtable *rt, struct fib_nh_exception *fnhe)
+{
+	rt->rt_pmtu = fnhe->fnhe_pmtu;
+	rt->dst.expires = fnhe->fnhe_expires;
+
+	if (fnhe->fnhe_gw) {
+		rt->rt_flags |= RTCF_REDIRECTED;
+		rt->rt_gateway = fnhe->fnhe_gw;
+		rt->rt_uses_gateway = 1;
+	}
+}
+
 static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 				  u32 pmtu, unsigned long expires)
 {
 	struct fnhe_hash_bucket *hash;
 	struct fib_nh_exception *fnhe;
+	struct rtable *rt;
+	unsigned int i;
 	int depth;
 	u32 hval = fnhe_hashfun(daddr);
 
@@ -627,8 +641,12 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 			fnhe->fnhe_gw = gw;
 		if (pmtu) {
 			fnhe->fnhe_pmtu = pmtu;
-			fnhe->fnhe_expires = expires;
+			fnhe->fnhe_expires = max(1UL, expires);
 		}
+		/* Update all cached dsts too */
+		rt = rcu_dereference(fnhe->fnhe_rth);
+		if (rt)
+			fill_route_from_fnhe(rt, fnhe);
 	} else {
 		if (depth > FNHE_RECLAIM_DEPTH)
 			fnhe = fnhe_oldest(hash);
@@ -644,6 +662,18 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 		fnhe->fnhe_gw = gw;
 		fnhe->fnhe_pmtu = pmtu;
 		fnhe->fnhe_expires = expires;
+
+		/* Exception created; mark the cached routes for the nexthop
+		 * stale, so anyone caching it rechecks if this exception
+		 * applies to them.
+		 */
+		for_each_possible_cpu(i) {
+			struct rtable __rcu **prt;
+			prt = per_cpu_ptr(nh->nh_pcpu_rth_output, i);
+			rt = rcu_dereference(*prt);
+			if (rt)
+				rt->dst.obsolete = DST_OBSOLETE_KILL;
+		}
 	}
 
 	fnhe->fnhe_stamp = jiffies;
@@ -917,13 +947,6 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 	if (mtu < ip_rt_min_pmtu)
 		mtu = ip_rt_min_pmtu;
 
-	if (!rt->rt_pmtu) {
-		dst->obsolete = DST_OBSOLETE_KILL;
-	} else {
-		rt->rt_pmtu = mtu;
-		dst->expires = max(1UL, jiffies + ip_rt_mtu_expires);
-	}
-
 	rcu_read_lock();
 	if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) {
 		struct fib_nh *nh = &FIB_RES_NH(res);
@@ -1063,11 +1086,11 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
 	 * into this function always.
 	 *
-	 * When a PMTU/redirect information update invalidates a
-	 * route, this is indicated by setting obsolete to
-	 * DST_OBSOLETE_KILL.
+	 * When a PMTU/redirect information update invalidates a route,
+	 * this is indicated by setting obsolete to DST_OBSOLETE_KILL or
+	 * DST_OBSOLETE_DEAD by dst_free().
 	 */
-	if (dst->obsolete == DST_OBSOLETE_KILL || rt_is_expired(rt))
+	if (dst->obsolete != DST_OBSOLETE_FORCE_CHK || rt_is_expired(rt))
 		return NULL;
 	return dst;
 }
@@ -1215,20 +1238,8 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 			fnhe->fnhe_pmtu = 0;
 			fnhe->fnhe_expires = 0;
 		}
-		if (fnhe->fnhe_pmtu) {
-			unsigned long expires = fnhe->fnhe_expires;
-			unsigned long diff = expires - jiffies;
-
-			if (time_before(jiffies, expires)) {
-				rt->rt_pmtu = fnhe->fnhe_pmtu;
-				dst_set_expires(&rt->dst, diff);
-			}
-		}
-		if (fnhe->fnhe_gw) {
-			rt->rt_flags |= RTCF_REDIRECTED;
-			rt->rt_gateway = fnhe->fnhe_gw;
-			rt->rt_uses_gateway = 1;
-		} else if (!rt->rt_gateway)
+		fill_route_from_fnhe(rt, fnhe);
+		if (!rt->rt_gateway)
 			rt->rt_gateway = daddr;
 
 		rcu_assign_pointer(fnhe->fnhe_rth, rt);
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 4/6] ipv4: rate limit updating of next hop exceptions with same pmtu
  2013-05-27 11:16 [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teräs
                   ` (2 preceding siblings ...)
  2013-05-27 11:16 ` [PATCH net-next 3/6] ipv4: properly refresh rtable entries on pmtu/redirect events Timo Teräs
@ 2013-05-27 11:16 ` Timo Teräs
  2013-05-27 11:16 ` [PATCH net-next 5/6] ipv4: use separate genid for next hop exceptions Timo Teräs
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Timo Teräs @ 2013-05-27 11:16 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs

The tunnel devices call update_pmtu for each packet sent, this causes
contention on the fnhe_lock. Ignore the pmtu update if pmtu is not
actually changed, and there is still plenty of time before the entry
expires.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
---
 net/ipv4/route.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 561a378..a4082be 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -947,6 +947,10 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 	if (mtu < ip_rt_min_pmtu)
 		mtu = ip_rt_min_pmtu;
 
+	if (rt->rt_pmtu == mtu &&
+	    time_before(jiffies, dst->expires - ip_rt_mtu_expires / 2))
+		return;
+
 	rcu_read_lock();
 	if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) {
 		struct fib_nh *nh = &FIB_RES_NH(res);
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 5/6] ipv4: use separate genid for next hop exceptions
  2013-05-27 11:16 [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teräs
                   ` (3 preceding siblings ...)
  2013-05-27 11:16 ` [PATCH net-next 4/6] ipv4: rate limit updating of next hop exceptions with same pmtu Timo Teräs
@ 2013-05-27 11:16 ` Timo Teräs
  2013-05-27 11:16 ` [PATCH RFC net-next 6/6] ipv4: use next hop exceptions also for input routes Timo Teräs
  2013-05-28  6:33 ` [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teras
  6 siblings, 0 replies; 24+ messages in thread
From: Timo Teräs @ 2013-05-27 11:16 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs, Steffen Klassert

commit 13d82bf5 (ipv4: Fix flushing of cached routing informations)
added the support to flush learned pmtu information.

However, using rt_genid is quite heavy as it is bumped on route
add/change and multicast events amongst other places. These can
happen quote often, especially if using dynamic routing protocols.

While this is ok with routes (as they are just recreated locally),
the pmtu information is learned from remote systems and the icmp
notification can come with long delays. It is worthy to have separate
genid to avoid excessive pmtu resets.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
---
 include/net/ip_fib.h        |  1 +
 include/net/net_namespace.h | 11 +++++++++++
 net/ipv4/route.c            | 12 ++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e49db91..44424e9 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -51,6 +51,7 @@ struct rtable;
 
 struct fib_nh_exception {
 	struct fib_nh_exception __rcu	*fnhe_next;
+	int				fnhe_genid;
 	__be32				fnhe_daddr;
 	u32				fnhe_pmtu;
 	__be32				fnhe_gw;
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index b176978..495bc57 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -118,6 +118,7 @@ struct net {
 	struct netns_ipvs	*ipvs;
 	struct sock		*diag_nlsk;
 	atomic_t		rt_genid;
+	atomic_t		fnhe_genid;
 };
 
 /*
@@ -340,4 +341,14 @@ static inline void rt_genid_bump(struct net *net)
 	atomic_inc(&net->rt_genid);
 }
 
+static inline int fnhe_genid(struct net *net)
+{
+	return atomic_read(&net->fnhe_genid);
+}
+
+static inline void fnhe_genid_bump(struct net *net)
+{
+	atomic_inc(&net->fnhe_genid);
+}
+
 #endif /* __NET_NET_NAMESPACE_H */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a4082be..403e283 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -658,6 +658,7 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 			fnhe->fnhe_next = hash->chain;
 			rcu_assign_pointer(hash->chain, fnhe);
 		}
+		fnhe->fnhe_genid = fnhe_genid(dev_net(nh->nh_dev));
 		fnhe->fnhe_daddr = daddr;
 		fnhe->fnhe_gw = gw;
 		fnhe->fnhe_pmtu = pmtu;
@@ -1236,8 +1237,11 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 	spin_lock_bh(&fnhe_lock);
 
 	if (daddr == fnhe->fnhe_daddr) {
+		int genid = fnhe_genid(dev_net(rt->dst.dev));
 		struct rtable *orig = rcu_dereference(fnhe->fnhe_rth);
-		if (orig && rt_is_expired(orig)) {
+
+		if (fnhe->fnhe_genid != genid) {
+			fnhe->fnhe_genid = genid;
 			fnhe->fnhe_gw = 0;
 			fnhe->fnhe_pmtu = 0;
 			fnhe->fnhe_expires = 0;
@@ -2443,8 +2447,11 @@ static int ipv4_sysctl_rtcache_flush(ctl_table *__ctl, int write,
 					void __user *buffer,
 					size_t *lenp, loff_t *ppos)
 {
+	struct net *net = (struct net *)__ctl->extra1;
+
 	if (write) {
-		rt_cache_flush((struct net *)__ctl->extra1);
+		rt_cache_flush(net);
+		fnhe_genid_bump(net);
 		return 0;
 	}
 
@@ -2619,6 +2626,7 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
 static __net_init int rt_genid_init(struct net *net)
 {
 	atomic_set(&net->rt_genid, 0);
+	atomic_set(&net->fnhe_genid, 0);
 	get_random_bytes(&net->ipv4.dev_addr_genid,
 			 sizeof(net->ipv4.dev_addr_genid));
 	return 0;
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC net-next 6/6] ipv4: use next hop exceptions also for input routes
  2013-05-27 11:16 [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teräs
                   ` (4 preceding siblings ...)
  2013-05-27 11:16 ` [PATCH net-next 5/6] ipv4: use separate genid for next hop exceptions Timo Teräs
@ 2013-05-27 11:16 ` Timo Teräs
  2013-05-28  6:33 ` [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teras
  6 siblings, 0 replies; 24+ messages in thread
From: Timo Teräs @ 2013-05-27 11:16 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs

Commit d2d68ba9 (ipv4: Cache input routes in fib_info nexthops)
assmued that "locally destined, and routed packets, never trigger
PMTU events or redirects that will be processed by us".

However, it seems that tunnel devices do trigger PMTU events in certain
cases. At least ip_gre, ip6_gre, sit, and ipip do use the inner flow's
skb_dst(skb)->ops->update_pmtu to propage mtu information from the
outer flows. These can cause the inner flow mtu to be decreased. If
next hop exceptions are not consulted for pmtu, IP fragmentation will
not be done properly for these routes.

It also seems that we really need to have the PMTU information always
for netfilter TCPMSS clamp-to-pmtu feature to work properly.

So for the time being, cache separate copies of input routes for
each next hop exception.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
---
 include/net/ip_fib.h     |  3 ++-
 net/ipv4/fib_semantics.c |  3 ++-
 net/ipv4/route.c         | 65 +++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 54 insertions(+), 17 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 44424e9..aac8553 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -56,7 +56,8 @@ struct fib_nh_exception {
 	u32				fnhe_pmtu;
 	__be32				fnhe_gw;
 	unsigned long			fnhe_expires;
-	struct rtable __rcu		*fnhe_rth;
+	struct rtable __rcu		*fnhe_rth_input;
+	struct rtable __rcu		*fnhe_rth_output;
 	unsigned long			fnhe_stamp;
 };
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 8f6cb7a..d5dbca5 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -169,7 +169,8 @@ static void free_nh_exceptions(struct fib_nh *nh)
 			
 			next = rcu_dereference_protected(fnhe->fnhe_next, 1);
 
-			rt_fibinfo_free(&fnhe->fnhe_rth);
+			rt_fibinfo_free(&fnhe->fnhe_rth_input);
+			rt_fibinfo_free(&fnhe->fnhe_rth_output);
 
 			kfree(fnhe);
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 403e283..82f2074 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -565,10 +565,25 @@ static inline void rt_free(struct rtable *rt)
 
 static DEFINE_SPINLOCK(fnhe_lock);
 
+static void fnhe_flush_routes(struct fib_nh_exception *fnhe)
+{
+	struct rtable *rt;
+
+	rt = rcu_dereference(fnhe->fnhe_rth_input);
+	if (rt) {
+		RCU_INIT_POINTER(fnhe->fnhe_rth_input, NULL);
+		rt_free(rt);
+	}
+	rt = rcu_dereference(fnhe->fnhe_rth_output);
+	if (rt) {
+		RCU_INIT_POINTER(fnhe->fnhe_rth_output, NULL);
+		rt_free(rt);
+	}
+}
+
 static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash)
 {
 	struct fib_nh_exception *fnhe, *oldest;
-	struct rtable *orig;
 
 	oldest = rcu_dereference(hash->chain);
 	for (fnhe = rcu_dereference(oldest->fnhe_next); fnhe;
@@ -576,11 +591,7 @@ static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash)
 		if (time_before(fnhe->fnhe_stamp, oldest->fnhe_stamp))
 			oldest = fnhe;
 	}
-	orig = rcu_dereference(oldest->fnhe_rth);
-	if (orig) {
-		RCU_INIT_POINTER(oldest->fnhe_rth, NULL);
-		rt_free(orig);
-	}
+	fnhe_flush_routes(oldest);
 	return oldest;
 }
 
@@ -644,7 +655,10 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 			fnhe->fnhe_expires = max(1UL, expires);
 		}
 		/* Update all cached dsts too */
-		rt = rcu_dereference(fnhe->fnhe_rth);
+		rt = rcu_dereference(fnhe->fnhe_rth_input);
+		if (rt)
+			fill_route_from_fnhe(rt, fnhe);
+		rt = rcu_dereference(fnhe->fnhe_rth_output);
 		if (rt)
 			fill_route_from_fnhe(rt, fnhe);
 	} else {
@@ -668,6 +682,10 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 		 * stale, so anyone caching it rechecks if this exception
 		 * applies to them.
 		 */
+		rt = rcu_dereference(nh->nh_rth_input);
+		if (rt)
+			rt->dst.obsolete = DST_OBSOLETE_KILL;
+
 		for_each_possible_cpu(i) {
 			struct rtable __rcu **prt;
 			prt = per_cpu_ptr(nh->nh_pcpu_rth_output, i);
@@ -1237,25 +1255,36 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 	spin_lock_bh(&fnhe_lock);
 
 	if (daddr == fnhe->fnhe_daddr) {
+		struct rtable __rcu **porig;
+		struct rtable *orig;
 		int genid = fnhe_genid(dev_net(rt->dst.dev));
-		struct rtable *orig = rcu_dereference(fnhe->fnhe_rth);
+
+		if (rt_is_input_route(rt))
+			porig = &fnhe->fnhe_rth_input;
+		else
+			porig = &fnhe->fnhe_rth_output;
+		orig = rcu_dereference(*porig);
 
 		if (fnhe->fnhe_genid != genid) {
 			fnhe->fnhe_genid = genid;
 			fnhe->fnhe_gw = 0;
 			fnhe->fnhe_pmtu = 0;
 			fnhe->fnhe_expires = 0;
+			fnhe_flush_routes(fnhe);
+			orig = NULL;
 		}
 		fill_route_from_fnhe(rt, fnhe);
 		if (!rt->rt_gateway)
 			rt->rt_gateway = daddr;
 
-		rcu_assign_pointer(fnhe->fnhe_rth, rt);
-		if (orig)
-			rt_free(orig);
+		if (!(rt->dst.flags & DST_NOCACHE)) {
+			rcu_assign_pointer(*porig, rt);
+			if (orig)
+				rt_free(orig);
+			ret = true;
+		}
 
 		fnhe->fnhe_stamp = jiffies;
-		ret = true;
 	}
 	spin_unlock_bh(&fnhe_lock);
 
@@ -1487,6 +1516,7 @@ static int __mkroute_input(struct sk_buff *skb,
 			   struct in_device *in_dev,
 			   __be32 daddr, __be32 saddr, u32 tos)
 {
+	struct fib_nh_exception *fnhe;
 	struct rtable *rth;
 	int err;
 	struct in_device *out_dev;
@@ -1533,8 +1563,13 @@ static int __mkroute_input(struct sk_buff *skb,
 		}
 	}
 
+	fnhe = find_exception(&FIB_RES_NH(*res), daddr);
 	if (do_cache) {
-		rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
+		if (fnhe != NULL)
+			rth = rcu_dereference(fnhe->fnhe_rth_input);
+		else
+			rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
+
 		if (rt_cache_valid(rth)) {
 			skb_dst_set_noref(skb, &rth->dst);
 			goto out;
@@ -1562,7 +1597,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->dst.input = ip_forward;
 	rth->dst.output = ip_output;
 
-	rt_set_nexthop(rth, daddr, res, NULL, res->fi, res->type, itag);
+	rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
 	skb_dst_set(skb, &rth->dst);
 out:
 	err = 0;
@@ -1877,7 +1912,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 
 		fnhe = find_exception(nh, fl4->daddr);
 		if (fnhe)
-			prth = &fnhe->fnhe_rth;
+			prth = &fnhe->fnhe_rth_output;
 		else {
 			if (unlikely(fl4->flowi4_flags &
 				     FLOWI_FLAG_KNOWN_NH &&
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/6] net: inform NETDEV_CHANGE callbacks which flags were changed
  2013-05-27 11:16 ` [PATCH net-next 1/6] net: inform NETDEV_CHANGE callbacks which flags were changed Timo Teräs
@ 2013-05-27 11:18   ` Jiri Pirko
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2013-05-27 11:18 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

Mon, May 27, 2013 at 01:16:11PM CEST, timo.teras@iki.fi wrote:
>In certain cases (like the follow up commit to arp.c) will need to
>check which flags actually changed to avoid excessive work.
>
>Ben Hutchings nicely worded why to put these transient flags to
>struct net_device for the time being:
>> It's inelegant to put transient data associated with an event in a
>> persistent data structure.  On the other hand, having every user cache
>> the old state is pretty awful as well.
>>
>> Really, netdev notifiers should be changed to accept a structure that
>> encapsulates the changes rather than just a pointer to the net_device.
>> But making such a change would be an enormous pain and error-prone
>> because notifier functions aren't type-safe.
>>
>> As an interim solution, I think either the general flags_changed or
>> old_flags would be preferable to defining extra transient flags.
>
>Signed-off-by: Timo Teräs <timo.teras@iki.fi>
>Acked-by: Ben Hutchings <bhutchings@solarflare.com>
>---
> include/linux/netdevice.h | 4 +++-
> net/core/dev.c            | 5 ++++-
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index ea7b6bc..f336e03 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1112,7 +1112,9 @@ struct net_device {
> 	/* Hardware header description */
> 	const struct header_ops *header_ops;
> 
>-	unsigned int		flags;	/* interface flags (a la BSD)	*/
>+	unsigned int		flags;		/* interface flags (a la BSD)	*/
>+	unsigned int		flags_changed;	/* flags that are being changed
>+						 * valid during NETDEV_CHANGE notifier */
> 	unsigned int		priv_flags; /* Like 'flags' but invisible to userspace.
> 					     * See if.h for definitions. */
> 	unsigned short		gflags;
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 50c02de..bbaa3c2 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -4742,8 +4742,11 @@ void __dev_notify_flags(struct net_device *dev, unsigned int old_flags)
> 	}
> 
> 	if (dev->flags & IFF_UP &&
>-	    (changes & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI | IFF_VOLATILE)))
>+	    (changes & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI | IFF_VOLATILE))) {
>+		dev->flags_changed = changes;
> 		call_netdevice_notifiers(NETDEV_CHANGE, dev);
>+		dev->flags_changed = 0;
>+	}
> }

Hold on with this please. I'm working on netdev notifier extension. One
would be able to pass custom structure with an event after that. I plan
to post this today/tomorrow.

> 
> /**
>-- 
>1.8.2.3
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/6] ipv4 fixes for dmvpn
  2013-05-27 11:16 [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teräs
                   ` (5 preceding siblings ...)
  2013-05-27 11:16 ` [PATCH RFC net-next 6/6] ipv4: use next hop exceptions also for input routes Timo Teräs
@ 2013-05-28  6:33 ` Timo Teras
  2013-05-28  6:38   ` David Miller
  6 siblings, 1 reply; 24+ messages in thread
From: Timo Teras @ 2013-05-28  6:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Hi Dave,

On Mon, 27 May 2013 14:16:10 +0300
Timo Teräs <timo.teras@iki.fi> wrote:

> Collection of pending fixes which mostly hit DMVPN setups, but some
> are doing good to generic setups too.
> 
> Patches 1 & 2 fix the NOARP flag change to flush nud NOARP entries
> from ARP cache. The only change from previous send is updating the
> commit message per request, and adding Ben's Acked-By.

I agree with Jiri, that these can be deferred until he gets the
notifier improvements done.

> Patch 3 reworks IPsec PMTU propagation to be less invasive, and fixes
> certain subtle caching issues.
> 
> Patch 4 and 5 are performance improvements. Both are especially useful
> on dmvpn setups, but the latter patch helps general setups too.

Seems you deferred also these patches 3-5. However, these are not
dependant on the first two patches. Could these be taken a look at?

These would be the patchwork items:
http://patchwork.ozlabs.org/patch/246579/
http://patchwork.ozlabs.org/patch/246575/
http://patchwork.ozlabs.org/patch/246577/

Or should I resubmit them later?

Thanks,
 Timo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/6] ipv4 fixes for dmvpn
  2013-05-28  6:33 ` [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teras
@ 2013-05-28  6:38   ` David Miller
  2013-05-28  6:46     ` [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events Timo Teräs
  0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2013-05-28  6:38 UTC (permalink / raw)
  To: timo.teras; +Cc: netdev

From: Timo Teras <timo.teras@iki.fi>
Date: Tue, 28 May 2013 09:33:15 +0300

> Seems you deferred also these patches 3-5. However, these are not
> dependant on the first two patches. Could these be taken a look at?

Please resubmit them, thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events
  2013-05-28  6:38   ` David Miller
@ 2013-05-28  6:46     ` Timo Teräs
  2013-05-28  6:46       ` [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu Timo Teräs
                         ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Timo Teräs @ 2013-05-28  6:46 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs, Steffen Klassert

This reverts commit 05ab86c5 (xfrm4: Invalidate all ipv4 routes on
IPsec pmtu events). Flushing all cached entries is not needed.

Instead, invalidate only the related next hop dsts to recheck for
the added next hop exception where needed. This also fixes a subtle
race due to bumping generation id's before updating the pmtu.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
---
 net/ipv4/ah4.c    |  7 ++-----
 net/ipv4/esp4.c   |  7 ++-----
 net/ipv4/ipcomp.c |  7 ++-----
 net/ipv4/route.c  | 63 ++++++++++++++++++++++++++++++++-----------------------
 4 files changed, 43 insertions(+), 41 deletions(-)

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 2e7f194..7179026 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -419,12 +419,9 @@ static void ah4_err(struct sk_buff *skb, u32 info)
 	if (!x)
 		return;
 
-	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH) {
-		atomic_inc(&flow_cache_genid);
-		rt_genid_bump(net);
-
+	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
 		ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_AH, 0);
-	} else
+	else
 		ipv4_redirect(skb, net, 0, 0, IPPROTO_AH, 0);
 	xfrm_state_put(x);
 }
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 4cfe34d..ab3d814 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -502,12 +502,9 @@ static void esp4_err(struct sk_buff *skb, u32 info)
 	if (!x)
 		return;
 
-	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH) {
-		atomic_inc(&flow_cache_genid);
-		rt_genid_bump(net);
-
+	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
 		ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_ESP, 0);
-	} else
+	else
 		ipv4_redirect(skb, net, 0, 0, IPPROTO_ESP, 0);
 	xfrm_state_put(x);
 }
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index 59cb8c7..826be4c 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -47,12 +47,9 @@ static void ipcomp4_err(struct sk_buff *skb, u32 info)
 	if (!x)
 		return;
 
-	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH) {
-		atomic_inc(&flow_cache_genid);
-		rt_genid_bump(net);
-
+	if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
 		ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_COMP, 0);
-	} else
+	else
 		ipv4_redirect(skb, net, 0, 0, IPPROTO_COMP, 0);
 	xfrm_state_put(x);
 }
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 550781a..561a378 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -594,11 +594,25 @@ static inline u32 fnhe_hashfun(__be32 daddr)
 	return hval & (FNHE_HASH_SIZE - 1);
 }
 
+static void fill_route_from_fnhe(struct rtable *rt, struct fib_nh_exception *fnhe)
+{
+	rt->rt_pmtu = fnhe->fnhe_pmtu;
+	rt->dst.expires = fnhe->fnhe_expires;
+
+	if (fnhe->fnhe_gw) {
+		rt->rt_flags |= RTCF_REDIRECTED;
+		rt->rt_gateway = fnhe->fnhe_gw;
+		rt->rt_uses_gateway = 1;
+	}
+}
+
 static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 				  u32 pmtu, unsigned long expires)
 {
 	struct fnhe_hash_bucket *hash;
 	struct fib_nh_exception *fnhe;
+	struct rtable *rt;
+	unsigned int i;
 	int depth;
 	u32 hval = fnhe_hashfun(daddr);
 
@@ -627,8 +641,12 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 			fnhe->fnhe_gw = gw;
 		if (pmtu) {
 			fnhe->fnhe_pmtu = pmtu;
-			fnhe->fnhe_expires = expires;
+			fnhe->fnhe_expires = max(1UL, expires);
 		}
+		/* Update all cached dsts too */
+		rt = rcu_dereference(fnhe->fnhe_rth);
+		if (rt)
+			fill_route_from_fnhe(rt, fnhe);
 	} else {
 		if (depth > FNHE_RECLAIM_DEPTH)
 			fnhe = fnhe_oldest(hash);
@@ -644,6 +662,18 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 		fnhe->fnhe_gw = gw;
 		fnhe->fnhe_pmtu = pmtu;
 		fnhe->fnhe_expires = expires;
+
+		/* Exception created; mark the cached routes for the nexthop
+		 * stale, so anyone caching it rechecks if this exception
+		 * applies to them.
+		 */
+		for_each_possible_cpu(i) {
+			struct rtable __rcu **prt;
+			prt = per_cpu_ptr(nh->nh_pcpu_rth_output, i);
+			rt = rcu_dereference(*prt);
+			if (rt)
+				rt->dst.obsolete = DST_OBSOLETE_KILL;
+		}
 	}
 
 	fnhe->fnhe_stamp = jiffies;
@@ -917,13 +947,6 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 	if (mtu < ip_rt_min_pmtu)
 		mtu = ip_rt_min_pmtu;
 
-	if (!rt->rt_pmtu) {
-		dst->obsolete = DST_OBSOLETE_KILL;
-	} else {
-		rt->rt_pmtu = mtu;
-		dst->expires = max(1UL, jiffies + ip_rt_mtu_expires);
-	}
-
 	rcu_read_lock();
 	if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) {
 		struct fib_nh *nh = &FIB_RES_NH(res);
@@ -1063,11 +1086,11 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
 	 * into this function always.
 	 *
-	 * When a PMTU/redirect information update invalidates a
-	 * route, this is indicated by setting obsolete to
-	 * DST_OBSOLETE_KILL.
+	 * When a PMTU/redirect information update invalidates a route,
+	 * this is indicated by setting obsolete to DST_OBSOLETE_KILL or
+	 * DST_OBSOLETE_DEAD by dst_free().
 	 */
-	if (dst->obsolete == DST_OBSOLETE_KILL || rt_is_expired(rt))
+	if (dst->obsolete != DST_OBSOLETE_FORCE_CHK || rt_is_expired(rt))
 		return NULL;
 	return dst;
 }
@@ -1215,20 +1238,8 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 			fnhe->fnhe_pmtu = 0;
 			fnhe->fnhe_expires = 0;
 		}
-		if (fnhe->fnhe_pmtu) {
-			unsigned long expires = fnhe->fnhe_expires;
-			unsigned long diff = expires - jiffies;
-
-			if (time_before(jiffies, expires)) {
-				rt->rt_pmtu = fnhe->fnhe_pmtu;
-				dst_set_expires(&rt->dst, diff);
-			}
-		}
-		if (fnhe->fnhe_gw) {
-			rt->rt_flags |= RTCF_REDIRECTED;
-			rt->rt_gateway = fnhe->fnhe_gw;
-			rt->rt_uses_gateway = 1;
-		} else if (!rt->rt_gateway)
+		fill_route_from_fnhe(rt, fnhe);
+		if (!rt->rt_gateway)
 			rt->rt_gateway = daddr;
 
 		rcu_assign_pointer(fnhe->fnhe_rth, rt);
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu
  2013-05-28  6:46     ` [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events Timo Teräs
@ 2013-05-28  6:46       ` Timo Teräs
  2013-05-28  8:45         ` Julian Anastasov
  2013-06-03  7:10         ` David Miller
  2013-05-28  6:46       ` [PATCH net-next 3/3] ipv4: use separate genid for next hop exceptions Timo Teräs
                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 24+ messages in thread
From: Timo Teräs @ 2013-05-28  6:46 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs

The tunnel devices call update_pmtu for each packet sent, this causes
contention on the fnhe_lock. Ignore the pmtu update if pmtu is not
actually changed, and there is still plenty of time before the entry
expires.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
---
 net/ipv4/route.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 561a378..a4082be 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -947,6 +947,10 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 	if (mtu < ip_rt_min_pmtu)
 		mtu = ip_rt_min_pmtu;
 
+	if (rt->rt_pmtu == mtu &&
+	    time_before(jiffies, dst->expires - ip_rt_mtu_expires / 2))
+		return;
+
 	rcu_read_lock();
 	if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) {
 		struct fib_nh *nh = &FIB_RES_NH(res);
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 3/3] ipv4: use separate genid for next hop exceptions
  2013-05-28  6:46     ` [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events Timo Teräs
  2013-05-28  6:46       ` [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu Timo Teräs
@ 2013-05-28  6:46       ` Timo Teräs
  2013-06-03  7:10         ` David Miller
  2013-05-28  8:25       ` [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events Julian Anastasov
  2013-06-03  7:09       ` David Miller
  3 siblings, 1 reply; 24+ messages in thread
From: Timo Teräs @ 2013-05-28  6:46 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs, Steffen Klassert

commit 13d82bf5 (ipv4: Fix flushing of cached routing informations)
added the support to flush learned pmtu information.

However, using rt_genid is quite heavy as it is bumped on route
add/change and multicast events amongst other places. These can
happen quote often, especially if using dynamic routing protocols.

While this is ok with routes (as they are just recreated locally),
the pmtu information is learned from remote systems and the icmp
notification can come with long delays. It is worthy to have separate
genid to avoid excessive pmtu resets.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
---
 include/net/ip_fib.h        |  1 +
 include/net/net_namespace.h | 11 +++++++++++
 net/ipv4/route.c            | 12 ++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e49db91..44424e9 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -51,6 +51,7 @@ struct rtable;
 
 struct fib_nh_exception {
 	struct fib_nh_exception __rcu	*fnhe_next;
+	int				fnhe_genid;
 	__be32				fnhe_daddr;
 	u32				fnhe_pmtu;
 	__be32				fnhe_gw;
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index b176978..495bc57 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -118,6 +118,7 @@ struct net {
 	struct netns_ipvs	*ipvs;
 	struct sock		*diag_nlsk;
 	atomic_t		rt_genid;
+	atomic_t		fnhe_genid;
 };
 
 /*
@@ -340,4 +341,14 @@ static inline void rt_genid_bump(struct net *net)
 	atomic_inc(&net->rt_genid);
 }
 
+static inline int fnhe_genid(struct net *net)
+{
+	return atomic_read(&net->fnhe_genid);
+}
+
+static inline void fnhe_genid_bump(struct net *net)
+{
+	atomic_inc(&net->fnhe_genid);
+}
+
 #endif /* __NET_NET_NAMESPACE_H */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a4082be..403e283 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -658,6 +658,7 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 			fnhe->fnhe_next = hash->chain;
 			rcu_assign_pointer(hash->chain, fnhe);
 		}
+		fnhe->fnhe_genid = fnhe_genid(dev_net(nh->nh_dev));
 		fnhe->fnhe_daddr = daddr;
 		fnhe->fnhe_gw = gw;
 		fnhe->fnhe_pmtu = pmtu;
@@ -1236,8 +1237,11 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 	spin_lock_bh(&fnhe_lock);
 
 	if (daddr == fnhe->fnhe_daddr) {
+		int genid = fnhe_genid(dev_net(rt->dst.dev));
 		struct rtable *orig = rcu_dereference(fnhe->fnhe_rth);
-		if (orig && rt_is_expired(orig)) {
+
+		if (fnhe->fnhe_genid != genid) {
+			fnhe->fnhe_genid = genid;
 			fnhe->fnhe_gw = 0;
 			fnhe->fnhe_pmtu = 0;
 			fnhe->fnhe_expires = 0;
@@ -2443,8 +2447,11 @@ static int ipv4_sysctl_rtcache_flush(ctl_table *__ctl, int write,
 					void __user *buffer,
 					size_t *lenp, loff_t *ppos)
 {
+	struct net *net = (struct net *)__ctl->extra1;
+
 	if (write) {
-		rt_cache_flush((struct net *)__ctl->extra1);
+		rt_cache_flush(net);
+		fnhe_genid_bump(net);
 		return 0;
 	}
 
@@ -2619,6 +2626,7 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
 static __net_init int rt_genid_init(struct net *net)
 {
 	atomic_set(&net->rt_genid, 0);
+	atomic_set(&net->fnhe_genid, 0);
 	get_random_bytes(&net->ipv4.dev_addr_genid,
 			 sizeof(net->ipv4.dev_addr_genid));
 	return 0;
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events
  2013-05-28  6:46     ` [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events Timo Teräs
  2013-05-28  6:46       ` [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu Timo Teräs
  2013-05-28  6:46       ` [PATCH net-next 3/3] ipv4: use separate genid for next hop exceptions Timo Teräs
@ 2013-05-28  8:25       ` Julian Anastasov
  2013-05-28  8:53         ` Timo Teras
  2013-06-03  7:09       ` David Miller
  3 siblings, 1 reply; 24+ messages in thread
From: Julian Anastasov @ 2013-05-28  8:25 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev, Steffen Klassert

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1527 bytes --]


	Hello,

On Tue, 28 May 2013, Timo Teräs wrote:

> This reverts commit 05ab86c5 (xfrm4: Invalidate all ipv4 routes on
> IPsec pmtu events). Flushing all cached entries is not needed.
> 
> Instead, invalidate only the related next hop dsts to recheck for
> the added next hop exception where needed. This also fixes a subtle
> race due to bumping generation id's before updating the pmtu.
> 
> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> Signed-off-by: Timo Teräs <timo.teras@iki.fi>
> ---

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 550781a..561a378 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -594,11 +594,25 @@ static inline u32 fnhe_hashfun(__be32 daddr)
>  	return hval & (FNHE_HASH_SIZE - 1);
>  }
>  
> +static void fill_route_from_fnhe(struct rtable *rt, struct fib_nh_exception *fnhe)
> +{
> +	rt->rt_pmtu = fnhe->fnhe_pmtu;
> +	rt->dst.expires = fnhe->fnhe_expires;

	The 'if (time_before' ... dst_set_expires() logic from
rt_bind_exception() is removed, may be it should be moved here,
i.e. fnhe_pmtu should be ignored if expired.

> @@ -627,8 +641,12 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
>  			fnhe->fnhe_gw = gw;
>  		if (pmtu) {
>  			fnhe->fnhe_pmtu = pmtu;
> -			fnhe->fnhe_expires = expires;
> +			fnhe->fnhe_expires = max(1UL, expires);
>  		}
> +		/* Update all cached dsts too */
> +		rt = rcu_dereference(fnhe->fnhe_rth);

	rt = rcu_dereference_protected(fnhe->fnhe_rth, 1);

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu
  2013-05-28  6:46       ` [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu Timo Teräs
@ 2013-05-28  8:45         ` Julian Anastasov
  2013-05-28 10:07           ` Timo Teras
  2013-06-03  7:10         ` David Miller
  1 sibling, 1 reply; 24+ messages in thread
From: Julian Anastasov @ 2013-05-28  8:45 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1493 bytes --]


	Hello,

On Tue, 28 May 2013, Timo Teräs wrote:

> The tunnel devices call update_pmtu for each packet sent, this causes
> contention on the fnhe_lock. Ignore the pmtu update if pmtu is not
> actually changed, and there is still plenty of time before the entry
> expires.
> 
> Signed-off-by: Timo Teräs <timo.teras@iki.fi>
> ---
>  net/ipv4/route.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 561a378..a4082be 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -947,6 +947,10 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
>  	if (mtu < ip_rt_min_pmtu)
>  		mtu = ip_rt_min_pmtu;
>  
> +	if (rt->rt_pmtu == mtu &&
> +	    time_before(jiffies, dst->expires - ip_rt_mtu_expires / 2))
> +		return;
> +

	Can we also add logic in this patch in
update_or_create_fnhe, so that we avoid invalidation for cached
routes when only pmtu expiration is updated (same pmtu), i.e.:

+	if (gw || pmtu != fnhe->fnhe_pmtu) {
+		/* Exception created; mark the cached routes for the nexthop
+		...
+	}

	BTW, I now see that previous patch should
call for_each_possible_cpu for the both cases, not
only when fnhe is created but also on update:

	bool invalidate;

	if (fnhe) {
		invalidate = gw || pmtu != fnhe->fnhe_pmtu;
		...
	} else {
		...
		invalidate = true;
	}

	if (invalidate) {
		/* Exception created; mark the cached routes for the nexthop
		...
	}

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events
  2013-05-28  8:25       ` [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events Julian Anastasov
@ 2013-05-28  8:53         ` Timo Teras
  2013-05-28 19:44           ` Julian Anastasov
  0 siblings, 1 reply; 24+ messages in thread
From: Timo Teras @ 2013-05-28  8:53 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev, Steffen Klassert

On Tue, 28 May 2013 11:25:55 +0300 (EEST)
Julian Anastasov <ja@ssi.bg> wrote:

> 
> 	Hello,
> 
> On Tue, 28 May 2013, Timo Teräs wrote:
> 
> > This reverts commit 05ab86c5 (xfrm4: Invalidate all ipv4 routes on
> > IPsec pmtu events). Flushing all cached entries is not needed.
> > 
> > Instead, invalidate only the related next hop dsts to recheck for
> > the added next hop exception where needed. This also fixes a subtle
> > race due to bumping generation id's before updating the pmtu.
> > 
> > Cc: Steffen Klassert <steffen.klassert@secunet.com>
> > Signed-off-by: Timo Teräs <timo.teras@iki.fi>
> > ---
> 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 550781a..561a378 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -594,11 +594,25 @@ static inline u32 fnhe_hashfun(__be32 daddr)
> >  	return hval & (FNHE_HASH_SIZE - 1);
> >  }
> >  
> > +static void fill_route_from_fnhe(struct rtable *rt, struct
> > fib_nh_exception *fnhe) +{
> > +	rt->rt_pmtu = fnhe->fnhe_pmtu;
> > +	rt->dst.expires = fnhe->fnhe_expires;
> 
> 	The 'if (time_before' ... dst_set_expires() logic from
> rt_bind_exception() is removed, may be it should be moved here,
> i.e. fnhe_pmtu should be ignored if expired.

That code would not help much. The route's rt_pmtu is never reset to
zero after the fnhe expires, so this would not make much difference.
The old rt_pmtu and dst.expires would be left there anyway. All rt
accesses check for expires too.

This was actually intentional on the old code, as non-zero rt_pmtu
implied that we had "next hop exception route" instead of "next hop
route" and affected how the rt was invalidated in pmtu update.

If we want to clear out these fields, then it would make sense to have
rt_bind_exception() to reset these on expiry to the struct
fib_nh_exception directly and keep the direct assignments in
fill_route_from_fnhe().

- Timo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu
  2013-05-28  8:45         ` Julian Anastasov
@ 2013-05-28 10:07           ` Timo Teras
  2013-05-28 21:04             ` Julian Anastasov
  0 siblings, 1 reply; 24+ messages in thread
From: Timo Teras @ 2013-05-28 10:07 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev

On Tue, 28 May 2013 11:45:51 +0300 (EEST)
Julian Anastasov <ja@ssi.bg> wrote:

> On Tue, 28 May 2013, Timo Teräs wrote:
> 
> > The tunnel devices call update_pmtu for each packet sent, this
> > causes contention on the fnhe_lock. Ignore the pmtu update if pmtu
> > is not actually changed, and there is still plenty of time before
> > the entry expires.
> > 
> > Signed-off-by: Timo Teräs <timo.teras@iki.fi>
> > ---
> >  net/ipv4/route.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 561a378..a4082be 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -947,6 +947,10 @@ static void __ip_rt_update_pmtu(struct rtable
> > *rt, struct flowi4 *fl4, u32 mtu) if (mtu < ip_rt_min_pmtu)
> >  		mtu = ip_rt_min_pmtu;
> >  
> > +	if (rt->rt_pmtu == mtu &&
> > +	    time_before(jiffies, dst->expires -
> > ip_rt_mtu_expires / 2))
> > +		return;
> > +
> 
> 	Can we also add logic in this patch in
> update_or_create_fnhe, so that we avoid invalidation for cached
> routes when only pmtu expiration is updated (same pmtu), i.e.:
> 
> +	if (gw || pmtu != fnhe->fnhe_pmtu) {
> +		/* Exception created; mark the cached routes for the
> nexthop
> +		...
> +	}
> 
> 	BTW, I now see that previous patch should
> call for_each_possible_cpu for the both cases, not
> only when fnhe is created but also on update:

Why would this be needed?

The idea is that if there is no next hop exception, everyone is using
the next hop's dsts. If there is a next hop exception hashed, they will
be using those routes in the exception entry.

When the exception is created, we need to invalidate the nexthop's
routes, to force relookup of these dst's if should go to the exception.
Basically it flushes the nexthop's 'default' dst.

But if we have a valid exception, and we are just updating it. Everyone
is already using the right dst. The update_or_create_fnhe() updates all
exception's dst's that are effected. Since the set of hosts to which
that exception applies is the same, I don't see why we would need to
invalidate the nexthop's 'default' dst.

- Timo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events
  2013-05-28  8:53         ` Timo Teras
@ 2013-05-28 19:44           ` Julian Anastasov
  0 siblings, 0 replies; 24+ messages in thread
From: Julian Anastasov @ 2013-05-28 19:44 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev, Steffen Klassert


	Hello,

On Tue, 28 May 2013, Timo Teras wrote:

> > > +static void fill_route_from_fnhe(struct rtable *rt, struct
> > > fib_nh_exception *fnhe) +{
> > > +	rt->rt_pmtu = fnhe->fnhe_pmtu;
> > > +	rt->dst.expires = fnhe->fnhe_expires;
> > 
> > 	The 'if (time_before' ... dst_set_expires() logic from
> > rt_bind_exception() is removed, may be it should be moved here,
> > i.e. fnhe_pmtu should be ignored if expired.
> 
> That code would not help much. The route's rt_pmtu is never reset to
> zero after the fnhe expires, so this would not make much difference.
> The old rt_pmtu and dst.expires would be left there anyway. All rt
> accesses check for expires too.

	I see, ipv4_mtu validates rt_pmtu expiration, so
such checks are not needed in rt_bind_exception and
fill_route_from_fnhe.

> This was actually intentional on the old code, as non-zero rt_pmtu
> implied that we had "next hop exception route" instead of "next hop
> route" and affected how the rt was invalidated in pmtu update.
> 
> If we want to clear out these fields, then it would make sense to have
> rt_bind_exception() to reset these on expiry to the struct
> fib_nh_exception directly and keep the direct assignments in
> fill_route_from_fnhe().
> 
> - Timo

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu
  2013-05-28 10:07           ` Timo Teras
@ 2013-05-28 21:04             ` Julian Anastasov
  2013-05-29  5:07               ` Timo Teras
  0 siblings, 1 reply; 24+ messages in thread
From: Julian Anastasov @ 2013-05-28 21:04 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev


	Hello,

On Tue, 28 May 2013, Timo Teras wrote:

> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -947,6 +947,10 @@ static void __ip_rt_update_pmtu(struct rtable
> > > *rt, struct flowi4 *fl4, u32 mtu) if (mtu < ip_rt_min_pmtu)
> > >  		mtu = ip_rt_min_pmtu;
> > >  
> > > +	if (rt->rt_pmtu == mtu &&
> > > +	    time_before(jiffies, dst->expires -
> > > ip_rt_mtu_expires / 2))
> > > +		return;
> > > +
> > 
> > 	Can we also add logic in this patch in
> > update_or_create_fnhe, so that we avoid invalidation for cached
> > routes when only pmtu expiration is updated (same pmtu), i.e.:
> > 
> > +	if (gw || pmtu != fnhe->fnhe_pmtu) {
> > +		/* Exception created; mark the cached routes for the
> > nexthop
> > +		...
> > +	}
> > 
> > 	BTW, I now see that previous patch should
> > call for_each_possible_cpu for the both cases, not
> > only when fnhe is created but also on update:
> 
> Why would this be needed?
> 
> The idea is that if there is no next hop exception, everyone is using
> the next hop's dsts. If there is a next hop exception hashed, they will
> be using those routes in the exception entry.
> 
> When the exception is created, we need to invalidate the nexthop's
> routes, to force relookup of these dst's if should go to the exception.
> Basically it flushes the nexthop's 'default' dst.
> 
> But if we have a valid exception, and we are just updating it. Everyone
> is already using the right dst. The update_or_create_fnhe() updates all
> exception's dst's that are effected. Since the set of hosts to which
> that exception applies is the same, I don't see why we would need to
> invalidate the nexthop's 'default' dst.

	Agreed for the NH route. But there is another case:
when fnhe exists for fnhe_pmtu!=0 and fnhe_gw=0 and the cached
FNHE route has just rt_pmtu!=0 and rt_uses_gateway=0. In the
event of redirect we should invalidate fnhe_rth.
Users should come again to get the new gw from the
same fnhe. As you note, nh_pcpu_rth_output does not need
to be invalidated, it was already invalidated when
fnhe was created.

	So, my idea is something like this. If cached
route detects change in gw on redirect, it calls
update_or_create_fnhe but FNHE can already know this gw,
we should invalidate the fnhe_rth only if gw changes.
As caller has some stale cache route, it should be
invalidated as before, I assume 'kill_route' is properly
provided.

	if (fnhe) {
		if (fnhe->fnhe_gw != gw && gw) {
			rt = rcu_dereference_protected(fnhe->fnhe_rth, 1);
			if (rt)
				rt->dst.obsolete = DST_OBSOLETE_KILL;
			fnhe->fnhe_gw = gw;
		}
		...
	}
	...
}

	Also, I don't know the XFRM code well but
xfrm_bundle_ok() calls dst_check. Now ipv4_dst_check
will not give any indication for recent change in
rt_pmtu. I hope this is not a problem because I see
some comparisons with cached pmtus. I.e. the main
question is: are there any dst_check and ->check users
that needs to know for changes in rt_pmtu or all
of them just use dst_mtu().

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu
  2013-05-28 21:04             ` Julian Anastasov
@ 2013-05-29  5:07               ` Timo Teras
  2013-05-29 22:49                 ` Julian Anastasov
  0 siblings, 1 reply; 24+ messages in thread
From: Timo Teras @ 2013-05-29  5:07 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev

On Wed, 29 May 2013 00:04:47 +0300 (EEST)
Julian Anastasov <ja@ssi.bg> wrote:

> 
> 	Hello,
> 
> On Tue, 28 May 2013, Timo Teras wrote:
> 
> > > > --- a/net/ipv4/route.c
> > > > +++ b/net/ipv4/route.c
> > > > @@ -947,6 +947,10 @@ static void __ip_rt_update_pmtu(struct
> > > > rtable *rt, struct flowi4 *fl4, u32 mtu) if (mtu <
> > > > ip_rt_min_pmtu) mtu = ip_rt_min_pmtu;
> > > >  
> > > > +	if (rt->rt_pmtu == mtu &&
> > > > +	    time_before(jiffies, dst->expires -
> > > > ip_rt_mtu_expires / 2))
> > > > +		return;
> > > > +
> > > 
> > > 	Can we also add logic in this patch in
> > > update_or_create_fnhe, so that we avoid invalidation for cached
> > > routes when only pmtu expiration is updated (same pmtu), i.e.:
> > > 
> > > +	if (gw || pmtu != fnhe->fnhe_pmtu) {
> > > +		/* Exception created; mark the cached routes for
> > > the nexthop
> > > +		...
> > > +	}
> > > 
> > > 	BTW, I now see that previous patch should
> > > call for_each_possible_cpu for the both cases, not
> > > only when fnhe is created but also on update:
> > 
> > Why would this be needed?
> > 
> > The idea is that if there is no next hop exception, everyone is
> > using the next hop's dsts. If there is a next hop exception hashed,
> > they will be using those routes in the exception entry.
> > 
> > When the exception is created, we need to invalidate the nexthop's
> > routes, to force relookup of these dst's if should go to the
> > exception. Basically it flushes the nexthop's 'default' dst.
> > 
> > But if we have a valid exception, and we are just updating it.
> > Everyone is already using the right dst. The
> > update_or_create_fnhe() updates all exception's dst's that are
> > effected. Since the set of hosts to which that exception applies is
> > the same, I don't see why we would need to invalidate the nexthop's
> > 'default' dst.
> 
> 	Agreed for the NH route. But there is another case:
> when fnhe exists for fnhe_pmtu!=0 and fnhe_gw=0 and the cached
> FNHE route has just rt_pmtu!=0 and rt_uses_gateway=0. In the
> event of redirect we should invalidate fnhe_rth.
> Users should come again to get the new gw from the
> same fnhe. As you note, nh_pcpu_rth_output does not need
> to be invalidated, it was already invalidated when
> fnhe was created.
> 
> 	So, my idea is something like this. If cached
> route detects change in gw on redirect, it calls
> update_or_create_fnhe but FNHE can already know this gw,
> we should invalidate the fnhe_rth only if gw changes.
> As caller has some stale cache route, it should be
> invalidated as before, I assume 'kill_route' is properly
> provided.
> 
> 	if (fnhe) {
> 		if (fnhe->fnhe_gw != gw && gw) {
> 			rt =
> rcu_dereference_protected(fnhe->fnhe_rth, 1); if (rt)
> 				rt->dst.obsolete = DST_OBSOLETE_KILL;
> 			fnhe->fnhe_gw = gw;
> 		}
> 		...
> 	}
> 	...
> }

This is already done by the caller of update_or_create_fnhe for the
case of gw change. It might be worth to put all this to the same place,
but would be worth of separate patch.

> 	Also, I don't know the XFRM code well but
> xfrm_bundle_ok() calls dst_check. Now ipv4_dst_check
> will not give any indication for recent change in
> rt_pmtu. I hope this is not a problem because I see
> some comparisons with cached pmtus. I.e. the main
> question is: are there any dst_check and ->check users
> that needs to know for changes in rt_pmtu or all
> of them just use dst_mtu().

In XFRM case, xfrm_bundle_ok() will propagate the pmtu by calling
dst_mtu() for each target. All caching users need to use dst_mtu() to
check it, so the route.c code is correct - this is how it worked before
my patches too.

- Timo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu
  2013-05-29  5:07               ` Timo Teras
@ 2013-05-29 22:49                 ` Julian Anastasov
  0 siblings, 0 replies; 24+ messages in thread
From: Julian Anastasov @ 2013-05-29 22:49 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev


	Hello,

On Wed, 29 May 2013, Timo Teras wrote:

> On Wed, 29 May 2013 00:04:47 +0300 (EEST)
> Julian Anastasov <ja@ssi.bg> wrote:
> 
> > 	So, my idea is something like this. If cached
> > route detects change in gw on redirect, it calls
> > update_or_create_fnhe but FNHE can already know this gw,
> > we should invalidate the fnhe_rth only if gw changes.
> > As caller has some stale cache route, it should be
> > invalidated as before, I assume 'kill_route' is properly
> > provided.
> > 
> > 	if (fnhe) {
> > 		if (fnhe->fnhe_gw != gw && gw) {
> > 			rt =
> > rcu_dereference_protected(fnhe->fnhe_rth, 1); if (rt)
> > 				rt->dst.obsolete = DST_OBSOLETE_KILL;
> > 			fnhe->fnhe_gw = gw;
> > 		}
> > 		...
> > 	}
> > 	...
> > }
> 
> This is already done by the caller of update_or_create_fnhe for the
> case of gw change. It might be worth to put all this to the same place,
> but would be worth of separate patch.

	Looking at do_redirect() for TCP and __sk_dst_check
may be such race is very small because first gw change
for another socket will invalidate our cached entry and
our socket will fail to deliver the new gw to FNHE
without a two-step validation as done in inet_csk_update_pmtu().
Still, it should be more safe to change and invalidate
at the same place - update_or_create_fnhe. But we can
investigate it later.

> > 	Also, I don't know the XFRM code well but
> > xfrm_bundle_ok() calls dst_check. Now ipv4_dst_check
> > will not give any indication for recent change in
> > rt_pmtu. I hope this is not a problem because I see
> > some comparisons with cached pmtus. I.e. the main
> > question is: are there any dst_check and ->check users
> > that needs to know for changes in rt_pmtu or all
> > of them just use dst_mtu().
> 
> In XFRM case, xfrm_bundle_ok() will propagate the pmtu by calling
> dst_mtu() for each target. All caching users need to use dst_mtu() to
> check it, so the route.c code is correct - this is how it worked before
> my patches too.

	Aha, so nobody needs DST_OBSOLETE_KILL when
fnhe_pmtu changes. Thanks for answering my questions!

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events
  2013-05-28  6:46     ` [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events Timo Teräs
                         ` (2 preceding siblings ...)
  2013-05-28  8:25       ` [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events Julian Anastasov
@ 2013-06-03  7:09       ` David Miller
  3 siblings, 0 replies; 24+ messages in thread
From: David Miller @ 2013-06-03  7:09 UTC (permalink / raw)
  To: timo.teras; +Cc: netdev, steffen.klassert

From: Timo Teräs <timo.teras@iki.fi>
Date: Tue, 28 May 2013 09:46:31 +0300

> This reverts commit 05ab86c5 (xfrm4: Invalidate all ipv4 routes on
> IPsec pmtu events). Flushing all cached entries is not needed.
> 
> Instead, invalidate only the related next hop dsts to recheck for
> the added next hop exception where needed. This also fixes a subtle
> race due to bumping generation id's before updating the pmtu.
> 
> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> Signed-off-by: Timo Teräs <timo.teras@iki.fi>

Applied.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu
  2013-05-28  6:46       ` [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu Timo Teräs
  2013-05-28  8:45         ` Julian Anastasov
@ 2013-06-03  7:10         ` David Miller
  1 sibling, 0 replies; 24+ messages in thread
From: David Miller @ 2013-06-03  7:10 UTC (permalink / raw)
  To: timo.teras; +Cc: netdev

From: Timo Teräs <timo.teras@iki.fi>
Date: Tue, 28 May 2013 09:46:32 +0300

> The tunnel devices call update_pmtu for each packet sent, this causes
> contention on the fnhe_lock. Ignore the pmtu update if pmtu is not
> actually changed, and there is still plenty of time before the entry
> expires.
> 
> Signed-off-by: Timo Teräs <timo.teras@iki.fi>

Applied.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: use separate genid for next hop exceptions
  2013-05-28  6:46       ` [PATCH net-next 3/3] ipv4: use separate genid for next hop exceptions Timo Teräs
@ 2013-06-03  7:10         ` David Miller
  0 siblings, 0 replies; 24+ messages in thread
From: David Miller @ 2013-06-03  7:10 UTC (permalink / raw)
  To: timo.teras; +Cc: netdev, steffen.klassert

From: Timo Teräs <timo.teras@iki.fi>
Date: Tue, 28 May 2013 09:46:33 +0300

> commit 13d82bf5 (ipv4: Fix flushing of cached routing informations)
> added the support to flush learned pmtu information.
> 
> However, using rt_genid is quite heavy as it is bumped on route
> add/change and multicast events amongst other places. These can
> happen quote often, especially if using dynamic routing protocols.
> 
> While this is ok with routes (as they are just recreated locally),
> the pmtu information is learned from remote systems and the icmp
> notification can come with long delays. It is worthy to have separate
> genid to avoid excessive pmtu resets.
> 
> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> Signed-off-by: Timo Teräs <timo.teras@iki.fi>

Applied.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-06-03  7:10 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-27 11:16 [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teräs
2013-05-27 11:16 ` [PATCH net-next 1/6] net: inform NETDEV_CHANGE callbacks which flags were changed Timo Teräs
2013-05-27 11:18   ` Jiri Pirko
2013-05-27 11:16 ` [PATCH net-next 2/6] arp: flush arp cache on IFF_NOARP change Timo Teräs
2013-05-27 11:16 ` [PATCH net-next 3/6] ipv4: properly refresh rtable entries on pmtu/redirect events Timo Teräs
2013-05-27 11:16 ` [PATCH net-next 4/6] ipv4: rate limit updating of next hop exceptions with same pmtu Timo Teräs
2013-05-27 11:16 ` [PATCH net-next 5/6] ipv4: use separate genid for next hop exceptions Timo Teräs
2013-05-27 11:16 ` [PATCH RFC net-next 6/6] ipv4: use next hop exceptions also for input routes Timo Teräs
2013-05-28  6:33 ` [PATCH net-next 0/6] ipv4 fixes for dmvpn Timo Teras
2013-05-28  6:38   ` David Miller
2013-05-28  6:46     ` [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events Timo Teräs
2013-05-28  6:46       ` [PATCH net-next 2/3] ipv4: rate limit updating of next hop exceptions with same pmtu Timo Teräs
2013-05-28  8:45         ` Julian Anastasov
2013-05-28 10:07           ` Timo Teras
2013-05-28 21:04             ` Julian Anastasov
2013-05-29  5:07               ` Timo Teras
2013-05-29 22:49                 ` Julian Anastasov
2013-06-03  7:10         ` David Miller
2013-05-28  6:46       ` [PATCH net-next 3/3] ipv4: use separate genid for next hop exceptions Timo Teräs
2013-06-03  7:10         ` David Miller
2013-05-28  8:25       ` [PATCH net-next 1/3] ipv4: properly refresh rtable entries on pmtu/redirect events Julian Anastasov
2013-05-28  8:53         ` Timo Teras
2013-05-28 19:44           ` Julian Anastasov
2013-06-03  7:09       ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).