[PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
@ 2014-04-24 13:59 Chris Mason
  2014-04-24 14:20 ` Hannes Frederic Sowa
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Chris Mason @ 2014-04-24 13:59 UTC (permalink / raw)
  To: netdev

The ipv6 code to dump routes in /proc/net/ipv6_route can hold
a read lock on the table for a very long time.  This ends up blocking
writers and triggering softlockups.

This patch is a simple work around to limit the number of entries
we'll walk while processing /proc/net/ipv6_route.  It intentionally
slows down proc file reading to make sure we don't lock out the
real ipv6 traffic.

This patch is also horrible, and doesn't actually fix the entire
problem.  We still have rcu_read_lock held the whole time we cat
/proc/net/ipv6_route.  On an unpatched machine, I've clocked the
time required to cat /proc/net/ipv6_route at 14 minutes.

java cats this proc file on startup to search for local routes, and the
resulting contention on the table lock makes our boxes fall over.

So, I'm sending the partial fix to get discussion started.

Signed-off-by: Chris Mason <clm@fb.com>

---
 net/ipv6/ip6_fib.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 87891f5..19b0f78 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1814,6 +1814,7 @@ struct ipv6_route_iter {
 	loff_t skip;
 	struct fib6_table *tbl;
 	__u32 sernum;
+	int max_walk;
 };

 static int ipv6_route_seq_show(struct seq_file *seq, void *v)
@@ -1853,8 +1854,11 @@ static int ipv6_route_yield(struct fib6_walker_t *w)
 		iter->skip--;
 		if (!iter->skip && iter->w.leaf)
 			return 1;
+		iter->max_walk--;
 	} while (iter->w.leaf);

+	if (iter->max_walk <= 0)
+		return -EAGAIN;
 	return 0;
 }

@@ -1867,6 +1871,7 @@ static void ipv6_route_seq_setup_walk(struct ipv6_route_iter *iter)
 	iter->w.node = iter->w.root;
 	iter->w.args = iter;
 	iter->sernum = iter->w.root->fn_sernum;
+	iter->max_walk = 128;
 	INIT_LIST_HEAD(&iter->w.lh);
 	fib6_walker_link(&iter->w);
 }
@@ -1921,7 +1926,9 @@ static void *ipv6_route_seq_next(struct seq_file *seq, void *v, loff_t *pos)

 iter_table:
 	ipv6_route_check_sernum(iter);
+iter_again:
 	read_lock(&iter->tbl->tb6_lock);
+	iter->max_walk = 128;
 	r = fib6_walk_continue(&iter->w);
 	read_unlock(&iter->tbl->tb6_lock);
 	if (r > 0) {
@@ -1929,6 +1936,8 @@ iter_table:
 			++*pos;
 		return iter->w.leaf;
 	} else if (r < 0) {
+		if (r == -EAGAIN)
+			goto iter_again;
 		fib6_walker_unlink(&iter->w);
 		return NULL;
 	}
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-24 13:59 [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route Chris Mason
@ 2014-04-24 14:20 ` Hannes Frederic Sowa
  2014-04-24 14:30   ` Eric Dumazet
  2014-04-24 14:41   ` Chris Mason
  2014-04-24 14:20 ` Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 13+ messages in thread
From: Hannes Frederic Sowa @ 2014-04-24 14:20 UTC (permalink / raw)
  To: Chris Mason; +Cc: netdev

Hi!

On Thu, Apr 24, 2014 at 09:59:24AM -0400, Chris Mason wrote:
> The ipv6 code to dump routes in /proc/net/ipv6_route can hold
> a read lock on the table for a very long time.  This ends up blocking
> writers and triggering softlockups.
>
> This patch is a simple work around to limit the number of entries
> we'll walk while processing /proc/net/ipv6_route.  It intentionally
> slows down proc file reading to make sure we don't lock out the
> real ipv6 traffic.

I guess most time is spent in formatting and printing the rt6_info details
to the procfs file. Have you tried excluding !(rt6_info->rt6i_flags &
RTF_CACHE) routes?

Maybe this is a viable alternative. A patch could also check for
RTF_DYNAMIC and RTF_MODIFIED so we would still show redirected and
mtu-caching nodes.

> This patch is also horrible, and doesn't actually fix the entire
> problem.  We still have rcu_read_lock held the whole time we cat
> /proc/net/ipv6_route.  On an unpatched machine, I've clocked the
> time required to cat /proc/net/ipv6_route at 14 minutes.
> 
> java cats this proc file on startup to search for local routes, and the
> resulting contention on the table lock makes our boxes fall over.

Urks, does plain openjdk do that or is this something in your application?

> 
> So, I'm sending the partial fix to get discussion started.

I am planing to submit patches which reduce the caching of DST_HOST
entries in the ipv6 fib next month which will result in a much smaller
fib to walk by then.

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-24 13:59 [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route Chris Mason
  2014-04-24 14:20 ` Hannes Frederic Sowa
@ 2014-04-24 14:20 ` Eric Dumazet
  2014-04-25 19:53 ` David Miller
  2014-04-25 20:09 ` David Miller
  3 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2014-04-24 14:20 UTC (permalink / raw)
  To: Chris Mason; +Cc: netdev

On Thu, 2014-04-24 at 09:59 -0400, Chris Mason wrote:
> The ipv6 code to dump routes in /proc/net/ipv6_route can hold
> a read lock on the table for a very long time.  This ends up blocking
> writers and triggering softlockups.
> 
> This patch is a simple work around to limit the number of entries
> we'll walk while processing /proc/net/ipv6_route.  It intentionally
> slows down proc file reading to make sure we don't lock out the
> real ipv6 traffic.
> 
> This patch is also horrible, and doesn't actually fix the entire
> problem.  We still have rcu_read_lock held the whole time we cat
> /proc/net/ipv6_route.  On an unpatched machine, I've clocked the
> time required to cat /proc/net/ipv6_route at 14 minutes.
> 
> java cats this proc file on startup to search for local routes, and the
> resulting contention on the table lock makes our boxes fall over.
> 
> So, I'm sending the partial fix to get discussion started.
> 
> Signed-off-by: Chris Mason <clm@fb.com>
> 
> ---

It might be time to convert this to RCU and/or a better data structure.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-24 14:20 ` Hannes Frederic Sowa
@ 2014-04-24 14:30   ` Eric Dumazet
  2014-04-24 14:41   ` Chris Mason
  1 sibling, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2014-04-24 14:30 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Chris Mason, netdev

On Thu, 2014-04-24 at 16:20 +0200, Hannes Frederic Sowa wrote:

> I am planing to submit patches which reduce the caching of DST_HOST
> entries in the ipv6 fib next month which will result in a much smaller
> fib to walk by then.

Or, if caching is still needed, we might use a separate hash table for
this, as it is more scalable.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-24 14:20 ` Hannes Frederic Sowa
  2014-04-24 14:30   ` Eric Dumazet
@ 2014-04-24 14:41   ` Chris Mason
  2014-04-25 21:31     ` Hannes Frederic Sowa
  1 sibling, 1 reply; 13+ messages in thread
From: Chris Mason @ 2014-04-24 14:41 UTC (permalink / raw)
  To: netdev

On 04/24/2014 10:20 AM, Hannes Frederic Sowa wrote:
> Hi!
>
> On Thu, Apr 24, 2014 at 09:59:24AM -0400, Chris Mason wrote:
>> The ipv6 code to dump routes in /proc/net/ipv6_route can hold
>> a read lock on the table for a very long time.  This ends up blocking
>> writers and triggering softlockups.
>>
>> This patch is a simple work around to limit the number of entries
>> we'll walk while processing /proc/net/ipv6_route.  It intentionally
>> slows down proc file reading to make sure we don't lock out the
>> real ipv6 traffic.
>
> I guess most time is spent in formatting and printing the rt6_info details
> to the procfs file. Have you tried excluding !(rt6_info->rt6i_flags &
> RTF_CACHE) routes?

We do have a separate patch from Paul Saab that excludes the cached 
routes and it has a big impact (~10x fewer entries).  But the 
softlockups still flow.

I was going to discuss the cache exclusion on a separate thread, but the 
short version is that I don't have any clue of how many people we'd 
upset by unconditionally leaving out the cached entries.

>
> Maybe this is a viable alternative. A patch could also check for
> RTF_DYNAMIC and RTF_MODIFIED so we would still show redirected and
> mtu-caching nodes.
>
>> This patch is also horrible, and doesn't actually fix the entire
>> problem.  We still have rcu_read_lock held the whole time we cat
>> /proc/net/ipv6_route.  On an unpatched machine, I've clocked the
>> time required to cat /proc/net/ipv6_route at 14 minutes.
>>
>> java cats this proc file on startup to search for local routes, and the
>> resulting contention on the table lock makes our boxes fall over.
>
> Urks, does plain openjdk do that or is this something in your application?
>

Seems to be built into the jdk, and not our app.

>>
>> So, I'm sending the partial fix to get discussion started.
>
> I am planing to submit patches which reduce the caching of DST_HOST
> entries in the ipv6 fib next month which will result in a much smaller
> fib to walk by then.

Great.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-24 13:59 [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route Chris Mason
  2014-04-24 14:20 ` Hannes Frederic Sowa
  2014-04-24 14:20 ` Eric Dumazet
@ 2014-04-25 19:53 ` David Miller
  2014-04-25 20:09 ` David Miller
  3 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2014-04-25 19:53 UTC (permalink / raw)
  To: clm; +Cc: netdev

From: Chris Mason <clm@fb.com>
Date: Thu, 24 Apr 2014 09:59:24 -0400

> java cats this proc file on startup to search for local routes, and the
> resulting contention on the table lock makes our boxes fall over.

I know it's tangental, but Java should open a netlink socket and probe
for such things properly.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-24 13:59 [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route Chris Mason
                   ` (2 preceding siblings ...)
  2014-04-25 19:53 ` David Miller
@ 2014-04-25 20:09 ` David Miller
  2014-04-25 20:27   ` Chris Mason
  3 siblings, 1 reply; 13+ messages in thread
From: David Miller @ 2014-04-25 20:09 UTC (permalink / raw)
  To: clm; +Cc: netdev

From: Chris Mason <clm@fb.com>
Date: Thu, 24 Apr 2014 09:59:24 -0400

> The ipv6 code to dump routes in /proc/net/ipv6_route can hold
> a read lock on the table for a very long time.  This ends up blocking
> writers and triggering softlockups.
> 
> This patch is a simple work around to limit the number of entries
> we'll walk while processing /proc/net/ipv6_route.  It intentionally
> slows down proc file reading to make sure we don't lock out the
> real ipv6 traffic.
> 
> This patch is also horrible, and doesn't actually fix the entire
> problem.  We still have rcu_read_lock held the whole time we cat
> /proc/net/ipv6_route.  On an unpatched machine, I've clocked the
> time required to cat /proc/net/ipv6_route at 14 minutes.

There is another way to more effectively mitigate this.

Take the rtnl mutex over the traversals.

The tables cannot change if you hold it.

Then you can use rcu_dereference_rtnl() in the table traversals and
get rid of the RCU locking entirely.

Now you're only left with the read locking over the individual trees.
And as in your patch we can drop it temporarily after a limit is hit.

But yes, longer term we need to convert the ipv6 route trees over to
RCU or similar.

Even better would be to align the ipv6 routing with how ipv4 works
since the routing-cache removal.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-25 20:09 ` David Miller
@ 2014-04-25 20:27   ` Chris Mason
  2014-04-25 21:52     ` Hannes Frederic Sowa
  2014-04-26  4:11     ` David Miller
  0 siblings, 2 replies; 13+ messages in thread
From: Chris Mason @ 2014-04-25 20:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On 04/25/2014 04:09 PM, David Miller wrote:
> From: Chris Mason <clm@fb.com>
> Date: Thu, 24 Apr 2014 09:59:24 -0400
>
>> The ipv6 code to dump routes in /proc/net/ipv6_route can hold
>> a read lock on the table for a very long time.  This ends up blocking
>> writers and triggering softlockups.
>>
>> This patch is a simple work around to limit the number of entries
>> we'll walk while processing /proc/net/ipv6_route.  It intentionally
>> slows down proc file reading to make sure we don't lock out the
>> real ipv6 traffic.
>>
>> This patch is also horrible, and doesn't actually fix the entire
>> problem.  We still have rcu_read_lock held the whole time we cat
>> /proc/net/ipv6_route.  On an unpatched machine, I've clocked the
>> time required to cat /proc/net/ipv6_route at 14 minutes.
>
> There is another way to more effectively mitigate this.
>
> Take the rtnl mutex over the traversals.
>
> The tables cannot change if you hold it.
>
> Then you can use rcu_dereference_rtnl() in the table traversals and
> get rid of the RCU locking entirely.

Ah ok, so the rtnl mutex can replace rcu_read_lock().  Will it end up 
blocking any traffic? (sorry, filesystem guys are a little slow)

>
> Now you're only left with the read locking over the individual trees.
> And as in your patch we can drop it temporarily after a limit is hit.

That would be wonderful because I can use some cond_resched() variant, 
and get rid of the max_walk counter completely.

>
> But yes, longer term we need to convert the ipv6 route trees over to
> RCU or similar.

Instead of the ->skip counter, can we get a cursor into the tree and 
just resume walking at the first entry after that cursor?  It would have 
to be a key that we copy out instead of a pointer so we can drop the 
rcu_read_lock()

>
> Even better would be to align the ipv6 routing with how ipv4 works
> since the routing-cache removal.
>

I'll shop task that around here.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-24 14:41   ` Chris Mason
@ 2014-04-25 21:31     ` Hannes Frederic Sowa
  2014-04-26  4:06       ` David Miller
  0 siblings, 1 reply; 13+ messages in thread
From: Hannes Frederic Sowa @ 2014-04-25 21:31 UTC (permalink / raw)
  To: Chris Mason; +Cc: netdev

On Thu, Apr 24, 2014 at 10:41:11AM -0400, Chris Mason wrote:
> I was going to discuss the cache exclusion on a separate thread, but the 
> short version is that I don't have any clue of how many people we'd 
> upset by unconditionally leaving out the cached entries.

In my opinion this would be ok.

We definitely must reduce number of entries in the fib and I don't think it is
wise to synthesize them thereafter, especially if we don't need to track state
at all, e.g. forwarding.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-25 20:27   ` Chris Mason
@ 2014-04-25 21:52     ` Hannes Frederic Sowa
  2014-04-26  4:11     ` David Miller
  1 sibling, 0 replies; 13+ messages in thread
From: Hannes Frederic Sowa @ 2014-04-25 21:52 UTC (permalink / raw)
  To: Chris Mason; +Cc: David Miller, netdev

On Fri, Apr 25, 2014 at 04:27:21PM -0400, Chris Mason wrote:
> On 04/25/2014 04:09 PM, David Miller wrote:
> >From: Chris Mason <clm@fb.com>
> >Date: Thu, 24 Apr 2014 09:59:24 -0400
> >
> >>The ipv6 code to dump routes in /proc/net/ipv6_route can hold
> >>a read lock on the table for a very long time.  This ends up blocking
> >>writers and triggering softlockups.
> >>
> >>This patch is a simple work around to limit the number of entries
> >>we'll walk while processing /proc/net/ipv6_route.  It intentionally
> >>slows down proc file reading to make sure we don't lock out the
> >>real ipv6 traffic.
> >>
> >>This patch is also horrible, and doesn't actually fix the entire
> >>problem.  We still have rcu_read_lock held the whole time we cat
> >>/proc/net/ipv6_route.  On an unpatched machine, I've clocked the
> >>time required to cat /proc/net/ipv6_route at 14 minutes.
> >
> >There is another way to more effectively mitigate this.
> >
> >Take the rtnl mutex over the traversals.
> >
> >The tables cannot change if you hold it.
> >
> >Then you can use rcu_dereference_rtnl() in the table traversals and
> >get rid of the RCU locking entirely.

That's a pretty good idea!

> Ah ok, so the rtnl mutex can replace rcu_read_lock().  Will it end up 
> blocking any traffic? (sorry, filesystem guys are a little slow)

Traffic flow shouldn't get stopped but any network stack configuration
would wait for the time being completly, this could also affect ipv4.

> >Now you're only left with the read locking over the individual trees.
> >And as in your patch we can drop it temporarily after a limit is hit.
> 
> That would be wonderful because I can use some cond_resched() variant, 
> and get rid of the max_walk counter completely.
> 
> >
> >But yes, longer term we need to convert the ipv6 route trees over to
> >RCU or similar.
> 
> Instead of the ->skip counter, can we get a cursor into the tree and 
> just resume walking at the first entry after that cursor?  It would have 
> to be a key that we copy out instead of a pointer so we can drop the 
> rcu_read_lock()

The keys could be struct rt6key (rt6i_dst, rt6i_src) and the table
(which won't get dropped because of rtnl lock).

The function to locate the node would be fib6_locate, because we would
have to respect the prefix lengths during lookup.

> >Even better would be to align the ipv6 routing with how ipv4 works
> >since the routing-cache removal.
> >
> 
> I'll shop task that around here.

I'll send out some patches reducing DST_CACHE entries soon (hopefully
next week), so we can build up on them.

I am very keen on getting this task done.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-25 21:31     ` Hannes Frederic Sowa
@ 2014-04-26  4:06       ` David Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2014-04-26  4:06 UTC (permalink / raw)
  To: hannes; +Cc: clm, netdev

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Fri, 25 Apr 2014 23:31:40 +0200

> On Thu, Apr 24, 2014 at 10:41:11AM -0400, Chris Mason wrote:
>> I was going to discuss the cache exclusion on a separate thread, but the 
>> short version is that I don't have any clue of how many people we'd 
>> upset by unconditionally leaving out the cached entries.
> 
> In my opinion this would be ok.
> 
> We definitely must reduce number of entries in the fib and I don't think it is
> wise to synthesize them thereafter, especially if we don't need to track state
> at all, e.g. forwarding.

I think this is a dangerous idea, to limit what is shown in the dumps.

We must always output the entry that would actually be used by a
lookup otherwise you are asking for impossible to debug scenerios.

On the other side, once we change ipv6 to work like ipv4, those cached
entries simply won't be there at all and then automatically this huge
dump issue just goes away.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-25 20:27   ` Chris Mason
  2014-04-25 21:52     ` Hannes Frederic Sowa
@ 2014-04-26  4:11     ` David Miller
  2014-04-28 17:21       ` Chris Mason
  1 sibling, 1 reply; 13+ messages in thread
From: David Miller @ 2014-04-26  4:11 UTC (permalink / raw)
  To: clm; +Cc: netdev

From: Chris Mason <clm@fb.com>
Date: Fri, 25 Apr 2014 16:27:21 -0400

> Ah ok, so the rtnl mutex can replace rcu_read_lock().  Will it end up
> blocking any traffic? (sorry, filesystem guys are a little slow)

It stops networking configuration operations.

BTW, this means that netlink based ipv6 route dumps can be
optimized similarly.  Because those code paths implicitly
hold the RTNL mutex.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route
  2014-04-26  4:11     ` David Miller
@ 2014-04-28 17:21       ` Chris Mason
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Mason @ 2014-04-28 17:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On 04/26/2014 12:11 AM, David Miller wrote:
> From: Chris Mason <clm@fb.com>
> Date: Fri, 25 Apr 2014 16:27:21 -0400
>
>> Ah ok, so the rtnl mutex can replace rcu_read_lock().  Will it end up
>> blocking any traffic? (sorry, filesystem guys are a little slow)
>
> It stops networking configuration operations.
>
> BTW, this means that netlink based ipv6 route dumps can be
> optimized similarly.  Because those code paths implicitly
> hold the RTNL mutex.
>

Ok, the rtnl mutex looks like a much better path.  I'll give it a shot 
this week.

If anyone working on patches in this area wants help testing, just let 
me know.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-04-28 19:35 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-24 13:59 [PATCH RFC] ipv6_fib limit spinlock hold times for /proc/net/ipv6_route Chris Mason
2014-04-24 14:20 ` Hannes Frederic Sowa
2014-04-24 14:30   ` Eric Dumazet
2014-04-24 14:41   ` Chris Mason
2014-04-25 21:31     ` Hannes Frederic Sowa
2014-04-26  4:06       ` David Miller
2014-04-24 14:20 ` Eric Dumazet
2014-04-25 19:53 ` David Miller
2014-04-25 20:09 ` David Miller
2014-04-25 20:27   ` Chris Mason
2014-04-25 21:52     ` Hannes Frederic Sowa
2014-04-26  4:11     ` David Miller
2014-04-28 17:21       ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).