Bug in net/ipv6/ip6_fib.c:fib6_dump

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
@ 2012-06-12 17:22 Debabrata Banerjee
  2012-06-21 19:35 ` Josh Hunt
  0 siblings, 1 reply; 15+ messages in thread
From: Debabrata Banerjee @ 2012-06-12 17:22 UTC (permalink / raw)
  To: netdev, davem, kaber, yoshfuji, jmorris, pekkas, kuznet
  Cc: linux-kernel, eric.dumazet, Joshua Hunt

Looks like commit 2bec5a369ee79576a3eea2c23863325089785a2c "ipv6: fib:
fix crash when changing large fib while dumping" is the culprit. The
result of this code is that if there is a tree addition while a dump
has suspended because the netlink skb is full, it will simply go back
to the top of the tree and you end up with duplicate/triplicate/etc
routes. It looks like the code attempts to count nodes, but it's a
linear count and the data structure is a tree so that's a big problem.
The net result is potentially DOSable, since if route table updates
happen often enough in proportion to table size, a dump will attempt
to return an infinite amount of routes (observed). So this commit
should be reverted. However I am interested in the problem that commit
tried to solve, if anyone has more information on that. My assumption
is the fib tree gets corrupted and eventually it crashes in
fib6_dump_table(), which I assume can still happen.

I can easily demonstrate the bug by adding cloned/cache routes while I
check the results of fib6_dump_table:

root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
593
189
root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
884
16
root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
888
78
root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
507
507
root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
533
533
root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
571
571

Thanks,
Debabrata

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-12 17:22 Bug in net/ipv6/ip6_fib.c:fib6_dump_table() Debabrata Banerjee
@ 2012-06-21 19:35 ` Josh Hunt
  2012-06-21 20:27   ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Josh Hunt @ 2012-06-21 19:35 UTC (permalink / raw)
  To: davem@davemloft.net, kaber@trash.net
  Cc: Debabrata Banerjee, netdev@vger.kernel.org,
	yoshfuji@linux-ipv6.org, jmorris@namei.org, pekkas@netcore.fi,
	kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org,
	eric.dumazet@gmail.com

On 06/12/2012 12:22 PM, Debabrata Banerjee wrote:
> Looks like commit 2bec5a369ee79576a3eea2c23863325089785a2c "ipv6: fib:
> fix crash when changing large fib while dumping" is the culprit. The
> result of this code is that if there is a tree addition while a dump
> has suspended because the netlink skb is full, it will simply go back
> to the top of the tree and you end up with duplicate/triplicate/etc
> routes. It looks like the code attempts to count nodes, but it's a
> linear count and the data structure is a tree so that's a big problem.
> The net result is potentially DOSable, since if route table updates
> happen often enough in proportion to table size, a dump will attempt
> to return an infinite amount of routes (observed). So this commit
> should be reverted. However I am interested in the problem that commit
> tried to solve, if anyone has more information on that. My assumption
> is the fib tree gets corrupted and eventually it crashes in
> fib6_dump_table(), which I assume can still happen.
> 
> I can easily demonstrate the bug by adding cloned/cache routes while I
> check the results of fib6_dump_table:
> 
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 593
> 189
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 884
> 16
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 888
> 78
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 507
> 507
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 533
> 533
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 571
> 571
> 
> Thanks,
> Debabrata

Ping?

Can anyone provide details of the crash which was intended to be fixed
by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
doing concurrent adds/deletes and dumping the table via netlink causes
duplicate entries to be reported. Reverting this patch causes those
problems to go away. We can provide a more detailed test if that is
needed, but so far our testing has been unable to reproduce the crash
mentioned in the above commit with it reverted.

Thanks
Josh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-21 19:35 ` Josh Hunt
@ 2012-06-21 20:27   ` Eric Dumazet
  2012-06-21 21:50     ` Alexey Kuznetsov
  2012-06-22  6:49     ` Josh Hunt
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2012-06-21 20:27 UTC (permalink / raw)
  To: Josh Hunt
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org

On Thu, 2012-06-21 at 14:35 -0500, Josh Hunt wrote:

> Can anyone provide details of the crash which was intended to be fixed
> by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
> doing concurrent adds/deletes and dumping the table via netlink causes
> duplicate entries to be reported. Reverting this patch causes those
> problems to go away. We can provide a more detailed test if that is
> needed, but so far our testing has been unable to reproduce the crash
> mentioned in the above commit with it reverted.

A mere revert wont be enough.

Looking at this code, it lacks proper synchronization
between tree updaters and tree walkers.

fib6_walker_lock rwlock is not enough to prevent races.

Are you willing to fix this yourself ?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-21 20:27   ` Eric Dumazet
@ 2012-06-21 21:50     ` Alexey Kuznetsov
  2012-06-22  3:34       ` Gao feng
  2012-06-22  6:49     ` Josh Hunt
  1 sibling, 1 reply; 15+ messages in thread
From: Alexey Kuznetsov @ 2012-06-21 21:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Josh Hunt, davem@davemloft.net, kaber@trash.net,
	Debabrata Banerjee, netdev@vger.kernel.org,
	yoshfuji@linux-ipv6.org, jmorris@namei.org, pekkas@netcore.fi,
	linux-kernel@vger.kernel.org

On Thu, Jun 21, 2012 at 10:27:49PM +0200, Eric Dumazet wrote:
> Looking at this code, it lacks proper synchronization
> between tree updaters and tree walkers.
> 
> fib6_walker_lock rwlock is not enough to prevent races.

Hmm. As author of this weird code, I must say I honestly believed it was correct.
At least I tried. :-)

What's about 2bec5a336.., it does not look reasonable.
The idea was that when you change tree, you fixup sleeping walkers, moving
their location in tree to correct point. So, walkers must not have any stale pointers
at any times (except when you under table write lock) and no skips/counts are required.
I remember how damn difficult was it to make this right (well, sorry, if it is not yet :-)),
so that understand that if some update is forgotten or done incorrectly, it is not so easy to find,
but it is definitely worth of efforts.

Alexey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-21 21:50     ` Alexey Kuznetsov
@ 2012-06-22  3:34       ` Gao feng
  0 siblings, 0 replies; 15+ messages in thread
From: Gao feng @ 2012-06-22  3:34 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Eric Dumazet, Josh Hunt, davem@davemloft.net, kaber@trash.net,
	Debabrata Banerjee, netdev@vger.kernel.org,
	yoshfuji@linux-ipv6.org, jmorris@namei.org, pekkas@netcore.fi,
	linux-kernel@vger.kernel.org

于 2012年06月22日 05:50, Alexey Kuznetsov 写道:
> On Thu, Jun 21, 2012 at 10:27:49PM +0200, Eric Dumazet wrote:
>> Looking at this code, it lacks proper synchronization
>> between tree updaters and tree walkers.
>>
>> fib6_walker_lock rwlock is not enough to prevent races.
> 
> Hmm. As author of this weird code, I must say I honestly believed it was correct.
> At least I tried. :-)
> 
> 
> What's about 2bec5a336.., it does not look reasonable.
> The idea was that when you change tree, you fixup sleeping walkers, moving
> their location in tree to correct point. So, walkers must not have any stale pointers
> at any times (except when you under table write lock) and no skips/counts are required.
> I remember how damn difficult was it to make this right (well, sorry, if it is not yet :-)),
> so that understand that if some update is forgotten or done incorrectly, it is not so easy to find,
> but it is definitely worth of efforts.


Actually, I spent two months to try to reproduce this crash four months ago,
But finally I give up, I don't think there was any stale pointers,
we already correct it when we change the tree.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-21 20:27   ` Eric Dumazet
  2012-06-21 21:50     ` Alexey Kuznetsov
@ 2012-06-22  6:49     ` Josh Hunt
  2012-06-22  8:29       ` Eric Dumazet
  1 sibling, 1 reply; 15+ messages in thread
From: Josh Hunt @ 2012-06-22  6:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org

On 06/21/2012 03:27 PM, Eric Dumazet wrote:
> On Thu, 2012-06-21 at 14:35 -0500, Josh Hunt wrote:
> 
>> Can anyone provide details of the crash which was intended to be fixed
>> by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
>> doing concurrent adds/deletes and dumping the table via netlink causes
>> duplicate entries to be reported. Reverting this patch causes those
>> problems to go away. We can provide a more detailed test if that is
>> needed, but so far our testing has been unable to reproduce the crash
>> mentioned in the above commit with it reverted.
> 
> A mere revert wont be enough.
> 
> Looking at this code, it lacks proper synchronization
> between tree updaters and tree walkers.
> 
> fib6_walker_lock rwlock is not enough to prevent races.
> 
> Are you willing to fix this yourself ?
> 

Looking through the code a bit more it seems like we would need to have
a lock in fib6_walker_t to protect its contents. Mainly for when we
update the pointers in fib6_del_route and fib6_repair_tree. Right now
there is the fib6_walker_lock, but that appears to only be protecting
the elements of the list, not their contents. Is this what you had in
mind? I just coded up something along these lines and it works for the
most part, but I also got a message about unsafe lock ordering when I
stressed it so I am messing something up. If this sounds like it's on
the right track I can work out the kinks in the morning.

Josh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-22  6:49     ` Josh Hunt
@ 2012-06-22  8:29       ` Eric Dumazet
  2012-06-22 13:44         ` Josh Hunt
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2012-06-22  8:29 UTC (permalink / raw)
  To: Josh Hunt
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org

On Fri, 2012-06-22 at 01:49 -0500, Josh Hunt wrote:
> On 06/21/2012 03:27 PM, Eric Dumazet wrote:
> > On Thu, 2012-06-21 at 14:35 -0500, Josh Hunt wrote:
> > 
> >> Can anyone provide details of the crash which was intended to be fixed
> >> by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
> >> doing concurrent adds/deletes and dumping the table via netlink causes
> >> duplicate entries to be reported. Reverting this patch causes those
> >> problems to go away. We can provide a more detailed test if that is
> >> needed, but so far our testing has been unable to reproduce the crash
> >> mentioned in the above commit with it reverted.
> > 
> > A mere revert wont be enough.
> > 
> > Looking at this code, it lacks proper synchronization
> > between tree updaters and tree walkers.
> > 
> > fib6_walker_lock rwlock is not enough to prevent races.
> > 
> > Are you willing to fix this yourself ?
> > 
> 
> Looking through the code a bit more it seems like we would need to have
> a lock in fib6_walker_t to protect its contents. Mainly for when we
> update the pointers in fib6_del_route and fib6_repair_tree. Right now
> there is the fib6_walker_lock, but that appears to only be protecting
> the elements of the list, not their contents. Is this what you had in
> mind? I just coded up something along these lines and it works for the
> most part, but I also got a message about unsafe lock ordering when I
> stressed it so I am messing something up. If this sounds like it's on
> the right track I can work out the kinks in the morning.

Hmm, it seems tb6_lock is held by a writer, so its safe :

a tree walker can run only holding a read_lock on tb6_lock

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-22  8:29       ` Eric Dumazet
@ 2012-06-22 13:44         ` Josh Hunt
  2012-06-22 18:13           ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Josh Hunt @ 2012-06-22 13:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org

On 06/22/2012 03:29 AM, Eric Dumazet wrote:
> On Fri, 2012-06-22 at 01:49 -0500, Josh Hunt wrote:
>> On 06/21/2012 03:27 PM, Eric Dumazet wrote:
>>> On Thu, 2012-06-21 at 14:35 -0500, Josh Hunt wrote:
>>>
>>>> Can anyone provide details of the crash which was intended to be fixed
>>>> by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
>>>> doing concurrent adds/deletes and dumping the table via netlink causes
>>>> duplicate entries to be reported. Reverting this patch causes those
>>>> problems to go away. We can provide a more detailed test if that is
>>>> needed, but so far our testing has been unable to reproduce the crash
>>>> mentioned in the above commit with it reverted.
>>>
>>> A mere revert wont be enough.
>>>
>>> Looking at this code, it lacks proper synchronization
>>> between tree updaters and tree walkers.
>>>
>>> fib6_walker_lock rwlock is not enough to prevent races.
>>>
>>> Are you willing to fix this yourself ?
>>>
>>
>> Looking through the code a bit more it seems like we would need to have
>> a lock in fib6_walker_t to protect its contents. Mainly for when we
>> update the pointers in fib6_del_route and fib6_repair_tree. Right now
>> there is the fib6_walker_lock, but that appears to only be protecting
>> the elements of the list, not their contents. Is this what you had in
>> mind? I just coded up something along these lines and it works for the
>> most part, but I also got a message about unsafe lock ordering when I
>> stressed it so I am messing something up. If this sounds like it's on
>> the right track I can work out the kinks in the morning.
> 
> Hmm, it seems tb6_lock is held by a writer, so its safe :
> 
> a tree walker can run only holding a read_lock on tb6_lock

Ahh. That makes sense and is what Alexey said before I just didn't put
it all together. So we are OK reverting this patch? I cannot find a path
where the walker's pointers are updated without the tb6_lock write_lock.

Josh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-22 13:44         ` Josh Hunt
@ 2012-06-22 18:13           ` Eric Dumazet
  2012-06-22 21:12             ` Debabrata Banerjee
  2012-06-23  0:02             ` David Miller
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2012-06-22 18:13 UTC (permalink / raw)
  To: Josh Hunt
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org

On Fri, 2012-06-22 at 08:44 -0500, Josh Hunt wrote:

> Ahh. That makes sense and is what Alexey said before I just didn't put
> it all together. So we are OK reverting this patch? I cannot find a path
> where the walker's pointers are updated without the tb6_lock write_lock.
> 

There was a bug somewhere, not sure we want to NULL dereference again.

Following fix should at least avoid a never ending dump

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 74c21b9..6083276 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1349,8 +1349,8 @@ static int fib6_walk_continue(struct fib6_walker_t *w)
 			if (w->leaf && fn->fn_flags & RTN_RTINFO) {
 				int err;
 
-				if (w->count < w->skip) {
-					w->count++;
+				if (w->skip) {
+					w->skip--;
 					continue;
 				}
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-22 18:13           ` Eric Dumazet
@ 2012-06-22 21:12             ` Debabrata Banerjee
  2012-06-23  0:02             ` David Miller
  1 sibling, 0 replies; 15+ messages in thread
From: Debabrata Banerjee @ 2012-06-22 21:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Josh Hunt, davem@davemloft.net, kaber@trash.net,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org

On Fri, Jun 22, 2012 at 2:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-06-22 at 08:44 -0500, Josh Hunt wrote:
>
>> Ahh. That makes sense and is what Alexey said before I just didn't put
>> it all together. So we are OK reverting this patch? I cannot find a path
>> where the walker's pointers are updated without the tb6_lock write_lock.
>>
>
> There was a bug somewhere, not sure we want to NULL dereference again.
>

As you identified, the tree seems to be protected by tb6_lock. I
couldn't find a race by inspection either. If this is not the root of
the problem, how would this patch fix it? So I think it does nothing.
We are attempting to reproduce that crash to prove it, but like Gao
feng I don't think we will see it.

My current favorite theory is that inet6_dump_fib was called with a
NULL func in callback. This looks like the approximate area of the
crash, but it's impossible to say without more information from
Patrick McHardy.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-22 18:13           ` Eric Dumazet
  2012-06-22 21:12             ` Debabrata Banerjee
@ 2012-06-23  0:02             ` David Miller
  2012-06-23  5:37               ` Eric Dumazet
  1 sibling, 1 reply; 15+ messages in thread
From: David Miller @ 2012-06-23  0:02 UTC (permalink / raw)
  To: eric.dumazet
  Cc: johunt, kaber, dbavatar, netdev, yoshfuji, jmorris, pekkas,
	kuznet, linux-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 22 Jun 2012 20:13:05 +0200

> On Fri, 2012-06-22 at 08:44 -0500, Josh Hunt wrote:
> 
>> Ahh. That makes sense and is what Alexey said before I just didn't put
>> it all together. So we are OK reverting this patch? I cannot find a path
>> where the walker's pointers are updated without the tb6_lock write_lock.
>> 
> 
> There was a bug somewhere, not sure we want to NULL dereference again.

Well:

1) Patrick McHardy has been inactive for a while, so do not expect
   any insight from him.

2) Ben Greear isn't even on the CC: list of this discussion yet he
   appears to be the person who reproduced the crash way back then
   and is listed in the Tested-by tag of the commit.

   As a result we aren't likely to get any insight from the one person
   who actually could hit the crash.

I'm inclined to just revert simply because we have people active who
can reproduce regressions introduced by this change and nobody can
understand why the change is even necessary.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-23  0:02             ` David Miller
@ 2012-06-23  5:37               ` Eric Dumazet
  2012-06-23 20:55                 ` Alexey Kuznetsov
  2012-06-25 22:40                 ` David Miller
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2012-06-23  5:37 UTC (permalink / raw)
  To: David Miller
  Cc: johunt, kaber, dbavatar, netdev, yoshfuji, jmorris, pekkas,
	kuznet, linux-kernel, Ben Greear

From: Eric Dumazet <edumazet@google.com>

> 1) Patrick McHardy has been inactive for a while, so do not expect
>    any insight from him.
> 
> 2) Ben Greear isn't even on the CC: list of this discussion yet he
>    appears to be the person who reproduced the crash way back then
>    and is listed in the Tested-by tag of the commit.
> 
>    As a result we aren't likely to get any insight from the one person
>    who actually could hit the crash.
> 
> I'm inclined to just revert simply because we have people active who
> can reproduce regressions introduced by this change and nobody can
> understand why the change is even necessary.

Well, except that :

I spent 3 hours trying to understand Alexey code and failed.

All other /proc/net files don't have a such sophisticated walkers aware
mechanism (easily DOSable by the way, if some guy opens 10.000 handles
and suspend in the middle the dumps).

cat /proc/net/tcp for example can display same socket twice or miss a
socket, because a 'suspend/restart' remembers offsets/counts in a hash
chain, not a pointer to 'next socket'

The fix I submitted is a real one, based on my analysis and tests.

Patrick patch was restarting the dump at the root of the tree, and
setting skip = count was doing nothing at all, since all entries were
dumped again.

This is more a stable candidate fix.

If someones smarter than me can find the real bug, then we certainly can
revert Patrick patch ?

[PATCH] ipv6: fib: fix fib dump restart

Commit 2bec5a369ee79576a3 (ipv6: fib: fix crash when changing large fib
while dumping it) introduced ability to restart the dump at tree root,
but failed to skip correctly a count of already dumped entries. Code
didn't match Patrick intent.

We must skip exactly the number of already dumped entries.

Note that like other /proc/net files or netlink producers, we could
still dump some duplicates entries.

Reported-by: Debabrata Banerjee <dbavatar@gmail.com>
Reported-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Ben Greear <greearb@candelatech.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
---
 net/ipv6/ip6_fib.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 74c21b9..6083276 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1349,8 +1349,8 @@ static int fib6_walk_continue(struct fib6_walker_t *w)
 			if (w->leaf && fn->fn_flags & RTN_RTINFO) {
 				int err;

-				if (w->count < w->skip) {
-					w->count++;
+				if (w->skip) {
+					w->skip--;
 					continue;
 				}

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-23  5:37               ` Eric Dumazet
@ 2012-06-23 20:55                 ` Alexey Kuznetsov
  2012-06-23 23:02                   ` David Miller
  2012-06-25 22:40                 ` David Miller
  1 sibling, 1 reply; 15+ messages in thread
From: Alexey Kuznetsov @ 2012-06-23 20:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, johunt, kaber, dbavatar, netdev, yoshfuji, jmorris,
	pekkas, linux-kernel, Ben Greear

On Sat, Jun 23, 2012 at 07:37:31AM +0200, Eric Dumazet wrote:
> All other /proc/net files don't have a such sophisticated walkers aware
> mechanism

I can explain why.

IPv6 routing table has a capital management drawback: core policy rules are mixed
with dynamic cache and addrconf routes in one structure.
(BTW it is one of reasons why I did not want to integrate routing cache to fib for IPv4)

Do you see the problem? F.e. when you do iptables-save, you do not expect
that it can occasionally miss some rules (unless you mess with it in parallel, of course)
The same is here. When you dump routing table, you are allowed to miss some cache routes,
but if you have a chance to miss at least one of important routes just because
unimportant dynamic part is alway under change, it is fatal.

There are a lot of ways to solve the problem, all of them have some flaws.
F.e. I can remember:
* atomic dump like bsd sysctl.
* keeping administrative routes in a separate list, which can be walked using skip/count
etc.

This way with walkers I chose because it looked quite optimal and because
it was an exciting little task for brains . :-)

> (easily DOSable by the way, if some guy opens 10.000 handles
> and suspend in the middle the dumps).

This is true. The easiest way to fix this is just to limit amount of readers,
putting them on hold.

Alexey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-23 20:55                 ` Alexey Kuznetsov
@ 2012-06-23 23:02                   ` David Miller
  0 siblings, 0 replies; 15+ messages in thread
From: David Miller @ 2012-06-23 23:02 UTC (permalink / raw)
  To: kuznet
  Cc: eric.dumazet, johunt, kaber, dbavatar, netdev, yoshfuji, jmorris,
	pekkas, linux-kernel, greearb

From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Date: Sun, 24 Jun 2012 00:55:46 +0400

> IPv6 routing table has a capital management drawback: core policy
> rules are mixed with dynamic cache and addrconf routes in one
> structure.  (BTW it is one of reasons why I did not want to
> integrate routing cache to fib for IPv4)

Yes, and this causes other problems too.  Recently I had to make the
dst cache not count pure ipv6 routes otherwise cache size limited how
many actual routes administrator could add.

I would like to eventually make ipv4 and ipv6 more similar rather than
more different.  BTW, decision to use different host models (weak vs.
strong) in the two stacks was another idiotic move which makes
consolidation and code auditing harder.

I think once my long work to kill the ipv4 routing cache is complete
and successful we can model ipv6 after the results.

Major blockers are in two areas, reliance upon rt->rt_dst and...
performance :-)

Main reliance upon rt->rt_dst are:

1) Neighbours, which I plan to move to a model where lookups are
   done on demand using RCU and lack of refcounts.

   There are a few stragglers in infiniband and elsewhere that still
   want to get a neighbour from a dst and I haven't converted over to
   a lookup-on-demand model.  I'm slowly working through those but it
   is painful and thankless work.

   It also involved trying to figure out reliable replacements for
   magic tests like:

	if (!dst_get_neighbour_noref_raw(&rt->dst) && !(rt->rt6i_flags & RTF_NONEXTHOP))

   in ipv6.  Really, the set of ipv6 dsts which have a neighbour
   pre-attached is non-trivial to describe via other means.

   dst_confirm() is left, which I'll handle by setting a "neigh
   confirm pending bit", and next packet output when we have the neigh
   looked up we'll update it's state and clear the bit in the dst.  Or
   something like this.  Maybe a u8 or an int instead of a flag so we
   don't need atomic ops.

   Divorcing neigh from dst can have another huge benefit, no more
   neighbour table overflow because small prefixed route for very
   active subnets with improperly adjusted neighbour cache sizing.
   We'll have more freedom to toss neighs because they'll be largely
   ref-less unlike now where every route to external place holds onto
   neigh.

2) Metrics, which really must be done differently.

   Currently the scheme I have in mind is:

   a) Pure TCP metrics move into RCU ref-count-free table and are
      accessed on-demand.  When TCP connection starts up, TCP fetches
      metrics block from table.  When TCP connection closes, TCP
      pushes new metrics values into table.

   b) PMTU and redirect information is moved back into route.

      We clone new routes in FIB trie when PMTU or redirects are
      received.

   Metric table will be rooted in FIB table like inetpeer is now.

   Inetpeer will become nearly orphan once more, only used for IP ID
   generation and IPv4 fragment ID wrap detection.

Then we have no more need for rt->rt_dst to point to a specific IP
address once the routing cache is removed.  It means we can use
routes constructed completely inside of FIB trie entries.

Next is worse area, performance.  I can easily make output route
lookups fast without the routing cache, but input... mama mia!

Problems are two-fold:

1) Separation of local and main table, I plan to combine them.  Well,
   this applies to output and input routes.

   This was really a terrible design decision.  Only the most obscure
   critters take advantage of this separation, yet everyone pays the
   price.  What's more their goals can be achieved by other means.

   It means that every fib_lookup() is essentially two FIB trie
   traversals instead of just one.

2) fib_validate_source(), really it is the most painful monster and
   should never have been put into the FIB.  It is really a netfilter
   service, at best.

   It means that, for forwarding global routes, we currently make 4
   FIB trie traversals.  FOUR.  So no matter how fast Robert Olsson
   made fib_trie, it still needs to be consulted 4 times.

   I've tried to come up with algorithms that do this validation
   cheaply.  Especially for default typical configuration where this
   kind of check is especially stupid and pointless.  I have not had
   any major breakthroughs.

   For a workstation or typical one-interface server, it we eliminate
   loopback anomalies earlier in the path, it can be a simple check I
   suppose.

   I plan to facilitate this also by making non-unicast specific
   destination determination on-demand.  Then there is class ID
   determination, another huge hardship on everyone created by a
   feature with a tiny class of users.

Anyways, that is brain dump.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
  2012-06-23  5:37               ` Eric Dumazet
  2012-06-23 20:55                 ` Alexey Kuznetsov
@ 2012-06-25 22:40                 ` David Miller
  1 sibling, 0 replies; 15+ messages in thread
From: David Miller @ 2012-06-25 22:40 UTC (permalink / raw)
  To: eric.dumazet
  Cc: johunt, kaber, dbavatar, netdev, yoshfuji, jmorris, pekkas,
	kuznet, linux-kernel, greearb

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 23 Jun 2012 07:37:31 +0200

> [PATCH] ipv6: fib: fix fib dump restart
> 
> Commit 2bec5a369ee79576a3 (ipv6: fib: fix crash when changing large fib
> while dumping it) introduced ability to restart the dump at tree root,
> but failed to skip correctly a count of already dumped entries. Code
> didn't match Patrick intent.
> 
> We must skip exactly the number of already dumped entries.
> 
> Note that like other /proc/net files or netlink producers, we could
> still dump some duplicates entries.
> 
> Reported-by: Debabrata Banerjee <dbavatar@gmail.com>
> Reported-by: Josh Hunt <johunt@akamai.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I've applied this.

But I wonder if it does the right thing, to be honest.

When tree change is detected, w->skip is set to w->count

But with your change, w->count won't be the number of entries to
skip from the root after the first time we handle a tree change.

So on the second tree change, we'll skip the wrong number of
entries, since the w->count we save into w->skip will be biased
by the previous w->skip value.  So we'll skip too few entries.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-06-25 22:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-12 17:22 Bug in net/ipv6/ip6_fib.c:fib6_dump_table() Debabrata Banerjee
2012-06-21 19:35 ` Josh Hunt
2012-06-21 20:27   ` Eric Dumazet
2012-06-21 21:50     ` Alexey Kuznetsov
2012-06-22  3:34       ` Gao feng
2012-06-22  6:49     ` Josh Hunt
2012-06-22  8:29       ` Eric Dumazet
2012-06-22 13:44         ` Josh Hunt
2012-06-22 18:13           ` Eric Dumazet
2012-06-22 21:12             ` Debabrata Banerjee
2012-06-23  0:02             ` David Miller
2012-06-23  5:37               ` Eric Dumazet
2012-06-23 20:55                 ` Alexey Kuznetsov
2012-06-23 23:02                   ` David Miller
2012-06-25 22:40                 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).