Re: conntrackd, internal cache keeps filling up

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: conntrackd, internal cache keeps filling up
       [not found]   ` <20140510061743.GA32197@finrod>
@ 2014-05-12 16:35     ` Pablo Neira Ayuso
  2014-05-13 11:45       ` Martin Kraus
  2014-07-11 16:27       ` Martin Kraus
  0 siblings, 2 replies; 7+ messages in thread
From: Pablo Neira Ayuso @ 2014-05-12 16:35 UTC (permalink / raw)
  To: Martin Kraus; +Cc: netfilter-devel, netfilter

On Sat, May 10, 2014 at 08:17:45AM +0200, Martin Kraus wrote:
> On Fri, May 09, 2014 at 01:31:29PM +0200, Pablo Neira Ayuso wrote:
> > > There's thousands of these entries and in a few days they'll fill up the
> > > internal cache and break internal routing.
> > 
> > Could you retry with lastest conntrackd version? 1.4.2.
> 
> will try 1.4.2. we just need to package it.

OK.

> > You didn't specify your Linux kernel version either. Thanks.
> 
> current kernel is 3.13.7. 
> 
> we already hit a bug in the official 3.2 kernel packaged with wheezy where 
> our scan for heartbleed vulnerability would cause conntrackd to kernel panic
> the router.

Please, provide more information on how to reproduce the problem that
you're noticing. Thank you.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: conntrackd, internal cache keeps filling up
  2014-05-12 16:35     ` conntrackd, internal cache keeps filling up Pablo Neira Ayuso
@ 2014-05-13 11:45       ` Martin Kraus
  2014-05-13 12:04         ` Florian Westphal
  2014-05-13 12:40         ` Pablo Neira Ayuso
  2014-07-11 16:27       ` Martin Kraus
  1 sibling, 2 replies; 7+ messages in thread
From: Martin Kraus @ 2014-05-13 11:45 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, netfilter

On Mon, May 12, 2014 at 06:35:38PM +0200, Pablo Neira Ayuso wrote:
> > current kernel is 3.13.7. 
> > 
> > we already hit a bug in the official 3.2 kernel packaged with wheezy where 
> > our scan for heartbleed vulnerability would cause conntrackd to kernel panic
> > the router.
> 
> Please, provide more information on how to reproduce the problem that
> you're noticing. Thank you.

regarding the kernel panic on 3.2 a colleague of mine was using nmap with it's
heartbleed plugin

nmap --script ssl-heartbleed -sT -oX logfile.log 10.0.0.0/20

http://nmap.org/nsedoc/scripts/ssl-heartbleed.html

it took about 30 minutes to trigger the problem.

regarding the internal cache fill up. we have two routers and some vlans using
one and some vlans using the other router as the default gateway. 

this is the conntrackd config on both routers.

Sync {
        Mode FTFW {
                ResendQueueSize 131072
                ACKWindowSize 300
                DisableExternalCache On
        }
        UDP {
                IPv4_address 192.168.100.200
                IPv4_Destination_Address 192.168.100.100
                Port 3780
                Interface eth0
                Checksum on
        }
        Options {
                TCPWindowTracking On
        }
}

General {
        Nice -20

        HashSize 65536
        HashLimit 262144

        Syslog on
        LockFile /var/lock/conntrack.lock
        UNIX {
                Path /var/run/conntrackd.ctl
                Backlog 20
        }

        NetlinkBufferSize 2097152
        NetlinkBufferSizeMaxGrowth 8388608
        NetlinkEventsReliable On
        NetlinkOverrunResync Off

        Filter From Kernelspace {
                Address Ignore {
                        IPv4_address 127.0.0.1 # loopback
                }
        }
}

We have about 80 users, some of them running window or macs, so there is
plenty of multicasts and broadcasts that fill the conntrack table. some of
these then get stuck in the conntrackd internal cache. We can see the
LAST_ACK tcp states stuck in the internal cache as well, but I think these are
related to TCPWindowTracking On.

mk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: conntrackd, internal cache keeps filling up
  2014-05-13 11:45       ` Martin Kraus
@ 2014-05-13 12:04         ` Florian Westphal
  2014-05-13 12:55           ` Pablo Neira Ayuso
  2014-05-13 12:40         ` Pablo Neira Ayuso
  1 sibling, 1 reply; 7+ messages in thread
From: Florian Westphal @ 2014-05-13 12:04 UTC (permalink / raw)
  To: Martin Kraus; +Cc: Pablo Neira Ayuso, netfilter-devel, netfilter

Martin Kraus <lists_mk@wujiman.net> wrote:
> On Mon, May 12, 2014 at 06:35:38PM +0200, Pablo Neira Ayuso wrote:
> > > current kernel is 3.13.7. 
> > > 
> > > we already hit a bug in the official 3.2 kernel packaged with wheezy where 
> > > our scan for heartbleed vulnerability would cause conntrackd to kernel panic
> > > the router.
> > 
> > Please, provide more information on how to reproduce the problem that
> > you're noticing. Thank you.
> 
> regarding the kernel panic on 3.2 a colleague of mine was using nmap with it's
> heartbleed plugin
> 
> nmap --script ssl-heartbleed -sT -oX logfile.log 10.0.0.0/20
> 
> http://nmap.org/nsedoc/scripts/ssl-heartbleed.html
> 
> it took about 30 minutes to trigger the problem.
[..]

>         NetlinkEventsReliable On

known broken until at least Linux 3.6, see f.e.

5b423f6a40a0327f9d40bc8b97ce9be266f74368
("netfilter: nf_conntrack: fix racy timer handling with reliable events")

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: conntrackd, internal cache keeps filling up
  2014-05-13 11:45       ` Martin Kraus
  2014-05-13 12:04         ` Florian Westphal
@ 2014-05-13 12:40         ` Pablo Neira Ayuso
  2014-05-13 14:57           ` Martin Kraus
  1 sibling, 1 reply; 7+ messages in thread
From: Pablo Neira Ayuso @ 2014-05-13 12:40 UTC (permalink / raw)
  To: Martin Kraus; +Cc: netfilter-devel, netfilter

On Tue, May 13, 2014 at 01:45:35PM +0200, Martin Kraus wrote:
> On Mon, May 12, 2014 at 06:35:38PM +0200, Pablo Neira Ayuso wrote:
> > > current kernel is 3.13.7. 
> > > 
> > > we already hit a bug in the official 3.2 kernel packaged with wheezy where 
> > > our scan for heartbleed vulnerability would cause conntrackd to kernel panic
> > > the router.
> > 
> > Please, provide more information on how to reproduce the problem that
> > you're noticing. Thank you.
> 
> regarding the kernel panic on 3.2 a colleague of mine was using nmap with it's
> heartbleed plugin
> 
> nmap --script ssl-heartbleed -sT -oX logfile.log 10.0.0.0/20
> 
> http://nmap.org/nsedoc/scripts/ssl-heartbleed.html
> 
> it took about 30 minutes to trigger the problem.

Did you annotate the kernel oops backtrace? Without that information,
this is pretty much like looking for the needle in the stack.

> regarding the internal cache fill up. we have two routers and some vlans using
> one and some vlans using the other router as the default gateway. 
> 
> this is the conntrackd config on both routers.
> 
> Sync {
>         Mode FTFW {
>                 ResendQueueSize 131072
>                 ACKWindowSize 300
>                 DisableExternalCache On
>         }
>         UDP {
>                 IPv4_address 192.168.100.200
>                 IPv4_Destination_Address 192.168.100.100
>                 Port 3780
>                 Interface eth0
>                 Checksum on
>         }
>         Options {
>                 TCPWindowTracking On
>         }
> }
> 
> General {
>         Nice -20
> 
>         HashSize 65536
>         HashLimit 262144
> 
>         Syslog on
>         LockFile /var/lock/conntrack.lock
>         UNIX {
>                 Path /var/run/conntrackd.ctl
>                 Backlog 20
>         }
> 
>         NetlinkBufferSize 2097152
>         NetlinkBufferSizeMaxGrowth 8388608
>         NetlinkEventsReliable On
>         NetlinkOverrunResync Off
> 
>         Filter From Kernelspace {
>                 Address Ignore {
>                         IPv4_address 127.0.0.1 # loopback
>                 }
>         }
> }
> 
> We have about 80 users, some of them running window or macs, so there is
> plenty of multicasts and broadcasts that fill the conntrack table. some of
> these then get stuck in the conntrackd internal cache. We can see the
> LAST_ACK tcp states stuck in the internal cache as well, but I think these are
> related to TCPWindowTracking On.

Did you retry with lastest conntrack-tools version? If so, please
collect as much information as you can via all -s options, moreover
check the logs. If you didn't retry with lastest, please upgrade.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: conntrackd, internal cache keeps filling up
  2014-05-13 12:04         ` Florian Westphal
@ 2014-05-13 12:55           ` Pablo Neira Ayuso
  0 siblings, 0 replies; 7+ messages in thread
From: Pablo Neira Ayuso @ 2014-05-13 12:55 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Martin Kraus, netfilter-devel, netfilter

On Tue, May 13, 2014 at 02:04:00PM +0200, Florian Westphal wrote:
> Martin Kraus <lists_mk@wujiman.net> wrote:
> > On Mon, May 12, 2014 at 06:35:38PM +0200, Pablo Neira Ayuso wrote:
> > > > current kernel is 3.13.7. 
> > > > 
> > > > we already hit a bug in the official 3.2 kernel packaged with wheezy where 
> > > > our scan for heartbleed vulnerability would cause conntrackd to kernel panic
> > > > the router.
> > > 
> > > Please, provide more information on how to reproduce the problem that
> > > you're noticing. Thank you.
> > 
> > regarding the kernel panic on 3.2 a colleague of mine was using nmap with it's
> > heartbleed plugin
> > 
> > nmap --script ssl-heartbleed -sT -oX logfile.log 10.0.0.0/20
> > 
> > http://nmap.org/nsedoc/scripts/ssl-heartbleed.html
> > 
> > it took about 30 minutes to trigger the problem.
> [..]
> 
> >         NetlinkEventsReliable On
> 
> known broken until at least Linux 3.6, see f.e.
> 
> 5b423f6a40a0327f9d40bc8b97ce9be266f74368
> ("netfilter: nf_conntrack: fix racy timer handling with reliable events")

If they are using latest 3.2, that patch is already there.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: conntrackd, internal cache keeps filling up
  2014-05-13 12:40         ` Pablo Neira Ayuso
@ 2014-05-13 14:57           ` Martin Kraus
  0 siblings, 0 replies; 7+ messages in thread
From: Martin Kraus @ 2014-05-13 14:57 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, netfilter

On Tue, May 13, 2014 at 02:40:44PM +0200, Pablo Neira Ayuso wrote:
> > it took about 30 minutes to trigger the problem.
> 
> Did you annotate the kernel oops backtrace? Without that information,
> this is pretty much like looking for the needle in the stack.

Unfortunately we don't have any other information. I could reboot to the old
kernel and trigger it again but I'll need to do that in the evening.

Is there a better way to get the oops backtrace than a redirect to the serial
console? 

mk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: conntrackd, internal cache keeps filling up
  2014-05-12 16:35     ` conntrackd, internal cache keeps filling up Pablo Neira Ayuso
  2014-05-13 11:45       ` Martin Kraus
@ 2014-07-11 16:27       ` Martin Kraus
  1 sibling, 0 replies; 7+ messages in thread
From: Martin Kraus @ 2014-07-11 16:27 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, netfilter

On Mon, May 12, 2014 at 06:35:38PM +0200, Pablo Neira Ayuso wrote:
> > will try 1.4.2. we just need to package it.

Hi. 

we've been running 1.4.2 since the end of May but the problem persists. We've
managed to keep it down to a restart once a month through NOTRACK rules but 
that's not a nice solution.

> Please, provide more information on how to reproduce the problem that
> you're noticing. Thank you.

We have two office routers (router1, router2) which are connected to our internal
vlans and then to the uplinks. Both routers are active and there is a dedicated 
line used for conntrackd synchronization.

An example of state that keeps filling the conntrackd internal cache is dns
traffic to our nat64 proxy.

A user from vlan6 goes to our nat64 proxy in vlan20. vlan6 has a default
gateway to router2, vlan20 has a default gateway to vlan1. 

User ipv6 address is 2001:1488:fffe:6:c941:fae:4505:7a22.
Nat64 ipv6 address is 2001:1488:fffe:20::34.

The dns packet from the user goes to router2 and then directly to vlan20 where the 
nat64 host is located.

The dns reply packet from nat64 then goes to router1 and then directly to
vlan6 back to the user.

Now on router2 when I run conntrackd -i I can see

udp      17 src=2001:1488:fffe:6:c941:fae:4505:7a22 dst=2001:1488:fffe:20::34 sport=6728 dport=53 [UNREPLIED] src=2001:1488:fffe:20::34 dst=2001:1488:fffe:6:c941:fae:4505:7a22 sport=53 dport=6728 [active since 192159s]
udp      17 src=2001:1488:fffe:6:c941:fae:4505:7a22 dst=2001:1488:fffe:20::34 sport=6961 dport=53 [UNREPLIED] src=2001:1488:fffe:20::34 dst=2001:1488:fffe:6:c941:fae:4505:7a22 sport=53 dport=6961 [active since 191949s]
udp      17 src=2001:1488:fffe:6:c941:fae:4505:7a22 dst=2001:1488:fffe:20::34 sport=6977 dport=53 [UNREPLIED] src=2001:1488:fffe:20::34 dst=2001:1488:fffe:6:c941:fae:4505:7a22 sport=53 dport=6977 [active since 191962s]
udp      17 src=2001:1488:fffe:6:c941:fae:4505:7a22 dst=2001:1488:fffe:20::34 sport=6979 dport=53 [UNREPLIED] src=2001:1488:fffe:20::34 dst=2001:1488:fffe:6:c941:fae:4505:7a22 sport=53 dport=6979 [active since 168352s]

and another 126000 entries like this. router1 is similar except that the state
is not UNREPLIED. kernel conntrack table doesn't have any of these entries.

Interesting thing is that it's usually 1 ipv[46] address that generates most
of these stale entries. 

This is the config file used

Sync {
        Mode FTFW {
                ResendQueueSize 131072
                ACKWindowSize 300
                DisableExternalCache On
        }
        UDP {
                IPv4_address 192.168.100.100
                IPv4_Destination_Address 192.168.100.200
                Port 3780
                Interface eth0
                Checksum on
        }
        Options {
                TCPWindowTracking On
        }

}

General {
        Nice -20

        HashSize 65536
        HashLimit 262144

        Syslog on
        LockFile /var/lock/conntrack.lock
        UNIX {
                Path /var/run/conntrackd.ctl
                Backlog 20
        }

        NetlinkBufferSize 2097152
        NetlinkBufferSizeMaxGrowth 8388608
        NetlinkEventsReliable Off
        NetlinkOverrunResync On

        Filter From Kernelspace {
                Address Ignore {             
                        IPv4_address 127.0.0.1 # loopback
                }
        }
}

I always assumed that the internal cache is a copy of the kernel conntrack table
plus entries that have not yet been synchronized to the other router so I
don't understand why is it getting this huge.

mk

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-07-11 16:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20140505104058.GA30297@finrod>
     [not found] ` <20140509113129.GA8031@localhost>
     [not found]   ` <20140510061743.GA32197@finrod>
2014-05-12 16:35     ` conntrackd, internal cache keeps filling up Pablo Neira Ayuso
2014-05-13 11:45       ` Martin Kraus
2014-05-13 12:04         ` Florian Westphal
2014-05-13 12:55           ` Pablo Neira Ayuso
2014-05-13 12:40         ` Pablo Neira Ayuso
2014-05-13 14:57           ` Martin Kraus
2014-07-11 16:27       ` Martin Kraus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).