* null-pointer deref in ulogd2 @ 2009-06-23 7:27 Bernhard Schmidt 2009-06-23 8:31 ` Bernhard Schmidt 0 siblings, 1 reply; 14+ messages in thread From: Bernhard Schmidt @ 2009-06-23 7:27 UTC (permalink / raw) To: netfilter-devel Hi, we have just now tried to migrate the flow logging of our central NAT gateway from conntrack -L | logger to ulogd2 and a PostgreSQL database. 4 CPU Xeon (64bit) SLES 11 libnfnetlink 0.0.41 libnetfilter_log 0.0.16 libnetfilter_conntrack 0.0.99 ulogd2 2.0.0beta3 The system is pretty heavily used, at the moment it does about 200 Mbps bandwidth, 30k concurrent sessions and maybe 500 new connections/s (hard to tell). ulogd is pretty much the vanilla config logging NFCT into PGSQL. Problem: ulogd crashes within seconds after the start. A gdb backtrace looks like this: Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7ffc634 in ?? () (gdb) bt full #0 0x00007ffff7ffc634 in ?? () No symbol table info available. #1 0x00007ffff7ffc826 in gettimeofday () No symbol table info available. #2 0x00007ffff76f96ea in gettimeofday () from /lib64/libc.so.6 No symbol table info available. #3 0x00007ffff6e54106 in event_handler (type=NFCT_T_NEW, ct=0x72a3f0, data=0x6161f0) at ulogd_inpflow_NFCT.c:599 upi = (struct ulogd_pluginstance *) 0x6161f0 cpi = (struct nfct_pluginstance *) 0x616268 ts = (struct ct_timestamp *) 0x0 tmp = {time = {{tv_sec = 0, tv_usec = 0}, {tv_sec = 0, tv_usec = 0}}, ct = 0x72a3f0} #4 0x00007ffff6c42fb4 in __callback (nlh=0x7fffffffc1d0, nfa=0x7fffffffc0d0, data=0x620e70) at callback.c:33 ret = <value optimized out> ct = <value optimized out> #5 0x00007ffff70594b9 in nfnl_step (h=<value optimized out>, nlh=0x7fffffffc1d0) at libnfnetlink.c:1318 err = <value optimized out> type = <value optimized out> subsys_id = <value optimized out> #6 0x00007ffff705964f in nfnl_process (h=0x621c90, buf=<value optimized out>, len=196) at libnfnetlink.c:1363 ret = 76 nlh = (struct nlmsghdr *) 0x7fffffffc1d0 __PRETTY_FUNCTION__ = "nfnl_process" #7 0x00007ffff705a5d6 in nfnl_catch (h=0x621c90) at libnfnetlink.c:1517 ret = 196 __PRETTY_FUNCTION__ = "nfnl_catch" #8 0x00007ffff6e54340 in read_cb_nfct (fd=9, what=1, param=0x616268) at ulogd_inpflow_NFCT.c:664 cpi = (struct nfct_pluginstance *) 0x616268 upi = (struct ulogd_pluginstance *) 0x6161f0 #9 0x00000000004050ca in ulogd_select_main (tv=0x7fffffffe440) at select.c:110 flags = 1 ufd = (struct ulogd_fd *) 0x616280 rds_tmp = {__fds_bits = {1536, 0 <repeats 15 times>}} wrs_tmp = {__fds_bits = {0 <repeats 16 times>}} as you can see there is a nullpointer deref. We protected the two crashpoints so far with a very simple workaround diff -ur ulogd-2.0.0beta3/input/flow/ulogd_inpflow_NFCT.c ulogd-2.0.0beta3-patched/input/flow/ulogd_inpflow_NFCT.c --- ulogd-2.0.0beta3/input/flow/ulogd_inpflow_NFCT.c 2009-03-06 18:54:04.000000000 +0100 +++ ulogd-2.0.0beta3-patched/input/flow/ulogd_inpflow_NFCT.c 2009-06-23 08:51:51.912520684 +0200 @@ -596,7 +596,8 @@ switch(type) { case NFCT_T_NEW: ts = hashtable_add(cpi->ct_active, &tmp); - gettimeofday(&ts->time[START], NULL); + if (ts) + gettimeofday(&ts->time[START], NULL); return NFCT_CB_STOLEN; case NFCT_T_UPDATE: ts = hashtable_get(cpi->ct_active, &tmp); @@ -734,7 +735,8 @@ /* if it does not exist, add it */ if (!hashtable_get(cpi->ct_active, &tmp)) { ts = hashtable_add(cpi->ct_active, &tmp); - gettimeofday(&ts->time[START], NULL); /* do our best here */ + if (ts) + gettimeofday(&ts->time[START], NULL); /* do our best here */ return NFCT_CB_STOLEN; } now it seems to work okay. In the database about 90% of the flows have flow_end_sec NULL. Can anyone see at the first glance why ts isn't set here? Is this some overload issue? Thanks, Bernhard ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: null-pointer deref in ulogd2 2009-06-23 7:27 null-pointer deref in ulogd2 Bernhard Schmidt @ 2009-06-23 8:31 ` Bernhard Schmidt 2009-06-23 15:40 ` Pablo Neira Ayuso 0 siblings, 1 reply; 14+ messages in thread From: Bernhard Schmidt @ 2009-06-23 8:31 UTC (permalink / raw) To: netfilter-devel Bernhard Schmidt <berni@birkenwald.de> wrote: > now it seems to work okay. In the database about 90% of the flows have > flow_end_sec NULL. Just had a closer look, flow_start_sec can be NULL as well. ulogd=# SELECT COUNT(*) FROM ulog2_ct WHERE flow_start_sec IS NOT NULL AND flow_end_sec IS NOT NULL; count ------- 88793 (1 row) ulogd=# SELECT COUNT(*) FROM ulog2_ct WHERE flow_start_sec IS NULL AND flow_end_sec IS NOT NULL; count -------- 320560 (1 row) ulogd=# SELECT COUNT(*) FROM ulog2_ct WHERE flow_start_sec IS NOT NULL AND flow_end_sec IS NULL; count -------- 407220 (1 row) ulogd=# SELECT COUNT(*) FROM ulog2_ct WHERE flow_start_sec IS NULL AND flow_end_sec IS NULL; count ------- 0 (1 row) What is happening here? Bernhard ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: null-pointer deref in ulogd2 2009-06-23 8:31 ` Bernhard Schmidt @ 2009-06-23 15:40 ` Pablo Neira Ayuso 2009-06-23 16:54 ` Bernhard Schmidt 0 siblings, 1 reply; 14+ messages in thread From: Pablo Neira Ayuso @ 2009-06-23 15:40 UTC (permalink / raw) To: Bernhard Schmidt; +Cc: netfilter-devel [-- Attachment #1: Type: text/plain, Size: 814 bytes --] Bernhard Schmidt wrote: > Bernhard Schmidt <berni@birkenwald.de> wrote: > >> now it seems to work okay. In the database about 90% of the flows have >> flow_end_sec NULL. Please, rise "netlink_socket_buffer_size" and "netlink_socket_buffer_maxsize". If you use the default buffer, it's likely to overrun and, thus, to lose events. [...] > What is happening here? I think that you're using the default "hash_max_entries" which is too small. I suggest you to rise this value. I'm going to push a patch that includes information on these parameter tweaking to the example config file. BTW, could you give a quick test to this patch, yours seems to leak memory since NFCT_CB_STOLEN means not to release the ct object (no problem, I guess that you're not familiar with libnetfilter_conntrack). Thanks for testing. [-- Attachment #2: fix.patch --] [-- Type: text/x-diff, Size: 1537 bytes --] NFCT: fix NULL dereference when hashtable is full This patch fixes a NULL dereference to the timestamp structure when hashtable_add() fails, for example, because the hashtable is full. Reported-by: Bernhard Schmidt <berni@birkenwald.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> --- input/flow/ulogd_inpflow_NFCT.c | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/input/flow/ulogd_inpflow_NFCT.c b/input/flow/ulogd_inpflow_NFCT.c index b8278af..b16687e 100644 --- a/input/flow/ulogd_inpflow_NFCT.c +++ b/input/flow/ulogd_inpflow_NFCT.c @@ -596,6 +596,9 @@ static int event_handler(enum nf_conntrack_msg_type type, switch(type) { case NFCT_T_NEW: ts = hashtable_add(cpi->ct_active, &tmp); + if (ts == NULL) + return NFCT_CB_CONTINUE; + gettimeofday(&ts->time[START], NULL); return NFCT_CB_STOLEN; case NFCT_T_UPDATE: @@ -604,6 +607,9 @@ static int event_handler(enum nf_conntrack_msg_type type, nfct_copy(ts->ct, ct, NFCT_CP_META); else { ts = hashtable_add(cpi->ct_active, &tmp); + if (ts == NULL) + return NFCT_CB_CONTINUE; + gettimeofday(&ts->time[START], NULL); return NFCT_CB_STOLEN; } @@ -734,6 +740,9 @@ static int overrun_handler(enum nf_conntrack_msg_type type, /* if it does not exist, add it */ if (!hashtable_get(cpi->ct_active, &tmp)) { ts = hashtable_add(cpi->ct_active, &tmp); + if (ts == NULL) + return NFCT_CB_CONTINUE; + gettimeofday(&ts->time[START], NULL); /* do our best here */ return NFCT_CB_STOLEN; } ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: null-pointer deref in ulogd2 2009-06-23 15:40 ` Pablo Neira Ayuso @ 2009-06-23 16:54 ` Bernhard Schmidt 2009-06-23 22:39 ` Bernhard Schmidt 0 siblings, 1 reply; 14+ messages in thread From: Bernhard Schmidt @ 2009-06-23 16:54 UTC (permalink / raw) To: Pablo Neira Ayuso; +Cc: netfilter-devel Hi Pablo, >>> now it seems to work okay. In the database about 90% of the flows have >>> flow_end_sec NULL. > Please, rise "netlink_socket_buffer_size" and > "netlink_socket_buffer_maxsize". If you use the default buffer, it's > likely to overrun and, thus, to lose events. We had increased that in the meantime, to netlink_socket_buffer_size=10854400 netlink_socket_buffer_maxsize=20971520 That pretty much stopped the warning messages in /var/log/ulogd.log We had also figured that the hash was the problem, so we tried the hash_enable=0 and used the INSERT_OR_REPLACE_CT function. However, that was also pretty unsuccessful, right now we have 750k flows in ulog2_ct where ct_event < 4 (so, as far as I understand it, the DESTROY event has not yet been received). Which is a bit too much for a box that only has 40k-50k connections at the same time according to conntrack -L. 1.67M flows in total, I suspect that's a bit low as well. When I did 100 HTTP connections through the box I could only find ~20 flows in the database, none of them in DESTROYed state. >> What is happening here? > I think that you're using the default "hash_max_entries" which is too > small. I suggest you to rise this value. I'm going to push a patch that > includes information on these parameter tweaking to the example config file. I've now set hash_buckets=81920 hash_max_entries=327680 and went back to hash_enable=1. However, it still doesn't look too great. About five minutes after 100 TCP connects the number of flows in the ulog2_ct table for this IP address has stabilized at 116, consisting of - 9 flows with both flow_start_sec and flow_end_sec - 83 flows with only flow_start_sec - 24 flows with only flow_end_sec SELECT COUNT(DISTINCT orig_l4_sport) tells me that 92 real connections are listed in the table somehow, so 8 connections are totally lost and 24 connections are listed twice. [ half an hour later ] ARGH! I found my problem. Apparently Postgres was too slow on INSERT. Although the CPU load looked fine (and even IOWait wasn't out of the ordinary, 20% on one CPU) it seems to have blocked. Sacrificing consistency for speed by setting fsync=no in postgres the IOwait went down to 0.5% and I now have 100 flows, all of them with start and end! > BTW, could you give a quick test to this patch, yours seems to leak > memory since NFCT_CB_STOLEN means not to release the ct object (no > problem, I guess that you're not familiar with libnetfilter_conntrack). Thanks. Yes, I'm even not that familiar with C :-) Your patch compiles and runs fine. Can't tell much about memory leaks, but the system has not exploded yet. Bernhard ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: null-pointer deref in ulogd2 2009-06-23 16:54 ` Bernhard Schmidt @ 2009-06-23 22:39 ` Bernhard Schmidt 2009-06-24 10:59 ` conntrack segfault (was: Re: null-pointer deref in ulogd2) Bernhard Schmidt 0 siblings, 1 reply; 14+ messages in thread From: Bernhard Schmidt @ 2009-06-23 22:39 UTC (permalink / raw) To: Pablo Neira Ayuso; +Cc: netfilter-devel Hi, > ARGH! I found my problem. Apparently Postgres was too slow on INSERT. > Although the CPU load looked fine (and even IOWait wasn't out of the > ordinary, 20% on one CPU) it seems to have blocked. Sacrificing > consistency for speed by setting fsync=no in postgres the IOwait went > down to 0.5% and I now have 100 flows, all of them with start and end! Looks like I spoke too early :-( We have now passed peak-time, which means about 450 Mbps traffic, 60k concurrent sessions and about 300 flows/s in a 1hour average. First of all, ulogd has segfaulted again. Unfortunately I didn't get a coredump, I've restarted it in gdb now. Second, the number of flow records without any time stamp is getting higher and higher again, with now 20% lacking either start or endtime ulogd=# SELECT count(*) FROM ulog2_ct; count --------- 3278208 (1 row) ulogd=# SELECT count(*) FROM ulog2_ct WHERE flow_start_sec IS NULL; count -------- 270690 (1 row) ulogd=# SELECT count(*) FROM ulog2_ct WHERE flow_end_sec IS NULL; count -------- 306740 (1 row) This seems to get worse the longer ulogd runs, shortly before the segfault there were 8000 flows without end_time in a row. The recent ones are fine again. I'm still getting (very ocasionally) Wed Jun 24 00:31:21 2009 <5> ulogd_inpflow_NFCT.c:656 Maximum buffer size (17367040) in NFCT has been reached. Please, consider rising `netlink_socket_buffer_size ` and `netlink_socket_buffer_maxsize` clauses. does it make sense to increase the buffer even more? If 17MB of buffer aren't enough I don't think it can keep up with any setting. And now that fsync is disabled in Postgres the box is really not that heavily loaded. CPUs 3&4 (serving the interrupts of the two NICs) are near 100% interrupt load at peak time, but 1&2 are >80% idle. Does anyone else run this setup with similar numbers and can shed some light on tuning? Oh, and we're dumping conntrack -L every minute. Works fine during the day with 30k connections, but starts to frequently segfault with 60k connections in the evening. Also trying to get a coredump now. Bernhard ^ permalink raw reply [flat|nested] 14+ messages in thread
* conntrack segfault (was: Re: null-pointer deref in ulogd2) 2009-06-23 22:39 ` Bernhard Schmidt @ 2009-06-24 10:59 ` Bernhard Schmidt 2009-06-24 11:17 ` Krzysztof Oledzki 0 siblings, 1 reply; 14+ messages in thread From: Bernhard Schmidt @ 2009-06-24 10:59 UTC (permalink / raw) To: netfilter-devel On Wed, Jun 24, 2009 at 12:39:58AM +0200, Bernhard Schmidt wrote: Hi, > Oh, and we're dumping conntrack -L every minute. Works fine during the > day with 30k connections, but starts to frequently segfault with 60k > connections in the evening. Also trying to get a coredump now. sorry, this is slightly off-topic, but I can't decode the core dump :-( Jun 24 12:03:01 secomat2 kernel: conntrack[14117]: segfault at 7fff1ce83f34 ip 00007fff1ce83f34 sp 00007fff1ce82f20 error 15 secomat2:~ # gdb /usr/local/sbin/conntrack /root/core.14117 GNU gdb (GDB; SUSE Linux Enterprise 11) 6.8.50.20081120-cvs Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-suse-linux". For bug reporting instructions, please see: <http://bugs.opensuse.org/>... warning: Can't read pathname for load map: Input/output error. Reading symbols from /usr/local/lib/libnetfilter_conntrack.so.1...done. Loaded symbols for /usr/local/lib/libnetfilter_conntrack.so.1 Reading symbols from /usr/local/lib/libnfnetlink.so.0...done. Loaded symbols for /usr/local/lib/libnfnetlink.so.0 Reading symbols from /lib64/libc.so.6...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/libdl.so.2...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Core was generated by `/usr/local/sbin/conntrack -L conntrack'. Program terminated with signal 11, Segmentation fault. #0 0x00007fff1ce83f34 in ?? () (gdb) bt #0 0x00007fff1ce83f34 in ?? () #1 0x00007fff1ce83f3c in ?? () #2 0x00007fff1ce83f44 in ?? () #3 0x0000000000000000 in ?? () (gdb) secomat2:~ # file /usr/local/sbin/conntrack /usr/local/sbin/conntrack: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses shared libs), not stripped what am I missing to get a backtrace you can work with? Both libnfnetlink and conntrack-tools are built with CFLAGS="-g -O0". Thanks, Bernhard ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: conntrack segfault (was: Re: null-pointer deref in ulogd2) 2009-06-24 10:59 ` conntrack segfault (was: Re: null-pointer deref in ulogd2) Bernhard Schmidt @ 2009-06-24 11:17 ` Krzysztof Oledzki 2009-06-24 11:57 ` Jan Engelhardt 0 siblings, 1 reply; 14+ messages in thread From: Krzysztof Oledzki @ 2009-06-24 11:17 UTC (permalink / raw) To: Bernhard Schmidt; +Cc: netfilter-devel [-- Attachment #1: Type: TEXT/PLAIN, Size: 692 bytes --] On Wed, 24 Jun 2009, Bernhard Schmidt wrote: > On Wed, Jun 24, 2009 at 12:39:58AM +0200, Bernhard Schmidt wrote: > > Hi, > >> Oh, and we're dumping conntrack -L every minute. Works fine during the >> day with 30k connections, but starts to frequently segfault with 60k >> connections in the evening. Also trying to get a coredump now. > > sorry, this is slightly off-topic, but I can't decode the core dump :-( > > Jun 24 12:03:01 secomat2 kernel: conntrack[14117]: segfault at > 7fff1ce83f34 ip 00007fff1ce83f34 sp 00007fff1ce82f20 error 15 I think you should rather try using valgrind. It is very hard to trace memory corruption problem with gdb. Best regards, Krzysztof Olędzki ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: conntrack segfault (was: Re: null-pointer deref in ulogd2) 2009-06-24 11:17 ` Krzysztof Oledzki @ 2009-06-24 11:57 ` Jan Engelhardt 2009-06-24 12:56 ` conntrack segfault Bernhard Schmidt 0 siblings, 1 reply; 14+ messages in thread From: Jan Engelhardt @ 2009-06-24 11:57 UTC (permalink / raw) To: Krzysztof Oledzki; +Cc: Bernhard Schmidt, netfilter-devel On Wednesday 2009-06-24 13:17, Krzysztof Oledzki wrote: > On Wed, 24 Jun 2009, Bernhard Schmidt wrote: >> On Wed, Jun 24, 2009 at 12:39:58AM +0200, Bernhard Schmidt wrote: >> >> Hi, >> >>> Oh, and we're dumping conntrack -L every minute. Works fine during the >>> day with 30k connections, but starts to frequently segfault with 60k >>> connections in the evening. Also trying to get a coredump now. >> >> sorry, this is slightly off-topic, but I can't decode the core dump :-( >> >> Jun 24 12:03:01 secomat2 kernel: conntrack[14117]: segfault at >> 7fff1ce83f34 ip 00007fff1ce83f34 sp 00007fff1ce82f20 error 15 > > I think you should rather try using valgrind. It is very hard to trace memory > corruption problem with gdb. A number of libc functions do not seem to always keep the stack pointers around, or when you are in a signal handler, so gdb is confused until these functions are exited. I have seen such even with programs that otherwise behave normally and which merely have been attached to with gdb. The solution there would be to set a breakpoint at a well-known function and let it continue, but in case of segfaults that barely works. Here, use valgrind to determine the faulty spot, then maybe run gdb on it (no attach, but direct run) and set a breakpoint before the spot is hit to examine the variables. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: conntrack segfault 2009-06-24 11:57 ` Jan Engelhardt @ 2009-06-24 12:56 ` Bernhard Schmidt 2009-06-24 17:58 ` Pablo Neira Ayuso 0 siblings, 1 reply; 14+ messages in thread From: Bernhard Schmidt @ 2009-06-24 12:56 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Krzysztof Oledzki, netfilter-devel Hi, >>>> Oh, and we're dumping conntrack -L every minute. Works fine during the >>>> day with 30k connections, but starts to frequently segfault with 60k >>>> connections in the evening. Also trying to get a coredump now. >>> sorry, this is slightly off-topic, but I can't decode the core dump :-( >>> >>> Jun 24 12:03:01 secomat2 kernel: conntrack[14117]: segfault at >>> 7fff1ce83f34 ip 00007fff1ce83f34 sp 00007fff1ce82f20 error 15 >> I think you should rather try using valgrind. It is very hard to trace memory >> corruption problem with gdb. > > A number of libc functions do not seem to always keep the stack > pointers around, or when you are in a signal handler, > so gdb is confused until these functions are exited. > > I have seen such even with programs that otherwise behave normally > and which merely have been attached to with gdb. The solution there > would be to set a breakpoint at a well-known function and let it > continue, but in case of segfaults that barely works. Here, use > valgrind to determine the faulty spot, then maybe run gdb on it (no > attach, but direct run) and set a breakpoint before the spot is > hit to examine the variables. The problem is, we currently run conntrack -L every minute. It segfaults about 20 times a day, usually during the period with the highest number of connections. Unless I can always run conntrack in valgrind/gdb automatically and get a usable dump when it fails I have a hard time to get any information from it. Do you happen to have any advice of the parameters to give to valgrind/gdb to get the information required automatically? Bernhard ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: conntrack segfault 2009-06-24 12:56 ` conntrack segfault Bernhard Schmidt @ 2009-06-24 17:58 ` Pablo Neira Ayuso 2009-06-24 20:05 ` Bernhard Schmidt 2009-06-24 22:18 ` Bernhard Schmidt 0 siblings, 2 replies; 14+ messages in thread From: Pablo Neira Ayuso @ 2009-06-24 17:58 UTC (permalink / raw) To: Bernhard Schmidt; +Cc: Jan Engelhardt, Krzysztof Oledzki, netfilter-devel Bernhard Schmidt wrote: > Hi, > >>>>> Oh, and we're dumping conntrack -L every minute. Works fine during the >>>>> day with 30k connections, but starts to frequently segfault with 60k >>>>> connections in the evening. Also trying to get a coredump now. >>>> sorry, this is slightly off-topic, but I can't decode the core dump :-( >>>> >>>> Jun 24 12:03:01 secomat2 kernel: conntrack[14117]: segfault at >>>> 7fff1ce83f34 ip 00007fff1ce83f34 sp 00007fff1ce82f20 error 15 >>> I think you should rather try using valgrind. It is very hard to >>> trace memory >>> corruption problem with gdb. >> >> A number of libc functions do not seem to always keep the stack >> pointers around, or when you are in a signal handler, >> so gdb is confused until these functions are exited. >> >> I have seen such even with programs that otherwise behave normally >> and which merely have been attached to with gdb. The solution there >> would be to set a breakpoint at a well-known function and let it >> continue, but in case of segfaults that barely works. Here, use >> valgrind to determine the faulty spot, then maybe run gdb on it (no >> attach, but direct run) and set a breakpoint before the spot is >> hit to examine the variables. > > The problem is, we currently run conntrack -L every minute. It segfaults > about 20 times a day, usually during the period with the highest number > of connections. Unless I can always run conntrack in valgrind/gdb > automatically and get a usable dump when it fails I have a hard time to > get any information from it. Are you using latest version? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: conntrack segfault 2009-06-24 17:58 ` Pablo Neira Ayuso @ 2009-06-24 20:05 ` Bernhard Schmidt 2009-06-24 22:18 ` Bernhard Schmidt 1 sibling, 0 replies; 14+ messages in thread From: Bernhard Schmidt @ 2009-06-24 20:05 UTC (permalink / raw) To: Pablo Neira Ayuso; +Cc: netfilter-devel On Wed, Jun 24, 2009 at 07:58:10PM +0200, Pablo Neira Ayuso wrote: > > The problem is, we currently run conntrack -L every minute. It segfaults > > about 20 times a day, usually during the period with the highest number > > of connections. Unless I can always run conntrack in valgrind/gdb > > automatically and get a usable dump when it fails I have a hard time to > > get any information from it. > > Are you using latest version? Yeah, pretty much. libnfnetlink 0.0.41 libnetfilter_log 0.0.16 libnetfilter_conntrack 0.0.99 conntrack-tools 0.9.12 So the only thing that isn't up to date anymore (since two days) is libnfnetlink. I can't see anything important in the changelog though. The base system is a SLES 11 with vanilla kernel 2.6.27.23, so that's a bit aged as well. Bernhard ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: conntrack segfault 2009-06-24 17:58 ` Pablo Neira Ayuso 2009-06-24 20:05 ` Bernhard Schmidt @ 2009-06-24 22:18 ` Bernhard Schmidt 2009-07-02 16:30 ` Pablo Neira Ayuso 2009-07-06 10:29 ` Krzysztof Oledzki 1 sibling, 2 replies; 14+ messages in thread From: Bernhard Schmidt @ 2009-06-24 22:18 UTC (permalink / raw) To: Pablo Neira Ayuso; +Cc: Jan Engelhardt, Krzysztof Oledzki, netfilter-devel Hi, not sure whether this helps, but after what felt like 500 attempts of running conntrack -L in valgrind I just captured one crash. secomat2:~ # valgrind -v conntrack -L > /dev/null ==24699== Memcheck, a memory error detector. ==24699== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al. ==24699== Using LibVEX rev 1854, a library for dynamic binary translation. ==24699== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP. ==24699== Using valgrind-3.3.1, a dynamic binary instrumentation framework. ==24699== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al. ==24699== --24699-- Command line --24699-- conntrack --24699-- -L --24699-- Startup, with flags: --24699-- -v --24699-- Contents of /proc/version: --24699-- Linux version 2.6.27.23-0.1-default (geeko@buildhost) (gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #1 SMP 2009-05-26 17:02:05 -0400 --24699-- Arch and hwcaps: AMD64, amd64-sse2 --24699-- Page sizes: currently 4096, max supported 4096 --24699-- Valgrind library directory: /usr/lib64/valgrind --24699-- Reading syms from /usr/local/sbin/conntrack (0x400000) --24699-- Reading syms from /lib64/ld-2.9.so (0x4000000) --24699-- object doesn't have a symbol table --24699-- Reading syms from /usr/lib64/valgrind/amd64-linux/memcheck (0x38000000) --24699-- object doesn't have a symbol table --24699-- object doesn't have a dynamic symbol table --24699-- Reading suppressions file: /usr/lib64/valgrind/default.supp --24699-- Reading syms from /usr/lib64/valgrind/amd64-linux/vgpreload_core.so (0x4A1F000) --24699-- object doesn't have a symbol table --24699-- Reading syms from /usr/lib64/valgrind/amd64-linux/vgpreload_memcheck.so (0x4C21000) --24699-- object doesn't have a symbol table --24699-- Reading syms from /usr/local/lib/libnetfilter_conntrack.so.1.2.0 (0x4E2A000) --24699-- Reading syms from /usr/local/lib/libnfnetlink.so.0.2.0 (0x5043000) --24699-- Reading syms from /lib64/libc-2.9.so (0x524B000) --24699-- object doesn't have a symbol table --24699-- Reading syms from /lib64/libdl-2.9.so (0x55A4000) --24699-- object doesn't have a symbol table --24699-- REDIR: 0x52c7c30 (rindex) redirected to 0x4c25a00 (rindex) --24699-- REDIR: 0x52c93f0 (memset) redirected to 0x4c26ca0 (memset) --24699-- REDIR: 0x52c67b0 (strcmp) redirected to 0x4c26100 (strcmp) --24699-- REDIR: 0x52c7540 (strlen) redirected to 0x4c25e20 (strlen) --24699-- REDIR: 0x52c6600 (index) redirected to 0x4c25b20 (index) --24699-- REDIR: 0x52c37f0 (malloc) redirected to 0x4c255e0 (malloc) --24699-- REDIR: 0xffffffffff600400 (???) redirected to 0x3802d13d (???) --24699-- REDIR: 0x52c3430 (calloc) redirected to 0x4c233b0 (calloc) --24699-- REDIR: 0x52cad30 (memcpy) redirected to 0x4c26270 (memcpy) --24699-- REDIR: 0x52cd840 (strchrnul) redirected to 0x4c26d70 (strchrnul) --24699-- REDIR: 0x52c9f20 (mempcpy) redirected to 0x4c26dd0 (mempcpy) --24699-- REDIR: 0x52c0c00 (free) redirected to 0x4c242e0 (free) --24699-- REDIR: 0x52c7b60 (strncpy) redirected to 0x4c25f50 (strncpy) vex amd64->IR: unhandled instruction bytes: 0x6 0xDF 0xA1 0xBF 0x8 0x0 ==24699== Invalid read of size 1 ==24699== at 0x7FEFFD3E4: ??? ==24699== by 0x7FEFFD3EB: ??? ==24699== by 0x7FEFFD3F3: ??? ==24699== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==24699== ==24699== Process terminating with default action of signal 11 (SIGSEGV) ==24699== Access not within mapped region at address 0x0 ==24699== at 0x7FEFFD3E4: ??? ==24699== by 0x7FEFFD3EB: ??? ==24699== by 0x7FEFFD3F3: ??? ==24699== ==24699== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 1) ==24699== ==24699== 1 errors in context 1 of 1: ==24699== Invalid read of size 1 ==24699== at 0x7FEFFD3E4: ??? ==24699== by 0x7FEFFD3EB: ??? ==24699== by 0x7FEFFD3F3: ??? ==24699== Address 0x0 is not stack'd, malloc'd or (recently) free'd --24699-- --24699-- supp: 2 dl-hack3-cond-1 ==24699== ==24699== IN SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 1) ==24699== ==24699== malloc/free: in use at exit: 984 bytes in 5 blocks. ==24699== malloc/free: 8,293 allocs, 8,288 frees, 2,586,840 bytes allocated. ==24699== ==24699== searching for pointers to 5 not-freed blocks. ==24699== checked 100,824 bytes. ==24699== ==24699== LEAK SUMMARY: ==24699== definitely lost: 0 bytes in 0 blocks. ==24699== possibly lost: 0 bytes in 0 blocks. ==24699== still reachable: 984 bytes in 5 blocks. ==24699== suppressed: 0 bytes in 0 blocks. ==24699== Rerun with --leak-check=full to see details of leaked memory. --24699-- memcheck: sanity checks: 300 cheap, 12 expensive --24699-- memcheck: auxmaps: 0 auxmap entries (0k, 0M) in use --24699-- memcheck: auxmaps_L1: 0 searches, 0 cmps, ratio 0:10 --24699-- memcheck: auxmaps_L2: 0 searches, 0 nodes --24699-- memcheck: SMs: n_issued = 64 (1024k, 1M) --24699-- memcheck: SMs: n_deissued = 0 (0k, 0M) --24699-- memcheck: SMs: max_noaccess = 524287 (8388592k, 8191M) --24699-- memcheck: SMs: max_undefined = 0 (0k, 0M) --24699-- memcheck: SMs: max_defined = 210 (3360k, 3M) --24699-- memcheck: SMs: max_non_DSM = 64 (1024k, 1M) --24699-- memcheck: max sec V bit nodes: 0 (0k, 0M) --24699-- memcheck: set_sec_vbits8 calls: 0 (new: 0, updates: 0) --24699-- memcheck: max shadow mem size: 5168k, 5M --24699-- translate: fast SP updates identified: 1,595 ( 85.4%) --24699-- translate: generic_known SP updates identified: 184 ( 9.8%) --24699-- translate: generic_unknown SP updates identified: 88 ( 4.7%) --24699-- tt/tc: 21,344 tt lookups requiring 21,427 probes --24699-- tt/tc: 21,344 fast-cache updates, 2 flushes --24699-- transtab: new 2,351 (54,368 -> 830,549; ratio 152:10) [1 scs] --24699-- transtab: dumped 0 (0 -> ??) --24699-- transtab: discarded 0 (0 -> ??) --24699-- scheduler: 30,068,731 jumps (bb entries). --24699-- scheduler: 300/37,001 major/minor sched events. --24699-- sanity: 301 cheap, 12 expensive checks. --24699-- exectx: 769 lists, 9 contexts (avg 0 per list) --24699-- exectx: 16,584 searches, 16,575 full compares (999 per 1000) --24699-- exectx: 0 cmp2, 1 cmp4, 0 cmpAll --24699-- errormgr: 3 supplist searches, 69 comparisons during search --24699-- errormgr: 3 errlist searches, 3 comparisons during search Segmentation fault Does this help any more? Bernhard ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: conntrack segfault 2009-06-24 22:18 ` Bernhard Schmidt @ 2009-07-02 16:30 ` Pablo Neira Ayuso 2009-07-06 10:29 ` Krzysztof Oledzki 1 sibling, 0 replies; 14+ messages in thread From: Pablo Neira Ayuso @ 2009-07-02 16:30 UTC (permalink / raw) To: Bernhard Schmidt; +Cc: Jan Engelhardt, Krzysztof Oledzki, netfilter-devel Sorry for the late reply. Bernhard Schmidt wrote: > vex amd64->IR: unhandled instruction bytes: 0x6 0xDF 0xA1 0xBF 0x8 0x0 > ==24699== Invalid read of size 1 > ==24699== at 0x7FEFFD3E4: ??? > ==24699== by 0x7FEFFD3EB: ??? > ==24699== by 0x7FEFFD3F3: ??? > ==24699== Address 0x0 is not stack'd, malloc'd or (recently) free'd > ==24699== > ==24699== Process terminating with default action of signal 11 (SIGSEGV) > ==24699== Access not within mapped region at address 0x0 > ==24699== at 0x7FEFFD3E4: ??? > ==24699== by 0x7FEFFD3EB: ??? > ==24699== by 0x7FEFFD3F3: ??? > ==24699== > ==24699== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 1) > Segmentation fault > > Does this help any more? I still don't have a clear clue on what's wrong with this. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: conntrack segfault 2009-06-24 22:18 ` Bernhard Schmidt 2009-07-02 16:30 ` Pablo Neira Ayuso @ 2009-07-06 10:29 ` Krzysztof Oledzki 1 sibling, 0 replies; 14+ messages in thread From: Krzysztof Oledzki @ 2009-07-06 10:29 UTC (permalink / raw) To: Bernhard Schmidt; +Cc: Pablo Neira Ayuso, Jan Engelhardt, netfilter-devel [-- Attachment #1: Type: TEXT/PLAIN, Size: 4588 bytes --] On Thu, 25 Jun 2009, Bernhard Schmidt wrote: > Hi, Hi, > not sure whether this helps, but after what felt like 500 attempts of > running conntrack -L in valgrind I just captured one crash. > > secomat2:~ # valgrind -v conntrack -L > /dev/null > ==24699== Memcheck, a memory error detector. > ==24699== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et > al. > ==24699== Using LibVEX rev 1854, a library for dynamic binary > translation. > ==24699== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP. > ==24699== Using valgrind-3.3.1, a dynamic binary instrumentation > framework. > ==24699== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et > al. > ==24699== > --24699-- Command line > --24699-- conntrack > --24699-- -L > --24699-- Startup, with flags: > --24699-- -v > --24699-- Contents of /proc/version: > --24699-- Linux version 2.6.27.23-0.1-default (geeko@buildhost) (gcc > version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #1 SMP > 2009-05-26 17:02:05 -0400 > --24699-- Arch and hwcaps: AMD64, amd64-sse2 > --24699-- Page sizes: currently 4096, max supported 4096 > --24699-- Valgrind library directory: /usr/lib64/valgrind > --24699-- Reading syms from /usr/local/sbin/conntrack (0x400000) > --24699-- Reading syms from /lib64/ld-2.9.so (0x4000000) > --24699-- object doesn't have a symbol table > --24699-- Reading syms from /usr/lib64/valgrind/amd64-linux/memcheck > (0x38000000) > --24699-- object doesn't have a symbol table > --24699-- object doesn't have a dynamic symbol table > --24699-- Reading suppressions file: /usr/lib64/valgrind/default.supp > --24699-- Reading syms from > /usr/lib64/valgrind/amd64-linux/vgpreload_core.so (0x4A1F000) > --24699-- object doesn't have a symbol table > --24699-- Reading syms from > /usr/lib64/valgrind/amd64-linux/vgpreload_memcheck.so (0x4C21000) > --24699-- object doesn't have a symbol table > --24699-- Reading syms from > /usr/local/lib/libnetfilter_conntrack.so.1.2.0 (0x4E2A000) > --24699-- Reading syms from /usr/local/lib/libnfnetlink.so.0.2.0 > (0x5043000) > --24699-- Reading syms from /lib64/libc-2.9.so (0x524B000) > --24699-- object doesn't have a symbol table > --24699-- Reading syms from /lib64/libdl-2.9.so (0x55A4000) > --24699-- object doesn't have a symbol table > --24699-- REDIR: 0x52c7c30 (rindex) redirected to 0x4c25a00 (rindex) > --24699-- REDIR: 0x52c93f0 (memset) redirected to 0x4c26ca0 (memset) > --24699-- REDIR: 0x52c67b0 (strcmp) redirected to 0x4c26100 (strcmp) > --24699-- REDIR: 0x52c7540 (strlen) redirected to 0x4c25e20 (strlen) > --24699-- REDIR: 0x52c6600 (index) redirected to 0x4c25b20 (index) > --24699-- REDIR: 0x52c37f0 (malloc) redirected to 0x4c255e0 (malloc) > --24699-- REDIR: 0xffffffffff600400 (???) redirected to 0x3802d13d (???) > --24699-- REDIR: 0x52c3430 (calloc) redirected to 0x4c233b0 (calloc) > --24699-- REDIR: 0x52cad30 (memcpy) redirected to 0x4c26270 (memcpy) > --24699-- REDIR: 0x52cd840 (strchrnul) redirected to 0x4c26d70 > (strchrnul) > --24699-- REDIR: 0x52c9f20 (mempcpy) redirected to 0x4c26dd0 (mempcpy) > --24699-- REDIR: 0x52c0c00 (free) redirected to 0x4c242e0 (free) > --24699-- REDIR: 0x52c7b60 (strncpy) redirected to 0x4c25f50 (strncpy) > vex amd64->IR: unhandled instruction bytes: 0x6 0xDF 0xA1 0xBF 0x8 0x0 > ==24699== Invalid read of size 1 > ==24699== at 0x7FEFFD3E4: ??? > ==24699== by 0x7FEFFD3EB: ??? > ==24699== by 0x7FEFFD3F3: ??? > ==24699== Address 0x0 is not stack'd, malloc'd or (recently) free'd > ==24699== > ==24699== Process terminating with default action of signal 11 (SIGSEGV) > ==24699== Access not within mapped region at address 0x0 > ==24699== at 0x7FEFFD3E4: ??? > ==24699== by 0x7FEFFD3EB: ??? > ==24699== by 0x7FEFFD3F3: ??? > ==24699== > ==24699== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 1) > ==24699== > ==24699== 1 errors in context 1 of 1: > ==24699== Invalid read of size 1 > ==24699== at 0x7FEFFD3E4: ??? > ==24699== by 0x7FEFFD3EB: ??? > ==24699== by 0x7FEFFD3F3: ??? > ==24699== Address 0x0 is not stack'd, malloc'd or (recently) free'd OK. It looks like we hit a NULL poiter assigment here, however something is wrong. Valgrind should pointed a file and a line numer so we should be able to track and fix it. Instead we have the "unhandled instruction bytes" message. Could you try a more recent version? AFAIR 3.4.1 has much better support for your arch (amd64). Please make sure both your libraries and the conntrack binary are not stripped. Best regards, Krzysztof Olędzki ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-07-06 10:30 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-06-23 7:27 null-pointer deref in ulogd2 Bernhard Schmidt 2009-06-23 8:31 ` Bernhard Schmidt 2009-06-23 15:40 ` Pablo Neira Ayuso 2009-06-23 16:54 ` Bernhard Schmidt 2009-06-23 22:39 ` Bernhard Schmidt 2009-06-24 10:59 ` conntrack segfault (was: Re: null-pointer deref in ulogd2) Bernhard Schmidt 2009-06-24 11:17 ` Krzysztof Oledzki 2009-06-24 11:57 ` Jan Engelhardt 2009-06-24 12:56 ` conntrack segfault Bernhard Schmidt 2009-06-24 17:58 ` Pablo Neira Ayuso 2009-06-24 20:05 ` Bernhard Schmidt 2009-06-24 22:18 ` Bernhard Schmidt 2009-07-02 16:30 ` Pablo Neira Ayuso 2009-07-06 10:29 ` Krzysztof Oledzki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).