Conntrack Events Performance - Multipart Messages?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Conntrack Events Performance - Multipart Messages?
@ 2008-07-16 16:42 Fabian Hugelshofer
  2008-07-17  9:16 ` Patrick McHardy
  2008-07-17 10:03 ` Pablo Neira Ayuso
  0 siblings, 2 replies; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-16 16:42 UTC (permalink / raw)
  To: netfilter-devel

Hi,

I am writing a network application for a genuine wireless router (266Mhz 
IXP4XX). I am capturing packets with ULOG and need connection tracking. 
For performance reasons I planned to use connection tracking events 
(NEW/DESTROY) to avoid doing the same work twice.

In a high load test case I stress the router with UDP packets with 
random source ports (1000B payload, 1800pps). CPU usage is 100%, 10% of 
packets and 80% ctevents are dropped. If I disable ctevents, the CPU 
usage is just 24% and no packet drops occur.

My application is not very heavy and I expect most of the ctevent 
overhead to be caused by passing events from kernel to user space. I 
expect that performance could be increased by using multipart messages 
for ctevents like it is done in ULOG/NFLOG.

Do you share my opinion, that multipart messages would lead to 
significant performance improvements? (Actually, I doubt that I will be 
more efficient than performing connection tracking in user space)

Do you think introducing multipart messages for connection tracking 
events is feasible without breaking existing applications? Maybe with a 
default setting of 1 bundled events, which can be increased by a 
function call?

Is someone intending to implement multipart messages for ctevents? ;-)

Any comments are appreciated.

Regards,

Fabian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-16 16:42 Conntrack Events Performance - Multipart Messages? Fabian Hugelshofer
@ 2008-07-17  9:16 ` Patrick McHardy
  2008-07-17 10:03 ` Pablo Neira Ayuso
  1 sibling, 0 replies; 30+ messages in thread
From: Patrick McHardy @ 2008-07-17  9:16 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: netfilter-devel

Fabian Hugelshofer wrote:
> Hi,
> 
> I am writing a network application for a genuine wireless router (266Mhz 
> IXP4XX). I am capturing packets with ULOG and need connection tracking. 
> For performance reasons I planned to use connection tracking events 
> (NEW/DESTROY) to avoid doing the same work twice.
> 
> In a high load test case I stress the router with UDP packets with 
> random source ports (1000B payload, 1800pps). CPU usage is 100%, 10% of 
> packets and 80% ctevents are dropped. If I disable ctevents, the CPU 
> usage is just 24% and no packet drops occur.
> 
> My application is not very heavy and I expect most of the ctevent 
> overhead to be caused by passing events from kernel to user space. I 
> expect that performance could be increased by using multipart messages 
> for ctevents like it is done in ULOG/NFLOG.
> 
> Do you share my opinion, that multipart messages would lead to 
> significant performance improvements? (Actually, I doubt that I will be 
> more efficient than performing connection tracking in user space)

Quite possible, but some profiles would be useful to determine
whether this is actually the bottleneck.

> Do you think introducing multipart messages for connection tracking 
> events is feasible without breaking existing applications? Maybe with a 
> default setting of 1 bundled events, which can be increased by a 
> function call?

That sounds sane.

> Is someone intending to implement multipart messages for ctevents? ;-)

I don't think so.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-16 16:42 Conntrack Events Performance - Multipart Messages? Fabian Hugelshofer
  2008-07-17  9:16 ` Patrick McHardy
@ 2008-07-17 10:03 ` Pablo Neira Ayuso
  2008-07-17 14:34   ` Fabian Hugelshofer
  2008-07-18  2:11   ` Patrick McHardy
  1 sibling, 2 replies; 30+ messages in thread
From: Pablo Neira Ayuso @ 2008-07-17 10:03 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: netfilter-devel

Fabian Hugelshofer wrote:
> I am writing a network application for a genuine wireless router (266Mhz
> IXP4XX). I am capturing packets with ULOG and need connection tracking.
> For performance reasons I planned to use connection tracking events
> (NEW/DESTROY) to avoid doing the same work twice.

Did you write your own application to handle ctevents and ULOG messages?
Are you using any library? What does your application do?

We now have the berkeley socket filtering facilities for netlink, you
may use it to filter only the events that you need. I have a patch here
for libnetfilter_conntrack that introduces a high-level API to
autogenerate simple BSF code for filtering. As soon as I finish testing
it, I'll commit it.

Also, you may periodically dump the connection tracking table (polling),
but, of course, this depends on the nature of your application. Assuming
that your application is a logger, this is not a choice as you'll lose
information.

> In a high load test case I stress the router with UDP packets with
> random source ports (1000B payload, 1800pps). CPU usage is 100%, 10% of
> packets and 80% ctevents are dropped. If I disable ctevents, the CPU
> usage is just 24% and no packet drops occur.

I have a similar testbed here. You did not mention the threshold that
you're using in ULOG. If you provide more information on your
application I'll try to reproduce those numbers.

> My application is not very heavy and I expect most of the ctevent
> overhead to be caused by passing events from kernel to user space. I
> expect that performance could be increased by using multipart messages
> for ctevents like it is done in ULOG/NFLOG.
> 
> Do you share my opinion, that multipart messages would lead to
> significant performance improvements? (Actually, I doubt that I will be
> more efficient than performing connection tracking in user space)

Yes, I think that batching could help here.

> Do you think introducing multipart messages for connection tracking
> events is feasible without breaking existing applications? Maybe with a
> default setting of 1 bundled events, which can be increased by a
> function call?

AFAIK, libnfnetlink and other netlink-based libraries should handle the
multipart messages appropriately so that should not be a problem.

> Is someone intending to implement multipart messages for ctevents? ;-)

The problem here is that batching should be a per-socket parameter. We
will not accept a patch that changes the behaviour for all the ctevent
users. And I don't see an obvious way to do this now.

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-17 10:03 ` Pablo Neira Ayuso
@ 2008-07-17 14:34   ` Fabian Hugelshofer
  2008-07-17 15:15     ` Fabian Hugelshofer
  2008-07-18 15:56     ` Fabian Hugelshofer
  2008-07-18  2:11   ` Patrick McHardy
  1 sibling, 2 replies; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-17 14:34 UTC (permalink / raw)
  To: netfilter-devel

Pablo Neira Ayuso wrote:
> Fabian Hugelshofer wrote:
>> I am writing a network application for a genuine wireless router (266Mhz
>> IXP4XX). I am capturing packets with ULOG and need connection tracking.
>> For performance reasons I planned to use connection tracking events
>> (NEW/DESTROY) to avoid doing the same work twice.
> 
> Did you write your own application to handle ctevents and ULOG messages?
> Are you using any library? What does your application do?

I am using:
libnetfilter_conntrack 0.0.89
libnfnetlink 0.0.38
libipulog (from ulog2 r6382)

My application parses packets (IP and transport headers) and increments 
different counters per sending host. For this a hash table lookup is 
performed. Similar for ctevents.

> We now have the berkeley socket filtering facilities for netlink, you
> may use it to filter only the events that you need. I have a patch here
> for libnetfilter_conntrack that introduces a high-level API to
> autogenerate simple BSF code for filtering. As soon as I finish testing
> it, I'll commit it.

This is interesting and useful in various cases. Thanks for 
implementing. However, here I does not help, as I need all events.

> Also, you may periodically dump the connection tracking table (polling),
> but, of course, this depends on the nature of your application. Assuming
> that your application is a logger, this is not a choice as you'll lose
> information.

Polling is not possible as I need destroy events and information in 
them. Further I don't want to loose any events.

>> In a high load test case I stress the router with UDP packets with
>> random source ports (1000B payload, 1800pps). CPU usage is 100%, 10% of
>> packets and 80% ctevents are dropped. If I disable ctevents, the CPU
>> usage is just 24% and no packet drops occur.
> 
> I have a similar testbed here. You did not mention the threshold that
> you're using in ULOG. If you provide more information on your
> application I'll try to reproduce those numbers.

--ulog-cprange 96 --ulog-qthreshold 50

A dispatcher is calling back a ulog and a ctevent module as soon as 
their sockets are readable. The modules then read one message from the 
socket and process it before returning to the dispatcher. If both 
sockets are readable at the same time, both modules are called back.

To reduce any side effects I wrote a small test application which just 
reads the ctevent socket and does nothing else. You find it attached to 
this email.

I started the application, sent 102'313 UDP packets with random source 
ports and 1000B UDP payload in 57s (1795pps) and then waited until all 
entries have been removed from the connection table. The CPU usage was 
measured with top every 10s while sending and then averaged over 5 
intervals. It was 56%, which seems quite high to me. Without any 
applications running it is 11% to route this UDP traffic.

The test application reports 113699 received events and 140 overflows. 
The events are NEW and DESTROY events. Generally I would roughly expect 
the double number of events than packets. Under this assumption 44% of 
events have been dropped.

As I wrote earlier my real application is easily able to capture and 
process packets from this traffic flow without any drops and 24% CPU 
usage, as long as I unload nf_conntrack_netlink or do not initialise my 
ctevent module. If ctevents are used it is reasonable that more events 
are lost than with the test application as no events are read as long as 
a multipart ulog message is processed. However 44% drops with the test 
app seems too high. I think that either ctevents are very expensive in 
general or a big overhead is introduced by passing them one at the time.

>> Do you think introducing multipart messages for connection tracking
>> events is feasible without breaking existing applications? Maybe with a
>> default setting of 1 bundled events, which can be increased by a
>> function call?
> 
> AFAIK, libnfnetlink and other netlink-based libraries should handle the
> multipart messages appropriately so that should not be a problem.
> 
>> Is someone intending to implement multipart messages for ctevents? ;-)
> 
> The problem here is that batching should be a per-socket parameter. We
> will not accept a patch that changes the behaviour for all the ctevent
> users. And I don't see an obvious way to do this now.

If you do see a good way to do it, let me know. Depending on how much 
time I have I might consider implementing it.

Fabian

Example top for ctevtest:
Cpu(s):  0.1%us,  3.1%sy,  0.0%ni, 38.8%id,  0.0%wa,  1.4%hi, 56.6%si, 
0.0%st
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  3913 root      20   0   696  220  160 R 32.1  0.7   0:03.89 ctevtest
  3916 root      20   0  1068  548  412 R  0.8  1.8   0:00.22 top

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-17 14:34   ` Fabian Hugelshofer
@ 2008-07-17 15:15     ` Fabian Hugelshofer
  2008-07-18 15:56     ` Fabian Hugelshofer
  1 sibling, 0 replies; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-17 15:15 UTC (permalink / raw)
  To: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 237 bytes --]

Fabian Hugelshofer wrote:
> To reduce any side effects I wrote a small test application which just 
> reads the ctevent socket and does nothing else. You find it attached to 
> this email.

Forgot to attach, you find it in this email...

[-- Attachment #2: ctevtest.c --]
[-- Type: text/x-csrc, Size: 1201 bytes --]

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <signal.h>

#include <libnfnetlink/libnfnetlink.h>
#include <libnetfilter_conntrack/libnetfilter_conntrack.h>


volatile sig_atomic_t terminate;


static void __sig_handler(int sig)
{
	switch (sig) {
		case SIGTERM:
		case SIGINT:
			terminate = 1;
			break;
	}
}

int main(int argc, char* argv[])
{
	struct nfct_handle *h;
	struct sigaction sigact;
	char buf[NFNL_BUFFSIZE] __attribute__ ((aligned));
	int len;
	int events = 0;
	int overflows = 0;

	terminate = 0;
	sigact.sa_handler = &__sig_handler;
	sigaction(SIGINT, &sigact, NULL);
	sigaction(SIGTERM, &sigact, NULL);

	h = nfct_open(NFNL_SUBSYS_CTNETLINK,
			NF_NETLINK_CONNTRACK_NEW | NF_NETLINK_CONNTRACK_DESTROY);
	if (h == NULL) {
		perror("opening ctnetlink failed");
		exit(EXIT_FAILURE);
	}
	
	while (!terminate) {
		len = recv(nfct_fd(h), buf, sizeof(buf), 0);
		if (len < 0) {
			if (errno == ENOBUFS) {
				overflows++;
			} else if (errno != EINTR) {
				perror("recv failed");
				nfct_close(h);
				exit(EXIT_FAILURE);
			}
		} else {
			events++;
		}
	}

	printf("%d events received (%d overflows)\n", events, overflows);

	nfct_close(h);

	exit(EXIT_SUCCESS);
}

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-17 14:34   ` Fabian Hugelshofer
  2008-07-17 15:15     ` Fabian Hugelshofer
@ 2008-07-18 15:56     ` Fabian Hugelshofer
  1 sibling, 0 replies; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-18 15:56 UTC (permalink / raw)
  To: netfilter-devel

The last message I accidentially first sent to Pablo only. He replied
without noticing that the list was missing. You find his reply included
in this message. (And I got it wrong with this message again, sorry for that.)

Pablo Neira Ayuso wrote:
> Fabian Hugelshofer wrote:
>> I started the application, sent 102'313 UDP packets with random source
>> ports and 1000B UDP payload in 57s (1795pps) and then waited until all
>> entries have been removed from the connection table. The CPU usage was
>> measured with top every 10s while sending and then averaged over 5
>> intervals. It was 56%, which seems quite high to me. Without any
>> applications running it is 11% to route this UDP traffic.
>>
>> The test application reports 113699 received events and 140 overflows.
>> The events are NEW and DESTROY events. Generally I would roughly expect
>> the double number of events than packets. Under this assumption 44% of
>> events have been dropped.
> 
> I guess that your device has little memory so the default socket buffer
> must be pretty small. I suggest you to increase the socket buffer size
> via nfnl_rcvbufsiz(), that will delay the ENOBUFS. I'd like to see the
> results with my suggestion.

The system has only 32MB of RAM. The default socket buffer size is
110592 bytes which applies to ctevtest. My real application increases
the socket buffer to 430080 bytes.

> Of course, this suggestion is not directly related with the message
> batching that you're proposing that can be useful to reduce CPU
> consumption - if someone wants to use ctevents for logging purposes
> which is what you want.

I increased the socket buffer size of ctevtest to 2MB. With this setting
no more overruns occur and all events can be logged (same test as
before). The CPU usage is now 84%.

I did not retest my real application yet, but the CPU usage will hit
100% again and the bigger buffer size will not help to prevent losses
anymore.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-17 10:03 ` Pablo Neira Ayuso
  2008-07-17 14:34   ` Fabian Hugelshofer
@ 2008-07-18  2:11   ` Patrick McHardy
  2008-07-21 15:51     ` Fabian Hugelshofer
  1 sibling, 1 reply; 30+ messages in thread
From: Patrick McHardy @ 2008-07-18  2:11 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Fabian Hugelshofer, netfilter-devel

Pablo Neira Ayuso wrote:
> Fabian Hugelshofer wrote:
>> Is someone intending to implement multipart messages for ctevents? ;-)
>
> The problem here is that batching should be a per-socket parameter. We
> will not accept a patch that changes the behaviour for all the ctevent
> users. And I don't see an obvious way to do this now

Thats a good point, there's no pretty solution to this. Profiles
would still be nice to have though :)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-18  2:11   ` Patrick McHardy
@ 2008-07-21 15:51     ` Fabian Hugelshofer
  2008-07-21 15:59       ` Patrick McHardy
  0 siblings, 1 reply; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-21 15:51 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Pablo Neira Ayuso, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 2826 bytes --]

Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> The problem here is that batching should be a per-socket parameter. We
>> will not accept a patch that changes the behaviour for all the ctevent
>> users. And I don't see an obvious way to do this now
> 
> Thats a good point, there's no pretty solution to this. Profiles
> would still be nice to have though :)

It took me some time to set up profiling on the router. I am using 
oprofile. nf_conntrack and nfnetlink are built into the kernel. Most of 
the time is probably spent in nfnetlink.

If you need other data than the one provided here, just let me know. I'm 
not very familiar with oprofile, so providing the needed arguments would 
be helpful.

opreport:
CPU: ARM/XScale PMU2, speed 0 MHz (estimated)
Counted CPU_CYCLES events (clock cycles counter) with a unit mask of 
0x00 (No unit mask) count 100000
CPU_CYCLES:100000|
   samples|      %|
------------------
    156310 75.3672 vmlinux
     25733 12.4075 ath_pci
     14674  7.0753 wlan
      5383  2.5955 nf_conntrack_netlink
      2926  1.4108 oprofiled
      1206  0.5815 ath_rate_minstrel
       246  0.1186 libuClibc-0.9.29.so
       234  0.1128 iptable_raw
       232  0.1119 ld-uClibc-0.9.29.so
       145  0.0699 ctevtest
       128  0.0617 busybox
        99  0.0477 libnfnetlink.so.0.2.0
        56  0.0270 libnetfilter_conntrack.so.1.2.0
        25  0.0121 arp_tables
         1 4.8e-04 arptable_filter

opreport --symbols (first 20, full file attached):
CPU: ARM/XScale PMU2, speed 0 MHz (estimated)
Counted CPU_CYCLES events (clock cycles counter) with a unit mask of 
0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
20018     9.6520  vmlinux                  memcpy
19493     9.3988  ath_pci.ko               ath_sysctl_register
6676      3.2189  vmlinux                  __nf_conntrack_find
6098      2.9402  vmlinux                  ipt_do_table
5858      2.8245  wlan.ko                  ieee80211_input
5383      2.5955  nf_conntrack_netlink.ko  .text
4567      2.2020  vmlinux                  __memzero
4225      2.0371  vmlinux                  __kmalloc
4091      1.9725  vmlinux                  csum_partial
3469      1.6726  vmlinux                  nf_nat_setup_info
3468      1.6721  vmlinux                  netlink_broadcast
3376      1.6278  vmlinux                  __hash_conntrack
3304      1.5931  vmlinux                  nla_put
2992      1.4426  vmlinux                  __nla_reserve
2933      1.4142  vmlinux                  __nf_conntrack_confirm
2926      1.4108  oprofiled                /jffs/usr/bin/oprofiled
2847      1.3727  ath_pci.ko               ath_intr
2698      1.3009  vmlinux                  kfree
2655      1.2801  vmlinux                  __nla_put
2563      1.2358  vmlinux                  nf_conntrack_in


[-- Attachment #2: oprep_symb.txt --]
[-- Type: text/plain, Size: 27139 bytes --]

CPU: ARM/XScale PMU2, speed 0 MHz (estimated)
Counted CPU_CYCLES events (clock cycles counter) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
20018     9.6520  vmlinux                  memcpy
19493     9.3988  ath_pci.ko               ath_sysctl_register
6676      3.2189  vmlinux                  __nf_conntrack_find
6098      2.9402  vmlinux                  ipt_do_table
5858      2.8245  wlan.ko                  ieee80211_input
5383      2.5955  nf_conntrack_netlink.ko  .text
4567      2.2020  vmlinux                  __memzero
4225      2.0371  vmlinux                  __kmalloc
4091      1.9725  vmlinux                  csum_partial
3469      1.6726  vmlinux                  nf_nat_setup_info
3468      1.6721  vmlinux                  netlink_broadcast
3376      1.6278  vmlinux                  __hash_conntrack
3304      1.5931  vmlinux                  nla_put
2992      1.4426  vmlinux                  __nla_reserve
2933      1.4142  vmlinux                  __nf_conntrack_confirm
2926      1.4108  oprofiled                /jffs/usr/bin/oprofiled
2847      1.3727  ath_pci.ko               ath_intr
2698      1.3009  vmlinux                  kfree
2655      1.2801  vmlinux                  __nla_put
2563      1.2358  vmlinux                  nf_conntrack_in
2561      1.2348  vmlinux                  nf_conntrack_alloc
2331      1.1239  ath_pci.ko               ath_suspend
2122      1.0232  vmlinux                  netif_receive_skb
1955      0.9426  vmlinux                  nf_iterate
1929      0.9301  vmlinux                  local_bh_enable
1916      0.9238  vmlinux                  __copy_to_user
1810      0.8727  vmlinux                  handle_IRQ_event
1798      0.8669  vmlinux                  dev_queue_xmit
1781      0.8587  wlan.ko                  ieee80211_find_txnode
1751      0.8443  vmlinux                  __alloc_skb
1657      0.7989  vmlinux                  ip_rcv
1633      0.7874  vmlinux                  __copy_skb_header
1619      0.7806  wlan.ko                  ieee80211_hardstart
1566      0.7551  vmlinux                  pskb_expand_head
1505      0.7257  vmlinux                  ip_forward
1360      0.6557  vmlinux                  skb_release_data
1273      0.6138  wlan.ko                  ieee80211_saveath
1250      0.6027  vmlinux                  default_idle
1222      0.5892  vmlinux                  nf_nat_fn
1206      0.5815  ath_rate_minstrel.ko     .text
1197      0.5772  vmlinux                  xscale_dma_inv_range
1182      0.5699  vmlinux                  kmem_cache_alloc
1130      0.5448  vmlinux                  skb_release_all
1064      0.5130  vmlinux                  ip_finish_output
1062      0.5121  ath_pci.ko               ath_sysctl_unregister
1039      0.5010  vmlinux                  netlink_recvmsg
1038      0.5005  vmlinux                  ip_route_input
1031      0.4971  vmlinux                  dma_map_single
1008      0.4860  vmlinux                  local_bh_disable
990       0.4773  vmlinux                  skb_copy
982       0.4735  vmlinux                  nf_ct_invert_tuple
967       0.4663  vmlinux                  nf_hook_slow
936       0.4513  vmlinux                  ip_rcv_finish
914       0.4407  vmlinux                  kmem_cache_free
901       0.4344  vmlinux                  udp_error
865       0.4171  vmlinux                  __wake_up
864       0.4166  vmlinux                  __mod_timer
845       0.4074  vmlinux                  dev_hard_start_xmit
843       0.4065  vmlinux                  death_by_timeout
802       0.3867  vmlinux                  __nf_ct_refresh_acct
764       0.3684  vmlinux                  memset
733       0.3534  vmlinux                  memcmp
732       0.3529  vmlinux                  __nf_ct_l4proto_find
730       0.3520  vmlinux                  pfifo_fast_enqueue
692       0.3337  vmlinux                  destroy_conntrack
688       0.3317  vmlinux                  ip_rt_send_redirect
679       0.3274  wlan.ko                  ieee80211_encap
678       0.3269  vmlinux                  dma_unmap_single
647       0.3120  wlan.ko                  ieee80211_unref_node
643       0.3100  vmlinux                  __kfree_skb
632       0.3047  vmlinux                  nlmsg_notify
601       0.2898  vmlinux                  ip_output
600       0.2893  vmlinux                  dma_cache_maint
597       0.2879  vmlinux                  kfree_skb
593       0.2859  vmlinux                  __udivsi3
590       0.2845  vmlinux                  ipv4_get_l4proto
586       0.2825  vmlinux                  skb_checksum
582       0.2806  vmlinux                  netlink_has_listeners
576       0.2777  vmlinux                  __aeabi_idiv
568       0.2739  vmlinux                  notifier_call_chain
534       0.2575  wlan.ko                  ieee80211_dev_alloc_skb
518       0.2498  vmlinux                  nf_ct_port_tuple_to_nlattr
518       0.2498  vmlinux                  nf_nat_packet
514       0.2478  wlan.ko                  ieee80211_skb_track
508       0.2449  vmlinux                  nf_ct_deliver_cached_events
499       0.2406  vmlinux                  nf_nat_rule_find
490       0.2363  vmlinux                  nf_ip_checksum
481       0.2319  vmlinux                  __nf_ct_ext_add
479       0.2310  vmlinux                  nf_nat_cleanup_conntrack
471       0.2271  vmlinux                  rt_hash_code
469       0.2261  wlan.ko                  ieee80211_cancel_scan
464       0.2237  vmlinux                  nf_ct_invert_tuplepr
456       0.2199  vmlinux                  nf_ct_get_tuple
455       0.2194  vmlinux                  __nf_ct_helper_find
439       0.2117  vmlinux                  ipv4_tuple_to_nlattr
434       0.2093  vmlinux                  nf_ct_l3proto_find_get
430       0.2073  vmlinux                  alloc_null_binding
424       0.2044  vmlinux                  del_timer
419       0.2020  vmlinux                  __nf_ct_ext_destroy
413       0.1991  vmlinux                  sock_def_readable
403       0.1943  vmlinux                  udp_pkt_to_tuple
401       0.1933  vmlinux                  dma_sync_single_for_cpu
396       0.1909  vmlinux                  ipv4_pkt_to_tuple
386       0.1861  vmlinux                  copy_skb_header
384       0.1852  vmlinux                  ipv4_invert_tuple
377       0.1818  vmlinux                  __nf_ct_event_cache_init
365       0.1760  vmlinux                  nfnetlink_send
363       0.1750  vmlinux                  __skb_checksum_complete_head
363       0.1750  vmlinux                  nf_ct_l4proto_find_get
356       0.1717  vmlinux                  module_put
354       0.1707  vmlinux                  skb_copy_bits
343       0.1654  vmlinux                  cpu_idle
342       0.1649  vmlinux                  sock_recvmsg
333       0.1606  vmlinux                  nf_nat_in
332       0.1601  wlan.ko                  ieee80211_parent_queue_xmit
330       0.1591  vmlinux                  ipv4_conntrack_defrag
323       0.1557  vmlinux                  ipt_route_hook
320       0.1543  vmlinux                  nf_ct_remove_expectations
319       0.1538  vmlinux                  __nf_ct_expect_find
314       0.1514  vmlinux                  sys_recvfrom
309       0.1490  vmlinux                  udp_invert_tuple
299       0.1442  vmlinux                  atomic_notifier_call_chain
295       0.1422  wlan.ko                  ieee80211_ref_node
293       0.1413  vmlinux                  ipv4_confirm
290       0.1398  vmlinux                  pfifo_fast_dequeue
284       0.1369  vmlinux                  ipv4_conntrack_help
280       0.1350  vmlinux                  __nf_conntrack_hash_insert
274       0.1321  wlan.ko                  ieee80211_setup_rates
273       0.1316  vmlinux                  nf_conntrack_find_get
273       0.1316  vmlinux                  nf_nat_used_tuple
271       0.1307  vmlinux                  nf_ct_l3proto_put
267       0.1287  vmlinux                  nf_conntrack_tuple_taken
267       0.1287  vmlinux                  nfnetlink_has_listeners
263       0.1268  vmlinux                  udp_packet
261       0.1258  vmlinux                  dev_kfree_skb_any
259       0.1249  vmlinux                  nf_nat_out
247       0.1191  vmlinux                  xscale_dma_clean_range
246       0.1186  libuClibc-0.9.29.so      /rom/lib/libuClibc-0.9.29.so
246       0.1186  wlan.ko                  ieee80211_dev_kfree_skb
244       0.1176  vmlinux                  __umodsi3
241       0.1162  vmlinux                  dma_needs_bounce
240       0.1157  vmlinux                  __do_softirq
240       0.1157  vmlinux                  lock_timer_base
234       0.1128  iptable_raw.ko           .text
233       0.1123  vmlinux                  nf_ct_find_expectation
232       0.1119  ld-uClibc-0.9.29.so      /rom/lib/ld-uClibc-0.9.29.so
232       0.1119  vmlinux                  cpu_xscale_switch_mm
230       0.1109  vmlinux                  skb_queue_tail
228       0.1099  vmlinux                  ip_forward_finish
228       0.1099  vmlinux                  nf_ct_l4proto_put
226       0.1090  vmlinux                  net_rx_action
209       0.1008  vmlinux                  add_event_entry
209       0.1008  vmlinux                  skb_dequeue
208       0.1003  vmlinux                  init_timer
200       0.0964  vmlinux                  vector_swi
191       0.0921  vmlinux                  __csum_ipv6_magic
188       0.0906  vmlinux                  memcpy_toiovec
188       0.0906  vmlinux                  skb_recv_datagram
178       0.0858  vmlinux                  sync_buffer
166       0.0800  vmlinux                  skb_copy_datagram_iovec
163       0.0786  vmlinux                  nf_conntrack_destroy
158       0.0762  vmlinux                  __atomic_notifier_call_chain
158       0.0762  vmlinux                  ipv4_conntrack_in
151       0.0728  vmlinux                  nf_conntrack_free
148       0.0714  vmlinux                  schedule
144       0.0694  vmlinux                  ip_sabotage_in
142       0.0685  vmlinux                  xscale_flush_user_cache_range
138       0.0665  vmlinux                  ipt_hook
135       0.0651  vmlinux                  __qdisc_run
134       0.0646  vmlinux                  nf_nat_adjust
128       0.0617  busybox                  /rom/bin/busybox
126       0.0608  wlan.ko                  ieee80211_iterate_dev_nodes
124       0.0598  vmlinux                  __skb_checksum_complete
123       0.0593  vmlinux                  sock_rfree
121       0.0583  ctevtest                 main
108       0.0521  vmlinux                  fget_light
101       0.0487  vmlinux                  add_sample_entry
99        0.0477  libnfnetlink.so.0.2.0    /rom/usr/lib/libnfnetlink.so.0.2.0
99        0.0477  vmlinux                  mc_copy_user_page
98        0.0473  vmlinux                  helper_hash
98        0.0473  vmlinux                  update_mmu_cache
97        0.0468  vmlinux                  sys_recv
94        0.0453  vmlinux                  neigh_resolve_output
87        0.0419  vmlinux                  net_tx_action
83        0.0400  vmlinux                  __napi_schedule
71        0.0342  vmlinux                  tasklet_action
70        0.0338  vmlinux                  xscale_mc_clear_user_page
64        0.0309  vmlinux                  run_timer_softirq
57        0.0275  vmlinux                  netlink_overrun
56        0.0270  libnetfilter_conntrack.so.1.2.0 /jffs/usr/lib/libnetfilter_conntrack.so.1.2.0
53        0.0256  vmlinux                  skb_free_datagram
40        0.0193  vmlinux                  __flush_whole_cache
40        0.0193  vmlinux                  udp_new
39        0.0188  vmlinux                  get_page_from_freelist
37        0.0178  vmlinux                  eth_header
34        0.0164  vmlinux                  tick_nohz_stop_sched_tick
32        0.0154  vmlinux                  tick_nohz_restart_sched_tick
29        0.0140  vmlinux                  handle_mm_fault
27        0.0130  vmlinux                  __link_path_walk
27        0.0130  vmlinux                  ret_fast_syscall
26        0.0125  vmlinux                  __tasklet_schedule
25        0.0121  arp_tables.ko            arpt_do_table
24        0.0116  ctevtest                 .plt
24        0.0116  vmlinux                  find_lock_page
24        0.0116  vmlinux                  unmap_vmas
23        0.0111  vmlinux                  hrtimer_run_queues
22        0.0106  vmlinux                  __switch_to
22        0.0106  vmlinux                  find_vma
22        0.0106  vmlinux                  pfifo_fast_requeue
21        0.0101  vmlinux                  __queue_work
21        0.0101  vmlinux                  do_page_fault
18        0.0087  vmlinux                  eth_poll
17        0.0082  vmlinux                  __dabt_usr
16        0.0077  vmlinux                  __do_fault
15        0.0072  vmlinux                  arp_process
14        0.0068  vmlinux                  filemap_fault
12        0.0058  vmlinux                  cache_reap
11        0.0053  vmlinux                  put_page
10        0.0048  vmlinux                  __rcu_process_callbacks
9         0.0043  vmlinux                  copy_process
9         0.0043  vmlinux                  do_alignment
9         0.0043  vmlinux                  drain_array
9         0.0043  vmlinux                  dsp_do
9         0.0043  vmlinux                  fib_semantic_match
9         0.0043  vmlinux                  free_hot_cold_page
9         0.0043  wlan.ko                  ieee80211_beacon_update
8         0.0039  vmlinux                  __d_lookup
8         0.0039  vmlinux                  vma_adjust
7         0.0034  vmlinux                  copy_page_range
7         0.0034  vmlinux                  fib_validate_source
7         0.0034  vmlinux                  zone_watermark_ok
6         0.0029  vmlinux                  __wake_up_bit
6         0.0029  vmlinux                  anon_vma_unlink
6         0.0029  vmlinux                  cpu_xscale_set_pte_ext
6         0.0029  vmlinux                  do_DataAbort
6         0.0029  vmlinux                  do_alignment_ldrstr
6         0.0029  vmlinux                  load_elf_binary
6         0.0029  vmlinux                  mdio_read
6         0.0029  vmlinux                  neigh_lookup
6         0.0029  vmlinux                  xscale_flush_kern_dcache_page
6         0.0029  wlan.ko                  ieee80211_skb_untrack
5         0.0024  vmlinux                  __clear_user
5         0.0024  vmlinux                  __page_set_anon_rmap
5         0.0024  vmlinux                  arp_hash
5         0.0024  vmlinux                  arp_rcv
5         0.0024  vmlinux                  do_exit
5         0.0024  vmlinux                  do_lookup
5         0.0024  vmlinux                  do_wp_page
5         0.0024  vmlinux                  dput
5         0.0024  vmlinux                  unlock_page
4         0.0019  vmlinux                  __up_read
4         0.0019  vmlinux                  find_get_page
4         0.0019  vmlinux                  find_vma_prev
4         0.0019  vmlinux                  finish_task_switch
4         0.0019  vmlinux                  fn_hash_lookup
4         0.0019  vmlinux                  fput
4         0.0019  vmlinux                  free_pgtables
4         0.0019  vmlinux                  neigh_update
4         0.0019  vmlinux                  page_add_file_rmap
4         0.0019  vmlinux                  qmgr_irq1
4         0.0019  vmlinux                  up_read
4         0.0019  vmlinux                  vma_link
4         0.0019  wlan.ko                  ieee80211_recv_mgmt
3         0.0014  vmlinux                  __copy_from_user
3         0.0014  vmlinux                  __dentry_open
3         0.0014  vmlinux                  __down_read_trylock
3         0.0014  vmlinux                  __pabt_usr
3         0.0014  vmlinux                  __pagevec_lru_add_active
3         0.0014  vmlinux                  __udp4_lib_rcv
3         0.0014  vmlinux                  __up_write
3         0.0014  vmlinux                  __vm_enough_memory
3         0.0014  vmlinux                  __vma_link
3         0.0014  vmlinux                  anon_vma_link
3         0.0014  vmlinux                  anon_vma_prepare
3         0.0014  vmlinux                  cpu_xscale_dcache_clean_area
3         0.0014  vmlinux                  do_mmap_pgoff
3         0.0014  vmlinux                  do_munmap
3         0.0014  vmlinux                  do_sync_read
3         0.0014  vmlinux                  down_read_trylock
3         0.0014  vmlinux                  eth_type_trans
3         0.0014  vmlinux                  file_read_actor
3         0.0014  vmlinux                  filp_close
3         0.0014  vmlinux                  find_mergeable_anon_vma
3         0.0014  vmlinux                  finish_wait
3         0.0014  vmlinux                  generic_fillattr
3         0.0014  vmlinux                  lru_add_drain
3         0.0014  vmlinux                  mark_page_accessed
3         0.0014  vmlinux                  mmap_region
3         0.0014  vmlinux                  page_remove_rmap
3         0.0014  vmlinux                  release_mm
3         0.0014  vmlinux                  ret_to_user
3         0.0014  vmlinux                  touch_atime
3         0.0014  vmlinux                  v4wbi_flush_user_tlb_range
3         0.0014  vmlinux                  vm_normal_page
3         0.0014  wlan.ko                  ieee80211_fix_rate
2        9.6e-04  vmlinux                  __alloc_pages
2        9.6e-04  vmlinux                  __dabt_svc
2        9.6e-04  vmlinux                  __down_read
2        9.6e-04  vmlinux                  __free_pages
2        9.6e-04  vmlinux                  __rb_rotate_left
2        9.6e-04  vmlinux                  __remove_shared_vm_struct
2        9.6e-04  vmlinux                  __user_walk_fd
2        9.6e-04  vmlinux                  _atomic_dec_and_lock
2        9.6e-04  vmlinux                  acct_collect
2        9.6e-04  vmlinux                  add_to_page_cache
2        9.6e-04  vmlinux                  arch_get_unmapped_area
2        9.6e-04  vmlinux                  can_vma_merge_after
2        9.6e-04  vmlinux                  cap_vm_enough_memory
2        9.6e-04  vmlinux                  check_mini_fo_file
2        9.6e-04  vmlinux                  cond_resched
2        9.6e-04  vmlinux                  current_fs_time
2        9.6e-04  vmlinux                  delayed_work_timer_fn
2        9.6e-04  vmlinux                  dentry_open
2        9.6e-04  vmlinux                  dnotify_parent
2        9.6e-04  vmlinux                  do_path_lookup
2        9.6e-04  vmlinux                  down_read
2        9.6e-04  vmlinux                  eth_rx_irq
2        9.6e-04  vmlinux                  fget
2        9.6e-04  vmlinux                  file_free_rcu
2        9.6e-04  vmlinux                  file_ra_state_init
2        9.6e-04  vmlinux                  find_vma_prepare
2        9.6e-04  vmlinux                  free_page_and_swap_cache
2        9.6e-04  vmlinux                  free_pgd_slow
2        9.6e-04  vmlinux                  get_pgd_slow
2        9.6e-04  vmlinux                  getname
2        9.6e-04  vmlinux                  getnstimeofday
2        9.6e-04  vmlinux                  kthread_should_stop
2        9.6e-04  vmlinux                  locks_remove_posix
2        9.6e-04  vmlinux                  may_expand_vm
2        9.6e-04  vmlinux                  mini_fo_d_delete
2        9.6e-04  vmlinux                  mini_fo_d_revalidate
2        9.6e-04  vmlinux                  mini_fo_flush
2        9.6e-04  vmlinux                  mini_fo_read
2        9.6e-04  vmlinux                  mini_fo_readlink
2        9.6e-04  vmlinux                  mntput_no_expire
2        9.6e-04  vmlinux                  move_page_tables
2        9.6e-04  vmlinux                  mprotect_fixup
2        9.6e-04  vmlinux                  mutex_lock
2        9.6e-04  vmlinux                  page_getlink
2        9.6e-04  vmlinux                  page_waitqueue
2        9.6e-04  vmlinux                  permission
2        9.6e-04  vmlinux                  prepare_to_wait_exclusive
2        9.6e-04  vmlinux                  proc_flush_task
2        9.6e-04  vmlinux                  ramfs_get_inode
2        9.6e-04  vmlinux                  rb_insert_color
2        9.6e-04  vmlinux                  read_cache_page
2        9.6e-04  vmlinux                  ret_from_exception
2        9.6e-04  vmlinux                  run_workqueue
2        9.6e-04  vmlinux                  split_vma
2        9.6e-04  vmlinux                  sys_close
2        9.6e-04  vmlinux                  sys_read
2        9.6e-04  vmlinux                  timespec_trunc
2        9.6e-04  vmlinux                  unlink_file_vma
2        9.6e-04  vmlinux                  vfs_read
2        9.6e-04  vmlinux                  vm_stat_account
2        9.6e-04  vmlinux                  worker_thread
2        9.6e-04  vmlinux                  wq_sync_buffer
2        9.6e-04  vmlinux                  write_chan
2        9.6e-04  vmlinux                  xscale_mc_copy_user_page
2        9.6e-04  wlan.ko                  ieee80211_find_rxnode
1        4.8e-04  arptable_filter.ko       .text
1        4.8e-04  vmlinux                  __anon_vma_link
1        4.8e-04  vmlinux                  __down_write
1        4.8e-04  vmlinux                  __down_write_nested
1        4.8e-04  vmlinux                  __fput
1        4.8e-04  vmlinux                  __get_free_pages
1        4.8e-04  vmlinux                  __get_user_4
1        4.8e-04  vmlinux                  __lookup_mnt
1        4.8e-04  vmlinux                  __mark_inode_dirty
1        4.8e-04  vmlinux                  __netdev_alloc_skb
1        4.8e-04  vmlinux                  __strncpy_from_user
1        4.8e-04  vmlinux                  __strnlen_user
1        4.8e-04  vmlinux                  __wake_up_sync
1        4.8e-04  vmlinux                  _clear_bit_be
1        4.8e-04  vmlinux                  acct_stack_growth
1        4.8e-04  vmlinux                  bit_waitqueue
1        4.8e-04  vmlinux                  cap_bprm_secureexec
1        4.8e-04  vmlinux                  compute_creds
1        4.8e-04  vmlinux                  copy_files
1        4.8e-04  vmlinux                  copy_namespaces
1        4.8e-04  vmlinux                  copy_strings
1        4.8e-04  vmlinux                  cp_new_stat
1        4.8e-04  vmlinux                  cp_new_stat64
1        4.8e-04  vmlinux                  current_kernel_time
1        4.8e-04  vmlinux                  d_alloc
1        4.8e-04  vmlinux                  dnotify_flush
1        4.8e-04  vmlinux                  do_generic_mapping_read
1        4.8e-04  vmlinux                  do_translation_fault
1        4.8e-04  vmlinux                  dup_fd
1        4.8e-04  vmlinux                  dupfd
1        4.8e-04  vmlinux                  expand_files
1        4.8e-04  vmlinux                  free_hot_page
1        4.8e-04  vmlinux                  generic_file_mmap
1        4.8e-04  vmlinux                  generic_readlink
1        4.8e-04  vmlinux                  get_empty_filp
1        4.8e-04  vmlinux                  get_signal_to_deliver
1        4.8e-04  vmlinux                  get_task_mm
1        4.8e-04  vmlinux                  get_unused_fd_flags
1        4.8e-04  vmlinux                  get_user_pages
1        4.8e-04  vmlinux                  half_md4_transform
1        4.8e-04  vmlinux                  hrtimer_start
1        4.8e-04  vmlinux                  ip_local_deliver
1        4.8e-04  vmlinux                  iput
1        4.8e-04  vmlinux                  ixp4xx_get_cycles
1        4.8e-04  vmlinux                  ktime_get
1        4.8e-04  vmlinux                  link_path_walk
1        4.8e-04  vmlinux                  lookup_mnt
1        4.8e-04  vmlinux                  lru_cache_add_active
1        4.8e-04  vmlinux                  max_sane_readahead
1        4.8e-04  vmlinux                  may_open
1        4.8e-04  vmlinux                  mini_fo_d_compare
1        4.8e-04  vmlinux                  mini_fo_follow_link
1        4.8e-04  vmlinux                  mini_fo_getattr
1        4.8e-04  vmlinux                  mini_fo_open
1        4.8e-04  vmlinux                  mini_fo_put_link
1        4.8e-04  vmlinux                  mm_init
1        4.8e-04  vmlinux                  mutex_unlock
1        4.8e-04  vmlinux                  new_inode
1        4.8e-04  vmlinux                  notify_change
1        4.8e-04  vmlinux                  open_exec
1        4.8e-04  vmlinux                  open_namei
1        4.8e-04  vmlinux                  page_follow_link_light
1        4.8e-04  vmlinux                  path_walk
1        4.8e-04  vmlinux                  pdflush_operation
1        4.8e-04  vmlinux                  pipe_read
1        4.8e-04  vmlinux                  pipe_read_fasync
1        4.8e-04  vmlinux                  pipe_read_release
1        4.8e-04  vmlinux                  pipe_write_fasync
1        4.8e-04  vmlinux                  prepare_to_wait
1        4.8e-04  vmlinux                  proc_pident_lookup
1        4.8e-04  vmlinux                  process_task_mortuary
1        4.8e-04  vmlinux                  put_pid
1        4.8e-04  vmlinux                  qmgr_disable_irq
1        4.8e-04  vmlinux                  qmgr_enable_irq
1        4.8e-04  vmlinux                  queue_delayed_work_on
1        4.8e-04  vmlinux                  rb_erase
1        4.8e-04  vmlinux                  release_pages
1        4.8e-04  vmlinux                  release_task
1        4.8e-04  vmlinux                  remove_vma
1        4.8e-04  vmlinux                  schedule_delayed_work
1        4.8e-04  vmlinux                  seq_escape
1        4.8e-04  vmlinux                  serial_in
1        4.8e-04  vmlinux                  strlen
1        4.8e-04  vmlinux                  sys_brk
1        4.8e-04  vmlinux                  sys_fcntl64
1        4.8e-04  vmlinux                  sys_ioctl
1        4.8e-04  vmlinux                  sys_mprotect
1        4.8e-04  vmlinux                  sys_munmap
1        4.8e-04  vmlinux                  sysctl_head_next
1        4.8e-04  vmlinux                  tty_ioctl
1        4.8e-04  vmlinux                  uart_start
1        4.8e-04  vmlinux                  unmap_region
1        4.8e-04  vmlinux                  vfs_write
1        4.8e-04  vmlinux                  vma_merge
1        4.8e-04  vmlinux                  vsnprintf
1        4.8e-04  wlan.ko                  ieee80211_chan2ieee
1        4.8e-04  wlan.ko                  ieee80211_ibss_merge
1        4.8e-04  wlan.ko                  ieee80211_iterate_nodes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-21 15:51     ` Fabian Hugelshofer
@ 2008-07-21 15:59       ` Patrick McHardy
  2008-07-21 17:49         ` Fabian Hugelshofer
  0 siblings, 1 reply; 30+ messages in thread
From: Patrick McHardy @ 2008-07-21 15:59 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: Pablo Neira Ayuso, netfilter-devel

Fabian Hugelshofer wrote:
> It took me some time to set up profiling on the router. I am using 
> oprofile. nf_conntrack and nfnetlink are built into the kernel. Most of 
> the time is probably spent in nfnetlink.
> 
> If you need other data than the one provided here, just let me know. I'm 
> not very familiar with oprofile, so providing the needed arguments would 
> be helpful.

Thanks.

> opreport --symbols (first 20, full file attached):
> CPU: ARM/XScale PMU2, speed 0 MHz (estimated)
> Counted CPU_CYCLES events (clock cycles counter) with a unit mask of 
> 0x00 (No unit mask) count 100000
> samples  %        app name                 symbol name
> 20018     9.6520  vmlinux                  memcpy

Callgraph information would be useful since its unclear whether
this is the memcpy triggered by netlink message trimming in
af_netlink.c or something different. Unfortunately according
to the documentation this is only supported on x86. I think
selecting the netfilter options as modules should provide
slightly more detail though.

> 19493     9.3988  ath_pci.ko               ath_sysctl_register

This looks odd. I couldn't find this function in the current
kernel tree, which version are you using?

> 6676      3.2189  vmlinux                  __nf_conntrack_find
> 6098      2.9402  vmlinux                  ipt_do_table
> 5858      2.8245  wlan.ko                  ieee80211_input
> 5383      2.5955  nf_conntrack_netlink.ko  .text
> 4567      2.2020  vmlinux                  __memzero
> 4225      2.0371  vmlinux                  __kmalloc
> 4091      1.9725  vmlinux                  csum_partial

You can disable conntrack checksumming by executing:

echo 0 >/proc/sys/net/netfilter/nf_conntrack_checksum

> 3469      1.6726  vmlinux                  nf_nat_setup_info
> 3468      1.6721  vmlinux                  netlink_broadcast
> 3376      1.6278  vmlinux                  __hash_conntrack
> 3304      1.5931  vmlinux                  nla_put
> 2992      1.4426  vmlinux                  __nla_reserve
> 2933      1.4142  vmlinux                  __nf_conntrack_confirm
> 2926      1.4108  oprofiled                /jffs/usr/bin/oprofiled
> 2847      1.3727  ath_pci.ko               ath_intr
> 2698      1.3009  vmlinux                  kfree
> 2655      1.2801  vmlinux                  __nla_put
> 2563      1.2358  vmlinux                  nf_conntrack_in
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-21 15:59       ` Patrick McHardy
@ 2008-07-21 17:49         ` Fabian Hugelshofer
  2008-07-23 14:32           ` Fabian Hugelshofer
  0 siblings, 1 reply; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-21 17:49 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Pablo Neira Ayuso, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 2100 bytes --]

Patrick McHardy wrote:
> Fabian Hugelshofer wrote:
>> opreport --symbols (first 20, full file attached):
>> CPU: ARM/XScale PMU2, speed 0 MHz (estimated)
>> Counted CPU_CYCLES events (clock cycles counter) with a unit mask of 
>> 0x00 (No unit mask) count 100000
>> samples  %        app name                 symbol name
>> 20018     9.6520  vmlinux                  memcpy
> 
> Callgraph information would be useful since its unclear whether
> this is the memcpy triggered by netlink message trimming in
> af_netlink.c or something different. Unfortunately according
> to the documentation this is only supported on x86. I think
> selecting the netfilter options as modules should provide
> slightly more detail though.

Callgraphs work also with ARM. I had to enable callgraph already for 
capturing, which I did not before. I collected a deepness of 5 levels. 
You find the result attached to this email. For help on interpreting read:
http://oprofile.sourceforge.net/doc/opreport.html#opreport-callgraph
It basically shows the same list as before (unindented) with callers on 
top and callees below.

memcpy is mostly invoked by skb_copy and netlink_broadcast (af_netlink). 
netlink_broadcast is expensive on its own and calls pskb_expand_head 
which is expensive as well. Using multipart messages would reduce the 
need to call netlink_broadcast.

>> 19493     9.3988  ath_pci.ko               ath_sysctl_register
> 
> This looks odd. I couldn't find this function in the current
> kernel tree, which version are you using?

I am using 2.6.24.2. ath_pci is from the Madwifi drivers. The test setup 
is all wireless.

>> 4091      1.9725  vmlinux                  csum_partial
> 
> You can disable conntrack checksumming by executing:
> 
> echo 0 >/proc/sys/net/netfilter/nf_conntrack_checksum

I don't think that this has a significant impact. As I wrote it is 
possible to capture all packets (same traffic) with ULOG, parse the 
headers, do lookups and the counting without problems. For this 
checksumming is enabled as well. The problems only start if I enable 
connection events capturing.

[-- Attachment #2: oprep_symb_cg5.txt.tar.gz --]
[-- Type: application/x-gzip, Size: 32505 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-21 17:49         ` Fabian Hugelshofer
@ 2008-07-23 14:32           ` Fabian Hugelshofer
  2008-07-23 14:38             ` Patrick McHardy
  0 siblings, 1 reply; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-23 14:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Patrick McHardy, Pablo Neira Ayuso

Fabian Hugelshofer wrote:
> Patrick McHardy wrote:
>> Callgraph information would be useful since its unclear whether
>> this is the memcpy triggered by netlink message trimming in
>> af_netlink.c or something different. Unfortunately according
>> to the documentation this is only supported on x86. I think
>> selecting the netfilter options as modules should provide
>> slightly more detail though.
[...]
> 
> memcpy is mostly invoked by skb_copy and netlink_broadcast (af_netlink). 
> netlink_broadcast is expensive on its own and calls pskb_expand_head 
> which is expensive as well. Using multipart messages would reduce the 
> need to call netlink_broadcast.

I profiled again with nfnetlink and nf_conntrack compiled as modules:
    103599 61.1842 vmlinux
     24481 14.4582 ath_pci
     19232 11.3582 nf_conntrack
     10435  6.1628 wlan
      3588  2.1190 nf_conntrack_netlink
      2869  1.6944 oprofiled
      1886  1.1138 nf_conntrack_ipv4
      1447  0.8546 ath_rate_minstrel
       627  0.3703 nfnetlink
       237  0.1400 ld-uClibc-0.9.29.so
       233  0.1376 libuClibc-0.9.29.so
       183  0.1081 iptable_raw
       174  0.1028 ctevtest
       147  0.0868 busybox
        85  0.0502 libnfnetlink.so.0.2.0
        60  0.0354 libnetfilter_conntrack.so.1.2.0
        38  0.0224 arp_tables
         2  0.0012 arptable_filter

Again most of the time is spent in the kernel. Memory and skb operations 
are accounted there. I suspect that they cause the most overhead.

Do you plan to dig deeper into optimising the non-optimal parts? I 
consider myself not to have enough understanding to do it myself.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 14:32           ` Fabian Hugelshofer
@ 2008-07-23 14:38             ` Patrick McHardy
  2008-07-23 16:12               ` Fabian Hugelshofer
  0 siblings, 1 reply; 30+ messages in thread
From: Patrick McHardy @ 2008-07-23 14:38 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: netfilter-devel, Pablo Neira Ayuso

[-- Attachment #1: Type: text/plain, Size: 1884 bytes --]

Fabian Hugelshofer wrote:
> Fabian Hugelshofer wrote:
>> Patrick McHardy wrote:
>>> Callgraph information would be useful since its unclear whether
>>> this is the memcpy triggered by netlink message trimming in
>>> af_netlink.c or something different. Unfortunately according
>>> to the documentation this is only supported on x86. I think
>>> selecting the netfilter options as modules should provide
>>> slightly more detail though.
> [...]
>>
>> memcpy is mostly invoked by skb_copy and netlink_broadcast 
>> (af_netlink). netlink_broadcast is expensive on its own and calls 
>> pskb_expand_head which is expensive as well. Using multipart messages 
>> would reduce the need to call netlink_broadcast.
> 
> I profiled again with nfnetlink and nf_conntrack compiled as modules:
>    103599 61.1842 vmlinux
>     24481 14.4582 ath_pci
>     19232 11.3582 nf_conntrack
>     10435  6.1628 wlan
>      3588  2.1190 nf_conntrack_netlink
>      2869  1.6944 oprofiled
>      1886  1.1138 nf_conntrack_ipv4
>      1447  0.8546 ath_rate_minstrel
>       627  0.3703 nfnetlink
>       237  0.1400 ld-uClibc-0.9.29.so
>       233  0.1376 libuClibc-0.9.29.so
>       183  0.1081 iptable_raw
>       174  0.1028 ctevtest
>       147  0.0868 busybox
>        85  0.0502 libnfnetlink.so.0.2.0
>        60  0.0354 libnetfilter_conntrack.so.1.2.0
>        38  0.0224 arp_tables
>         2  0.0012 arptable_filter
> 
> Again most of the time is spent in the kernel. Memory and skb operations 
> are accounted there. I suspect that they cause the most overhead.
> 
> Do you plan to dig deeper into optimising the non-optimal parts? I 
> consider myself not to have enough understanding to do it myself.

The first thing to try would be to use sane allocation sizes
for the event messages. This patch doesn't implement it properly
(uses probing), but should be enough to test whether it helps.





[-- Attachment #2: x --]
[-- Type: text/plain, Size: 991 bytes --]

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 105a616..0aa1b30 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -425,6 +425,7 @@ static int ctnetlink_conntrack_event(struct notifier_block *this,
 	unsigned int type;
 	sk_buff_data_t b;
 	unsigned int flags = 0, group;
+	static unsigned int size = 128;
 
 	/* ignore our fake conntrack entry */
 	if (ct == &nf_conntrack_untracked)
@@ -446,7 +447,8 @@ static int ctnetlink_conntrack_event(struct notifier_block *this,
 	if (!nfnetlink_has_listeners(group))
 		return NOTIFY_DONE;
 
-	skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
+retry:
+	skb = alloc_skb(size, GFP_ATOMIC);
 	if (!skb)
 		return NOTIFY_DONE;
 
@@ -525,7 +527,8 @@ static int ctnetlink_conntrack_event(struct notifier_block *this,
 nlmsg_failure:
 nla_put_failure:
 	kfree_skb(skb);
-	return NOTIFY_DONE;
+	size <<= 1;
+	goto retry;
 }
 #endif /* CONFIG_NF_CONNTRACK_EVENTS */
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 14:38             ` Patrick McHardy
@ 2008-07-23 16:12               ` Fabian Hugelshofer
  2008-07-23 17:01                 ` Patrick McHardy
  2008-07-25  8:44                 ` Fabian Hugelshofer
  0 siblings, 2 replies; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-23 16:12 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netfilter-devel, Pablo Neira Ayuso

[-- Attachment #1: Type: text/plain, Size: 2286 bytes --]

Patrick McHardy wrote:
> Fabian Hugelshofer wrote:
>> Again most of the time is spent in the kernel. Memory and skb 
>> operations are accounted there. I suspect that they cause the most 
>> overhead.
>>
>> Do you plan to dig deeper into optimising the non-optimal parts? I 
>> consider myself not to have enough understanding to do it myself.
> 
> The first thing to try would be to use sane allocation sizes
> for the event messages. This patch doesn't implement it properly
> (uses probing), but should be enough to test whether it helps.

Thanks a lot. This patch already decreased the CPU usage for ctevtest 
from 85% to 44%. Sweet...

I created a new callgraph profile which you find attached to this mail. 
Let's have a look at two parts:

First:
2055      2.7205    ctnetlink_conntrack_event
   2378     21.6201    nla_put
   2181     19.8291    nfnetlink_send
   2055     18.6835    ctnetlink_conntrack_event [self]
   1250     11.3647    __alloc_skb
   955       8.6826    ipv4_tuple_to_nlattr
   752       6.8370    nf_ct_port_tuple_to_nlattr
   321       2.9184    __memzero
   220       2.0002    nfnetlink_has_listeners
   177       1.6092    nf_ct_l4proto_find_get
   155       1.4092    __nla_put
   116       1.0546    nf_ct_l3proto_find_get
   82        0.7455    module_put
   70        0.6364    nf_ct_l4proto_put
   66        0.6001    nf_ct_l3proto_put
   60        0.5455    nlmsg_notify
   43        0.3909    netlink_has_listeners
   42        0.3819    __kmalloc
   37        0.3364    kmem_cache_alloc
   26        0.2364    __nf_ct_l4proto_find
   13        0.1182    __irq_svc

nf_conntrack_event is now one of the first functions listed. Do you see 
other ways of improving performance?

Second:
   33        2.4775    __nf_ct_ext_add
   63        4.7297    dev_hard_start_xmit
   65        4.8799    sock_recvmsg
   77        5.7808    netif_receive_skb
   92        6.9069    __nla_put
   96        7.2072    nf_conntrack_alloc
   199      14.9399    nf_conntrack_in
   246      18.4685    skb_copy
   427      32.0571    nf_ct_invert_tuplepr
1793      2.3737    __memzero
   1793     100.000    __memzero [self]

Is the zeroing of the inverted tuple in nf_ct_invert_tuple really 
required? As far as I can see all fields are set by the subsequent code.

[-- Attachment #2: opreport_cg_patch.tar.gz --]
[-- Type: application/x-gzip, Size: 34247 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 16:12               ` Fabian Hugelshofer
@ 2008-07-23 17:01                 ` Patrick McHardy
  2008-07-23 17:07                   ` Patrick McHardy
  2008-07-23 17:15                   ` Fabian Hugelshofer
  2008-07-25  8:44                 ` Fabian Hugelshofer
  1 sibling, 2 replies; 30+ messages in thread
From: Patrick McHardy @ 2008-07-23 17:01 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: netfilter-devel, Pablo Neira Ayuso

Fabian Hugelshofer wrote:
> Patrick McHardy wrote:
>> Fabian Hugelshofer wrote:
>>> Again most of the time is spent in the kernel. Memory and skb 
>>> operations are accounted there. I suspect that they cause the most 
>>> overhead.
>>>
>>> Do you plan to dig deeper into optimising the non-optimal parts? I 
>>> consider myself not to have enough understanding to do it myself.
>>
>> The first thing to try would be to use sane allocation sizes
>> for the event messages. This patch doesn't implement it properly
>> (uses probing), but should be enough to test whether it helps.
> 
> Thanks a lot. This patch already decreased the CPU usage for ctevtest 
> from 85% to 44%. Sweet...

Nice. Now we just need to do it properly :)

> I created a new callgraph profile which you find attached to this mail. 
> Let's have a look at two parts:
> 
> First:
> 2055      2.7205    ctnetlink_conntrack_event
>   2378     21.6201    nla_put
>   2181     19.8291    nfnetlink_send
>   2055     18.6835    ctnetlink_conntrack_event [self]
>   1250     11.3647    __alloc_skb
>   955       8.6826    ipv4_tuple_to_nlattr
>   752       6.8370    nf_ct_port_tuple_to_nlattr
>   321       2.9184    __memzero
>   220       2.0002    nfnetlink_has_listeners
>   177       1.6092    nf_ct_l4proto_find_get
>   155       1.4092    __nla_put
>   116       1.0546    nf_ct_l3proto_find_get
>   82        0.7455    module_put
>   70        0.6364    nf_ct_l4proto_put
>   66        0.6001    nf_ct_l3proto_put
>   60        0.5455    nlmsg_notify
>   43        0.3909    netlink_has_listeners
>   42        0.3819    __kmalloc
>   37        0.3364    kmem_cache_alloc
>   26        0.2364    __nf_ct_l4proto_find
>   13        0.1182    __irq_svc
> 
> nf_conntrack_event is now one of the first functions listed. Do you see 
> other ways of improving performance?

For some members doing in-place message construction instead of
copying the data might help, but I couldn only spot few only
used rarely.

The module reference stuff (module_put/nf_ct_*_find_get etc)
is clearly superfluous, this runs in packet processing context
and shouldn't use module references but RCU.


> Second:
>   33        2.4775    __nf_ct_ext_add
>   63        4.7297    dev_hard_start_xmit
>   65        4.8799    sock_recvmsg
>   77        5.7808    netif_receive_skb
>   92        6.9069    __nla_put
>   96        7.2072    nf_conntrack_alloc
>   199      14.9399    nf_conntrack_in
>   246      18.4685    skb_copy
>   427      32.0571    nf_ct_invert_tuplepr
> 1793      2.3737    __memzero
>   1793     100.000    __memzero [self]
> 
> Is the zeroing of the inverted tuple in nf_ct_invert_tuple really 
> required? As far as I can see all fields are set by the subsequent code.

It dependfs on the protocol family. For IPv6 its completely
unnecessary, for IPv4 the last 12 bytes of each address need
to be zeroes. We could push this down to the protocols to
behave more optimally (actually something I started and didn't
finish some time ago).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 17:01                 ` Patrick McHardy
@ 2008-07-23 17:07                   ` Patrick McHardy
  2008-07-23 17:30                     ` Fabian Hugelshofer
  2008-07-23 17:15                   ` Fabian Hugelshofer
  1 sibling, 1 reply; 30+ messages in thread
From: Patrick McHardy @ 2008-07-23 17:07 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: netfilter-devel, Pablo Neira Ayuso

Patrick McHardy wrote:
> Fabian Hugelshofer wrote:
>> Is the zeroing of the inverted tuple in nf_ct_invert_tuple really 
>> required? As far as I can see all fields are set by the subsequent code.
> 
> It dependfs on the protocol family. For IPv6 its completely
> unnecessary, for IPv4 the last 12 bytes of each address need
> to be zeroes. We could push this down to the protocols to
> behave more optimally (actually something I started and didn't
> finish some time ago).

Actually that really is necessary because we have padding in the
tuple on at least ARM.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 17:07                   ` Patrick McHardy
@ 2008-07-23 17:30                     ` Fabian Hugelshofer
  2008-07-23 17:32                       ` Patrick McHardy
  0 siblings, 1 reply; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-23 17:30 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netfilter-devel, Pablo Neira Ayuso

Patrick McHardy wrote:
> Patrick McHardy wrote:
>> Fabian Hugelshofer wrote:
>>> Is the zeroing of the inverted tuple in nf_ct_invert_tuple really 
>>> required? As far as I can see all fields are set by the subsequent code.
>>
>> It dependfs on the protocol family. For IPv6 its completely
>> unnecessary, for IPv4 the last 12 bytes of each address need
>> to be zeroes. We could push this down to the protocols to
>> behave more optimally (actually something I started and didn't
>> finish some time ago).
> 
> Actually that really is necessary because we have padding in the
> tuple on at least ARM.

As you write the remaining 12 bytes for IPv4 can easily be handled in 
the l3protocol. The original tuple is already initialised properly and 
one would just have to replace things like
tuple->src.u3.ip = orig->dst.u3.ip;
with
tuple->src.u3.all = orig->dst.u3.all;
in nf_conntrack_l3proto_ipv4.c.

Why do you think the padding causes problems? For hashing e.g. src/dst 
.u3 and .u are referenced independently of potential padding. Where does 
access to the padding data occur?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 17:30                     ` Fabian Hugelshofer
@ 2008-07-23 17:32                       ` Patrick McHardy
  2008-07-23 17:38                         ` Fabian Hugelshofer
  0 siblings, 1 reply; 30+ messages in thread
From: Patrick McHardy @ 2008-07-23 17:32 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: netfilter-devel, Pablo Neira Ayuso

Fabian Hugelshofer wrote:
> Patrick McHardy wrote:
>> Patrick McHardy wrote:
>>> Fabian Hugelshofer wrote:
>>>> Is the zeroing of the inverted tuple in nf_ct_invert_tuple really 
>>>> required? As far as I can see all fields are set by the subsequent 
>>>> code.
>>>
>>> It dependfs on the protocol family. For IPv6 its completely
>>> unnecessary, for IPv4 the last 12 bytes of each address need
>>> to be zeroes. We could push this down to the protocols to
>>> behave more optimally (actually something I started and didn't
>>> finish some time ago).
>>
>> Actually that really is necessary because we have padding in the
>> tuple on at least ARM.
> 
> As you write the remaining 12 bytes for IPv4 can easily be handled in 
> the l3protocol. The original tuple is already initialised properly and 
> one would just have to replace things like
> tuple->src.u3.ip = orig->dst.u3.ip;
> with
> tuple->src.u3.all = orig->dst.u3.all;
> in nf_conntrack_l3proto_ipv4.c.

Correct.

> Why do you think the padding causes problems? For hashing e.g. src/dst 
> .u3 and .u are referenced independently of potential padding. Where does 
> access to the padding data occur?

The hash function hashes the entire tuple in one go.
The padding needs to have deterministic content for
that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 17:32                       ` Patrick McHardy
@ 2008-07-23 17:38                         ` Fabian Hugelshofer
  2008-07-23 17:40                           ` Patrick McHardy
  0 siblings, 1 reply; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-23 17:38 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netfilter-devel, Pablo Neira Ayuso

Patrick McHardy wrote:
> Fabian Hugelshofer wrote:
>> Why do you think the padding causes problems? For hashing e.g. src/dst 
>> .u3 and .u are referenced independently of potential padding. Where 
>> does access to the padding data occur?
> 
> The hash function hashes the entire tuple in one go.
> The padding needs to have deterministic content for
> that.

Ok, I see. This has been changed since 2.6.24. Makes sense...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 17:38                         ` Fabian Hugelshofer
@ 2008-07-23 17:40                           ` Patrick McHardy
  0 siblings, 0 replies; 30+ messages in thread
From: Patrick McHardy @ 2008-07-23 17:40 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: netfilter-devel, Pablo Neira Ayuso

Fabian Hugelshofer wrote:
> Patrick McHardy wrote:
>> Fabian Hugelshofer wrote:
>>> Why do you think the padding causes problems? For hashing e.g. 
>>> src/dst .u3 and .u are referenced independently of potential padding. 
>>> Where does access to the padding data occur?
>>
>> The hash function hashes the entire tuple in one go.
>> The padding needs to have deterministic content for
>> that.
> 
> Ok, I see. This has been changed since 2.6.24. Makes sense...

Indeed. That part might also help speed things up for you a bit.
But that would also need this fix:

commit 443a70d50bdc212e1292778e264ce3d0a85b896f
Author: Philip Craig <philipc@snapgear.com>
Date:   Tue Apr 29 03:35:10 2008 -0700

     netfilter: nf_conntrack: padding breaks conntrack hash on ARM

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 17:01                 ` Patrick McHardy
  2008-07-23 17:07                   ` Patrick McHardy
@ 2008-07-23 17:15                   ` Fabian Hugelshofer
  2008-07-23 17:20                     ` Patrick McHardy
  1 sibling, 1 reply; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-23 17:15 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netfilter-devel, Pablo Neira Ayuso

Patrick McHardy wrote:
> Fabian Hugelshofer wrote:
>> Patrick McHardy wrote:
>>> The first thing to try would be to use sane allocation sizes
>>> for the event messages. This patch doesn't implement it properly
>>> (uses probing), but should be enough to test whether it helps.
>>
>> Thanks a lot. This patch already decreased the CPU usage for ctevtest 
>> from 85% to 44%. Sweet...
> 
> Nice. Now we just need to do it properly :)

What do you mean with properly? Put some kind of cap? Or eliminate the 
guessing by allocating a reasonable small fixed amount?

>> nf_conntrack_event is now one of the first functions listed. Do you 
>> see other ways of improving performance?
> 
> For some members doing in-place message construction instead of
> copying the data might help, but I couldn only spot few only
> used rarely.
> 
> The module reference stuff (module_put/nf_ct_*_find_get etc)
> is clearly superfluous, this runs in packet processing context
> and shouldn't use module references but RCU.

This goes too deep, that I could help you on this.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 17:15                   ` Fabian Hugelshofer
@ 2008-07-23 17:20                     ` Patrick McHardy
  2008-07-24 13:21                       ` Fabian Hugelshofer
  0 siblings, 1 reply; 30+ messages in thread
From: Patrick McHardy @ 2008-07-23 17:20 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: netfilter-devel, Pablo Neira Ayuso

Fabian Hugelshofer wrote:
> Patrick McHardy wrote:
>> Fabian Hugelshofer wrote:
>>> Patrick McHardy wrote:
>>>> The first thing to try would be to use sane allocation sizes
>>>> for the event messages. This patch doesn't implement it properly
>>>> (uses probing), but should be enough to test whether it helps.
>>>
>>> Thanks a lot. This patch already decreased the CPU usage for ctevtest 
>>> from 85% to 44%. Sweet...
>>
>> Nice. Now we just need to do it properly :)
> 
> What do you mean with properly? Put some kind of cap? Or eliminate the 
> guessing by allocating a reasonable small fixed amount?

Yes, ideally it should use exact allocations, but thats probably
not worth it because it depends on too many factors. The next
best thing would be to allocate exactly the maximum thats needed.
The <<= 1 in my patch might still cause reallocations, some (or
all) of them could be avoided.

>>> nf_conntrack_event is now one of the first functions listed. Do you 
>>> see other ways of improving performance?
>>
>> For some members doing in-place message construction instead of
>> copying the data might help, but I couldn only spot few only
>> used rarely.
>>
>> The module reference stuff (module_put/nf_ct_*_find_get etc)
>> is clearly superfluous, this runs in packet processing context
>> and shouldn't use module references but RCU.
> 
> This goes too deep, that I could help you on this.

Its not really hard, all the nf_ct_*_find_get functions on the event
sending path should be replaced by the corresponding __nf_ct_*_find
functions and the _put functions removed. In case dumping functions
are used by both the event handler and userspace triggered dumps,
the userspace path needs to call rcu_read_lock/rcu_read_unlock.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 17:20                     ` Patrick McHardy
@ 2008-07-24 13:21                       ` Fabian Hugelshofer
  2008-07-25  8:51                         ` Fabian Hugelshofer
  2008-07-25  9:32                         ` Pablo Neira Ayuso
  0 siblings, 2 replies; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-24 13:21 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netfilter-devel, Pablo Neira Ayuso

Patrick McHardy wrote:
> Fabian Hugelshofer wrote:
> > Patrick McHardy wrote:
> >> The module reference stuff (module_put/nf_ct_*_find_get etc)
> >> is clearly superfluous, this runs in packet processing context
> >> and shouldn't use module references but RCU.
> > 
> > This goes too deep, that I could help you on this.
> 
> Its not really hard, all the nf_ct_*_find_get functions on the event
> sending path should be replaced by the corresponding __nf_ct_*_find
> functions and the _put functions removed. In case dumping functions
> are used by both the event handler and userspace triggered dumps,
> the userspace path needs to call rcu_read_lock/rcu_read_unlock.

Ok, I did what you said. You find the patch below. Is this what you
mean? If yes, I'll give it a try...

I only touched the parts from the event notification chain. Should I
leave the other find_gets as they are?

Is it ok to call rcu_read_lock() multiple times in a row? In
ctnetlink_fill_info e.g. I lock and then ctnetlink_dump_helpinfo locks
again. Should I move the second lock out of dump_helpinfo?

Generally I have the impression, that moving the locks out of some dump
functions makes the whole thing less uniform. Is this acceptable?


diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 95a7967..b8650e5 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -104,16 +104,14 @@ ctnetlink_dump_tuples(struct sk_buff *skb,
 	struct nf_conntrack_l3proto *l3proto;
 	struct nf_conntrack_l4proto *l4proto;
 
-	l3proto = nf_ct_l3proto_find_get(tuple->src.l3num);
+	l3proto = __nf_ct_l3proto_find(tuple->src.l3num);
 	ret = ctnetlink_dump_tuples_ip(skb, tuple, l3proto);
-	nf_ct_l3proto_put(l3proto);
 
 	if (unlikely(ret < 0))
 		return ret;
 
-	l4proto = nf_ct_l4proto_find_get(tuple->src.l3num, tuple->dst.protonum);
+	l4proto = __nf_ct_l4proto_find(tuple->src.l3num, tuple->dst.protonum);
 	ret = ctnetlink_dump_tuples_proto(skb, tuple, l4proto);
-	nf_ct_l4proto_put(l4proto);
 
 	return ret;
 }
@@ -150,9 +148,8 @@ ctnetlink_dump_protoinfo(struct sk_buff *skb, const struct nf_conn *ct)
 	struct nlattr *nest_proto;
 	int ret;
 
-	l4proto = nf_ct_l4proto_find_get(nf_ct_l3num(ct), nf_ct_protonum(ct));
+	l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
 	if (!l4proto->to_nlattr) {
-		nf_ct_l4proto_put(l4proto);
 		return 0;
 	}
 
@@ -162,14 +159,11 @@ ctnetlink_dump_protoinfo(struct sk_buff *skb, const struct nf_conn *ct)
 
 	ret = l4proto->to_nlattr(skb, nest_proto, ct);
 
-	nf_ct_l4proto_put(l4proto);
-
 	nla_nest_end(skb, nest_proto);
 
 	return ret;
 
 nla_put_failure:
-	nf_ct_l4proto_put(l4proto);
 	return -1;
 }
 
@@ -384,8 +378,11 @@ ctnetlink_fill_info(struct sk_buff *skb, u32 pid, u32 seq,
 	nest_parms = nla_nest_start(skb, CTA_TUPLE_REPLY | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
-	if (ctnetlink_dump_tuples(skb, tuple(ct, IP_CT_DIR_REPLY)) < 0)
+	rcu_read_lock();
+	if (ctnetlink_dump_tuples(skb, tuple(ct, IP_CT_DIR_REPLY)) < 0) {
+		rcu_read_unlock();
 		goto nla_put_failure;
+	}
 	nla_nest_end(skb, nest_parms);
 
 	if (ctnetlink_dump_status(skb, ct) < 0 ||
@@ -399,8 +396,11 @@ ctnetlink_fill_info(struct sk_buff *skb, u32 pid, u32 seq,
 	    ctnetlink_dump_id(skb, ct) < 0 ||
 	    ctnetlink_dump_use(skb, ct) < 0 ||
 	    ctnetlink_dump_master(skb, ct) < 0 ||
-	    ctnetlink_dump_nat_seq_adj(skb, ct) < 0)
+	    ctnetlink_dump_nat_seq_adj(skb, ct) < 0) {
+		rcu_read_unlock();
 		goto nla_put_failure;
+	}
+	rcu_read_unlock();
 
 	nlh->nlmsg_len = skb_tail_pointer(skb) - b;
 	return skb->len;
@@ -1288,8 +1288,12 @@ ctnetlink_exp_dump_tuple(struct sk_buff *skb,
 	nest_parms = nla_nest_start(skb, type | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
-	if (ctnetlink_dump_tuples(skb, tuple) < 0)
+	rcu_read_lock();
+	if (ctnetlink_dump_tuples(skb, tuple) < 0) {
+		rcu_read_unlock();
 		goto nla_put_failure;
+	}
+	rcu_read_unlock();
 	nla_nest_end(skb, nest_parms);
 
 	return 0;



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-24 13:21                       ` Fabian Hugelshofer
@ 2008-07-25  8:51                         ` Fabian Hugelshofer
  2008-07-25  9:32                         ` Pablo Neira Ayuso
  1 sibling, 0 replies; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-25  8:51 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Patrick McHardy, Pablo Neira Ayuso

Fabian Hugelshofer wrote:
>>> Patrick McHardy wrote:
>>>> The module reference stuff (module_put/nf_ct_*_find_get etc)
>>>> is clearly superfluous, this runs in packet processing context
>>>> and shouldn't use module references but RCU.
> 
> Ok, I did what you said. You find the patch below. Is this what you
> mean? If yes, I'll give it a try...


I applied the patch and did the test. The effect of removing unnecessary 
locks reduced the CPU usage by one percent (packet rates verified to be 
comparable this time).

It's up to you to decide if making the code less nice (IMHO) is worth 
saving 1% CPU usage under rare circumstances.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-24 13:21                       ` Fabian Hugelshofer
  2008-07-25  8:51                         ` Fabian Hugelshofer
@ 2008-07-25  9:32                         ` Pablo Neira Ayuso
  2008-07-25 11:15                           ` Pablo Neira Ayuso
  1 sibling, 1 reply; 30+ messages in thread
From: Pablo Neira Ayuso @ 2008-07-25  9:32 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: Patrick McHardy, netfilter-devel

Fabian Hugelshofer wrote:
> @@ -384,8 +378,11 @@ ctnetlink_fill_info(struct sk_buff *skb, u32 pid, u32 seq,
>  	nest_parms = nla_nest_start(skb, CTA_TUPLE_REPLY | NLA_F_NESTED);
>  	if (!nest_parms)
>  		goto nla_put_failure;
> -	if (ctnetlink_dump_tuples(skb, tuple(ct, IP_CT_DIR_REPLY)) < 0)
> +	rcu_read_lock();
> +	if (ctnetlink_dump_tuples(skb, tuple(ct, IP_CT_DIR_REPLY)) < 0) {
> +		rcu_read_unlock();
>  		goto nla_put_failure;
                        ^^^
Would it look nicer if you add a new label 'nla_put_failure_unlock'?

nla_put_failure_unlock:
        read_rcu_unlock();
nlmsg_failure:
nla_put_failure:
        nlmsg_trim(skb, b);
        return -1;

Or much simpler, just call read_rcu_unlock() before the first
nla_nest_start() so that this results in much smaller patch:

nlmsg_failure:
nla_put_failure:
        read_rcu_unlock(); <---
        nlmsg_trim(skb, b);
        return -1;

BTW, please, add a short description to your patches and the
'Signed-off-by' field.

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-25  9:32                         ` Pablo Neira Ayuso
@ 2008-07-25 11:15                           ` Pablo Neira Ayuso
  2008-07-27 17:23                             ` Fabian Hugelshofer
  2008-07-28 18:31                             ` Pablo Neira Ayuso
  0 siblings, 2 replies; 30+ messages in thread
From: Pablo Neira Ayuso @ 2008-07-25 11:15 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Fabian Hugelshofer, Patrick McHardy, netfilter-devel

Pablo Neira Ayuso wrote:
> Or much simpler, just call read_rcu_unlock() before the first
> nla_nest_start() so that this results in much smaller patch:
> 
> nlmsg_failure:
> nla_put_failure:
>         read_rcu_unlock(); <---
>         nlmsg_trim(skb, b);
>         return -1;

As said, if you do this in ctnetlink_conntrack_event, I think that you
can also remove the rcu_read_lock in ctnetlink_dump_helpinfo, as then
all dump functions will be invoked under rcu_read_lock.

In ctnetlink_get_conntrack, I think that ctnetlink_fill_info needs to be
protected with rcu_read_lock.

BTW, why do we need such a big read-side critical section in
ctnetlink_create_conntrack? I think that we only need for the helper
assignation, right?

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-25 11:15                           ` Pablo Neira Ayuso
@ 2008-07-27 17:23                             ` Fabian Hugelshofer
  2008-07-28 18:31                             ` Pablo Neira Ayuso
  1 sibling, 0 replies; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-27 17:23 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Patrick McHardy, netfilter-devel

Pablo Neira Ayuso wrote:
> Pablo Neira Ayuso wrote:
> > Or much simpler, just call read_rcu_unlock() before the first
> > nla_nest_start() so that this results in much smaller patch:
> > 
> > nlmsg_failure:
> > nla_put_failure:
> >         read_rcu_unlock(); <---
> >         nlmsg_trim(skb, b);
> >         return -1;
> 
> As said, if you do this in ctnetlink_conntrack_event, I think that you
> can also remove the rcu_read_lock in ctnetlink_dump_helpinfo, as then
> all dump functions will be invoked under rcu_read_lock.

I modified the patch with your suggestions. Read locks have been removed
from all dump functions.

> In ctnetlink_get_conntrack, I think that ctnetlink_fill_info needs to be
> protected with rcu_read_lock.

ctnetlink_get_conntrack now contains a read lock to protect the dump
functions. Protecting it outside is therefore not necessary.


nf_ctnetlink: Remove read locks from dump functions to increase
performance in the event notification path

Signed-off-by: Fabian Hugelshofer <hugelshofer2006@gmx.ch>

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 95a7967..696649e 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -104,16 +104,14 @@ ctnetlink_dump_tuples(struct sk_buff *skb,
 	struct nf_conntrack_l3proto *l3proto;
 	struct nf_conntrack_l4proto *l4proto;
 
-	l3proto = nf_ct_l3proto_find_get(tuple->src.l3num);
+	l3proto = __nf_ct_l3proto_find(tuple->src.l3num);
 	ret = ctnetlink_dump_tuples_ip(skb, tuple, l3proto);
-	nf_ct_l3proto_put(l3proto);
 
 	if (unlikely(ret < 0))
 		return ret;
 
-	l4proto = nf_ct_l4proto_find_get(tuple->src.l3num, tuple->dst.protonum);
+	l4proto = __nf_ct_l4proto_find(tuple->src.l3num, tuple->dst.protonum);
 	ret = ctnetlink_dump_tuples_proto(skb, tuple, l4proto);
-	nf_ct_l4proto_put(l4proto);
 
 	return ret;
 }
@@ -150,9 +148,8 @@ ctnetlink_dump_protoinfo(struct sk_buff *skb, const struct nf_conn *ct)
 	struct nlattr *nest_proto;
 	int ret;
 
-	l4proto = nf_ct_l4proto_find_get(nf_ct_l3num(ct), nf_ct_protonum(ct));
+	l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
 	if (!l4proto->to_nlattr) {
-		nf_ct_l4proto_put(l4proto);
 		return 0;
 	}
 
@@ -162,14 +159,11 @@ ctnetlink_dump_protoinfo(struct sk_buff *skb, const struct nf_conn *ct)
 
 	ret = l4proto->to_nlattr(skb, nest_proto, ct);
 
-	nf_ct_l4proto_put(l4proto);
-
 	nla_nest_end(skb, nest_proto);
 
 	return ret;
 
 nla_put_failure:
-	nf_ct_l4proto_put(l4proto);
 	return -1;
 }
 
@@ -183,7 +177,6 @@ ctnetlink_dump_helpinfo(struct sk_buff *skb, const struct nf_conn *ct)
 	if (!help)
 		return 0;
 
-	rcu_read_lock();
 	helper = rcu_dereference(help->helper);
 	if (!helper)
 		goto out;
@@ -198,11 +191,9 @@ ctnetlink_dump_helpinfo(struct sk_buff *skb, const struct nf_conn *ct)
 
 	nla_nest_end(skb, nest_helper);
 out:
-	rcu_read_unlock();
 	return 0;
 
 nla_put_failure:
-	rcu_read_unlock();
 	return -1;
 }
 
@@ -374,6 +365,8 @@ ctnetlink_fill_info(struct sk_buff *skb, u32 pid, u32 seq,
 	nfmsg->version      = NFNETLINK_V0;
 	nfmsg->res_id	    = 0;
 
+	rcu_read_lock();
+
 	nest_parms = nla_nest_start(skb, CTA_TUPLE_ORIG | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
@@ -402,11 +395,14 @@ ctnetlink_fill_info(struct sk_buff *skb, u32 pid, u32 seq,
 	    ctnetlink_dump_nat_seq_adj(skb, ct) < 0)
 		goto nla_put_failure;
 
+	rcu_read_unlock();
+
 	nlh->nlmsg_len = skb_tail_pointer(skb) - b;
 	return skb->len;
 
 nlmsg_failure:
 nla_put_failure:
+	rcu_read_unlock();
 	nlmsg_trim(skb, b);
 	return -1;
 }
@@ -1285,16 +1281,19 @@ ctnetlink_exp_dump_tuple(struct sk_buff *skb,
 {
 	struct nlattr *nest_parms;
 
+	rcu_read_lock();
 	nest_parms = nla_nest_start(skb, type | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
 	if (ctnetlink_dump_tuples(skb, tuple) < 0)
 		goto nla_put_failure;
 	nla_nest_end(skb, nest_parms);
+	rcu_read_unlock();
 
 	return 0;
 
 nla_put_failure:
+	rcu_read_unlock();
 	return -1;
 }
 



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-25 11:15                           ` Pablo Neira Ayuso
  2008-07-27 17:23                             ` Fabian Hugelshofer
@ 2008-07-28 18:31                             ` Pablo Neira Ayuso
  2008-07-28 23:12                               ` Fabian Hugelshofer
  1 sibling, 1 reply; 30+ messages in thread
From: Pablo Neira Ayuso @ 2008-07-28 18:31 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: Patrick McHardy, netfilter-devel

Pablo Neira Ayuso wrote:
> Pablo Neira Ayuso wrote:
>> Or much simpler, just call read_rcu_unlock() before the first
>> nla_nest_start() so that this results in much smaller patch:
>>
>> nlmsg_failure:
>> nla_put_failure:
>>         read_rcu_unlock(); <---
>>         nlmsg_trim(skb, b);
>>         return -1;

Sorry, this is wrong. It should be:

nla_put_failure:
         read_rcu_unlock();
nlmsg_failure:
         nlmsg_trim(skb, b);
         return -1;

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-28 18:31                             ` Pablo Neira Ayuso
@ 2008-07-28 23:12                               ` Fabian Hugelshofer
  2008-07-29 17:11                                 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-28 23:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Patrick McHardy, netfilter-devel

On Mon, 2008-07-28 at 20:31 +0200, Pablo Neira Ayuso wrote:
> Pablo Neira Ayuso wrote:
> > Pablo Neira Ayuso wrote:
> >> Or much simpler, just call read_rcu_unlock() before the first
> >> nla_nest_start() so that this results in much smaller patch:
> >>
> >> nlmsg_failure:
> >> nla_put_failure:
> >>         read_rcu_unlock(); <---
> >>         nlmsg_trim(skb, b);
> >>         return -1;
> 
> Sorry, this is wrong. It should be:
> 
> nla_put_failure:
>          read_rcu_unlock();
> nlmsg_failure:
>          nlmsg_trim(skb, b);
>          return -1;

Very true indeed. Thanks for noticing. The nlmsg_failure is kinda hidden
in the macro and the jump targets were in the wrong order. You find the
corrected version below.

nf_ctnetlink: Remove read locks from dump functions to increase
performance in the event notification path

Signed-off-by: Fabian Hugelshofer <hugelshofer2006@gmx.ch>

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 95a7967..06f69a2 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -104,16 +104,14 @@ ctnetlink_dump_tuples(struct sk_buff *skb,
 	struct nf_conntrack_l3proto *l3proto;
 	struct nf_conntrack_l4proto *l4proto;
 
-	l3proto = nf_ct_l3proto_find_get(tuple->src.l3num);
+	l3proto = __nf_ct_l3proto_find(tuple->src.l3num);
 	ret = ctnetlink_dump_tuples_ip(skb, tuple, l3proto);
-	nf_ct_l3proto_put(l3proto);
 
 	if (unlikely(ret < 0))
 		return ret;
 
-	l4proto = nf_ct_l4proto_find_get(tuple->src.l3num, tuple->dst.protonum);
+	l4proto = __nf_ct_l4proto_find(tuple->src.l3num, tuple->dst.protonum);
 	ret = ctnetlink_dump_tuples_proto(skb, tuple, l4proto);
-	nf_ct_l4proto_put(l4proto);
 
 	return ret;
 }
@@ -150,9 +148,8 @@ ctnetlink_dump_protoinfo(struct sk_buff *skb, const struct nf_conn *ct)
 	struct nlattr *nest_proto;
 	int ret;
 
-	l4proto = nf_ct_l4proto_find_get(nf_ct_l3num(ct), nf_ct_protonum(ct));
+	l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
 	if (!l4proto->to_nlattr) {
-		nf_ct_l4proto_put(l4proto);
 		return 0;
 	}
 
@@ -162,14 +159,11 @@ ctnetlink_dump_protoinfo(struct sk_buff *skb, const struct nf_conn *ct)
 
 	ret = l4proto->to_nlattr(skb, nest_proto, ct);
 
-	nf_ct_l4proto_put(l4proto);
-
 	nla_nest_end(skb, nest_proto);
 
 	return ret;
 
 nla_put_failure:
-	nf_ct_l4proto_put(l4proto);
 	return -1;
 }
 
@@ -183,7 +177,6 @@ ctnetlink_dump_helpinfo(struct sk_buff *skb, const struct nf_conn *ct)
 	if (!help)
 		return 0;
 
-	rcu_read_lock();
 	helper = rcu_dereference(help->helper);
 	if (!helper)
 		goto out;
@@ -198,11 +191,9 @@ ctnetlink_dump_helpinfo(struct sk_buff *skb, const struct nf_conn *ct)
 
 	nla_nest_end(skb, nest_helper);
 out:
-	rcu_read_unlock();
 	return 0;
 
 nla_put_failure:
-	rcu_read_unlock();
 	return -1;
 }
 
@@ -374,6 +365,8 @@ ctnetlink_fill_info(struct sk_buff *skb, u32 pid, u32 seq,
 	nfmsg->version      = NFNETLINK_V0;
 	nfmsg->res_id	    = 0;
 
+	rcu_read_lock();
+
 	nest_parms = nla_nest_start(skb, CTA_TUPLE_ORIG | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
@@ -402,11 +395,14 @@ ctnetlink_fill_info(struct sk_buff *skb, u32 pid, u32 seq,
 	    ctnetlink_dump_nat_seq_adj(skb, ct) < 0)
 		goto nla_put_failure;
 
+	rcu_read_unlock();
+
 	nlh->nlmsg_len = skb_tail_pointer(skb) - b;
 	return skb->len;
 
-nlmsg_failure:
 nla_put_failure:
+	rcu_read_unlock();
+nlmsg_failure:
 	nlmsg_trim(skb, b);
 	return -1;
 }
@@ -1285,16 +1281,19 @@ ctnetlink_exp_dump_tuple(struct sk_buff *skb,
 {
 	struct nlattr *nest_parms;
 
+	rcu_read_lock();
 	nest_parms = nla_nest_start(skb, type | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
 	if (ctnetlink_dump_tuples(skb, tuple) < 0)
 		goto nla_put_failure;
 	nla_nest_end(skb, nest_parms);
+	rcu_read_unlock();
 
 	return 0;
 
 nla_put_failure:
+	rcu_read_unlock();
 	return -1;
 }
 



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-28 23:12                               ` Fabian Hugelshofer
@ 2008-07-29 17:11                                 ` Pablo Neira Ayuso
  0 siblings, 0 replies; 30+ messages in thread
From: Pablo Neira Ayuso @ 2008-07-29 17:11 UTC (permalink / raw)
  To: Fabian Hugelshofer; +Cc: Patrick McHardy, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1056 bytes --]

Fabian Hugelshofer wrote:
> On Mon, 2008-07-28 at 20:31 +0200, Pablo Neira Ayuso wrote:
>> Pablo Neira Ayuso wrote:
>>> Pablo Neira Ayuso wrote:
>>>> Or much simpler, just call read_rcu_unlock() before the first
>>>> nla_nest_start() so that this results in much smaller patch:
>>>>
>>>> nlmsg_failure:
>>>> nla_put_failure:
>>>>         read_rcu_unlock(); <---
>>>>         nlmsg_trim(skb, b);
>>>>         return -1;
>> Sorry, this is wrong. It should be:
>>
>> nla_put_failure:
>>          read_rcu_unlock();
>> nlmsg_failure:
>>          nlmsg_trim(skb, b);
>>          return -1;
> 
> Very true indeed. Thanks for noticing. The nlmsg_failure is kinda hidden
> in the macro and the jump targets were in the wrong order. You find the
> corrected version below.
> 
> nf_ctnetlink: Remove read locks from dump functions to increase
> performance in the event notification path

I have six patches for ctnetlink here, one of them is based on your
patch. I hope to post them tomorrow for review.

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

[-- Attachment #2: 00.patch --]
[-- Type: text/x-diff, Size: 4652 bytes --]

[PATCH] get rid of module refcounting in ctnetlink

This patch replaces the unnecessary module refcounting with
the read-side locks. With this patch, all the dump and fill_info
function are called under the RCU read lock.

Based on a patch from Fabien Hugelshofer.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Index: net-next-2.6.git/net/netfilter/nf_conntrack_netlink.c
===================================================================
--- net-next-2.6.git.orig/net/netfilter/nf_conntrack_netlink.c	2008-07-29 14:24:39.000000000 +0200
+++ net-next-2.6.git/net/netfilter/nf_conntrack_netlink.c	2008-07-29 14:24:41.000000000 +0200
@@ -103,16 +103,14 @@ ctnetlink_dump_tuples(struct sk_buff *sk
 	struct nf_conntrack_l3proto *l3proto;
 	struct nf_conntrack_l4proto *l4proto;
 
-	l3proto = nf_ct_l3proto_find_get(tuple->src.l3num);
+	l3proto = __nf_ct_l3proto_find(tuple->src.l3num);
 	ret = ctnetlink_dump_tuples_ip(skb, tuple, l3proto);
-	nf_ct_l3proto_put(l3proto);
 
 	if (unlikely(ret < 0))
 		return ret;
 
-	l4proto = nf_ct_l4proto_find_get(tuple->src.l3num, tuple->dst.protonum);
+	l4proto = __nf_ct_l4proto_find(tuple->src.l3num, tuple->dst.protonum);
 	ret = ctnetlink_dump_tuples_proto(skb, tuple, l4proto);
-	nf_ct_l4proto_put(l4proto);
 
 	return ret;
 }
@@ -149,11 +147,9 @@ ctnetlink_dump_protoinfo(struct sk_buff 
 	struct nlattr *nest_proto;
 	int ret;
 
-	l4proto = nf_ct_l4proto_find_get(nf_ct_l3num(ct), nf_ct_protonum(ct));
-	if (!l4proto->to_nlattr) {
-		nf_ct_l4proto_put(l4proto);
+	l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
+	if (!l4proto->to_nlattr)
 		return 0;
-	}
 
 	nest_proto = nla_nest_start(skb, CTA_PROTOINFO | NLA_F_NESTED);
 	if (!nest_proto)
@@ -161,14 +157,11 @@ ctnetlink_dump_protoinfo(struct sk_buff 
 
 	ret = l4proto->to_nlattr(skb, nest_proto, ct);
 
-	nf_ct_l4proto_put(l4proto);
-
 	nla_nest_end(skb, nest_proto);
 
 	return ret;
 
 nla_put_failure:
-	nf_ct_l4proto_put(l4proto);
 	return -1;
 }
 
@@ -182,7 +175,6 @@ ctnetlink_dump_helpinfo(struct sk_buff *
 	if (!help)
 		return 0;
 
-	rcu_read_lock();
 	helper = rcu_dereference(help->helper);
 	if (!helper)
 		goto out;
@@ -197,11 +189,9 @@ ctnetlink_dump_helpinfo(struct sk_buff *
 
 	nla_nest_end(skb, nest_helper);
 out:
-	rcu_read_unlock();
 	return 0;
 
 nla_put_failure:
-	rcu_read_unlock();
 	return -1;
 }
 
@@ -458,6 +448,7 @@ static int ctnetlink_conntrack_event(str
 	nfmsg->version	= NFNETLINK_V0;
 	nfmsg->res_id	= 0;
 
+	rcu_read_lock();
 	nest_parms = nla_nest_start(skb, CTA_TUPLE_ORIG | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
@@ -519,13 +510,15 @@ static int ctnetlink_conntrack_event(str
 	    && ctnetlink_dump_mark(skb, ct) < 0)
 		goto nla_put_failure;
 #endif
+	rcu_read_unlock();
 
 	nlh->nlmsg_len = skb->tail - b;
 	nfnetlink_send(skb, 0, group, 0);
 	return NOTIFY_DONE;
 
-nlmsg_failure:
 nla_put_failure:
+	rcu_read_unlock();
+nlmsg_failure:
 	kfree_skb(skb);
 	return NOTIFY_DONE;
 }
@@ -863,8 +856,10 @@ ctnetlink_get_conntrack(struct sock *ctn
 		return -ENOMEM;
 	}
 
+	rcu_read_lock();
 	err = ctnetlink_fill_info(skb2, NETLINK_CB(skb).pid, nlh->nlmsg_seq,
 				  IPCTNL_MSG_CT_NEW, 1, ct);
+	rcu_read_unlock();
 	nf_ct_put(ct);
 	if (err <= 0)
 		goto free;
@@ -1316,16 +1311,14 @@ ctnetlink_exp_dump_mask(struct sk_buff *
 	if (!nest_parms)
 		goto nla_put_failure;
 
-	l3proto = nf_ct_l3proto_find_get(tuple->src.l3num);
+	l3proto = __nf_ct_l3proto_find(tuple->src.l3num);
 	ret = ctnetlink_dump_tuples_ip(skb, &m, l3proto);
-	nf_ct_l3proto_put(l3proto);
 
 	if (unlikely(ret < 0))
 		goto nla_put_failure;
 
-	l4proto = nf_ct_l4proto_find_get(tuple->src.l3num, tuple->dst.protonum);
+	l4proto = __nf_ct_l4proto_find(tuple->src.l3num, tuple->dst.protonum);
 	ret = ctnetlink_dump_tuples_proto(skb, &m, l4proto);
-	nf_ct_l4proto_put(l4proto);
 	if (unlikely(ret < 0))
 		goto nla_put_failure;
 
@@ -1432,15 +1425,18 @@ static int ctnetlink_expect_event(struct
 	nfmsg->version	    = NFNETLINK_V0;
 	nfmsg->res_id	    = 0;
 
+	rcu_read_lock();
 	if (ctnetlink_exp_dump_expect(skb, exp) < 0)
 		goto nla_put_failure;
+	rcu_read_unlock();
 
 	nlh->nlmsg_len = skb->tail - b;
 	nfnetlink_send(skb, 0, NFNLGRP_CONNTRACK_EXP_NEW, 0);
 	return NOTIFY_DONE;
 
-nlmsg_failure:
 nla_put_failure:
+	rcu_read_unlock();
+nlmsg_failure:
 	kfree_skb(skb);
 	return NOTIFY_DONE;
 }
@@ -1543,9 +1539,11 @@ ctnetlink_get_expect(struct sock *ctnl, 
 	if (!skb2)
 		goto out;
 
+	rcu_read_lock();
 	err = ctnetlink_exp_fill_info(skb2, NETLINK_CB(skb).pid,
 				      nlh->nlmsg_seq, IPCTNL_MSG_EXP_NEW,
 				      1, exp);
+	rcu_read_unlock();
 	if (err <= 0)
 		goto free;
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Conntrack Events Performance - Multipart Messages?
  2008-07-23 16:12               ` Fabian Hugelshofer
  2008-07-23 17:01                 ` Patrick McHardy
@ 2008-07-25  8:44                 ` Fabian Hugelshofer
  1 sibling, 0 replies; 30+ messages in thread
From: Fabian Hugelshofer @ 2008-07-25  8:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Patrick McHardy, Pablo Neira Ayuso

Fabian Hugelshofer wrote:
> Patrick McHardy wrote:
>> The first thing to try would be to use sane allocation sizes
>> for the event messages. This patch doesn't implement it properly
>> (uses probing), but should be enough to test whether it helps.
> 
> Thanks a lot. This patch already decreased the CPU usage for ctevtest 
> from 85% to 44%. Sweet...

I just rerun the test. The 85% CPU usage was with 1700pps and 44% with 
1500pps. The wireless channel does not provide the same performance all 
the time and I did not pay attention to the packet rate.

Without the allocation patch and 1500pps the CPU usage is 56%. But 10% 
less is still quite nice.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-07-29 17:12 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-16 16:42 Conntrack Events Performance - Multipart Messages? Fabian Hugelshofer
2008-07-17  9:16 ` Patrick McHardy
2008-07-17 10:03 ` Pablo Neira Ayuso
2008-07-17 14:34   ` Fabian Hugelshofer
2008-07-17 15:15     ` Fabian Hugelshofer
2008-07-18 15:56     ` Fabian Hugelshofer
2008-07-18  2:11   ` Patrick McHardy
2008-07-21 15:51     ` Fabian Hugelshofer
2008-07-21 15:59       ` Patrick McHardy
2008-07-21 17:49         ` Fabian Hugelshofer
2008-07-23 14:32           ` Fabian Hugelshofer
2008-07-23 14:38             ` Patrick McHardy
2008-07-23 16:12               ` Fabian Hugelshofer
2008-07-23 17:01                 ` Patrick McHardy
2008-07-23 17:07                   ` Patrick McHardy
2008-07-23 17:30                     ` Fabian Hugelshofer
2008-07-23 17:32                       ` Patrick McHardy
2008-07-23 17:38                         ` Fabian Hugelshofer
2008-07-23 17:40                           ` Patrick McHardy
2008-07-23 17:15                   ` Fabian Hugelshofer
2008-07-23 17:20                     ` Patrick McHardy
2008-07-24 13:21                       ` Fabian Hugelshofer
2008-07-25  8:51                         ` Fabian Hugelshofer
2008-07-25  9:32                         ` Pablo Neira Ayuso
2008-07-25 11:15                           ` Pablo Neira Ayuso
2008-07-27 17:23                             ` Fabian Hugelshofer
2008-07-28 18:31                             ` Pablo Neira Ayuso
2008-07-28 23:12                               ` Fabian Hugelshofer
2008-07-29 17:11                                 ` Pablo Neira Ayuso
2008-07-25  8:44                 ` Fabian Hugelshofer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.