PATCH: extra conntrack stats

All of lore.kernel.org
 help / color / mirror / Atom feed

* PATCH: extra conntrack stats
@ 2003-04-19 21:42 Rolf Fokkens
  2003-04-25  8:23 ` Patrick Schaaf
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Rolf Fokkens @ 2003-04-19 21:42 UTC (permalink / raw)
  To: netfilter-devel

Hi!

This patch adds a little statistical information to the connection tracking:
the number of packets and bytes that have gone throught each connection.

We've been using the connection tracking info so far to count concurrently
connected users on an Oracle environment. Now our customer would like us to
measure the size of print jobs and the nunmber of print jobs - this patch may
be an option for us. Of course one can also measure other kinds of traffic.

It might even be possible now to limit the amount of data that has gone 
throught a connection by implementing other targets.

Well, it's a small patch that may not do any harm and please some people.

Cheers,

Rolf

diff -ruN linux-2.4.20.orig/include/linux/netfilter_ipv4/ip_conntrack_tuple.h linux-2.4.20/include/linux/netfilter_ipv4/ip_conntrack_tuple.h
--- linux-2.4.20.orig/include/linux/netfilter_ipv4/ip_conntrack_tuple.h	Mon Feb 25 20:38:13 2002
+++ linux-2.4.20/include/linux/netfilter_ipv4/ip_conntrack_tuple.h	Fri Apr 18 20:26:48 2003
@@ -60,6 +60,8 @@
 		/* The protocol. */
 		u_int16_t protonum;
 	} dst;
+
+	unsigned long npackets, nbytes;
 };
 
 enum ip_conntrack_dir
diff -ruN linux-2.4.20.orig/net/ipv4/netfilter/ip_conntrack_core.c linux-2.4.20/net/ipv4/netfilter/ip_conntrack_core.c
--- linux-2.4.20.orig/net/ipv4/netfilter/ip_conntrack_core.c	Fri Nov 29 00:53:15 2002
+++ linux-2.4.20/net/ipv4/netfilter/ip_conntrack_core.c	Sat Apr 19 15:13:34 2003
@@ -140,6 +140,8 @@
 	else if (iph->ihl * 4 + 8 > len)
 		return 0;
 
+	tuple->npackets = 0;
+	tuple->nbytes = 0;
 	tuple->src.ip = iph->saddr;
 	tuple->dst.ip = iph->daddr;
 	tuple->dst.protonum = iph->protocol;
@@ -158,6 +160,8 @@
 	inverse->src.ip = orig->dst.ip;
 	inverse->dst.ip = orig->src.ip;
 	inverse->dst.protonum = orig->dst.protonum;
+	inverse->npackets = 0;
+	inverse->nbytes = 0;
 
 	return protocol->invert_tuple(inverse, orig);
 }
@@ -760,7 +764,10 @@
 		if (IS_ERR(h))
 			return (void *)h;
 	}
-
+	WRITE_LOCK(&ip_conntrack_lock);
+	h->tuple.npackets ++;
+	h->tuple.nbytes += skb->len;
+	WRITE_UNLOCK(&ip_conntrack_lock);
 	/* It exists; we have (non-exclusive) reference. */
 	if (DIRECTION(h) == IP_CT_DIR_REPLY) {
 		*ctinfo = IP_CT_ESTABLISHED + IP_CT_IS_REPLY;
diff -ruN linux-2.4.20.orig/net/ipv4/netfilter/ip_conntrack_standalone.c linux-2.4.20/net/ipv4/netfilter/ip_conntrack_standalone.c
--- linux-2.4.20.orig/net/ipv4/netfilter/ip_conntrack_standalone.c	Fri Nov 29 00:53:15 2002
+++ linux-2.4.20/net/ipv4/netfilter/ip_conntrack_standalone.c	Sat Apr 19 11:51:26 2003
@@ -48,8 +48,11 @@
 {
 	int len;
 
-	len = sprintf(buffer, "src=%u.%u.%u.%u dst=%u.%u.%u.%u ",
-		      NIPQUAD(tuple->src.ip), NIPQUAD(tuple->dst.ip));
+	len = sprintf(buffer, "src=%u.%u.%u.%u dst=%u.%u.%u.%u "
+			      "packets=%lu bytes=%lu ",
+		      NIPQUAD(tuple->src.ip), NIPQUAD(tuple->dst.ip),
+		      tuple->npackets, tuple->nbytes
+		      );
 
 	len += proto->print_tuple(buffer + len, tuple);
 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-19 21:42 PATCH: extra conntrack stats Rolf Fokkens
@ 2003-04-25  8:23 ` Patrick Schaaf
  2003-04-27 12:48 ` Harald Welte
  2003-04-28  9:13 ` vecna
  2 siblings, 0 replies; 42+ messages in thread
From: Patrick Schaaf @ 2003-04-25  8:23 UTC (permalink / raw)
  To: Rolf Fokkens; +Cc: netfilter-devel

Hi Rolf,

first of all: thanks for doing that work!

> This patch adds a little statistical information to the connection tracking:
> the number of packets and bytes that have gone throught each connection.
> 
> We've been using the connection tracking info so far to count concurrently
> connected users on an Oracle environment. Now our customer would like us to
> measure the size of print jobs and the nunmber of print jobs - this patch may
> be an option for us. Of course one can also measure other kinds of traffic.
> 
> It might even be possible now to limit the amount of data that has gone 
> throught a connection by implementing other targets.

It could be useful if you add a suitable minimal match, for demonstration.

> Well, it's a small patch that may not do any harm and please some people.

I don't think the normal processing takes a WRITE_LOCK(&ip_conntrack_lock)
for each and every packet. Thus, your patch will have a performance
impact under high packet load.

Another pitfall is the 'unsigned long' nbytes: I'm sure I'm not the
only one who has TCP connections transporting more than 4GB, so that
counter will overflow in practise. It would be better to make it
64 bit from the start. 'unsigned long long' and '%Lu' work in the
kernel, as far as I know.

That aside: nice work, nice minimal addition. I urge the core team
to evaluate this (and earlier) approaches. The general idea to have
per-conntrack usage information, is very powerful.

Teaser: "log me the first five packets of each HTTP connection".

best regards
  Patrick

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-19 21:42 PATCH: extra conntrack stats Rolf Fokkens
  2003-04-25  8:23 ` Patrick Schaaf
@ 2003-04-27 12:48 ` Harald Welte
  2003-04-27 15:23   ` Rolf Fokkens
  2003-04-28  9:13 ` vecna
  2 siblings, 1 reply; 42+ messages in thread
From: Harald Welte @ 2003-04-27 12:48 UTC (permalink / raw)
  To: Rolf Fokkens; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1011 bytes --]

On Sat, Apr 19, 2003 at 11:42:01PM +0200, Rolf Fokkens wrote:
> Hi!
> 
> This patch adds a little statistical information to the connection tracking:
> the number of packets and bytes that have gone throught each connection.
> 
> Well, it's a small patch that may not do any harm and please some people.

It will.  You are adding a counter that is written to for every packet.
This causes tremendous cache line ping-pong on SMP systems.

This is inacceptable, the only way to implement this is via per-cpu
counters that are added at the time you read them out (like the iptables
per-cpu local counters)

> Cheers,
> Rolf

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-27 12:48 ` Harald Welte
@ 2003-04-27 15:23   ` Rolf Fokkens
  2003-04-27 20:42     ` Harald Welte
  2003-04-29 22:15     ` Jozsef Kadlecsik
  0 siblings, 2 replies; 42+ messages in thread
From: Rolf Fokkens @ 2003-04-27 15:23 UTC (permalink / raw)
  To: Harald Welte; +Cc: netfilter-devel

On Sunday 27 April 2003 02:48 pm, Harald Welte wrote:
> This is inacceptable, the only way to implement this is via per-cpu
> counters that are added at the time you read them out (like the iptables
> per-cpu local counters)

Oh boy! It's even worse! Even without this patch we have a concurrency 
problem: the timout! It's updated frequently I guess, and it's also protected 
by the same lock: ip_conntrack_lock!

Well, after all this panic and calming down something constructive:

Should a lock per contrack entry be considered? This increases the 
concurrency for different connections anyhow.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-27 15:23   ` Rolf Fokkens
@ 2003-04-27 20:42     ` Harald Welte
  2003-04-28  6:13       ` Patrick Schaaf
  2003-04-29 22:15     ` Jozsef Kadlecsik
  1 sibling, 1 reply; 42+ messages in thread
From: Harald Welte @ 2003-04-27 20:42 UTC (permalink / raw)
  To: Rolf Fokkens; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1329 bytes --]

On Sun, Apr 27, 2003 at 05:23:30PM +0200, Rolf Fokkens wrote:
> On Sunday 27 April 2003 02:48 pm, Harald Welte wrote:
> > This is inacceptable, the only way to implement this is via per-cpu
> > counters that are added at the time you read them out (like the iptables
> > per-cpu local counters)
> 
> Oh boy! It's even worse! Even without this patch we have a concurrency 
> problem: the timout! It's updated frequently I guess, and it's also protected 
> by the same lock: ip_conntrack_lock!

Oh yes, sorry.  This issue never was fixed, although I seem to remember
some patches implementing a solution where the timeout was not updated
for every packet.

> Should a lock per contrack entry be considered? This increases the 
> concurrency for different connections anyhow.

Rusty has written some patches about this quite some time ago,
unfortunately they seem to be discontinued.  IIRC Martin Josefsson
wanted to look into them again.

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-27 20:42     ` Harald Welte
@ 2003-04-28  6:13       ` Patrick Schaaf
  0 siblings, 0 replies; 42+ messages in thread
From: Patrick Schaaf @ 2003-04-28  6:13 UTC (permalink / raw)
  To: Harald Welte, Rolf Fokkens, netfilter-devel

On Sun, Apr 27, 2003 at 10:42:18PM +0200, Harald Welte wrote:
> On Sun, Apr 27, 2003 at 05:23:30PM +0200, Rolf Fokkens wrote:
> > On Sunday 27 April 2003 02:48 pm, Harald Welte wrote:
> > > This is inacceptable, the only way to implement this is via per-cpu
> > > counters that are added at the time you read them out (like the iptables
> > > per-cpu local counters)
> > 
> > Oh boy! It's even worse! Even without this patch we have a concurrency 
> > problem: the timout! It's updated frequently I guess, and it's also
> > protected by the same lock: ip_conntrack_lock!

Let's look at things non-polemically, please.

The first add_timer() you allude to, happens in __ip_conntrack_confirm().
Indeed, we take a WRITE_LOCK there. But, we do this not to protect
the add_timer(), but to protect the conntrack list insertion happening
there. Of course, we need a WRITE_LOCK for list insertion [*]. But, note
that this only happens for unconfirmed conntracks, i.e. about once per
individual connection, for the first packet of the connection.
Otherwise, __ip_conntrack_confirm() returns _before_ doing the WRITE_LOCK.

The more general issue with timers, is the regular update on further
packets. This happens in ip_ct_refresh(). Indeed, it happens under
conntrack WRITE_LOCK. It's probably this place that you remembered.

IF a new WRITE_LOCK reason appears, e.g. for the counter update
discussed here, I would strongly advise to do it in ip_ct_refresh(),
i.e. use the already taken WRITE_LOCK, don't take it twice. If you
take it twice in a row, chances are you just doubled the ping pong
frequncy under load, i.e. halved the packets-per-second worst case.

> > Should a lock per contrack entry be considered? This increases the 
> > concurrency for different connections anyhow.

That would help in that place. However, we'd still do a del_timer/add_timer,
and the kernel timer implementation will again do some locking.

Harald remembered:

> Oh yes, sorry.  This issue never was fixed, although I seem to remember
> some patches implementing a solution where the timeout was not updated
> for every packet.

Correct. I implemented (and lightly tested) such a solution last year,
here's the pointer to the archives:

http://lists.netfilter.org/pipermail/netfilter-devel/2002-August/008859.html

Note that it does NOT get rid of the WRITE_LOCK, because this solution
needs to store the new timeout target in conntrack, after calculating
it - which would be racy without the lock.

So, Rolf's suggestion of a per-conntrack lock would indeed help here
to further reduce pingpong.

As far as I remember, ip_ct_refresh() was the only WRITE_LOCK we
take per-packet, at least at the conntrack layer.

To recapitulate, for the packet/byte counter patch, one should
either use per-cpu counters, or put the counter update into
ip_ct_refresh(), somehow.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-19 21:42 PATCH: extra conntrack stats Rolf Fokkens
  2003-04-25  8:23 ` Patrick Schaaf
  2003-04-27 12:48 ` Harald Welte
@ 2003-04-28  9:13 ` vecna
  2003-04-28 13:47   ` Patrick Schaaf
  2 siblings, 1 reply; 42+ messages in thread
From: vecna @ 2003-04-28  9:13 UTC (permalink / raw)
  To: netfilter-devel


[-- Attachment #1.1: Type: text/plain, Size: 739 bytes --]

On Sat, Apr 19, 2003 at 11:42:01PM +0200, Rolf Fokkens wrote:
> Hi!
> 
> This patch adds a little statistical information to the connection tracking:
> the number of packets and bytes that have gone throught each connection.

in attach I've put a simple patch for make proc entry that report number of
connection tracked, if anyone is interesting ... I've used them because
cat /proc/net/ip_conntrack | wc -l is too low with a large number of 
connection tracked.

in addiction, for large tracking of connection, I think that the problem
is on the manage of linked list, I'm thinking to implement a system for 
apply merge sort over sortder array of connection hashs, did you think
that this should be a good idea ? 

bye

[-- Attachment #1.2: counter.diff --]
[-- Type: text/plain, Size: 3223 bytes --]

--- kernel-2.4.20/net/ipv4/netfilter/ip_conntrack_core.c	2002-11-29 02:53:15.000000000 +0300
+++ linux/net/ipv4/netfilter/ip_conntrack_core.c	2003-04-28 13:23:28.000000000 +0400
@@ -61,7 +61,8 @@
 LIST_HEAD(protocol_list);
 static LIST_HEAD(helpers);
 unsigned int ip_conntrack_htable_size = 0;
-static int ip_conntrack_max = 0;
+static unsigned ip_conntrack_max = 0;
+static unsigned ip_conntrack_cnt = 0;
 static atomic_t ip_conntrack_count = ATOMIC_INIT(0);
 struct list_head *ip_conntrack_hash;
 static kmem_cache_t *ip_conntrack_cachep;
@@ -640,8 +641,9 @@
 
 	hash = hash_conntrack(tuple);
 
-	if (ip_conntrack_max &&
-	    atomic_read(&ip_conntrack_count) >= ip_conntrack_max) {
+	ip_conntrack_cnt =atomic_read(&ip_conntrack_count);
+	
+	if (ip_conntrack_max && ip_conntrack_cnt >= ip_conntrack_max) {
 		/* Try dropping from random chain, or else from the
                    chain about to put into (in case they're trying to
                    bomb one hash chain). */
@@ -1348,8 +1350,12 @@
 #define NET_IP_CONNTRACK_MAX 2089
 #define NET_IP_CONNTRACK_MAX_NAME "ip_conntrack_max"
 
+#define NET_IP_CONNTRACK_CNT 2090
+#define NET_IP_CONNTRACK_CNT_NAME "ip_conntrack_cnt"
+
 #ifdef CONFIG_SYSCTL
-static struct ctl_table_header *ip_conntrack_sysctl_header;
+static struct ctl_table_header *ip_conntrack_sysctl_hdr_max;
+static struct ctl_table_header *ip_conntrack_sysctl_hdr_cnt;
 
 static ctl_table ip_conntrack_table[] = {
 	{ NET_IP_CONNTRACK_MAX, NET_IP_CONNTRACK_MAX_NAME, &ip_conntrack_max,
@@ -1366,6 +1372,23 @@
 	{CTL_NET, "net", NULL, 0, 0555, ip_conntrack_dir_table, 0, 0, 0, 0, 0},
 	{ 0 }
 };
+
+static ctl_table ip_conntrack_t_cnt[] = {
+	{ NET_IP_CONNTRACK_CNT, NET_IP_CONNTRACK_CNT_NAME, &ip_conntrack_cnt,
+	sizeof(ip_conntrack_cnt), 0644,  NULL, proc_dointvec },
+	{ 0 }
+};
+
+static ctl_table ip_conntrack_dir_t_cnt[] = {
+	{ NET_IPV4, "ipv4", NULL, 0, 0555, ip_conntrack_t_cnt, 0, 0, 0, 0, 0},
+	{ 0 }
+};
+
+static ctl_table ip_conntrack_root_t_cnt[] = {
+	{CTL_NET, "net", NULL, 0, 0555, ip_conntrack_dir_t_cnt, 0, 0, 0, 0, 0},
+	{ 0 }
+};
+
 #endif /*CONFIG_SYSCTL*/
 
 static int kill_all(const struct ip_conntrack *i, void *data)
@@ -1378,7 +1401,10 @@
 void ip_conntrack_cleanup(void)
 {
 #ifdef CONFIG_SYSCTL
-	unregister_sysctl_table(ip_conntrack_sysctl_header);
+	/* max of connection tracked */
+	unregister_sysctl_table(ip_conntrack_sysctl_hdr_max);
+	/* counter of connection tracked */
+	unregister_sysctl_table(ip_conntrack_sysctl_hdr_cnt);
 #endif
 	ip_ct_attach = NULL;
 	/* This makes sure all current packets have passed through
@@ -1462,9 +1488,14 @@
    the CONFIG_SYSCTL unless you don't want to detect errors.
    Grrr... --RR */
 #ifdef CONFIG_SYSCTL
-	ip_conntrack_sysctl_header
+	ip_conntrack_sysctl_hdr_max
 		= register_sysctl_table(ip_conntrack_root_table, 0);
-	if (ip_conntrack_sysctl_header == NULL) {
+	if (ip_conntrack_sysctl_hdr_max == NULL) {
+		goto err_free_ct_cachep;
+	}
+	ip_conntrack_sysctl_hdr_cnt
+		= register_sysctl_table(ip_conntrack_root_t_cnt, 0);
+	if (ip_conntrack_sysctl_hdr_cnt == NULL) {
 		goto err_free_ct_cachep;
 	}
 #endif /*CONFIG_SYSCTL*/

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-28  9:13 ` vecna
@ 2003-04-28 13:47   ` Patrick Schaaf
  2003-04-28 15:07     ` possible target SBALANCE ? vecna
  0 siblings, 1 reply; 42+ messages in thread
From: Patrick Schaaf @ 2003-04-28 13:47 UTC (permalink / raw)
  To: vecna; +Cc: netfilter-devel

> in attach I've put a simple patch for make proc entry that report number of
> connection tracked, if anyone is interesting ... I've used them because
> cat /proc/net/ip_conntrack | wc -l is too low with a large number of 
> connection tracked.

I use 'grep conntrack /proc/slabinfo' for this task (first number column
is the number of conntracks currently allocated).

Regarding your patch, it introduces another new place for cache line
pingpong, happening with the same frequency as creation of new conntracks.
That's rather rare, so the new pingpong won't have as much impact as the
per-packet cases, but it's still there.

> in addiction, for large tracking of connection, I think that the problem
> is on the manage of linked list, I'm thinking to implement a system for 
> apply merge sort over sortder array of connection hashs, did you think
> that this should be a good idea ? 

That is not a good idea, it's not the way hashing is supposed to work.
There is no need for ordering anywhere, so no need to sort anything,
as a mergesort would do.

If you have overly long hash chains, use more hash buckets, and, if
neccessary, a better distributing hash function.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 42+ messages in thread

* possible target SBALANCE ?
  2003-04-28 13:47   ` Patrick Schaaf
@ 2003-04-28 15:07     ` vecna
  2003-04-29 14:48       ` Harald Welte
  0 siblings, 1 reply; 42+ messages in thread
From: vecna @ 2003-04-28 15:07 UTC (permalink / raw)
  To: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1953 bytes --]

(sorry for my bad english :)

On Mon, Apr 28, 2003 at 03:47:13PM +0200, Patrick Schaaf wrote:
> I use 'grep conntrack /proc/slabinfo' for this task (first number column
> is the number of conntracks currently allocated).
uu :) I ever listen about :) tnx :)

> Regarding your patch, it introduces another new place for cache line
> pingpong, happening with the same frequency as creation of new conntracks.
excuse me, what do you mean with pingpong ?

> That is not a good idea, it's not the way hashing is supposed to work.
> There is no need for ordering anywhere, so no need to sort anything,
> as a mergesort would do.
> If you have overly long hash chains, use more hash buckets, and, if
> neccessary, a better distributing hash function.
umh, I've always view as hard work the calling of __ip_conntrack_find,
that call LIST_FIND that one check at the wrong possibility all present
connection tracked ...

but, about the subject of the list, on the documentation of a switch 
for compactPCI arch
(http://www.znyx.com/products/software/openarchitect/default.htm)
them explain our system for make a traffic division more or less similar
to a load balancer, them suppose that statistically all internet connection
should be (more or less) equally distributed if is applied a matematic 
computation over session params (ip source, ip dest, source porc, dest port).

if we take a look at the last byte of ip addres and port, we could use
this number and associate them (with a simple LAST_BYTENUMBER % POSSIBLE_OUTPUT)
to one output interface, for make load balancing applied to session
without doing session tracking. (because on between direction
of sessions, ip source ip dest source port dest port, is alwayes the some,
also with inverted position).

I've some time to use for make a possible -j SBALANCE target for do
a forward on a list of output interface ... did you think that
should be nice ? 

bye,
~c

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: possible target SBALANCE ?
  2003-04-28 15:07     ` possible target SBALANCE ? vecna
@ 2003-04-29 14:48       ` Harald Welte
  2003-04-30 11:59         ` vecna
  0 siblings, 1 reply; 42+ messages in thread
From: Harald Welte @ 2003-04-29 14:48 UTC (permalink / raw)
  To: vecna; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 2710 bytes --]

On Mon, Apr 28, 2003 at 05:07:27PM +0200, vecna@s0ftpj.org wrote:
> > Regarding your patch, it introduces another new place for cache line
> > pingpong, happening with the same frequency as creation of new conntracks.
> excuse me, what do you mean with pingpong ?

he is referring to cache line ping pong on SMP systems.  It means that
only one of the per-cpu caches can have writeable data.  as soon as you
write to a certain memory location (like incrementing a counter), all
cached values of this memory location on the other CPU's get
invalidated.  This basically results in a cache miss at every time you
are referencing this memory location, making the CPU idle to wait for
the data from memory to arrive.

> umh, I've always view as hard work the calling of __ip_conntrack_find,
> that call LIST_FIND that one check at the wrong possibility all present
> connection tracked ...

__ip_conntrack_find is an inline function and LIST_FIND a macro.  so
there are not really any function calls involved.

> them explain our system for make a traffic division more or less similar
> to a load balancer, them suppose that statistically all internet connection
> should be (more or less) equally distributed if is applied a matematic 
> computation over session params (ip source, ip dest, source porc, dest port).
> 
> if we take a look at the last byte of ip addres and port, we could use
> this number and associate them (with a simple LAST_BYTENUMBER %
> POSSIBLE_OUTPUT) to one output interface, for make load balancing
> applied to session without doing session tracking. (because on between
> direction of sessions, ip source ip dest source port dest port, is
> alwayes the some, also with inverted position).

this is not new.  However:
- you cannot have outgoing connections from your individual nodes, since
  you cannot make sure that the reply packets will end up in the correct
  hash bucket
- you cannot deal with complex protocols like FTP,IRC,H.323,SIP,PPTP,...
- your load balancing is static and depends on the equal distribution of
  (source) IP addresses

> I've some time to use for make a possible -j SBALANCE target for do
> a forward on a list of output interface ... did you think that
> should be nice ? 

I personally don't think this would be very helpful.
 
> ~c

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-27 15:23   ` Rolf Fokkens
  2003-04-27 20:42     ` Harald Welte
@ 2003-04-29 22:15     ` Jozsef Kadlecsik
  2003-04-29 22:38       ` Martin Josefsson
  1 sibling, 1 reply; 42+ messages in thread
From: Jozsef Kadlecsik @ 2003-04-29 22:15 UTC (permalink / raw)
  To: Rolf Fokkens; +Cc: Harald Welte, netfilter-devel

On Sun, 27 Apr 2003, Rolf Fokkens wrote:

> Well, after all this panic and calming down something constructive:
>
> Should a lock per contrack entry be considered? This increases the
> concurrency for different connections anyhow.

Yep, I'm working on such a patch.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-29 22:15     ` Jozsef Kadlecsik
@ 2003-04-29 22:38       ` Martin Josefsson
  2003-04-30 10:49         ` Jozsef Kadlecsik
  0 siblings, 1 reply; 42+ messages in thread
From: Martin Josefsson @ 2003-04-29 22:38 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Rolf Fokkens, Harald Welte, Netfilter-devel

On Wed, 2003-04-30 at 00:15, Jozsef Kadlecsik wrote:
> On Sun, 27 Apr 2003, Rolf Fokkens wrote:
> 
> > Well, after all this panic and calming down something constructive:
> >
> > Should a lock per contrack entry be considered? This increases the
> > concurrency for different connections anyhow.
> 
> Yep, I'm working on such a patch.

Care to describe how you are going to build it?

-- 
/Martin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-29 22:38       ` Martin Josefsson
@ 2003-04-30 10:49         ` Jozsef Kadlecsik
  2003-04-30 11:19           ` Martin Josefsson
  2003-05-01  0:06           ` Rusty Russell
  0 siblings, 2 replies; 42+ messages in thread
From: Jozsef Kadlecsik @ 2003-04-30 10:49 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Rolf Fokkens, Harald Welte, Rusty Russell, Netfilter-devel

On 30 Apr 2003, Martin Josefsson wrote:

> On Wed, 2003-04-30 at 00:15, Jozsef Kadlecsik wrote:
> > On Sun, 27 Apr 2003, Rolf Fokkens wrote:
> >
> > > Well, after all this panic and calming down something constructive:
> > >
> > > Should a lock per contrack entry be considered? This increases the
> > > concurrency for different connections anyhow.
> >
> > Yep, I'm working on such a patch.
>
> Care to describe how you are going to build it?

As usual, it's Rusty's locking patch what triggered me to think over the
locking issues. In that patch Rusty actually eliminates ip_conntrack
lock:

- functionality structures (helpers etc.) are protected by the netproto
  lock instead
- expectations (lists and data) are protected by the ip_conntrack_expect
  lock
- and per-chain locking introduced in the hash table, protecting the
  conntrack lists and entries

I think that eliminating ip_conntrack lock might uncover a (potential)
bottleneck: the tcp_lock, the lock in conntrack for the predominant
protocol of the Internet.

By adding a data lock to the conntrack entries can solve that
(potential) problem. The data lock would make unnecessary the per
conntrack helper locks too (unfortunately the buffers introduced in the
zerocopy patch undo that).

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-30 10:49         ` Jozsef Kadlecsik
@ 2003-04-30 11:19           ` Martin Josefsson
  2003-04-30 23:05             ` Martin Josefsson
  2003-05-01  0:06           ` Rusty Russell
  1 sibling, 1 reply; 42+ messages in thread
From: Martin Josefsson @ 2003-04-30 11:19 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Rolf Fokkens, Harald Welte, Rusty Russell, Netfilter-devel

On Wed, 2003-04-30 at 12:49, Jozsef Kadlecsik wrote:

> > > > Should a lock per contrack entry be considered? This increases the
> > > > concurrency for different connections anyhow.
> > >
> > > Yep, I'm working on such a patch.
> >
> > Care to describe how you are going to build it?
> 
> As usual, it's Rusty's locking patch what triggered me to think over the
> locking issues. In that patch Rusty actually eliminates ip_conntrack
> lock:
> 
> - functionality structures (helpers etc.) are protected by the netproto
>   lock instead
> - expectations (lists and data) are protected by the ip_conntrack_expect
>   lock
> - and per-chain locking introduced in the hash table, protecting the
>   conntrack lists and entries

I was a little scared that you were going to try implementing per
conntrack locks :)

> I think that eliminating ip_conntrack lock might uncover a (potential)
> bottleneck: the tcp_lock, the lock in conntrack for the predominant
> protocol of the Internet.

Yes that will be a problem.

> By adding a data lock to the conntrack entries can solve that
> (potential) problem. The data lock would make unnecessary the per
> conntrack helper locks too (unfortunately the buffers introduced in the
> zerocopy patch undo that).

A per conntrack lock is probably the best for solving that bottleneck.

The conntrack helpers could linearize the packets (NAT code does that
for helpers anyway?) or copy them using skb_copy()?
The linearize will impact performance a little even when not using NAT.

We should try to go RCU later.

I'll write up a small list of things I think we should fix in conntrack.

-- 
/Martin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: possible target SBALANCE ?
  2003-04-29 14:48       ` Harald Welte
@ 2003-04-30 11:59         ` vecna
  2003-04-30 13:02           ` Roberto Nibali
  0 siblings, 1 reply; 42+ messages in thread
From: vecna @ 2003-04-30 11:59 UTC (permalink / raw)
  To: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1369 bytes --]

> __ip_conntrack_find is an inline function and LIST_FIND a macro.  so
> there are not really any function calls involved.
ok, i don't speak about latency caused by continuos jump, but the
complexity of the algorithm is O(N) with N =number of connection,
with merge sort this should became O(Log(N)), and simple when is 
added to the list you needed to follow incremental order for the key to
search (the hash)

> this is not new.  However:
> - you cannot have outgoing connections from your individual nodes, since
>   you cannot make sure that the reply packets will end up in the correct
>   hash bucket
> - you cannot deal with complex protocols like FTP,IRC,H.323,SIP,PPTP,...
> - your load balancing is static and depends on the equal distribution of
>   (source) IP addresses
I've felt the necessity to balance a large size of traffic, and all
the other solutions (proposed in LVS) involves connection tracking (and 
LVS have the "bug" o "limitation" to follow session computing hash 
with source ip and destination port only, not with between couple for 
trace also the replies, and this could not be fine for split traffic for
IDS, because most needed for trace connection to read also the replis)

but if this is not ad interesting feature for netfilter (is a firewall :)
could be better if I ask on LVS mailing list ? 

bye,
vecna

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: possible target SBALANCE ?
  2003-04-30 11:59         ` vecna
@ 2003-04-30 13:02           ` Roberto Nibali
  0 siblings, 0 replies; 42+ messages in thread
From: Roberto Nibali @ 2003-04-30 13:02 UTC (permalink / raw)
  To: vecna; +Cc: Netfilter Developers

Ciao Vecna, :)

> I've felt the necessity to balance a large size of traffic, and all
> the other solutions (proposed in LVS) involves connection tracking (and 

But very simple connection tracking.

> LVS have the "bug" o "limitation" to follow session computing hash 
> with source ip and destination port only, not with between couple for 

Well, to be correct, it's a triplet of <proto, vaddr, vport> for basic hashing, 
but you can as well implement a new scheduler and register your own hashing, as 
it is done for example with the lblcr scheduler.

> trace also the replies, and this could not be fine for split traffic for
> IDS, because most needed for trace connection to read also the replis)

It's changeable, the hash could be extended if there is a real need for it. Just 
bring it on on the LVS list ;).

> but if this is not ad interesting feature for netfilter (is a firewall :)
> could be better if I ask on LVS mailing list ? 

Yes, we can discuss it there, if you prefer.

Cheers,
Roberto Nibali, ratz
-- 
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-30 11:19           ` Martin Josefsson
@ 2003-04-30 23:05             ` Martin Josefsson
  2003-05-01  4:05               ` Rusty Russell
  2003-05-01 11:26               ` Harald Welte
  0 siblings, 2 replies; 42+ messages in thread
From: Martin Josefsson @ 2003-04-30 23:05 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Rolf Fokkens, Harald Welte, Rusty Russell, Netfilter-devel

On Wed, 2003-04-30 at 13:19, Martin Josefsson wrote:

> > I think that eliminating ip_conntrack lock might uncover a (potential)
> > bottleneck: the tcp_lock, the lock in conntrack for the predominant
> > protocol of the Internet.
> 
> Yes that will be a problem.
> 
> > By adding a data lock to the conntrack entries can solve that
> > (potential) problem. The data lock would make unnecessary the per
> > conntrack helper locks too (unfortunately the buffers introduced in the
> > zerocopy patch undo that).
> 
> A per conntrack lock is probably the best for solving that bottleneck.

If we have per bucket spinlocks we don't have to have a tcp_lock if I
rememberthe code correctly.
If we go RCU we might need the per conntrack lock.

> I'll write up a small list of things I think we should fix in conntrack.

- Switch to using hlists for the hashtable
- Rearrange struct ip_conntrack to be more cachefriendly if possible
- Add prefetching in list-searching
- Turn protocol_list into an array
- Switch to a better hashfunction
- Remove pointless timer-updating
- Rework locking to be finer grained, start with per bucket spinlocks
  (goal: RCU?)
- Remove tcp_lock if using per bucket spinlocks, otherwise move it into
  the entries
- Remove pointless rehashing of tuples
- Rework overload support (early_drop)
- Avoid as many memory-writes as possible, no need to dirty cachelines if
  we don't have to
- Eliminate listhelp.h and lockhelp.h by request from hch
- Try to shrink struct ip_conntrack

-- 
/Martin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-30 10:49         ` Jozsef Kadlecsik
  2003-04-30 11:19           ` Martin Josefsson
@ 2003-05-01  0:06           ` Rusty Russell
  2003-05-01  5:48             ` Patrick Schaaf
                               ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Rusty Russell @ 2003-05-01  0:06 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Rolf Fokkens, Harald Welte, Martin Josefsson, Netfilter-devel,
	Patrick Schaaf

In message <Pine.LNX.4.33.0304301227530.3328-100000@blackhole.kfki.hu> you write:
> As usual, it's Rusty's locking patch what triggered me to think over the
> locking issues. In that patch Rusty actually eliminates ip_conntrack
> lock:

Ah, but I think I might have a better trick.  Patrick cc'd since he's
the hashing man.

Can we compromise the hash so that every packet hashes to the same
chain as its reply?

If we can do this without making our hash really suck, we only need one
entry per conntrack (and we then compare both ways).  This means a
smaller table, and hence better cache utilization, as well as simpler
locking.

Thoughts?

> I think that eliminating ip_conntrack lock might uncover a (potential)
> bottleneck: the tcp_lock, the lock in conntrack for the predominant
> protocol of the Internet.
> 
> By adding a data lock to the conntrack entries can solve that
> (potential) problem. The data lock would make unnecessary the per
> conntrack helper locks too (unfortunately the buffers introduced in the
> zerocopy patch undo that).

Yeah, per-entry locks make sense I think.  I can change the helpers to
do an skb_copy (suggested by Alexey) which will be lockless, or just
implement the damn non-linear search.

Cheers,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-30 23:05             ` Martin Josefsson
@ 2003-05-01  4:05               ` Rusty Russell
  2003-05-01  6:05                 ` Patrick Schaaf
                                   ` (2 more replies)
  2003-05-01 11:26               ` Harald Welte
  1 sibling, 3 replies; 42+ messages in thread
From: Rusty Russell @ 2003-05-01  4:05 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Jozsef Kadlecsik, Rolf Fokkens, Harald Welte, Netfilter-devel,
	Patrick Schaaf

In message <1051743903.8213.188.camel@tux.rsn.bth.se> you write:
> On Wed, 2003-04-30 at 13:19, Martin Josefsson wrote:
> - Switch to using hlists for the hashtable

OK.  Please implement hlist_for_each_entry(), though, a-la
list_for_each_entry.

> - Rearrange struct ip_conntrack to be more cachefriendly if possible

Leave 'til last, I think, since that structure will be changing
anyway.  Most important is that next ptr and tuple are at the start
(ie. only one cacheline per hash chain entry).

> - Add prefetching in list-searching

Can be done, but it's a micro-optimization and you'd want to check
carefully that recent gcc's don't do this anyway.

> - Turn protocol_list into an array

Or hardcode the three builtins and use the list for others.

> - Switch to a better hashfunction

Yes!  See comments below on secret hashing...

> - Remove pointless timer-updating

We could revisit this altogether: go for one timer which sweeps the
hash chains and an "expiry" date on each one.  You end up with some
icky deletion issues, but...

> - Rework locking to be finer grained, start with per bucket spinlocks
>   (goal: RCU?)

Well, might as well go straight to RCU for the infrastructure
locking.  

It's a little tricky.  RCU can be used for the read side, but since
writes are not uncommon (I always guesstimated 1 in 10), we still
probably want per-chain locks.  Hmm, actually, since that bloats the
hash, let's start with one lock and see how it goes.

The protection of the conntrack objects themselves should become a
lock per conntrack I think.  We currently use the timer lock as a form
of synchronization: if we have a conntrack.lock we should use that.

> - Remove tcp_lock if using per bucket spinlocks, otherwise move it into
>   the entries

Agreed: use conntrack.lock.

> - Remove pointless rehashing of tuples

Hmmm, does this happen?

> - Rework overload support (early_drop)

OK, you've probably seen my "hash with secret key" scheme.
Unfortunately, it relies on being able to grab the network brlock to
stop all activity while it redoes the hash.  But Dave is removing the
brlock, so this becomes more tricky (ie. readers have to be aware that
there could be two hash tables, ick).

It might still be worth it though (this same trick would allow us to
resize the hash through /proc).  Needs more thought.

> - Avoid as many memory-writes as possible, no need to dirty cachelines if
>   we don't have to

Well, yes, but it's usually secondary after correctness.

> - Eliminate listhelp.h and lockhelp.h by request from hch

Yeah, list.h is more sophisticated now, and we have lock debugging

> - Try to shrink struct ip_conntrack

Ignoring NAT for the moment, and using a 32-bit arch:

	struct nf_conntrack ct_general;
8 bytes: hard to shrink.

	struct ip_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX];
2 x 28 bytes: we can get rid of one.

	unsigned long status;
8 bytes. Needs to be ulong for bitops.

	struct timer_list timeout;
48 bytes: could be one ulong (time for expiry).

	struct list_head sibling_list;
8 bytes.  Hard to remove without getting tricky...

	unsigned int expecting;
4 bytes. Maybe could merge this with upper bits of status, maybe...

	struct ip_conntrack_expect *master;
4 bytes. Needed.

	struct ip_conntrack_helper *helper;
4 bytes.  Needed.

	struct nf_ct_info infos[IP_CT_NUMBER];
7 * 4 bytes.  We could eliminate this by adding an skb field to hold
the state.

	union ip_conntrack_proto proto;
8 bytes, will get bigger with tcp tracking.

	union ip_conntrack_help help;
16 bytes, could use flags in status for ftp's seq_aft_nl_set and cut
to 8 bytes.

So, we're about 192 bytes.  We could get to 80 bytes.

Cheers,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  0:06           ` Rusty Russell
@ 2003-05-01  5:48             ` Patrick Schaaf
  2003-05-01 10:01               ` Martin Josefsson
  2003-05-01  9:06             ` Martin Josefsson
  2003-05-05  8:43             ` PATCH: extra conntrack stats Jozsef Kadlecsik
  2 siblings, 1 reply; 42+ messages in thread
From: Patrick Schaaf @ 2003-05-01  5:48 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jozsef Kadlecsik, Rolf Fokkens, Harald Welte, Martin Josefsson,
	Netfilter-devel, Patrick Schaaf

> > As usual, it's Rusty's locking patch what triggered me to think over the
> > locking issues. In that patch Rusty actually eliminates ip_conntrack
> > lock:
> 
> Ah, but I think I might have a better trick.  Patrick cc'd since he's
> the hashing man.
> 
> Can we compromise the hash so that every packet hashes to the same
> chain as its reply?
> 
> If we can do this without making our hash really suck, we only need one
> entry per conntrack (and we then compare both ways).  This means a
> smaller table, and hence better cache utilization, as well as simpler
> locking.

Thanks for the Cc, I have indeed thought about such an approach.

Basically, it is possible for non-NAT conntracks. Those have
what I'd call "mirror" key material. Sort the keys, resulting
in A) a "flipped" bit, and B) first the smaller key, then the
larger, you have broken down both sides of the connection to
the same key (and the flipped bit, giving the direction);
look up the "sorted" key in the hash table, then choose the
real tuple to compare to based on the flipped bit.

Unfortunately, once NAT is set up for a conntrack, the keys
for both connections lose the mirror property. The approach
then becomes a small loss, compared to what we have now.

That means we have a tradeoff based on how we are deployed.
I stopped thinking about it at that point...

As an aside, and independant of the key sorting/flipping,
we could eliminate completely the info struct array we
now have in each conntrack. Interested? Then I'll write
how I envision that.

Finally, regarding the locking, I would like to mention
the memory conservative "sectored locks" approach.
Break the hash table into sectors, about N*NR_CPU
of them. Use one rwlock per sector.  Given an equally
distributing hash function, the probability that two
of our NR_CPU cpus will lock the same sector at the
same time, becomes small with growing N, fast.

This way, a lot less locks than with one-lock-per-chain
or one-lock-per-conntrack can be used, without compromising
concurrency too much.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  4:05               ` Rusty Russell
@ 2003-05-01  6:05                 ` Patrick Schaaf
  2003-05-01  6:46                   ` Rusty Russell
  2003-05-01  9:58                 ` Martin Josefsson
  2003-05-01 11:32                 ` Harald Welte
  2 siblings, 1 reply; 42+ messages in thread
From: Patrick Schaaf @ 2003-05-01  6:05 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Martin Josefsson, Jozsef Kadlecsik, Rolf Fokkens, Harald Welte,
	Netfilter-devel, Patrick Schaaf

> 	struct nf_ct_info infos[IP_CT_NUMBER];
> 7 * 4 bytes.  We could eliminate this by adding an skb field to hold
> the state.

Given a guaranteed cacheline alignment of our 'struct ip_conntrack',
we could make the current pointer on the skbuffs a tagged pointer,
i.e. stuff our 3 info bits into the low 3 bits, without consuming
more room in the skbuffs. The skb_nf_forget() cleanup patch I made
last year, was a preparation for that. With that patch applied,
the ugliness of such a tagged pointer is not visible in the base
network stack (skb_nf_forget() makes the base stack callers
oblivious of what exactly is stored on an skbuff wrt conntracking).

See http://lists.netfilter.org/pipermail/netfilter-devel/2002-July/008703.html

best regards
  Patrick

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  6:05                 ` Patrick Schaaf
@ 2003-05-01  6:46                   ` Rusty Russell
  2003-05-01  7:04                     ` Patrick Schaaf
  0 siblings, 1 reply; 42+ messages in thread
From: Rusty Russell @ 2003-05-01  6:46 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Martin Josefsson, Jozsef Kadlecsik, Rolf Fokkens, Harald Welte,
	Netfilter-devel

In message <20030501060522.GB15143@oknodo.bof.de> you write:
> > 	struct nf_ct_info infos[IP_CT_NUMBER];
> > 7 * 4 bytes.  We could eliminate this by adding an skb field to hold
> > the state.
> 
> Given a guaranteed cacheline alignment of our 'struct ip_conntrack',
> we could make the current pointer on the skbuffs a tagged pointer,
> i.e. stuff our 3 info bits into the low 3 bits, without consuming
> more room in the skbuffs.

That was my original implementation.  It was fairly ugly, but worse, I
can imagine an arch where it would break.  None would at the moment,
though, since kmalloc cacheline aligns (although we'd have to be a
little careful to explicitly align a conntrack if we were to use a
dummy one for a "notrack" target.

> See http://lists.netfilter.org/pipermail/netfilter-devel/2002-July/008703.html

That looks like a good patch anyway...

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  6:46                   ` Rusty Russell
@ 2003-05-01  7:04                     ` Patrick Schaaf
  2003-05-01  7:38                       ` Rusty Russell
  0 siblings, 1 reply; 42+ messages in thread
From: Patrick Schaaf @ 2003-05-01  7:04 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Patrick Schaaf, Martin Josefsson, Jozsef Kadlecsik, Rolf Fokkens,
	Harald Welte, Netfilter-devel

On Thu, May 01, 2003 at 04:46:44PM +1000, Rusty Russell wrote:
> In message <20030501060522.GB15143@oknodo.bof.de> you write:
> > > 	struct nf_ct_info infos[IP_CT_NUMBER];
> > > 7 * 4 bytes.  We could eliminate this by adding an skb field to hold
> > > the state.
> > 
> > Given a guaranteed cacheline alignment of our 'struct ip_conntrack',
> > we could make the current pointer on the skbuffs a tagged pointer,
> > i.e. stuff our 3 info bits into the low 3 bits, without consuming
> > more room in the skbuffs.
> 
> That was my original implementation.  It was fairly ugly, but worse, I
> can imagine an arch where it would break.  None would at the moment,
> though, since kmalloc cacheline aligns

This boils down to the question whether SLAB_HWCACHE_ALIGN is a guarantee,
or a hint. I would have expected it to be a guarantee.

Argh. Are you worried about architectures with 8 byte or less cache lines?

Well, use another field in the skbuff, then. I would recycle nfcache,
but that's just because nobody ever explained to me what it would be
used for... :)

> (although we'd have to be a
> little careful to explicitly align a conntrack if we were to use a
> dummy one for a "notrack" target.

Unless you are after static initializability, what keeps us from
allocating such dummies in the same way as we allocate the rest?

best regards
  Patrick

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  7:04                     ` Patrick Schaaf
@ 2003-05-01  7:38                       ` Rusty Russell
  0 siblings, 0 replies; 42+ messages in thread
From: Rusty Russell @ 2003-05-01  7:38 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Martin Josefsson, Jozsef Kadlecsik, Rolf Fokkens, Harald Welte,
	Netfilter-devel

In message <20030501070408.GD15143@oknodo.bof.de> you write:
> Argh. Are you worried about architectures with 8 byte or less cache lines?

Yes, an embedded chip might value space above all else.  But noone
seems to, so it's probably OK.

> Well, use another field in the skbuff, then. I would recycle nfcache,
> but that's just because nobody ever explained to me what it would be
> used for... :)

It was so you can optimize fastroute: you can cache decisions made by
netfilter hooks.  But it was never actually used.  It could be dropped
in 2.7.

> > (although we'd have to be a
> > little careful to explicitly align a conntrack if we were to use a
> > dummy one for a "notrack" target.
> 
> Unless you are after static initializability, what keeps us from
> allocating such dummies in the same way as we allocate the rest?

As I said, you'd have to be careful.  The logical way would be to
simply declare a static variable, which will still work as long as you
align it.

Cheers!
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  0:06           ` Rusty Russell
  2003-05-01  5:48             ` Patrick Schaaf
@ 2003-05-01  9:06             ` Martin Josefsson
  2003-05-02  5:31               ` Rusty Russell
  2003-05-05  8:43             ` PATCH: extra conntrack stats Jozsef Kadlecsik
  2 siblings, 1 reply; 42+ messages in thread
From: Martin Josefsson @ 2003-05-01  9:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jozsef Kadlecsik, Rolf Fokkens, Harald Welte, Netfilter-devel,
	Patrick Schaaf

On Thu, 2003-05-01 at 02:06, Rusty Russell wrote:
> In message <Pine.LNX.4.33.0304301227530.3328-100000@blackhole.kfki.hu> you write:
> > As usual, it's Rusty's locking patch what triggered me to think over the
> > locking issues. In that patch Rusty actually eliminates ip_conntrack
> > lock:
> 
> Ah, but I think I might have a better trick.  Patrick cc'd since he's
> the hashing man.
> 
> Can we compromise the hash so that every packet hashes to the same
> chain as its reply?
> 
> If we can do this without making our hash really suck, we only need one
> entry per conntrack (and we then compare both ways).  This means a
> smaller table, and hence better cache utilization, as well as simpler
> locking.
> 
> Thoughts?

As Patrick already pointed out, it gets ugly wrt. NAT, the reply tuple
can be quite diffrent.

> > I think that eliminating ip_conntrack lock might uncover a (potential)
> > bottleneck: the tcp_lock, the lock in conntrack for the predominant
> > protocol of the Internet.
> > 
> > By adding a data lock to the conntrack entries can solve that
> > (potential) problem. The data lock would make unnecessary the per
> > conntrack helper locks too (unfortunately the buffers introduced in the
> > zerocopy patch undo that).
> 
> Yeah, per-entry locks make sense I think.  I can change the helpers to
> do an skb_copy (suggested by Alexey) which will be lockless, or just
> implement the damn non-linear search.

The question is... how do you free such an entry with the guarantee that
there are no waiters? It's probably possible but I can't see a clean
nice way to do it.

-- 
/Martin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  4:05               ` Rusty Russell
  2003-05-01  6:05                 ` Patrick Schaaf
@ 2003-05-01  9:58                 ` Martin Josefsson
  2003-05-01 11:32                 ` Harald Welte
  2 siblings, 0 replies; 42+ messages in thread
From: Martin Josefsson @ 2003-05-01  9:58 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jozsef Kadlecsik, Rolf Fokkens, Harald Welte, Netfilter-devel,
	Patrick Schaaf

On Thu, 2003-05-01 at 06:05, Rusty Russell wrote:
> In message <1051743903.8213.188.camel@tux.rsn.bth.se> you write:
> > On Wed, 2003-04-30 at 13:19, Martin Josefsson wrote:
> > - Switch to using hlists for the hashtable
> 
> OK.  Please implement hlist_for_each_entry(), though, a-la
> list_for_each_entry.

That's the plan.

> > - Rearrange struct ip_conntrack to be more cachefriendly if possible
> 
> Leave 'til last, I think, since that structure will be changing
> anyway.  Most important is that next ptr and tuple are at the start
> (ie. only one cacheline per hash chain entry).

I'll try to do some measurements of this so we know how much we are
talking about.

> > - Turn protocol_list into an array
> 
> Or hardcode the three builtins and use the list for others.

It would probably be faster with just the array, especially for the not
builtin case. The upside of the hardcoding is that it's contained in the
instructioncache.

> > - Switch to a better hashfunction
> 
> Yes!  See comments below on secret hashing...

Jozsef continued the work on this and has come to some conclusions I
think.

> > - Remove pointless timer-updating
> 
> We could revisit this altogether: go for one timer which sweeps the
> hash chains and an "expiry" date on each one.  You end up with some
> icky deletion issues, but...

Patrick made a patch sometime ago that still had the per entry timer but
only updated it with a new timeout once it expired.
The timer-handling code in kernel seems to have rather low overhead even
with lots of timers, it's just that it disables interrupts twice for
each packet, and the fact that the updates are pointless (often updated
to the exact same jiffy as it had before).

> > - Rework locking to be finer grained, start with per bucket spinlocks
> >   (goal: RCU?)
> 
> Well, might as well go straight to RCU for the infrastructure
> locking.  
> 
> It's a little tricky.  RCU can be used for the read side, but since
> writes are not uncommon (I always guesstimated 1 in 10), we still
> probably want per-chain locks.  Hmm, actually, since that bloats the
> hash, let's start with one lock and see how it goes.

Noo, please don't, with one lock for inserts/deletes it could still be
awful during a fairly large DDoS on an SMP machine.
If per-chain locks are too much bloat (many many locks which large
hashtables) I'd say sectored locks like Patrick suggested, but not one
global.

> The protection of the conntrack objects themselves should become a
> lock per conntrack I think.  We currently use the timer lock as a form
> of synchronization: if we have a conntrack.lock we should use that.
> 
> > - Remove tcp_lock if using per bucket spinlocks, otherwise move it into
> >   the entries
> 
> Agreed: use conntrack.lock.

We only read conntrack->proto.tcp twice and update it twice, both things
can be grouped together. I currently know too little about RCU, don't
know if it would be possible to skip the locking here.

> > - Remove pointless rehashing of tuples
> 
> Hmmm, does this happen?

Oh yes, currently we actually rehash the tuple for each entry searched
in the lists. There's a patch in optimization/ in p-o-m that fixes that.
But I think we have more rehashing going on and iirc Patrick made a
patch for that as well.

> > - Rework overload support (early_drop)
> 
> OK, you've probably seen my "hash with secret key" scheme.
> Unfortunately, it relies on being able to grab the network brlock to
> stop all activity while it redoes the hash.  But Dave is removing the
> brlock, so this becomes more tricky (ie. readers have to be aware that
> there could be two hash tables, ick).
> 
> It might still be worth it though (this same trick would allow us to
> resize the hash through /proc).  Needs more thought.

The only time rehashing is a positive thing is when you have one or a
few attacked chain or you manually want to resize it. Otherwise it will
probably just slow things down when the rehash happens. I like to just
give the hashtable a proper size from the beginning.

So we need a way to tell if someone is attacking a chain or if it's just
overload due to poor sizing.

> > - Avoid as many memory-writes as possible, no need to dirty cachelines if
> >   we don't have to
> 
> Well, yes, but it's usually secondary after correctness.

Of course.

> > - Try to shrink struct ip_conntrack
> 
> Ignoring NAT for the moment, and using a 32-bit arch:
> 
> 	struct nf_conntrack ct_general;
> 8 bytes: hard to shrink.
> 
> 	struct ip_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX];
> 2 x 28 bytes: we can get rid of one.

If we try to change the hashing yes.

> 	unsigned long status;
> 8 bytes. Needs to be ulong for bitops.

4 bytes

> So, we're about 192 bytes.  We could get to 80 bytes.

That'd be nice.

-- 
/Martin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  5:48             ` Patrick Schaaf
@ 2003-05-01 10:01               ` Martin Josefsson
  0 siblings, 0 replies; 42+ messages in thread
From: Martin Josefsson @ 2003-05-01 10:01 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Rusty Russell, Jozsef Kadlecsik, Rolf Fokkens, Harald Welte,
	Netfilter-devel

On Thu, 2003-05-01 at 07:48, Patrick Schaaf wrote:

> As an aside, and independant of the key sorting/flipping,
> we could eliminate completely the info struct array we
> now have in each conntrack. Interested? Then I'll write
> how I envision that.

Please do.

> Finally, regarding the locking, I would like to mention
> the memory conservative "sectored locks" approach.
> Break the hash table into sectors, about N*NR_CPU
> of them. Use one rwlock per sector.  Given an equally
> distributing hash function, the probability that two
> of our NR_CPU cpus will lock the same sector at the
> same time, becomes small with growing N, fast.
> 
> This way, a lot less locks than with one-lock-per-chain
> or one-lock-per-conntrack can be used, without compromising
> concurrency too much.

I like this, if you have 512k buckets you don't want 512k locks if it's
a 2way SMP machine.

-- 
/Martin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-04-30 23:05             ` Martin Josefsson
  2003-05-01  4:05               ` Rusty Russell
@ 2003-05-01 11:26               ` Harald Welte
  2003-05-02 12:18                 ` Jozsef Kadlecsik
  1 sibling, 1 reply; 42+ messages in thread
From: Harald Welte @ 2003-05-01 11:26 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Jozsef Kadlecsik, Rolf Fokkens, Rusty Russell, Netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1580 bytes --]

On Thu, May 01, 2003 at 01:05:03AM +0200, Martin Josefsson wrote:

> > I'll write up a small list of things I think we should fix in conntrack.
> 
> - Switch to using hlists for the hashtable
> - Rearrange struct ip_conntrack to be more cachefriendly if possible
> - Add prefetching in list-searching
> - Turn protocol_list into an array
> - Switch to a better hashfunction
> - Remove pointless timer-updating
> - Rework locking to be finer grained, start with per bucket spinlocks
>   (goal: RCU?)
> - Remove tcp_lock if using per bucket spinlocks, otherwise move it into
>   the entries
> - Remove pointless rehashing of tuples
> - Rework overload support (early_drop)
> - Avoid as many memory-writes as possible, no need to dirty cachelines if
>   we don't have to
> - Eliminate listhelp.h and lockhelp.h by request from hch
> - Try to shrink struct ip_conntrack

this list is too long for a change that is going to happen during 2.4.x
soon.

I would really like to see a simple patch changing the hash function
before.  This patch can be pushed to davem after a relatively short time
period of testing, since it shouldn't touch too much of the conntrack
core.


> /Martin

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  4:05               ` Rusty Russell
  2003-05-01  6:05                 ` Patrick Schaaf
  2003-05-01  9:58                 ` Martin Josefsson
@ 2003-05-01 11:32                 ` Harald Welte
  2 siblings, 0 replies; 42+ messages in thread
From: Harald Welte @ 2003-05-01 11:32 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Martin Josefsson, Jozsef Kadlecsik, Rolf Fokkens, Netfilter-devel,
	Patrick Schaaf

[-- Attachment #1: Type: text/plain, Size: 712 bytes --]

On Thu, May 01, 2003 at 02:05:52PM +1000, Rusty Russell wrote:

> Ignoring NAT for the moment, and using a 32-bit arch:
> So, we're about 192 bytes.  We could get to 80 bytes.

JFYI: including PPTP nat (which expands struct ip_conntrack_tuple),
sizeof(struct ip_conntrack) is about 340 bytes on a 32bit arch.

> Cheers,
> Rusty.

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  9:06             ` Martin Josefsson
@ 2003-05-02  5:31               ` Rusty Russell
  2003-05-02  7:06                 ` Patrick Schaaf
  0 siblings, 1 reply; 42+ messages in thread
From: Rusty Russell @ 2003-05-02  5:31 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Jozsef Kadlecsik, Rolf Fokkens, Harald Welte, Netfilter-devel,
	Patrick Schaaf

In message <1051780007.8215.200.camel@tux.rsn.bth.se> you write:
> On Thu, 2003-05-01 at 02:06, Rusty Russell wrote:
> > If we can do this without making our hash really suck, we only need one
> > entry per conntrack (and we then compare both ways).  This means a
> > smaller table, and hence better cache utilization, as well as simpler
> > locking.
> > 
> > Thoughts?
> 
> As Patrick already pointed out, it gets ugly wrt. NAT, the reply tuple
> can be quite diffrent.

Yes, you're both right, I'm stupid.  Scratch this "great idea".

> > Yeah, per-entry locks make sense I think.  I can change the helpers to
> > do an skb_copy (suggested by Alexey) which will be lockless, or just
> > implement the damn non-linear search.
> 
> The question is... how do you free such an entry with the guarantee that
> there are no waiters? It's probably possible but I can't see a clean
> nice way to do it.

Reference counts.  Currently, you delete from the hash, then dec
reference count (hash reference holds one reference count).  With RCU,
you'd do delete from hash, and call_rcu(drop_refcount) to do the
atomic_dec_and_test after you *know* noone else can reach it.

It's easy to confuse infrastructure protection (ie. protecting the
hash table chains and existence of the elements), with structure
protection (protecting the contents of each structure).  Currently we
use one big lock for both of them, but they're different.

Thanks,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-02  5:31               ` Rusty Russell
@ 2003-05-02  7:06                 ` Patrick Schaaf
  2003-05-02  8:57                   ` Rusty Russell
  0 siblings, 1 reply; 42+ messages in thread
From: Patrick Schaaf @ 2003-05-02  7:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Martin Josefsson, Jozsef Kadlecsik, Rolf Fokkens, Harald Welte,
	Netfilter-devel, Patrick Schaaf

On Fri, May 02, 2003 at 03:31:50PM +1000, Rusty Russell wrote:
> In message <1051780007.8215.200.camel@tux.rsn.bth.se> you write:
> > On Thu, 2003-05-01 at 02:06, Rusty Russell wrote:
> > > If we can do this without making our hash really suck, we only need one
> > > entry per conntrack (and we then compare both ways).  This means a
> > > smaller table, and hence better cache utilization, as well as simpler
> > > locking.
> > > 
> > > Thoughts?
> > 
> > As Patrick already pointed out, it gets ugly wrt. NAT, the reply tuple
> > can be quite diffrent.
> 
> Yes, you're both right, I'm stupid.  Scratch this "great idea".

Sorry, but you are not stupid. I do care for non-NAT workloads, and the
approach would help greatly, there.

In the normal ip_conntrack, one tuple is left, which is normally used
for both directions of a non-NAT connection. In addition, there's a
pointer, ordinarily 0, which can point to a NAT-reverse-tuple.

When NAT starts to manipulate headers, it modifies ip_conntrack anyway;
in that place, it can also allocate a suitable NAT-reverse-tuple, attach
that to the conntrack, and put it into the hashes.

Hmm, we could have a list of such reverse-tuples per conntrack,
and call them expectations...

In fact, this setup gives a nice extra feature: for NATted connections,
we now logically have four flows in our hashes (physically two, due
to the key mirror property). Two of those four flows are INVALID, and
this can easily be determined when initially looking for the conntrack
of an incoming skb. Not that I know the area well, but I feel that this
could make some NAT setup decisions a lot simpler.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-02  7:06                 ` Patrick Schaaf
@ 2003-05-02  8:57                   ` Rusty Russell
  2003-05-02  9:54                     ` SNAT and IP ID Patrick Schaaf
  0 siblings, 1 reply; 42+ messages in thread
From: Rusty Russell @ 2003-05-02  8:57 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Martin Josefsson, Jozsef Kadlecsik, Rolf Fokkens, Harald Welte,
	Netfilter-devel

In message <20030502070651.GG15143@oknodo.bof.de> you write:
> On Fri, May 02, 2003 at 03:31:50PM +1000, Rusty Russell wrote:
> > Yes, you're both right, I'm stupid.  Scratch this "great idea".
> 
> Sorry, but you are not stupid.

Sorry, no, I am stupid.

> I do care for non-NAT workloads, and the
> approach would help greatly, there.
> 
> In the normal ip_conntrack, one tuple is left, which is normally used
> for both directions of a non-NAT connection. In addition, there's a
> pointer, ordinarily 0, which can point to a NAT-reverse-tuple.
> 
> When NAT starts to manipulate headers, it modifies ip_conntrack anyway;
> in that place, it can also allocate a suitable NAT-reverse-tuple, attach
> that to the conntrack, and put it into the hashes.

The initial reason I wanted to get rid of the duplicates is the
locking pain it produces.  If we go for RCU, then it's not a problem,
though.

We'll need to keep around the back-pointer inside the hash entry: if
we really only had one, we could use container_of().  We could cover
this up in some helper macro I guess...

> In fact, this setup gives a nice extra feature: for NATted connections,
> we now logically have four flows in our hashes (physically two, due
> to the key mirror property). Two of those four flows are INVALID, and
> this can easily be determined when initially looking for the conntrack
> of an incoming skb. Not that I know the area well, but I feel that this
> could make some NAT setup decisions a lot simpler.

Not sure this is correct.  But anyway, I think some nice helper
functions #ifdef CONFIG_IP_NF_NAT_NEEDED could make this workable: not
so sure about optimizing for the "we could NAT, but most connections
aren't NAT" case...

Cheers,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* SNAT and IP ID
  2003-05-02  8:57                   ` Rusty Russell
@ 2003-05-02  9:54                     ` Patrick Schaaf
  2003-05-02 15:43                       ` Harald Welte
  0 siblings, 1 reply; 42+ messages in thread
From: Patrick Schaaf @ 2003-05-02  9:54 UTC (permalink / raw)
  To: netfilter-devel

Hi all,

I have a question that seems too simple, somehow. Anyway, out with it:

Given a basic SNAT application, with two hosts on the left,
an iptables box in the middle, and a server to the right.
By using SNAT, both hosts's source IP addresses are changed
to a single shared source IP address, on the way from the
iptables box, to the server.

Every packet sent by our two hosts, has a 16-bit IP ID field,
for use in fragmentation and reassembly. Together with source and
destination IP addresses, the ID field provides the key under which
the server will gather incoming fragments belonging to the same
IP packet.

Now, consider both hosts sending out an IP packet with identical ID
field. Will the IDs be made unique by the SNAT process? If not, what
happens when, after the SNAT, the packets get fragmented on their
way to the server? The server cannot sanely reassemble.

Has this been discussed here, before?  Googling [1] results in quite a
set of hits that mention the issue, it seems customary to ignore it?

best regards
  Patrick

[1] http://www.google.de/search?q=NAT+fragmentation+problems+IP+ID

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01 11:26               ` Harald Welte
@ 2003-05-02 12:18                 ` Jozsef Kadlecsik
  2003-05-02 12:30                   ` Martin Josefsson
  0 siblings, 1 reply; 42+ messages in thread
From: Jozsef Kadlecsik @ 2003-05-02 12:18 UTC (permalink / raw)
  To: Harald Welte
  Cc: Martin Josefsson, Rolf Fokkens, Rusty Russell, Patrick Schaaf,
	Netfilter-devel

On Thu, 1 May 2003, Harald Welte wrote:

> I would really like to see a simple patch changing the hash function
> before.  This patch can be pushed to davem after a relatively short time
> period of testing, since it shouldn't touch too much of the conntrack
> core.

Still sitting and chewing the problem which hash function to choose,
decided yet undecided...

I'm working on an updated raw patch (+ splitted up to internal logging +
raw patches) and an updated and further tuned tcp window tracking patch (+
splitted up to proc + tcp window tracking patches). It'll take probably
one more week and then I'll submit a patch on a better hash function for
conntrack.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-02 12:18                 ` Jozsef Kadlecsik
@ 2003-05-02 12:30                   ` Martin Josefsson
  2003-05-02 21:51                     ` Jozsef Kadlecsik
  0 siblings, 1 reply; 42+ messages in thread
From: Martin Josefsson @ 2003-05-02 12:30 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Harald Welte, Rolf Fokkens, Rusty Russell, Patrick Schaaf,
	Netfilter-devel

On Fri, 2003-05-02 at 14:18, Jozsef Kadlecsik wrote:

> I'm working on an updated raw patch (+ splitted up to internal logging +
> raw patches) and an updated and further tuned tcp window tracking patch (+
> splitted up to proc + tcp window tracking patches). It'll take probably
> one more week and then I'll submit a patch on a better hash function for
> conntrack.

While on the subject of the raw patch, please don't forget to add the
fix for the NFC_ clash I added to cvs a little while ago. The
userspace/raw.patch depends on pending/27_include-nfc-order.patch

Can I get a feature-request? :)
-j NOTRACK --only-search

that would be nice, search the hashtables but don't allocate new entries
if not found and if not found mark as UNTRACKED. It will do the same as
dropping --state NEW later but avoids the alloc and init.

-- 
/Martin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: SNAT and IP ID
  2003-05-02  9:54                     ` SNAT and IP ID Patrick Schaaf
@ 2003-05-02 15:43                       ` Harald Welte
  0 siblings, 0 replies; 42+ messages in thread
From: Harald Welte @ 2003-05-02 15:43 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1450 bytes --]

On Fri, May 02, 2003 at 11:54:21AM +0200, Patrick Schaaf wrote:
> Hi all,
> 
> I have a question that seems too simple, somehow. Anyway, out with it:

one which has already been mentioned, not sure if on this list or on
netdev@oss.sgi.com a couple of months ago.;)

> Now, consider both hosts sending out an IP packet with identical ID
> field. Will the IDs be made unique by the SNAT process? If not, what
> happens when, after the SNAT, the packets get fragmented on their
> way to the server? The server cannot sanely reassemble.

yes, they will clash.   DaveM said something like IP ID's can already
clash on a single host at gbit speeds...  so it's not worth considering
this issue.

I said, that it can easily be fixed by creating a mangle table IPID
target which would reset the IPID on the NAT gateway.

This target (if used on nat boxes) would also prevent the
recently-published way to heuristically guess that NAT is used (because
of the different incrementing sequences IPID's that reveal multiple
hosts).

> best regards
>   Patrick
> 

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-02 12:30                   ` Martin Josefsson
@ 2003-05-02 21:51                     ` Jozsef Kadlecsik
  2003-05-02 21:58                       ` Martin Josefsson
  0 siblings, 1 reply; 42+ messages in thread
From: Jozsef Kadlecsik @ 2003-05-02 21:51 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Harald Welte, Rolf Fokkens, Rusty Russell, Patrick Schaaf,
	Netfilter-devel

On 2 May 2003, Martin Josefsson wrote:

> On Fri, 2003-05-02 at 14:18, Jozsef Kadlecsik wrote:
>
> > I'm working on an updated raw patch (+ splitted up to internal logging +
> > raw patches) and an updated and further tuned tcp window tracking patch (+
> > splitted up to proc + tcp window tracking patches). It'll take probably
> > one more week and then I'll submit a patch on a better hash function for
> > conntrack.
>
> While on the subject of the raw patch, please don't forget to add the
> fix for the NFC_ clash I added to cvs a little while ago. The
> userspace/raw.patch depends on pending/27_include-nfc-order.patch

I noticed it and the new patch takes into account the clash.

> Can I get a feature-request? :)
> -j NOTRACK --only-search
>
> that would be nice, search the hashtables but don't allocate new entries
> if not found and if not found mark as UNTRACKED. It will do the same as
> dropping --state NEW later but avoids the alloc and init.

Maybe it's too late, but I don't understand you:

- the raw table has got no notion on conntrack
- conntrack skips the packet "marked" by the NOTRACK target of the raw
  table. No new hash table entries are generated, one dummy entry is used
  for every untracked packet.
- packets "marked" with NOTRACK has got the state UNTRACKED

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-02 21:51                     ` Jozsef Kadlecsik
@ 2003-05-02 21:58                       ` Martin Josefsson
  2003-05-05  9:24                         ` Jozsef Kadlecsik
  2003-05-05 12:38                         ` Jozsef Kadlecsik
  0 siblings, 2 replies; 42+ messages in thread
From: Martin Josefsson @ 2003-05-02 21:58 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Harald Welte, Rolf Fokkens, Rusty Russell, Patrick Schaaf,
	Netfilter-devel

On Fri, 2003-05-02 at 23:51, Jozsef Kadlecsik wrote:

> > Can I get a feature-request? :)
> > -j NOTRACK --only-search
> >
> > that would be nice, search the hashtables but don't allocate new entries
> > if not found and if not found mark as UNTRACKED. It will do the same as
> > dropping --state NEW later but avoids the alloc and init.
> 
> Maybe it's too late, but I don't understand you:
> 
> - the raw table has got no notion on conntrack
> - conntrack skips the packet "marked" by the NOTRACK target of the raw
>   table. No new hash table entries are generated, one dummy entry is used
>   for every untracked packet.
> - packets "marked" with NOTRACK has got the state UNTRACKED

--only-search wouldn't skip conntrack completely. It still does the
search but if no entry is found we don't allocate a new, instead we bail
out and set

skb->nfct = &ip_conntrack_untracked.infos[IP_CT_NEW];

so it from now on looks just like a "regular" -j NOTRACK packet.

The purpose of this feature is to still allow outbound connections
during a DDoS but deny all new inbound with as little overhead as
possible.

-- 
/Martin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-01  0:06           ` Rusty Russell
  2003-05-01  5:48             ` Patrick Schaaf
  2003-05-01  9:06             ` Martin Josefsson
@ 2003-05-05  8:43             ` Jozsef Kadlecsik
  2 siblings, 0 replies; 42+ messages in thread
From: Jozsef Kadlecsik @ 2003-05-05  8:43 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Rolf Fokkens, Harald Welte, Martin Josefsson, Netfilter-devel,
	Patrick Schaaf

On Thu, 1 May 2003, Rusty Russell wrote:

> > By adding a data lock to the conntrack entries can solve that
> > (potential) problem. The data lock would make unnecessary the per
> > conntrack helper locks too (unfortunately the buffers introduced in the
> > zerocopy patch undo that).
>
> Yeah, per-entry locks make sense I think.  I can change the helpers to
> do an skb_copy (suggested by Alexey) which will be lockless, or just
> implement the damn non-linear search.

I afraid the non-linear search is not general enough. We have to support
protocols which carry structures in the payload and in that case the
non-linear search would be a nightmare.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-02 21:58                       ` Martin Josefsson
@ 2003-05-05  9:24                         ` Jozsef Kadlecsik
  2003-05-05 12:38                         ` Jozsef Kadlecsik
  1 sibling, 0 replies; 42+ messages in thread
From: Jozsef Kadlecsik @ 2003-05-05  9:24 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Harald Welte, Rolf Fokkens, Rusty Russell, Patrick Schaaf,
	Netfilter-devel

On 2 May 2003, Martin Josefsson wrote:

> On Fri, 2003-05-02 at 23:51, Jozsef Kadlecsik wrote:
>
> > > Can I get a feature-request? :)
> > > -j NOTRACK --only-search
>
> --only-search wouldn't skip conntrack completely. It still does the
> search but if no entry is found we don't allocate a new, instead we bail
> out and set
>
> skb->nfct = &ip_conntrack_untracked.infos[IP_CT_NEW];
>
> so it from now on looks just like a "regular" -j NOTRACK packet.
>
> The purpose of this feature is to still allow outbound connections
> during a DDoS but deny all new inbound with as little overhead as
> possible.

I see. I'll try to add it without introducing a new NFC flag.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-02 21:58                       ` Martin Josefsson
  2003-05-05  9:24                         ` Jozsef Kadlecsik
@ 2003-05-05 12:38                         ` Jozsef Kadlecsik
  2003-05-05 13:07                           ` Martin Josefsson
  1 sibling, 1 reply; 42+ messages in thread
From: Jozsef Kadlecsik @ 2003-05-05 12:38 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Harald Welte, Rolf Fokkens, Rusty Russell, Patrick Schaaf,
	Netfilter-devel

On 2 May 2003, Martin Josefsson wrote:

> On Fri, 2003-05-02 at 23:51, Jozsef Kadlecsik wrote:
>
> > > Can I get a feature-request? :)
> > > -j NOTRACK --only-search
>
> --only-search wouldn't skip conntrack completely. It still does the
> search but if no entry is found we don't allocate a new, instead we bail
> out and set
>
> skb->nfct = &ip_conntrack_untracked.infos[IP_CT_NEW];
>
> so it from now on looks just like a "regular" -j NOTRACK packet.
>
> The purpose of this feature is to still allow outbound connections
> during a DDoS but deny all new inbound with as little overhead as
> possible.

Oh, there is no need for a flag: that was a bug in the NOTRACK target.
It should have not marked packets belonging to existing connections.

Thank you for spotting this.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PATCH: extra conntrack stats
  2003-05-05 12:38                         ` Jozsef Kadlecsik
@ 2003-05-05 13:07                           ` Martin Josefsson
  0 siblings, 0 replies; 42+ messages in thread
From: Martin Josefsson @ 2003-05-05 13:07 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Harald Welte, Rolf Fokkens, Rusty Russell, Patrick Schaaf,
	Netfilter-devel

On Mon, 2003-05-05 at 14:38, Jozsef Kadlecsik wrote:

> > --only-search wouldn't skip conntrack completely. It still does the
> > search but if no entry is found we don't allocate a new, instead we bail
> > out and set
> >
> > skb->nfct = &ip_conntrack_untracked.infos[IP_CT_NEW];
> >
> > so it from now on looks just like a "regular" -j NOTRACK packet.
> >
> > The purpose of this feature is to still allow outbound connections
> > during a DDoS but deny all new inbound with as little overhead as
> > possible.
> 
> Oh, there is no need for a flag: that was a bug in the NOTRACK target.
> It should have not marked packets belonging to existing connections.
> 
> Thank you for spotting this.

Hehe ok, I thought that was how you wanted it to work.

If the DDoS is _really_ massive you will use -j DROP and not -j NOTRACK
anyway so this change in behaviour would be really nice.
(I see >140kpps DDoS's sometimes)

I don't think there's that many people using NOTRACK so the change
probably won't break any existing configurations.

-- 
/Martin

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2003-05-05 13:07 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-19 21:42 PATCH: extra conntrack stats Rolf Fokkens
2003-04-25  8:23 ` Patrick Schaaf
2003-04-27 12:48 ` Harald Welte
2003-04-27 15:23   ` Rolf Fokkens
2003-04-27 20:42     ` Harald Welte
2003-04-28  6:13       ` Patrick Schaaf
2003-04-29 22:15     ` Jozsef Kadlecsik
2003-04-29 22:38       ` Martin Josefsson
2003-04-30 10:49         ` Jozsef Kadlecsik
2003-04-30 11:19           ` Martin Josefsson
2003-04-30 23:05             ` Martin Josefsson
2003-05-01  4:05               ` Rusty Russell
2003-05-01  6:05                 ` Patrick Schaaf
2003-05-01  6:46                   ` Rusty Russell
2003-05-01  7:04                     ` Patrick Schaaf
2003-05-01  7:38                       ` Rusty Russell
2003-05-01  9:58                 ` Martin Josefsson
2003-05-01 11:32                 ` Harald Welte
2003-05-01 11:26               ` Harald Welte
2003-05-02 12:18                 ` Jozsef Kadlecsik
2003-05-02 12:30                   ` Martin Josefsson
2003-05-02 21:51                     ` Jozsef Kadlecsik
2003-05-02 21:58                       ` Martin Josefsson
2003-05-05  9:24                         ` Jozsef Kadlecsik
2003-05-05 12:38                         ` Jozsef Kadlecsik
2003-05-05 13:07                           ` Martin Josefsson
2003-05-01  0:06           ` Rusty Russell
2003-05-01  5:48             ` Patrick Schaaf
2003-05-01 10:01               ` Martin Josefsson
2003-05-01  9:06             ` Martin Josefsson
2003-05-02  5:31               ` Rusty Russell
2003-05-02  7:06                 ` Patrick Schaaf
2003-05-02  8:57                   ` Rusty Russell
2003-05-02  9:54                     ` SNAT and IP ID Patrick Schaaf
2003-05-02 15:43                       ` Harald Welte
2003-05-05  8:43             ` PATCH: extra conntrack stats Jozsef Kadlecsik
2003-04-28  9:13 ` vecna
2003-04-28 13:47   ` Patrick Schaaf
2003-04-28 15:07     ` possible target SBALANCE ? vecna
2003-04-29 14:48       ` Harald Welte
2003-04-30 11:59         ` vecna
2003-04-30 13:02           ` Roberto Nibali

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.