Re: ctnetlink questions

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: ctnetlink questions
       [not found] <20031019171851.GR21521@sunbeam.de.gnumonks.org>
@ 2003-10-19 19:36 ` Patrick McHardy
  2003-10-19 20:28   ` Harald Welte
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2003-10-19 19:36 UTC (permalink / raw)
  To: Harald Welte; +Cc: Netfilter Development Mailinglist

Harald Welte wrote:

>Hi Patrick!
>
>A couple of questions regarding your ctnetlink modifications:
>
>1) Why do we need this 'ordered list' ?  I can't remember the exact 
>   reason why it was added
>
The ordered list and the unique conntrack id was added for table 
dumping. Without
it entries could be dumped multiple times or even worse a single hash 
chain chould
be dumped over and over again if it's contents exceeded the size of a 
single skb.

>2) Why did you merge connmark and ctnetlink?  Was it just for
>   convenience? If yes, I'd appreciate to have them seperated again.
>
It's just a left-over .. I started hacking on ctnetlink to change 
connection marks
from userspace .. I'm going to remove it from the next patch.

Best regards,
Patrick

>
>Thanks.
>
>  
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-19 19:36 ` ctnetlink questions Patrick McHardy
@ 2003-10-19 20:28   ` Harald Welte
  2003-10-19 22:55     ` Patrick McHardy
  0 siblings, 1 reply; 40+ messages in thread
From: Harald Welte @ 2003-10-19 20:28 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 3007 bytes --]

On Sun, Oct 19, 2003 at 09:36:45PM +0200, Patrick McHardy wrote:

> >1) Why do we need this 'ordered list' ?  I can't remember the exact 
> >  reason why it was added
> >
> The ordered list and the unique conntrack id was added for table 
> dumping. Without it entries could be dumped multiple times or even
> worse a single hash chain chould be dumped over and over again if it's
> contents exceeded the size of a single skb.

ah, yes.  I remember.  but that actually is a shortcoming of the netlink
api, isn't it?  The problem is that we cannot save an exact position in
the hashtable where we stopped dumping.   So in my original ctnetlink we
just dump a whole bucket and saved the bucket number in cb->args[].  But
if we were saving bucket number + number of connection in bucket, we
could continue where we left from.  

Of course, entries could be added before that number (or even in buckets
that we had already traversed) - but we don't guarantee an atomic
snapshot anyway.

Also, there's another problem:
Let's say we left at bucket 5, entry 12 - and while we are waiting for
the next netlink callback, entry 10 gets removed.  Then we would
continue at 12, which is in reality the old 13.  So we're missing one
conntrack.  

The question is what to do.  I really don't like having yet another list
of conntracks (the ordered list) together with the unique id.  It is
questionable how big the impact on performance is (contention on unique
ID, bigger struct ip_conntrack, additional list_add's), but even if it
was 'cheap', I don't like the architecture.

Other approaches I can think of:

a) making a snapshot of the whole conntrack table.
Large memory usage - probably easy to get OOM :(  Also, read lock on
ip_conntrack_lock would have to be grabbed long

b) unique ID per hash bucket.  This means less contention, but we could
only save bucket id in cb->args, start iterating from the beginning and
only send whose ID is newer than the last one we already sent.

c) snapshot of the current bucket
As with the new hash function every bucket is supposed to be short, we
could also make a snapshot of the current bucket, and send our messages
from this snapshot copy.

what do you think?

> >2) Why did you merge connmark and ctnetlink?  Was it just for
> >  convenience? If yes, I'd appreciate to have them seperated again.
> >
> It's just a left-over .. I started hacking on ctnetlink to change 
> connection marks
> from userspace .. I'm going to remove it from the next patch.

As you may have noticed, I already did that with 0.13 in current cvs.

> Patrick

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-19 20:28   ` Harald Welte
@ 2003-10-19 22:55     ` Patrick McHardy
  2003-10-20  1:05       ` Henrik Nordstrom
  2003-10-20  6:58       ` Harald Welte
  0 siblings, 2 replies; 40+ messages in thread
From: Patrick McHardy @ 2003-10-19 22:55 UTC (permalink / raw)
  To: Harald Welte; +Cc: Netfilter Development Mailinglist

Harald Welte wrote:

>>The ordered list and the unique conntrack id was added for table 
>>dumping. Without it entries could be dumped multiple times or even
>>worse a single hash chain chould be dumped over and over again if it's
>>contents exceeded the size of a single skb.
>>    
>>
>
>ah, yes.  I remember.  but that actually is a shortcoming of the netlink
>api, isn't it?  The problem is that we cannot save an exact position in
>the hashtable where we stopped dumping.   So in my original ctnetlink we
>just dump a whole bucket and saved the bucket number in cb->args[].  But
>if we were saving bucket number + number of connection in bucket, we
>could continue where we left from.  
>
>Of course, entries could be added before that number (or even in buckets
>that we had already traversed) - but we don't guarantee an atomic
>snapshot anyway.
>
In my opinion for any serious use we need to provide a mechanism for 
userspace to be sure
it is in sync with the kernel. We could just add a new message type 
which contains the total
number of entries. That way userspace could check the number, if unequal 
to number of
connections known in userspace dump table, repeat. Of course this is 
still racy but it would
be better than nothing.

There was a thread on linux-net recently (subject: xfrm_user 
reliability) which is related to
this. Alexey mentioned reliable transmissions from kernel to userspace 
are impossible, so
userspace needs a recovery mechanism from dropped event messages (dump 
table and
resync). If dump table is also unreliable and doesn't even signal 
failure userspace is screwed.
Nice thing with the unique ids is that it's better than an atomic 
snapshot, when you're done
reading you have the _current_ state, not the state when you began reading.

>Also, there's another problem:
>Let's say we left at bucket 5, entry 12 - and while we are waiting for
>the next netlink callback, entry 10 gets removed.  Then we would
>continue at 12, which is in reality the old 13.  So we're missing one
>conntrack.  
>

With the unique id solution ? No, the id's don't represent the 
list-position, what happens is
that every conntrack with an id less or equal to the last one dumped is 
skipped. Since they
are ordered by increasing id we will still continue at entry 13, only 
that it now has position
12 on the list.

>The question is what to do.  I really don't like having yet another list
>of conntracks (the ordered list) together with the unique id.  It is
>questionable how big the impact on performance is (contention on unique
>ID, bigger struct ip_conntrack, additional list_add's), but even if it
>was 'cheap', I don't like the architecture.
>

I didn't worry too much about performance yet, in my opinion it was 
required for beeing
useful. For the architecture, if it was only for table dumping I'd agree 
with you, but there is
another important use for the id. When we want to manipulate/delete 
conntrack entries from
userspace there is no way to make sure that we will do things the the 
right connection since
the tuples that are used for lookup could have been reused. This is 
especially true a tuple is
used for lookup that has been changed by nat.

>Other approaches I can think of:
>
>a) making a snapshot of the whole conntrack table.
>Large memory usage - probably easy to get OOM :(  Also, read lock on
>ip_conntrack_lock would have to be grabbed long
>
>b) unique ID per hash bucket.  This means less contention, but we could
>only save bucket id in cb->args, start iterating from the beginning and
>only send whose ID is newer than the last one we already sent.
>
>c) snapshot of the current bucket
>As with the new hash function every bucket is supposed to be short, we
>could also make a snapshot of the current bucket, and send our messages
>from this snapshot copy.
>
>what do you think?
>

I think we first need to agree on how important the problems I mentioned 
above are. All
these solutions don't provide reliable mechanisms. Some comments though:

a) problem is that there can be multiple parallel dumps so we 
potentially need many copies.
I think memory usage is not acceptable.

b) I'm not sure if i understand correctly, this is basically what has 
been done before my
changes except that we would always continue at the next bucket id and 
not just advance
if the whole bucket has successfully dumped ?

c) same problem as a, except memory usage is not as bad. IMO it is a 
basically a workaround
for limited socket buffers to circumvent the limits. If we don't need 
reliability I'd say it's the users
job to make sure socket buffer limits are set to a reasonable size.

So in conclusion if we agree we need reliability, we probably need the 
unique ids. If we agree
we don't, I'd say we use solution b.

>As you may have noticed, I already did that with 0.13 in current cvs.
>
Yes, Krisztian pointed me to the code.

Two last things I noted during writing the mail:
- Table dumping is currenlty not restricted to root, this should 
probably be done for privacy reasons.
- Have you got objections against s/CTA_RPLY/CTA_REPLY/ ? IMO It makes 
typing and thinking
more comfortable if you can actually pronounce what you are thinking 
about ;)

Best regards,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-19 22:55     ` Patrick McHardy
@ 2003-10-20  1:05       ` Henrik Nordstrom
  2003-10-20  3:01         ` Patrick McHardy
                           ` (3 more replies)
  2003-10-20  6:58       ` Harald Welte
  1 sibling, 4 replies; 40+ messages in thread
From: Henrik Nordstrom @ 2003-10-20  1:05 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Harald Welte, Netfilter Development Mailinglist

On Mon, 20 Oct 2003, Patrick McHardy wrote:

> In my opinion for any serious use we need to provide a mechanism for
> userspace to be sure it is in sync with the kernel. We could just add a
> new message type which contains the total number of entries. That way
> userspace could check the number, if unequal to number of connections
> known in userspace dump table, repeat. Of course this is still racy but
> it would be better than nothing.

Agreed, partially.

My opinions:

It is imporant that userspace does not miss entries which was in the 
kernel when duming started and still exists in the kernel when the dump 
finished.

It is also important userspace can have some kind of semi-static 
reference to a conntrack to be able to manipulate that conntrack without 
risking hitting another conntrack.

It is OK for me if it is unspecified what happens with entries which 
either was created or destroyed while the dump was in progress.

With these criterias in mind I propose a hybrid of your approaches

a) Assign a globally unique ID to each conntrack, in such manner that IDs 
is not reused for a significant amount of time. This to provide a stable 
point of reference to a connection with low risk of false collisions if 
the original connection was destroyed while userspace still thought it was 
there. 

b) When duming the conntrack entries, dump one bucket at a time. 
If the bucket is too large to fit in the current response packet 
then sort the bucket entries on ID and keep track of which bucket+ID 
was last dumped. On next netlink packet restart at the same bucket and 
skip the entries with a ID lower than those already dumped for that 
bucket.

This requires a read lock per hash bucket while dumping that bucket, and
some small (usually) amount of memory to keep the temporary sorted index
of bucket entries unless the bucket is permanently resorted in which case
it may be possible to solve with no memory allocation (but then requires 
the bucket to be write locked while resorting which is probably worse).

Regarding the conntrack ID. For me it is acceptable if as much as 64 bits
is reserved for the conntrack ID. This gives sufficient namespace to

a) Provide truly unique IDs suitable for long-term reference without any 
risk of collisions.

b) Allows for the namespace to be built in such manner that there never
will be any risk for congestion in finding the next available ID. For 
example by using CPU#+counter.

Regards
Henrik

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  1:05       ` Henrik Nordstrom
@ 2003-10-20  3:01         ` Patrick McHardy
  2003-10-20  3:09           ` Patrick McHardy
                             ` (2 more replies)
  2003-10-20  7:04         ` Harald Welte
                           ` (2 subsequent siblings)
  3 siblings, 3 replies; 40+ messages in thread
From: Patrick McHardy @ 2003-10-20  3:01 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: Harald Welte, Netfilter Development Mailinglist

Henrik Nordstrom wrote:

>Agreed, partially.
>
>My opinions:
>
>It is imporant that userspace does not miss entries which was in the 
>kernel when duming started and still exists in the kernel when the dump 
>finished.
>
>It is also important userspace can have some kind of semi-static 
>reference to a conntrack to be able to manipulate that conntrack without 
>risking hitting another conntrack.
>
>It is OK for me if it is unspecified what happens with entries which 
>either was created or destroyed while the dump was in progress.
>

I totally agree.

>With these criterias in mind I propose a hybrid of your approaches
>
>a) Assign a globally unique ID to each conntrack, in such manner that IDs 
>is not reused for a significant amount of time. This to provide a stable 
>point of reference to a connection with low risk of false collisions if 
>the original connection was destroyed while userspace still thought it was 
>there. 
>
>b) When duming the conntrack entries, dump one bucket at a time. 
>If the bucket is too large to fit in the current response packet 
>then sort the bucket entries on ID and keep track of which bucket+ID 
>was last dumped. On next netlink packet restart at the same bucket and 
>skip the entries with a ID lower than those already dumped for that 
>bucket.
>
>This requires a read lock per hash bucket while dumping that bucket, and
>some small (usually) amount of memory to keep the temporary sorted index
>of bucket entries unless the bucket is permanently resorted in which case
>it may be possible to solve with no memory allocation (but then requires 
>the bucket to be write locked while resorting which is probably worse).
>

Sounds like a nice solution. I favour the permanent resorting for these 
reasons:
- all temporary memory allocations should be released before 
ctnetlink_dump is left,
  not in ctnetlink_done since we don't know if and when the read will 
continue. this means
  sorting multiple times is required.

- we can use some sorting algorithm which benefits from pre-sorted 
input. this would
  give better average performance. IIRC new conntracks are added at the 
head of the
  chains, so if we sort and walk backwards through the chains we only 
have to resort
  after an id counter wrap. Sorting is also pretty easy in that case: 
move all entries at
  head of list whose id is smaller than the last one's to the tail while 
preserving order,
  stop at first one thats bigger. This also means we only need the write 
lock in a very
  very rare case.

>Regarding the conntrack ID. For me it is acceptable if as much as 64 bits
>is reserved for the conntrack ID. This gives sufficient namespace to
>
>a) Provide truly unique IDs suitable for long-term reference without any 
>risk of collisions.
>

I agree, we should use 64 bit.

>b) Allows for the namespace to be built in such manner that there never
>will be any risk for congestion in finding the next available ID. For 
>example by using CPU#+counter.
>

Also a good idea. Thanks Henrik for your valuable input. Harald, what do 
you think of
this approach ?

Best regards,
Patrick (hoping mozilla will have mercy with his formatting this time)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  3:01         ` Patrick McHardy
@ 2003-10-20  3:09           ` Patrick McHardy
  2003-10-20  6:34           ` Henrik Nordstrom
  2003-10-20  7:15           ` Harald Welte
  2 siblings, 0 replies; 40+ messages in thread
From: Patrick McHardy @ 2003-10-20  3:09 UTC (permalink / raw)
  Cc: Henrik Nordstrom, Harald Welte, Netfilter Development Mailinglist

Patrick McHardy wrote:

> - we can use some sorting algorithm which benefits from pre-sorted 
> input. this would
>  give better average performance. IIRC new conntracks are added at the 
> head of the
>  chains, so if we sort and walk backwards through the chains we only 
> have to resort
>  after an id counter wrap. Sorting is also pretty easy in that case: 
> move all entries at
>  head of list whose id is smaller than the last one's to the tail 
> while preserving order,
>  stop at first one thats bigger. This also means we only need the 
> write lock in a very
>  very rare case. 


One small addition, this it not completly correct we need to resort more 
often, but never
before the first counter wrap.

Best regards,
patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  3:01         ` Patrick McHardy
  2003-10-20  3:09           ` Patrick McHardy
@ 2003-10-20  6:34           ` Henrik Nordstrom
  2003-10-20 17:53             ` Patrick McHardy
  2003-10-20  7:15           ` Harald Welte
  2 siblings, 1 reply; 40+ messages in thread
From: Henrik Nordstrom @ 2003-10-20  6:34 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Harald Welte, Netfilter Development Mailinglist

On Mon, 20 Oct 2003, Patrick McHardy wrote:

> Sounds like a nice solution. I favour the permanent resorting for these 
> reasons:

> - all temporary memory allocations should be released before
> ctnetlink_dump is left, not in ctnetlink_done since we don't know if and
> when the read will continue. this means sorting multiple times is
> required.

Sorting multiple times is indeed needed if not doing a permanent resort,
as you do not want to keep the bucket locked for a long period. Even if 
you knew the read would continue you can not save the temporary sorted 
list.

> - we can use some sorting algorithm which benefits from pre-sorted
> input. this would give better average performance. IIRC new conntracks
> are added at the head of the chains, so if we sort and walk backwards
> through the chains we only have to resort after an id counter wrap.

Could work. In such case the bucket should at most times be sorted
naturally with no need to resort. There is a few theoretical races
where entries may be inserted in another order (more so on SMP), but these
are hopefully relatively rare.

Regards
Henrik

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  6:34           ` Henrik Nordstrom
@ 2003-10-20 17:53             ` Patrick McHardy
  0 siblings, 0 replies; 40+ messages in thread
From: Patrick McHardy @ 2003-10-20 17:53 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: Harald Welte, Netfilter Development Mailinglist

Henrik Nordstrom wrote:

>>- we can use some sorting algorithm which benefits from pre-sorted
>>input. this would give better average performance. IIRC new conntracks
>>are added at the head of the chains, so if we sort and walk backwards
>>through the chains we only have to resort after an id counter wrap.
>>    
>>
>
>Could work. In such case the bucket should at most times be sorted
>naturally with no need to resort. There is a few theoretical races
>where entries may be inserted in another order (more so on SMP), but these
>are hopefully relatively rare.
>  
>

Entries are always inserted with the list locked so I can't see these 
cases. I also
calculated, with 64bit and 10^7 connections/s 64 bit won't wrap for 
58494 years.
So we don't need resorting at all.

Best regards,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  3:01         ` Patrick McHardy
  2003-10-20  3:09           ` Patrick McHardy
  2003-10-20  6:34           ` Henrik Nordstrom
@ 2003-10-20  7:15           ` Harald Welte
  2003-10-20  9:37             ` Henrik Nordstrom
  2003-10-20 18:17             ` Patrick McHardy
  2 siblings, 2 replies; 40+ messages in thread
From: Harald Welte @ 2003-10-20  7:15 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 3348 bytes --]

On Mon, Oct 20, 2003 at 05:01:17AM +0200, Patrick McHardy wrote:

> >This requires a read lock per hash bucket while dumping that bucket, and
> >some small (usually) amount of memory to keep the temporary sorted index
> >of bucket entries unless the bucket is permanently resorted in which case
> >it may be possible to solve with no memory allocation (but then requires 
> >the bucket to be write locked while resorting which is probably worse).
> 
> Sounds like a nice solution. I favour the permanent resorting for these 
> reasons:
> - all temporary memory allocations should be released before
> ctnetlink_dump is left, not in ctnetlink_done since we don't know if
> and when the read will continue. this means sorting multiple times is
> required.

again, this seems a shortcoming of the netlink infrastructure.  This
would be much better, if we'd actually have a reference to the socket of
the userspace process (since table dumps could be unicast anyway...).

As for the permanent sorting:  I fear that is going to be an intrusive
change.  Either ctnetlink locks the bucket on it's own [and does the
sorting locally], or we have some ugly special-case functions in the
ip_conntrack core.  Also:  There is no primitive for upgrading a reader
lock into a writer lock.  Since we don't know if the skb is large enough
beforehand, we would need to take a write lock in any case (since we
cannot tell if we need to reorder or not).

> - we can use some sorting algorithm which benefits from pre-sorted 
> input. this would give better average performance. IIRC new conntracks
> are added at the head of the chains, so if we sort and walk backwards
> through the chains we only have to resort after an id counter wrap.

this works with a 64bit counter, but doesn't with my proposed generation
counter.  the generation counter has two advantages:
- no global counter, just the counter field in every conntrack
- less size increase of struct ip_conntrack. 

Oh well, yes.  We have to think about the slab cache returning objects
to the 'real' memory allocator.  Then our generation counter would
become useless.  Don't know if it is reliable enough to initialize the
counter with a random 32bits in that case.  How high is the  probability
of re-using the recently-used counter in that case?

> I agree, we should use 64 bit.

I feel like I have to cry.  As soon as we add any kind of counter (be
it 32 or 64 bits), we would definitely have to make it a compile time
option.  My vision of ctnetlink was something like a match extension:
You can always compile it as a module, and it wouldn't hurt performance
as long as you don't load it.

> Also a good idea. Thanks Henrik for your valuable input. Harald, what
> do you think of this approach ?

yes, if we really want to be reliable I don't see a better way.

> Best regards,
> Patrick (hoping mozilla will have mercy with his formatting this time)

not really :(

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  7:15           ` Harald Welte
@ 2003-10-20  9:37             ` Henrik Nordstrom
  2003-10-20 18:43               ` Patrick McHardy
  2003-10-20 18:17             ` Patrick McHardy
  1 sibling, 1 reply; 40+ messages in thread
From: Henrik Nordstrom @ 2003-10-20  9:37 UTC (permalink / raw)
  To: Harald Welte; +Cc: Patrick McHardy, Netfilter Development Mailinglist

On Mon, 20 Oct 2003, Harald Welte wrote:

> again, this seems a shortcoming of the netlink infrastructure.  This
> would be much better, if we'd actually have a reference to the socket of
> the userspace process (since table dumps could be unicast anyway...).

If it is a shortcoming of netlink or a shortcoming of how to reliably dump 
the content of a large hash buckets can be argued, but I tend to agree 
with you here.

The dump operation should be connetion oriented with the userspace 
application, not purely datagram based. The kernel should know for certain 
if the userspace application terminates a dump operation mid-air.

Regards
Henrik

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  9:37             ` Henrik Nordstrom
@ 2003-10-20 18:43               ` Patrick McHardy
  2003-10-20 18:37                 ` Harald Welte
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2003-10-20 18:43 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: Harald Welte, Netfilter Development Mailinglist

Henrik Nordstrom wrote:

>On Mon, 20 Oct 2003, Harald Welte wrote:
>  
>
>>again, this seems a shortcoming of the netlink infrastructure.  This
>>would be much better, if we'd actually have a reference to the socket of
>>the userspace process (since table dumps could be unicast anyway...).
>>    
>>
>
>If it is a shortcoming of netlink or a shortcoming of how to reliably dump 
>the content of a large hash buckets can be argued, but I tend to agree 
>with you here.
>
>The dump operation should be connetion oriented with the userspace 
>application, not purely datagram based. The kernel should know for certain 
>if the userspace application terminates a dump operation mid-air.
>

Actually the kernel knows. If the socket is closed cb->done() is called.
However the kernel can not know if userspace still keeps the socket open
but doesn't read anymore. Connection oriented sockets don't help with this.

Best regards,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 18:43               ` Patrick McHardy
@ 2003-10-20 18:37                 ` Harald Welte
  2003-10-20 19:17                   ` Patrick McHardy
  2003-10-20 19:41                   ` Balazs Scheidler
  0 siblings, 2 replies; 40+ messages in thread
From: Harald Welte @ 2003-10-20 18:37 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 1283 bytes --]

On Mon, Oct 20, 2003 at 08:43:02PM +0200, Patrick McHardy wrote:

> >The dump operation should be connetion oriented with the userspace 
> >application, not purely datagram based. The kernel should know for certain 
> >if the userspace application terminates a dump operation mid-air.
> >
> 
> Actually the kernel knows. If the socket is closed cb->done() is called.
> However the kernel can not know if userspace still keeps the socket open
> but doesn't read anymore. Connection oriented sockets don't help with this.

yes, but if the application is broken, that's not our problem.  If the
API and the behaviour is documented, I don't see any problems with this.
If an app wants to intentionally allocate many opjects in kernel space,
there's nothing we can do.  Also, since we are sending to userspace, we
could actually allocate this as virtual memory, right?

> Best regards,
> Patrick

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 18:37                 ` Harald Welte
@ 2003-10-20 19:17                   ` Patrick McHardy
  2003-10-20 19:41                   ` Balazs Scheidler
  1 sibling, 0 replies; 40+ messages in thread
From: Patrick McHardy @ 2003-10-20 19:17 UTC (permalink / raw)
  To: Harald Welte; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

Harald Welte wrote:

>On Mon, Oct 20, 2003 at 08:43:02PM +0200, Patrick McHardy wrote:
>
>  
>
>>>The dump operation should be connetion oriented with the userspace 
>>>application, not purely datagram based. The kernel should know for certain 
>>>if the userspace application terminates a dump operation mid-air.
>>>
>>>      
>>>
>>Actually the kernel knows. If the socket is closed cb->done() is called.
>>However the kernel can not know if userspace still keeps the socket open
>>but doesn't read anymore. Connection oriented sockets don't help with this.
>>    
>>
>
>yes, but if the application is broken, that's not our problem.  If the
>API and the behaviour is documented, I don't see any problems with this.
>If an app wants to intentionally allocate many opjects in kernel space,
>there's nothing we can do.  Also, since we are sending to userspace, we
>could actually allocate this as virtual memory, right?
>

Yes I just mentioned it to clarify the problem with allocating memory and
freeing it in ctnetlink_done() and point out that connection oriented
sockets don't help. I'm not sure what you mean with "allocate as virtual
memory", do you mean accounting the memory to the process which started
the dump ? I'm not sure how the conventions with accounting kernel memory
are, but at least that would provide a way to bound the amount of memory
that can be used in case we decide to use a solution which requires dynamic
allocations.

Best regards,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 18:37                 ` Harald Welte
  2003-10-20 19:17                   ` Patrick McHardy
@ 2003-10-20 19:41                   ` Balazs Scheidler
  2003-10-20 20:20                     ` Patrick McHardy
  1 sibling, 1 reply; 40+ messages in thread
From: Balazs Scheidler @ 2003-10-20 19:41 UTC (permalink / raw)
  To: Harald Welte, Patrick McHardy, Henrik Nordstrom,
	Netfilter Development Mailinglist

On Mon, Oct 20, 2003 at 08:37:42PM +0200, Harald Welte wrote:
> On Mon, Oct 20, 2003 at 08:43:02PM +0200, Patrick McHardy wrote:
> 
> > >The dump operation should be connetion oriented with the userspace 
> > >application, not purely datagram based. The kernel should know for certain 
> > >if the userspace application terminates a dump operation mid-air.
> > >
> > 
> > Actually the kernel knows. If the socket is closed cb->done() is called.
> > However the kernel can not know if userspace still keeps the socket open
> > but doesn't read anymore. Connection oriented sockets don't help with this.
> 
> yes, but if the application is broken, that's not our problem.  If the
> API and the behaviour is documented, I don't see any problems with this.
> If an app wants to intentionally allocate many opjects in kernel space,
> there's nothing we can do.  Also, since we are sending to userspace, we
> could actually allocate this as virtual memory, right?

Sorry to jump into the conversation, it just occurred something I've read
about relayfs:

quoting from the announcement:

"relayfs is a filesystem designed to provide an efficient mechanism for
tools and facilities to relay large amounts of data from kernel space
to user space.  Full details can be found in Documentation/filesystems/
relayfs.txt.  The current version can always be found at
http://www.opersys.com/relayfs."

This _might_ be of interest when complete tables are to be dumped, though
the ctnetlink based interface is better when the userspace app provides more
specific queries.

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 19:41                   ` Balazs Scheidler
@ 2003-10-20 20:20                     ` Patrick McHardy
  2003-10-20 22:59                       ` Harald Welte
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2003-10-20 20:20 UTC (permalink / raw)
  To: Balazs Scheidler
  Cc: Harald Welte, Henrik Nordstrom, Netfilter Development Mailinglist

Hi Balazs,

Balazs Scheidler wrote:

>Sorry to jump into the conversation, it just occurred something I've read
>about relayfs:
>

Suggestions are always welcome.

>quoting from the announcement:
>
>"relayfs is a filesystem designed to provide an efficient mechanism for
>tools and facilities to relay large amounts of data from kernel space
>to user space.  Full details can be found in Documentation/filesystems/
>relayfs.txt.  The current version can always be found at
>http://www.opersys.com/relayfs."
>
>This _might_ be of interest when complete tables are to be dumped, though
>the ctnetlink based interface is better when the userspace app provides more
>specific queries.
>

I believe we should stick to a single consistent interface for user comfort.

Best regards,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 20:20                     ` Patrick McHardy
@ 2003-10-20 22:59                       ` Harald Welte
  0 siblings, 0 replies; 40+ messages in thread
From: Harald Welte @ 2003-10-20 22:59 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Balazs Scheidler, Henrik Nordstrom,
	Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 888 bytes --]

On Mon, Oct 20, 2003 at 10:20:06PM +0200, Patrick McHardy wrote:
> >This _might_ be of interest when complete tables are to be dumped, though
> >the ctnetlink based interface is better when the userspace app provides 
> >more
> >specific queries.
> >
> 
> I believe we should stick to a single consistent interface for user comfort.

also, if we stick with netlink, we can easily adapt to netlink2... that
gives us the ability to even remotely dump/modify conntrack tables.

> Best regards,
> Patrick

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  7:15           ` Harald Welte
  2003-10-20  9:37             ` Henrik Nordstrom
@ 2003-10-20 18:17             ` Patrick McHardy
  2003-10-20 18:39               ` Harald Welte
  2003-10-20 18:52               ` Harald Welte
  1 sibling, 2 replies; 40+ messages in thread
From: Patrick McHardy @ 2003-10-20 18:17 UTC (permalink / raw)
  To: Harald Welte; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

Harald Welte wrote:

>On Mon, Oct 20, 2003 at 05:01:17AM +0200, Patrick McHardy wrote:
>
>>- we can use some sorting algorithm which benefits from pre-sorted 
>>input. this would give better average performance. IIRC new conntracks
>>are added at the head of the chains, so if we sort and walk backwards
>>through the chains we only have to resort after an id counter wrap.
>>    
>>
>
>this works with a 64bit counter, but doesn't with my proposed generation
>counter.  the generation counter has two advantages:
>- no global counter, just the counter field in every conntrack
>- less size increase of struct ip_conntrack. 
>
>Oh well, yes.  We have to think about the slab cache returning objects
>to the 'real' memory allocator.  Then our generation counter would
>become useless.  Don't know if it is reliable enough to initialize the
>counter with a random 32bits in that case.  How high is the  probability
>of re-using the recently-used counter in that case?
>

Actually with 64 bit the wrap-around time is so large we would never 
have to resort.

>>I agree, we should use 64 bit.
>>    
>>
>
>I feel like I have to cry.  As soon as we add any kind of counter (be
>it 32 or 64 bits), we would definitely have to make it a compile time
>option.  My vision of ctnetlink was something like a match extension:
>You can always compile it as a module, and it wouldn't hurt performance
>as long as you don't load it.
>  
>

I understand your objections, It is really not my intend to enlarge 
struct ip_conntrack without
the need to do so. However I'd rather save some memory by f.e. making 
helper memory
dynamic than by saving these couple of bytes and making some 
functionality of ctnetlink
somewhat useless. Compiling as a module without performance penalty 
won't work anyways
as soon as you enable event notifications.

So do I understand correctly we're at the point were we agree we need to 
add something
to struct ip_conntrack and either use a linear increasing global counter 
or a generation count
which increases as soon as we return something to the slab.

Advantages/Disadvantages of global counter:
- don't need any sorting with 64bit, natural sorting is fine
- uniquely identifies conntracks without risk of collisions
- possible contention on global counter

Advantages of generation count:
- No global counter
- probably expensive sorting needed over time
- no unique identity, can not be used from userspace

I still favour the global counter but I'm fine as long as dumping works, 
a unqiue identity for
userspace is less important. I'd say it's up to you to decide.

Best regards,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 18:17             ` Patrick McHardy
@ 2003-10-20 18:39               ` Harald Welte
  2003-10-20 19:21                 ` Patrick McHardy
  2003-10-21 16:47                 ` Patrick McHardy
  2003-10-20 18:52               ` Harald Welte
  1 sibling, 2 replies; 40+ messages in thread
From: Harald Welte @ 2003-10-20 18:39 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

On Mon, Oct 20, 2003 at 08:17:37PM +0200, Patrick McHardy wrote:

> I still favour the global counter but I'm fine as long as dumping works, 
> a unqiue identity for userspace is less important. I'd say it's up to
> you to decide.

Mh, well let's go for the 64bit, as there seems to be no other choice.
But we should start working on the variable-sized conntracks within
short time afterwards.

> Best regards,
> Patrick

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 18:39               ` Harald Welte
@ 2003-10-20 19:21                 ` Patrick McHardy
  2003-10-21 16:47                 ` Patrick McHardy
  1 sibling, 0 replies; 40+ messages in thread
From: Patrick McHardy @ 2003-10-20 19:21 UTC (permalink / raw)
  To: Harald Welte; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

Harald Welte wrote:

>On Mon, Oct 20, 2003 at 08:17:37PM +0200, Patrick McHardy wrote:
>
>  
>
>>I still favour the global counter but I'm fine as long as dumping works, 
>>a unqiue identity for userspace is less important. I'd say it's up to
>>you to decide.
>>    
>>
>
>Mh, well let's go for the 64bit, as there seems to be no other choice.
>But we should start working on the variable-sized conntracks within
>short time afterwards.
>

I actually already started it some time ago but it had to step back for
more interesting things ;) Maybe we can place the global counter next to
some other value that is modified anyways at conntrack creation so they
will be in the same cache line. ip_conntrack_count comes to mind but
unfortunately it's atomic_t (volatile).

Best regards,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 18:39               ` Harald Welte
  2003-10-20 19:21                 ` Patrick McHardy
@ 2003-10-21 16:47                 ` Patrick McHardy
  2003-10-21 19:54                   ` Henrik Nordstrom
  1 sibling, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2003-10-21 16:47 UTC (permalink / raw)
  To: Harald Welte; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

Harald Welte wrote:

>Mh, well let's go for the 64bit, as there seems to be no other choice.
>But we should start working on the variable-sized conntracks within
>short time afterwards.
>

Ok there's another problem, for fast lookups by id (we don't want to
search the entire hash) we need to encode the hash chain of a tuple 
in the id. We basically have two choices now for the remaining bits:

a) keep using a global counter which reduces namespace to
   2^(64-lg(hashsize))

b) use a per-bucket counter which keeps 64 bit namespace and eliminates
   potential contention on the counter but requires as much memory as the
   hash buckets themselves.

The problem is guessing how big the hash might get. With a hash of 2^20
buckets and 1 million connections/s the ids wrap after 0.5 years with
possibility a. Even if the connection rate may be unrealistic high I
assume a hash size of 2^20 and bigger is realistic now or might be soon,
so the chance of seeing reused ids is real. Possibility b is of course
not acceptable due to memory usage.

My proposed solution is to reserve 16bit for the chain id and to
compensate for the remaining used bits by keeping
2^max(number_of_bits-16, 0) counters. This always gives us 48bit for
the id (if hash distribution is good), with the numbers above that is 
~9 years without a wraparound while keeping 16 counters. For a hash
size <= 2^16 we still only have one counter, but if we really
experience contention we now can easily increase it.

Does that sound ok ? Feel free to shut me up by giving some more
realistic numbers ;)

Best regards,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-21 16:47                 ` Patrick McHardy
@ 2003-10-21 19:54                   ` Henrik Nordstrom
  2003-10-21 20:00                     ` Patrick McHardy
  0 siblings, 1 reply; 40+ messages in thread
From: Henrik Nordstrom @ 2003-10-21 19:54 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Harald Welte, Netfilter Development Mailinglist

On Tue, 21 Oct 2003, Patrick McHardy wrote:

> Ok there's another problem, for fast lookups by id (we don't want to
> search the entire hash) we need to encode the hash chain of a tuple 
> in the id. We basically have two choices now for the remaining bits:

You can always make the userspace ID larger by including the bucket
number. There is no need to extend the ID stored in the conntrack for
this as it is already known.

Regards
Henrik

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-21 19:54                   ` Henrik Nordstrom
@ 2003-10-21 20:00                     ` Patrick McHardy
  0 siblings, 0 replies; 40+ messages in thread
From: Patrick McHardy @ 2003-10-21 20:00 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: Harald Welte, Netfilter Development Mailinglist

Henrik Nordstrom wrote:

>On Tue, 21 Oct 2003, Patrick McHardy wrote:
>
>  
>
>>Ok there's another problem, for fast lookups by id (we don't want to
>>search the entire hash) we need to encode the hash chain of a tuple 
>>in the id. We basically have two choices now for the remaining bits:
>>    
>>
>
>You can always make the userspace ID larger by including the bucket
>number. There is no need to extend the ID stored in the conntrack for
>this as it is already known.
>
>Regards
>Henrik
>  
>

Ok I have to admit I didn't think of this obvious solution.

Thanks again,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 18:17             ` Patrick McHardy
  2003-10-20 18:39               ` Harald Welte
@ 2003-10-20 18:52               ` Harald Welte
  2003-10-20 19:52                 ` Patrick McHardy
  1 sibling, 1 reply; 40+ messages in thread
From: Harald Welte @ 2003-10-20 18:52 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 2681 bytes --]

Actually, another point is what to do with expectations.  

It's not as problematic as with conntrack's, but in general it's the
same.  Let's say:

- conntrack helper creates expectation for typle xyz
- userspace gets a list of unconfirmed expects
- expectation for tuple xyz is confirmed
- conntrack helper creates a new expectation for tuple xyz
- userspace wants to remove expectation by referring to tuple xyz.

At least in this case, that might actually be what the user wants -
since there is not much difference between the two expectations, other
than time passing in between.

If the helper is automatically re-adding expectation upon confirmation,
than the user can race with incoming connections in order to 'break the
circle' ;) 

I think there is not too much point in removing expect's anyway.  The
real need is for adding and modyfing expectations, in case there is a
userspace conntrack/nat helper.

What do you think?

btw: In the failover code, I have another problem with regard to
expect's:  A sibling conntrack has a pointer to the master expect, not
the master conntrack.  But i somehow need to replicate that pointer.
Without Krisztians idmap (that I've already ripped out), I'm now passing
the master conntrack's tuple and the expectation's tuple.    This works
while doing normal sync:  There can always be only one unconfirmed
expectation for every tuple.

However, when doing a initial sync (or a full-resync), I cannot
replicate the whole tree of master and siblings - because I first
replicate the conntracks and then later have to fill in the [confirmed]
expectations as glue in between.  

The only idea I have is to use the 'seq' number.  However, this again
only works for TCP.

Any other options?

(in the Future, I think we will at least optionally have per-connection
byte and packet counters.  Then the 'seq' field could be initialized to
the byte counter at the time the expectation was raised.  This would
solve the udp case [and other udp races which Patrick can tell us
nightmares of]. However, this is once again enlarging conntrack - but
only if somebody wants to use 'connbytes' match or create netflow-style
connection logs.  But we could require that compile option in case
somebody wants to enable failover)

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 18:52               ` Harald Welte
@ 2003-10-20 19:52                 ` Patrick McHardy
  2003-10-20 23:09                   ` Harald Welte
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2003-10-20 19:52 UTC (permalink / raw)
  To: Harald Welte; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

Harald Welte wrote:

>Actually, another point is what to do with expectations.  
>
>It's not as problematic as with conntrack's, but in general it's the
>same.  Let's say:
>
>- conntrack helper creates expectation for typle xyz
>- userspace gets a list of unconfirmed expects
>- expectation for tuple xyz is confirmed
>- conntrack helper creates a new expectation for tuple xyz
>- userspace wants to remove expectation by referring to tuple xyz.
>
>At least in this case, that might actually be what the user wants -
>since there is not much difference between the two expectations, other
>than time passing in between.
>
>If the helper is automatically re-adding expectation upon confirmation,
>than the user can race with incoming connections in order to 'break the
>circle' ;) 
>
>I think there is not too much point in removing expect's anyway.  The
>real need is for adding and modyfing expectations, in case there is a
>userspace conntrack/nat helper.
>
>What do you think?
>

I agree. Removing is not very important. Modifying also requires a way to
identify the expectation. Currently (as you might have noticed) the
expectations also include an id. The namespace could probably be smaller
than for conntracks but I need to think about this some more after catching
breakfast.

BTW: I thought a bit about userspace helpers, they need to be synchronous
like they are now so we don't get races were an expectation arrives before
it's registered. So they should probably receive their packets though
ip_queue. How realistic do you think is it to move ftp/irc/amanda... to
userspace ? All of them operate on low traffic protocols, but if they sent
packets to userspace through netlink sockets operation can easily be
interrupted be sending lots of traffic that will fill up the socket buffer.

>btw: In the failover code, I have another problem with regard to
>expect's:  A sibling conntrack has a pointer to the master expect, not
>the master conntrack.  But i somehow need to replicate that pointer.
>Without Krisztians idmap (that I've already ripped out), I'm now passing
>the master conntrack's tuple and the expectation's tuple.    This works
>while doing normal sync:  There can always be only one unconfirmed
>expectation for every tuple.
>
>However, when doing a initial sync (or a full-resync), I cannot
>replicate the whole tree of master and siblings - because I first
>replicate the conntracks and then later have to fill in the [confirmed]
>expectations as glue in between.  
>
>The only idea I have is to use the 'seq' number.  However, this again
>only works for TCP.
>
>Any other options?
>

I have not studied the code intensively, but can't you just sync the tables
in order of the hierachie:

conntrack
conntrack, master-expect, sibling-conntrack, sibling-conntrack, ...
conntrack
...

Best regards,
Patrick

>(in the Future, I think we will at least optionally have per-connection
>byte and packet counters.  Then the 'seq' field could be initialized to
>the byte counter at the time the expectation was raised.  This would
>solve the udp case [and other udp races which Patrick can tell us
>nightmares of]. However, this is once again enlarging conntrack - but
>only if somebody wants to use 'connbytes' match or create netflow-style
>connection logs.  But we could require that compile option in case
>somebody wants to enable failover)
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 19:52                 ` Patrick McHardy
@ 2003-10-20 23:09                   ` Harald Welte
  0 siblings, 0 replies; 40+ messages in thread
From: Harald Welte @ 2003-10-20 23:09 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 2710 bytes --]

On Mon, Oct 20, 2003 at 09:52:05PM +0200, Patrick McHardy wrote:

> I agree. Removing is not very important. Modifying also requires a way to
> identify the expectation. Currently (as you might have noticed) the
> expectations also include an id. The namespace could probably be smaller
> than for conntracks but I need to think about this some more after catching
> breakfast.

yes, I've noted that they also have id's.  However, as you will have
noticed by now, I feel very reluctant to add id's to our structures ;)

> BTW: I thought a bit about userspace helpers, they need to be synchronous
> like they are now so we don't get races were an expectation arrives before
> it's registered. So they should probably receive their packets though
> ip_queue. 

for local helpers this is true.  But think about even more complex
setups, like the envisioned SIP proxy (that might even run on a totally
different machine than your packet filter).  They are inherently racy -
and there's nothing we can do about that.

> How realistic do you think is it to move ftp/irc/amanda... to
> userspace ? All of them operate on low traffic protocols, but if they sent
> packets to userspace through netlink sockets operation can easily be
> interrupted be sending lots of traffic that will fill up the socket buffer.

yes, that is a problem.  another problem is that there can only be one
userspace process be attached to the queue.  And no, we don't want to
use ipqmpd - that is just a hack and doesn't scale at all.

btw: I already have a patch of a l3 independent queue implementation.
The only problem is that it has to change the packet format - and thus
will introduce incompatibility :(

> I have not studied the code intensively, but can't you just sync the tables
> in order of the hierachie:
> 
> conntrack
> conntrack, master-expect, sibling-conntrack, sibling-conntrack, ...
> conntrack
> ...

well, this means I cannot use the existing iterator functions, and
locking might become complex once we have per-bucket locks.

Also, I don't like the idea of hiding too much information in the
packet/message order.  Yes, ordering is guaranteed by the protocol - but
even then...

I'll have to think about that... but now I'm off for some sleep. Maybe
tomorrow ;)

> Best regards,
> Patrick

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  1:05       ` Henrik Nordstrom
  2003-10-20  3:01         ` Patrick McHardy
@ 2003-10-20  7:04         ` Harald Welte
  2003-10-20  7:17         ` Jozsef Kadlecsik
  2003-10-20 11:11         ` Jozsef Kadlecsik
  3 siblings, 0 replies; 40+ messages in thread
From: Harald Welte @ 2003-10-20  7:04 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: Patrick McHardy, Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 2919 bytes --]

On Mon, Oct 20, 2003 at 03:05:41AM +0200, Henrik Nordstrom wrote:

> It is imporant that userspace does not miss entries which was in the 
> kernel when duming started and still exists in the kernel when the dump 
> finished.

finally agreed.

> It is also important userspace can have some kind of semi-static 
> reference to a conntrack to be able to manipulate that conntrack without 
> risking hitting another conntrack.

also agreed.

> It is OK for me if it is unspecified what happens with entries which 
> either was created or destroyed while the dump was in progress.

ack.

> With these criterias in mind I propose a hybrid of your approaches
> 
> a) Assign a globally unique ID to each conntrack, in such manner that IDs 
> is not reused for a significant amount of time. This to provide a stable 
> point of reference to a connection with low risk of false collisions if 
> the original connection was destroyed while userspace still thought it was 
> there. 

In reality, we could use a pointer together with a generation-counter.
That generation counter could be incremented as soon as we return the
structure to the slab cache.  This way we could live with a 32bit
generation counter + pointer/address. 

> b) When duming the conntrack entries, dump one bucket at a time. 
> If the bucket is too large to fit in the current response packet 
> then sort the bucket entries on ID and keep track of which bucket+ID 
> was last dumped. On next netlink packet restart at the same bucket and 
> skip the entries with a ID lower than those already dumped for that 
> bucket.

I'm going to comment on this in the next mail.
> Regarding the conntrack ID. For me it is acceptable if as much as 64 bits
> is reserved for the conntrack ID. This gives sufficient namespace to

for me, not a single bit is acceptable.  the size of ip_conntrack is
already way too heavy.  the l3 generic conntrack should have support for
different-sized conntracks, e.g. saving the ct/nat helper part for all
conncetions but the ones that actually have a helper, etc.

> a) Provide truly unique IDs suitable for long-term reference without any 
> risk of collisions.

a generation counter would make that guarantee.

> b) Allows for the namespace to be built in such manner that there never
> will be any risk for congestion in finding the next available ID. For 
> example by using CPU#+counter.

generation counters also fulfill that requirement. However, they are not
ordered.

> Regards
> Henrik

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  1:05       ` Henrik Nordstrom
  2003-10-20  3:01         ` Patrick McHardy
  2003-10-20  7:04         ` Harald Welte
@ 2003-10-20  7:17         ` Jozsef Kadlecsik
  2003-10-20  9:29           ` Henrik Nordstrom
  2003-10-20 14:48           ` Harald Welte
  2003-10-20 11:11         ` Jozsef Kadlecsik
  3 siblings, 2 replies; 40+ messages in thread
From: Jozsef Kadlecsik @ 2003-10-20  7:17 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: Netfilter Development Mailinglist

Hi,

On Mon, 20 Oct 2003, Henrik Nordstrom wrote:

> a) Assign a globally unique ID to each conntrack, in such manner that IDs
> is not reused for a significant amount of time. This to provide a stable
> point of reference to a connection with low risk of false collisions if
> the original connection was destroyed while userspace still thought it was
> there.

I still don't see why can't we simply use the tuple as unique id, as
Harald suggested. That's truly unique and does not require additional
fields in the ip_conntrack structure.

> This requires a read lock per hash bucket while dumping that bucket, and
> some small (usually) amount of memory to keep the temporary sorted index
> of bucket entries unless the bucket is permanently resorted in which case
> it may be possible to solve with no memory allocation (but then requires
> the bucket to be write locked while resorting which is probably worse).

On the developer workshop I presented my per bucket locking patch, with
some performance comparison graphs. It's time to sync the patch and
release it...

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  7:17         ` Jozsef Kadlecsik
@ 2003-10-20  9:29           ` Henrik Nordstrom
  2004-02-06 18:52             ` Harald Welte
  2003-10-20 14:48           ` Harald Welte
  1 sibling, 1 reply; 40+ messages in thread
From: Henrik Nordstrom @ 2003-10-20  9:29 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Netfilter Development Mailinglist

On Mon, 20 Oct 2003, Jozsef Kadlecsik wrote:

> I still don't see why can't we simply use the tuple as unique id, as
> Harald suggested. That's truly unique and does not require additional
> fields in the ip_conntrack structure.

Because it is not long-term unique. With the tuple approach the
administrator risks hitting another connection if the originally intended
connection has already been destroyed and replaced by a new connection
with the same address details.

But yes, if we use the full conntrack tuple (both directions) then the
uniqueness is probably good enough for all practical purposes except when
there is evil clients in the mix, but on the other hand becomes a "little"
cumbersome to work with if you want the administrator to ever enter which
connection he refers to manually, even more so if you consider that the
details of a tuple varies greatly per protocol.

A 64-bit integer can be copy-pasted, and is relatively easy to manage in
textual form. A full conntrack tuple (at minimum "protocol, source IP,
dest IP, reply source IP, reply destination IP, source port, destination
port, reply source port, reply destination port", but preferaly a binary
"struct ip_conntrack_tuple tuple[2]") is obviously not as easy to manage.

> On the developer workshop I presented my per bucket locking patch, with
> some performance comparison graphs. It's time to sync the patch and
> release it...

Would be great ;-)

Regards
Henrik

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  9:29           ` Henrik Nordstrom
@ 2004-02-06 18:52             ` Harald Welte
  2004-02-09 10:33               ` Pablo Neira
  2004-02-10 12:39               ` Patrick McHardy
  0 siblings, 2 replies; 40+ messages in thread
From: Harald Welte @ 2004-02-06 18:52 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: Jozsef Kadlecsik, Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 1858 bytes --]

Hi!

I have to follow up on this old discussion, since I want to get
ctnetlink into a submission-ready state.

On Mon, Oct 20, 2003 at 11:29:46AM +0200, Henrik Nordstrom wrote:
> On Mon, 20 Oct 2003, Jozsef Kadlecsik wrote:
> 
> > I still don't see why can't we simply use the tuple as unique id, as
> > Harald suggested. That's truly unique and does not require additional
> > fields in the ip_conntrack structure.
> 
> Because it is not long-term unique. With the tuple approach the
> administrator risks hitting another connection if the originally intended
> connection has already been destroyed and replaced by a new connection
> with the same address details.

well, but if the tuple is again the same tuple, chances are high the
administrator actually wants to remove that new connection as much as
the previous one.  In fact, apart from a short difference in time, they
_are_ pretty much the same connection.

So from my point of view, the tuple is still sufficient.  Tuple can be
used by userspace to identify a connection, tuple is used for
replication messages in ct_sync.

We can also guarantee, that all entries that
	- existed before the dump started
	- and still exist when the dump ended
are actually dumped.

We don't make any guarantees about connections that either started
within that timeframe, or have been terminated within that timeframe.

I would really like to see the ordered list and id disappear.

> Regards
> Henrik

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2004-02-06 18:52             ` Harald Welte
@ 2004-02-09 10:33               ` Pablo Neira
  2004-02-10 12:39               ` Patrick McHardy
  1 sibling, 0 replies; 40+ messages in thread
From: Pablo Neira @ 2004-02-09 10:33 UTC (permalink / raw)
  To: Harald Welte, netfilter-devel

Hi!

I've been working on an API to add, update conntrack entries since fall 
2003, it's
my final proyect at the university. It's still experimental and it's far 
from Harald and Jozsef
work because of their experience in that matter. Anyway I promise to 
post that patch.

Harald Welte wrote:

>>Because it is not long-term unique. With the tuple approach the
>>administrator risks hitting another connection if the originally intended
>>connection has already been destroyed and replaced by a new connection
>>with the same address details.
>>    
>>

well, I can't see any problem, anyway we could perform some checkings to 
avoid something like this:

- when a create entry message arrives we could check if there's a 
connection with the same address details by using 
ip_conntrack_find_get(...) and if it's found, update the ip_conntrack 
structure with the new info.

but by the means of the id we could have two conntrack structures with 
the same address info. I think that this duplicated info, actually I 
think that it's better considering that last info received about a 
connection is up to date and forget the state of the old one.

>well, but if the tuple is again the same tuple, chances are high the
>administrator actually wants to remove that new connection as much as
>the previous one.  In fact, apart from a short difference in time, they
>_are_ pretty much the same connection.
>  
>
>So from my point of view, the tuple is still sufficient.  Tuple can be
>used by userspace to identify a connection, tuple is used for
>replication messages in ct_sync.
>
>We can also guarantee, that all entries that
>	- existed before the dump started
>	- and still exist when the dump ended
>are actually dumped.
>
>We don't make any guarantees about connections that either started
>within that timeframe, or have been terminated within that timeframe.
>
>I would really like to see the ordered list and id disappear.
>  
>
I agree with Harald, actually I think that adding a new id and that 
ordered stuff will complicate
the current structure, I would prefer redesigning the current structure 
of the conntrack table than adding those fields.

best regards,
Pablo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2004-02-06 18:52             ` Harald Welte
  2004-02-09 10:33               ` Pablo Neira
@ 2004-02-10 12:39               ` Patrick McHardy
  2004-02-14 20:03                 ` Harald Welte
  1 sibling, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2004-02-10 12:39 UTC (permalink / raw)
  To: Harald Welte
  Cc: Henrik Nordstrom, Jozsef Kadlecsik,
	Netfilter Development Mailinglist

Harald Welte wrote:
> On Mon, Oct 20, 2003 at 11:29:46AM +0200, Henrik Nordstrom wrote:
>>Because it is not long-term unique. With the tuple approach the
>>administrator risks hitting another connection if the originally intended
>>connection has already been destroyed and replaced by a new connection
>>with the same address details.
> 
> 
> well, but if the tuple is again the same tuple, chances are high the
> administrator actually wants to remove that new connection as much as
> the previous one.  In fact, apart from a short difference in time, they
> _are_ pretty much the same connection.
> 
> So from my point of view, the tuple is still sufficient.  Tuple can be
> used by userspace to identify a connection, tuple is used for
> replication messages in ct_sync.
> 
> We can also guarantee, that all entries that
> 	- existed before the dump started
> 	- and still exist when the dump ended
> are actually dumped.
> 
> We don't make any guarantees about connections that either started
> within that timeframe, or have been terminated within that timeframe.
> 
> I would really like to see the ordered list and id disappear.

I can make my peace with not having a unique identity for each conntrack
over time, but the other use for IDs was to continue an interrupted
dump at the right place, how can we solve this ? The problematic case
is when a single hash-chain doesn't fit into an skb. We need to remember
the last one dumped somehow, and be able to continue at the next one
not dumped even when the last one dumped is gone when the dump
continues.

Best regards,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2004-02-10 12:39               ` Patrick McHardy
@ 2004-02-14 20:03                 ` Harald Welte
  2004-02-15 10:01                   ` Patrick McHardy
  0 siblings, 1 reply; 40+ messages in thread
From: Harald Welte @ 2004-02-14 20:03 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Henrik Nordstrom, Jozsef Kadlecsik,
	Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 1341 bytes --]

On Tue, Feb 10, 2004 at 01:39:01PM +0100, Patrick McHardy wrote:
> I can make my peace with not having a unique identity for each conntrack
> over time, 

Thanks :)

> but the other use for IDs was to continue an interrupted
> dump at the right place, how can we solve this ? The problematic case
> is when a single hash-chain doesn't fit into an skb. We need to remember
> the last one dumped somehow, and be able to continue at the next one
> not dumped even when the last one dumped is gone when the dump
> continues.

We'd have to ensure that a single hash chain is not longer than what we
could put into one skb.  This can be done by limiting the maximum number
of entries in a bucket (and then rehash).  Also, we should increase the
default number of hash buckets to reduce the probability that this might
happen.

Also, Jozsef proposed a flip/flop bit mechanism that would solve that
problem.  What do you say to his proposal?

> Patrick

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2004-02-14 20:03                 ` Harald Welte
@ 2004-02-15 10:01                   ` Patrick McHardy
  2004-02-17 21:37                     ` Harald Welte
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2004-02-15 10:01 UTC (permalink / raw)
  To: Harald Welte
  Cc: Henrik Nordstrom, Jozsef Kadlecsik,
	Netfilter Development Mailinglist

Harald Welte wrote:
> On Tue, Feb 10, 2004 at 01:39:01PM +0100, Patrick McHardy wrote:
>
>>but the other use for IDs was to continue an interrupted
>>dump at the right place, how can we solve this ? The problematic case
>>is when a single hash-chain doesn't fit into an skb. We need to remember
>>the last one dumped somehow, and be able to continue at the next one
>>not dumped even when the last one dumped is gone when the dump
>>continues.
> 
> 
> We'd have to ensure that a single hash chain is not longer than what we
> could put into one skb.  This can be done by limiting the maximum number
> of entries in a bucket (and then rehash).  Also, we should increase the
> default number of hash buckets to reduce the probability that this might
> happen.

I like the idea. So assuming that long hash chains are a result of "bad
luck" with the jenkins hash, we would just change the secret, rehash,
and repeat if some chains are still too long ? At what point do you
propose rehashing, at the moment the chain length exceeds some threshold
(or thereafter, defered to occur out of packet processing context), or
when dumping over netlink ?

> 
> Also, Jozsef proposed a flip/flop bit mechanism that would solve that
> problem.  What do you say to his proposal?
> 

Can I find his proposal somewhere ?

> 
>>Patrick
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2004-02-15 10:01                   ` Patrick McHardy
@ 2004-02-17 21:37                     ` Harald Welte
  0 siblings, 0 replies; 40+ messages in thread
From: Harald Welte @ 2004-02-17 21:37 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Henrik Nordstrom, Jozsef Kadlecsik,
	Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 1292 bytes --]

On Sun, Feb 15, 2004 at 11:01:10AM +0100, Patrick McHardy wrote:

> I like the idea. So assuming that long hash chains are a result of "bad
> luck" with the jenkins hash, we would just change the secret, rehash,
> and repeat if some chains are still too long ? At what point do you
> propose rehashing, at the moment the chain length exceeds some threshold
> (or thereafter, defered to occur out of packet processing context), or
> when dumping over netlink ?

Mh. I am not really sure what might be the best solution.  It should
definitely not happen within softirq context, though.
 
> >Also, Jozsef proposed a flip/flop bit mechanism that would solve that
> >problem.  What do you say to his proposal?
> 
> Can I find his proposal somewhere ?

Message-Id: <Pine.LNX.4.33.0310201018510.12485-100000@blackhole.kfki.hu>

http://lists.netfilter.org/pipermail/netfilter-devel/2003-October/012821.html

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  7:17         ` Jozsef Kadlecsik
  2003-10-20  9:29           ` Henrik Nordstrom
@ 2003-10-20 14:48           ` Harald Welte
  2003-10-20 18:53             ` Patrick McHardy
  1 sibling, 1 reply; 40+ messages in thread
From: Harald Welte @ 2003-10-20 14:48 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 2006 bytes --]

On Mon, Oct 20, 2003 at 09:17:36AM +0200, Jozsef Kadlecsik wrote:
> Hi,
> 
> On Mon, 20 Oct 2003, Henrik Nordstrom wrote:
> 
> > a) Assign a globally unique ID to each conntrack, in such manner that IDs
> > is not reused for a significant amount of time. This to provide a stable
> > point of reference to a connection with low risk of false collisions if
> > the original connection was destroyed while userspace still thought it was
> > there.
> 
> I still don't see why can't we simply use the tuple as unique id, as
> Harald suggested. That's truly unique and does not require additional
> fields in the ip_conntrack structure.

I think you are mixing up two seperate issues:

1) uniquely representing ip_conntrack during state replication between
master and slave.  Here the tuple is sufficient, since all state changes
will be processed in-order.  Since the tuple is always unique in the
hashtable, there is no mistake of updating/deleting the wrong one.

2) uniquely identifying an ip_conntrcak from userspace.
When userspace first dumps and then deletes by tuple, the tuple might
already have been reused.  This is what most of the discussion was
about, where a 64bit counter or the generation counter+address had been
suggested as possible solutions.

> On the developer workshop I presented my per bucket locking patch, with
> some performance comparison graphs. It's time to sync the patch and
> release it...

yes... can we first have the final raw and tcp-window-tracking patch?
It's probably already too late to get them in 2.6.x anyway... but let's
try.

> Best regards,
> Jozsef

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 14:48           ` Harald Welte
@ 2003-10-20 18:53             ` Patrick McHardy
  2003-10-20 22:57               ` Harald Welte
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2003-10-20 18:53 UTC (permalink / raw)
  To: Harald Welte
  Cc: Jozsef Kadlecsik, Henrik Nordstrom,
	Netfilter Development Mailinglist

Harald Welte wrote:

>1) uniquely representing ip_conntrack during state replication between
>master and slave.  Here the tuple is sufficient, since all state changes
>will be processed in-order.  Since the tuple is always unique in the
>hashtable, there is no mistake of updating/deleting the wrong one.
>
>2) uniquely identifying an ip_conntrcak from userspace.
>When userspace first dumps and then deletes by tuple, the tuple might
>already have been reused.  This is what most of the discussion was
>about, where a 64bit counter or the generation counter+address had been
>suggested as possible solutions.
>  
>

Seems now I didn't understand. How can we use the ids generated by a 
generation counter in
userspace ? In any busy system they will be invalidated too fast ..

Best regards,
Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20 18:53             ` Patrick McHardy
@ 2003-10-20 22:57               ` Harald Welte
  0 siblings, 0 replies; 40+ messages in thread
From: Harald Welte @ 2003-10-20 22:57 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jozsef Kadlecsik, Henrik Nordstrom,
	Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 1546 bytes --]

On Mon, Oct 20, 2003 at 08:53:08PM +0200, Patrick McHardy wrote:
> Harald Welte wrote:
> 
> >1) uniquely representing ip_conntrack during state replication between
> >master and slave.  Here the tuple is sufficient, since all state changes
> >will be processed in-order.  Since the tuple is always unique in the
> >hashtable, there is no mistake of updating/deleting the wrong one.
> >
> >2) uniquely identifying an ip_conntrcak from userspace.
> >When userspace first dumps and then deletes by tuple, the tuple might
> >already have been reused.  This is what most of the discussion was
> >about, where a 64bit counter or the generation counter+address had been
> >suggested as possible solutions.
> > 
> >
> 
> Seems now I didn't understand. How can we use the ids generated by a 
> generation counter in userspace ? In any busy system they will be
> invalidated too fast ..

no.  The idea is that every particular conntrack (that is, allocated
chunk of memory) has it's own generation counter.  That counter is
part of struct ip_conntrack and incremented every time we return this
chunk of memory to the slab cache.

> Best regards,
> Patrick

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-20  1:05       ` Henrik Nordstrom
                           ` (2 preceding siblings ...)
  2003-10-20  7:17         ` Jozsef Kadlecsik
@ 2003-10-20 11:11         ` Jozsef Kadlecsik
  3 siblings, 0 replies; 40+ messages in thread
From: Jozsef Kadlecsik @ 2003-10-20 11:11 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: Netfilter Development Mailinglist

On Mon, 20 Oct 2003, Henrik Nordstrom wrote:

> It is imporant that userspace does not miss entries which was in the
> kernel when duming started and still exists in the kernel when the dump
> finished.
>
> It is also important userspace can have some kind of semi-static
> reference to a conntrack to be able to manipulate that conntrack without
> risking hitting another conntrack.
>
> It is OK for me if it is unspecified what happens with entries which
> either was created or destroyed while the dump was in progress.

This is an excellent summary for the requirements of the dump
functionality in ctnetlink.

However, I think Harald has got the points on shrinking instead of blowing
up the ip_conntrack structure.

What about introducing new, flip-flop conntrack status bits?

	/* Dump state A */
	IPS_DUMP_A_BIT = 4,
	ISP_DUMP_A = (1 << IPS_DUMP_A_BIT),

	/* Dump state B */
	IPS_DUMP_B_BIT = 5,
	ISP_DUMP_B = (1 << IPS_DUMP_B_BIT),

The general dump state is stored in ip_conntrack_dump_status. New
conntrack entries are created with their status set to the value of
ip_conntrack_dump_status.

When a dump is requested, the ip_conntrack_dump_status is set to the
another value. Then ip_conntrack hash is scanned and all entries with the
previous status bit is dumped and then their bit is turned to the current
value of ip_conntrack_dump_status.

New entries are created with the new ip_conntrack_dump_status value,
consequently those are not dumped but updated to the slaves using the
normal procedure.

It means of course that there could be only one dumping, i.e. until the
whole ip_conntrack hash hasn't got fully processed, the system must not
allow changing the value of ip_conntrack_dump_status again.

A quick idea, may be bogus.

> A 64-bit integer can be copy-pasted, and is relatively easy to manage in
> textual form. A full conntrack tuple (at minimum "protocol, source IP,
> dest IP, reply source IP, reply destination IP, source port, destination
> port, reply source port, reply destination port", but preferaly a binary
> "struct ip_conntrack_tuple tuple[2]") is obviously not as easy to
> manage.

We'll have a nifty GUI, so we won't need to type anything: just
click'n'shoot. ;-)

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ctnetlink questions
  2003-10-19 22:55     ` Patrick McHardy
  2003-10-20  1:05       ` Henrik Nordstrom
@ 2003-10-20  6:58       ` Harald Welte
  1 sibling, 0 replies; 40+ messages in thread
From: Harald Welte @ 2003-10-20  6:58 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 5079 bytes --]

On Mon, Oct 20, 2003 at 12:55:03AM +0200, Patrick McHardy wrote:

> Nice thing with the unique ids is that it's better than an atomic 
> snapshot, when you're done reading you have the _current_ state, not
> the state when you began reading.

well, I don't consider this as particularly important.  As long as it is
doucumented...

> >Also, there's another problem:
> >Let's say we left at bucket 5, entry 12 - and while we are waiting for
> >the next netlink callback, entry 10 gets removed.  Then we would
> >continue at 12, which is in reality the old 13.  So we're missing one
> >conntrack.  
> >
> 
> With the unique id solution ? No, the id's don't represent the
> list-position

No, this problem would occur with the old (and my proposed) solution
that doesn't require an ID.

> I didn't worry too much about performance yet, in my opinion it was 
> required for beeing useful. For the architecture, if it was only for
> table dumping I'd agree with you, but there is another important use
> for the id. When we want to manipulate/delete conntrack entries from
> userspace there is no way to make sure that we will do things the the 
> right connection since the tuples that are used for lookup could have
> been reused. 

Mh.  I am wondering if we can make that guarantee without adding the ID
field.  We really should be in the mindset of making ip_conntrack
smaller, not blowing it up.

> >Other approaches I can think of:
> >
> >a) making a snapshot of the whole conntrack table.
> >Large memory usage - probably easy to get OOM :(  Also, read lock on
> >ip_conntrack_lock would have to be grabbed long
> >
> >b) unique ID per hash bucket.  This means less contention, but we could
> >only save bucket id in cb->args, start iterating from the beginning and
> >only send whose ID is newer than the last one we already sent.
> >
> >c) snapshot of the current bucket
> >As with the new hash function every bucket is supposed to be short, we
> >could also make a snapshot of the current bucket, and send our messages
> >from this snapshot copy.
> >
> >what do you think?
> >
> 
> I think we first need to agree on how important the problems I mentioned 
> above are. All these solutions don't provide reliable mechanisms. Some
> comments though:

Yes.  I am aware of the non-reliability.  For me it is more important to
not interfere with the current connection tracking design, leaving
ctnetlink an addon that doesn't require deep hooks into the conntrack
implementation, and that can live without dozens of #ifdef's.

Let's say that I'm looking upon all possible solutions under that
precondition.

> a) problem is that there can be multiple parallel dumps so we 
> potentially need many copies.
> I think memory usage is not acceptable.

we can just allow one dump at a time and make every body else either
wait or try again.

> b) I'm not sure if i understand correctly, this is basically what has 
> been done before my changes except that we would always continue at
> the next bucket id and not just advance if the whole bucket has
> successfully dumped ?

before, the code did dump the same bucket again if it didn't fit in the
skb last time.  My proposed approach would have a unique ct_id inside a
signle bucket list.  This way we can sort-of live without the ordered
list (minus the 12/13 issue pointed out above) but don't dump the same
bucket over and over again. 

> c) same problem as a, except memory usage is not as bad. IMO it is a 
> basically a workaround for limited socket buffers to circumvent the
> limits. If we don't need reliability I'd say it's the users job to
> make sure socket buffer limits are set to a reasonable size.

mh.

> So in conclusion if we agree we need reliability, we probably need the
> unique ids. If we agree we don't, I'd say we use solution b.

I'm going to comment on the ID's in my next reply.

> Two last things I noted during writing the mail:
> - Table dumping is currenlty not restricted to root, this should
>   probably be done for privacy reasons.

I'm a bit undecided.  netstat -a, -r are always allowed for every user,
too.  But then, those tables don't indicate forwarded connections... ok,
let's have it require CAP_NET_ADMIN.

> - Have you got objections against s/CTA_RPLY/CTA_REPLY/ ? IMO It makes 
>   typing and thinking more comfortable if you can actually pronounce
>   what you are thinking about ;)

question is:  are all NFA/CTA constants four letters? than the current
way is actually more consistent.  But feel free to change that, since it
might become a TLV in the future anyway ;)

> Best regards,
> Patrick

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* ctnetlink questions
@ 2003-10-19 14:54 Harald Welte
  0 siblings, 0 replies; 40+ messages in thread
From: Harald Welte @ 2003-10-19 14:54 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 697 bytes --]

Hi Gandalf!

A couple of questions regarding your ctnetlink modifications:

1) Why do we need this 'ordered list' ?  I can't remember the exact 
   reason why it was added

2) Why did you merge connmark and ctnetlink?  Was it just for
   convenience? If yes, I'd appreciate to have them seperated again.

Thanks.

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2004-02-17 21:37 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20031019171851.GR21521@sunbeam.de.gnumonks.org>
2003-10-19 19:36 ` ctnetlink questions Patrick McHardy
2003-10-19 20:28   ` Harald Welte
2003-10-19 22:55     ` Patrick McHardy
2003-10-20  1:05       ` Henrik Nordstrom
2003-10-20  3:01         ` Patrick McHardy
2003-10-20  3:09           ` Patrick McHardy
2003-10-20  6:34           ` Henrik Nordstrom
2003-10-20 17:53             ` Patrick McHardy
2003-10-20  7:15           ` Harald Welte
2003-10-20  9:37             ` Henrik Nordstrom
2003-10-20 18:43               ` Patrick McHardy
2003-10-20 18:37                 ` Harald Welte
2003-10-20 19:17                   ` Patrick McHardy
2003-10-20 19:41                   ` Balazs Scheidler
2003-10-20 20:20                     ` Patrick McHardy
2003-10-20 22:59                       ` Harald Welte
2003-10-20 18:17             ` Patrick McHardy
2003-10-20 18:39               ` Harald Welte
2003-10-20 19:21                 ` Patrick McHardy
2003-10-21 16:47                 ` Patrick McHardy
2003-10-21 19:54                   ` Henrik Nordstrom
2003-10-21 20:00                     ` Patrick McHardy
2003-10-20 18:52               ` Harald Welte
2003-10-20 19:52                 ` Patrick McHardy
2003-10-20 23:09                   ` Harald Welte
2003-10-20  7:04         ` Harald Welte
2003-10-20  7:17         ` Jozsef Kadlecsik
2003-10-20  9:29           ` Henrik Nordstrom
2004-02-06 18:52             ` Harald Welte
2004-02-09 10:33               ` Pablo Neira
2004-02-10 12:39               ` Patrick McHardy
2004-02-14 20:03                 ` Harald Welte
2004-02-15 10:01                   ` Patrick McHardy
2004-02-17 21:37                     ` Harald Welte
2003-10-20 14:48           ` Harald Welte
2003-10-20 18:53             ` Patrick McHardy
2003-10-20 22:57               ` Harald Welte
2003-10-20 11:11         ` Jozsef Kadlecsik
2003-10-20  6:58       ` Harald Welte
2003-10-19 14:54 Harald Welte

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.