* Re: ctnetlink questions [not found] <20031019171851.GR21521@sunbeam.de.gnumonks.org> @ 2003-10-19 19:36 ` Patrick McHardy 2003-10-19 20:28 ` Harald Welte 0 siblings, 1 reply; 40+ messages in thread From: Patrick McHardy @ 2003-10-19 19:36 UTC (permalink / raw) To: Harald Welte; +Cc: Netfilter Development Mailinglist Harald Welte wrote: >Hi Patrick! > >A couple of questions regarding your ctnetlink modifications: > >1) Why do we need this 'ordered list' ? I can't remember the exact > reason why it was added > The ordered list and the unique conntrack id was added for table dumping. Without it entries could be dumped multiple times or even worse a single hash chain chould be dumped over and over again if it's contents exceeded the size of a single skb. >2) Why did you merge connmark and ctnetlink? Was it just for > convenience? If yes, I'd appreciate to have them seperated again. > It's just a left-over .. I started hacking on ctnetlink to change connection marks from userspace .. I'm going to remove it from the next patch. Best regards, Patrick > >Thanks. > > > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-19 19:36 ` ctnetlink questions Patrick McHardy @ 2003-10-19 20:28 ` Harald Welte 2003-10-19 22:55 ` Patrick McHardy 0 siblings, 1 reply; 40+ messages in thread From: Harald Welte @ 2003-10-19 20:28 UTC (permalink / raw) To: Patrick McHardy; +Cc: Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 3007 bytes --] On Sun, Oct 19, 2003 at 09:36:45PM +0200, Patrick McHardy wrote: > >1) Why do we need this 'ordered list' ? I can't remember the exact > > reason why it was added > > > The ordered list and the unique conntrack id was added for table > dumping. Without it entries could be dumped multiple times or even > worse a single hash chain chould be dumped over and over again if it's > contents exceeded the size of a single skb. ah, yes. I remember. but that actually is a shortcoming of the netlink api, isn't it? The problem is that we cannot save an exact position in the hashtable where we stopped dumping. So in my original ctnetlink we just dump a whole bucket and saved the bucket number in cb->args[]. But if we were saving bucket number + number of connection in bucket, we could continue where we left from. Of course, entries could be added before that number (or even in buckets that we had already traversed) - but we don't guarantee an atomic snapshot anyway. Also, there's another problem: Let's say we left at bucket 5, entry 12 - and while we are waiting for the next netlink callback, entry 10 gets removed. Then we would continue at 12, which is in reality the old 13. So we're missing one conntrack. The question is what to do. I really don't like having yet another list of conntracks (the ordered list) together with the unique id. It is questionable how big the impact on performance is (contention on unique ID, bigger struct ip_conntrack, additional list_add's), but even if it was 'cheap', I don't like the architecture. Other approaches I can think of: a) making a snapshot of the whole conntrack table. Large memory usage - probably easy to get OOM :( Also, read lock on ip_conntrack_lock would have to be grabbed long b) unique ID per hash bucket. This means less contention, but we could only save bucket id in cb->args, start iterating from the beginning and only send whose ID is newer than the last one we already sent. c) snapshot of the current bucket As with the new hash function every bucket is supposed to be short, we could also make a snapshot of the current bucket, and send our messages from this snapshot copy. what do you think? > >2) Why did you merge connmark and ctnetlink? Was it just for > > convenience? If yes, I'd appreciate to have them seperated again. > > > It's just a left-over .. I started hacking on ctnetlink to change > connection marks > from userspace .. I'm going to remove it from the next patch. As you may have noticed, I already did that with 0.13 in current cvs. > Patrick -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-19 20:28 ` Harald Welte @ 2003-10-19 22:55 ` Patrick McHardy 2003-10-20 1:05 ` Henrik Nordstrom 2003-10-20 6:58 ` Harald Welte 0 siblings, 2 replies; 40+ messages in thread From: Patrick McHardy @ 2003-10-19 22:55 UTC (permalink / raw) To: Harald Welte; +Cc: Netfilter Development Mailinglist Harald Welte wrote: >>The ordered list and the unique conntrack id was added for table >>dumping. Without it entries could be dumped multiple times or even >>worse a single hash chain chould be dumped over and over again if it's >>contents exceeded the size of a single skb. >> >> > >ah, yes. I remember. but that actually is a shortcoming of the netlink >api, isn't it? The problem is that we cannot save an exact position in >the hashtable where we stopped dumping. So in my original ctnetlink we >just dump a whole bucket and saved the bucket number in cb->args[]. But >if we were saving bucket number + number of connection in bucket, we >could continue where we left from. > >Of course, entries could be added before that number (or even in buckets >that we had already traversed) - but we don't guarantee an atomic >snapshot anyway. > In my opinion for any serious use we need to provide a mechanism for userspace to be sure it is in sync with the kernel. We could just add a new message type which contains the total number of entries. That way userspace could check the number, if unequal to number of connections known in userspace dump table, repeat. Of course this is still racy but it would be better than nothing. There was a thread on linux-net recently (subject: xfrm_user reliability) which is related to this. Alexey mentioned reliable transmissions from kernel to userspace are impossible, so userspace needs a recovery mechanism from dropped event messages (dump table and resync). If dump table is also unreliable and doesn't even signal failure userspace is screwed. Nice thing with the unique ids is that it's better than an atomic snapshot, when you're done reading you have the _current_ state, not the state when you began reading. >Also, there's another problem: >Let's say we left at bucket 5, entry 12 - and while we are waiting for >the next netlink callback, entry 10 gets removed. Then we would >continue at 12, which is in reality the old 13. So we're missing one >conntrack. > With the unique id solution ? No, the id's don't represent the list-position, what happens is that every conntrack with an id less or equal to the last one dumped is skipped. Since they are ordered by increasing id we will still continue at entry 13, only that it now has position 12 on the list. >The question is what to do. I really don't like having yet another list >of conntracks (the ordered list) together with the unique id. It is >questionable how big the impact on performance is (contention on unique >ID, bigger struct ip_conntrack, additional list_add's), but even if it >was 'cheap', I don't like the architecture. > I didn't worry too much about performance yet, in my opinion it was required for beeing useful. For the architecture, if it was only for table dumping I'd agree with you, but there is another important use for the id. When we want to manipulate/delete conntrack entries from userspace there is no way to make sure that we will do things the the right connection since the tuples that are used for lookup could have been reused. This is especially true a tuple is used for lookup that has been changed by nat. >Other approaches I can think of: > >a) making a snapshot of the whole conntrack table. >Large memory usage - probably easy to get OOM :( Also, read lock on >ip_conntrack_lock would have to be grabbed long > >b) unique ID per hash bucket. This means less contention, but we could >only save bucket id in cb->args, start iterating from the beginning and >only send whose ID is newer than the last one we already sent. > >c) snapshot of the current bucket >As with the new hash function every bucket is supposed to be short, we >could also make a snapshot of the current bucket, and send our messages >from this snapshot copy. > >what do you think? > I think we first need to agree on how important the problems I mentioned above are. All these solutions don't provide reliable mechanisms. Some comments though: a) problem is that there can be multiple parallel dumps so we potentially need many copies. I think memory usage is not acceptable. b) I'm not sure if i understand correctly, this is basically what has been done before my changes except that we would always continue at the next bucket id and not just advance if the whole bucket has successfully dumped ? c) same problem as a, except memory usage is not as bad. IMO it is a basically a workaround for limited socket buffers to circumvent the limits. If we don't need reliability I'd say it's the users job to make sure socket buffer limits are set to a reasonable size. So in conclusion if we agree we need reliability, we probably need the unique ids. If we agree we don't, I'd say we use solution b. >As you may have noticed, I already did that with 0.13 in current cvs. > Yes, Krisztian pointed me to the code. Two last things I noted during writing the mail: - Table dumping is currenlty not restricted to root, this should probably be done for privacy reasons. - Have you got objections against s/CTA_RPLY/CTA_REPLY/ ? IMO It makes typing and thinking more comfortable if you can actually pronounce what you are thinking about ;) Best regards, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-19 22:55 ` Patrick McHardy @ 2003-10-20 1:05 ` Henrik Nordstrom 2003-10-20 3:01 ` Patrick McHardy ` (3 more replies) 2003-10-20 6:58 ` Harald Welte 1 sibling, 4 replies; 40+ messages in thread From: Henrik Nordstrom @ 2003-10-20 1:05 UTC (permalink / raw) To: Patrick McHardy; +Cc: Harald Welte, Netfilter Development Mailinglist On Mon, 20 Oct 2003, Patrick McHardy wrote: > In my opinion for any serious use we need to provide a mechanism for > userspace to be sure it is in sync with the kernel. We could just add a > new message type which contains the total number of entries. That way > userspace could check the number, if unequal to number of connections > known in userspace dump table, repeat. Of course this is still racy but > it would be better than nothing. Agreed, partially. My opinions: It is imporant that userspace does not miss entries which was in the kernel when duming started and still exists in the kernel when the dump finished. It is also important userspace can have some kind of semi-static reference to a conntrack to be able to manipulate that conntrack without risking hitting another conntrack. It is OK for me if it is unspecified what happens with entries which either was created or destroyed while the dump was in progress. With these criterias in mind I propose a hybrid of your approaches a) Assign a globally unique ID to each conntrack, in such manner that IDs is not reused for a significant amount of time. This to provide a stable point of reference to a connection with low risk of false collisions if the original connection was destroyed while userspace still thought it was there. b) When duming the conntrack entries, dump one bucket at a time. If the bucket is too large to fit in the current response packet then sort the bucket entries on ID and keep track of which bucket+ID was last dumped. On next netlink packet restart at the same bucket and skip the entries with a ID lower than those already dumped for that bucket. This requires a read lock per hash bucket while dumping that bucket, and some small (usually) amount of memory to keep the temporary sorted index of bucket entries unless the bucket is permanently resorted in which case it may be possible to solve with no memory allocation (but then requires the bucket to be write locked while resorting which is probably worse). Regarding the conntrack ID. For me it is acceptable if as much as 64 bits is reserved for the conntrack ID. This gives sufficient namespace to a) Provide truly unique IDs suitable for long-term reference without any risk of collisions. b) Allows for the namespace to be built in such manner that there never will be any risk for congestion in finding the next available ID. For example by using CPU#+counter. Regards Henrik ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 1:05 ` Henrik Nordstrom @ 2003-10-20 3:01 ` Patrick McHardy 2003-10-20 3:09 ` Patrick McHardy ` (2 more replies) 2003-10-20 7:04 ` Harald Welte ` (2 subsequent siblings) 3 siblings, 3 replies; 40+ messages in thread From: Patrick McHardy @ 2003-10-20 3:01 UTC (permalink / raw) To: Henrik Nordstrom; +Cc: Harald Welte, Netfilter Development Mailinglist Henrik Nordstrom wrote: >Agreed, partially. > >My opinions: > >It is imporant that userspace does not miss entries which was in the >kernel when duming started and still exists in the kernel when the dump >finished. > >It is also important userspace can have some kind of semi-static >reference to a conntrack to be able to manipulate that conntrack without >risking hitting another conntrack. > >It is OK for me if it is unspecified what happens with entries which >either was created or destroyed while the dump was in progress. > I totally agree. >With these criterias in mind I propose a hybrid of your approaches > >a) Assign a globally unique ID to each conntrack, in such manner that IDs >is not reused for a significant amount of time. This to provide a stable >point of reference to a connection with low risk of false collisions if >the original connection was destroyed while userspace still thought it was >there. > >b) When duming the conntrack entries, dump one bucket at a time. >If the bucket is too large to fit in the current response packet >then sort the bucket entries on ID and keep track of which bucket+ID >was last dumped. On next netlink packet restart at the same bucket and >skip the entries with a ID lower than those already dumped for that >bucket. > >This requires a read lock per hash bucket while dumping that bucket, and >some small (usually) amount of memory to keep the temporary sorted index >of bucket entries unless the bucket is permanently resorted in which case >it may be possible to solve with no memory allocation (but then requires >the bucket to be write locked while resorting which is probably worse). > Sounds like a nice solution. I favour the permanent resorting for these reasons: - all temporary memory allocations should be released before ctnetlink_dump is left, not in ctnetlink_done since we don't know if and when the read will continue. this means sorting multiple times is required. - we can use some sorting algorithm which benefits from pre-sorted input. this would give better average performance. IIRC new conntracks are added at the head of the chains, so if we sort and walk backwards through the chains we only have to resort after an id counter wrap. Sorting is also pretty easy in that case: move all entries at head of list whose id is smaller than the last one's to the tail while preserving order, stop at first one thats bigger. This also means we only need the write lock in a very very rare case. >Regarding the conntrack ID. For me it is acceptable if as much as 64 bits >is reserved for the conntrack ID. This gives sufficient namespace to > >a) Provide truly unique IDs suitable for long-term reference without any >risk of collisions. > I agree, we should use 64 bit. >b) Allows for the namespace to be built in such manner that there never >will be any risk for congestion in finding the next available ID. For >example by using CPU#+counter. > Also a good idea. Thanks Henrik for your valuable input. Harald, what do you think of this approach ? Best regards, Patrick (hoping mozilla will have mercy with his formatting this time) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 3:01 ` Patrick McHardy @ 2003-10-20 3:09 ` Patrick McHardy 2003-10-20 6:34 ` Henrik Nordstrom 2003-10-20 7:15 ` Harald Welte 2 siblings, 0 replies; 40+ messages in thread From: Patrick McHardy @ 2003-10-20 3:09 UTC (permalink / raw) Cc: Henrik Nordstrom, Harald Welte, Netfilter Development Mailinglist Patrick McHardy wrote: > - we can use some sorting algorithm which benefits from pre-sorted > input. this would > give better average performance. IIRC new conntracks are added at the > head of the > chains, so if we sort and walk backwards through the chains we only > have to resort > after an id counter wrap. Sorting is also pretty easy in that case: > move all entries at > head of list whose id is smaller than the last one's to the tail > while preserving order, > stop at first one thats bigger. This also means we only need the > write lock in a very > very rare case. One small addition, this it not completly correct we need to resort more often, but never before the first counter wrap. Best regards, patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 3:01 ` Patrick McHardy 2003-10-20 3:09 ` Patrick McHardy @ 2003-10-20 6:34 ` Henrik Nordstrom 2003-10-20 17:53 ` Patrick McHardy 2003-10-20 7:15 ` Harald Welte 2 siblings, 1 reply; 40+ messages in thread From: Henrik Nordstrom @ 2003-10-20 6:34 UTC (permalink / raw) To: Patrick McHardy; +Cc: Harald Welte, Netfilter Development Mailinglist On Mon, 20 Oct 2003, Patrick McHardy wrote: > Sounds like a nice solution. I favour the permanent resorting for these > reasons: > - all temporary memory allocations should be released before > ctnetlink_dump is left, not in ctnetlink_done since we don't know if and > when the read will continue. this means sorting multiple times is > required. Sorting multiple times is indeed needed if not doing a permanent resort, as you do not want to keep the bucket locked for a long period. Even if you knew the read would continue you can not save the temporary sorted list. > - we can use some sorting algorithm which benefits from pre-sorted > input. this would give better average performance. IIRC new conntracks > are added at the head of the chains, so if we sort and walk backwards > through the chains we only have to resort after an id counter wrap. Could work. In such case the bucket should at most times be sorted naturally with no need to resort. There is a few theoretical races where entries may be inserted in another order (more so on SMP), but these are hopefully relatively rare. Regards Henrik ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 6:34 ` Henrik Nordstrom @ 2003-10-20 17:53 ` Patrick McHardy 0 siblings, 0 replies; 40+ messages in thread From: Patrick McHardy @ 2003-10-20 17:53 UTC (permalink / raw) To: Henrik Nordstrom; +Cc: Harald Welte, Netfilter Development Mailinglist Henrik Nordstrom wrote: >>- we can use some sorting algorithm which benefits from pre-sorted >>input. this would give better average performance. IIRC new conntracks >>are added at the head of the chains, so if we sort and walk backwards >>through the chains we only have to resort after an id counter wrap. >> >> > >Could work. In such case the bucket should at most times be sorted >naturally with no need to resort. There is a few theoretical races >where entries may be inserted in another order (more so on SMP), but these >are hopefully relatively rare. > > Entries are always inserted with the list locked so I can't see these cases. I also calculated, with 64bit and 10^7 connections/s 64 bit won't wrap for 58494 years. So we don't need resorting at all. Best regards, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 3:01 ` Patrick McHardy 2003-10-20 3:09 ` Patrick McHardy 2003-10-20 6:34 ` Henrik Nordstrom @ 2003-10-20 7:15 ` Harald Welte 2003-10-20 9:37 ` Henrik Nordstrom 2003-10-20 18:17 ` Patrick McHardy 2 siblings, 2 replies; 40+ messages in thread From: Harald Welte @ 2003-10-20 7:15 UTC (permalink / raw) To: Patrick McHardy; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 3348 bytes --] On Mon, Oct 20, 2003 at 05:01:17AM +0200, Patrick McHardy wrote: > >This requires a read lock per hash bucket while dumping that bucket, and > >some small (usually) amount of memory to keep the temporary sorted index > >of bucket entries unless the bucket is permanently resorted in which case > >it may be possible to solve with no memory allocation (but then requires > >the bucket to be write locked while resorting which is probably worse). > > Sounds like a nice solution. I favour the permanent resorting for these > reasons: > - all temporary memory allocations should be released before > ctnetlink_dump is left, not in ctnetlink_done since we don't know if > and when the read will continue. this means sorting multiple times is > required. again, this seems a shortcoming of the netlink infrastructure. This would be much better, if we'd actually have a reference to the socket of the userspace process (since table dumps could be unicast anyway...). As for the permanent sorting: I fear that is going to be an intrusive change. Either ctnetlink locks the bucket on it's own [and does the sorting locally], or we have some ugly special-case functions in the ip_conntrack core. Also: There is no primitive for upgrading a reader lock into a writer lock. Since we don't know if the skb is large enough beforehand, we would need to take a write lock in any case (since we cannot tell if we need to reorder or not). > - we can use some sorting algorithm which benefits from pre-sorted > input. this would give better average performance. IIRC new conntracks > are added at the head of the chains, so if we sort and walk backwards > through the chains we only have to resort after an id counter wrap. this works with a 64bit counter, but doesn't with my proposed generation counter. the generation counter has two advantages: - no global counter, just the counter field in every conntrack - less size increase of struct ip_conntrack. Oh well, yes. We have to think about the slab cache returning objects to the 'real' memory allocator. Then our generation counter would become useless. Don't know if it is reliable enough to initialize the counter with a random 32bits in that case. How high is the probability of re-using the recently-used counter in that case? > I agree, we should use 64 bit. I feel like I have to cry. As soon as we add any kind of counter (be it 32 or 64 bits), we would definitely have to make it a compile time option. My vision of ctnetlink was something like a match extension: You can always compile it as a module, and it wouldn't hurt performance as long as you don't load it. > Also a good idea. Thanks Henrik for your valuable input. Harald, what > do you think of this approach ? yes, if we really want to be reliable I don't see a better way. > Best regards, > Patrick (hoping mozilla will have mercy with his formatting this time) not really :( -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 7:15 ` Harald Welte @ 2003-10-20 9:37 ` Henrik Nordstrom 2003-10-20 18:43 ` Patrick McHardy 2003-10-20 18:17 ` Patrick McHardy 1 sibling, 1 reply; 40+ messages in thread From: Henrik Nordstrom @ 2003-10-20 9:37 UTC (permalink / raw) To: Harald Welte; +Cc: Patrick McHardy, Netfilter Development Mailinglist On Mon, 20 Oct 2003, Harald Welte wrote: > again, this seems a shortcoming of the netlink infrastructure. This > would be much better, if we'd actually have a reference to the socket of > the userspace process (since table dumps could be unicast anyway...). If it is a shortcoming of netlink or a shortcoming of how to reliably dump the content of a large hash buckets can be argued, but I tend to agree with you here. The dump operation should be connetion oriented with the userspace application, not purely datagram based. The kernel should know for certain if the userspace application terminates a dump operation mid-air. Regards Henrik ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 9:37 ` Henrik Nordstrom @ 2003-10-20 18:43 ` Patrick McHardy 2003-10-20 18:37 ` Harald Welte 0 siblings, 1 reply; 40+ messages in thread From: Patrick McHardy @ 2003-10-20 18:43 UTC (permalink / raw) To: Henrik Nordstrom; +Cc: Harald Welte, Netfilter Development Mailinglist Henrik Nordstrom wrote: >On Mon, 20 Oct 2003, Harald Welte wrote: > > >>again, this seems a shortcoming of the netlink infrastructure. This >>would be much better, if we'd actually have a reference to the socket of >>the userspace process (since table dumps could be unicast anyway...). >> >> > >If it is a shortcoming of netlink or a shortcoming of how to reliably dump >the content of a large hash buckets can be argued, but I tend to agree >with you here. > >The dump operation should be connetion oriented with the userspace >application, not purely datagram based. The kernel should know for certain >if the userspace application terminates a dump operation mid-air. > Actually the kernel knows. If the socket is closed cb->done() is called. However the kernel can not know if userspace still keeps the socket open but doesn't read anymore. Connection oriented sockets don't help with this. Best regards, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 18:43 ` Patrick McHardy @ 2003-10-20 18:37 ` Harald Welte 2003-10-20 19:17 ` Patrick McHardy 2003-10-20 19:41 ` Balazs Scheidler 0 siblings, 2 replies; 40+ messages in thread From: Harald Welte @ 2003-10-20 18:37 UTC (permalink / raw) To: Patrick McHardy; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 1283 bytes --] On Mon, Oct 20, 2003 at 08:43:02PM +0200, Patrick McHardy wrote: > >The dump operation should be connetion oriented with the userspace > >application, not purely datagram based. The kernel should know for certain > >if the userspace application terminates a dump operation mid-air. > > > > Actually the kernel knows. If the socket is closed cb->done() is called. > However the kernel can not know if userspace still keeps the socket open > but doesn't read anymore. Connection oriented sockets don't help with this. yes, but if the application is broken, that's not our problem. If the API and the behaviour is documented, I don't see any problems with this. If an app wants to intentionally allocate many opjects in kernel space, there's nothing we can do. Also, since we are sending to userspace, we could actually allocate this as virtual memory, right? > Best regards, > Patrick -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 18:37 ` Harald Welte @ 2003-10-20 19:17 ` Patrick McHardy 2003-10-20 19:41 ` Balazs Scheidler 1 sibling, 0 replies; 40+ messages in thread From: Patrick McHardy @ 2003-10-20 19:17 UTC (permalink / raw) To: Harald Welte; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist Harald Welte wrote: >On Mon, Oct 20, 2003 at 08:43:02PM +0200, Patrick McHardy wrote: > > > >>>The dump operation should be connetion oriented with the userspace >>>application, not purely datagram based. The kernel should know for certain >>>if the userspace application terminates a dump operation mid-air. >>> >>> >>> >>Actually the kernel knows. If the socket is closed cb->done() is called. >>However the kernel can not know if userspace still keeps the socket open >>but doesn't read anymore. Connection oriented sockets don't help with this. >> >> > >yes, but if the application is broken, that's not our problem. If the >API and the behaviour is documented, I don't see any problems with this. >If an app wants to intentionally allocate many opjects in kernel space, >there's nothing we can do. Also, since we are sending to userspace, we >could actually allocate this as virtual memory, right? > Yes I just mentioned it to clarify the problem with allocating memory and freeing it in ctnetlink_done() and point out that connection oriented sockets don't help. I'm not sure what you mean with "allocate as virtual memory", do you mean accounting the memory to the process which started the dump ? I'm not sure how the conventions with accounting kernel memory are, but at least that would provide a way to bound the amount of memory that can be used in case we decide to use a solution which requires dynamic allocations. Best regards, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 18:37 ` Harald Welte 2003-10-20 19:17 ` Patrick McHardy @ 2003-10-20 19:41 ` Balazs Scheidler 2003-10-20 20:20 ` Patrick McHardy 1 sibling, 1 reply; 40+ messages in thread From: Balazs Scheidler @ 2003-10-20 19:41 UTC (permalink / raw) To: Harald Welte, Patrick McHardy, Henrik Nordstrom, Netfilter Development Mailinglist On Mon, Oct 20, 2003 at 08:37:42PM +0200, Harald Welte wrote: > On Mon, Oct 20, 2003 at 08:43:02PM +0200, Patrick McHardy wrote: > > > >The dump operation should be connetion oriented with the userspace > > >application, not purely datagram based. The kernel should know for certain > > >if the userspace application terminates a dump operation mid-air. > > > > > > > Actually the kernel knows. If the socket is closed cb->done() is called. > > However the kernel can not know if userspace still keeps the socket open > > but doesn't read anymore. Connection oriented sockets don't help with this. > > yes, but if the application is broken, that's not our problem. If the > API and the behaviour is documented, I don't see any problems with this. > If an app wants to intentionally allocate many opjects in kernel space, > there's nothing we can do. Also, since we are sending to userspace, we > could actually allocate this as virtual memory, right? Sorry to jump into the conversation, it just occurred something I've read about relayfs: quoting from the announcement: "relayfs is a filesystem designed to provide an efficient mechanism for tools and facilities to relay large amounts of data from kernel space to user space. Full details can be found in Documentation/filesystems/ relayfs.txt. The current version can always be found at http://www.opersys.com/relayfs." This _might_ be of interest when complete tables are to be dumped, though the ctnetlink based interface is better when the userspace app provides more specific queries. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 19:41 ` Balazs Scheidler @ 2003-10-20 20:20 ` Patrick McHardy 2003-10-20 22:59 ` Harald Welte 0 siblings, 1 reply; 40+ messages in thread From: Patrick McHardy @ 2003-10-20 20:20 UTC (permalink / raw) To: Balazs Scheidler Cc: Harald Welte, Henrik Nordstrom, Netfilter Development Mailinglist Hi Balazs, Balazs Scheidler wrote: >Sorry to jump into the conversation, it just occurred something I've read >about relayfs: > Suggestions are always welcome. >quoting from the announcement: > >"relayfs is a filesystem designed to provide an efficient mechanism for >tools and facilities to relay large amounts of data from kernel space >to user space. Full details can be found in Documentation/filesystems/ >relayfs.txt. The current version can always be found at >http://www.opersys.com/relayfs." > >This _might_ be of interest when complete tables are to be dumped, though >the ctnetlink based interface is better when the userspace app provides more >specific queries. > I believe we should stick to a single consistent interface for user comfort. Best regards, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 20:20 ` Patrick McHardy @ 2003-10-20 22:59 ` Harald Welte 0 siblings, 0 replies; 40+ messages in thread From: Harald Welte @ 2003-10-20 22:59 UTC (permalink / raw) To: Patrick McHardy Cc: Balazs Scheidler, Henrik Nordstrom, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 888 bytes --] On Mon, Oct 20, 2003 at 10:20:06PM +0200, Patrick McHardy wrote: > >This _might_ be of interest when complete tables are to be dumped, though > >the ctnetlink based interface is better when the userspace app provides > >more > >specific queries. > > > > I believe we should stick to a single consistent interface for user comfort. also, if we stick with netlink, we can easily adapt to netlink2... that gives us the ability to even remotely dump/modify conntrack tables. > Best regards, > Patrick -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 7:15 ` Harald Welte 2003-10-20 9:37 ` Henrik Nordstrom @ 2003-10-20 18:17 ` Patrick McHardy 2003-10-20 18:39 ` Harald Welte 2003-10-20 18:52 ` Harald Welte 1 sibling, 2 replies; 40+ messages in thread From: Patrick McHardy @ 2003-10-20 18:17 UTC (permalink / raw) To: Harald Welte; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist Harald Welte wrote: >On Mon, Oct 20, 2003 at 05:01:17AM +0200, Patrick McHardy wrote: > >>- we can use some sorting algorithm which benefits from pre-sorted >>input. this would give better average performance. IIRC new conntracks >>are added at the head of the chains, so if we sort and walk backwards >>through the chains we only have to resort after an id counter wrap. >> >> > >this works with a 64bit counter, but doesn't with my proposed generation >counter. the generation counter has two advantages: >- no global counter, just the counter field in every conntrack >- less size increase of struct ip_conntrack. > >Oh well, yes. We have to think about the slab cache returning objects >to the 'real' memory allocator. Then our generation counter would >become useless. Don't know if it is reliable enough to initialize the >counter with a random 32bits in that case. How high is the probability >of re-using the recently-used counter in that case? > Actually with 64 bit the wrap-around time is so large we would never have to resort. >>I agree, we should use 64 bit. >> >> > >I feel like I have to cry. As soon as we add any kind of counter (be >it 32 or 64 bits), we would definitely have to make it a compile time >option. My vision of ctnetlink was something like a match extension: >You can always compile it as a module, and it wouldn't hurt performance >as long as you don't load it. > > I understand your objections, It is really not my intend to enlarge struct ip_conntrack without the need to do so. However I'd rather save some memory by f.e. making helper memory dynamic than by saving these couple of bytes and making some functionality of ctnetlink somewhat useless. Compiling as a module without performance penalty won't work anyways as soon as you enable event notifications. So do I understand correctly we're at the point were we agree we need to add something to struct ip_conntrack and either use a linear increasing global counter or a generation count which increases as soon as we return something to the slab. Advantages/Disadvantages of global counter: - don't need any sorting with 64bit, natural sorting is fine - uniquely identifies conntracks without risk of collisions - possible contention on global counter Advantages of generation count: - No global counter - probably expensive sorting needed over time - no unique identity, can not be used from userspace I still favour the global counter but I'm fine as long as dumping works, a unqiue identity for userspace is less important. I'd say it's up to you to decide. Best regards, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 18:17 ` Patrick McHardy @ 2003-10-20 18:39 ` Harald Welte 2003-10-20 19:21 ` Patrick McHardy 2003-10-21 16:47 ` Patrick McHardy 2003-10-20 18:52 ` Harald Welte 1 sibling, 2 replies; 40+ messages in thread From: Harald Welte @ 2003-10-20 18:39 UTC (permalink / raw) To: Patrick McHardy; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 806 bytes --] On Mon, Oct 20, 2003 at 08:17:37PM +0200, Patrick McHardy wrote: > I still favour the global counter but I'm fine as long as dumping works, > a unqiue identity for userspace is less important. I'd say it's up to > you to decide. Mh, well let's go for the 64bit, as there seems to be no other choice. But we should start working on the variable-sized conntracks within short time afterwards. > Best regards, > Patrick -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 18:39 ` Harald Welte @ 2003-10-20 19:21 ` Patrick McHardy 2003-10-21 16:47 ` Patrick McHardy 1 sibling, 0 replies; 40+ messages in thread From: Patrick McHardy @ 2003-10-20 19:21 UTC (permalink / raw) To: Harald Welte; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist Harald Welte wrote: >On Mon, Oct 20, 2003 at 08:17:37PM +0200, Patrick McHardy wrote: > > > >>I still favour the global counter but I'm fine as long as dumping works, >>a unqiue identity for userspace is less important. I'd say it's up to >>you to decide. >> >> > >Mh, well let's go for the 64bit, as there seems to be no other choice. >But we should start working on the variable-sized conntracks within >short time afterwards. > I actually already started it some time ago but it had to step back for more interesting things ;) Maybe we can place the global counter next to some other value that is modified anyways at conntrack creation so they will be in the same cache line. ip_conntrack_count comes to mind but unfortunately it's atomic_t (volatile). Best regards, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 18:39 ` Harald Welte 2003-10-20 19:21 ` Patrick McHardy @ 2003-10-21 16:47 ` Patrick McHardy 2003-10-21 19:54 ` Henrik Nordstrom 1 sibling, 1 reply; 40+ messages in thread From: Patrick McHardy @ 2003-10-21 16:47 UTC (permalink / raw) To: Harald Welte; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist Harald Welte wrote: >Mh, well let's go for the 64bit, as there seems to be no other choice. >But we should start working on the variable-sized conntracks within >short time afterwards. > Ok there's another problem, for fast lookups by id (we don't want to search the entire hash) we need to encode the hash chain of a tuple in the id. We basically have two choices now for the remaining bits: a) keep using a global counter which reduces namespace to 2^(64-lg(hashsize)) b) use a per-bucket counter which keeps 64 bit namespace and eliminates potential contention on the counter but requires as much memory as the hash buckets themselves. The problem is guessing how big the hash might get. With a hash of 2^20 buckets and 1 million connections/s the ids wrap after 0.5 years with possibility a. Even if the connection rate may be unrealistic high I assume a hash size of 2^20 and bigger is realistic now or might be soon, so the chance of seeing reused ids is real. Possibility b is of course not acceptable due to memory usage. My proposed solution is to reserve 16bit for the chain id and to compensate for the remaining used bits by keeping 2^max(number_of_bits-16, 0) counters. This always gives us 48bit for the id (if hash distribution is good), with the numbers above that is ~9 years without a wraparound while keeping 16 counters. For a hash size <= 2^16 we still only have one counter, but if we really experience contention we now can easily increase it. Does that sound ok ? Feel free to shut me up by giving some more realistic numbers ;) Best regards, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-21 16:47 ` Patrick McHardy @ 2003-10-21 19:54 ` Henrik Nordstrom 2003-10-21 20:00 ` Patrick McHardy 0 siblings, 1 reply; 40+ messages in thread From: Henrik Nordstrom @ 2003-10-21 19:54 UTC (permalink / raw) To: Patrick McHardy; +Cc: Harald Welte, Netfilter Development Mailinglist On Tue, 21 Oct 2003, Patrick McHardy wrote: > Ok there's another problem, for fast lookups by id (we don't want to > search the entire hash) we need to encode the hash chain of a tuple > in the id. We basically have two choices now for the remaining bits: You can always make the userspace ID larger by including the bucket number. There is no need to extend the ID stored in the conntrack for this as it is already known. Regards Henrik ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-21 19:54 ` Henrik Nordstrom @ 2003-10-21 20:00 ` Patrick McHardy 0 siblings, 0 replies; 40+ messages in thread From: Patrick McHardy @ 2003-10-21 20:00 UTC (permalink / raw) To: Henrik Nordstrom; +Cc: Harald Welte, Netfilter Development Mailinglist Henrik Nordstrom wrote: >On Tue, 21 Oct 2003, Patrick McHardy wrote: > > > >>Ok there's another problem, for fast lookups by id (we don't want to >>search the entire hash) we need to encode the hash chain of a tuple >>in the id. We basically have two choices now for the remaining bits: >> >> > >You can always make the userspace ID larger by including the bucket >number. There is no need to extend the ID stored in the conntrack for >this as it is already known. > >Regards >Henrik > > Ok I have to admit I didn't think of this obvious solution. Thanks again, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 18:17 ` Patrick McHardy 2003-10-20 18:39 ` Harald Welte @ 2003-10-20 18:52 ` Harald Welte 2003-10-20 19:52 ` Patrick McHardy 1 sibling, 1 reply; 40+ messages in thread From: Harald Welte @ 2003-10-20 18:52 UTC (permalink / raw) To: Patrick McHardy; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 2681 bytes --] Actually, another point is what to do with expectations. It's not as problematic as with conntrack's, but in general it's the same. Let's say: - conntrack helper creates expectation for typle xyz - userspace gets a list of unconfirmed expects - expectation for tuple xyz is confirmed - conntrack helper creates a new expectation for tuple xyz - userspace wants to remove expectation by referring to tuple xyz. At least in this case, that might actually be what the user wants - since there is not much difference between the two expectations, other than time passing in between. If the helper is automatically re-adding expectation upon confirmation, than the user can race with incoming connections in order to 'break the circle' ;) I think there is not too much point in removing expect's anyway. The real need is for adding and modyfing expectations, in case there is a userspace conntrack/nat helper. What do you think? btw: In the failover code, I have another problem with regard to expect's: A sibling conntrack has a pointer to the master expect, not the master conntrack. But i somehow need to replicate that pointer. Without Krisztians idmap (that I've already ripped out), I'm now passing the master conntrack's tuple and the expectation's tuple. This works while doing normal sync: There can always be only one unconfirmed expectation for every tuple. However, when doing a initial sync (or a full-resync), I cannot replicate the whole tree of master and siblings - because I first replicate the conntracks and then later have to fill in the [confirmed] expectations as glue in between. The only idea I have is to use the 'seq' number. However, this again only works for TCP. Any other options? (in the Future, I think we will at least optionally have per-connection byte and packet counters. Then the 'seq' field could be initialized to the byte counter at the time the expectation was raised. This would solve the udp case [and other udp races which Patrick can tell us nightmares of]. However, this is once again enlarging conntrack - but only if somebody wants to use 'connbytes' match or create netflow-style connection logs. But we could require that compile option in case somebody wants to enable failover) -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 18:52 ` Harald Welte @ 2003-10-20 19:52 ` Patrick McHardy 2003-10-20 23:09 ` Harald Welte 0 siblings, 1 reply; 40+ messages in thread From: Patrick McHardy @ 2003-10-20 19:52 UTC (permalink / raw) To: Harald Welte; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist Harald Welte wrote: >Actually, another point is what to do with expectations. > >It's not as problematic as with conntrack's, but in general it's the >same. Let's say: > >- conntrack helper creates expectation for typle xyz >- userspace gets a list of unconfirmed expects >- expectation for tuple xyz is confirmed >- conntrack helper creates a new expectation for tuple xyz >- userspace wants to remove expectation by referring to tuple xyz. > >At least in this case, that might actually be what the user wants - >since there is not much difference between the two expectations, other >than time passing in between. > >If the helper is automatically re-adding expectation upon confirmation, >than the user can race with incoming connections in order to 'break the >circle' ;) > >I think there is not too much point in removing expect's anyway. The >real need is for adding and modyfing expectations, in case there is a >userspace conntrack/nat helper. > >What do you think? > I agree. Removing is not very important. Modifying also requires a way to identify the expectation. Currently (as you might have noticed) the expectations also include an id. The namespace could probably be smaller than for conntracks but I need to think about this some more after catching breakfast. BTW: I thought a bit about userspace helpers, they need to be synchronous like they are now so we don't get races were an expectation arrives before it's registered. So they should probably receive their packets though ip_queue. How realistic do you think is it to move ftp/irc/amanda... to userspace ? All of them operate on low traffic protocols, but if they sent packets to userspace through netlink sockets operation can easily be interrupted be sending lots of traffic that will fill up the socket buffer. >btw: In the failover code, I have another problem with regard to >expect's: A sibling conntrack has a pointer to the master expect, not >the master conntrack. But i somehow need to replicate that pointer. >Without Krisztians idmap (that I've already ripped out), I'm now passing >the master conntrack's tuple and the expectation's tuple. This works >while doing normal sync: There can always be only one unconfirmed >expectation for every tuple. > >However, when doing a initial sync (or a full-resync), I cannot >replicate the whole tree of master and siblings - because I first >replicate the conntracks and then later have to fill in the [confirmed] >expectations as glue in between. > >The only idea I have is to use the 'seq' number. However, this again >only works for TCP. > >Any other options? > I have not studied the code intensively, but can't you just sync the tables in order of the hierachie: conntrack conntrack, master-expect, sibling-conntrack, sibling-conntrack, ... conntrack ... Best regards, Patrick >(in the Future, I think we will at least optionally have per-connection >byte and packet counters. Then the 'seq' field could be initialized to >the byte counter at the time the expectation was raised. This would >solve the udp case [and other udp races which Patrick can tell us >nightmares of]. However, this is once again enlarging conntrack - but >only if somebody wants to use 'connbytes' match or create netflow-style >connection logs. But we could require that compile option in case >somebody wants to enable failover) > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 19:52 ` Patrick McHardy @ 2003-10-20 23:09 ` Harald Welte 0 siblings, 0 replies; 40+ messages in thread From: Harald Welte @ 2003-10-20 23:09 UTC (permalink / raw) To: Patrick McHardy; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 2710 bytes --] On Mon, Oct 20, 2003 at 09:52:05PM +0200, Patrick McHardy wrote: > I agree. Removing is not very important. Modifying also requires a way to > identify the expectation. Currently (as you might have noticed) the > expectations also include an id. The namespace could probably be smaller > than for conntracks but I need to think about this some more after catching > breakfast. yes, I've noted that they also have id's. However, as you will have noticed by now, I feel very reluctant to add id's to our structures ;) > BTW: I thought a bit about userspace helpers, they need to be synchronous > like they are now so we don't get races were an expectation arrives before > it's registered. So they should probably receive their packets though > ip_queue. for local helpers this is true. But think about even more complex setups, like the envisioned SIP proxy (that might even run on a totally different machine than your packet filter). They are inherently racy - and there's nothing we can do about that. > How realistic do you think is it to move ftp/irc/amanda... to > userspace ? All of them operate on low traffic protocols, but if they sent > packets to userspace through netlink sockets operation can easily be > interrupted be sending lots of traffic that will fill up the socket buffer. yes, that is a problem. another problem is that there can only be one userspace process be attached to the queue. And no, we don't want to use ipqmpd - that is just a hack and doesn't scale at all. btw: I already have a patch of a l3 independent queue implementation. The only problem is that it has to change the packet format - and thus will introduce incompatibility :( > I have not studied the code intensively, but can't you just sync the tables > in order of the hierachie: > > conntrack > conntrack, master-expect, sibling-conntrack, sibling-conntrack, ... > conntrack > ... well, this means I cannot use the existing iterator functions, and locking might become complex once we have per-bucket locks. Also, I don't like the idea of hiding too much information in the packet/message order. Yes, ordering is guaranteed by the protocol - but even then... I'll have to think about that... but now I'm off for some sleep. Maybe tomorrow ;) > Best regards, > Patrick -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 1:05 ` Henrik Nordstrom 2003-10-20 3:01 ` Patrick McHardy @ 2003-10-20 7:04 ` Harald Welte 2003-10-20 7:17 ` Jozsef Kadlecsik 2003-10-20 11:11 ` Jozsef Kadlecsik 3 siblings, 0 replies; 40+ messages in thread From: Harald Welte @ 2003-10-20 7:04 UTC (permalink / raw) To: Henrik Nordstrom; +Cc: Patrick McHardy, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 2919 bytes --] On Mon, Oct 20, 2003 at 03:05:41AM +0200, Henrik Nordstrom wrote: > It is imporant that userspace does not miss entries which was in the > kernel when duming started and still exists in the kernel when the dump > finished. finally agreed. > It is also important userspace can have some kind of semi-static > reference to a conntrack to be able to manipulate that conntrack without > risking hitting another conntrack. also agreed. > It is OK for me if it is unspecified what happens with entries which > either was created or destroyed while the dump was in progress. ack. > With these criterias in mind I propose a hybrid of your approaches > > a) Assign a globally unique ID to each conntrack, in such manner that IDs > is not reused for a significant amount of time. This to provide a stable > point of reference to a connection with low risk of false collisions if > the original connection was destroyed while userspace still thought it was > there. In reality, we could use a pointer together with a generation-counter. That generation counter could be incremented as soon as we return the structure to the slab cache. This way we could live with a 32bit generation counter + pointer/address. > b) When duming the conntrack entries, dump one bucket at a time. > If the bucket is too large to fit in the current response packet > then sort the bucket entries on ID and keep track of which bucket+ID > was last dumped. On next netlink packet restart at the same bucket and > skip the entries with a ID lower than those already dumped for that > bucket. I'm going to comment on this in the next mail. > Regarding the conntrack ID. For me it is acceptable if as much as 64 bits > is reserved for the conntrack ID. This gives sufficient namespace to for me, not a single bit is acceptable. the size of ip_conntrack is already way too heavy. the l3 generic conntrack should have support for different-sized conntracks, e.g. saving the ct/nat helper part for all conncetions but the ones that actually have a helper, etc. > a) Provide truly unique IDs suitable for long-term reference without any > risk of collisions. a generation counter would make that guarantee. > b) Allows for the namespace to be built in such manner that there never > will be any risk for congestion in finding the next available ID. For > example by using CPU#+counter. generation counters also fulfill that requirement. However, they are not ordered. > Regards > Henrik -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 1:05 ` Henrik Nordstrom 2003-10-20 3:01 ` Patrick McHardy 2003-10-20 7:04 ` Harald Welte @ 2003-10-20 7:17 ` Jozsef Kadlecsik 2003-10-20 9:29 ` Henrik Nordstrom 2003-10-20 14:48 ` Harald Welte 2003-10-20 11:11 ` Jozsef Kadlecsik 3 siblings, 2 replies; 40+ messages in thread From: Jozsef Kadlecsik @ 2003-10-20 7:17 UTC (permalink / raw) To: Henrik Nordstrom; +Cc: Netfilter Development Mailinglist Hi, On Mon, 20 Oct 2003, Henrik Nordstrom wrote: > a) Assign a globally unique ID to each conntrack, in such manner that IDs > is not reused for a significant amount of time. This to provide a stable > point of reference to a connection with low risk of false collisions if > the original connection was destroyed while userspace still thought it was > there. I still don't see why can't we simply use the tuple as unique id, as Harald suggested. That's truly unique and does not require additional fields in the ip_conntrack structure. > This requires a read lock per hash bucket while dumping that bucket, and > some small (usually) amount of memory to keep the temporary sorted index > of bucket entries unless the bucket is permanently resorted in which case > it may be possible to solve with no memory allocation (but then requires > the bucket to be write locked while resorting which is probably worse). On the developer workshop I presented my per bucket locking patch, with some performance comparison graphs. It's time to sync the patch and release it... Best regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 7:17 ` Jozsef Kadlecsik @ 2003-10-20 9:29 ` Henrik Nordstrom 2004-02-06 18:52 ` Harald Welte 2003-10-20 14:48 ` Harald Welte 1 sibling, 1 reply; 40+ messages in thread From: Henrik Nordstrom @ 2003-10-20 9:29 UTC (permalink / raw) To: Jozsef Kadlecsik; +Cc: Netfilter Development Mailinglist On Mon, 20 Oct 2003, Jozsef Kadlecsik wrote: > I still don't see why can't we simply use the tuple as unique id, as > Harald suggested. That's truly unique and does not require additional > fields in the ip_conntrack structure. Because it is not long-term unique. With the tuple approach the administrator risks hitting another connection if the originally intended connection has already been destroyed and replaced by a new connection with the same address details. But yes, if we use the full conntrack tuple (both directions) then the uniqueness is probably good enough for all practical purposes except when there is evil clients in the mix, but on the other hand becomes a "little" cumbersome to work with if you want the administrator to ever enter which connection he refers to manually, even more so if you consider that the details of a tuple varies greatly per protocol. A 64-bit integer can be copy-pasted, and is relatively easy to manage in textual form. A full conntrack tuple (at minimum "protocol, source IP, dest IP, reply source IP, reply destination IP, source port, destination port, reply source port, reply destination port", but preferaly a binary "struct ip_conntrack_tuple tuple[2]") is obviously not as easy to manage. > On the developer workshop I presented my per bucket locking patch, with > some performance comparison graphs. It's time to sync the patch and > release it... Would be great ;-) Regards Henrik ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 9:29 ` Henrik Nordstrom @ 2004-02-06 18:52 ` Harald Welte 2004-02-09 10:33 ` Pablo Neira 2004-02-10 12:39 ` Patrick McHardy 0 siblings, 2 replies; 40+ messages in thread From: Harald Welte @ 2004-02-06 18:52 UTC (permalink / raw) To: Henrik Nordstrom; +Cc: Jozsef Kadlecsik, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 1858 bytes --] Hi! I have to follow up on this old discussion, since I want to get ctnetlink into a submission-ready state. On Mon, Oct 20, 2003 at 11:29:46AM +0200, Henrik Nordstrom wrote: > On Mon, 20 Oct 2003, Jozsef Kadlecsik wrote: > > > I still don't see why can't we simply use the tuple as unique id, as > > Harald suggested. That's truly unique and does not require additional > > fields in the ip_conntrack structure. > > Because it is not long-term unique. With the tuple approach the > administrator risks hitting another connection if the originally intended > connection has already been destroyed and replaced by a new connection > with the same address details. well, but if the tuple is again the same tuple, chances are high the administrator actually wants to remove that new connection as much as the previous one. In fact, apart from a short difference in time, they _are_ pretty much the same connection. So from my point of view, the tuple is still sufficient. Tuple can be used by userspace to identify a connection, tuple is used for replication messages in ct_sync. We can also guarantee, that all entries that - existed before the dump started - and still exist when the dump ended are actually dumped. We don't make any guarantees about connections that either started within that timeframe, or have been terminated within that timeframe. I would really like to see the ordered list and id disappear. > Regards > Henrik -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2004-02-06 18:52 ` Harald Welte @ 2004-02-09 10:33 ` Pablo Neira 2004-02-10 12:39 ` Patrick McHardy 1 sibling, 0 replies; 40+ messages in thread From: Pablo Neira @ 2004-02-09 10:33 UTC (permalink / raw) To: Harald Welte, netfilter-devel Hi! I've been working on an API to add, update conntrack entries since fall 2003, it's my final proyect at the university. It's still experimental and it's far from Harald and Jozsef work because of their experience in that matter. Anyway I promise to post that patch. Harald Welte wrote: >>Because it is not long-term unique. With the tuple approach the >>administrator risks hitting another connection if the originally intended >>connection has already been destroyed and replaced by a new connection >>with the same address details. >> >> well, I can't see any problem, anyway we could perform some checkings to avoid something like this: - when a create entry message arrives we could check if there's a connection with the same address details by using ip_conntrack_find_get(...) and if it's found, update the ip_conntrack structure with the new info. but by the means of the id we could have two conntrack structures with the same address info. I think that this duplicated info, actually I think that it's better considering that last info received about a connection is up to date and forget the state of the old one. >well, but if the tuple is again the same tuple, chances are high the >administrator actually wants to remove that new connection as much as >the previous one. In fact, apart from a short difference in time, they >_are_ pretty much the same connection. > > >So from my point of view, the tuple is still sufficient. Tuple can be >used by userspace to identify a connection, tuple is used for >replication messages in ct_sync. > >We can also guarantee, that all entries that > - existed before the dump started > - and still exist when the dump ended >are actually dumped. > >We don't make any guarantees about connections that either started >within that timeframe, or have been terminated within that timeframe. > >I would really like to see the ordered list and id disappear. > > I agree with Harald, actually I think that adding a new id and that ordered stuff will complicate the current structure, I would prefer redesigning the current structure of the conntrack table than adding those fields. best regards, Pablo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2004-02-06 18:52 ` Harald Welte 2004-02-09 10:33 ` Pablo Neira @ 2004-02-10 12:39 ` Patrick McHardy 2004-02-14 20:03 ` Harald Welte 1 sibling, 1 reply; 40+ messages in thread From: Patrick McHardy @ 2004-02-10 12:39 UTC (permalink / raw) To: Harald Welte Cc: Henrik Nordstrom, Jozsef Kadlecsik, Netfilter Development Mailinglist Harald Welte wrote: > On Mon, Oct 20, 2003 at 11:29:46AM +0200, Henrik Nordstrom wrote: >>Because it is not long-term unique. With the tuple approach the >>administrator risks hitting another connection if the originally intended >>connection has already been destroyed and replaced by a new connection >>with the same address details. > > > well, but if the tuple is again the same tuple, chances are high the > administrator actually wants to remove that new connection as much as > the previous one. In fact, apart from a short difference in time, they > _are_ pretty much the same connection. > > So from my point of view, the tuple is still sufficient. Tuple can be > used by userspace to identify a connection, tuple is used for > replication messages in ct_sync. > > We can also guarantee, that all entries that > - existed before the dump started > - and still exist when the dump ended > are actually dumped. > > We don't make any guarantees about connections that either started > within that timeframe, or have been terminated within that timeframe. > > I would really like to see the ordered list and id disappear. I can make my peace with not having a unique identity for each conntrack over time, but the other use for IDs was to continue an interrupted dump at the right place, how can we solve this ? The problematic case is when a single hash-chain doesn't fit into an skb. We need to remember the last one dumped somehow, and be able to continue at the next one not dumped even when the last one dumped is gone when the dump continues. Best regards, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2004-02-10 12:39 ` Patrick McHardy @ 2004-02-14 20:03 ` Harald Welte 2004-02-15 10:01 ` Patrick McHardy 0 siblings, 1 reply; 40+ messages in thread From: Harald Welte @ 2004-02-14 20:03 UTC (permalink / raw) To: Patrick McHardy Cc: Henrik Nordstrom, Jozsef Kadlecsik, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 1341 bytes --] On Tue, Feb 10, 2004 at 01:39:01PM +0100, Patrick McHardy wrote: > I can make my peace with not having a unique identity for each conntrack > over time, Thanks :) > but the other use for IDs was to continue an interrupted > dump at the right place, how can we solve this ? The problematic case > is when a single hash-chain doesn't fit into an skb. We need to remember > the last one dumped somehow, and be able to continue at the next one > not dumped even when the last one dumped is gone when the dump > continues. We'd have to ensure that a single hash chain is not longer than what we could put into one skb. This can be done by limiting the maximum number of entries in a bucket (and then rehash). Also, we should increase the default number of hash buckets to reduce the probability that this might happen. Also, Jozsef proposed a flip/flop bit mechanism that would solve that problem. What do you say to his proposal? > Patrick -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2004-02-14 20:03 ` Harald Welte @ 2004-02-15 10:01 ` Patrick McHardy 2004-02-17 21:37 ` Harald Welte 0 siblings, 1 reply; 40+ messages in thread From: Patrick McHardy @ 2004-02-15 10:01 UTC (permalink / raw) To: Harald Welte Cc: Henrik Nordstrom, Jozsef Kadlecsik, Netfilter Development Mailinglist Harald Welte wrote: > On Tue, Feb 10, 2004 at 01:39:01PM +0100, Patrick McHardy wrote: > >>but the other use for IDs was to continue an interrupted >>dump at the right place, how can we solve this ? The problematic case >>is when a single hash-chain doesn't fit into an skb. We need to remember >>the last one dumped somehow, and be able to continue at the next one >>not dumped even when the last one dumped is gone when the dump >>continues. > > > We'd have to ensure that a single hash chain is not longer than what we > could put into one skb. This can be done by limiting the maximum number > of entries in a bucket (and then rehash). Also, we should increase the > default number of hash buckets to reduce the probability that this might > happen. I like the idea. So assuming that long hash chains are a result of "bad luck" with the jenkins hash, we would just change the secret, rehash, and repeat if some chains are still too long ? At what point do you propose rehashing, at the moment the chain length exceeds some threshold (or thereafter, defered to occur out of packet processing context), or when dumping over netlink ? > > Also, Jozsef proposed a flip/flop bit mechanism that would solve that > problem. What do you say to his proposal? > Can I find his proposal somewhere ? > >>Patrick > > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2004-02-15 10:01 ` Patrick McHardy @ 2004-02-17 21:37 ` Harald Welte 0 siblings, 0 replies; 40+ messages in thread From: Harald Welte @ 2004-02-17 21:37 UTC (permalink / raw) To: Patrick McHardy Cc: Henrik Nordstrom, Jozsef Kadlecsik, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 1292 bytes --] On Sun, Feb 15, 2004 at 11:01:10AM +0100, Patrick McHardy wrote: > I like the idea. So assuming that long hash chains are a result of "bad > luck" with the jenkins hash, we would just change the secret, rehash, > and repeat if some chains are still too long ? At what point do you > propose rehashing, at the moment the chain length exceeds some threshold > (or thereafter, defered to occur out of packet processing context), or > when dumping over netlink ? Mh. I am not really sure what might be the best solution. It should definitely not happen within softirq context, though. > >Also, Jozsef proposed a flip/flop bit mechanism that would solve that > >problem. What do you say to his proposal? > > Can I find his proposal somewhere ? Message-Id: <Pine.LNX.4.33.0310201018510.12485-100000@blackhole.kfki.hu> http://lists.netfilter.org/pipermail/netfilter-devel/2003-October/012821.html -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 7:17 ` Jozsef Kadlecsik 2003-10-20 9:29 ` Henrik Nordstrom @ 2003-10-20 14:48 ` Harald Welte 2003-10-20 18:53 ` Patrick McHardy 1 sibling, 1 reply; 40+ messages in thread From: Harald Welte @ 2003-10-20 14:48 UTC (permalink / raw) To: Jozsef Kadlecsik; +Cc: Henrik Nordstrom, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 2006 bytes --] On Mon, Oct 20, 2003 at 09:17:36AM +0200, Jozsef Kadlecsik wrote: > Hi, > > On Mon, 20 Oct 2003, Henrik Nordstrom wrote: > > > a) Assign a globally unique ID to each conntrack, in such manner that IDs > > is not reused for a significant amount of time. This to provide a stable > > point of reference to a connection with low risk of false collisions if > > the original connection was destroyed while userspace still thought it was > > there. > > I still don't see why can't we simply use the tuple as unique id, as > Harald suggested. That's truly unique and does not require additional > fields in the ip_conntrack structure. I think you are mixing up two seperate issues: 1) uniquely representing ip_conntrack during state replication between master and slave. Here the tuple is sufficient, since all state changes will be processed in-order. Since the tuple is always unique in the hashtable, there is no mistake of updating/deleting the wrong one. 2) uniquely identifying an ip_conntrcak from userspace. When userspace first dumps and then deletes by tuple, the tuple might already have been reused. This is what most of the discussion was about, where a 64bit counter or the generation counter+address had been suggested as possible solutions. > On the developer workshop I presented my per bucket locking patch, with > some performance comparison graphs. It's time to sync the patch and > release it... yes... can we first have the final raw and tcp-window-tracking patch? It's probably already too late to get them in 2.6.x anyway... but let's try. > Best regards, > Jozsef -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 14:48 ` Harald Welte @ 2003-10-20 18:53 ` Patrick McHardy 2003-10-20 22:57 ` Harald Welte 0 siblings, 1 reply; 40+ messages in thread From: Patrick McHardy @ 2003-10-20 18:53 UTC (permalink / raw) To: Harald Welte Cc: Jozsef Kadlecsik, Henrik Nordstrom, Netfilter Development Mailinglist Harald Welte wrote: >1) uniquely representing ip_conntrack during state replication between >master and slave. Here the tuple is sufficient, since all state changes >will be processed in-order. Since the tuple is always unique in the >hashtable, there is no mistake of updating/deleting the wrong one. > >2) uniquely identifying an ip_conntrcak from userspace. >When userspace first dumps and then deletes by tuple, the tuple might >already have been reused. This is what most of the discussion was >about, where a 64bit counter or the generation counter+address had been >suggested as possible solutions. > > Seems now I didn't understand. How can we use the ids generated by a generation counter in userspace ? In any busy system they will be invalidated too fast .. Best regards, Patrick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 18:53 ` Patrick McHardy @ 2003-10-20 22:57 ` Harald Welte 0 siblings, 0 replies; 40+ messages in thread From: Harald Welte @ 2003-10-20 22:57 UTC (permalink / raw) To: Patrick McHardy Cc: Jozsef Kadlecsik, Henrik Nordstrom, Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 1546 bytes --] On Mon, Oct 20, 2003 at 08:53:08PM +0200, Patrick McHardy wrote: > Harald Welte wrote: > > >1) uniquely representing ip_conntrack during state replication between > >master and slave. Here the tuple is sufficient, since all state changes > >will be processed in-order. Since the tuple is always unique in the > >hashtable, there is no mistake of updating/deleting the wrong one. > > > >2) uniquely identifying an ip_conntrcak from userspace. > >When userspace first dumps and then deletes by tuple, the tuple might > >already have been reused. This is what most of the discussion was > >about, where a 64bit counter or the generation counter+address had been > >suggested as possible solutions. > > > > > > Seems now I didn't understand. How can we use the ids generated by a > generation counter in userspace ? In any busy system they will be > invalidated too fast .. no. The idea is that every particular conntrack (that is, allocated chunk of memory) has it's own generation counter. That counter is part of struct ip_conntrack and incremented every time we return this chunk of memory to the slab cache. > Best regards, > Patrick -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-20 1:05 ` Henrik Nordstrom ` (2 preceding siblings ...) 2003-10-20 7:17 ` Jozsef Kadlecsik @ 2003-10-20 11:11 ` Jozsef Kadlecsik 3 siblings, 0 replies; 40+ messages in thread From: Jozsef Kadlecsik @ 2003-10-20 11:11 UTC (permalink / raw) To: Henrik Nordstrom; +Cc: Netfilter Development Mailinglist On Mon, 20 Oct 2003, Henrik Nordstrom wrote: > It is imporant that userspace does not miss entries which was in the > kernel when duming started and still exists in the kernel when the dump > finished. > > It is also important userspace can have some kind of semi-static > reference to a conntrack to be able to manipulate that conntrack without > risking hitting another conntrack. > > It is OK for me if it is unspecified what happens with entries which > either was created or destroyed while the dump was in progress. This is an excellent summary for the requirements of the dump functionality in ctnetlink. However, I think Harald has got the points on shrinking instead of blowing up the ip_conntrack structure. What about introducing new, flip-flop conntrack status bits? /* Dump state A */ IPS_DUMP_A_BIT = 4, ISP_DUMP_A = (1 << IPS_DUMP_A_BIT), /* Dump state B */ IPS_DUMP_B_BIT = 5, ISP_DUMP_B = (1 << IPS_DUMP_B_BIT), The general dump state is stored in ip_conntrack_dump_status. New conntrack entries are created with their status set to the value of ip_conntrack_dump_status. When a dump is requested, the ip_conntrack_dump_status is set to the another value. Then ip_conntrack hash is scanned and all entries with the previous status bit is dumped and then their bit is turned to the current value of ip_conntrack_dump_status. New entries are created with the new ip_conntrack_dump_status value, consequently those are not dumped but updated to the slaves using the normal procedure. It means of course that there could be only one dumping, i.e. until the whole ip_conntrack hash hasn't got fully processed, the system must not allow changing the value of ip_conntrack_dump_status again. A quick idea, may be bogus. > A 64-bit integer can be copy-pasted, and is relatively easy to manage in > textual form. A full conntrack tuple (at minimum "protocol, source IP, > dest IP, reply source IP, reply destination IP, source port, destination > port, reply source port, reply destination port", but preferaly a binary > "struct ip_conntrack_tuple tuple[2]") is obviously not as easy to > manage. We'll have a nifty GUI, so we won't need to type anything: just click'n'shoot. ;-) Best regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: ctnetlink questions 2003-10-19 22:55 ` Patrick McHardy 2003-10-20 1:05 ` Henrik Nordstrom @ 2003-10-20 6:58 ` Harald Welte 1 sibling, 0 replies; 40+ messages in thread From: Harald Welte @ 2003-10-20 6:58 UTC (permalink / raw) To: Patrick McHardy; +Cc: Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 5079 bytes --] On Mon, Oct 20, 2003 at 12:55:03AM +0200, Patrick McHardy wrote: > Nice thing with the unique ids is that it's better than an atomic > snapshot, when you're done reading you have the _current_ state, not > the state when you began reading. well, I don't consider this as particularly important. As long as it is doucumented... > >Also, there's another problem: > >Let's say we left at bucket 5, entry 12 - and while we are waiting for > >the next netlink callback, entry 10 gets removed. Then we would > >continue at 12, which is in reality the old 13. So we're missing one > >conntrack. > > > > With the unique id solution ? No, the id's don't represent the > list-position No, this problem would occur with the old (and my proposed) solution that doesn't require an ID. > I didn't worry too much about performance yet, in my opinion it was > required for beeing useful. For the architecture, if it was only for > table dumping I'd agree with you, but there is another important use > for the id. When we want to manipulate/delete conntrack entries from > userspace there is no way to make sure that we will do things the the > right connection since the tuples that are used for lookup could have > been reused. Mh. I am wondering if we can make that guarantee without adding the ID field. We really should be in the mindset of making ip_conntrack smaller, not blowing it up. > >Other approaches I can think of: > > > >a) making a snapshot of the whole conntrack table. > >Large memory usage - probably easy to get OOM :( Also, read lock on > >ip_conntrack_lock would have to be grabbed long > > > >b) unique ID per hash bucket. This means less contention, but we could > >only save bucket id in cb->args, start iterating from the beginning and > >only send whose ID is newer than the last one we already sent. > > > >c) snapshot of the current bucket > >As with the new hash function every bucket is supposed to be short, we > >could also make a snapshot of the current bucket, and send our messages > >from this snapshot copy. > > > >what do you think? > > > > I think we first need to agree on how important the problems I mentioned > above are. All these solutions don't provide reliable mechanisms. Some > comments though: Yes. I am aware of the non-reliability. For me it is more important to not interfere with the current connection tracking design, leaving ctnetlink an addon that doesn't require deep hooks into the conntrack implementation, and that can live without dozens of #ifdef's. Let's say that I'm looking upon all possible solutions under that precondition. > a) problem is that there can be multiple parallel dumps so we > potentially need many copies. > I think memory usage is not acceptable. we can just allow one dump at a time and make every body else either wait or try again. > b) I'm not sure if i understand correctly, this is basically what has > been done before my changes except that we would always continue at > the next bucket id and not just advance if the whole bucket has > successfully dumped ? before, the code did dump the same bucket again if it didn't fit in the skb last time. My proposed approach would have a unique ct_id inside a signle bucket list. This way we can sort-of live without the ordered list (minus the 12/13 issue pointed out above) but don't dump the same bucket over and over again. > c) same problem as a, except memory usage is not as bad. IMO it is a > basically a workaround for limited socket buffers to circumvent the > limits. If we don't need reliability I'd say it's the users job to > make sure socket buffer limits are set to a reasonable size. mh. > So in conclusion if we agree we need reliability, we probably need the > unique ids. If we agree we don't, I'd say we use solution b. I'm going to comment on the ID's in my next reply. > Two last things I noted during writing the mail: > - Table dumping is currenlty not restricted to root, this should > probably be done for privacy reasons. I'm a bit undecided. netstat -a, -r are always allowed for every user, too. But then, those tables don't indicate forwarded connections... ok, let's have it require CAP_NET_ADMIN. > - Have you got objections against s/CTA_RPLY/CTA_REPLY/ ? IMO It makes > typing and thinking more comfortable if you can actually pronounce > what you are thinking about ;) question is: are all NFA/CTA constants four letters? than the current way is actually more consistent. But feel free to change that, since it might become a TLV in the future anyway ;) > Best regards, > Patrick -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* ctnetlink questions @ 2003-10-19 14:54 Harald Welte 0 siblings, 0 replies; 40+ messages in thread From: Harald Welte @ 2003-10-19 14:54 UTC (permalink / raw) To: Martin Josefsson; +Cc: Netfilter Development Mailinglist [-- Attachment #1: Type: text/plain, Size: 697 bytes --] Hi Gandalf! A couple of questions regarding your ctnetlink modifications: 1) Why do we need this 'ordered list' ? I can't remember the exact reason why it was added 2) Why did you merge connmark and ctnetlink? Was it just for convenience? If yes, I'd appreciate to have them seperated again. Thanks. -- - Harald Welte <laforge@netfilter.org> http://www.netfilter.org/ ============================================================================ "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2004-02-17 21:37 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20031019171851.GR21521@sunbeam.de.gnumonks.org>
2003-10-19 19:36 ` ctnetlink questions Patrick McHardy
2003-10-19 20:28 ` Harald Welte
2003-10-19 22:55 ` Patrick McHardy
2003-10-20 1:05 ` Henrik Nordstrom
2003-10-20 3:01 ` Patrick McHardy
2003-10-20 3:09 ` Patrick McHardy
2003-10-20 6:34 ` Henrik Nordstrom
2003-10-20 17:53 ` Patrick McHardy
2003-10-20 7:15 ` Harald Welte
2003-10-20 9:37 ` Henrik Nordstrom
2003-10-20 18:43 ` Patrick McHardy
2003-10-20 18:37 ` Harald Welte
2003-10-20 19:17 ` Patrick McHardy
2003-10-20 19:41 ` Balazs Scheidler
2003-10-20 20:20 ` Patrick McHardy
2003-10-20 22:59 ` Harald Welte
2003-10-20 18:17 ` Patrick McHardy
2003-10-20 18:39 ` Harald Welte
2003-10-20 19:21 ` Patrick McHardy
2003-10-21 16:47 ` Patrick McHardy
2003-10-21 19:54 ` Henrik Nordstrom
2003-10-21 20:00 ` Patrick McHardy
2003-10-20 18:52 ` Harald Welte
2003-10-20 19:52 ` Patrick McHardy
2003-10-20 23:09 ` Harald Welte
2003-10-20 7:04 ` Harald Welte
2003-10-20 7:17 ` Jozsef Kadlecsik
2003-10-20 9:29 ` Henrik Nordstrom
2004-02-06 18:52 ` Harald Welte
2004-02-09 10:33 ` Pablo Neira
2004-02-10 12:39 ` Patrick McHardy
2004-02-14 20:03 ` Harald Welte
2004-02-15 10:01 ` Patrick McHardy
2004-02-17 21:37 ` Harald Welte
2003-10-20 14:48 ` Harald Welte
2003-10-20 18:53 ` Patrick McHardy
2003-10-20 22:57 ` Harald Welte
2003-10-20 11:11 ` Jozsef Kadlecsik
2003-10-20 6:58 ` Harald Welte
2003-10-19 14:54 Harald Welte
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.