Re: ip_conntrack performance issues

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: ip_conntrack performance issues - also semantic issues
       [not found] ` <15914.20503.476455.344137@isis.cs3-inc.com>
@ 2003-01-19  7:51   ` Patrick Schaaf
       [not found]     ` <15914.25470.189261.168220@isis.cs3-inc.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick Schaaf @ 2003-01-19  7:51 UTC (permalink / raw)
  To: Don Cohen; +Cc: Martin Josefsson, netfilter-devel

Hi Don & all,

On Sat, Jan 18, 2003 at 11:13:27PM -0800, Don Cohen wrote:
> 
> [This is not to the list, but feel free to put it or replies to 
> parts of it there if you think they're of general interest.]

I'll do so. Your point about the post-dequeueing hook warrants
thinking about by the masses :)

> I have one big complaint with conntrack, which is related to
> performance but also semantics.

I don't think it's semantics per se, to me you are talking about an
(important) implementation detail. Fixing it will _hopefully_ not
require new semantics (as understood by the end user).

> The semantic problem is that not all packets are forwarded.
> What we really want is two different conntrack hooks.
> The first as soon as the packet arrives classifies it in terms of
> what has been seen before.  This is used by filters, schedulers, etc.
> However that one does NOT update the conntrack data structure.

It is already the case that a NEW contrack structure is put into the
_hashtable_ only after running through all the filters - as the last
thing in POSTROUTING. It happens right at the point where the packet
will then be ENqueued to the outgoing network device.

If I understand your complaint correctly, it is really two complaints in one:

1) that a NEW conntrack need not be allocated in full.
2) that the putting into hashes (which presupposes allocation in full),
   happens before ENqueueing the packet, and not after DEqueueing,
   so potential drops by egress shaping are not seen and handled.

Addressing point 1) would help overhead in the case of the filter
rules themselves handling a DoS attack (by dropping suitable packets).
Addressing point 2) would _additionally_ cover the CLS/SCHED policing.
2) does not make much sense if the overhead reduction of 1) has not
been already accomplished. 1) makes sense by itself, and can be
implemented without touching the base network stack.

Would you agree that the two points are related, but independant?

Fixing point 1), would need no change in semantics (but changes in
the internal APIs): for each packet which now gets a NEW conntrack,
instead, let the skbuff reference a shared, unspecific "THE NEW CONNTRACK".
Only when an individual conntrack is required (by NAT module calls on a
packet which has the shared, unspecific "THE NEW CONNTRACK"), will a
real conntrack structure be allocated, on demand. The same must happen
on the POSTROUTING conntrack hook, before the individual NEW connection's
conntrack is put into the hashes. "ALLOCATE ON DEMAND" is the general theme.
Of course, almost every place in iptables where now we assume we have an
individual conntrack, must learn to individualize "THE NEW CONNTRACK"
when encountered. Big code audit time. Always a good thing - Don, is
that a job for you, if people commit to taking the changes? :-)

Regarding point 2), there is a (temporal) semantic change involved.
With that approach, it takes potentially much longer until the
conntrack is created. The packet can sit in the output queue for
quite a long time, if the output interface is a slow one, and filled
to the brim. So, the question is, are there real world protocols
where several packets back to back go from A to B, before packets
flow back? Such protocols already have a window of opportunity
for SNAFU in the current scheme, but updating the hashes after
the output queue may aggravate the symptoms. (I have no protocol
in mind, just being paranoid...)

best regards
  Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <15914.25470.189261.168220@isis.cs3-inc.com>]

* Re: ip_conntrack performance issues - also semantic issues
       [not found]     ` <15914.25470.189261.168220@isis.cs3-inc.com>
@ 2003-01-19  9:16       ` Patrick Schaaf
  2003-01-19  9:40         ` Martin Josefsson
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick Schaaf @ 2003-01-19  9:16 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

On Sun, Jan 19, 2003 at 12:36:14AM -0800, Don Cohen wrote:
>  > I'll do so. Your point about the post-dequeueing hook warrants
>  > thinking about by the masses :)
> The reason I didn't is that I already suggested this to the list.

I'm of the opinion that good points warrant occasional repeating,
until they permeated enough skulls to be resolved once and for all.
Without the repetition, I wouldn't have thought about some of
the things I write now.

I'll again put this reply onto the list, see new point 3), below.

As you indicate you are not fully familiar with it, here's the story
about the "allocate early, hash late" approach which is in the current
code. It's really pretty simple. See net/ipv4/netfilter/ip_conntrack_core.c:

	- init_conntrack() allocates a new conntrack, triggered
	  by resolve_normal_ct(), which in turn is called by
	  ip_conntrack_in() - the PREROUTING hook function,
	  sitting there before all other functions (mangle/nat/filter).
	  There are no other call chains.
	  The new conntrack is NOT put into the hashes, but only
	  referenced by the skbuff (the packet under inspection).
	  The state match, NAT, etc. all use this skbuff reference!
	- __ip_conntrack_confirm() puts the conntrack of an skbuff
	  that passed through to POSTROUTING, into the hashes.
	  This function is wrapped in a 'static inline' in
	  include/linux/netfilter_ipv4/ip_conntrack_core.h,
	  called ip_conntrack_confirm(). For normal iptables
	  operation, this in turn is called from exactly
	  one place, ip_confirm() in net/ipv4/ip_conntrack_standalone.c.
	  That ip_confirm(), in turn, is the last LOCAL_IN hook
	  function, and (indirectly through ip_refrag()) the last
	  POSTROUTING hook function.

Point 2), i.e. your "put it after egress dequeue" proposal, would
mean delaying the confirm operation now done in ip_refrag(). A new
hook in the core network stack, called after egress dequeueing,
would again call ip_confirm(), as in the LOCAL_IN case.

Hmm, here's a point 3): in the LOCAL_IN case, the incoming packet may
be an UDP packet to a nonexisting local port.  This packet will be
dropped by the UDP bind hash lookup, but that comes AFTER the
LOCAL_IN hook, so there will be a conntrack remaining (with
30 seconds timeout, IIRC.)  So, the proposed after-egress-dequeue hook,
should also be called from the local delivery code, after real user level
interest has been checked.

>  > 1) that a NEW conntrack need not be allocated in full.
> I guess you're saying that the allocation takes a significant amount
> of time, entering into hash is another chunk, and the first part is
> done at prerouting.

I'm pretty sure that the allocation is the main cost, putting into
the hashes is not performance critical per se (if the lists are short). 

> In which case I agree allocation should be delayed.  But of course
> this means that various code has to be changed to expect new
> connections to be unallocated.

This may come at a cost for the normal case. In the normal case,
the allocation cost _has_ to be payed (or we wouldn't load
ip_conntrack), and the constant per-packet checking would
be pure overhead.

For filtered DoS packets not belonging to any connection, it would
certainly be a significant saving.

[BIG SNIP]

> I don't know of any, but for such protocols I guess the right thing
> would be to allocate enough at arrival of the first packet to
> recognize the second packet on its arrival, etc.

This is already done, in a sense. The really large NAT pieces are
kept outside the main ip_conntrack structure, allocated on demand.
As far as I know the prococol helpers operate (or could operate)
the same way.

Keep the "equal for all" conntracking structure simple. BTW, there's
still 10-15% cruft there which could be removed.

> A borderline case - even tcp does allow, e.g., a syn followed by a
> repeat of the syn if not answered promptly.  But in that case both
> can be classified as new.  Only in the second phase it's important
> to recognize that the second one is a repeat of the first.

In the current ip_conntrack_confirm() implementation, when that
happens, the second packet is DROPped, and its conntrack freed
shortly thereafter. This happens whenever ip_conntrack_confirm()
finds one or the other direction of the connection in the hashes.

I think that's about the sanest thing one can do, realizing that
before that point, we potentially set up conflicting NAT information
for one and the same end-user TCP connection. Letting both progress,
is a sure way to desaster. With the given approach, there's a chance
for everything to just work.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ip_conntrack performance issues - also semantic issues
  2003-01-19  9:16       ` Patrick Schaaf
@ 2003-01-19  9:40         ` Martin Josefsson
  2003-01-19  9:55           ` Patrick Schaaf
  2003-01-31 11:35           ` Harald Welte
  0 siblings, 2 replies; 6+ messages in thread
From: Martin Josefsson @ 2003-01-19  9:40 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: Don Cohen, Netfilter-devel

On Sun, 2003-01-19 at 10:16, Patrick Schaaf wrote:

> >  > 1) that a NEW conntrack need not be allocated in full.
> > I guess you're saying that the allocation takes a significant amount
> > of time, entering into hash is another chunk, and the first part is
> > done at prerouting.
> 
> I'm pretty sure that the allocation is the main cost, putting into
> the hashes is not performance critical per se (if the lists are short). 

My small unscientific tests (performed quite some time ago) say that the
allocation isn't the most costly operation performed.
The test was a simple SYN-flood (random source ip's and ports) beeing
forwarded by a router. If I didn't load ip_conntrack at all everything
was fine and I had lots of cpu left. When I loaded ip_conntrack the cpu
usage went to 100%. Then I tried blocking the flood in filter/FORWARD
and that helped _a lot_ and in this case the allocation is still
happening for each packet. But of course blocking it in raw/PREROUTING
is even better (before the packets reach ip_conntrack)

My point is that until someone really profiles ip_conntrack I'm not sure
I'll believe that the allocation is the most costly operation.
We need good profiles for a lot of diffrent situations.

-- 
/Martin

Never argue with an idiot. They drag you down to their level, then beat you with experience.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ip_conntrack performance issues - also semantic issues
  2003-01-19  9:40         ` Martin Josefsson
@ 2003-01-19  9:55           ` Patrick Schaaf
  2003-01-31 11:35           ` Harald Welte
  1 sibling, 0 replies; 6+ messages in thread
From: Patrick Schaaf @ 2003-01-19  9:55 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Patrick Schaaf, Don Cohen, Netfilter-devel

On Sun, Jan 19, 2003 at 10:40:47AM +0100, Martin Josefsson wrote:
> 
> > I'm pretty sure that the allocation is the main cost, putting into
> > the hashes is not performance critical per se (if the lists are short). 
> 
> My small unscientific tests (performed quite some time ago) say that the
> allocation isn't the most costly operation performed.
> The test was a simple SYN-flood (random source ip's and ports) beeing
> forwarded by a router. If I didn't load ip_conntrack at all everything
> was fine and I had lots of cpu left. When I loaded ip_conntrack the cpu
> usage went to 100%. Then I tried blocking the flood in filter/FORWARD
> and that helped _a lot_ and in this case the allocation is still
> happening for each packet.

OK, thanks for that information. Still the cost won't be in the
operation of putting the conntrack into the hashes, but in the
consequences of that: each hashed contrack contributes to future
lookup cost (you may have head significant hash bucket chain lengths),

Also, at the same point where the conntrack enters the hashes,
it is also add_timer()ed for the first time. That touches kernel
locks, IIRC, and may contribute significantly to the cost you saw.

Such a test, IFF it also performed significant egress DROPping
of the SYNs (somehow), should clearly show the benefit of Don's
proposal of delaying hash (and timer) update until egress dequeueing.

> My point is that until someone really profiles ip_conntrack I'm not sure
> I'll believe that the allocation is the most costly operation.
> We need good profiles for a lot of diffrent situations.

One or two, for common situations, would already be a good start :)
Maybe I should update my TSC based netfilter hook profiling patch,
as it provides good, low impact, and precise facility for that.
If somebody wants to give it a run & has problems extracting it
from the dated patch posted below, please mail me directly.

best regards
  Patrick

[1] http://lists.netfilter.org/pipermail/netfilter-devel/2002-August/008876.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ip_conntrack performance issues - also semantic issues
  2003-01-19  9:40         ` Martin Josefsson
  2003-01-19  9:55           ` Patrick Schaaf
@ 2003-01-31 11:35           ` Harald Welte
  2003-01-31 22:58             ` Martin Josefsson
  1 sibling, 1 reply; 6+ messages in thread
From: Harald Welte @ 2003-01-31 11:35 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Patrick Schaaf, Don Cohen, Netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1779 bytes --]

On Sun, Jan 19, 2003 at 10:40:47AM +0100, Martin Josefsson wrote:
> On Sun, 2003-01-19 at 10:16, Patrick Schaaf wrote:
> 
> > >  > 1) that a NEW conntrack need not be allocated in full.
> > > I guess you're saying that the allocation takes a significant amount
> > > of time, entering into hash is another chunk, and the first part is
> > > done at prerouting.
> > 
> > I'm pretty sure that the allocation is the main cost, putting into
> > the hashes is not performance critical per se (if the lists are short). 
> 
> My small unscientific tests (performed quite some time ago) say that the
> allocation isn't the most costly operation performed.

It would also surprise me if allocation cost was _that_ high. Remember,
we are already using the slab cache for this.

> The test was a simple SYN-flood (random source ip's and ports) beeing
> forwarded by a router. If I didn't load ip_conntrack at all everything
> was fine and I had lots of cpu left. When I loaded ip_conntrack the cpu
> usage went to 100%. Then I tried blocking the flood in filter/FORWARD
> and that helped _a lot_ and in this case the allocation is still
> happening for each packet. But of course blocking it in raw/PREROUTING
> is even better (before the packets reach ip_conntrack)

yup.  Why don't we have a 'raw' or 'prestate' table in patch-o-matic
yet?  This is a very easy job, can't anybody please add this missing
feature?

> /Martin

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ip_conntrack performance issues - also semantic issues
  2003-01-31 11:35           ` Harald Welte
@ 2003-01-31 22:58             ` Martin Josefsson
  0 siblings, 0 replies; 6+ messages in thread
From: Martin Josefsson @ 2003-01-31 22:58 UTC (permalink / raw)
  To: Harald Welte; +Cc: Patrick Schaaf, Don Cohen, Netfilter-devel

On Fri, 2003-01-31 at 12:35, Harald Welte wrote:

> > My small unscientific tests (performed quite some time ago) say that the
> > allocation isn't the most costly operation performed.
> 
> It would also surprise me if allocation cost was _that_ high. Remember,
> we are already using the slab cache for this.

I'll ask Robert Olsson (of NAPI fame) to run a few tests with conntrack
to see where we spend most time in the case of a single (or two) udp
stream(s) on both UP and SMP. I think we should start optimizing for the
already_in_conntrack case first and then move on to fixing the
new_conntrack problem (the global lock is probably the worst problem for
SMP, on UP I have no idea as of now)

> yup.  Why don't we have a 'raw' or 'prestate' table in patch-o-matic
> yet?  This is a very easy job, can't anybody please add this missing
> feature?

We do have such a table in patch-o-matic/userspace, raw.patch and
raw.patch.ipv6, it was Jozsef that made those patches. They contain the
raw table, TRACE (every further match is printk'd) and NOTRACK (assign a
dummy conntrack to the skb) targets.

I've used the raw table to block a few DDoS's (some were >140kpps)
without any problems. I've also tested the TRACE and NOTRACK targets
(only in a testmachine, not in production) and they also worked fine.

-- 
/Martin

Never argue with an idiot. They drag you down to their level, then beat you with experience.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-01-31 22:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20030118232752.26497.32589.Mailman@kashyyyk>
     [not found] ` <15914.20503.476455.344137@isis.cs3-inc.com>
2003-01-19  7:51   ` ip_conntrack performance issues - also semantic issues Patrick Schaaf
     [not found]     ` <15914.25470.189261.168220@isis.cs3-inc.com>
2003-01-19  9:16       ` Patrick Schaaf
2003-01-19  9:40         ` Martin Josefsson
2003-01-19  9:55           ` Patrick Schaaf
2003-01-31 11:35           ` Harald Welte
2003-01-31 22:58             ` Martin Josefsson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.