Table duplication on smp machine

All of lore.kernel.org
 help / color / mirror / Atom feed

* Table duplication on smp machine
@ 2003-03-31 10:52 Thomas Heinz
  2003-03-31 11:07 ` Harald Welte
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Heinz @ 2003-03-31 10:52 UTC (permalink / raw)
  To: netfilter-devel

Hi

Netfilter maintains different tables for each cpu which is especially
interesting for smp ;-)

This is done - according to the docs - to avoid write locking.
Now, what would be write locked in case we had only one set of
rules:

a) counters
b) match/target write operations to their match/target data

a) is clear. b) not really: if a target/match needs to store
data it would do it the way the limit match does it which
requires an in-module lock mechanism and an ugly pointer in the
private data of the match/target.

Hm, remains a). There must be another reason. Otherwise a lot
of space is wasted just to keep separate per-cpu counter values.

Is there another reason?

Regards

Thomas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Table duplication on smp machine
  2003-03-31 10:52 Table duplication on smp machine Thomas Heinz
@ 2003-03-31 11:07 ` Harald Welte
  2003-03-31 12:06   ` Thomas Heinz
  0 siblings, 1 reply; 10+ messages in thread
From: Harald Welte @ 2003-03-31 11:07 UTC (permalink / raw)
  To: Thomas Heinz; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]

On Mon, Mar 31, 2003 at 12:52:06PM +0200, Thomas Heinz wrote:
> Hi
> 
> Netfilter maintains different tables for each cpu which is especially
> interesting for smp ;-)

yes.

> This is done - according to the docs - to avoid write locking.
> Now, what would be write locked in case we had only one set of
> rules:

it's not only write locking.  Even if there was a locking mechanism in
place, and we didn't care about it's performance. 

What we are really caring about is cacheline ping-pong.  If you have a
single piece of data (like a counter) that is written by different cpu's
all the time, you will always invalidate the cached values in the other
cpu's cachelines.

> Hm, remains a). There must be another reason. Otherwise a lot
> of space is wasted just to keep separate per-cpu counter values.

Depends on what you think of 'a lot of space'.  In all my practical
setups, the connection tracking table is significantly larger (by orders
of magnitude) than the iptables ruleset.

> Regards
> Thomas

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Table duplication on smp machine
  2003-03-31 11:07 ` Harald Welte
@ 2003-03-31 12:06   ` Thomas Heinz
  2003-03-31 12:33     ` Harald Welte
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Heinz @ 2003-03-31 12:06 UTC (permalink / raw)
  To: Harald Welte; +Cc: netfilter-devel

Hi Harald

Thanks for your quick reply.

You wrote:
> it's not only write locking.  Even if there was a locking mechanism in
> place, and we didn't care about it's performance. 
> 
> What we are really caring about is cacheline ping-pong.  If you have a
> single piece of data (like a counter) that is written by different cpu's
> all the time, you will always invalidate the cached values in the other
> cpu's cachelines.

I see. If this is the main reason it would suffice to have _one_ ruleset
for all cpu's but a block of counters for each rule like this:
                   ----------------------------------------
                   |          counter for cpu 0           |
                   |--------------------------------------|
                   |          cache-align padding         |
                   |--------------------------------------|
                   |                                      |
                  ...                                    ...
                   |--------------------------------------|
                   |          cache-align padding         |
                   |--------------------------------------|
                   |          counter for cpu n           |
                   |--------------------------------------|

Right?

> Depends on what you think of 'a lot of space'.  In all my practical
> setups, the connection tracking table is significantly larger (by orders
> of magnitude) than the iptables ruleset.

Hm, in the above scheme the padding could be up to 112 bytes on x86
which means that the netfilter scheme does not waste a byte if the
average rule is <= 112 bytes.

Anyway, I was more concerned about the reasons behind the table
duplication than waste of memory.


Thomas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Table duplication on smp machine
  2003-03-31 12:06   ` Thomas Heinz
@ 2003-03-31 12:33     ` Harald Welte
  2003-03-31 12:50       ` Thomas Heinz
  2003-03-31 12:56       ` Patrick Schaaf
  0 siblings, 2 replies; 10+ messages in thread
From: Harald Welte @ 2003-03-31 12:33 UTC (permalink / raw)
  To: Thomas Heinz; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1341 bytes --]

On Mon, Mar 31, 2003 at 02:06:42PM +0200, Thomas Heinz wrote:
> Hi Harald
> 
> Thanks for your quick reply.
> 
> You wrote:
> >it's not only write locking.  Even if there was a locking mechanism in
> >place, and we didn't care about it's performance. 
> >
> >What we are really caring about is cacheline ping-pong.  If you have a
> >single piece of data (like a counter) that is written by different cpu's
> >all the time, you will always invalidate the cached values in the other
> >cpu's cachelines.
> 
> I see. If this is the main reason it would suffice to have _one_ ruleset
> for all cpu's but a block of counters for each rule like this:

then you start having cacheline ping-pong problems when match/target
private data is written to (like in the limit match, ...)

apart from that, ther is no reason why we shouldn't do what you are
proposing.  In fact, this is exactly what my initial pkttables/iptables2
design does ;)

 
> Thomas

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Table duplication on smp machine
  2003-03-31 12:33     ` Harald Welte
@ 2003-03-31 12:50       ` Thomas Heinz
  2003-03-31 12:56       ` Patrick Schaaf
  1 sibling, 0 replies; 10+ messages in thread
From: Thomas Heinz @ 2003-03-31 12:50 UTC (permalink / raw)
  To: Harald Welte; +Cc: netfilter-devel

Hi Harald

You wrote:
> then you start having cacheline ping-pong problems when match/target
> private data is written to (like in the limit match, ...)

Hm, I don't understand. The limit match performs the write operations
on the same private data block for all cpu's so the cacheline ping-pong
occurs in the current scheme too [in addition to the locking overhead].

Of course this would be true if a match/target uses its per-cpu
data block to store information but I cannot imagine a scenario
where this makes sense or whatsoever.

> apart from that, ther is no reason why we shouldn't do what you are
> proposing.  In fact, this is exactly what my initial pkttables/iptables2
> design does ;)

Ok :)

Thomas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Table duplication on smp machine
  2003-03-31 12:33     ` Harald Welte
  2003-03-31 12:50       ` Thomas Heinz
@ 2003-03-31 12:56       ` Patrick Schaaf
  2003-03-31 13:34         ` Thomas Heinz
  1 sibling, 1 reply; 10+ messages in thread
From: Patrick Schaaf @ 2003-03-31 12:56 UTC (permalink / raw)
  To: Harald Welte, Thomas Heinz, netfilter-devel

> > I see. If this is the main reason it would suffice to have _one_ ruleset
> > for all cpu's but a block of counters for each rule like this:
> 
> then you start having cacheline ping-pong problems when match/target
> private data is written to (like in the limit match, ...)

There are other, generic APIs for allocation of per-cpu data. In a new
world, the rare match/target module with such a requirement, could use
that API.

But there is another, less common, incentive to have per-cpu copies
of all the data structures: non-uniform-memory-architecture machines.
An example (hopefully) not far from widespread adoption, would be the
new AMD Hammer SMP architecture. On such an architecture, there is
a performance benefit even from CPU (or node) local copies of even
readonly data, because of the reduced local memory latency. In the
Hammer example, there would be one copy of the ruleset in each
single Hammer's locally attached DRAM.

Not so common today, but I always thought that the existing separation
in the current iptables code was good preparation for that future.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Table duplication on smp machine
  2003-03-31 12:56       ` Patrick Schaaf
@ 2003-03-31 13:34         ` Thomas Heinz
  2003-03-31 13:45           ` Patrick Schaaf
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Heinz @ 2003-03-31 13:34 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: Harald Welte, netfilter-devel

Hi Patrick

You wrote:
> There are other, generic APIs for allocation of per-cpu data. In a new
> world, the rare match/target module with such a requirement, could use
> that API.

Ok but with the current scheme I don't see any reason why a match/target
would want to write to per-cpu data. Is there any practical application
for this?

> But there is another, less common, incentive to have per-cpu copies
> of all the data structures: non-uniform-memory-architecture machines.
> An example (hopefully) not far from widespread adoption, would be the
> new AMD Hammer SMP architecture. On such an architecture, there is
> a performance benefit even from CPU (or node) local copies of even
> readonly data, because of the reduced local memory latency. In the
> Hammer example, there would be one copy of the ruleset in each
> single Hammer's locally attached DRAM.
> 
> Not so common today, but I always thought that the existing separation
> in the current iptables code was good preparation for that future.

Very interesting. I guess one has to use a special API to address the
per cpu memory so that the current implementation wouldn't work out
of the box. Right?


Regards,

Thomas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Table duplication on smp machine
  2003-03-31 13:34         ` Thomas Heinz
@ 2003-03-31 13:45           ` Patrick Schaaf
  2003-03-31 13:51             ` Patrick Schaaf
  2003-03-31 14:06             ` Thomas Heinz
  0 siblings, 2 replies; 10+ messages in thread
From: Patrick Schaaf @ 2003-03-31 13:45 UTC (permalink / raw)
  To: Thomas Heinz; +Cc: Patrick Schaaf, Harald Welte, netfilter-devel

> >of all the data structures: non-uniform-memory-architecture machines.
> >An example (hopefully) not far from widespread adoption, would be the
> >new AMD Hammer SMP architecture. On such an architecture, there is
> >a performance benefit even from CPU (or node) local copies of even
> >readonly data, because of the reduced local memory latency. In the
> >Hammer example, there would be one copy of the ruleset in each
> >single Hammer's locally attached DRAM.
> >
> >Not so common today, but I always thought that the existing separation
> >in the current iptables code was good preparation for that future.
> 
> Very interesting. I guess one has to use a special API to address the
> per cpu memory so that the current implementation wouldn't work out
> of the box. Right?

I assume that the existing per-cpu API in Linux 2.5.xxx will suffice
to do the right thing on NUMA boxen. It's just a question of carefully
selecting what physical pages each per-cpu slab is put into. The problem
can and should be solved in generic allocators.

I have not followed the relevant development very closely, but there
were lively discussions on linux-kernel during the last year.

Regarding the limit target, not current implementation, but seen
a bit more abstractly, it is possible to arrange things so each
CPU has its own copy of a partial token bucket, and refills it
from locked shared pingponging storage only when it empties.
I have no idea whether it makes sense to add the required
code complexity, but it is a distinct optimization possible
for high-volume applications.

I feel that there's a lot more to worry about before that becomes
a problem in real life, though.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Table duplication on smp machine
  2003-03-31 13:45           ` Patrick Schaaf
@ 2003-03-31 13:51             ` Patrick Schaaf
  2003-03-31 14:06             ` Thomas Heinz
  1 sibling, 0 replies; 10+ messages in thread
From: Patrick Schaaf @ 2003-03-31 13:51 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: Thomas Heinz, Harald Welte, netfilter-devel

> > Very interesting. I guess one has to use a special API to address the
> > per cpu memory so that the current implementation wouldn't work out
> > of the box. Right?
> 
> I assume that the existing per-cpu API in Linux 2.5.xxx will suffice
> to do the right thing on NUMA boxen. It's just a question of carefully
> selecting what physical pages each per-cpu slab is put into. The problem
> can and should be solved in generic allocators.

Now I understand your question :)

Regarding the current implementation, the whole of table memory
is allocated in one vmalloc() call, and that will very likely
NOT do the right thing WRT numa placing. I am pretty sure that
Rusty Russel knows how to get the code there, as he has participated
in the percpu development on linux-kernel.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Table duplication on smp machine
  2003-03-31 13:45           ` Patrick Schaaf
  2003-03-31 13:51             ` Patrick Schaaf
@ 2003-03-31 14:06             ` Thomas Heinz
  1 sibling, 0 replies; 10+ messages in thread
From: Thomas Heinz @ 2003-03-31 14:06 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: Harald Welte, netfilter-devel

Hi Patrick

You wrote:
> Regarding the limit target, not current implementation, but seen
> a bit more abstractly, it is possible to arrange things so each
> CPU has its own copy of a partial token bucket, and refills it
> from locked shared pingponging storage only when it empties.

Indeed a good example. Thank you.


Regards

Thomas

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-03-31 14:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-03-31 10:52 Table duplication on smp machine Thomas Heinz
2003-03-31 11:07 ` Harald Welte
2003-03-31 12:06   ` Thomas Heinz
2003-03-31 12:33     ` Harald Welte
2003-03-31 12:50       ` Thomas Heinz
2003-03-31 12:56       ` Patrick Schaaf
2003-03-31 13:34         ` Thomas Heinz
2003-03-31 13:45           ` Patrick Schaaf
2003-03-31 13:51             ` Patrick Schaaf
2003-03-31 14:06             ` Thomas Heinz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.