netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* nftables with thousands of chains is unreasonably slow
@ 2024-05-06 11:59 Neels Hofmeyr
  2024-05-07 22:32 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 6+ messages in thread
From: Neels Hofmeyr @ 2024-05-06 11:59 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Hi!

I am trying to figure out why adding/deleting chains and retrieving counters
via nftables takes long for high nr of entries. I'd like to share some
findings, to ask whether this is known, whether a solution exists, etc...

I'm writing two programs that interact with nftables via
nft_run_cmd_from_buffer(). (osmo-upf and osmo-hnbgw)

One use case is setting up / tearing down thousands of GTP tunnels via nftables
rule sets one by one, for reference:
https://gitea.osmocom.org/cellular-infrastructure/osmo-upf/src/commit/a21bcec358a5147deb15d156700279f52386a7d7/tests/nft-rule.vty

The other use case is retrieving all counters that are currently present in the
inet table that my client program owns.

Adding the first few hundred chains / reading a few hundred counters finishes
in an ok amount of time. But the more chains are already submitted in the
table, the longer it takes to add another one, etc.

By the time that there are ~4000 chains, adding another chain becomes
prohibitively slow. Some numbers: in total, when setting up 1000 GTP tunnels in
osmo-upf, which means adding 4000 chains into a table owned by the program
(osmo-upf), in total takes 49 seconds. This includes the overhead for talking
PFCP and logging, etc., but by far the most time is taken waiting for
nft_run_cmd_from_buffer().

We'd like to scale up to like 100 times the above -- completely beyond all
reason currently, since the wait time seems to increase exponentially.

I have two angles:

1) workaround: structure the chains and rules differently?
2) analysis: bpftracing tells me that most time is spent in chain_free().

(1) Currently I have one flat table with all the chains in that "top level".
Would it make sense to try breaking that long list up into smaller groups,
maybe batches of 100? As in, pseudocode:

  table osmo-upf {
      group-0 {
         chain-0 {}
	 chain-1 {}
	 ...
      }
      group-1 {
         chain-100 {}
         chain-101 {}
	 ...
      }
      ...
      group-1000 {
         chain-100000 {}
         chain-100001 {}
	 ...
      }
  }

Then I can also retrieve the counters in batches of 100, which might be more
efficient and better to coordinate with concurrent tasks.

(2) Using bpftrace, I drilled a bit into where the time is spent. Two
interesting findings for me so far:

It seems most time is spent in

  nft_run_cmd_from_buffer
    nft_evaluate
      nft_cache_update
        nft_cache_release
          nft_cache_flush
            table_free
              chain_free

The bpftrace shows:

nft_cache_update() was called 6000 times.
(I have 1000 GTP tunnels and expect four chains per GTP tunnel, which would be
4000, not sure why I see 6k, but that's not really that significant.)

chain_free() was called more than 5 million times, that's 1000 times as often
as I would naively expect to see.

Most calls are fast (<16us), but there are some break-outs of up to 8ms, and in
total the 5 million calls amount to 80 seconds. (One bpftrace dump below, FYI)

I'm continuing to think:

- Is this stuff known already, or does it make sense to continue on this path,
  share a reproduction recipe here, and so on?

- can we fix that? Is there some memory leak / unnecessary blow up happening
  that causes this apparent factor 1000 in effort?

- can we defer the free, so my client program gets the nftables results without
  having to wait for the freeing? Can the client program decide when to do the
  nftables freeing, i.e. not within nft_run_cmd_from_buffer()?

Thanks!

~N

foo_bar__count: function foo_bar() was called this many times
...__ms: milliseconds
...__us: microseconds

==========================================================
@handle_free__count: 5190516
@handle_free__total_ms: 16919          <--- in total 17 seconds spent in handle_free()
@handle_free__us: 
[2, 4)           4884915 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            280031 |@@                                                  |
[8, 16)            15323 |                                                    |
[16, 32)            9910 |                                                    |
[32, 64)             267 |                                                    |
[64, 128)             24 |                                                    |
[128, 256)             3 |                                                    |
[256, 512)             3 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               3 |                                                    |
[2K, 4K)              48 |                                                    |

@cache_free__count: 11100
@cache_free__total_ms: 230
@cache_free__us: 
[2, 4)              7327 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)              3461 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
[8, 16)              177 |@                                                   |
[16, 32)              73 |                                                    |
[32, 64)               5 |                                                    |
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             0 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               9 |                                                    |
[2K, 4K)              39 |                                                    |
[4K, 8K)               9 |                                                    |

@scope_release__count: 5140608
@scope_release__total_ms: 12679          <--- in total 13 seconds spent in scope_release()
@scope_release__us: 
[1]              1580615 |@@@@@@@@@@@@@@@@@@@@@@@                             |
[2, 4)           3527447 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             21717 |                                                    |
[8, 16)             4521 |                                                    |
[16, 32)            6165 |                                                    |
[32, 64)             140 |                                                    |
[64, 128)             10 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             5 |                                                    |




@set_free__count: 16650
@set_free__total_ms: 118
@set_free__us: 
[1]                 2527 |@@@@@@@@@@@@@@@@@@@                                 |
[2, 4)              6643 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)              1843 |@@@@@@@@@@@@@@                                      |
[8, 16)             3475 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[16, 32)            2119 |@@@@@@@@@@@@@@@@                                    |
[32, 64)              37 |                                                    |
[64, 128)              3 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             0 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               3 |                                                    |

@chain_free__count: 5137887          <--- called 5G times
@chain_free__total_ms: 80114         <--- in total 80! seconds spent in scope_release()
@chain_free__us:
[8, 16)          4556941 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)          550786 |@@@@@@                                              |
[32, 64)           29604 |                                                    |
[64, 128)            397 |                                                    |
[128, 256)            12 |                                                    |
[256, 512)            21 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               3 |                                                    |
[2K, 4K)             117 |                                                    |
[4K, 8K)              15 |                                                    |

==========================================================


-- 
- Neels Hofmeyr <nhofmeyr@sysmocom.de>         https://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Siemensstr. 26a
* 10551 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Director: Harald Welte

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nftables with thousands of chains is unreasonably slow
  2024-05-06 11:59 nftables with thousands of chains is unreasonably slow Neels Hofmeyr
@ 2024-05-07 22:32 ` Pablo Neira Ayuso
  2024-05-08 13:58   ` Neels Hofmeyr
  2024-05-10  5:46   ` Neels Hofmeyr
  0 siblings, 2 replies; 6+ messages in thread
From: Pablo Neira Ayuso @ 2024-05-07 22:32 UTC (permalink / raw)
  To: Neels Hofmeyr; +Cc: netfilter-devel

Hi Neels,

On Mon, May 06, 2024 at 01:59:55PM +0200, Neels Hofmeyr wrote:
> Hi!
> 
> I am trying to figure out why adding/deleting chains and retrieving counters
> via nftables takes long for high nr of entries. I'd like to share some
> findings, to ask whether this is known, whether a solution exists, etc...
> 
> I'm writing two programs that interact with nftables via
> nft_run_cmd_from_buffer(). (osmo-upf and osmo-hnbgw)
> 
> One use case is setting up / tearing down thousands of GTP tunnels via nftables
> rule sets one by one, for reference:
> https://gitea.osmocom.org/cellular-infrastructure/osmo-upf/src/commit/a21bcec358a5147deb15d156700279f52386a7d7/tests/nft-rule.vty

You are focusing on the "set up / tear down thousands of GTP tunnels"
problems in this email, correct?

> The other use case is retrieving all counters that are currently present in the
> inet table that my client program owns.
> 
> Adding the first few hundred chains / reading a few hundred counters finishes
> in an ok amount of time. But the more chains are already submitted in the
> table, the longer it takes to add another one, etc.
> 
> By the time that there are ~4000 chains, adding another chain becomes
> prohibitively slow. Some numbers: in total, when setting up 1000 GTP tunnels in
> osmo-upf, which means adding 4000 chains into a table owned by the program
> (osmo-upf), in total takes 49 seconds. This includes the overhead for talking
> PFCP and logging, etc., but by far the most time is taken waiting for
> nft_run_cmd_from_buffer().

In you scenario: Is there a nft_run_cmd_from_buffer() call for each
new chain?

> We'd like to scale up to like 100 times the above -- completely beyond all
> reason currently, since the wait time seems to increase exponentially.
> 
> I have two angles:
> 
> 1) workaround: structure the chains and rules differently?
> 2) analysis: bpftracing tells me that most time is spent in chain_free().
> 
> (1) Currently I have one flat table with all the chains in that "top level".
> Would it make sense to try breaking that long list up into smaller groups,
> maybe batches of 100? As in, pseudocode:
> 
>   table osmo-upf {
>       group-0 {
>          chain-0 {}
> 	 chain-1 {}
> 	 ...
>       }
>       group-1 {
>          chain-100 {}
>          chain-101 {}
> 	 ...
>       }
>       ...
>       group-1000 {
>          chain-100000 {}
>          chain-100001 {}
> 	 ...
>       }
>   }
> 
> Then I can also retrieve the counters in batches of 100, which might be more
> efficient and better to coordinate with concurrent tasks.

This means, your existing approach is not batching updates?

> (2) Using bpftrace, I drilled a bit into where the time is spent. Two
> interesting findings for me so far:
> 
> It seems most time is spent in
> 
>   nft_run_cmd_from_buffer
>     nft_evaluate
>       nft_cache_update
>         nft_cache_release
>           nft_cache_flush
>             table_free
>               chain_free
> 
> The bpftrace shows:
> 
> nft_cache_update() was called 6000 times.
> (I have 1000 GTP tunnels and expect four chains per GTP tunnel, which would be
> 4000, not sure why I see 6k, but that's not really that significant.)
> 
> chain_free() was called more than 5 million times, that's 1000 times as often
> as I would naively expect to see.
> 
> Most calls are fast (<16us), but there are some break-outs of up to 8ms, and in
> total the 5 million calls amount to 80 seconds. (One bpftrace dump below, FYI)
> 
> I'm continuing to think:
> 
> - Is this stuff known already, or does it make sense to continue on this path,
>   share a reproduction recipe here, and so on?

Yes, a simple reproducer would be good to have. Users from time to
time post reproducer on bugzilla to trigger issue, sometimes those
reproducers provide the time it takes, so it should be possible to get
numbers before and after patches.

> - can we fix that? Is there some memory leak / unnecessary blow up happening
>   that causes this apparent factor 1000 in effort?

tests regularly run valgrind and ASAN to ensure no memleaks.

> - can we defer the free, so my client program gets the nftables results without
>   having to wait for the freeing? Can the client program decide when to do the
>   nftables freeing, i.e. not within nft_run_cmd_from_buffer()?

I am not sure yet I understand the pattern you follow to trigger the
issue.

Thanks for reporting.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nftables with thousands of chains is unreasonably slow
  2024-05-07 22:32 ` Pablo Neira Ayuso
@ 2024-05-08 13:58   ` Neels Hofmeyr
  2024-05-08 16:48     ` Pablo Neira Ayuso
  2024-05-10  5:46   ` Neels Hofmeyr
  1 sibling, 1 reply; 6+ messages in thread
From: Neels Hofmeyr @ 2024-05-08 13:58 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Wed, May 08, 2024 at 12:32:38AM +0200, Pablo Neira Ayuso wrote:
> Yes, a simple reproducer would be good to have.

ack, will send one soon.
How does that typically look, a shell script?

> You are focusing on the "set up / tear down thousands of GTP tunnels"
> problems in this email, correct?

A bit of context, the general interest came about from my job, working on
osmo-upf and osmo-hnbgw, but this is now shifting to a personal interest that
isn't necessarily backed by my employer.

I think the most narrow focus is this:

compare time taken adding
- the first chain
to
- the 1001st chain
to
- the 10001st chain.

My experience shows that each additional chain takes longer than the one
before.

Likewise, deleting one of 10000 chains takes significantly longer than deleting
one of 1000 chains, which in turn takes significantly longer than deleting one
of 10 chains in a table.

I suspect inefficiency in the very basic handling of chains per se, and not in
particular my type of chain (rewriting UDP packets). But for reference of what
kinds of chains I am using, you can look at this link:
> > https://gitea.osmocom.org/cellular-infrastructure/osmo-upf/src/commit/a21bcec358a5147deb15d156700279f52386a7d7/tests/nft-rule.vty

My aim here is a general approach: how efficient does nftables work for large
numbers of chains and rules?

- could we scale down use of temporary memory?
- could we more efficiently free temporary memory?
- could we do the freeing maintenance work *after* returning the results?

And the most important question I am asking here is: are things like this
already known issues or are we opening a new angle to things?

> In you scenario: Is there a nft_run_cmd_from_buffer() call for each
> new chain?

I tried batching of something like 400 chain additions, but it does not make
any significant difference.

> > Then I can also retrieve the counters in batches of 100, which might be more
> > efficient and better to coordinate with concurrent tasks.
> 
> This means, your existing approach is not batching updates?

Reconsidering, I'd rather not mix this aspect into this mail thread, my main
interest is in the quite general performance considerations above. (This is
referring to another use case, sorry for the confusion. That use case is about
retrieving counters from all chains in a rule set efficiently. fetching all at
once is slow, so fetching in batches might help. But let's ignore that aspect
for now).

> > - can we fix that? Is there some memory leak / unnecessary blow up happening
> >   that causes this apparent factor 1000 in effort?
> 
> tests regularly run valgrind and ASAN to ensure no memleaks.

Sorry, wrong word, I meant not really a memleak proper, but a sort of overkill
use of memory: given that all temporary allocations are properly cleaned up
later, it can still be a sort of "leak" if the nr of those allocations
explodes. For example, a rapid loop could maybe use a single allocation all
over instead of one per iteration, or some data tree need not necessarily be
copied for just some read-only access... that kind of thing in general.

~N

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nftables with thousands of chains is unreasonably slow
  2024-05-08 13:58   ` Neels Hofmeyr
@ 2024-05-08 16:48     ` Pablo Neira Ayuso
  0 siblings, 0 replies; 6+ messages in thread
From: Pablo Neira Ayuso @ 2024-05-08 16:48 UTC (permalink / raw)
  To: Neels Hofmeyr; +Cc: netfilter-devel

On Wed, May 08, 2024 at 03:58:25PM +0200, Neels Hofmeyr wrote:
> On Wed, May 08, 2024 at 12:32:38AM +0200, Pablo Neira Ayuso wrote:
> > Yes, a simple reproducer would be good to have.
> 
> ack, will send one soon.
> How does that typically look, a shell script?

That would be great, yes.

> > You are focusing on the "set up / tear down thousands of GTP tunnels"
> > problems in this email, correct?
> 
> A bit of context, the general interest came about from my job, working on
> osmo-upf and osmo-hnbgw, but this is now shifting to a personal interest that
> isn't necessarily backed by my employer.
> 
> I think the most narrow focus is this:
> 
> compare time taken adding
> - the first chain
> to
> - the 1001st chain
> to
> - the 10001st chain.
> 
> My experience shows that each additional chain takes longer than the one
> before.

I made this script, and I don't observe any variation in the numbers:

nft add table x
for ((i=0;i<4000;i++))
do
        time nft add chain x y$i
done

maybe your reproducer requires a map?

> Likewise, deleting one of 10000 chains takes significantly longer than deleting
> one of 1000 chains, which in turn takes significantly longer than deleting one
> of 10 chains in a table.
> 
> I suspect inefficiency in the very basic handling of chains per se, and not in
> particular my type of chain (rewriting UDP packets). But for reference of what
> kinds of chains I am using, you can look at this link:
> > > https://gitea.osmocom.org/cellular-infrastructure/osmo-upf/src/commit/a21bcec358a5147deb15d156700279f52386a7d7/tests/nft-rule.vty
> 
> My aim here is a general approach: how efficient does nftables work for large
> numbers of chains and rules?
> 
> - could we scale down use of temporary memory?
> - could we more efficiently free temporary memory?
> - could we do the freeing maintenance work *after* returning the results?

Shrinking memory consumption is possible, I suspect you see a frequent
cache fill up then tear down with your pattern.

> And the most important question I am asking here is: are things like this
> already known issues or are we opening a new angle to things?
> 
> > In you scenario: Is there a nft_run_cmd_from_buffer() call for each
> > new chain?
> 
> I tried batching of something like 400 chain additions, but it does not make
> any significant difference.
>
> > > Then I can also retrieve the counters in batches of 100, which might be more
> > > efficient and better to coordinate with concurrent tasks.
> > 
> > This means, your existing approach is not batching updates?
> 
> Reconsidering, I'd rather not mix this aspect into this mail thread, my main
> interest is in the quite general performance considerations above. (This is
> referring to another use case, sorry for the confusion. That use case is about
> retrieving counters from all chains in a rule set efficiently. fetching all at
> once is slow, so fetching in batches might help. But let's ignore that aspect
> for now).

OK.

> > > - can we fix that? Is there some memory leak / unnecessary blow up happening
> > >   that causes this apparent factor 1000 in effort?
> > 
> > tests regularly run valgrind and ASAN to ensure no memleaks.
> 
> Sorry, wrong word, I meant not really a memleak proper, but a sort of overkill
> use of memory: given that all temporary allocations are properly cleaned up
> later, it can still be a sort of "leak" if the nr of those allocations
> explodes. For example, a rapid loop could maybe use a single allocation all
> over instead of one per iteration, or some data tree need not necessarily be
> copied for just some read-only access... that kind of thing in general.

Set elements are particularly consuming more memory, it gets worse
with maps.

If this is a general concern about memory consumption, then I can take
a look at this globally, running existing tests and look at memory
consumption numbers.

I already have a few out-of-tree patches I need to revisit to shrink
memory consumption.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nftables with thousands of chains is unreasonably slow
  2024-05-07 22:32 ` Pablo Neira Ayuso
  2024-05-08 13:58   ` Neels Hofmeyr
@ 2024-05-10  5:46   ` Neels Hofmeyr
  2024-05-27 22:31     ` Pablo Neira Ayuso
  1 sibling, 1 reply; 6+ messages in thread
From: Neels Hofmeyr @ 2024-05-10  5:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

[-- Attachment #1: Type: text/plain, Size: 5572 bytes --]

Hi,

I have some more news on my complaints about nftables speed =)
Before, I assumed the performance loss was exponential, but turns out that it
is linear.

Attached is a reproduction program in c.
Below is the output I get, first the explanation.

An aim is to crank the n in the program up to 200k
and still complete < 2 minutes;

I have taken the nft ruleset as used in osmo-upf, I have not yet spent time to
find a more minimal nft script.

One "item" here is a batch of two chains with a rule each, plus two map entries
(exactly as done for one direction of a GTP-U tunnel in osmo-upf).

The program calls one nft_run_cmd_from_buffer() per such item.
Roughly every 128th time, the program prints the time it takes for a single
nft_run_cmd_from_buffer() call adding one such item.

The first ~200 items take less than a millisecond each.
The 5000th item takes ~25 milliseconds, factor 34.
The increase is *linear* (attached a chart).

Then the program goes on to delete the same items one by one. The first one
takes ~21ms to delete, the last few only 0.33 ms each, factor 60.
The transition is also linear.

--Log output--

Legend:
'created idx <i> in <t> ns' shows one single nft_run_cmd_from_buffer() timing.
(in braces: how many nft_run_cmd_from_buffer() were called since the last log
output, and how much time has elapsed)

----
created idx    1 in  1114876 ns   (1 in 1 ms)
created idx  128 in   728617 ns   (127 in 77 ms)
created idx  256 in  1090610 ns   (128 in 121 ms)
created idx  384 in  1647839 ns   (128 in 169 ms)
created idx  512 in  2056034 ns   (128 in 231 ms)
created idx  640 in  2381756 ns   (128 in 281 ms)
created idx  768 in  2951399 ns   (128 in 333 ms)
created idx  896 in  3251231 ns   (128 in 395 ms)
created idx 1024 in  3623743 ns   (128 in 448 ms)
created idx 1152 in  4132313 ns   (128 in 506 ms)
created idx 1280 in  4833977 ns   (128 in 567 ms)
created idx 1408 in  6512871 ns   (128 in 761 ms)
created idx 1536 in  7053907 ns   (128 in 874 ms)
created idx 1664 in  7704614 ns   (128 in 953 ms)
created idx 1792 in  8203072 ns   (128 in 1031 ms)
created idx 1920 in  8978494 ns   (128 in 1110 ms)
created idx 2048 in  9616001 ns   (128 in 1189 ms)
created idx 2176 in 10496205 ns   (128 in 1266 ms)
created idx 2304 in 10877421 ns   (128 in 1367 ms)
created idx 2432 in 11407523 ns   (128 in 1429 ms)
created idx 2560 in 12016916 ns   (128 in 1509 ms)
created idx 2688 in 12740551 ns   (128 in 1594 ms)
created idx 2816 in 13171042 ns   (128 in 1672 ms)
created idx 2944 in 13911451 ns   (128 in 1754 ms)
created idx 3072 in 16008519 ns   (128 in 1861 ms)
created idx 3200 in 15957528 ns   (128 in 1968 ms)
created idx 3328 in 16310521 ns   (128 in 2063 ms)
created idx 3456 in 16731960 ns   (128 in 2124 ms)
created idx 3584 in 17517371 ns   (128 in 2202 ms)
created idx 3712 in 18090952 ns   (128 in 2272 ms)
created idx 3840 in 18867300 ns   (128 in 2361 ms)
created idx 3968 in 19330995 ns   (128 in 2432 ms)
created idx 4096 in 19871577 ns   (128 in 2519 ms)
created idx 4224 in 20686331 ns   (128 in 2593 ms)
created idx 4352 in 21175337 ns   (128 in 2700 ms)
created idx 4480 in 22022105 ns   (128 in 2766 ms)
created idx 4608 in 22466535 ns   (128 in 2846 ms)
created idx 4736 in 23259487 ns   (128 in 2939 ms)
created idx 4864 in 23976185 ns   (128 in 3015 ms)
created idx 4992 in 24488874 ns   (128 in 3102 ms)
created idx 4999 in 24558272 ns   (7 in 172 ms)
created idx 5000 in 24876752 ns   (1 in 24 ms)
deleted idx    1 in 21248834 ns   (1 in 21 ms)
deleted idx  128 in 19360199 ns   (127 in 2485 ms)
deleted idx  256 in 18756603 ns   (128 in 2437 ms)
deleted idx  384 in 18838228 ns   (128 in 2616 ms)
deleted idx  512 in 17583553 ns   (128 in 2347 ms)
deleted idx  640 in 17042570 ns   (128 in 2217 ms)
deleted idx  768 in 16833165 ns   (128 in 2162 ms)
deleted idx  896 in 15824594 ns   (128 in 2101 ms)
deleted idx 1024 in 15680040 ns   (128 in 2035 ms)
deleted idx 1152 in 14934728 ns   (128 in 1967 ms)
deleted idx 1280 in 14373106 ns   (128 in 1907 ms)
deleted idx 1408 in 14227968 ns   (128 in 1834 ms)
deleted idx 1536 in 13650017 ns   (128 in 1772 ms)
deleted idx 1664 in 13268944 ns   (128 in 1710 ms)
deleted idx 1792 in 12703344 ns   (128 in 1662 ms)
deleted idx 1920 in 12091815 ns   (128 in 1574 ms)
deleted idx 2048 in 11260211 ns   (128 in 1482 ms)
deleted idx 2176 in 10610571 ns   (128 in 1409 ms)
deleted idx 2304 in 10201007 ns   (128 in 1341 ms)
deleted idx 2432 in  9894120 ns   (128 in 1273 ms)
deleted idx 2560 in  8968704 ns   (128 in 1205 ms)
deleted idx 2688 in  8569120 ns   (128 in 1133 ms)
deleted idx 2816 in  8340385 ns   (128 in 1083 ms)
deleted idx 2944 in  8278684 ns   (128 in 1002 ms)
deleted idx 3072 in  7010837 ns   (128 in 926 ms)
deleted idx 3200 in  6341329 ns   (128 in 859 ms)
deleted idx 3328 in  4675342 ns   (128 in 681 ms)
deleted idx 3456 in  4301439 ns   (128 in 578 ms)
deleted idx 3584 in  3840153 ns   (128 in 534 ms)
deleted idx 3712 in  3667216 ns   (128 in 490 ms)
deleted idx 3840 in  3304109 ns   (128 in 446 ms)
deleted idx 3968 in  2882374 ns   (128 in 400 ms)
deleted idx 4096 in  2620334 ns   (128 in 361 ms)
deleted idx 4224 in  2231384 ns   (128 in 323 ms)
deleted idx 4352 in  1858886 ns   (128 in 274 ms)
deleted idx 4480 in  1882147 ns   (128 in 230 ms)
deleted idx 4608 in  1262297 ns   (128 in 186 ms)
deleted idx 4736 in   921315 ns   (128 in 143 ms)
deleted idx 4864 in   617549 ns   (128 in 102 ms)
deleted idx 4992 in   339924 ns   (128 in 63 ms)
deleted idx 4999 in   330487 ns   (7 in 2 ms)
deleted idx 5000 in   392075 ns   (1 in 0 ms)
----

[-- Attachment #2: Makefile --]
[-- Type: text/plain, Size: 126 bytes --]

run: nft_slew
	./nft_slew

nft_slew: nft_slew.c
	gcc -o nft_slew nft_slew.c -lnftables
	sudo setcap cap_net_admin+pe nft_slew

[-- Attachment #3: nft_slew.c --]
[-- Type: text/x-csrc, Size: 3671 bytes --]

#include <stdio.h>
#include <limits.h>
#include <time.h>
#include <sys/time.h>
#include <nftables/libnftables.h>

void timespec_diff(struct timespec *a, const struct timespec *b)
{
	a->tv_sec -= b->tv_sec;
	if (a->tv_nsec < b->tv_nsec) {
		a->tv_nsec += 1e9 - b->tv_nsec;
		a->tv_sec--;
	} else {
		a->tv_nsec -= b->tv_nsec;
	}
}

static const char *create =
"add table inet slew {"
"       flags owner;"
"	map tunmap-pre { typeof ip daddr . @ih,32,32 : verdict; };"
"	map tunmap-post { typeof meta mark : verdict; };"
"	chain pre { type filter hook prerouting priority -300; policy accept;"
"	        udp dport 2152 ip daddr . @ih,32,32 vmap @tunmap-pre;"
"       };"
"	chain post { type filter hook postrouting priority 400; policy accept;"
"		meta mark vmap @tunmap-post;"
"	};"
"}"
;

static struct nft_ctx *nft_ctx;

int main(void) {
	nft_ctx = nft_ctx_new(NFT_CTX_DEFAULT);

	if (!nft_ctx) {
		printf("cannot create nft_ctx\n");
		return 1;
	}

	if (nft_run_cmd_from_buffer(nft_ctx, create)) {
		printf("cannot init table\n");
		return 1;
	}

	char cmd[1024];

	const unsigned int n = 5000;

	struct timespec last_start;
	struct timespec last_diff;

	struct timespec start;
	struct timespec now;

	clock_gettime(CLOCK_MONOTONIC, &last_start);
	int count_since_last = 0;

	for (unsigned int i = 1; i <= n; i++) {
		snprintf(cmd, sizeof(cmd),
			 "add chain inet slew tunmap-pre-%u;\n"
			 "add rule inet slew tunmap-pre-%u"
			 " ip daddr set 127.0.1.1"
			 " meta mark set %u counter accept;\n"
			 "add chain inet slew tunmap-post-%u;\n"
			 "add rule inet slew tunmap-post-%u"
			 " ip saddr set 127.0.2.1"
			 " udp sport set 2152"
			 " @ih,32,32 set 0x%x"
			 " counter accept;\n"
			 "add element inet slew tunmap-pre { "
			 "127.0.1.1 . 0x%x : jump tunmap-pre-%u };\n"
			 "add element inet slew tunmap-post { %u : jump tunmap-post-%u };\n",
			 i, i, i, i, i,
			 UINT_MAX - i,
			 i,
			 i, i, i);

		bool show_time = ((i & 0x7f) == 0) || (i == 1) || (i == n) || (i == n - 1);

		if (show_time)
			clock_gettime(CLOCK_MONOTONIC, &start);

		if (nft_run_cmd_from_buffer(nft_ctx, cmd)) {
			printf("failed on %u\n", i);
			return 1;
		}

		count_since_last++;
		if (show_time) {
			clock_gettime(CLOCK_MONOTONIC, &now);

			last_diff = now;
			timespec_diff(&last_diff, &last_start);
			last_start = now;

			timespec_diff(&now, &start);
			printf("created idx %4d in %8u ns   (%d in %u ms)\n",
			       i, now.tv_nsec,
			       count_since_last, last_diff.tv_nsec / 1000000 + last_diff.tv_sec * 1000);
			fflush(stdout);
			count_since_last = 0;
			clock_gettime(CLOCK_MONOTONIC, &last_start);
		}
	}

	for (unsigned int i = 1; i <= n; i++) {
		snprintf(cmd, sizeof(cmd),
			 "delete element inet slew tunmap-pre { 127.0.1.1 . 0x%x };\n"
			 "delete element inet slew tunmap-post { %u };\n"
			 "delete chain inet slew tunmap-pre-%u;\n"
			 "delete chain inet slew tunmap-post-%u;\n",
			 i, i, i, i);

		bool show_time = ((i & 0x7f) == 0) || (i == 1) || (i == n) || (i == n - 1);

		if (show_time)
			clock_gettime(CLOCK_MONOTONIC, &start);

		if (nft_run_cmd_from_buffer(nft_ctx, cmd)) {
			printf("failed on deleting %u\n", i);
			return 1;
		}

		count_since_last++;
		if (show_time) {
			clock_gettime(CLOCK_MONOTONIC, &now);

			last_diff = now;
			timespec_diff(&last_diff, &last_start);
			last_start = now;

			timespec_diff(&now, &start);
			printf("deleted idx %4d in %8u ns   (%d in %u ms)\n",
			       i, now.tv_nsec,
			       count_since_last, last_diff.tv_nsec / 1000000 + last_diff.tv_sec * 1000);
			fflush(stdout);
			count_since_last = 0;
			clock_gettime(CLOCK_MONOTONIC, &last_start);
		}
	}

	nft_ctx_free(nft_ctx);

	return 0;
}

[-- Attachment #4: chart.png --]
[-- Type: image/png, Size: 8614 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nftables with thousands of chains is unreasonably slow
  2024-05-10  5:46   ` Neels Hofmeyr
@ 2024-05-27 22:31     ` Pablo Neira Ayuso
  0 siblings, 0 replies; 6+ messages in thread
From: Pablo Neira Ayuso @ 2024-05-27 22:31 UTC (permalink / raw)
  To: Neels Hofmeyr; +Cc: netfilter-devel

Hi Neels,

On Fri, May 10, 2024 at 07:46:50AM +0200, Neels Hofmeyr wrote:
> Hi,
> 
> I have some more news on my complaints about nftables speed =)
> Before, I assumed the performance loss was exponential, but turns out that it
> is linear.
> 
> Attached is a reproduction program in c.

Thanks for your reproducer.

> Below is the output I get, first the explanation.
> 
> An aim is to crank the n in the program up to 200k
> and still complete < 2 minutes;
> 
> I have taken the nft ruleset as used in osmo-upf, I have not yet spent time to
> find a more minimal nft script.
> 
> One "item" here is a batch of two chains with a rule each, plus two map entries
> (exactly as done for one direction of a GTP-U tunnel in osmo-upf).
> 
> The program calls one nft_run_cmd_from_buffer() per such item.
> Roughly every 128th time, the program prints the time it takes for a single
> nft_run_cmd_from_buffer() call adding one such item.
> 
> The first ~200 items take less than a millisecond each.
> The 5000th item takes ~25 milliseconds, factor 34.
> The increase is *linear* (attached a chart).
> 
> Then the program goes on to delete the same items one by one. The first one
> takes ~21ms to delete, the last few only 0.33 ms each, factor 60.
> The transition is also linear.

Problem is two-fold.

One of the issues is that the cache gets cancelled (ie. flush and
refill) for each new incremental update. I made a small improvement to
address this:

https://patchwork.ozlabs.org/project/netfilter-devel/patch/20240527221757.834892-1-pablo@netfilter.org/
https://patchwork.ozlabs.org/project/netfilter-devel/patch/20240527221757.834892-2-pablo@netfilter.org/

which improves the results shown by your nft_slew.c test program.

(the excessive number of chain_free() calls you have observed is gone).

Next issue that pops up after applying these two patches is chain
validation from the kernel. With a quick kernel hack to disable chain
validation (that's a bit of cheating, but goal is to find baseline in
case chain validation becomes negligible), then, I can see constant
time to incrementally update a ruleset using your program.

> --Log output--
> 
> Legend:
> 'created idx <i> in <t> ns' shows one single nft_run_cmd_from_buffer() timing.
> (in braces: how many nft_run_cmd_from_buffer() were called since the last log
> output, and how much time has elapsed)
> 
> ----
> created idx    1 in  1114876 ns   (1 in 1 ms)
> created idx  128 in   728617 ns   (127 in 77 ms)
> created idx  256 in  1090610 ns   (128 in 121 ms)
> created idx  384 in  1647839 ns   (128 in 169 ms)
> created idx  512 in  2056034 ns   (128 in 231 ms)
> created idx  640 in  2381756 ns   (128 in 281 ms)
> created idx  768 in  2951399 ns   (128 in 333 ms)
> created idx  896 in  3251231 ns   (128 in 395 ms)
> created idx 1024 in  3623743 ns   (128 in 448 ms)
> created idx 1152 in  4132313 ns   (128 in 506 ms)
> created idx 1280 in  4833977 ns   (128 in 567 ms)
> created idx 1408 in  6512871 ns   (128 in 761 ms)
> created idx 1536 in  7053907 ns   (128 in 874 ms)
> created idx 1664 in  7704614 ns   (128 in 953 ms)
> created idx 1792 in  8203072 ns   (128 in 1031 ms)
> created idx 1920 in  8978494 ns   (128 in 1110 ms)
> created idx 2048 in  9616001 ns   (128 in 1189 ms)
> created idx 2176 in 10496205 ns   (128 in 1266 ms)
> created idx 2304 in 10877421 ns   (128 in 1367 ms)
> created idx 2432 in 11407523 ns   (128 in 1429 ms)
> created idx 2560 in 12016916 ns   (128 in 1509 ms)
> created idx 2688 in 12740551 ns   (128 in 1594 ms)
> created idx 2816 in 13171042 ns   (128 in 1672 ms)
> created idx 2944 in 13911451 ns   (128 in 1754 ms)
> created idx 3072 in 16008519 ns   (128 in 1861 ms)
> created idx 3200 in 15957528 ns   (128 in 1968 ms)
> created idx 3328 in 16310521 ns   (128 in 2063 ms)
> created idx 3456 in 16731960 ns   (128 in 2124 ms)
> created idx 3584 in 17517371 ns   (128 in 2202 ms)
> created idx 3712 in 18090952 ns   (128 in 2272 ms)
> created idx 3840 in 18867300 ns   (128 in 2361 ms)
> created idx 3968 in 19330995 ns   (128 in 2432 ms)
> created idx 4096 in 19871577 ns   (128 in 2519 ms)
> created idx 4224 in 20686331 ns   (128 in 2593 ms)
> created idx 4352 in 21175337 ns   (128 in 2700 ms)
> created idx 4480 in 22022105 ns   (128 in 2766 ms)
> created idx 4608 in 22466535 ns   (128 in 2846 ms)
> created idx 4736 in 23259487 ns   (128 in 2939 ms)
> created idx 4864 in 23976185 ns   (128 in 3015 ms)
> created idx 4992 in 24488874 ns   (128 in 3102 ms)
> created idx 4999 in 24558272 ns   (7 in 172 ms)
> created idx 5000 in 24876752 ns   (1 in 24 ms)
> deleted idx    1 in 21248834 ns   (1 in 21 ms)
> deleted idx  128 in 19360199 ns   (127 in 2485 ms)
> deleted idx  256 in 18756603 ns   (128 in 2437 ms)
> deleted idx  384 in 18838228 ns   (128 in 2616 ms)
> deleted idx  512 in 17583553 ns   (128 in 2347 ms)
> deleted idx  640 in 17042570 ns   (128 in 2217 ms)
> deleted idx  768 in 16833165 ns   (128 in 2162 ms)
> deleted idx  896 in 15824594 ns   (128 in 2101 ms)
> deleted idx 1024 in 15680040 ns   (128 in 2035 ms)
> deleted idx 1152 in 14934728 ns   (128 in 1967 ms)
> deleted idx 1280 in 14373106 ns   (128 in 1907 ms)
> deleted idx 1408 in 14227968 ns   (128 in 1834 ms)
> deleted idx 1536 in 13650017 ns   (128 in 1772 ms)
> deleted idx 1664 in 13268944 ns   (128 in 1710 ms)
> deleted idx 1792 in 12703344 ns   (128 in 1662 ms)
> deleted idx 1920 in 12091815 ns   (128 in 1574 ms)
> deleted idx 2048 in 11260211 ns   (128 in 1482 ms)
> deleted idx 2176 in 10610571 ns   (128 in 1409 ms)
> deleted idx 2304 in 10201007 ns   (128 in 1341 ms)
> deleted idx 2432 in  9894120 ns   (128 in 1273 ms)
> deleted idx 2560 in  8968704 ns   (128 in 1205 ms)
> deleted idx 2688 in  8569120 ns   (128 in 1133 ms)
> deleted idx 2816 in  8340385 ns   (128 in 1083 ms)
> deleted idx 2944 in  8278684 ns   (128 in 1002 ms)
> deleted idx 3072 in  7010837 ns   (128 in 926 ms)
> deleted idx 3200 in  6341329 ns   (128 in 859 ms)
> deleted idx 3328 in  4675342 ns   (128 in 681 ms)
> deleted idx 3456 in  4301439 ns   (128 in 578 ms)
> deleted idx 3584 in  3840153 ns   (128 in 534 ms)
> deleted idx 3712 in  3667216 ns   (128 in 490 ms)
> deleted idx 3840 in  3304109 ns   (128 in 446 ms)
> deleted idx 3968 in  2882374 ns   (128 in 400 ms)
> deleted idx 4096 in  2620334 ns   (128 in 361 ms)
> deleted idx 4224 in  2231384 ns   (128 in 323 ms)
> deleted idx 4352 in  1858886 ns   (128 in 274 ms)
> deleted idx 4480 in  1882147 ns   (128 in 230 ms)
> deleted idx 4608 in  1262297 ns   (128 in 186 ms)
> deleted idx 4736 in   921315 ns   (128 in 143 ms)
> deleted idx 4864 in   617549 ns   (128 in 102 ms)
> deleted idx 4992 in   339924 ns   (128 in 63 ms)
> deleted idx 4999 in   330487 ns   (7 in 2 ms)
> deleted idx 5000 in   392075 ns   (1 in 0 ms)
> ----

> run: nft_slew
> 	./nft_slew
> 
> nft_slew: nft_slew.c
> 	gcc -o nft_slew nft_slew.c -lnftables
> 	sudo setcap cap_net_admin+pe nft_slew

> #include <stdio.h>
> #include <limits.h>
> #include <time.h>
> #include <sys/time.h>
> #include <nftables/libnftables.h>
> 
> void timespec_diff(struct timespec *a, const struct timespec *b)
> {
> 	a->tv_sec -= b->tv_sec;
> 	if (a->tv_nsec < b->tv_nsec) {
> 		a->tv_nsec += 1e9 - b->tv_nsec;
> 		a->tv_sec--;
> 	} else {
> 		a->tv_nsec -= b->tv_nsec;
> 	}
> }
> 
> static const char *create =
> "add table inet slew {"
> "       flags owner;"
> "	map tunmap-pre { typeof ip daddr . @ih,32,32 : verdict; };"
> "	map tunmap-post { typeof meta mark : verdict; };"
> "	chain pre { type filter hook prerouting priority -300; policy accept;"
> "	        udp dport 2152 ip daddr . @ih,32,32 vmap @tunmap-pre;"
> "       };"
> "	chain post { type filter hook postrouting priority 400; policy accept;"
> "		meta mark vmap @tunmap-post;"
> "	};"
> "}"
> ;
> 
> static struct nft_ctx *nft_ctx;
> 
> int main(void) {
> 	nft_ctx = nft_ctx_new(NFT_CTX_DEFAULT);
> 
> 	if (!nft_ctx) {
> 		printf("cannot create nft_ctx\n");
> 		return 1;
> 	}
> 
> 	if (nft_run_cmd_from_buffer(nft_ctx, create)) {
> 		printf("cannot init table\n");
> 		return 1;
> 	}
> 
> 	char cmd[1024];
> 
> 	const unsigned int n = 5000;
> 
> 	struct timespec last_start;
> 	struct timespec last_diff;
> 
> 	struct timespec start;
> 	struct timespec now;
> 
> 	clock_gettime(CLOCK_MONOTONIC, &last_start);
> 	int count_since_last = 0;
> 
> 	for (unsigned int i = 1; i <= n; i++) {
> 		snprintf(cmd, sizeof(cmd),
> 			 "add chain inet slew tunmap-pre-%u;\n"
> 			 "add rule inet slew tunmap-pre-%u"
> 			 " ip daddr set 127.0.1.1"
> 			 " meta mark set %u counter accept;\n"
> 			 "add chain inet slew tunmap-post-%u;\n"
> 			 "add rule inet slew tunmap-post-%u"
> 			 " ip saddr set 127.0.2.1"
> 			 " udp sport set 2152"
> 			 " @ih,32,32 set 0x%x"
> 			 " counter accept;\n"
> 			 "add element inet slew tunmap-pre { "
> 			 "127.0.1.1 . 0x%x : jump tunmap-pre-%u };\n"
> 			 "add element inet slew tunmap-post { %u : jump tunmap-post-%u };\n",
> 			 i, i, i, i, i,
> 			 UINT_MAX - i,
> 			 i,
> 			 i, i, i);
> 
> 		bool show_time = ((i & 0x7f) == 0) || (i == 1) || (i == n) || (i == n - 1);
> 
> 		if (show_time)
> 			clock_gettime(CLOCK_MONOTONIC, &start);
> 
> 		if (nft_run_cmd_from_buffer(nft_ctx, cmd)) {
> 			printf("failed on %u\n", i);
> 			return 1;
> 		}
> 
> 		count_since_last++;
> 		if (show_time) {
> 			clock_gettime(CLOCK_MONOTONIC, &now);
> 
> 			last_diff = now;
> 			timespec_diff(&last_diff, &last_start);
> 			last_start = now;
> 
> 			timespec_diff(&now, &start);
> 			printf("created idx %4d in %8u ns   (%d in %u ms)\n",
> 			       i, now.tv_nsec,
> 			       count_since_last, last_diff.tv_nsec / 1000000 + last_diff.tv_sec * 1000);
> 			fflush(stdout);
> 			count_since_last = 0;
> 			clock_gettime(CLOCK_MONOTONIC, &last_start);
> 		}
> 	}
> 
> 	for (unsigned int i = 1; i <= n; i++) {
> 		snprintf(cmd, sizeof(cmd),
> 			 "delete element inet slew tunmap-pre { 127.0.1.1 . 0x%x };\n"
> 			 "delete element inet slew tunmap-post { %u };\n"
> 			 "delete chain inet slew tunmap-pre-%u;\n"
> 			 "delete chain inet slew tunmap-post-%u;\n",
> 			 i, i, i, i);
> 
> 		bool show_time = ((i & 0x7f) == 0) || (i == 1) || (i == n) || (i == n - 1);
> 
> 		if (show_time)
> 			clock_gettime(CLOCK_MONOTONIC, &start);
> 
> 		if (nft_run_cmd_from_buffer(nft_ctx, cmd)) {
> 			printf("failed on deleting %u\n", i);
> 			return 1;
> 		}
> 
> 		count_since_last++;
> 		if (show_time) {
> 			clock_gettime(CLOCK_MONOTONIC, &now);
> 
> 			last_diff = now;
> 			timespec_diff(&last_diff, &last_start);
> 			last_start = now;
> 
> 			timespec_diff(&now, &start);
> 			printf("deleted idx %4d in %8u ns   (%d in %u ms)\n",
> 			       i, now.tv_nsec,
> 			       count_since_last, last_diff.tv_nsec / 1000000 + last_diff.tv_sec * 1000);
> 			fflush(stdout);
> 			count_since_last = 0;
> 			clock_gettime(CLOCK_MONOTONIC, &last_start);
> 		}
> 	}
> 
> 	nft_ctx_free(nft_ctx);
> 
> 	return 0;
> }



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-05-27 22:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-06 11:59 nftables with thousands of chains is unreasonably slow Neels Hofmeyr
2024-05-07 22:32 ` Pablo Neira Ayuso
2024-05-08 13:58   ` Neels Hofmeyr
2024-05-08 16:48     ` Pablo Neira Ayuso
2024-05-10  5:46   ` Neels Hofmeyr
2024-05-27 22:31     ` Pablo Neira Ayuso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).