[Q] The usage of xt

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Q] The usage of xt_recseq.
@ 2024-08-13 14:01 Sebastian Andrzej Siewior
  2024-08-13 14:37 ` [netfilter-core] " Florian Westphal
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-08-13 14:01 UTC (permalink / raw)
  To: netfilter-devel, coreteam
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Eric Dumazet,
	Thomas Gleixner

xt_recseq is per-CPU sequence counter which is not entirely using the
seqcount API.
The writer side of the sequence counter is updating the packet and byte
counter (64bit) while processing a packet. The reader simply retrieves
the two counter.
Based on the code, the writer side can be recursive which is probably
why the "regular" write side isn't used or maybe because there is no
"lock".
The seqcount is per-CPU and disabling BH is used as the "lock". On
PREEMPT_RT code in local_bh_disable()ed section is preemptible and this
means that a seqcount reader with higher priority can preempt the writer
which leads to a deadlock. 

While trying to trigger the writer side, I managed only to trigger a
single reader and only while using iptables-legacy/ arptables-legacy
commands. The nft did not trigger it. So it is legacy code only.

Would it work to convert the counters to u64_stats_sync? On 32bit
there would be a seqcount_t with preemption disabling during the
update which means the xt_write_recseq_begin()/ xt_write_recseq_end()
has to be limited the counter update only. On 64bit architectures there
would be just the update. This means that number of packets and bytes
might be "off" (the one got updated, the other not "yet") but I don't
think that this is a problem here.

I could send a RFC patch if there is nothing obvious I overlooked ;)

Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [netfilter-core] [Q] The usage of xt_recseq.
  2024-08-13 14:01 [Q] The usage of xt_recseq Sebastian Andrzej Siewior
@ 2024-08-13 14:37 ` Florian Westphal
  2024-08-13 15:28   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Westphal @ 2024-08-13 14:37 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: netfilter-devel, coreteam, Eric Dumazet, Thomas Gleixner,
	Jozsef Kadlecsik

Hi Sebastian!

Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> xt_recseq is per-CPU sequence counter which is not entirely using the
> seqcount API.
> The writer side of the sequence counter is updating the packet and byte
> counter (64bit) while processing a packet. The reader simply retrieves
> the two counter.
> Based on the code, the writer side can be recursive which is probably
> why the "regular" write side isn't used or maybe because there is no
> "lock".

Yes, recursive entry is possible even with local_bh_disable(), as
some of the xt_FOO extensions can send a packet (REJECT and TEE come
to mind), which can re-enter into ip_tables' traverser (*_do_table).

> The seqcount is per-CPU and disabling BH is used as the "lock". On
> PREEMPT_RT code in local_bh_disable()ed section is preemptible and this
> means that a seqcount reader with higher priority can preempt the writer
> which leads to a deadlock. 
> 
> While trying to trigger the writer side, I managed only to trigger a
> single reader and only while using iptables-legacy/ arptables-legacy
> commands. The nft did not trigger it. So it is legacy code only.

Yes, this is legacy only.

> Would it work to convert the counters to u64_stats_sync? On 32bit
> there would be a seqcount_t with preemption disabling during the
> update which means the xt_write_recseq_begin()/ xt_write_recseq_end()
> has to be limited the counter update only. On 64bit architectures there
> would be just the update. This means that number of packets and bytes
> might be "off" (the one got updated, the other not "yet") but I don't
> think that this is a problem here.

Unfortunately its not only about counters; local_bh_disable() is also
used to prevent messing up the chain jump stack.

For local hooks, this is called from process context, so in order
to avoid timers kicking in and then re-using the jumpstack, this
local_bh_disable avoids that.

The chain stack is percpu in -legacy, and on-stack in nf_tables.

Then, there is also recursion via xt_TEE.c, hence this strange
        if (static_key_false(&xt_tee_enabled))

in ipt_do_table() (We'll switch to a shadow-stack for that case).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [netfilter-core] [Q] The usage of xt_recseq.
  2024-08-13 14:37 ` [netfilter-core] " Florian Westphal
@ 2024-08-13 15:28   ` Sebastian Andrzej Siewior
  2024-08-13 18:32     ` Florian Westphal
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-08-13 15:28 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, coreteam, Eric Dumazet, Thomas Gleixner,
	Jozsef Kadlecsik

On 2024-08-13 16:37:19 [+0200], Florian Westphal wrote:
> Hi Sebastian!
Hi Florian,

> > While trying to trigger the writer side, I managed only to trigger a
> > single reader and only while using iptables-legacy/ arptables-legacy
> > commands. The nft did not trigger it. So it is legacy code only.
> 
> Yes, this is legacy only.

perfect.

> > Would it work to convert the counters to u64_stats_sync? On 32bit
> > there would be a seqcount_t with preemption disabling during the
> > update which means the xt_write_recseq_begin()/ xt_write_recseq_end()
> > has to be limited the counter update only. On 64bit architectures there
> > would be just the update. This means that number of packets and bytes
> > might be "off" (the one got updated, the other not "yet") but I don't
> > think that this is a problem here.
> 
> Unfortunately its not only about counters; local_bh_disable() is also
> used to prevent messing up the chain jump stack.

Okay. But I could get rid of the counters/ seqcount and worry about the
other things later on?

> For local hooks, this is called from process context, so in order
> to avoid timers kicking in and then re-using the jumpstack, this
> local_bh_disable avoids that.
> 
> The chain stack is percpu in -legacy, and on-stack in nf_tables.
> 
> Then, there is also recursion via xt_TEE.c, hence this strange
>         if (static_key_false(&xt_tee_enabled))
> 
> in ipt_do_table() (We'll switch to a shadow-stack for that case).

Yeah. That is another per-CPU thingy. 
So my plan was to replace the seqcount/ counters with u64_stats_sync,
make this per-CPU nf_skb_duplicated per-TASK and then come back and
check what local_bh_disable() is actually protecting other than skb
alloc/ free during callback invocation (as in TEE).

So jumpstack. This is exclusively used by ipt_do_table(). Not sure how a
timer comes here but I goes any softirq (as in NAPI) would do the job
without actually disabling BH.
Can this be easily transformed to the on-stack thingy that nft is using
or is it completely different? If the latter I would just a add a lock. 
local_lock_nested_bh() would be the easiest to not upset anyone but this
is using hand crafted per-CPU memory instead of alloc_percpu(). Can
stacksize get extremely huge?

In case I made a wrong turn somewhere, do you have better suggestions?

Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [netfilter-core] [Q] The usage of xt_recseq.
  2024-08-13 15:28   ` Sebastian Andrzej Siewior
@ 2024-08-13 18:32     ` Florian Westphal
  2024-08-14  7:13       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Westphal @ 2024-08-13 18:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Florian Westphal, netfilter-devel, coreteam, Eric Dumazet,
	Thomas Gleixner, Jozsef Kadlecsik

Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > > Would it work to convert the counters to u64_stats_sync? On 32bit
> > > there would be a seqcount_t with preemption disabling during the
> > > update which means the xt_write_recseq_begin()/ xt_write_recseq_end()
> > > has to be limited the counter update only. On 64bit architectures there
> > > would be just the update. This means that number of packets and bytes
> > > might be "off" (the one got updated, the other not "yet") but I don't
> > > think that this is a problem here.
> > 
> > Unfortunately its not only about counters; local_bh_disable() is also
> > used to prevent messing up the chain jump stack.
> 
> Okay. But I could get rid of the counters/ seqcount and worry about the
> other things later on?

Unfortunately no, see xt_replace_table().  A seqcnt transition is seen
as "done" signal for releasing the ruleset.

See d3d40f237480 ("Revert "netfilter: x_tables: Switch synchronization to RCU"")
and its history.

First step would be to revert back to rcu and then replace
synchronize_rcu with call_rcu based release of the blob.

Or, just tag the x_tables traversers as incompatible with
CONFIG_PREEMPT_RT in Kconfig...

Its possible to build a kernel that can support iptables-nft
(iptables-over-nftables api) but not classic iptables, so the
problematic table traverse code isn't built.

> So jumpstack. This is exclusively used by ipt_do_table(). Not sure how a
> timer comes here but I goes any softirq (as in NAPI) would do the job
> without actually disabling BH.

Think eg. tcp retransmit timer kicking in while kernel executed ip
output path on behalf of write() on some socket.

We're in process context, inside ipt_do_table, local_bh_disable is
needed to delay sirq from coming in at wrong time due to pcpu jumpstack
area.

> Can this be easily transformed to the on-stack thingy that nft is using
> or is it completely different?

In first step, blob validation needs to be changed to validate that jump
depth can't exceed 16 (i.e. 64byte extra scratch space on stack for
storage of return addresses).

Then it could be updated. It will probably break some test cases but
I don't think there are real production rulesets that would fail to load
with a chainstack limit of 16.

> local_lock_nested_bh() would be the easiest to not upset anyone but this
> is using hand crafted per-CPU memory instead of alloc_percpu(). Can
> stacksize get extremely huge?

Classic iptables allows as many calls as there are jumps in the ruleset,
so theroretically they can be huge.
If that happens outside of test case scripts -- i do not think so.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [netfilter-core] [Q] The usage of xt_recseq.
  2024-08-13 18:32     ` Florian Westphal
@ 2024-08-14  7:13       ` Sebastian Andrzej Siewior
  2024-08-14 15:09         ` Florian Westphal
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-08-14  7:13 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, coreteam, Eric Dumazet, Thomas Gleixner,
	Jozsef Kadlecsik

On 2024-08-13 20:32:02 [+0200], Florian Westphal wrote:
> Or, just tag the x_tables traversers as incompatible with
> CONFIG_PREEMPT_RT in Kconfig...

After reading all this I am kind of leaning in for the Kconfig option
because it is legacy code. Do you have any schedule for removal? There
is nft and there is an iptables wrapper for it. It is around in Debian
since Buster/ 2019 and only because it wasn't packaged in the previous
releases.

Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [netfilter-core] [Q] The usage of xt_recseq.
  2024-08-14  7:13       ` Sebastian Andrzej Siewior
@ 2024-08-14 15:09         ` Florian Westphal
  2024-08-14 15:22           ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Westphal @ 2024-08-14 15:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Florian Westphal, netfilter-devel, coreteam, Eric Dumazet,
	Thomas Gleixner, Jozsef Kadlecsik

Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> On 2024-08-13 20:32:02 [+0200], Florian Westphal wrote:
> > Or, just tag the x_tables traversers as incompatible with
> > CONFIG_PREEMPT_RT in Kconfig...
> 
> After reading all this I am kind of leaning in for the Kconfig option
> because it is legacy code. Do you have any schedule for removal?

No.  I added a hidden kconfig symbol to allow for disabling it in
a9525c7f6219 ("netfilter: xtables: allow xtables-nft only builds")

I think we should wait a bit more before exposing the symbol
in Kconfig to reduce oldconfig breakage chances.

But I think all pieces are in place to allow for the removal.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [netfilter-core] [Q] The usage of xt_recseq.
  2024-08-14 15:09         ` Florian Westphal
@ 2024-08-14 15:22           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-08-14 15:22 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, coreteam, Eric Dumazet, Thomas Gleixner,
	Jozsef Kadlecsik

On 2024-08-14 17:09:19 [+0200], Florian Westphal wrote:
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > On 2024-08-13 20:32:02 [+0200], Florian Westphal wrote:
> > > Or, just tag the x_tables traversers as incompatible with
> > > CONFIG_PREEMPT_RT in Kconfig...
> > 
> > After reading all this I am kind of leaning in for the Kconfig option
> > because it is legacy code. Do you have any schedule for removal?
> 
> No.  I added a hidden kconfig symbol to allow for disabling it in
> a9525c7f6219 ("netfilter: xtables: allow xtables-nft only builds")

thank you for the pointer.

> I think we should wait a bit more before exposing the symbol
> in Kconfig to reduce oldconfig breakage chances.
> 
> But I think all pieces are in place to allow for the removal.

Awesome. 

Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-08-14 15:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-13 14:01 [Q] The usage of xt_recseq Sebastian Andrzej Siewior
2024-08-13 14:37 ` [netfilter-core] " Florian Westphal
2024-08-13 15:28   ` Sebastian Andrzej Siewior
2024-08-13 18:32     ` Florian Westphal
2024-08-14  7:13       ` Sebastian Andrzej Siewior
2024-08-14 15:09         ` Florian Westphal
2024-08-14 15:22           ` Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.