[PATCH net] net_sched: act_ct: drop all packets when not attached to ingress

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
@ 2026-02-17  9:38 Paolo Abeni
  2026-02-17 10:42 ` Paolo Abeni
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Paolo Abeni @ 2026-02-17  9:38 UTC (permalink / raw)
  To: netdev
  Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim

Since the blamed commit below, classify can return TC_ACT_CONSUMED while
the current skb being held by the defragmentation engine. As reported by
GangMin Kim, if such packet is that may cause a UaF when the defrag engine
later on tries to tuch again such packet.

act_ct was never meant to be used outside of the ingress path. Making
defrag really works for act_ct outside such constraints range from very
difficult to completely impossible.

Address the issue making act_ct drop any packet when not attached to the
ingress path and additionally emit a warning about the bad
configuration.

Reported-by: GangMin Kim <km.kim1503@gmail.com>
Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
CC: stable@vger.kernel.org
Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
Catching the bad configuration at runtime instead of init time to reduce
complexity
---
 net/sched/act_ct.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 81d488655793..e8eb0d195f4a 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
 	tcf_lastuse_update(&c->tcf_tm);
 	tcf_action_update_bstats(&c->common, skb);
 
+	if (!skb_at_tc_ingress(skb)) {
+		pr_warn_once("act_CT should be attached at ingress!\n");
+		goto drop;
+	}
+
 	if (clear) {
 		qdisc_skb_cb(skb)->post_ct = false;
 		ct = nf_ct_get(skb, &ctinfo);
@@ -1109,6 +1114,7 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
 out_frag:
 	if (err != -EINPROGRESS)
 		tcf_action_inc_drop_qstats(&c->common);
+
 	return TC_ACT_CONSUMED;
 
 drop:
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-17  9:38 [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress Paolo Abeni
@ 2026-02-17 10:42 ` Paolo Abeni
  2026-02-17 14:49 ` Ilya Maximets
  2026-02-17 15:28 ` Jakub Kicinski
  2 siblings, 0 replies; 17+ messages in thread
From: Paolo Abeni @ 2026-02-17 10:42 UTC (permalink / raw)
  To: netdev
  Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim

On 2/17/26 10:38 AM, Paolo Abeni wrote:
> Since the blamed commit below, classify can return TC_ACT_CONSUMED while
> the current skb being held by the defragmentation engine. As reported by
> GangMin Kim, if such packet is that may cause a UaF when the defrag engine
> later on tries to tuch again such packet.
> 
> act_ct was never meant to be used outside of the ingress path. Making
> defrag really works for act_ct outside such constraints range from very
> difficult to completely impossible.
> 
> Address the issue making act_ct drop any packet when not attached to the
> ingress path and additionally emit a warning about the bad
> configuration.
> 
> Reported-by: GangMin Kim <km.kim1503@gmail.com>
> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")

Whoops, no idea how that wrong tag landed there

> CC: stable@vger.kernel.org
> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> Catching the bad configuration at runtime instead of init time to reduce
> complexity
> ---
>  net/sched/act_ct.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> index 81d488655793..e8eb0d195f4a 100644
> --- a/net/sched/act_ct.c
> +++ b/net/sched/act_ct.c
> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
>  	tcf_lastuse_update(&c->tcf_tm);
>  	tcf_action_update_bstats(&c->common, skb);
>  
> +	if (!skb_at_tc_ingress(skb)) {
> +		pr_warn_once("act_CT should be attached at ingress!\n");
> +		goto drop;
> +	}
> +
>  	if (clear) {
>  		qdisc_skb_cb(skb)->post_ct = false;
>  		ct = nf_ct_get(skb, &ctinfo);
> @@ -1109,6 +1114,7 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
>  out_frag:
>  	if (err != -EINPROGRESS)
>  		tcf_action_inc_drop_qstats(&c->common);
> +

This also must be dropped.

I'll send a v2 at due time.

/P


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-17  9:38 [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress Paolo Abeni
  2026-02-17 10:42 ` Paolo Abeni
@ 2026-02-17 14:49 ` Ilya Maximets
  2026-02-17 15:52   ` Paolo Abeni
  2026-02-17 15:28 ` Jakub Kicinski
  2 siblings, 1 reply; 17+ messages in thread
From: Ilya Maximets @ 2026-02-17 14:49 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim, i.maximets,
	Eelco Chaudron, Aaron Conole

On 2/17/26 10:38 AM, Paolo Abeni wrote:
> Since the blamed commit below, classify can return TC_ACT_CONSUMED while
> the current skb being held by the defragmentation engine. As reported by
> GangMin Kim, if such packet is that may cause a UaF when the defrag engine
> later on tries to tuch again such packet.
> 
> act_ct was never meant to be used outside of the ingress path. Making
> defrag really works for act_ct outside such constraints range from very
> difficult to completely impossible.
> 
> Address the issue making act_ct drop any packet when not attached to the
> ingress path and additionally emit a warning about the bad
> configuration.
> 
> Reported-by: GangMin Kim <km.kim1503@gmail.com>
> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
> CC: stable@vger.kernel.org
> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> Catching the bad configuration at runtime instead of init time to reduce
> complexity
> ---
>  net/sched/act_ct.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> index 81d488655793..e8eb0d195f4a 100644
> --- a/net/sched/act_ct.c
> +++ b/net/sched/act_ct.c
> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
>  	tcf_lastuse_update(&c->tcf_tm);
>  	tcf_action_update_bstats(&c->common, skb);
>  
> +	if (!skb_at_tc_ingress(skb)) {
> +		pr_warn_once("act_CT should be attached at ingress!\n");

Hi, Paolo.  I didn't test this yet, but I'm a little concerned about this
change.  For TC offload of OVS tunnels, we create egress qdisc for internal
ports, a.k.a. bridge ports.  This is necessary for tunnels to work.
A typical setup looks like this:

    Bridge br-int (10.0.0.1)
        Port br-int
            Interface br-int
                type: internal
        Port vm
            Interface vm
        Port vxlan0
            Interface vxlan0
                type: vxlan
                options: {remote_ip="172.31.1.1"}

    Bridge br-phy (172.31.1.100)
        Port br-phy
            Interface br-phy
                type: internal
        Port eth0
            Interface eth0

When a VM sends a packet that supposed to go through the tunnel, it enters
from the vm port and OVS forwards it into vxlan0 LWT with an appropriate
tunnel metadata.  Then it gets encapsulated and put into kernel routing to
find the next hop, which is via br-phy bridge.  Packet enters br-phy and
OVS forwards the packet to eth0.  There can be stateful processing in the
br-phy bridge involving passing packets through conntrack.

With the TC offload enabled, OVS creates following filters:

1. Ingress filter on vm interface that forwards packets to vxlan0.
2. Egress filter on br-phy interface that forwards encapsulated packets
   to eth0 interface.

If some stateful processing is involved, both of those could have act_ct
in them.

AFAIU, we have to use Egress qdisc for the br-phy, because the packet is
egressing from the kernel routing via br-phy and the ingress qdisc is not
invoked.  Ingress will be at play when packets are flowing in the opposite
direction from eth0 to br-phy, as that's where they will ingress into
standard kernel routing.

If act_tc will not be allowed on Egress, then stateful processing will
not be possible in this case in br-phy bridge.

Thoughts?

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-17 14:49 ` Ilya Maximets
@ 2026-02-17 15:52   ` Paolo Abeni
  2026-02-17 19:37     ` Ilya Maximets
  0 siblings, 1 reply; 17+ messages in thread
From: Paolo Abeni @ 2026-02-17 15:52 UTC (permalink / raw)
  To: Ilya Maximets, netdev
  Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron,
	Aaron Conole, Florian Westphal

Adding Florian, too

On 2/17/26 3:49 PM, Ilya Maximets wrote:
> On 2/17/26 10:38 AM, Paolo Abeni wrote:
>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while
>> the current skb being held by the defragmentation engine. As reported by
>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine
>> later on tries to tuch again such packet.
>>
>> act_ct was never meant to be used outside of the ingress path. Making
>> defrag really works for act_ct outside such constraints range from very
>> difficult to completely impossible.
>>
>> Address the issue making act_ct drop any packet when not attached to the
>> ingress path and additionally emit a warning about the bad
>> configuration.
>>
>> Reported-by: GangMin Kim <km.kim1503@gmail.com>
>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
>> CC: stable@vger.kernel.org
>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> Catching the bad configuration at runtime instead of init time to reduce
>> complexity
>> ---
>>  net/sched/act_ct.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
>> index 81d488655793..e8eb0d195f4a 100644
>> --- a/net/sched/act_ct.c
>> +++ b/net/sched/act_ct.c
>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
>>  	tcf_lastuse_update(&c->tcf_tm);
>>  	tcf_action_update_bstats(&c->common, skb);
>>  
>> +	if (!skb_at_tc_ingress(skb)) {
>> +		pr_warn_once("act_CT should be attached at ingress!\n");
> 
> Hi, Paolo.  I didn't test this yet, but I'm a little concerned about this
> change.  For TC offload of OVS tunnels, we create egress qdisc for internal
> ports, a.k.a. bridge ports.  This is necessary for tunnels to work.
> A typical setup looks like this:
> 
>     Bridge br-int (10.0.0.1)
>         Port br-int
>             Interface br-int
>                 type: internal
>         Port vm
>             Interface vm
>         Port vxlan0
>             Interface vxlan0
>                 type: vxlan
>                 options: {remote_ip="172.31.1.1"}
> 
>     Bridge br-phy (172.31.1.100)
>         Port br-phy
>             Interface br-phy
>                 type: internal
>         Port eth0
>             Interface eth0
> 
> When a VM sends a packet that supposed to go through the tunnel, it enters
> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate
> tunnel metadata.  Then it gets encapsulated and put into kernel routing to
> find the next hop, which is via br-phy bridge.  Packet enters br-phy and
> OVS forwards the packet to eth0.  There can be stateful processing in the
> br-phy bridge involving passing packets through conntrack.
> 
> With the TC offload enabled, OVS creates following filters:
> 
> 1. Ingress filter on vm interface that forwards packets to vxlan0.
> 2. Egress filter on br-phy interface that forwards encapsulated packets
>    to eth0 interface.
> 
> If some stateful processing is involved, both of those could have act_ct
> in them.
> 
> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is
> egressing from the kernel routing via br-phy and the ingress qdisc is not
> invoked.  Ingress will be at play when packets are flowing in the opposite
> direction from eth0 to br-phy, as that's where they will ingress into
> standard kernel routing.
> 
> If act_tc will not be allowed on Egress, then stateful processing will
> not be possible in this case in br-phy bridge.
> 
> Thoughts?

This looks very problematic. A slightly different patch tried to
somewhat preserve functionality, but it simply can't work in presence of
IP fragments.

Why stateful processing on  br-int/port vm ingress is not enough?

/P


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-17 15:52   ` Paolo Abeni
@ 2026-02-17 19:37     ` Ilya Maximets
  2026-02-18 14:28       ` Jamal Hadi Salim
  0 siblings, 1 reply; 17+ messages in thread
From: Ilya Maximets @ 2026-02-17 19:37 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: i.maximets, Jamal Hadi Salim, Jiri Pirko, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen,
	Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim,
	Eelco Chaudron, Aaron Conole, Florian Westphal

On 2/17/26 4:52 PM, Paolo Abeni wrote:
> Adding Florian, too
> 
> On 2/17/26 3:49 PM, Ilya Maximets wrote:
>> On 2/17/26 10:38 AM, Paolo Abeni wrote:
>>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while
>>> the current skb being held by the defragmentation engine. As reported by
>>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine
>>> later on tries to tuch again such packet.
>>>
>>> act_ct was never meant to be used outside of the ingress path. Making
>>> defrag really works for act_ct outside such constraints range from very
>>> difficult to completely impossible.
>>>
>>> Address the issue making act_ct drop any packet when not attached to the
>>> ingress path and additionally emit a warning about the bad
>>> configuration.
>>>
>>> Reported-by: GangMin Kim <km.kim1503@gmail.com>
>>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
>>> CC: stable@vger.kernel.org
>>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>> Catching the bad configuration at runtime instead of init time to reduce
>>> complexity
>>> ---
>>>  net/sched/act_ct.c | 6 ++++++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
>>> index 81d488655793..e8eb0d195f4a 100644
>>> --- a/net/sched/act_ct.c
>>> +++ b/net/sched/act_ct.c
>>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
>>>  	tcf_lastuse_update(&c->tcf_tm);
>>>  	tcf_action_update_bstats(&c->common, skb);
>>>  
>>> +	if (!skb_at_tc_ingress(skb)) {
>>> +		pr_warn_once("act_CT should be attached at ingress!\n");
>>
>> Hi, Paolo.  I didn't test this yet, but I'm a little concerned about this
>> change.  For TC offload of OVS tunnels, we create egress qdisc for internal
>> ports, a.k.a. bridge ports.  This is necessary for tunnels to work.
>> A typical setup looks like this:
>>
>>     Bridge br-int (10.0.0.1)
>>         Port br-int
>>             Interface br-int
>>                 type: internal
>>         Port vm
>>             Interface vm
>>         Port vxlan0
>>             Interface vxlan0
>>                 type: vxlan
>>                 options: {remote_ip="172.31.1.1"}
>>
>>     Bridge br-phy (172.31.1.100)
>>         Port br-phy
>>             Interface br-phy
>>                 type: internal
>>         Port eth0
>>             Interface eth0
>>
>> When a VM sends a packet that supposed to go through the tunnel, it enters
>> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate
>> tunnel metadata.  Then it gets encapsulated and put into kernel routing to
>> find the next hop, which is via br-phy bridge.  Packet enters br-phy and
>> OVS forwards the packet to eth0.  There can be stateful processing in the
>> br-phy bridge involving passing packets through conntrack.
>>
>> With the TC offload enabled, OVS creates following filters:
>>
>> 1. Ingress filter on vm interface that forwards packets to vxlan0.
>> 2. Egress filter on br-phy interface that forwards encapsulated packets
>>    to eth0 interface.
>>
>> If some stateful processing is involved, both of those could have act_ct
>> in them.
>>
>> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is
>> egressing from the kernel routing via br-phy and the ingress qdisc is not
>> invoked.  Ingress will be at play when packets are flowing in the opposite
>> direction from eth0 to br-phy, as that's where they will ingress into
>> standard kernel routing.
>>
>> If act_tc will not be allowed on Egress, then stateful processing will
>> not be possible in this case in br-phy bridge.
>>
>> Thoughts?
> 
> This looks very problematic. A slightly different patch tried to
> somewhat preserve functionality, but it simply can't work in presence of
> IP fragments.
> 
> Why stateful processing on  br-int/port vm ingress is not enough?

In general we don't know where the packet will go after encapsulation,
so we do not know beforehand if it will end up in another OVS bridge
or not.  Hence the processing has to be split in two parts, the ingress
on vm port and egress on the other bridge port.

At the same time, thinking more about potential use cases, I don't see
a good one for running encapsulated traffic through conntrack.  Most
of the time it's a UDP unidirectional traffic with hashes for source
ports.  Not sure what functionality would require to do anything
stateful on such packets.  So, maybe it's fine to prohibit act_ct on
the egress after all.


For the implementation however, let's look at a different very simple
use case:

   Application on the host -> br-phy -> NAT -> eth0.

This is a very common thing to NAT traffic before it leaves the host.
It can be for load balancing reasons, e.g. if a destination is a service
IP with multiple backends and we want multiple applications to hit
different backends.

There is no actual reason to try and offload such traffic to TC, as it
would not be possible to offload it to hardware (since the traffic
source is an application on the host).

However, since OVS doesn't really know that it's not possible to offload
to hardware and TC by itself allows creation of such filter, it will
create the filter on the egress qdisc on the bridge port and that egress
filter will be performing act_ct with the NAT.

With this patch applied, all this traffic will be just dropped.  If we're
going to prohibit act_ct on egress, that should be done at the moment of
creation of this action on the egress qdisc, so the user knows that it's
not supposed to be there.  E.g. if OVS will try to create such action and
fail, it will keep forwarding the traffic successfully through the normal
openvswitch datapath.  But if the TC filter creation succeeds, all traffic
will be just dropped by it.  That doesn't sound right.

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-17 19:37     ` Ilya Maximets
@ 2026-02-18 14:28       ` Jamal Hadi Salim
  2026-02-18 16:15         ` Ilya Maximets
  0 siblings, 1 reply; 17+ messages in thread
From: Jamal Hadi Salim @ 2026-02-18 14:28 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Paolo Abeni, netdev, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron,
	Aaron Conole, Florian Westphal

On Tue, Feb 17, 2026 at 2:37 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 2/17/26 4:52 PM, Paolo Abeni wrote:
> > Adding Florian, too
> >
> > On 2/17/26 3:49 PM, Ilya Maximets wrote:
> >> On 2/17/26 10:38 AM, Paolo Abeni wrote:
> >>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while
> >>> the current skb being held by the defragmentation engine. As reported by
> >>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine
> >>> later on tries to tuch again such packet.
> >>>
> >>> act_ct was never meant to be used outside of the ingress path. Making
> >>> defrag really works for act_ct outside such constraints range from very
> >>> difficult to completely impossible.
> >>>
> >>> Address the issue making act_ct drop any packet when not attached to the
> >>> ingress path and additionally emit a warning about the bad
> >>> configuration.
> >>>
> >>> Reported-by: GangMin Kim <km.kim1503@gmail.com>
> >>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
> >>> CC: stable@vger.kernel.org
> >>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com
> >>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> >>> ---
> >>> Catching the bad configuration at runtime instead of init time to reduce
> >>> complexity
> >>> ---
> >>>  net/sched/act_ct.c | 6 ++++++
> >>>  1 file changed, 6 insertions(+)
> >>>
> >>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> >>> index 81d488655793..e8eb0d195f4a 100644
> >>> --- a/net/sched/act_ct.c
> >>> +++ b/net/sched/act_ct.c
> >>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
> >>>     tcf_lastuse_update(&c->tcf_tm);
> >>>     tcf_action_update_bstats(&c->common, skb);
> >>>
> >>> +   if (!skb_at_tc_ingress(skb)) {
> >>> +           pr_warn_once("act_CT should be attached at ingress!\n");
> >>
> >> Hi, Paolo.  I didn't test this yet, but I'm a little concerned about this
> >> change.  For TC offload of OVS tunnels, we create egress qdisc for internal
> >> ports, a.k.a. bridge ports.  This is necessary for tunnels to work.
> >> A typical setup looks like this:
> >>
> >>     Bridge br-int (10.0.0.1)
> >>         Port br-int
> >>             Interface br-int
> >>                 type: internal
> >>         Port vm
> >>             Interface vm
> >>         Port vxlan0
> >>             Interface vxlan0
> >>                 type: vxlan
> >>                 options: {remote_ip="172.31.1.1"}
> >>
> >>     Bridge br-phy (172.31.1.100)
> >>         Port br-phy
> >>             Interface br-phy
> >>                 type: internal
> >>         Port eth0
> >>             Interface eth0
> >>
> >> When a VM sends a packet that supposed to go through the tunnel, it enters
> >> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate
> >> tunnel metadata.  Then it gets encapsulated and put into kernel routing to
> >> find the next hop, which is via br-phy bridge.  Packet enters br-phy and
> >> OVS forwards the packet to eth0.  There can be stateful processing in the
> >> br-phy bridge involving passing packets through conntrack.
> >>
> >> With the TC offload enabled, OVS creates following filters:
> >>
> >> 1. Ingress filter on vm interface that forwards packets to vxlan0.
> >> 2. Egress filter on br-phy interface that forwards encapsulated packets
> >>    to eth0 interface.
> >>
> >> If some stateful processing is involved, both of those could have act_ct
> >> in them.
> >>
> >> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is
> >> egressing from the kernel routing via br-phy and the ingress qdisc is not
> >> invoked.  Ingress will be at play when packets are flowing in the opposite
> >> direction from eth0 to br-phy, as that's where they will ingress into
> >> standard kernel routing.
> >>
> >> If act_tc will not be allowed on Egress, then stateful processing will
> >> not be possible in this case in br-phy bridge.
> >>
> >> Thoughts?
> >
> > This looks very problematic. A slightly different patch tried to
> > somewhat preserve functionality, but it simply can't work in presence of
> > IP fragments.
> >
> > Why stateful processing on  br-int/port vm ingress is not enough?
>
> In general we don't know where the packet will go after encapsulation,
> so we do not know beforehand if it will end up in another OVS bridge
> or not.  Hence the processing has to be split in two parts, the ingress
> on vm port and egress on the other bridge port.
>
> At the same time, thinking more about potential use cases, I don't see
> a good one for running encapsulated traffic through conntrack.  Most
> of the time it's a UDP unidirectional traffic with hashes for source
> ports.  Not sure what functionality would require to do anything
> stateful on such packets.  So, maybe it's fine to prohibit act_ct on
> the egress after all.
>

Another alternative is to drop packets that have fragments? i.e
Paolo's original patch but instead of accepting to drop
Or perhaps if there are no consequences there is nothing wrong with
Paolo's original patch..

From a validation/testability/debugability POV, assuming its bad to
ignore frags processing:
The issue is for the user to distinguish between a bad setup (like
getting fragments in ct) and a good one where there were no frags. If
I were to write a tdc test to send two packets - one with fragments
and another without - a good way to check is to look at the tc
action/qdisc counters.. Paolo's original patch was accepting in both
cases the counter so the accept counter would read as 2 but if it was
dropped it would say 1 dropped and 1 accepted, so its easy to test (or
the user to debug)

cheers,
jamal

> For the implementation however, let's look at a different very simple
> use case:
>
>    Application on the host -> br-phy -> NAT -> eth0.
>
> This is a very common thing to NAT traffic before it leaves the host.
> It can be for load balancing reasons, e.g. if a destination is a service
> IP with multiple backends and we want multiple applications to hit
> different backends.
>
> There is no actual reason to try and offload such traffic to TC, as it
> would not be possible to offload it to hardware (since the traffic
> source is an application on the host).
>
> However, since OVS doesn't really know that it's not possible to offload
> to hardware and TC by itself allows creation of such filter, it will
> create the filter on the egress qdisc on the bridge port and that egress
> filter will be performing act_ct with the NAT.
>
> With this patch applied, all this traffic will be just dropped.  If we're
> going to prohibit act_ct on egress, that should be done at the moment of
> creation of this action on the egress qdisc, so the user knows that it's
> not supposed to be there.  E.g. if OVS will try to create such action and
> fail, it will keep forwarding the traffic successfully through the normal
> openvswitch datapath.  But if the TC filter creation succeeds, all traffic
> will be just dropped by it.  That doesn't sound right.
>
> Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-18 14:28       ` Jamal Hadi Salim
@ 2026-02-18 16:15         ` Ilya Maximets
  2026-02-18 18:31           ` Jamal Hadi Salim
  0 siblings, 1 reply; 17+ messages in thread
From: Ilya Maximets @ 2026-02-18 16:15 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: i.maximets, Paolo Abeni, netdev, Jiri Pirko, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen,
	Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim,
	Eelco Chaudron, Aaron Conole, Florian Westphal

On 2/18/26 3:28 PM, Jamal Hadi Salim wrote:
> On Tue, Feb 17, 2026 at 2:37 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 2/17/26 4:52 PM, Paolo Abeni wrote:
>>> Adding Florian, too
>>>
>>> On 2/17/26 3:49 PM, Ilya Maximets wrote:
>>>> On 2/17/26 10:38 AM, Paolo Abeni wrote:
>>>>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while
>>>>> the current skb being held by the defragmentation engine. As reported by
>>>>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine
>>>>> later on tries to tuch again such packet.
>>>>>
>>>>> act_ct was never meant to be used outside of the ingress path. Making
>>>>> defrag really works for act_ct outside such constraints range from very
>>>>> difficult to completely impossible.
>>>>>
>>>>> Address the issue making act_ct drop any packet when not attached to the
>>>>> ingress path and additionally emit a warning about the bad
>>>>> configuration.
>>>>>
>>>>> Reported-by: GangMin Kim <km.kim1503@gmail.com>
>>>>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
>>>>> CC: stable@vger.kernel.org
>>>>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com
>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>>>> ---
>>>>> Catching the bad configuration at runtime instead of init time to reduce
>>>>> complexity
>>>>> ---
>>>>>  net/sched/act_ct.c | 6 ++++++
>>>>>  1 file changed, 6 insertions(+)
>>>>>
>>>>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
>>>>> index 81d488655793..e8eb0d195f4a 100644
>>>>> --- a/net/sched/act_ct.c
>>>>> +++ b/net/sched/act_ct.c
>>>>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
>>>>>     tcf_lastuse_update(&c->tcf_tm);
>>>>>     tcf_action_update_bstats(&c->common, skb);
>>>>>
>>>>> +   if (!skb_at_tc_ingress(skb)) {
>>>>> +           pr_warn_once("act_CT should be attached at ingress!\n");
>>>>
>>>> Hi, Paolo.  I didn't test this yet, but I'm a little concerned about this
>>>> change.  For TC offload of OVS tunnels, we create egress qdisc for internal
>>>> ports, a.k.a. bridge ports.  This is necessary for tunnels to work.
>>>> A typical setup looks like this:
>>>>
>>>>     Bridge br-int (10.0.0.1)
>>>>         Port br-int
>>>>             Interface br-int
>>>>                 type: internal
>>>>         Port vm
>>>>             Interface vm
>>>>         Port vxlan0
>>>>             Interface vxlan0
>>>>                 type: vxlan
>>>>                 options: {remote_ip="172.31.1.1"}
>>>>
>>>>     Bridge br-phy (172.31.1.100)
>>>>         Port br-phy
>>>>             Interface br-phy
>>>>                 type: internal
>>>>         Port eth0
>>>>             Interface eth0
>>>>
>>>> When a VM sends a packet that supposed to go through the tunnel, it enters
>>>> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate
>>>> tunnel metadata.  Then it gets encapsulated and put into kernel routing to
>>>> find the next hop, which is via br-phy bridge.  Packet enters br-phy and
>>>> OVS forwards the packet to eth0.  There can be stateful processing in the
>>>> br-phy bridge involving passing packets through conntrack.
>>>>
>>>> With the TC offload enabled, OVS creates following filters:
>>>>
>>>> 1. Ingress filter on vm interface that forwards packets to vxlan0.
>>>> 2. Egress filter on br-phy interface that forwards encapsulated packets
>>>>    to eth0 interface.
>>>>
>>>> If some stateful processing is involved, both of those could have act_ct
>>>> in them.
>>>>
>>>> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is
>>>> egressing from the kernel routing via br-phy and the ingress qdisc is not
>>>> invoked.  Ingress will be at play when packets are flowing in the opposite
>>>> direction from eth0 to br-phy, as that's where they will ingress into
>>>> standard kernel routing.
>>>>
>>>> If act_tc will not be allowed on Egress, then stateful processing will
>>>> not be possible in this case in br-phy bridge.
>>>>
>>>> Thoughts?
>>>
>>> This looks very problematic. A slightly different patch tried to
>>> somewhat preserve functionality, but it simply can't work in presence of
>>> IP fragments.
>>>
>>> Why stateful processing on  br-int/port vm ingress is not enough?
>>
>> In general we don't know where the packet will go after encapsulation,
>> so we do not know beforehand if it will end up in another OVS bridge
>> or not.  Hence the processing has to be split in two parts, the ingress
>> on vm port and egress on the other bridge port.
>>
>> At the same time, thinking more about potential use cases, I don't see
>> a good one for running encapsulated traffic through conntrack.  Most
>> of the time it's a UDP unidirectional traffic with hashes for source
>> ports.  Not sure what functionality would require to do anything
>> stateful on such packets.  So, maybe it's fine to prohibit act_ct on
>> the egress after all.
>>
> 
> Another alternative is to drop packets that have fragments? i.e
> Paolo's original patch but instead of accepting to drop
> Or perhaps if there are no consequences there is nothing wrong with
> Paolo's original patch..
> 
> From a validation/testability/debugability POV, assuming its bad to
> ignore frags processing:
> The issue is for the user to distinguish between a bad setup (like
> getting fragments in ct) and a good one where there were no frags. If
> I were to write a tdc test to send two packets - one with fragments
> and another without - a good way to check is to look at the tc
> action/qdisc counters.. Paolo's original patch was accepting in both
> cases the counter so the accept counter would read as 2 but if it was
> dropped it would say 1 dropped and 1 accepted, so its easy to test (or
> the user to debug)
> 

From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when
it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS.  This is clear
for the application that makes a request and for the user if they make
the request manually with 'tc filter ...'.

If the filter with act_ct is allowed to be created, but only works for
non-fragmented traffic, that would be pretty hard to debug for someone
who is not an expert in the area, but also would be a pretty inconsistent
user API.

Best regards, Ilya Maximets.

> cheers,
> jamal
> 
>> For the implementation however, let's look at a different very simple
>> use case:
>>
>>    Application on the host -> br-phy -> NAT -> eth0.
>>
>> This is a very common thing to NAT traffic before it leaves the host.
>> It can be for load balancing reasons, e.g. if a destination is a service
>> IP with multiple backends and we want multiple applications to hit
>> different backends.
>>
>> There is no actual reason to try and offload such traffic to TC, as it
>> would not be possible to offload it to hardware (since the traffic
>> source is an application on the host).
>>
>> However, since OVS doesn't really know that it's not possible to offload
>> to hardware and TC by itself allows creation of such filter, it will
>> create the filter on the egress qdisc on the bridge port and that egress
>> filter will be performing act_ct with the NAT.
>>
>> With this patch applied, all this traffic will be just dropped.  If we're
>> going to prohibit act_ct on egress, that should be done at the moment of
>> creation of this action on the egress qdisc, so the user knows that it's
>> not supposed to be there.  E.g. if OVS will try to create such action and
>> fail, it will keep forwarding the traffic successfully through the normal
>> openvswitch datapath.  But if the TC filter creation succeeds, all traffic
>> will be just dropped by it.  That doesn't sound right.
>>
>> Best regards, Ilya Maximets.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-18 16:15         ` Ilya Maximets
@ 2026-02-18 18:31           ` Jamal Hadi Salim
  2026-02-18 18:44             ` Jamal Hadi Salim
  0 siblings, 1 reply; 17+ messages in thread
From: Jamal Hadi Salim @ 2026-02-18 18:31 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Paolo Abeni, netdev, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron,
	Aaron Conole, Florian Westphal

On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 2/18/26 3:28 PM, Jamal Hadi Salim wrote:
> > On Tue, Feb 17, 2026 at 2:37 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> On 2/17/26 4:52 PM, Paolo Abeni wrote:
> >>> Adding Florian, too
> >>>
> >>> On 2/17/26 3:49 PM, Ilya Maximets wrote:
> >>>> On 2/17/26 10:38 AM, Paolo Abeni wrote:
> >>>>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while
> >>>>> the current skb being held by the defragmentation engine. As reported by
> >>>>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine
> >>>>> later on tries to tuch again such packet.
> >>>>>
> >>>>> act_ct was never meant to be used outside of the ingress path. Making
> >>>>> defrag really works for act_ct outside such constraints range from very
> >>>>> difficult to completely impossible.
> >>>>>
> >>>>> Address the issue making act_ct drop any packet when not attached to the
> >>>>> ingress path and additionally emit a warning about the bad
> >>>>> configuration.
> >>>>>
> >>>>> Reported-by: GangMin Kim <km.kim1503@gmail.com>
> >>>>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
> >>>>> CC: stable@vger.kernel.org
> >>>>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com
> >>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> >>>>> ---
> >>>>> Catching the bad configuration at runtime instead of init time to reduce
> >>>>> complexity
> >>>>> ---
> >>>>>  net/sched/act_ct.c | 6 ++++++
> >>>>>  1 file changed, 6 insertions(+)
> >>>>>
> >>>>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> >>>>> index 81d488655793..e8eb0d195f4a 100644
> >>>>> --- a/net/sched/act_ct.c
> >>>>> +++ b/net/sched/act_ct.c
> >>>>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
> >>>>>     tcf_lastuse_update(&c->tcf_tm);
> >>>>>     tcf_action_update_bstats(&c->common, skb);
> >>>>>
> >>>>> +   if (!skb_at_tc_ingress(skb)) {
> >>>>> +           pr_warn_once("act_CT should be attached at ingress!\n");
> >>>>
> >>>> Hi, Paolo.  I didn't test this yet, but I'm a little concerned about this
> >>>> change.  For TC offload of OVS tunnels, we create egress qdisc for internal
> >>>> ports, a.k.a. bridge ports.  This is necessary for tunnels to work.
> >>>> A typical setup looks like this:
> >>>>
> >>>>     Bridge br-int (10.0.0.1)
> >>>>         Port br-int
> >>>>             Interface br-int
> >>>>                 type: internal
> >>>>         Port vm
> >>>>             Interface vm
> >>>>         Port vxlan0
> >>>>             Interface vxlan0
> >>>>                 type: vxlan
> >>>>                 options: {remote_ip="172.31.1.1"}
> >>>>
> >>>>     Bridge br-phy (172.31.1.100)
> >>>>         Port br-phy
> >>>>             Interface br-phy
> >>>>                 type: internal
> >>>>         Port eth0
> >>>>             Interface eth0
> >>>>
> >>>> When a VM sends a packet that supposed to go through the tunnel, it enters
> >>>> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate
> >>>> tunnel metadata.  Then it gets encapsulated and put into kernel routing to
> >>>> find the next hop, which is via br-phy bridge.  Packet enters br-phy and
> >>>> OVS forwards the packet to eth0.  There can be stateful processing in the
> >>>> br-phy bridge involving passing packets through conntrack.
> >>>>
> >>>> With the TC offload enabled, OVS creates following filters:
> >>>>
> >>>> 1. Ingress filter on vm interface that forwards packets to vxlan0.
> >>>> 2. Egress filter on br-phy interface that forwards encapsulated packets
> >>>>    to eth0 interface.
> >>>>
> >>>> If some stateful processing is involved, both of those could have act_ct
> >>>> in them.
> >>>>
> >>>> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is
> >>>> egressing from the kernel routing via br-phy and the ingress qdisc is not
> >>>> invoked.  Ingress will be at play when packets are flowing in the opposite
> >>>> direction from eth0 to br-phy, as that's where they will ingress into
> >>>> standard kernel routing.
> >>>>
> >>>> If act_tc will not be allowed on Egress, then stateful processing will
> >>>> not be possible in this case in br-phy bridge.
> >>>>
> >>>> Thoughts?
> >>>
> >>> This looks very problematic. A slightly different patch tried to
> >>> somewhat preserve functionality, but it simply can't work in presence of
> >>> IP fragments.
> >>>
> >>> Why stateful processing on  br-int/port vm ingress is not enough?
> >>
> >> In general we don't know where the packet will go after encapsulation,
> >> so we do not know beforehand if it will end up in another OVS bridge
> >> or not.  Hence the processing has to be split in two parts, the ingress
> >> on vm port and egress on the other bridge port.
> >>
> >> At the same time, thinking more about potential use cases, I don't see
> >> a good one for running encapsulated traffic through conntrack.  Most
> >> of the time it's a UDP unidirectional traffic with hashes for source
> >> ports.  Not sure what functionality would require to do anything
> >> stateful on such packets.  So, maybe it's fine to prohibit act_ct on
> >> the egress after all.
> >>
> >
> > Another alternative is to drop packets that have fragments? i.e
> > Paolo's original patch but instead of accepting to drop
> > Or perhaps if there are no consequences there is nothing wrong with
> > Paolo's original patch..
> >
> > From a validation/testability/debugability POV, assuming its bad to
> > ignore frags processing:
> > The issue is for the user to distinguish between a bad setup (like
> > getting fragments in ct) and a good one where there were no frags. If
> > I were to write a tdc test to send two packets - one with fragments
> > and another without - a good way to check is to look at the tc
> > action/qdisc counters.. Paolo's original patch was accepting in both
> > cases the counter so the accept counter would read as 2 but if it was
> > dropped it would say 1 dropped and 1 accepted, so its easy to test (or
> > the user to debug)
> >
>
> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when
> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS.  This is clear
> for the application that makes a request and for the user if they make
> the request manually with 'tc filter ...'.
>

The challenge is actions could be created as a standalone i.e "tc
actions add action ct..." then later bound via tc filter. Don't forget
actions can also be shared by multiple filters (which could be a mix
of egress/ingress)...

> If the filter with act_ct is allowed to be created, but only works for
> non-fragmented traffic, that would be pretty hard to debug for someone
> who is not an expert in the area, but also would be a pretty inconsistent
> user API.
>

True. The drop counter is better than "nothing"..

cheers,
jamal


> Best regards, Ilya Maximets.
>
> > cheers,
> > jamal
> >
> >> For the implementation however, let's look at a different very simple
> >> use case:
> >>
> >>    Application on the host -> br-phy -> NAT -> eth0.
> >>
> >> This is a very common thing to NAT traffic before it leaves the host.
> >> It can be for load balancing reasons, e.g. if a destination is a service
> >> IP with multiple backends and we want multiple applications to hit
> >> different backends.
> >>
> >> There is no actual reason to try and offload such traffic to TC, as it
> >> would not be possible to offload it to hardware (since the traffic
> >> source is an application on the host).
> >>
> >> However, since OVS doesn't really know that it's not possible to offload
> >> to hardware and TC by itself allows creation of such filter, it will
> >> create the filter on the egress qdisc on the bridge port and that egress
> >> filter will be performing act_ct with the NAT.
> >>
> >> With this patch applied, all this traffic will be just dropped.  If we're
> >> going to prohibit act_ct on egress, that should be done at the moment of
> >> creation of this action on the egress qdisc, so the user knows that it's
> >> not supposed to be there.  E.g. if OVS will try to create such action and
> >> fail, it will keep forwarding the traffic successfully through the normal
> >> openvswitch datapath.  But if the TC filter creation succeeds, all traffic
> >> will be just dropped by it.  That doesn't sound right.
> >>
> >> Best regards, Ilya Maximets.
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-18 18:31           ` Jamal Hadi Salim
@ 2026-02-18 18:44             ` Jamal Hadi Salim
  2026-02-18 20:43               ` Paolo Abeni
  0 siblings, 1 reply; 17+ messages in thread
From: Jamal Hadi Salim @ 2026-02-18 18:44 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Paolo Abeni, netdev, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron,
	Aaron Conole, Florian Westphal

On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 2/18/26 3:28 PM, Jamal Hadi Salim wrote:
> > > On Tue, Feb 17, 2026 at 2:37 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >>
> > >> On 2/17/26 4:52 PM, Paolo Abeni wrote:
> > >>> Adding Florian, too
> > >>>
> > >>> On 2/17/26 3:49 PM, Ilya Maximets wrote:
> > >>>> On 2/17/26 10:38 AM, Paolo Abeni wrote:
> > >>>>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while
> > >>>>> the current skb being held by the defragmentation engine. As reported by
> > >>>>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine
> > >>>>> later on tries to tuch again such packet.
> > >>>>>
> > >>>>> act_ct was never meant to be used outside of the ingress path. Making
> > >>>>> defrag really works for act_ct outside such constraints range from very
> > >>>>> difficult to completely impossible.
> > >>>>>
> > >>>>> Address the issue making act_ct drop any packet when not attached to the
> > >>>>> ingress path and additionally emit a warning about the bad
> > >>>>> configuration.
> > >>>>>
> > >>>>> Reported-by: GangMin Kim <km.kim1503@gmail.com>
> > >>>>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
> > >>>>> CC: stable@vger.kernel.org
> > >>>>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com
> > >>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > >>>>> ---
> > >>>>> Catching the bad configuration at runtime instead of init time to reduce
> > >>>>> complexity
> > >>>>> ---
> > >>>>>  net/sched/act_ct.c | 6 ++++++
> > >>>>>  1 file changed, 6 insertions(+)
> > >>>>>
> > >>>>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> > >>>>> index 81d488655793..e8eb0d195f4a 100644
> > >>>>> --- a/net/sched/act_ct.c
> > >>>>> +++ b/net/sched/act_ct.c
> > >>>>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
> > >>>>>     tcf_lastuse_update(&c->tcf_tm);
> > >>>>>     tcf_action_update_bstats(&c->common, skb);
> > >>>>>
> > >>>>> +   if (!skb_at_tc_ingress(skb)) {
> > >>>>> +           pr_warn_once("act_CT should be attached at ingress!\n");
> > >>>>
> > >>>> Hi, Paolo.  I didn't test this yet, but I'm a little concerned about this
> > >>>> change.  For TC offload of OVS tunnels, we create egress qdisc for internal
> > >>>> ports, a.k.a. bridge ports.  This is necessary for tunnels to work.
> > >>>> A typical setup looks like this:
> > >>>>
> > >>>>     Bridge br-int (10.0.0.1)
> > >>>>         Port br-int
> > >>>>             Interface br-int
> > >>>>                 type: internal
> > >>>>         Port vm
> > >>>>             Interface vm
> > >>>>         Port vxlan0
> > >>>>             Interface vxlan0
> > >>>>                 type: vxlan
> > >>>>                 options: {remote_ip="172.31.1.1"}
> > >>>>
> > >>>>     Bridge br-phy (172.31.1.100)
> > >>>>         Port br-phy
> > >>>>             Interface br-phy
> > >>>>                 type: internal
> > >>>>         Port eth0
> > >>>>             Interface eth0
> > >>>>
> > >>>> When a VM sends a packet that supposed to go through the tunnel, it enters
> > >>>> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate
> > >>>> tunnel metadata.  Then it gets encapsulated and put into kernel routing to
> > >>>> find the next hop, which is via br-phy bridge.  Packet enters br-phy and
> > >>>> OVS forwards the packet to eth0.  There can be stateful processing in the
> > >>>> br-phy bridge involving passing packets through conntrack.
> > >>>>
> > >>>> With the TC offload enabled, OVS creates following filters:
> > >>>>
> > >>>> 1. Ingress filter on vm interface that forwards packets to vxlan0.
> > >>>> 2. Egress filter on br-phy interface that forwards encapsulated packets
> > >>>>    to eth0 interface.
> > >>>>
> > >>>> If some stateful processing is involved, both of those could have act_ct
> > >>>> in them.
> > >>>>
> > >>>> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is
> > >>>> egressing from the kernel routing via br-phy and the ingress qdisc is not
> > >>>> invoked.  Ingress will be at play when packets are flowing in the opposite
> > >>>> direction from eth0 to br-phy, as that's where they will ingress into
> > >>>> standard kernel routing.
> > >>>>
> > >>>> If act_tc will not be allowed on Egress, then stateful processing will
> > >>>> not be possible in this case in br-phy bridge.
> > >>>>
> > >>>> Thoughts?
> > >>>
> > >>> This looks very problematic. A slightly different patch tried to
> > >>> somewhat preserve functionality, but it simply can't work in presence of
> > >>> IP fragments.
> > >>>
> > >>> Why stateful processing on  br-int/port vm ingress is not enough?
> > >>
> > >> In general we don't know where the packet will go after encapsulation,
> > >> so we do not know beforehand if it will end up in another OVS bridge
> > >> or not.  Hence the processing has to be split in two parts, the ingress
> > >> on vm port and egress on the other bridge port.
> > >>
> > >> At the same time, thinking more about potential use cases, I don't see
> > >> a good one for running encapsulated traffic through conntrack.  Most
> > >> of the time it's a UDP unidirectional traffic with hashes for source
> > >> ports.  Not sure what functionality would require to do anything
> > >> stateful on such packets.  So, maybe it's fine to prohibit act_ct on
> > >> the egress after all.
> > >>
> > >
> > > Another alternative is to drop packets that have fragments? i.e
> > > Paolo's original patch but instead of accepting to drop
> > > Or perhaps if there are no consequences there is nothing wrong with
> > > Paolo's original patch..
> > >
> > > From a validation/testability/debugability POV, assuming its bad to
> > > ignore frags processing:
> > > The issue is for the user to distinguish between a bad setup (like
> > > getting fragments in ct) and a good one where there were no frags. If
> > > I were to write a tdc test to send two packets - one with fragments
> > > and another without - a good way to check is to look at the tc
> > > action/qdisc counters.. Paolo's original patch was accepting in both
> > > cases the counter so the accept counter would read as 2 but if it was
> > > dropped it would say 1 dropped and 1 accepted, so its easy to test (or
> > > the user to debug)
> > >
> >
> > From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when
> > it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS.  This is clear
> > for the application that makes a request and for the user if they make
> > the request manually with 'tc filter ...'.
> >
>

Actually, looking closely at the code - this is doable. Let's see if a
patch can be cooked.


cheers,
jamal

> The challenge is actions could be created as a standalone i.e "tc
> actions add action ct..." then later bound via tc filter. Don't forget
> actions can also be shared by multiple filters (which could be a mix
> of egress/ingress)...
>


> > If the filter with act_ct is allowed to be created, but only works for
> > non-fragmented traffic, that would be pretty hard to debug for someone
> > who is not an expert in the area, but also would be a pretty inconsistent
> > user API.
> >
>
> True. The drop counter is better than "nothing"..
>
> cheers,
> jamal
>
>
> > Best regards, Ilya Maximets.
> >
> > > cheers,
> > > jamal
> > >
> > >> For the implementation however, let's look at a different very simple
> > >> use case:
> > >>
> > >>    Application on the host -> br-phy -> NAT -> eth0.
> > >>
> > >> This is a very common thing to NAT traffic before it leaves the host.
> > >> It can be for load balancing reasons, e.g. if a destination is a service
> > >> IP with multiple backends and we want multiple applications to hit
> > >> different backends.
> > >>
> > >> There is no actual reason to try and offload such traffic to TC, as it
> > >> would not be possible to offload it to hardware (since the traffic
> > >> source is an application on the host).
> > >>
> > >> However, since OVS doesn't really know that it's not possible to offload
> > >> to hardware and TC by itself allows creation of such filter, it will
> > >> create the filter on the egress qdisc on the bridge port and that egress
> > >> filter will be performing act_ct with the NAT.
> > >>
> > >> With this patch applied, all this traffic will be just dropped.  If we're
> > >> going to prohibit act_ct on egress, that should be done at the moment of
> > >> creation of this action on the egress qdisc, so the user knows that it's
> > >> not supposed to be there.  E.g. if OVS will try to create such action and
> > >> fail, it will keep forwarding the traffic successfully through the normal
> > >> openvswitch datapath.  But if the TC filter creation succeeds, all traffic
> > >> will be just dropped by it.  That doesn't sound right.
> > >>
> > >> Best regards, Ilya Maximets.
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-18 18:44             ` Jamal Hadi Salim
@ 2026-02-18 20:43               ` Paolo Abeni
  2026-02-19 11:46                 ` Ilya Maximets
  2026-02-19 14:16                 ` Jamal Hadi Salim
  0 siblings, 2 replies; 17+ messages in thread
From: Paolo Abeni @ 2026-02-18 20:43 UTC (permalink / raw)
  To: Jamal Hadi Salim, Ilya Maximets
  Cc: netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe,
	Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole,
	Florian Westphal

On 2/18/26 7:44 PM, Jamal Hadi Salim wrote:
> On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when
>>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS.  This is clear
>>> for the application that makes a request and for the user if they make
>>> the request manually with 'tc filter ...'.
> 
>> The challenge is actions could be created as a standalone i.e "tc
>> actions add action ct..." then later bound via tc filter. Don't forget
>> actions can also be shared by multiple filters (which could be a mix
>> of egress/ingress)...
>
> Actually, looking closely at the code - this is doable. Let's see if a
> patch can be cooked.

I discussed a bit the topic with Davide, it looks like the thing you
mentioned above could be handled. The problematic part is AFAICS the
additional indirection level added by (possibly shared) blocks. AFAICS
whatever check we do at ct_init() time, shared block could later
circumvent it - unless act_ct is always forbidden for shared blocks.

TL;DR: I think admission check at init time can be implemented only in a
quite (too much?) restrictive way.

/P


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-18 20:43               ` Paolo Abeni
@ 2026-02-19 11:46                 ` Ilya Maximets
  2026-02-19 14:16                 ` Jamal Hadi Salim
  1 sibling, 0 replies; 17+ messages in thread
From: Ilya Maximets @ 2026-02-19 11:46 UTC (permalink / raw)
  To: Paolo Abeni, Jamal Hadi Salim
  Cc: i.maximets, netdev, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron,
	Aaron Conole, Florian Westphal

On 2/18/26 9:43 PM, Paolo Abeni wrote:
> On 2/18/26 7:44 PM, Jamal Hadi Salim wrote:
>> On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when
>>>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS.  This is clear
>>>> for the application that makes a request and for the user if they make
>>>> the request manually with 'tc filter ...'.
>>
>>> The challenge is actions could be created as a standalone i.e "tc
>>> actions add action ct..." then later bound via tc filter. Don't forget
>>> actions can also be shared by multiple filters (which could be a mix
>>> of egress/ingress)...
>>
>> Actually, looking closely at the code - this is doable. Let's see if a
>> patch can be cooked.
> 
> I discussed a bit the topic with Davide, it looks like the thing you
> mentioned above could be handled. The problematic part is AFAICS the
> additional indirection level added by (possibly shared) blocks. AFAICS
> whatever check we do at ct_init() time, shared block could later
> circumvent it - unless act_ct is always forbidden for shared blocks.

I'm obviously biased here, but if it's not possible to cover all the
cases, I'd still prefer a partial solution over the no solution.
If we can handle just the RTM_NEWTFILTER case, that will allow us to
at least avoid breaking a known use case with OVS.  If it will be
possible to circumvent that with block sharing, then we can still drop
fragmented traffic there.  I'm not sure how widely sharing is used.

Best regards, Ilya Maximets.

> 
> TL;DR: I think admission check at init time can be implemented only in a
> quite (too much?) restrictive way.
> 
> /P
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-18 20:43               ` Paolo Abeni
  2026-02-19 11:46                 ` Ilya Maximets
@ 2026-02-19 14:16                 ` Jamal Hadi Salim
  2026-02-19 20:13                   ` Jamal Hadi Salim
  1 sibling, 1 reply; 17+ messages in thread
From: Jamal Hadi Salim @ 2026-02-19 14:16 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Ilya Maximets, netdev, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron,
	Aaron Conole, Florian Westphal

On Wed, Feb 18, 2026 at 3:43 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 2/18/26 7:44 PM, Jamal Hadi Salim wrote:
> > On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when
> >>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS.  This is clear
> >>> for the application that makes a request and for the user if they make
> >>> the request manually with 'tc filter ...'.
> >
> >> The challenge is actions could be created as a standalone i.e "tc
> >> actions add action ct..." then later bound via tc filter. Don't forget
> >> actions can also be shared by multiple filters (which could be a mix
> >> of egress/ingress)...
> >
> > Actually, looking closely at the code - this is doable. Let's see if a
> > patch can be cooked.
>
> I discussed a bit the topic with Davide, it looks like the thing you
> mentioned above could be handled. The problematic part is AFAICS the
> additional indirection level added by (possibly shared) blocks. AFAICS
> whatever check we do at ct_init() time, shared block could later
> circumvent it - unless act_ct is always forbidden for shared blocks.
>
> TL;DR: I think admission check at init time can be implemented only in a
> quite (too much?) restrictive way.
>

True - the blocks will require more surgery but seems doable. We may
end up again in ->quirks() land here. I.e when an action is bound to a
filter (not necessarily when created) it should check its fit in the
graph and see if it is legit. But since this is only the second quirk
we know of, we can simplify. Note: We already have one such check,
albeit ugly-looking, with skbedit (see flag TCA_ACT_FLAGS_AT_INGRESS).
I will talk to Victor and see if he can cook at minimal a preliminary
patch without considering blocks, and at best one that does consider
blocks.

cheers,
jamal
> /P
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-19 14:16                 ` Jamal Hadi Salim
@ 2026-02-19 20:13                   ` Jamal Hadi Salim
  2026-02-20 12:24                     ` Victor Nogueira
  0 siblings, 1 reply; 17+ messages in thread
From: Jamal Hadi Salim @ 2026-02-19 20:13 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Ilya Maximets, netdev, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron,
	Aaron Conole, Florian Westphal

On Thu, Feb 19, 2026 at 9:16 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, Feb 18, 2026 at 3:43 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 2/18/26 7:44 PM, Jamal Hadi Salim wrote:
> > > On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when
> > >>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS.  This is clear
> > >>> for the application that makes a request and for the user if they make
> > >>> the request manually with 'tc filter ...'.
> > >
> > >> The challenge is actions could be created as a standalone i.e "tc
> > >> actions add action ct..." then later bound via tc filter. Don't forget
> > >> actions can also be shared by multiple filters (which could be a mix
> > >> of egress/ingress)...
> > >
> > > Actually, looking closely at the code - this is doable. Let's see if a
> > > patch can be cooked.
> >
> > I discussed a bit the topic with Davide, it looks like the thing you
> > mentioned above could be handled. The problematic part is AFAICS the
> > additional indirection level added by (possibly shared) blocks. AFAICS
> > whatever check we do at ct_init() time, shared block could later
> > circumvent it - unless act_ct is always forbidden for shared blocks.
> >
> > TL;DR: I think admission check at init time can be implemented only in a
> > quite (too much?) restrictive way.
> >
>
> True - the blocks will require more surgery but seems doable. We may
> end up again in ->quirks() land here. I.e when an action is bound to a
> filter (not necessarily when created) it should check its fit in the
> graph and see if it is legit. But since this is only the second quirk
> we know of, we can simplify. Note: We already have one such check,
> albeit ugly-looking, with skbedit (see flag TCA_ACT_FLAGS_AT_INGRESS).
> I will talk to Victor and see if he can cook at minimal a preliminary
> patch without considering blocks, and at best one that does consider
> blocks.

After discussions with Victor we concluded that blocks have no effect
on this specific issue.
Blocks are only useful in clsact(both ingress/egress) and ingress.
Both of these handle ACT_CONSUMED fine.
Code sample looks as follows:

        case TC_ACT_STOLEN:
        case TC_ACT_QUEUED:
        case TC_ACT_TRAP:
                consume_skb(skb);
                fallthrough;
        case TC_ACT_CONSUMED:
                *ret = NET_RX_SUCCESS;
                bpf_net_ctx_clear(bpf_net_ctx);
                return NULL;

The reason this works is because they don't have to deal with
queues(unlike the other qdiscs).
So, ignoring the block should work just fine. Ilya, for your case, if
you must egress, use clsact egress - the only problem is it will not
support the more feature-rich qdiscs...

cheers,
jamal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-19 20:13                   ` Jamal Hadi Salim
@ 2026-02-20 12:24                     ` Victor Nogueira
  2026-02-20 13:41                       ` Ilya Maximets
  0 siblings, 1 reply; 17+ messages in thread
From: Victor Nogueira @ 2026-02-20 12:24 UTC (permalink / raw)
  To: Jamal Hadi Salim, Paolo Abeni
  Cc: Ilya Maximets, netdev, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron,
	Aaron Conole, Florian Westphal

On 19/02/2026 17:13, Jamal Hadi Salim wrote:
> On Thu, Feb 19, 2026 at 9:16 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>
>> On Wed, Feb 18, 2026 at 3:43 PM Paolo Abeni <pabeni@redhat.com> wrote:
>>>
>>> On 2/18/26 7:44 PM, Jamal Hadi Salim wrote:
>>>> On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>>>> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>  From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when
>>>>>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS.  This is clear
>>>>>> for the application that makes a request and for the user if they make
>>>>>> the request manually with 'tc filter ...'.
>>>>
>>>>> The challenge is actions could be created as a standalone i.e "tc
>>>>> actions add action ct..." then later bound via tc filter. Don't forget
>>>>> actions can also be shared by multiple filters (which could be a mix
>>>>> of egress/ingress)...
>>>>
>>>> Actually, looking closely at the code - this is doable. Let's see if a
>>>> patch can be cooked.
>>>
>>> I discussed a bit the topic with Davide, it looks like the thing you
>>> mentioned above could be handled. The problematic part is AFAICS the
>>> additional indirection level added by (possibly shared) blocks. AFAICS
>>> whatever check we do at ct_init() time, shared block could later
>>> circumvent it - unless act_ct is always forbidden for shared blocks.
>>>
>>> TL;DR: I think admission check at init time can be implemented only in a
>>> quite (too much?) restrictive way.
>>>
>>
>> True - the blocks will require more surgery but seems doable. We may
>> end up again in ->quirks() land here. I.e when an action is bound to a
>> filter (not necessarily when created) it should check its fit in the
>> graph and see if it is legit. But since this is only the second quirk
>> we know of, we can simplify. Note: We already have one such check,
>> albeit ugly-looking, with skbedit (see flag TCA_ACT_FLAGS_AT_INGRESS).
>> I will talk to Victor and see if he can cook at minimal a preliminary
>> patch without considering blocks, and at best one that does consider
>> blocks.
> 
> After discussions with Victor we concluded that blocks have no effect
> on this specific issue.
> Blocks are only useful in clsact(both ingress/egress) and ingress.
> Both of these handle ACT_CONSUMED fine.
> Code sample looks as follows:
> 
>          case TC_ACT_STOLEN:
>          case TC_ACT_QUEUED:
>          case TC_ACT_TRAP:
>                  consume_skb(skb);
>                  fallthrough;
>          case TC_ACT_CONSUMED:
>                  *ret = NET_RX_SUCCESS;
>                  bpf_net_ctx_clear(bpf_net_ctx);
>                  return NULL;
> 
> The reason this works is because they don't have to deal with
> queues(unlike the other qdiscs).
> So, ignoring the block should work just fine. Ilya, for your case, if
> you must egress, use clsact egress - the only problem is it will not
> support the more feature-rich qdiscs...

If there is no objection, we are thinking of sending something along
these lines:

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 91a24b5e0b93..2ba40eb45aad 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -70,6 +70,7 @@ struct tc_action {
  #define TCA_ACT_FLAGS_REPLACE  (1U << (TCA_ACT_FLAGS_USER_BITS + 2))
  #define TCA_ACT_FLAGS_NO_RTNL  (1U << (TCA_ACT_FLAGS_USER_BITS + 3))
  #define TCA_ACT_FLAGS_AT_INGRESS       (1U << (TCA_ACT_FLAGS_USER_BITS 
+ 4))
+#define TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT     (1U << 
(TCA_ACT_FLAGS_USER_BITS + 5))

  /* Update lastuse only if needed, to avoid dirtying a cache line.
   * We use a temp variable to avoid fetching jiffies twice.
diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 81d488655793..a79479faf0f8 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -1360,6 +1360,12 @@ static int tcf_ct_init(struct net *net, struct 
nlattr *nla,
                 return -EINVAL;
         }

+       if (bind && !(flags & TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT)) {
+               NL_SET_ERR_MSG_MOD(extack,
+                               "May only attach ct to ingress or clsact");
+               return -EPERM;
+       }
+
         err = nla_parse_nested(tb, TCA_CT_MAX, nla, ct_policy, extack);
         if (err < 0)
                 return err;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index ebca4b926dcf..8c72faf3314d 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -2228,6 +2228,11 @@ static bool is_qdisc_ingress(__u32 classid)
         return (TC_H_MIN(classid) == TC_H_MIN(TC_H_MIN_INGRESS));
  }

+static bool is_ingress_or_clsact(struct tcf_block *block, struct Qdisc *q)
+{
+       return tcf_block_shared(block) || (q && !!(q->flags & 
TCQ_F_INGRESS));
+}
+
  static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
                           struct netlink_ext_ack *extack)
  {
@@ -2420,6 +2425,8 @@ static int tc_new_tfilter(struct sk_buff *skb, 
struct nlmsghdr *n,
                 flags |= TCA_ACT_FLAGS_NO_RTNL;
         if (is_qdisc_ingress(parent))
                 flags |= TCA_ACT_FLAGS_AT_INGRESS;
+       if (is_ingress_or_clsact(block, q))
+               flags |= TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT;
         err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
                               flags, extack);
         if (err == 0) {



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-20 12:24                     ` Victor Nogueira
@ 2026-02-20 13:41                       ` Ilya Maximets
  2026-02-20 16:12                         ` Victor Nogueira
  0 siblings, 1 reply; 17+ messages in thread
From: Ilya Maximets @ 2026-02-20 13:41 UTC (permalink / raw)
  To: Victor Nogueira, Jamal Hadi Salim, Paolo Abeni
  Cc: i.maximets, netdev, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron,
	Aaron Conole, Florian Westphal

On 2/20/26 1:24 PM, Victor Nogueira wrote:
> On 19/02/2026 17:13, Jamal Hadi Salim wrote:
>> On Thu, Feb 19, 2026 at 9:16 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>>
>>> On Wed, Feb 18, 2026 at 3:43 PM Paolo Abeni <pabeni@redhat.com> wrote:
>>>>
>>>> On 2/18/26 7:44 PM, Jamal Hadi Salim wrote:
>>>>> On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>>>>> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>  From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when
>>>>>>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS.  This is clear
>>>>>>> for the application that makes a request and for the user if they make
>>>>>>> the request manually with 'tc filter ...'.
>>>>>
>>>>>> The challenge is actions could be created as a standalone i.e "tc
>>>>>> actions add action ct..." then later bound via tc filter. Don't forget
>>>>>> actions can also be shared by multiple filters (which could be a mix
>>>>>> of egress/ingress)...
>>>>>
>>>>> Actually, looking closely at the code - this is doable. Let's see if a
>>>>> patch can be cooked.
>>>>
>>>> I discussed a bit the topic with Davide, it looks like the thing you
>>>> mentioned above could be handled. The problematic part is AFAICS the
>>>> additional indirection level added by (possibly shared) blocks. AFAICS
>>>> whatever check we do at ct_init() time, shared block could later
>>>> circumvent it - unless act_ct is always forbidden for shared blocks.
>>>>
>>>> TL;DR: I think admission check at init time can be implemented only in a
>>>> quite (too much?) restrictive way.
>>>>
>>>
>>> True - the blocks will require more surgery but seems doable. We may
>>> end up again in ->quirks() land here. I.e when an action is bound to a
>>> filter (not necessarily when created) it should check its fit in the
>>> graph and see if it is legit. But since this is only the second quirk
>>> we know of, we can simplify. Note: We already have one such check,
>>> albeit ugly-looking, with skbedit (see flag TCA_ACT_FLAGS_AT_INGRESS).
>>> I will talk to Victor and see if he can cook at minimal a preliminary
>>> patch without considering blocks, and at best one that does consider
>>> blocks.
>>
>> After discussions with Victor we concluded that blocks have no effect
>> on this specific issue.
>> Blocks are only useful in clsact(both ingress/egress) and ingress.
>> Both of these handle ACT_CONSUMED fine.
>> Code sample looks as follows:
>>
>>          case TC_ACT_STOLEN:
>>          case TC_ACT_QUEUED:
>>          case TC_ACT_TRAP:
>>                  consume_skb(skb);
>>                  fallthrough;
>>          case TC_ACT_CONSUMED:
>>                  *ret = NET_RX_SUCCESS;
>>                  bpf_net_ctx_clear(bpf_net_ctx);
>>                  return NULL;
>>
>> The reason this works is because they don't have to deal with
>> queues(unlike the other qdiscs).
>> So, ignoring the block should work just fine. Ilya, for your case, if
>> you must egress, use clsact egress - the only problem is it will not
>> support the more feature-rich qdiscs...
> 
> If there is no objection, we are thinking of sending something along
> these lines:
> 
> diff --git a/include/net/act_api.h b/include/net/act_api.h
> index 91a24b5e0b93..2ba40eb45aad 100644
> --- a/include/net/act_api.h
> +++ b/include/net/act_api.h
> @@ -70,6 +70,7 @@ struct tc_action {
>   #define TCA_ACT_FLAGS_REPLACE  (1U << (TCA_ACT_FLAGS_USER_BITS + 2))
>   #define TCA_ACT_FLAGS_NO_RTNL  (1U << (TCA_ACT_FLAGS_USER_BITS + 3))
>   #define TCA_ACT_FLAGS_AT_INGRESS       (1U << (TCA_ACT_FLAGS_USER_BITS 
> + 4))
> +#define TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT     (1U << 
> (TCA_ACT_FLAGS_USER_BITS + 5))
> 
>   /* Update lastuse only if needed, to avoid dirtying a cache line.
>    * We use a temp variable to avoid fetching jiffies twice.
> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> index 81d488655793..a79479faf0f8 100644
> --- a/net/sched/act_ct.c
> +++ b/net/sched/act_ct.c
> @@ -1360,6 +1360,12 @@ static int tcf_ct_init(struct net *net, struct 
> nlattr *nla,
>                  return -EINVAL;
>          }
> 
> +       if (bind && !(flags & TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT)) {
> +               NL_SET_ERR_MSG_MOD(extack,
> +                               "May only attach ct to ingress or clsact");
> +               return -EPERM;

I haven't tried this change yet, but I'd say this error code should
be EOPNOTSUPP instead of EPERM, as it's not a permissions problem,
it's that we decided to not support this use case.

From the OVS side EOPNOTSUPP will be handled by ovs-vswitchd normally
with a fallback to a standard openvswitch datapath.  While EPERM, though
still handled, will cause an extra error reported to the user, which
is not desirable.

Best regards, Ilya Maximets.

> +       }
> +
>          err = nla_parse_nested(tb, TCA_CT_MAX, nla, ct_policy, extack);
>          if (err < 0)
>                  return err;
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index ebca4b926dcf..8c72faf3314d 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -2228,6 +2228,11 @@ static bool is_qdisc_ingress(__u32 classid)
>          return (TC_H_MIN(classid) == TC_H_MIN(TC_H_MIN_INGRESS));
>   }
> 
> +static bool is_ingress_or_clsact(struct tcf_block *block, struct Qdisc *q)
> +{
> +       return tcf_block_shared(block) || (q && !!(q->flags & 
> TCQ_F_INGRESS));
> +}
> +
>   static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>                            struct netlink_ext_ack *extack)
>   {
> @@ -2420,6 +2425,8 @@ static int tc_new_tfilter(struct sk_buff *skb, 
> struct nlmsghdr *n,
>                  flags |= TCA_ACT_FLAGS_NO_RTNL;
>          if (is_qdisc_ingress(parent))
>                  flags |= TCA_ACT_FLAGS_AT_INGRESS;
> +       if (is_ingress_or_clsact(block, q))
> +               flags |= TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT;
>          err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
>                                flags, extack);
>          if (err == 0) {
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-20 13:41                       ` Ilya Maximets
@ 2026-02-20 16:12                         ` Victor Nogueira
  0 siblings, 0 replies; 17+ messages in thread
From: Victor Nogueira @ 2026-02-20 16:12 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Jamal Hadi Salim, Paolo Abeni, netdev, Jiri Pirko,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman,
	Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser,
	GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal

On Fri, Feb 20, 2026 at 10:41 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 2/20/26 1:24 PM, Victor Nogueira wrote:
> > On 19/02/2026 17:13, Jamal Hadi Salim wrote:
> >> On Thu, Feb 19, 2026 at 9:16 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >>>
> >>> On Wed, Feb 18, 2026 at 3:43 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >>>>
> >>>> On 2/18/26 7:44 PM, Jamal Hadi Salim wrote:
> >>>>> On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >>>>>> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>  From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when
> >>>>>>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS.  This is clear
> >>>>>>> for the application that makes a request and for the user if they make
> >>>>>>> the request manually with 'tc filter ...'.
> >>>>>
> >>>>>> The challenge is actions could be created as a standalone i.e "tc
> >>>>>> actions add action ct..." then later bound via tc filter. Don't forget
> >>>>>> actions can also be shared by multiple filters (which could be a mix
> >>>>>> of egress/ingress)...
> >>>>>
> >>>>> Actually, looking closely at the code - this is doable. Let's see if a
> >>>>> patch can be cooked.
> >>>>
> >>>> I discussed a bit the topic with Davide, it looks like the thing you
> >>>> mentioned above could be handled. The problematic part is AFAICS the
> >>>> additional indirection level added by (possibly shared) blocks. AFAICS
> >>>> whatever check we do at ct_init() time, shared block could later
> >>>> circumvent it - unless act_ct is always forbidden for shared blocks.
> >>>>
> >>>> TL;DR: I think admission check at init time can be implemented only in a
> >>>> quite (too much?) restrictive way.
> >>>>
> >>>
> >>> True - the blocks will require more surgery but seems doable. We may
> >>> end up again in ->quirks() land here. I.e when an action is bound to a
> >>> filter (not necessarily when created) it should check its fit in the
> >>> graph and see if it is legit. But since this is only the second quirk
> >>> we know of, we can simplify. Note: We already have one such check,
> >>> albeit ugly-looking, with skbedit (see flag TCA_ACT_FLAGS_AT_INGRESS).
> >>> I will talk to Victor and see if he can cook at minimal a preliminary
> >>> patch without considering blocks, and at best one that does consider
> >>> blocks.
> >>
> >> After discussions with Victor we concluded that blocks have no effect
> >> on this specific issue.
> >> Blocks are only useful in clsact(both ingress/egress) and ingress.
> >> Both of these handle ACT_CONSUMED fine.
> >> Code sample looks as follows:
> >>
> >>          case TC_ACT_STOLEN:
> >>          case TC_ACT_QUEUED:
> >>          case TC_ACT_TRAP:
> >>                  consume_skb(skb);
> >>                  fallthrough;
> >>          case TC_ACT_CONSUMED:
> >>                  *ret = NET_RX_SUCCESS;
> >>                  bpf_net_ctx_clear(bpf_net_ctx);
> >>                  return NULL;
> >>
> >> The reason this works is because they don't have to deal with
> >> queues(unlike the other qdiscs).
> >> So, ignoring the block should work just fine. Ilya, for your case, if
> >> you must egress, use clsact egress - the only problem is it will not
> >> support the more feature-rich qdiscs...
> >
> > If there is no objection, we are thinking of sending something along
> > these lines:
> >
> > diff --git a/include/net/act_api.h b/include/net/act_api.h
> > index 91a24b5e0b93..2ba40eb45aad 100644
> > --- a/include/net/act_api.h
> > +++ b/include/net/act_api.h
> > @@ -70,6 +70,7 @@ struct tc_action {
> >   #define TCA_ACT_FLAGS_REPLACE  (1U << (TCA_ACT_FLAGS_USER_BITS + 2))
> >   #define TCA_ACT_FLAGS_NO_RTNL  (1U << (TCA_ACT_FLAGS_USER_BITS + 3))
> >   #define TCA_ACT_FLAGS_AT_INGRESS       (1U << (TCA_ACT_FLAGS_USER_BITS
> > + 4))
> > +#define TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT     (1U <<
> > (TCA_ACT_FLAGS_USER_BITS + 5))
> >
> >   /* Update lastuse only if needed, to avoid dirtying a cache line.
> >    * We use a temp variable to avoid fetching jiffies twice.
> > diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> > index 81d488655793..a79479faf0f8 100644
> > --- a/net/sched/act_ct.c
> > +++ b/net/sched/act_ct.c
> > @@ -1360,6 +1360,12 @@ static int tcf_ct_init(struct net *net, struct
> > nlattr *nla,
> >                  return -EINVAL;
> >          }
> >
> > +       if (bind && !(flags & TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT)) {
> > +               NL_SET_ERR_MSG_MOD(extack,
> > +                               "May only attach ct to ingress or clsact");
> > +               return -EPERM;
>
> I haven't tried this change yet, but I'd say this error code should
> be EOPNOTSUPP instead of EPERM, as it's not a permissions problem,
> it's that we decided to not support this use case.
>
> From the OVS side EOPNOTSUPP will be handled by ovs-vswitchd normally
> with a fallback to a standard openvswitch datapath.  While EPERM, though
> still handled, will cause an extra error reported to the user, which
> is not desirable.

That's a fair point, we can make that change.

cheers,
Victor

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
  2026-02-17  9:38 [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress Paolo Abeni
  2026-02-17 10:42 ` Paolo Abeni
  2026-02-17 14:49 ` Ilya Maximets
@ 2026-02-17 15:28 ` Jakub Kicinski
  2 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2026-02-17 15:28 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Jamal Hadi Salim, Jiri Pirko, David S. Miller,
	Eric Dumazet, Simon Horman, Henrik Steen, Olivier Tilmans,
	Bob Briscoe, Olga Albisser, GangMin Kim

On Tue, 17 Feb 2026 10:38:52 +0100 Paolo Abeni wrote:
> Since the blamed commit below, classify can return TC_ACT_CONSUMED while
> the current skb being held by the defragmentation engine. As reported by
> GangMin Kim, if such packet is that may cause a UaF when the defrag engine
> later on tries to tuch again such packet.
> 
> act_ct was never meant to be used outside of the ingress path. Making
> defrag really works for act_ct outside such constraints range from very
> difficult to completely impossible.
> 
> Address the issue making act_ct drop any packet when not attached to the
> ingress path and additionally emit a warning about the bad
> configuration.

FWIW there is a test case in net/forwarding/tc_actions.sh that fails
now (needs to be adjusted/deleted/reconsidered).
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-02-20 16:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-17  9:38 [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress Paolo Abeni
2026-02-17 10:42 ` Paolo Abeni
2026-02-17 14:49 ` Ilya Maximets
2026-02-17 15:52   ` Paolo Abeni
2026-02-17 19:37     ` Ilya Maximets
2026-02-18 14:28       ` Jamal Hadi Salim
2026-02-18 16:15         ` Ilya Maximets
2026-02-18 18:31           ` Jamal Hadi Salim
2026-02-18 18:44             ` Jamal Hadi Salim
2026-02-18 20:43               ` Paolo Abeni
2026-02-19 11:46                 ` Ilya Maximets
2026-02-19 14:16                 ` Jamal Hadi Salim
2026-02-19 20:13                   ` Jamal Hadi Salim
2026-02-20 12:24                     ` Victor Nogueira
2026-02-20 13:41                       ` Ilya Maximets
2026-02-20 16:12                         ` Victor Nogueira
2026-02-17 15:28 ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox