* [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress
@ 2026-02-17 9:38 Paolo Abeni
2026-02-17 10:42 ` Paolo Abeni
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Paolo Abeni @ 2026-02-17 9:38 UTC (permalink / raw)
To: netdev
Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans,
Bob Briscoe, Olga Albisser, GangMin Kim
Since the blamed commit below, classify can return TC_ACT_CONSUMED while
the current skb being held by the defragmentation engine. As reported by
GangMin Kim, if such packet is that may cause a UaF when the defrag engine
later on tries to tuch again such packet.
act_ct was never meant to be used outside of the ingress path. Making
defrag really works for act_ct outside such constraints range from very
difficult to completely impossible.
Address the issue making act_ct drop any packet when not attached to the
ingress path and additionally emit a warning about the bad
configuration.
Reported-by: GangMin Kim <km.kim1503@gmail.com>
Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
CC: stable@vger.kernel.org
Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
Catching the bad configuration at runtime instead of init time to reduce
complexity
---
net/sched/act_ct.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 81d488655793..e8eb0d195f4a 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
tcf_lastuse_update(&c->tcf_tm);
tcf_action_update_bstats(&c->common, skb);
+ if (!skb_at_tc_ingress(skb)) {
+ pr_warn_once("act_CT should be attached at ingress!\n");
+ goto drop;
+ }
+
if (clear) {
qdisc_skb_cb(skb)->post_ct = false;
ct = nf_ct_get(skb, &ctinfo);
@@ -1109,6 +1114,7 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
out_frag:
if (err != -EINPROGRESS)
tcf_action_inc_drop_qstats(&c->common);
+
return TC_ACT_CONSUMED;
drop:
--
2.53.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-17 9:38 [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress Paolo Abeni @ 2026-02-17 10:42 ` Paolo Abeni 2026-02-17 14:49 ` Ilya Maximets 2026-02-17 15:28 ` Jakub Kicinski 2 siblings, 0 replies; 17+ messages in thread From: Paolo Abeni @ 2026-02-17 10:42 UTC (permalink / raw) To: netdev Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim On 2/17/26 10:38 AM, Paolo Abeni wrote: > Since the blamed commit below, classify can return TC_ACT_CONSUMED while > the current skb being held by the defragmentation engine. As reported by > GangMin Kim, if such packet is that may cause a UaF when the defrag engine > later on tries to tuch again such packet. > > act_ct was never meant to be used outside of the ingress path. Making > defrag really works for act_ct outside such constraints range from very > difficult to completely impossible. > > Address the issue making act_ct drop any packet when not attached to the > ingress path and additionally emit a warning about the bad > configuration. > > Reported-by: GangMin Kim <km.kim1503@gmail.com> > Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc") Whoops, no idea how that wrong tag landed there > CC: stable@vger.kernel.org > Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- > Catching the bad configuration at runtime instead of init time to reduce > complexity > --- > net/sched/act_ct.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c > index 81d488655793..e8eb0d195f4a 100644 > --- a/net/sched/act_ct.c > +++ b/net/sched/act_ct.c > @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, > tcf_lastuse_update(&c->tcf_tm); > tcf_action_update_bstats(&c->common, skb); > > + if (!skb_at_tc_ingress(skb)) { > + pr_warn_once("act_CT should be attached at ingress!\n"); > + goto drop; > + } > + > if (clear) { > qdisc_skb_cb(skb)->post_ct = false; > ct = nf_ct_get(skb, &ctinfo); > @@ -1109,6 +1114,7 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, > out_frag: > if (err != -EINPROGRESS) > tcf_action_inc_drop_qstats(&c->common); > + This also must be dropped. I'll send a v2 at due time. /P ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-17 9:38 [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress Paolo Abeni 2026-02-17 10:42 ` Paolo Abeni @ 2026-02-17 14:49 ` Ilya Maximets 2026-02-17 15:52 ` Paolo Abeni 2026-02-17 15:28 ` Jakub Kicinski 2 siblings, 1 reply; 17+ messages in thread From: Ilya Maximets @ 2026-02-17 14:49 UTC (permalink / raw) To: Paolo Abeni, netdev Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, i.maximets, Eelco Chaudron, Aaron Conole On 2/17/26 10:38 AM, Paolo Abeni wrote: > Since the blamed commit below, classify can return TC_ACT_CONSUMED while > the current skb being held by the defragmentation engine. As reported by > GangMin Kim, if such packet is that may cause a UaF when the defrag engine > later on tries to tuch again such packet. > > act_ct was never meant to be used outside of the ingress path. Making > defrag really works for act_ct outside such constraints range from very > difficult to completely impossible. > > Address the issue making act_ct drop any packet when not attached to the > ingress path and additionally emit a warning about the bad > configuration. > > Reported-by: GangMin Kim <km.kim1503@gmail.com> > Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc") > CC: stable@vger.kernel.org > Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- > Catching the bad configuration at runtime instead of init time to reduce > complexity > --- > net/sched/act_ct.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c > index 81d488655793..e8eb0d195f4a 100644 > --- a/net/sched/act_ct.c > +++ b/net/sched/act_ct.c > @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, > tcf_lastuse_update(&c->tcf_tm); > tcf_action_update_bstats(&c->common, skb); > > + if (!skb_at_tc_ingress(skb)) { > + pr_warn_once("act_CT should be attached at ingress!\n"); Hi, Paolo. I didn't test this yet, but I'm a little concerned about this change. For TC offload of OVS tunnels, we create egress qdisc for internal ports, a.k.a. bridge ports. This is necessary for tunnels to work. A typical setup looks like this: Bridge br-int (10.0.0.1) Port br-int Interface br-int type: internal Port vm Interface vm Port vxlan0 Interface vxlan0 type: vxlan options: {remote_ip="172.31.1.1"} Bridge br-phy (172.31.1.100) Port br-phy Interface br-phy type: internal Port eth0 Interface eth0 When a VM sends a packet that supposed to go through the tunnel, it enters from the vm port and OVS forwards it into vxlan0 LWT with an appropriate tunnel metadata. Then it gets encapsulated and put into kernel routing to find the next hop, which is via br-phy bridge. Packet enters br-phy and OVS forwards the packet to eth0. There can be stateful processing in the br-phy bridge involving passing packets through conntrack. With the TC offload enabled, OVS creates following filters: 1. Ingress filter on vm interface that forwards packets to vxlan0. 2. Egress filter on br-phy interface that forwards encapsulated packets to eth0 interface. If some stateful processing is involved, both of those could have act_ct in them. AFAIU, we have to use Egress qdisc for the br-phy, because the packet is egressing from the kernel routing via br-phy and the ingress qdisc is not invoked. Ingress will be at play when packets are flowing in the opposite direction from eth0 to br-phy, as that's where they will ingress into standard kernel routing. If act_tc will not be allowed on Egress, then stateful processing will not be possible in this case in br-phy bridge. Thoughts? Best regards, Ilya Maximets. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-17 14:49 ` Ilya Maximets @ 2026-02-17 15:52 ` Paolo Abeni 2026-02-17 19:37 ` Ilya Maximets 0 siblings, 1 reply; 17+ messages in thread From: Paolo Abeni @ 2026-02-17 15:52 UTC (permalink / raw) To: Ilya Maximets, netdev Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal Adding Florian, too On 2/17/26 3:49 PM, Ilya Maximets wrote: > On 2/17/26 10:38 AM, Paolo Abeni wrote: >> Since the blamed commit below, classify can return TC_ACT_CONSUMED while >> the current skb being held by the defragmentation engine. As reported by >> GangMin Kim, if such packet is that may cause a UaF when the defrag engine >> later on tries to tuch again such packet. >> >> act_ct was never meant to be used outside of the ingress path. Making >> defrag really works for act_ct outside such constraints range from very >> difficult to completely impossible. >> >> Address the issue making act_ct drop any packet when not attached to the >> ingress path and additionally emit a warning about the bad >> configuration. >> >> Reported-by: GangMin Kim <km.kim1503@gmail.com> >> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc") >> CC: stable@vger.kernel.org >> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com >> Signed-off-by: Paolo Abeni <pabeni@redhat.com> >> --- >> Catching the bad configuration at runtime instead of init time to reduce >> complexity >> --- >> net/sched/act_ct.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c >> index 81d488655793..e8eb0d195f4a 100644 >> --- a/net/sched/act_ct.c >> +++ b/net/sched/act_ct.c >> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, >> tcf_lastuse_update(&c->tcf_tm); >> tcf_action_update_bstats(&c->common, skb); >> >> + if (!skb_at_tc_ingress(skb)) { >> + pr_warn_once("act_CT should be attached at ingress!\n"); > > Hi, Paolo. I didn't test this yet, but I'm a little concerned about this > change. For TC offload of OVS tunnels, we create egress qdisc for internal > ports, a.k.a. bridge ports. This is necessary for tunnels to work. > A typical setup looks like this: > > Bridge br-int (10.0.0.1) > Port br-int > Interface br-int > type: internal > Port vm > Interface vm > Port vxlan0 > Interface vxlan0 > type: vxlan > options: {remote_ip="172.31.1.1"} > > Bridge br-phy (172.31.1.100) > Port br-phy > Interface br-phy > type: internal > Port eth0 > Interface eth0 > > When a VM sends a packet that supposed to go through the tunnel, it enters > from the vm port and OVS forwards it into vxlan0 LWT with an appropriate > tunnel metadata. Then it gets encapsulated and put into kernel routing to > find the next hop, which is via br-phy bridge. Packet enters br-phy and > OVS forwards the packet to eth0. There can be stateful processing in the > br-phy bridge involving passing packets through conntrack. > > With the TC offload enabled, OVS creates following filters: > > 1. Ingress filter on vm interface that forwards packets to vxlan0. > 2. Egress filter on br-phy interface that forwards encapsulated packets > to eth0 interface. > > If some stateful processing is involved, both of those could have act_ct > in them. > > AFAIU, we have to use Egress qdisc for the br-phy, because the packet is > egressing from the kernel routing via br-phy and the ingress qdisc is not > invoked. Ingress will be at play when packets are flowing in the opposite > direction from eth0 to br-phy, as that's where they will ingress into > standard kernel routing. > > If act_tc will not be allowed on Egress, then stateful processing will > not be possible in this case in br-phy bridge. > > Thoughts? This looks very problematic. A slightly different patch tried to somewhat preserve functionality, but it simply can't work in presence of IP fragments. Why stateful processing on br-int/port vm ingress is not enough? /P ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-17 15:52 ` Paolo Abeni @ 2026-02-17 19:37 ` Ilya Maximets 2026-02-18 14:28 ` Jamal Hadi Salim 0 siblings, 1 reply; 17+ messages in thread From: Ilya Maximets @ 2026-02-17 19:37 UTC (permalink / raw) To: Paolo Abeni, netdev Cc: i.maximets, Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On 2/17/26 4:52 PM, Paolo Abeni wrote: > Adding Florian, too > > On 2/17/26 3:49 PM, Ilya Maximets wrote: >> On 2/17/26 10:38 AM, Paolo Abeni wrote: >>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while >>> the current skb being held by the defragmentation engine. As reported by >>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine >>> later on tries to tuch again such packet. >>> >>> act_ct was never meant to be used outside of the ingress path. Making >>> defrag really works for act_ct outside such constraints range from very >>> difficult to completely impossible. >>> >>> Address the issue making act_ct drop any packet when not attached to the >>> ingress path and additionally emit a warning about the bad >>> configuration. >>> >>> Reported-by: GangMin Kim <km.kim1503@gmail.com> >>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc") >>> CC: stable@vger.kernel.org >>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com >>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> >>> --- >>> Catching the bad configuration at runtime instead of init time to reduce >>> complexity >>> --- >>> net/sched/act_ct.c | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> >>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c >>> index 81d488655793..e8eb0d195f4a 100644 >>> --- a/net/sched/act_ct.c >>> +++ b/net/sched/act_ct.c >>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, >>> tcf_lastuse_update(&c->tcf_tm); >>> tcf_action_update_bstats(&c->common, skb); >>> >>> + if (!skb_at_tc_ingress(skb)) { >>> + pr_warn_once("act_CT should be attached at ingress!\n"); >> >> Hi, Paolo. I didn't test this yet, but I'm a little concerned about this >> change. For TC offload of OVS tunnels, we create egress qdisc for internal >> ports, a.k.a. bridge ports. This is necessary for tunnels to work. >> A typical setup looks like this: >> >> Bridge br-int (10.0.0.1) >> Port br-int >> Interface br-int >> type: internal >> Port vm >> Interface vm >> Port vxlan0 >> Interface vxlan0 >> type: vxlan >> options: {remote_ip="172.31.1.1"} >> >> Bridge br-phy (172.31.1.100) >> Port br-phy >> Interface br-phy >> type: internal >> Port eth0 >> Interface eth0 >> >> When a VM sends a packet that supposed to go through the tunnel, it enters >> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate >> tunnel metadata. Then it gets encapsulated and put into kernel routing to >> find the next hop, which is via br-phy bridge. Packet enters br-phy and >> OVS forwards the packet to eth0. There can be stateful processing in the >> br-phy bridge involving passing packets through conntrack. >> >> With the TC offload enabled, OVS creates following filters: >> >> 1. Ingress filter on vm interface that forwards packets to vxlan0. >> 2. Egress filter on br-phy interface that forwards encapsulated packets >> to eth0 interface. >> >> If some stateful processing is involved, both of those could have act_ct >> in them. >> >> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is >> egressing from the kernel routing via br-phy and the ingress qdisc is not >> invoked. Ingress will be at play when packets are flowing in the opposite >> direction from eth0 to br-phy, as that's where they will ingress into >> standard kernel routing. >> >> If act_tc will not be allowed on Egress, then stateful processing will >> not be possible in this case in br-phy bridge. >> >> Thoughts? > > This looks very problematic. A slightly different patch tried to > somewhat preserve functionality, but it simply can't work in presence of > IP fragments. > > Why stateful processing on br-int/port vm ingress is not enough? In general we don't know where the packet will go after encapsulation, so we do not know beforehand if it will end up in another OVS bridge or not. Hence the processing has to be split in two parts, the ingress on vm port and egress on the other bridge port. At the same time, thinking more about potential use cases, I don't see a good one for running encapsulated traffic through conntrack. Most of the time it's a UDP unidirectional traffic with hashes for source ports. Not sure what functionality would require to do anything stateful on such packets. So, maybe it's fine to prohibit act_ct on the egress after all. For the implementation however, let's look at a different very simple use case: Application on the host -> br-phy -> NAT -> eth0. This is a very common thing to NAT traffic before it leaves the host. It can be for load balancing reasons, e.g. if a destination is a service IP with multiple backends and we want multiple applications to hit different backends. There is no actual reason to try and offload such traffic to TC, as it would not be possible to offload it to hardware (since the traffic source is an application on the host). However, since OVS doesn't really know that it's not possible to offload to hardware and TC by itself allows creation of such filter, it will create the filter on the egress qdisc on the bridge port and that egress filter will be performing act_ct with the NAT. With this patch applied, all this traffic will be just dropped. If we're going to prohibit act_ct on egress, that should be done at the moment of creation of this action on the egress qdisc, so the user knows that it's not supposed to be there. E.g. if OVS will try to create such action and fail, it will keep forwarding the traffic successfully through the normal openvswitch datapath. But if the TC filter creation succeeds, all traffic will be just dropped by it. That doesn't sound right. Best regards, Ilya Maximets. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-17 19:37 ` Ilya Maximets @ 2026-02-18 14:28 ` Jamal Hadi Salim 2026-02-18 16:15 ` Ilya Maximets 0 siblings, 1 reply; 17+ messages in thread From: Jamal Hadi Salim @ 2026-02-18 14:28 UTC (permalink / raw) To: Ilya Maximets Cc: Paolo Abeni, netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On Tue, Feb 17, 2026 at 2:37 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 2/17/26 4:52 PM, Paolo Abeni wrote: > > Adding Florian, too > > > > On 2/17/26 3:49 PM, Ilya Maximets wrote: > >> On 2/17/26 10:38 AM, Paolo Abeni wrote: > >>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while > >>> the current skb being held by the defragmentation engine. As reported by > >>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine > >>> later on tries to tuch again such packet. > >>> > >>> act_ct was never meant to be used outside of the ingress path. Making > >>> defrag really works for act_ct outside such constraints range from very > >>> difficult to completely impossible. > >>> > >>> Address the issue making act_ct drop any packet when not attached to the > >>> ingress path and additionally emit a warning about the bad > >>> configuration. > >>> > >>> Reported-by: GangMin Kim <km.kim1503@gmail.com> > >>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc") > >>> CC: stable@vger.kernel.org > >>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com > >>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> > >>> --- > >>> Catching the bad configuration at runtime instead of init time to reduce > >>> complexity > >>> --- > >>> net/sched/act_ct.c | 6 ++++++ > >>> 1 file changed, 6 insertions(+) > >>> > >>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c > >>> index 81d488655793..e8eb0d195f4a 100644 > >>> --- a/net/sched/act_ct.c > >>> +++ b/net/sched/act_ct.c > >>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, > >>> tcf_lastuse_update(&c->tcf_tm); > >>> tcf_action_update_bstats(&c->common, skb); > >>> > >>> + if (!skb_at_tc_ingress(skb)) { > >>> + pr_warn_once("act_CT should be attached at ingress!\n"); > >> > >> Hi, Paolo. I didn't test this yet, but I'm a little concerned about this > >> change. For TC offload of OVS tunnels, we create egress qdisc for internal > >> ports, a.k.a. bridge ports. This is necessary for tunnels to work. > >> A typical setup looks like this: > >> > >> Bridge br-int (10.0.0.1) > >> Port br-int > >> Interface br-int > >> type: internal > >> Port vm > >> Interface vm > >> Port vxlan0 > >> Interface vxlan0 > >> type: vxlan > >> options: {remote_ip="172.31.1.1"} > >> > >> Bridge br-phy (172.31.1.100) > >> Port br-phy > >> Interface br-phy > >> type: internal > >> Port eth0 > >> Interface eth0 > >> > >> When a VM sends a packet that supposed to go through the tunnel, it enters > >> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate > >> tunnel metadata. Then it gets encapsulated and put into kernel routing to > >> find the next hop, which is via br-phy bridge. Packet enters br-phy and > >> OVS forwards the packet to eth0. There can be stateful processing in the > >> br-phy bridge involving passing packets through conntrack. > >> > >> With the TC offload enabled, OVS creates following filters: > >> > >> 1. Ingress filter on vm interface that forwards packets to vxlan0. > >> 2. Egress filter on br-phy interface that forwards encapsulated packets > >> to eth0 interface. > >> > >> If some stateful processing is involved, both of those could have act_ct > >> in them. > >> > >> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is > >> egressing from the kernel routing via br-phy and the ingress qdisc is not > >> invoked. Ingress will be at play when packets are flowing in the opposite > >> direction from eth0 to br-phy, as that's where they will ingress into > >> standard kernel routing. > >> > >> If act_tc will not be allowed on Egress, then stateful processing will > >> not be possible in this case in br-phy bridge. > >> > >> Thoughts? > > > > This looks very problematic. A slightly different patch tried to > > somewhat preserve functionality, but it simply can't work in presence of > > IP fragments. > > > > Why stateful processing on br-int/port vm ingress is not enough? > > In general we don't know where the packet will go after encapsulation, > so we do not know beforehand if it will end up in another OVS bridge > or not. Hence the processing has to be split in two parts, the ingress > on vm port and egress on the other bridge port. > > At the same time, thinking more about potential use cases, I don't see > a good one for running encapsulated traffic through conntrack. Most > of the time it's a UDP unidirectional traffic with hashes for source > ports. Not sure what functionality would require to do anything > stateful on such packets. So, maybe it's fine to prohibit act_ct on > the egress after all. > Another alternative is to drop packets that have fragments? i.e Paolo's original patch but instead of accepting to drop Or perhaps if there are no consequences there is nothing wrong with Paolo's original patch.. From a validation/testability/debugability POV, assuming its bad to ignore frags processing: The issue is for the user to distinguish between a bad setup (like getting fragments in ct) and a good one where there were no frags. If I were to write a tdc test to send two packets - one with fragments and another without - a good way to check is to look at the tc action/qdisc counters.. Paolo's original patch was accepting in both cases the counter so the accept counter would read as 2 but if it was dropped it would say 1 dropped and 1 accepted, so its easy to test (or the user to debug) cheers, jamal > For the implementation however, let's look at a different very simple > use case: > > Application on the host -> br-phy -> NAT -> eth0. > > This is a very common thing to NAT traffic before it leaves the host. > It can be for load balancing reasons, e.g. if a destination is a service > IP with multiple backends and we want multiple applications to hit > different backends. > > There is no actual reason to try and offload such traffic to TC, as it > would not be possible to offload it to hardware (since the traffic > source is an application on the host). > > However, since OVS doesn't really know that it's not possible to offload > to hardware and TC by itself allows creation of such filter, it will > create the filter on the egress qdisc on the bridge port and that egress > filter will be performing act_ct with the NAT. > > With this patch applied, all this traffic will be just dropped. If we're > going to prohibit act_ct on egress, that should be done at the moment of > creation of this action on the egress qdisc, so the user knows that it's > not supposed to be there. E.g. if OVS will try to create such action and > fail, it will keep forwarding the traffic successfully through the normal > openvswitch datapath. But if the TC filter creation succeeds, all traffic > will be just dropped by it. That doesn't sound right. > > Best regards, Ilya Maximets. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-18 14:28 ` Jamal Hadi Salim @ 2026-02-18 16:15 ` Ilya Maximets 2026-02-18 18:31 ` Jamal Hadi Salim 0 siblings, 1 reply; 17+ messages in thread From: Ilya Maximets @ 2026-02-18 16:15 UTC (permalink / raw) To: Jamal Hadi Salim Cc: i.maximets, Paolo Abeni, netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On 2/18/26 3:28 PM, Jamal Hadi Salim wrote: > On Tue, Feb 17, 2026 at 2:37 PM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 2/17/26 4:52 PM, Paolo Abeni wrote: >>> Adding Florian, too >>> >>> On 2/17/26 3:49 PM, Ilya Maximets wrote: >>>> On 2/17/26 10:38 AM, Paolo Abeni wrote: >>>>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while >>>>> the current skb being held by the defragmentation engine. As reported by >>>>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine >>>>> later on tries to tuch again such packet. >>>>> >>>>> act_ct was never meant to be used outside of the ingress path. Making >>>>> defrag really works for act_ct outside such constraints range from very >>>>> difficult to completely impossible. >>>>> >>>>> Address the issue making act_ct drop any packet when not attached to the >>>>> ingress path and additionally emit a warning about the bad >>>>> configuration. >>>>> >>>>> Reported-by: GangMin Kim <km.kim1503@gmail.com> >>>>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc") >>>>> CC: stable@vger.kernel.org >>>>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com >>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> >>>>> --- >>>>> Catching the bad configuration at runtime instead of init time to reduce >>>>> complexity >>>>> --- >>>>> net/sched/act_ct.c | 6 ++++++ >>>>> 1 file changed, 6 insertions(+) >>>>> >>>>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c >>>>> index 81d488655793..e8eb0d195f4a 100644 >>>>> --- a/net/sched/act_ct.c >>>>> +++ b/net/sched/act_ct.c >>>>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, >>>>> tcf_lastuse_update(&c->tcf_tm); >>>>> tcf_action_update_bstats(&c->common, skb); >>>>> >>>>> + if (!skb_at_tc_ingress(skb)) { >>>>> + pr_warn_once("act_CT should be attached at ingress!\n"); >>>> >>>> Hi, Paolo. I didn't test this yet, but I'm a little concerned about this >>>> change. For TC offload of OVS tunnels, we create egress qdisc for internal >>>> ports, a.k.a. bridge ports. This is necessary for tunnels to work. >>>> A typical setup looks like this: >>>> >>>> Bridge br-int (10.0.0.1) >>>> Port br-int >>>> Interface br-int >>>> type: internal >>>> Port vm >>>> Interface vm >>>> Port vxlan0 >>>> Interface vxlan0 >>>> type: vxlan >>>> options: {remote_ip="172.31.1.1"} >>>> >>>> Bridge br-phy (172.31.1.100) >>>> Port br-phy >>>> Interface br-phy >>>> type: internal >>>> Port eth0 >>>> Interface eth0 >>>> >>>> When a VM sends a packet that supposed to go through the tunnel, it enters >>>> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate >>>> tunnel metadata. Then it gets encapsulated and put into kernel routing to >>>> find the next hop, which is via br-phy bridge. Packet enters br-phy and >>>> OVS forwards the packet to eth0. There can be stateful processing in the >>>> br-phy bridge involving passing packets through conntrack. >>>> >>>> With the TC offload enabled, OVS creates following filters: >>>> >>>> 1. Ingress filter on vm interface that forwards packets to vxlan0. >>>> 2. Egress filter on br-phy interface that forwards encapsulated packets >>>> to eth0 interface. >>>> >>>> If some stateful processing is involved, both of those could have act_ct >>>> in them. >>>> >>>> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is >>>> egressing from the kernel routing via br-phy and the ingress qdisc is not >>>> invoked. Ingress will be at play when packets are flowing in the opposite >>>> direction from eth0 to br-phy, as that's where they will ingress into >>>> standard kernel routing. >>>> >>>> If act_tc will not be allowed on Egress, then stateful processing will >>>> not be possible in this case in br-phy bridge. >>>> >>>> Thoughts? >>> >>> This looks very problematic. A slightly different patch tried to >>> somewhat preserve functionality, but it simply can't work in presence of >>> IP fragments. >>> >>> Why stateful processing on br-int/port vm ingress is not enough? >> >> In general we don't know where the packet will go after encapsulation, >> so we do not know beforehand if it will end up in another OVS bridge >> or not. Hence the processing has to be split in two parts, the ingress >> on vm port and egress on the other bridge port. >> >> At the same time, thinking more about potential use cases, I don't see >> a good one for running encapsulated traffic through conntrack. Most >> of the time it's a UDP unidirectional traffic with hashes for source >> ports. Not sure what functionality would require to do anything >> stateful on such packets. So, maybe it's fine to prohibit act_ct on >> the egress after all. >> > > Another alternative is to drop packets that have fragments? i.e > Paolo's original patch but instead of accepting to drop > Or perhaps if there are no consequences there is nothing wrong with > Paolo's original patch.. > > From a validation/testability/debugability POV, assuming its bad to > ignore frags processing: > The issue is for the user to distinguish between a bad setup (like > getting fragments in ct) and a good one where there were no frags. If > I were to write a tdc test to send two packets - one with fragments > and another without - a good way to check is to look at the tc > action/qdisc counters.. Paolo's original patch was accepting in both > cases the counter so the accept counter would read as 2 but if it was > dropped it would say 1 dropped and 1 accepted, so its easy to test (or > the user to debug) > From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS. This is clear for the application that makes a request and for the user if they make the request manually with 'tc filter ...'. If the filter with act_ct is allowed to be created, but only works for non-fragmented traffic, that would be pretty hard to debug for someone who is not an expert in the area, but also would be a pretty inconsistent user API. Best regards, Ilya Maximets. > cheers, > jamal > >> For the implementation however, let's look at a different very simple >> use case: >> >> Application on the host -> br-phy -> NAT -> eth0. >> >> This is a very common thing to NAT traffic before it leaves the host. >> It can be for load balancing reasons, e.g. if a destination is a service >> IP with multiple backends and we want multiple applications to hit >> different backends. >> >> There is no actual reason to try and offload such traffic to TC, as it >> would not be possible to offload it to hardware (since the traffic >> source is an application on the host). >> >> However, since OVS doesn't really know that it's not possible to offload >> to hardware and TC by itself allows creation of such filter, it will >> create the filter on the egress qdisc on the bridge port and that egress >> filter will be performing act_ct with the NAT. >> >> With this patch applied, all this traffic will be just dropped. If we're >> going to prohibit act_ct on egress, that should be done at the moment of >> creation of this action on the egress qdisc, so the user knows that it's >> not supposed to be there. E.g. if OVS will try to create such action and >> fail, it will keep forwarding the traffic successfully through the normal >> openvswitch datapath. But if the TC filter creation succeeds, all traffic >> will be just dropped by it. That doesn't sound right. >> >> Best regards, Ilya Maximets. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-18 16:15 ` Ilya Maximets @ 2026-02-18 18:31 ` Jamal Hadi Salim 2026-02-18 18:44 ` Jamal Hadi Salim 0 siblings, 1 reply; 17+ messages in thread From: Jamal Hadi Salim @ 2026-02-18 18:31 UTC (permalink / raw) To: Ilya Maximets Cc: Paolo Abeni, netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 2/18/26 3:28 PM, Jamal Hadi Salim wrote: > > On Tue, Feb 17, 2026 at 2:37 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >> > >> On 2/17/26 4:52 PM, Paolo Abeni wrote: > >>> Adding Florian, too > >>> > >>> On 2/17/26 3:49 PM, Ilya Maximets wrote: > >>>> On 2/17/26 10:38 AM, Paolo Abeni wrote: > >>>>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while > >>>>> the current skb being held by the defragmentation engine. As reported by > >>>>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine > >>>>> later on tries to tuch again such packet. > >>>>> > >>>>> act_ct was never meant to be used outside of the ingress path. Making > >>>>> defrag really works for act_ct outside such constraints range from very > >>>>> difficult to completely impossible. > >>>>> > >>>>> Address the issue making act_ct drop any packet when not attached to the > >>>>> ingress path and additionally emit a warning about the bad > >>>>> configuration. > >>>>> > >>>>> Reported-by: GangMin Kim <km.kim1503@gmail.com> > >>>>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc") > >>>>> CC: stable@vger.kernel.org > >>>>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com > >>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> > >>>>> --- > >>>>> Catching the bad configuration at runtime instead of init time to reduce > >>>>> complexity > >>>>> --- > >>>>> net/sched/act_ct.c | 6 ++++++ > >>>>> 1 file changed, 6 insertions(+) > >>>>> > >>>>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c > >>>>> index 81d488655793..e8eb0d195f4a 100644 > >>>>> --- a/net/sched/act_ct.c > >>>>> +++ b/net/sched/act_ct.c > >>>>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, > >>>>> tcf_lastuse_update(&c->tcf_tm); > >>>>> tcf_action_update_bstats(&c->common, skb); > >>>>> > >>>>> + if (!skb_at_tc_ingress(skb)) { > >>>>> + pr_warn_once("act_CT should be attached at ingress!\n"); > >>>> > >>>> Hi, Paolo. I didn't test this yet, but I'm a little concerned about this > >>>> change. For TC offload of OVS tunnels, we create egress qdisc for internal > >>>> ports, a.k.a. bridge ports. This is necessary for tunnels to work. > >>>> A typical setup looks like this: > >>>> > >>>> Bridge br-int (10.0.0.1) > >>>> Port br-int > >>>> Interface br-int > >>>> type: internal > >>>> Port vm > >>>> Interface vm > >>>> Port vxlan0 > >>>> Interface vxlan0 > >>>> type: vxlan > >>>> options: {remote_ip="172.31.1.1"} > >>>> > >>>> Bridge br-phy (172.31.1.100) > >>>> Port br-phy > >>>> Interface br-phy > >>>> type: internal > >>>> Port eth0 > >>>> Interface eth0 > >>>> > >>>> When a VM sends a packet that supposed to go through the tunnel, it enters > >>>> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate > >>>> tunnel metadata. Then it gets encapsulated and put into kernel routing to > >>>> find the next hop, which is via br-phy bridge. Packet enters br-phy and > >>>> OVS forwards the packet to eth0. There can be stateful processing in the > >>>> br-phy bridge involving passing packets through conntrack. > >>>> > >>>> With the TC offload enabled, OVS creates following filters: > >>>> > >>>> 1. Ingress filter on vm interface that forwards packets to vxlan0. > >>>> 2. Egress filter on br-phy interface that forwards encapsulated packets > >>>> to eth0 interface. > >>>> > >>>> If some stateful processing is involved, both of those could have act_ct > >>>> in them. > >>>> > >>>> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is > >>>> egressing from the kernel routing via br-phy and the ingress qdisc is not > >>>> invoked. Ingress will be at play when packets are flowing in the opposite > >>>> direction from eth0 to br-phy, as that's where they will ingress into > >>>> standard kernel routing. > >>>> > >>>> If act_tc will not be allowed on Egress, then stateful processing will > >>>> not be possible in this case in br-phy bridge. > >>>> > >>>> Thoughts? > >>> > >>> This looks very problematic. A slightly different patch tried to > >>> somewhat preserve functionality, but it simply can't work in presence of > >>> IP fragments. > >>> > >>> Why stateful processing on br-int/port vm ingress is not enough? > >> > >> In general we don't know where the packet will go after encapsulation, > >> so we do not know beforehand if it will end up in another OVS bridge > >> or not. Hence the processing has to be split in two parts, the ingress > >> on vm port and egress on the other bridge port. > >> > >> At the same time, thinking more about potential use cases, I don't see > >> a good one for running encapsulated traffic through conntrack. Most > >> of the time it's a UDP unidirectional traffic with hashes for source > >> ports. Not sure what functionality would require to do anything > >> stateful on such packets. So, maybe it's fine to prohibit act_ct on > >> the egress after all. > >> > > > > Another alternative is to drop packets that have fragments? i.e > > Paolo's original patch but instead of accepting to drop > > Or perhaps if there are no consequences there is nothing wrong with > > Paolo's original patch.. > > > > From a validation/testability/debugability POV, assuming its bad to > > ignore frags processing: > > The issue is for the user to distinguish between a bad setup (like > > getting fragments in ct) and a good one where there were no frags. If > > I were to write a tdc test to send two packets - one with fragments > > and another without - a good way to check is to look at the tc > > action/qdisc counters.. Paolo's original patch was accepting in both > > cases the counter so the accept counter would read as 2 but if it was > > dropped it would say 1 dropped and 1 accepted, so its easy to test (or > > the user to debug) > > > > From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when > it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS. This is clear > for the application that makes a request and for the user if they make > the request manually with 'tc filter ...'. > The challenge is actions could be created as a standalone i.e "tc actions add action ct..." then later bound via tc filter. Don't forget actions can also be shared by multiple filters (which could be a mix of egress/ingress)... > If the filter with act_ct is allowed to be created, but only works for > non-fragmented traffic, that would be pretty hard to debug for someone > who is not an expert in the area, but also would be a pretty inconsistent > user API. > True. The drop counter is better than "nothing".. cheers, jamal > Best regards, Ilya Maximets. > > > cheers, > > jamal > > > >> For the implementation however, let's look at a different very simple > >> use case: > >> > >> Application on the host -> br-phy -> NAT -> eth0. > >> > >> This is a very common thing to NAT traffic before it leaves the host. > >> It can be for load balancing reasons, e.g. if a destination is a service > >> IP with multiple backends and we want multiple applications to hit > >> different backends. > >> > >> There is no actual reason to try and offload such traffic to TC, as it > >> would not be possible to offload it to hardware (since the traffic > >> source is an application on the host). > >> > >> However, since OVS doesn't really know that it's not possible to offload > >> to hardware and TC by itself allows creation of such filter, it will > >> create the filter on the egress qdisc on the bridge port and that egress > >> filter will be performing act_ct with the NAT. > >> > >> With this patch applied, all this traffic will be just dropped. If we're > >> going to prohibit act_ct on egress, that should be done at the moment of > >> creation of this action on the egress qdisc, so the user knows that it's > >> not supposed to be there. E.g. if OVS will try to create such action and > >> fail, it will keep forwarding the traffic successfully through the normal > >> openvswitch datapath. But if the TC filter creation succeeds, all traffic > >> will be just dropped by it. That doesn't sound right. > >> > >> Best regards, Ilya Maximets. > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-18 18:31 ` Jamal Hadi Salim @ 2026-02-18 18:44 ` Jamal Hadi Salim 2026-02-18 20:43 ` Paolo Abeni 0 siblings, 1 reply; 17+ messages in thread From: Jamal Hadi Salim @ 2026-02-18 18:44 UTC (permalink / raw) To: Ilya Maximets Cc: Paolo Abeni, netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > On 2/18/26 3:28 PM, Jamal Hadi Salim wrote: > > > On Tue, Feb 17, 2026 at 2:37 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > >> > > >> On 2/17/26 4:52 PM, Paolo Abeni wrote: > > >>> Adding Florian, too > > >>> > > >>> On 2/17/26 3:49 PM, Ilya Maximets wrote: > > >>>> On 2/17/26 10:38 AM, Paolo Abeni wrote: > > >>>>> Since the blamed commit below, classify can return TC_ACT_CONSUMED while > > >>>>> the current skb being held by the defragmentation engine. As reported by > > >>>>> GangMin Kim, if such packet is that may cause a UaF when the defrag engine > > >>>>> later on tries to tuch again such packet. > > >>>>> > > >>>>> act_ct was never meant to be used outside of the ingress path. Making > > >>>>> defrag really works for act_ct outside such constraints range from very > > >>>>> difficult to completely impossible. > > >>>>> > > >>>>> Address the issue making act_ct drop any packet when not attached to the > > >>>>> ingress path and additionally emit a warning about the bad > > >>>>> configuration. > > >>>>> > > >>>>> Reported-by: GangMin Kim <km.kim1503@gmail.com> > > >>>>> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc") > > >>>>> CC: stable@vger.kernel.org > > >>>>> Link: https://patch.msgid.link/16f6b264373ad60ab18eb8525809e7267442afa7.1770394932.git.pabeni@redhat.com > > >>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > >>>>> --- > > >>>>> Catching the bad configuration at runtime instead of init time to reduce > > >>>>> complexity > > >>>>> --- > > >>>>> net/sched/act_ct.c | 6 ++++++ > > >>>>> 1 file changed, 6 insertions(+) > > >>>>> > > >>>>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c > > >>>>> index 81d488655793..e8eb0d195f4a 100644 > > >>>>> --- a/net/sched/act_ct.c > > >>>>> +++ b/net/sched/act_ct.c > > >>>>> @@ -987,6 +987,11 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, > > >>>>> tcf_lastuse_update(&c->tcf_tm); > > >>>>> tcf_action_update_bstats(&c->common, skb); > > >>>>> > > >>>>> + if (!skb_at_tc_ingress(skb)) { > > >>>>> + pr_warn_once("act_CT should be attached at ingress!\n"); > > >>>> > > >>>> Hi, Paolo. I didn't test this yet, but I'm a little concerned about this > > >>>> change. For TC offload of OVS tunnels, we create egress qdisc for internal > > >>>> ports, a.k.a. bridge ports. This is necessary for tunnels to work. > > >>>> A typical setup looks like this: > > >>>> > > >>>> Bridge br-int (10.0.0.1) > > >>>> Port br-int > > >>>> Interface br-int > > >>>> type: internal > > >>>> Port vm > > >>>> Interface vm > > >>>> Port vxlan0 > > >>>> Interface vxlan0 > > >>>> type: vxlan > > >>>> options: {remote_ip="172.31.1.1"} > > >>>> > > >>>> Bridge br-phy (172.31.1.100) > > >>>> Port br-phy > > >>>> Interface br-phy > > >>>> type: internal > > >>>> Port eth0 > > >>>> Interface eth0 > > >>>> > > >>>> When a VM sends a packet that supposed to go through the tunnel, it enters > > >>>> from the vm port and OVS forwards it into vxlan0 LWT with an appropriate > > >>>> tunnel metadata. Then it gets encapsulated and put into kernel routing to > > >>>> find the next hop, which is via br-phy bridge. Packet enters br-phy and > > >>>> OVS forwards the packet to eth0. There can be stateful processing in the > > >>>> br-phy bridge involving passing packets through conntrack. > > >>>> > > >>>> With the TC offload enabled, OVS creates following filters: > > >>>> > > >>>> 1. Ingress filter on vm interface that forwards packets to vxlan0. > > >>>> 2. Egress filter on br-phy interface that forwards encapsulated packets > > >>>> to eth0 interface. > > >>>> > > >>>> If some stateful processing is involved, both of those could have act_ct > > >>>> in them. > > >>>> > > >>>> AFAIU, we have to use Egress qdisc for the br-phy, because the packet is > > >>>> egressing from the kernel routing via br-phy and the ingress qdisc is not > > >>>> invoked. Ingress will be at play when packets are flowing in the opposite > > >>>> direction from eth0 to br-phy, as that's where they will ingress into > > >>>> standard kernel routing. > > >>>> > > >>>> If act_tc will not be allowed on Egress, then stateful processing will > > >>>> not be possible in this case in br-phy bridge. > > >>>> > > >>>> Thoughts? > > >>> > > >>> This looks very problematic. A slightly different patch tried to > > >>> somewhat preserve functionality, but it simply can't work in presence of > > >>> IP fragments. > > >>> > > >>> Why stateful processing on br-int/port vm ingress is not enough? > > >> > > >> In general we don't know where the packet will go after encapsulation, > > >> so we do not know beforehand if it will end up in another OVS bridge > > >> or not. Hence the processing has to be split in two parts, the ingress > > >> on vm port and egress on the other bridge port. > > >> > > >> At the same time, thinking more about potential use cases, I don't see > > >> a good one for running encapsulated traffic through conntrack. Most > > >> of the time it's a UDP unidirectional traffic with hashes for source > > >> ports. Not sure what functionality would require to do anything > > >> stateful on such packets. So, maybe it's fine to prohibit act_ct on > > >> the egress after all. > > >> > > > > > > Another alternative is to drop packets that have fragments? i.e > > > Paolo's original patch but instead of accepting to drop > > > Or perhaps if there are no consequences there is nothing wrong with > > > Paolo's original patch.. > > > > > > From a validation/testability/debugability POV, assuming its bad to > > > ignore frags processing: > > > The issue is for the user to distinguish between a bad setup (like > > > getting fragments in ct) and a good one where there were no frags. If > > > I were to write a tdc test to send two packets - one with fragments > > > and another without - a good way to check is to look at the tc > > > action/qdisc counters.. Paolo's original patch was accepting in both > > > cases the counter so the accept counter would read as 2 but if it was > > > dropped it would say 1 dropped and 1 accepted, so its easy to test (or > > > the user to debug) > > > > > > > From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when > > it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS. This is clear > > for the application that makes a request and for the user if they make > > the request manually with 'tc filter ...'. > > > Actually, looking closely at the code - this is doable. Let's see if a patch can be cooked. cheers, jamal > The challenge is actions could be created as a standalone i.e "tc > actions add action ct..." then later bound via tc filter. Don't forget > actions can also be shared by multiple filters (which could be a mix > of egress/ingress)... > > > If the filter with act_ct is allowed to be created, but only works for > > non-fragmented traffic, that would be pretty hard to debug for someone > > who is not an expert in the area, but also would be a pretty inconsistent > > user API. > > > > True. The drop counter is better than "nothing".. > > cheers, > jamal > > > > Best regards, Ilya Maximets. > > > > > cheers, > > > jamal > > > > > >> For the implementation however, let's look at a different very simple > > >> use case: > > >> > > >> Application on the host -> br-phy -> NAT -> eth0. > > >> > > >> This is a very common thing to NAT traffic before it leaves the host. > > >> It can be for load balancing reasons, e.g. if a destination is a service > > >> IP with multiple backends and we want multiple applications to hit > > >> different backends. > > >> > > >> There is no actual reason to try and offload such traffic to TC, as it > > >> would not be possible to offload it to hardware (since the traffic > > >> source is an application on the host). > > >> > > >> However, since OVS doesn't really know that it's not possible to offload > > >> to hardware and TC by itself allows creation of such filter, it will > > >> create the filter on the egress qdisc on the bridge port and that egress > > >> filter will be performing act_ct with the NAT. > > >> > > >> With this patch applied, all this traffic will be just dropped. If we're > > >> going to prohibit act_ct on egress, that should be done at the moment of > > >> creation of this action on the egress qdisc, so the user knows that it's > > >> not supposed to be there. E.g. if OVS will try to create such action and > > >> fail, it will keep forwarding the traffic successfully through the normal > > >> openvswitch datapath. But if the TC filter creation succeeds, all traffic > > >> will be just dropped by it. That doesn't sound right. > > >> > > >> Best regards, Ilya Maximets. > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-18 18:44 ` Jamal Hadi Salim @ 2026-02-18 20:43 ` Paolo Abeni 2026-02-19 11:46 ` Ilya Maximets 2026-02-19 14:16 ` Jamal Hadi Salim 0 siblings, 2 replies; 17+ messages in thread From: Paolo Abeni @ 2026-02-18 20:43 UTC (permalink / raw) To: Jamal Hadi Salim, Ilya Maximets Cc: netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On 2/18/26 7:44 PM, Jamal Hadi Salim wrote: > On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: >> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when >>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS. This is clear >>> for the application that makes a request and for the user if they make >>> the request manually with 'tc filter ...'. > >> The challenge is actions could be created as a standalone i.e "tc >> actions add action ct..." then later bound via tc filter. Don't forget >> actions can also be shared by multiple filters (which could be a mix >> of egress/ingress)... > > Actually, looking closely at the code - this is doable. Let's see if a > patch can be cooked. I discussed a bit the topic with Davide, it looks like the thing you mentioned above could be handled. The problematic part is AFAICS the additional indirection level added by (possibly shared) blocks. AFAICS whatever check we do at ct_init() time, shared block could later circumvent it - unless act_ct is always forbidden for shared blocks. TL;DR: I think admission check at init time can be implemented only in a quite (too much?) restrictive way. /P ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-18 20:43 ` Paolo Abeni @ 2026-02-19 11:46 ` Ilya Maximets 2026-02-19 14:16 ` Jamal Hadi Salim 1 sibling, 0 replies; 17+ messages in thread From: Ilya Maximets @ 2026-02-19 11:46 UTC (permalink / raw) To: Paolo Abeni, Jamal Hadi Salim Cc: i.maximets, netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On 2/18/26 9:43 PM, Paolo Abeni wrote: > On 2/18/26 7:44 PM, Jamal Hadi Salim wrote: >> On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: >>> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when >>>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS. This is clear >>>> for the application that makes a request and for the user if they make >>>> the request manually with 'tc filter ...'. >> >>> The challenge is actions could be created as a standalone i.e "tc >>> actions add action ct..." then later bound via tc filter. Don't forget >>> actions can also be shared by multiple filters (which could be a mix >>> of egress/ingress)... >> >> Actually, looking closely at the code - this is doable. Let's see if a >> patch can be cooked. > > I discussed a bit the topic with Davide, it looks like the thing you > mentioned above could be handled. The problematic part is AFAICS the > additional indirection level added by (possibly shared) blocks. AFAICS > whatever check we do at ct_init() time, shared block could later > circumvent it - unless act_ct is always forbidden for shared blocks. I'm obviously biased here, but if it's not possible to cover all the cases, I'd still prefer a partial solution over the no solution. If we can handle just the RTM_NEWTFILTER case, that will allow us to at least avoid breaking a known use case with OVS. If it will be possible to circumvent that with block sharing, then we can still drop fragmented traffic there. I'm not sure how widely sharing is used. Best regards, Ilya Maximets. > > TL;DR: I think admission check at init time can be implemented only in a > quite (too much?) restrictive way. > > /P > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-18 20:43 ` Paolo Abeni 2026-02-19 11:46 ` Ilya Maximets @ 2026-02-19 14:16 ` Jamal Hadi Salim 2026-02-19 20:13 ` Jamal Hadi Salim 1 sibling, 1 reply; 17+ messages in thread From: Jamal Hadi Salim @ 2026-02-19 14:16 UTC (permalink / raw) To: Paolo Abeni Cc: Ilya Maximets, netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On Wed, Feb 18, 2026 at 3:43 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On 2/18/26 7:44 PM, Jamal Hadi Salim wrote: > > On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > >> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when > >>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS. This is clear > >>> for the application that makes a request and for the user if they make > >>> the request manually with 'tc filter ...'. > > > >> The challenge is actions could be created as a standalone i.e "tc > >> actions add action ct..." then later bound via tc filter. Don't forget > >> actions can also be shared by multiple filters (which could be a mix > >> of egress/ingress)... > > > > Actually, looking closely at the code - this is doable. Let's see if a > > patch can be cooked. > > I discussed a bit the topic with Davide, it looks like the thing you > mentioned above could be handled. The problematic part is AFAICS the > additional indirection level added by (possibly shared) blocks. AFAICS > whatever check we do at ct_init() time, shared block could later > circumvent it - unless act_ct is always forbidden for shared blocks. > > TL;DR: I think admission check at init time can be implemented only in a > quite (too much?) restrictive way. > True - the blocks will require more surgery but seems doable. We may end up again in ->quirks() land here. I.e when an action is bound to a filter (not necessarily when created) it should check its fit in the graph and see if it is legit. But since this is only the second quirk we know of, we can simplify. Note: We already have one such check, albeit ugly-looking, with skbedit (see flag TCA_ACT_FLAGS_AT_INGRESS). I will talk to Victor and see if he can cook at minimal a preliminary patch without considering blocks, and at best one that does consider blocks. cheers, jamal > /P > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-19 14:16 ` Jamal Hadi Salim @ 2026-02-19 20:13 ` Jamal Hadi Salim 2026-02-20 12:24 ` Victor Nogueira 0 siblings, 1 reply; 17+ messages in thread From: Jamal Hadi Salim @ 2026-02-19 20:13 UTC (permalink / raw) To: Paolo Abeni Cc: Ilya Maximets, netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On Thu, Feb 19, 2026 at 9:16 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Wed, Feb 18, 2026 at 3:43 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On 2/18/26 7:44 PM, Jamal Hadi Salim wrote: > > > On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > >> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > >>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when > > >>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS. This is clear > > >>> for the application that makes a request and for the user if they make > > >>> the request manually with 'tc filter ...'. > > > > > >> The challenge is actions could be created as a standalone i.e "tc > > >> actions add action ct..." then later bound via tc filter. Don't forget > > >> actions can also be shared by multiple filters (which could be a mix > > >> of egress/ingress)... > > > > > > Actually, looking closely at the code - this is doable. Let's see if a > > > patch can be cooked. > > > > I discussed a bit the topic with Davide, it looks like the thing you > > mentioned above could be handled. The problematic part is AFAICS the > > additional indirection level added by (possibly shared) blocks. AFAICS > > whatever check we do at ct_init() time, shared block could later > > circumvent it - unless act_ct is always forbidden for shared blocks. > > > > TL;DR: I think admission check at init time can be implemented only in a > > quite (too much?) restrictive way. > > > > True - the blocks will require more surgery but seems doable. We may > end up again in ->quirks() land here. I.e when an action is bound to a > filter (not necessarily when created) it should check its fit in the > graph and see if it is legit. But since this is only the second quirk > we know of, we can simplify. Note: We already have one such check, > albeit ugly-looking, with skbedit (see flag TCA_ACT_FLAGS_AT_INGRESS). > I will talk to Victor and see if he can cook at minimal a preliminary > patch without considering blocks, and at best one that does consider > blocks. After discussions with Victor we concluded that blocks have no effect on this specific issue. Blocks are only useful in clsact(both ingress/egress) and ingress. Both of these handle ACT_CONSUMED fine. Code sample looks as follows: case TC_ACT_STOLEN: case TC_ACT_QUEUED: case TC_ACT_TRAP: consume_skb(skb); fallthrough; case TC_ACT_CONSUMED: *ret = NET_RX_SUCCESS; bpf_net_ctx_clear(bpf_net_ctx); return NULL; The reason this works is because they don't have to deal with queues(unlike the other qdiscs). So, ignoring the block should work just fine. Ilya, for your case, if you must egress, use clsact egress - the only problem is it will not support the more feature-rich qdiscs... cheers, jamal ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-19 20:13 ` Jamal Hadi Salim @ 2026-02-20 12:24 ` Victor Nogueira 2026-02-20 13:41 ` Ilya Maximets 0 siblings, 1 reply; 17+ messages in thread From: Victor Nogueira @ 2026-02-20 12:24 UTC (permalink / raw) To: Jamal Hadi Salim, Paolo Abeni Cc: Ilya Maximets, netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On 19/02/2026 17:13, Jamal Hadi Salim wrote: > On Thu, Feb 19, 2026 at 9:16 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: >> >> On Wed, Feb 18, 2026 at 3:43 PM Paolo Abeni <pabeni@redhat.com> wrote: >>> >>> On 2/18/26 7:44 PM, Jamal Hadi Salim wrote: >>>> On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: >>>>> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when >>>>>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS. This is clear >>>>>> for the application that makes a request and for the user if they make >>>>>> the request manually with 'tc filter ...'. >>>> >>>>> The challenge is actions could be created as a standalone i.e "tc >>>>> actions add action ct..." then later bound via tc filter. Don't forget >>>>> actions can also be shared by multiple filters (which could be a mix >>>>> of egress/ingress)... >>>> >>>> Actually, looking closely at the code - this is doable. Let's see if a >>>> patch can be cooked. >>> >>> I discussed a bit the topic with Davide, it looks like the thing you >>> mentioned above could be handled. The problematic part is AFAICS the >>> additional indirection level added by (possibly shared) blocks. AFAICS >>> whatever check we do at ct_init() time, shared block could later >>> circumvent it - unless act_ct is always forbidden for shared blocks. >>> >>> TL;DR: I think admission check at init time can be implemented only in a >>> quite (too much?) restrictive way. >>> >> >> True - the blocks will require more surgery but seems doable. We may >> end up again in ->quirks() land here. I.e when an action is bound to a >> filter (not necessarily when created) it should check its fit in the >> graph and see if it is legit. But since this is only the second quirk >> we know of, we can simplify. Note: We already have one such check, >> albeit ugly-looking, with skbedit (see flag TCA_ACT_FLAGS_AT_INGRESS). >> I will talk to Victor and see if he can cook at minimal a preliminary >> patch without considering blocks, and at best one that does consider >> blocks. > > After discussions with Victor we concluded that blocks have no effect > on this specific issue. > Blocks are only useful in clsact(both ingress/egress) and ingress. > Both of these handle ACT_CONSUMED fine. > Code sample looks as follows: > > case TC_ACT_STOLEN: > case TC_ACT_QUEUED: > case TC_ACT_TRAP: > consume_skb(skb); > fallthrough; > case TC_ACT_CONSUMED: > *ret = NET_RX_SUCCESS; > bpf_net_ctx_clear(bpf_net_ctx); > return NULL; > > The reason this works is because they don't have to deal with > queues(unlike the other qdiscs). > So, ignoring the block should work just fine. Ilya, for your case, if > you must egress, use clsact egress - the only problem is it will not > support the more feature-rich qdiscs... If there is no objection, we are thinking of sending something along these lines: diff --git a/include/net/act_api.h b/include/net/act_api.h index 91a24b5e0b93..2ba40eb45aad 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -70,6 +70,7 @@ struct tc_action { #define TCA_ACT_FLAGS_REPLACE (1U << (TCA_ACT_FLAGS_USER_BITS + 2)) #define TCA_ACT_FLAGS_NO_RTNL (1U << (TCA_ACT_FLAGS_USER_BITS + 3)) #define TCA_ACT_FLAGS_AT_INGRESS (1U << (TCA_ACT_FLAGS_USER_BITS + 4)) +#define TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT (1U << (TCA_ACT_FLAGS_USER_BITS + 5)) /* Update lastuse only if needed, to avoid dirtying a cache line. * We use a temp variable to avoid fetching jiffies twice. diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c index 81d488655793..a79479faf0f8 100644 --- a/net/sched/act_ct.c +++ b/net/sched/act_ct.c @@ -1360,6 +1360,12 @@ static int tcf_ct_init(struct net *net, struct nlattr *nla, return -EINVAL; } + if (bind && !(flags & TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT)) { + NL_SET_ERR_MSG_MOD(extack, + "May only attach ct to ingress or clsact"); + return -EPERM; + } + err = nla_parse_nested(tb, TCA_CT_MAX, nla, ct_policy, extack); if (err < 0) return err; diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index ebca4b926dcf..8c72faf3314d 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -2228,6 +2228,11 @@ static bool is_qdisc_ingress(__u32 classid) return (TC_H_MIN(classid) == TC_H_MIN(TC_H_MIN_INGRESS)); } +static bool is_ingress_or_clsact(struct tcf_block *block, struct Qdisc *q) +{ + return tcf_block_shared(block) || (q && !!(q->flags & TCQ_F_INGRESS)); +} + static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n, struct netlink_ext_ack *extack) { @@ -2420,6 +2425,8 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n, flags |= TCA_ACT_FLAGS_NO_RTNL; if (is_qdisc_ingress(parent)) flags |= TCA_ACT_FLAGS_AT_INGRESS; + if (is_ingress_or_clsact(block, q)) + flags |= TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT; err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh, flags, extack); if (err == 0) { ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-20 12:24 ` Victor Nogueira @ 2026-02-20 13:41 ` Ilya Maximets 2026-02-20 16:12 ` Victor Nogueira 0 siblings, 1 reply; 17+ messages in thread From: Ilya Maximets @ 2026-02-20 13:41 UTC (permalink / raw) To: Victor Nogueira, Jamal Hadi Salim, Paolo Abeni Cc: i.maximets, netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On 2/20/26 1:24 PM, Victor Nogueira wrote: > On 19/02/2026 17:13, Jamal Hadi Salim wrote: >> On Thu, Feb 19, 2026 at 9:16 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: >>> >>> On Wed, Feb 18, 2026 at 3:43 PM Paolo Abeni <pabeni@redhat.com> wrote: >>>> >>>> On 2/18/26 7:44 PM, Jamal Hadi Salim wrote: >>>>> On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: >>>>>> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when >>>>>>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS. This is clear >>>>>>> for the application that makes a request and for the user if they make >>>>>>> the request manually with 'tc filter ...'. >>>>> >>>>>> The challenge is actions could be created as a standalone i.e "tc >>>>>> actions add action ct..." then later bound via tc filter. Don't forget >>>>>> actions can also be shared by multiple filters (which could be a mix >>>>>> of egress/ingress)... >>>>> >>>>> Actually, looking closely at the code - this is doable. Let's see if a >>>>> patch can be cooked. >>>> >>>> I discussed a bit the topic with Davide, it looks like the thing you >>>> mentioned above could be handled. The problematic part is AFAICS the >>>> additional indirection level added by (possibly shared) blocks. AFAICS >>>> whatever check we do at ct_init() time, shared block could later >>>> circumvent it - unless act_ct is always forbidden for shared blocks. >>>> >>>> TL;DR: I think admission check at init time can be implemented only in a >>>> quite (too much?) restrictive way. >>>> >>> >>> True - the blocks will require more surgery but seems doable. We may >>> end up again in ->quirks() land here. I.e when an action is bound to a >>> filter (not necessarily when created) it should check its fit in the >>> graph and see if it is legit. But since this is only the second quirk >>> we know of, we can simplify. Note: We already have one such check, >>> albeit ugly-looking, with skbedit (see flag TCA_ACT_FLAGS_AT_INGRESS). >>> I will talk to Victor and see if he can cook at minimal a preliminary >>> patch without considering blocks, and at best one that does consider >>> blocks. >> >> After discussions with Victor we concluded that blocks have no effect >> on this specific issue. >> Blocks are only useful in clsact(both ingress/egress) and ingress. >> Both of these handle ACT_CONSUMED fine. >> Code sample looks as follows: >> >> case TC_ACT_STOLEN: >> case TC_ACT_QUEUED: >> case TC_ACT_TRAP: >> consume_skb(skb); >> fallthrough; >> case TC_ACT_CONSUMED: >> *ret = NET_RX_SUCCESS; >> bpf_net_ctx_clear(bpf_net_ctx); >> return NULL; >> >> The reason this works is because they don't have to deal with >> queues(unlike the other qdiscs). >> So, ignoring the block should work just fine. Ilya, for your case, if >> you must egress, use clsact egress - the only problem is it will not >> support the more feature-rich qdiscs... > > If there is no objection, we are thinking of sending something along > these lines: > > diff --git a/include/net/act_api.h b/include/net/act_api.h > index 91a24b5e0b93..2ba40eb45aad 100644 > --- a/include/net/act_api.h > +++ b/include/net/act_api.h > @@ -70,6 +70,7 @@ struct tc_action { > #define TCA_ACT_FLAGS_REPLACE (1U << (TCA_ACT_FLAGS_USER_BITS + 2)) > #define TCA_ACT_FLAGS_NO_RTNL (1U << (TCA_ACT_FLAGS_USER_BITS + 3)) > #define TCA_ACT_FLAGS_AT_INGRESS (1U << (TCA_ACT_FLAGS_USER_BITS > + 4)) > +#define TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT (1U << > (TCA_ACT_FLAGS_USER_BITS + 5)) > > /* Update lastuse only if needed, to avoid dirtying a cache line. > * We use a temp variable to avoid fetching jiffies twice. > diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c > index 81d488655793..a79479faf0f8 100644 > --- a/net/sched/act_ct.c > +++ b/net/sched/act_ct.c > @@ -1360,6 +1360,12 @@ static int tcf_ct_init(struct net *net, struct > nlattr *nla, > return -EINVAL; > } > > + if (bind && !(flags & TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT)) { > + NL_SET_ERR_MSG_MOD(extack, > + "May only attach ct to ingress or clsact"); > + return -EPERM; I haven't tried this change yet, but I'd say this error code should be EOPNOTSUPP instead of EPERM, as it's not a permissions problem, it's that we decided to not support this use case. From the OVS side EOPNOTSUPP will be handled by ovs-vswitchd normally with a fallback to a standard openvswitch datapath. While EPERM, though still handled, will cause an extra error reported to the user, which is not desirable. Best regards, Ilya Maximets. > + } > + > err = nla_parse_nested(tb, TCA_CT_MAX, nla, ct_policy, extack); > if (err < 0) > return err; > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c > index ebca4b926dcf..8c72faf3314d 100644 > --- a/net/sched/cls_api.c > +++ b/net/sched/cls_api.c > @@ -2228,6 +2228,11 @@ static bool is_qdisc_ingress(__u32 classid) > return (TC_H_MIN(classid) == TC_H_MIN(TC_H_MIN_INGRESS)); > } > > +static bool is_ingress_or_clsact(struct tcf_block *block, struct Qdisc *q) > +{ > + return tcf_block_shared(block) || (q && !!(q->flags & > TCQ_F_INGRESS)); > +} > + > static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n, > struct netlink_ext_ack *extack) > { > @@ -2420,6 +2425,8 @@ static int tc_new_tfilter(struct sk_buff *skb, > struct nlmsghdr *n, > flags |= TCA_ACT_FLAGS_NO_RTNL; > if (is_qdisc_ingress(parent)) > flags |= TCA_ACT_FLAGS_AT_INGRESS; > + if (is_ingress_or_clsact(block, q)) > + flags |= TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT; > err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh, > flags, extack); > if (err == 0) { > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-20 13:41 ` Ilya Maximets @ 2026-02-20 16:12 ` Victor Nogueira 0 siblings, 0 replies; 17+ messages in thread From: Victor Nogueira @ 2026-02-20 16:12 UTC (permalink / raw) To: Ilya Maximets Cc: Jamal Hadi Salim, Paolo Abeni, netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim, Eelco Chaudron, Aaron Conole, Florian Westphal On Fri, Feb 20, 2026 at 10:41 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 2/20/26 1:24 PM, Victor Nogueira wrote: > > On 19/02/2026 17:13, Jamal Hadi Salim wrote: > >> On Thu, Feb 19, 2026 at 9:16 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > >>> > >>> On Wed, Feb 18, 2026 at 3:43 PM Paolo Abeni <pabeni@redhat.com> wrote: > >>>> > >>>> On 2/18/26 7:44 PM, Jamal Hadi Salim wrote: > >>>>> On Wed, Feb 18, 2026 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > >>>>>> On Wed, Feb 18, 2026 at 11:15 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>> From a user's perspective I'd prefer if RTM_NEWTFILTER just fails when > >>>>>>> it contains TCA_ACT_KIND "ct" with TC_H_MIN_EGRESS. This is clear > >>>>>>> for the application that makes a request and for the user if they make > >>>>>>> the request manually with 'tc filter ...'. > >>>>> > >>>>>> The challenge is actions could be created as a standalone i.e "tc > >>>>>> actions add action ct..." then later bound via tc filter. Don't forget > >>>>>> actions can also be shared by multiple filters (which could be a mix > >>>>>> of egress/ingress)... > >>>>> > >>>>> Actually, looking closely at the code - this is doable. Let's see if a > >>>>> patch can be cooked. > >>>> > >>>> I discussed a bit the topic with Davide, it looks like the thing you > >>>> mentioned above could be handled. The problematic part is AFAICS the > >>>> additional indirection level added by (possibly shared) blocks. AFAICS > >>>> whatever check we do at ct_init() time, shared block could later > >>>> circumvent it - unless act_ct is always forbidden for shared blocks. > >>>> > >>>> TL;DR: I think admission check at init time can be implemented only in a > >>>> quite (too much?) restrictive way. > >>>> > >>> > >>> True - the blocks will require more surgery but seems doable. We may > >>> end up again in ->quirks() land here. I.e when an action is bound to a > >>> filter (not necessarily when created) it should check its fit in the > >>> graph and see if it is legit. But since this is only the second quirk > >>> we know of, we can simplify. Note: We already have one such check, > >>> albeit ugly-looking, with skbedit (see flag TCA_ACT_FLAGS_AT_INGRESS). > >>> I will talk to Victor and see if he can cook at minimal a preliminary > >>> patch without considering blocks, and at best one that does consider > >>> blocks. > >> > >> After discussions with Victor we concluded that blocks have no effect > >> on this specific issue. > >> Blocks are only useful in clsact(both ingress/egress) and ingress. > >> Both of these handle ACT_CONSUMED fine. > >> Code sample looks as follows: > >> > >> case TC_ACT_STOLEN: > >> case TC_ACT_QUEUED: > >> case TC_ACT_TRAP: > >> consume_skb(skb); > >> fallthrough; > >> case TC_ACT_CONSUMED: > >> *ret = NET_RX_SUCCESS; > >> bpf_net_ctx_clear(bpf_net_ctx); > >> return NULL; > >> > >> The reason this works is because they don't have to deal with > >> queues(unlike the other qdiscs). > >> So, ignoring the block should work just fine. Ilya, for your case, if > >> you must egress, use clsact egress - the only problem is it will not > >> support the more feature-rich qdiscs... > > > > If there is no objection, we are thinking of sending something along > > these lines: > > > > diff --git a/include/net/act_api.h b/include/net/act_api.h > > index 91a24b5e0b93..2ba40eb45aad 100644 > > --- a/include/net/act_api.h > > +++ b/include/net/act_api.h > > @@ -70,6 +70,7 @@ struct tc_action { > > #define TCA_ACT_FLAGS_REPLACE (1U << (TCA_ACT_FLAGS_USER_BITS + 2)) > > #define TCA_ACT_FLAGS_NO_RTNL (1U << (TCA_ACT_FLAGS_USER_BITS + 3)) > > #define TCA_ACT_FLAGS_AT_INGRESS (1U << (TCA_ACT_FLAGS_USER_BITS > > + 4)) > > +#define TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT (1U << > > (TCA_ACT_FLAGS_USER_BITS + 5)) > > > > /* Update lastuse only if needed, to avoid dirtying a cache line. > > * We use a temp variable to avoid fetching jiffies twice. > > diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c > > index 81d488655793..a79479faf0f8 100644 > > --- a/net/sched/act_ct.c > > +++ b/net/sched/act_ct.c > > @@ -1360,6 +1360,12 @@ static int tcf_ct_init(struct net *net, struct > > nlattr *nla, > > return -EINVAL; > > } > > > > + if (bind && !(flags & TCA_ACT_FLAGS_AT_INGRESS_OR_CLSACT)) { > > + NL_SET_ERR_MSG_MOD(extack, > > + "May only attach ct to ingress or clsact"); > > + return -EPERM; > > I haven't tried this change yet, but I'd say this error code should > be EOPNOTSUPP instead of EPERM, as it's not a permissions problem, > it's that we decided to not support this use case. > > From the OVS side EOPNOTSUPP will be handled by ovs-vswitchd normally > with a fallback to a standard openvswitch datapath. While EPERM, though > still handled, will cause an extra error reported to the user, which > is not desirable. That's a fair point, we can make that change. cheers, Victor ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress 2026-02-17 9:38 [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress Paolo Abeni 2026-02-17 10:42 ` Paolo Abeni 2026-02-17 14:49 ` Ilya Maximets @ 2026-02-17 15:28 ` Jakub Kicinski 2 siblings, 0 replies; 17+ messages in thread From: Jakub Kicinski @ 2026-02-17 15:28 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet, Simon Horman, Henrik Steen, Olivier Tilmans, Bob Briscoe, Olga Albisser, GangMin Kim On Tue, 17 Feb 2026 10:38:52 +0100 Paolo Abeni wrote: > Since the blamed commit below, classify can return TC_ACT_CONSUMED while > the current skb being held by the defragmentation engine. As reported by > GangMin Kim, if such packet is that may cause a UaF when the defrag engine > later on tries to tuch again such packet. > > act_ct was never meant to be used outside of the ingress path. Making > defrag really works for act_ct outside such constraints range from very > difficult to completely impossible. > > Address the issue making act_ct drop any packet when not attached to the > ingress path and additionally emit a warning about the bad > configuration. FWIW there is a test case in net/forwarding/tc_actions.sh that fails now (needs to be adjusted/deleted/reconsidered). -- pw-bot: cr ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-02-20 16:12 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-17 9:38 [PATCH net] net_sched: act_ct: drop all packets when not attached to ingress Paolo Abeni 2026-02-17 10:42 ` Paolo Abeni 2026-02-17 14:49 ` Ilya Maximets 2026-02-17 15:52 ` Paolo Abeni 2026-02-17 19:37 ` Ilya Maximets 2026-02-18 14:28 ` Jamal Hadi Salim 2026-02-18 16:15 ` Ilya Maximets 2026-02-18 18:31 ` Jamal Hadi Salim 2026-02-18 18:44 ` Jamal Hadi Salim 2026-02-18 20:43 ` Paolo Abeni 2026-02-19 11:46 ` Ilya Maximets 2026-02-19 14:16 ` Jamal Hadi Salim 2026-02-19 20:13 ` Jamal Hadi Salim 2026-02-20 12:24 ` Victor Nogueira 2026-02-20 13:41 ` Ilya Maximets 2026-02-20 16:12 ` Victor Nogueira 2026-02-17 15:28 ` Jakub Kicinski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox