netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* When routed to VRF, NF _output_ hook is run unexpectedly
@ 2025-06-20 13:38 Eugene Crosser
  2025-06-20 14:56 ` Nicolas Dichtel
  0 siblings, 1 reply; 6+ messages in thread
From: Eugene Crosser @ 2025-06-20 13:38 UTC (permalink / raw)
  To: netdev
  Cc: netfilter-devel@vger.kernel.org, David Ahern, Nicolas Dichtel,
	Florian Westphal, Pablo Neira Ayuso


[-- Attachment #1.1: Type: text/plain, Size: 2797 bytes --]

Hello!

It is possible, and very useful, to implement "two-stage routing" by
installing a route that points to a VRF device:

    ip link add vrfNNN type vrf table NNN
    ...
    ip route add xxxxx/yy dev vrfNNN

however this causes surprising behaviour with relation to netfilter
hooks. Namely, packets taking such path traverse _output_ nftables
chain, with conntracking information reset. So, for example, even
when "notrack" has been set in the prerouting chain, conntrack entries
will still be created. Script attached below demonstrates this behaviour.

So, in order to control conntracking behaviour, it is necessary to
install additional rules in the output chain, despite clearly only
forwarding takes place, logically. Also, because "iif" is not available
in the "output" chain, it is difficult to distinguish such vrf-routed
traffic from true "output" traffic in the nftable rule.

I suppose that if the packet is being processed by vrf because it
followed a route pointing to the vrf interface, output netfilter hook
should not be executed. Possibly(?) a forwarding hook should be run
instead, or none?

Thanks for consideration

Eugene

=====
#!/bin/sh

cleanup() {
	for ns in 1 2 3; do
		ip netns del tns$ns
	done
}

trap cleanup EXIT

for ns in 1 2 3; do
	ip netns add tns$ns
done
ip -n tns2 link add ve21 type veth peer ve12 netns tns1
ip -n tns2 link add ve23 type veth peer ve32 netns tns3

ip -n tns1 link set lo up
ip -n tns1 addr add 172.16.1.1/30 dev ve12
ip -n tns1 link set ve12 up
ip -n tns1 route add default via 172.16.1.2 dev ve12

ip -n tns3 link set lo up
ip -n tns3 addr add 172.16.3.1/30 dev ve32
ip -n tns3 addr add 172.16.9.1/30 dev ve32
ip -n tns3 link set ve32 up
ip -n tns3 route add default via 172.16.3.2 dev ve32

ip -n tns2 link set lo up
ip -n tns2 addr add 172.16.1.2/30 dev ve21
ip -n tns2 link set ve21 up
ip -n tns2 addr add 172.16.3.2/30 dev ve23
ip -n tns2 link set ve23 up

ip -n tns2 link add tvrf1 type vrf table 9999
ip -n tns2 link set tvrf1 up
ip -n tns2 route add 172.16.9.0/24 dev tvrf1
ip -n tns2 route add 172.16.9.0/24 via 172.16.3.1 dev ve23 vrf tvrf1

ip netns exec tns2 nft -f - <<__END__
table inet filter {
	chain rawout {
		type filter hook output priority raw; policy accept;
		counter # notrack  ### NEED THIS ADDITIONAL "notrack"
	}
	chain rawpre {
		type filter hook prerouting priority raw; policy accept;
		counter notrack
	}
	chain forward {
		type filter hook forward priority filter; policy accept;
		ct state established,related counter accept
		counter
	}
}
__END__

ip netns exec tns1 ping -q -W1 -c1 172.16.3.1
ip netns exec tns1 ping -q -W1 -c1 172.16.9.1

ip netns exec tns2 nft list ruleset
ip netns exec tns2 conntrack -L

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: When routed to VRF, NF _output_ hook is run unexpectedly
  2025-06-20 13:38 When routed to VRF, NF _output_ hook is run unexpectedly Eugene Crosser
@ 2025-06-20 14:56 ` Nicolas Dichtel
  2025-06-20 16:04   ` Eugene Crosser
  0 siblings, 1 reply; 6+ messages in thread
From: Nicolas Dichtel @ 2025-06-20 14:56 UTC (permalink / raw)
  To: Eugene Crosser, netdev
  Cc: netfilter-devel@vger.kernel.org, David Ahern, Florian Westphal,
	Pablo Neira Ayuso

Le 20/06/2025 à 15:38, Eugene Crosser a écrit :
> Hello!
Hello,

> 
> It is possible, and very useful, to implement "two-stage routing" by
> installing a route that points to a VRF device:
> 
>     ip link add vrfNNN type vrf table NNN
>     ...
>     ip route add xxxxx/yy dev vrfNNN
> 
> however this causes surprising behaviour with relation to netfilter
> hooks. Namely, packets taking such path traverse _output_ nftables
> chain, with conntracking information reset. So, for example, even
> when "notrack" has been set in the prerouting chain, conntrack entries
> will still be created. Script attached below demonstrates this behaviour.
You can have a look to this commit to better understand this:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c9c296adfae9

Regards,
Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: When routed to VRF, NF _output_ hook is run unexpectedly
  2025-06-20 14:56 ` Nicolas Dichtel
@ 2025-06-20 16:04   ` Eugene Crosser
  2025-06-20 16:20     ` Nicolas Dichtel
  0 siblings, 1 reply; 6+ messages in thread
From: Eugene Crosser @ 2025-06-20 16:04 UTC (permalink / raw)
  To: nicolas.dichtel, netdev
  Cc: netfilter-devel@vger.kernel.org, David Ahern, Florian Westphal,
	Pablo Neira Ayuso


[-- Attachment #1.1: Type: text/plain, Size: 1010 bytes --]

Thanks Nicolas,

On 20/06/2025 16:56, Nicolas Dichtel wrote:

>> It is possible, and very useful, to implement "two-stage routing" by
>> installing a route that points to a VRF device:
>>
>>     ip link add vrfNNN type vrf table NNN
>>     ...
>>     ip route add xxxxx/yy dev vrfNNN
>>
>> however this causes surprising behaviour with relation to netfilter
>> hooks. Namely, packets taking such path traverse _output_ nftables
>> chain, with conntracking information reset. So, for example, even
>> when "notrack" has been set in the prerouting chain, conntrack entries
>> will still be created. Script attached below demonstrates this behaviour.
> You can have a look to this commit to better understand this:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c9c296adfae9

I've seen this commit.
My point is that the packets are _not locally generated_ in this case,
so it seems wrong to pass them to the _output_ hook, doesn't it?

Regards,

Eugene

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: When routed to VRF, NF _output_ hook is run unexpectedly
  2025-06-20 16:04   ` Eugene Crosser
@ 2025-06-20 16:20     ` Nicolas Dichtel
  2025-06-24 15:27       ` Eugene Crosser
  0 siblings, 1 reply; 6+ messages in thread
From: Nicolas Dichtel @ 2025-06-20 16:20 UTC (permalink / raw)
  To: Eugene Crosser, netdev
  Cc: netfilter-devel@vger.kernel.org, David Ahern, Florian Westphal,
	Pablo Neira Ayuso

Le 20/06/2025 à 18:04, Eugene Crosser a écrit :
> Thanks Nicolas,
> 
> On 20/06/2025 16:56, Nicolas Dichtel wrote:
> 
>>> It is possible, and very useful, to implement "two-stage routing" by
>>> installing a route that points to a VRF device:
>>>
>>>     ip link add vrfNNN type vrf table NNN
>>>     ...
>>>     ip route add xxxxx/yy dev vrfNNN
>>>
>>> however this causes surprising behaviour with relation to netfilter
>>> hooks. Namely, packets taking such path traverse _output_ nftables
>>> chain, with conntracking information reset. So, for example, even
>>> when "notrack" has been set in the prerouting chain, conntrack entries
>>> will still be created. Script attached below demonstrates this behaviour.
>> You can have a look to this commit to better understand this:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c9c296adfae9
> 
> I've seen this commit.
> My point is that the packets are _not locally generated_ in this case,
> so it seems wrong to pass them to the _output_ hook, doesn't it?
They are, from the POV of the vrf. The first route sends packets to the vrf
device, which acts like a loopback.


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: When routed to VRF, NF _output_ hook is run unexpectedly
  2025-06-20 16:20     ` Nicolas Dichtel
@ 2025-06-24 15:27       ` Eugene Crosser
  2025-08-06  9:00         ` Nicolas Dichtel
  0 siblings, 1 reply; 6+ messages in thread
From: Eugene Crosser @ 2025-06-24 15:27 UTC (permalink / raw)
  To: nicolas.dichtel, netdev
  Cc: netfilter-devel@vger.kernel.org, David Ahern, Florian Westphal,
	Pablo Neira Ayuso


[-- Attachment #1.1: Type: text/plain, Size: 2857 bytes --]

On 20/06/2025 18:20, Nicolas Dichtel wrote:

>>>> It is possible, and very useful, to implement "two-stage routing" by
>>>> installing a route that points to a VRF device:
>>>>
>>>>     ip link add vrfNNN type vrf table NNN
>>>>     ...
>>>>     ip route add xxxxx/yy dev vrfNNN
>>>>
>>>> however this causes surprising behaviour with relation to netfilter
>>>> hooks. Namely, packets taking such path traverse _output_ nftables
>>>> chain, with conntracking information reset. So, for example, even
>>>> when "notrack" has been set in the prerouting chain, conntrack entries
>>>> will still be created. Script attached below demonstrates this behaviour.
>>> You can have a look to this commit to better understand this:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c9c296adfae9
>>
>> I've seen this commit.
>> My point is that the packets are _not locally generated_ in this case,
>> so it seems wrong to pass them to the _output_ hook, doesn't it?
> They are, from the POV of the vrf. The first route sends packets to the vrf
> device, which acts like a loopback.

I see, this explains the behaviour that I observe.
I believe that there are two problems here though:

1. This behaviour is _surprising_. Packets are not really "locally
generated", they come from "outside", but treated as is they were
locally generated. In my view, it deserves an section in
Documentation/networking/vrf.rst (see suggestion below).

2. Using "output" hook makes it impossible(?) to define different
nftables rules depending on what vrf was used for routing (because iif
is not accessible in the "output" chain). For example, traffic from
different tenants, that is routed via different VRFs but egress over the
same uplink interface, cannot be assigned different zones. Conntrack
entries of different tenants will be mixed. As another example, one
cannot disable conntracking of tenant's traffic while continuing to
track "true output" traffic from he processes running on the host.

Thanks for consideration,

Eugene

========================
Suggested update to the documentation:

diff --git a/Documentation/networking/vrf.rst
b/Documentation/networking/vrf.rst
index 0a9a6f968cb9..74c6a69355df 100644
--- a/Documentation/networking/vrf.rst
+++ b/Documentation/networking/vrf.rst
@@ -61,6 +61,11 @@ domain as a whole.
        the VRF device. For egress POSTROUTING and OUTPUT rules can be
written
        using either the VRF device or real egress device.

+.. [3] When a packet is forwarded to a VRF interface, it gets further
+       routed according to the route table associated with the VRF, but
+       processed by the "output" netfilter hook instead of "forwarding"
+       hook.
+
 Setup
 -----
 1. VRF device is created with an association to a FIB table.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: When routed to VRF, NF _output_ hook is run unexpectedly
  2025-06-24 15:27       ` Eugene Crosser
@ 2025-08-06  9:00         ` Nicolas Dichtel
  0 siblings, 0 replies; 6+ messages in thread
From: Nicolas Dichtel @ 2025-08-06  9:00 UTC (permalink / raw)
  To: Eugene Crosser, netdev
  Cc: netfilter-devel@vger.kernel.org, David Ahern, Florian Westphal,
	Pablo Neira Ayuso

Le 24/06/2025 à 17:27, Eugene Crosser a écrit :
> On 20/06/2025 18:20, Nicolas Dichtel wrote:
> 
>>>>> It is possible, and very useful, to implement "two-stage routing" by
>>>>> installing a route that points to a VRF device:
>>>>>
>>>>>     ip link add vrfNNN type vrf table NNN
>>>>>     ...
>>>>>     ip route add xxxxx/yy dev vrfNNN
>>>>>
>>>>> however this causes surprising behaviour with relation to netfilter
>>>>> hooks. Namely, packets taking such path traverse _output_ nftables
>>>>> chain, with conntracking information reset. So, for example, even
>>>>> when "notrack" has been set in the prerouting chain, conntrack entries
>>>>> will still be created. Script attached below demonstrates this behaviour.
>>>> You can have a look to this commit to better understand this:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c9c296adfae9
>>>
>>> I've seen this commit.
>>> My point is that the packets are _not locally generated_ in this case,
>>> so it seems wrong to pass them to the _output_ hook, doesn't it?
>> They are, from the POV of the vrf. The first route sends packets to the vrf
>> device, which acts like a loopback.
> 
> I see, this explains the behaviour that I observe.
> I believe that there are two problems here though:
> 
> 1. This behaviour is _surprising_. Packets are not really "locally
> generated", they come from "outside", but treated as is they were
> locally generated. In my view, it deserves an section in
> Documentation/networking/vrf.rst (see suggestion below).
> 
> 2. Using "output" hook makes it impossible(?) to define different
> nftables rules depending on what vrf was used for routing (because iif
> is not accessible in the "output" chain). For example, traffic from
> different tenants, that is routed via different VRFs but egress over the
> same uplink interface, cannot be assigned different zones. Conntrack
> entries of different tenants will be mixed. As another example, one
> cannot disable conntracking of tenant's traffic while continuing to
> track "true output" traffic from he processes running on the host.
> 
Sorry for the late reply. I'll let netfiler/vrf experts answer these points.

> Thanks for consideration,
> 
> Eugene
> 
> ========================
> Suggested update to the documentation:
You can send a formal patch for this.


Regards,
Nicolas

> 
> diff --git a/Documentation/networking/vrf.rst
> b/Documentation/networking/vrf.rst
> index 0a9a6f968cb9..74c6a69355df 100644
> --- a/Documentation/networking/vrf.rst
> +++ b/Documentation/networking/vrf.rst
> @@ -61,6 +61,11 @@ domain as a whole.
>         the VRF device. For egress POSTROUTING and OUTPUT rules can be
> written
>         using either the VRF device or real egress device.
> 
> +.. [3] When a packet is forwarded to a VRF interface, it gets further
> +       routed according to the route table associated with the VRF, but
> +       processed by the "output" netfilter hook instead of "forwarding"
> +       hook.
> +
>  Setup
>  -----
>  1. VRF device is created with an association to a FIB table.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-08-06  9:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20 13:38 When routed to VRF, NF _output_ hook is run unexpectedly Eugene Crosser
2025-06-20 14:56 ` Nicolas Dichtel
2025-06-20 16:04   ` Eugene Crosser
2025-06-20 16:20     ` Nicolas Dichtel
2025-06-24 15:27       ` Eugene Crosser
2025-08-06  9:00         ` Nicolas Dichtel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).