From: Ralf Lici <ralf@mandelbit.com>
To: "Toke Høiland-Jørgensen" <toke@kernel.org>
Cc: netdev@vger.kernel.org, "Daniel Gröber" <dxld@darkboxed.org>,
"Antonio Quartulli" <antonio@mandelbit.com>,
"Andrew Lunn" <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
"Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
linux-kernel@vger.kernel.org
Subject: Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
Date: Sat, 13 Jun 2026 15:17:17 +0200 [thread overview]
Message-ID: <20260613131720.253936-1-ralf@mandelbit.com> (raw)
In-Reply-To: <87y0gm8x5k.fsf@toke.dk>
On Wed, 10 Jun 2026 13:14:47 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> Ralf Lici <ralf@mandelbit.com> writes:
>
> > Hi Toke,
> >
> > On Thu, 04 Jun 2026 20:23:51 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> Ralf Lici <ralf@mandelbit.com> writes:
> >>
> >> > This commit introduces the core start_xmit processing flow: validate,
> >> > select action, translate, and forward. It centralizes action resolution
> >> > in the dispatch layer and keeps per-direction translation logic separate
> >> > from device glue. The result is a single data-path entry point with
> >> > explicit control over drop/forward/emit behavior.
> >> >
> >> > Signed-off-by: Ralf Lici <ralf@mandelbit.com>
> >>
> >> This is very cool! Going quickly through the series, this seems like
> >> thorough work that will be cool to have available in the kernel, so
> >> thanks for doing this! I'll be quite happy to retire my barebones
> >> BPF-based implementation once this lands :)
> >>
> >
> > Thanks, glad to hear this looks useful. I have not had much time to work
> > on ipxlat lately, but I hope to respin the RFC soon.
> >
> >> One comment on the device model below (which is also why I chose this
> >> patch to reply to):
> >>
> >> > +static void ipxlat_forward_pkt(struct ipxlat_priv *ipxlat, struct sk_buff *skb)
> >> > +{
> >> > + const unsigned int len = skb->len;
> >> > + int err;
> >> > +
> >> > + /* reinject as a fresh packet with scrubbed metadata */
> >> > + skb_set_queue_mapping(skb, 0);
> >> > + skb_scrub_packet(skb, false);
> >> > +
> >> > + err = gro_cells_receive(&ipxlat->gro_cells, skb);
> >>
> >> So given that you're not resetting skb->dev here, IIUC, this means that
> >> the translated packet will magically re-appear as if it arrived on the
> >> interface it first came in on, right?
> >>
> >> That seems... a bit too magical? Sending a packet to one device making
> >> it suddenly reappear on a different, unrelated, device seems like it
> >> will just create confusion. It's like the ipxlat device can't really
> >> device if it's a device or a tunnel? :)
> >>
> >
> > That's not quite what happens in the routed xmit path. There the stack
> > sets skb->dev to the selected output device before handing the skb to
> > the device. For IPv4 and IPv6 this happens in ip_output/ip6_output,
> > where the output device is taken from the skb dst. So when the route
> > selects the ipxlat device, the skb reaches ndo_start_xmit with skb->dev
> > already pointing at the ipxlat device, not at the original ingress
> > device.
> >
> > The internal 4-to-6 pre-fragmentation path should preserve the same
> > property as well: ip_do_fragment copies the skb metadata to the
> > generated fragments, including skb->dev, and the temporary dst used for
> > that path also points at the ipxlat device. The fragment callback then
> > feeds those fragments back into the same ipxlat processing path.
> >
> > That said, I agree that relying on this implicitly is not great.
> > gro_cells_receive uses skb->dev directly, and the intended receive-side
> > re-injection model should be obvious at the call site. I will set
> > skb->dev = ipxlat->dev explicitly before gro_cells_receive in the next
> > version.
>
> Right, sounds good. I'm also wondering if you actually need the gro_cells
> infrastructure at all? IIUC, the purpose of that is to allow tunnels to
> create GRO superframes of packets after they are decapsulated (and thus
> their l4 commonality becomes apparent). But you're not decapsulating
> anything, you're just translating between protocols the kernel already
> understands. So presumably any opportunity to coalesce GRO packets would
> already have happened pre-translation? So any reason why you can't just
> do what loopback.c does, and do a straight __netif_rx() call in the
> transmit function?
>
No, I think you're right that gro_cells is not justified here, I was
probably biased by my work on tunnel interfaces. Unlike a tunnel decap
path, ipxlat does not reveal a new same-family L4 flow after
decapsulation, so I don't see a translation-specific GRO opportunity
there, and a loopback-style receive handoff would be the simpler version
of that design.
That said, after thinking more about the rest of your feedback, I think
the right fix is probably not just replacing gro_cells with __netif_rx.
The deeper issue is the netdevice/RX-reinjection model itself.
> >> I think a better model is to treat the device as basically a loopback
> >> device that translates packets before looping them back (so when they
> >> come back they appear to be coming from that device).
> >>
> >> Any reason why that wouldn't work?
> >>
> >
> > That's indeed the intended model for the ipxlat netdevice: route packets
> > to it, translate them, then loop them back into the stack as packets
> > received from that same device. That seemed like the simplest model and
> > the one that exposes the translation point most clearly.
>
> Right. I think this could be made a bit more explicit in the
> documentation as well, since it's a bit of an unusual model.
>
> And, well, taking a step back: is it really the right model? Regular NAT
> lives in netfilter, why can't this be a netfilter module as well? Seems
> to me you could have something like:
>
> table ip xlat4 {
> chain postrouting {
> type nat hook postrouting priority srcnat; policy accept;
> ip daddr 0.0.0.0/0 oifname "eth0" xlat to 64:ff9b::/96
> }
> }
> table ip6 xlat6 {
> chain prerouting {
> type nat hook prerouting priority dstnat; policy accept;
> ip6 saddr 64::ff0b::/96 iifname "eth0" xlat from 64::ff9b::/96
> }
> }
>
> and that would provide the functionality without having to implement a
> new interface type and the associated multiple traversals through the
> stack? Did you consider this as an alternative to the new device type?
>
We did consider netfilter, and your example is syntactically attractive,
but I am no longer convinced it is the cleanest model for SIIT.
An nft expression cannot simply rewrite ETH_P_IP <-> ETH_P_IPV6 and
return ACCEPT as if this were normal NAT because the current hook
invocation, dst, and conntrack-related state were established for the
packet as it entered that hook. A cross-family translator would need to
consume the skb, clear or rebuild route and ct metadata as appropriate,
do an other-family route lookup, and resume at a well-defined point in
that family. That seems possible, but it would be a new stateless
cross-family action, not just a new mode of the existing nft nat
expression (which is built around nf_nat_setup_info and assumes the
packet's L3 family does not change AFAICT).
My second concern is that the SIIT boundary would be a property of rule
and hook placement. That gives flexibility, but it also means the
translation point has to be constrained and documented very carefully to
avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior. For
this use case I would rather have the route that matches the translation
prefix also be the object that says: leave this family here and continue
in the other one.
After looking at the available kernel mechanisms again, I think the
better model is probably LWT: routes carry an ipxlat encap referencing a
named translator domain configured over netlink. That should represent
the stateless, prefix-based and symmetric nature of ipxlat.
Very roughly, userspace could look like:
ip xlat add siit0 prefix6 64:ff9b::/96
ip route add ... encap ipxlat id siit0
ip -6 route add ... encap ipxlat id siit0
There are some useful precedents for this: ILA is stateless address
translation as LWT, seg6_local already has cross-family LWT actions, and
ioam6 has a similar split between separately configured objects and
route attachments.
The invariant I would like v2 to follow is that the original-family
route lookup selects translation as its terminal route action. The
translated skb then gets a fresh lookup in the other family. From that
point on, TTL/Hop Limit where applicable, PMTU, ICMP errors, and
netfilter visibility belong to the translated family.
So I think your question addresses the core design issue in this RFC. My
current preference is to rework the next version around an LWT/domain
model instead of the virtual netdevice model, unless prototyping shows a
fundamental problem with that approach.
Does that model make sense to you?
Thanks for pushing on this.
--
Ralf Lici
Mandelbit Srl
next prev parent reply other threads:[~2026-06-13 13:24 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-19 15:12 [RFC net-next 00/15] Introducing ipxlat: a stateless IPv4/IPv6 translation device Ralf Lici
2026-03-19 15:12 ` [RFC net-next 01/15] drivers/net: add ipxlat netdevice skeleton and build plumbing Ralf Lici
2026-03-19 15:12 ` [RFC net-next 02/15] ipxlat: add RFC 6052 address conversion helpers Ralf Lici
2026-03-19 15:12 ` [RFC net-next 03/15] ipxlat: add packet metadata control block helpers Ralf Lici
2026-03-19 15:12 ` [RFC net-next 04/15] ipxlat: add IPv4 packet validation path Ralf Lici
2026-03-19 15:12 ` [RFC net-next 05/15] ipxlat: add IPv6 " Ralf Lici
2026-04-09 2:18 ` Xavier HSINYUAN
2026-04-09 9:44 ` Ralf Lici
2026-03-19 15:12 ` [RFC net-next 06/15] ipxlat: add transport checksum and offload helpers Ralf Lici
2026-03-19 15:12 ` [RFC net-next 07/15] ipxlat: add 4to6 and 6to4 TCP/UDP translation helpers Ralf Lici
2026-03-19 15:12 ` [RFC net-next 08/15] ipxlat: add translation engine and dispatch core Ralf Lici
2026-06-04 18:23 ` Toke Høiland-Jørgensen
2026-06-05 12:32 ` Ralf Lici
2026-06-10 11:14 ` Toke Høiland-Jørgensen
2026-06-13 13:17 ` Ralf Lici [this message]
2026-03-19 15:12 ` [RFC net-next 09/15] ipxlat: emit translator-generated ICMP errors on drop Ralf Lici
2026-03-19 15:12 ` [RFC net-next 10/15] ipxlat: add 4to6 pre-fragmentation path Ralf Lici
2026-05-18 12:36 ` Xavier HSINYUAN
2026-06-05 12:24 ` Ralf Lici
2026-03-19 15:12 ` [RFC net-next 11/15] ipxlat: add ICMP informational translation paths Ralf Lici
2026-03-19 15:12 ` [RFC net-next 12/15] ipxlat: add ICMP error translation and quoted-inner handling Ralf Lici
2026-03-19 15:12 ` [RFC net-next 13/15] ipxlat: add netlink control plane and uapi Ralf Lici
2026-03-19 15:12 ` [RFC net-next 14/15] selftests: net: add ipxlat coverage Ralf Lici
2026-03-19 15:12 ` [RFC net-next 15/15] Documentation: networking: add ipxlat translator guide Ralf Lici
2026-03-19 22:11 ` Jonathan Corbet
2026-03-24 9:55 ` Ralf Lici
2026-04-06 14:50 ` Xavier Hsinyuan
2026-04-07 11:30 ` Daniel Gröber
2026-04-09 2:17 ` Xavier HSINYUAN
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260613131720.253936-1-ralf@mandelbit.com \
--to=ralf@mandelbit.com \
--cc=andrew+netdev@lunn.ch \
--cc=antonio@mandelbit.com \
--cc=davem@davemloft.net \
--cc=dxld@darkboxed.org \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=toke@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox