* [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
@ 2025-05-28 9:03 Yafang Shao
2025-05-28 11:22 ` Florian Westphal
2025-05-28 23:43 ` Shaun Brady
0 siblings, 2 replies; 16+ messages in thread
From: Yafang Shao @ 2025-05-28 9:03 UTC (permalink / raw)
To: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman
Cc: netfilter-devel, coreteam
Hello,
We recently encountered an SNAT-related issue in our Kubernetes
environment and have successfully reproduced it with the following
configuration:
kernel
--------
Our kernel is 6.1.y (also reproduced on 6.14)
Host Network Configuration:
--------------------------------------
We run a DNS proxy on our Kubernetes servers with the following iptables rules:
-A PREROUTING -d 169.254.1.2/32 -j DNS-DNAT
-A DNS-DNAT -d 169.254.1.2/32 -i eth0 -j RETURN
-A DNS-DNAT -d 169.254.1.2/32 -i eth1 -j RETURN
-A DNS-DNAT -d 169.254.1.2/32 -i bond0 -j RETURN
-A DNS-DNAT -j DNAT --to-destination 127.0.0.1
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A POSTROUTING -j KUBE-POSTROUTING
-A KUBE-POSTROUTING -m mark --mark 0x4000/0x4000 -j MASQUERADE
Container Network Configuration:
--------------------------------------------
Containers use 169.254.1.2 as their DNS resolver:
$ cat /etc/resolve.conf
nameserver 169.254.1.2
Issue Description
------------------------
When performing DNS lookups from a container, the query fails with an
unexpected source port:
$ dig +short @169.254.1.2 A www.google.com
;; reply from unexpected source: 169.254.1.2#123, expected 169.254.1.2#53
The tcpdump is as follows,
16:47:23.441705 veth9cffd2a4 P IP 10.242.249.78.37562 >
169.254.1.2.53: 298+ [1au] A? www.google.com. (55)
16:47:23.441705 bridge0 In IP 10.242.249.78.37562 > 127.0.0.1.53:
298+ [1au] A? www.google.com. (55)
16:47:23.441856 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.37562:
298 1/0/1 A 142.250.71.228 (59)
16:47:23.441863 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298
1/0/1 A 142.250.71.228 (59)
16:47:23.441867 eth1 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298
1/0/1 A 142.250.71.228 (59)
16:47:23.441885 eth1 P IP 169.254.1.2.53 > 10.242.249.78.37562: 298
1/0/1 A 142.250.71.228 (59)
16:47:23.441885 bond0 P IP 169.254.1.2.53 > 10.242.249.78.37562: 298
1/0/1 A 142.250.71.228 (59)
16:47:23.441916 veth9cffd2a4 Out IP 169.254.1.2.124 >
10.242.249.78.37562: UDP, length 59
The DNS response port is unexpectedly changed from 53 to 124, causing
the application can't receive the response.
We suspected the issue might be related to commit d8f84a9bc7c4
("netfilter: nf_nat: don't try nat source port reallocation for
reverse dir clash"). After applying this commit, the port remapping no
longer occurs, but the DNS response is still dropped.
16:52:00.968814 veth9cffd2a4 P IP 10.242.249.78.54482 >
169.254.1.2.53: 15035+ [1au] A? www.google.com. (55)
16:52:00.968814 bridge0 In IP 10.242.249.78.54482 > 127.0.0.1.53:
15035+ [1au] A? www.google.com. (55)
16:52:00.996661 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.54482:
15035 1/0/1 A 142.250.198.100 (59)
16:52:00.996664 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.54482:
15035 1/0/1 A 142.250.198.100 (59)
16:52:00.996665 eth0 Out IP 169.254.1.2.53 > 10.242.249.78.54482:
15035 1/0/1 A 142.250.198.100 (59)
16:52:00.996682 eth0 P IP 169.254.1.2.53 > 10.242.249.78.54482:
15035 1/0/1 A 142.250.198.100 (59)
16:52:00.996682 bond0 P IP 169.254.1.2.53 > 10.242.249.78.54482:
15035 1/0/1 A 142.250.198.100 (59)
The response is now correctly sent to port 53, but it is dropped in
__nf_conntrack_confirm().
We bypassed the issue by modifying __nf_conntrack_confirm() to skip
the conflicting conntrack entry check:
diff --git a/net/netfilter/nf_conntrack_core.c
b/net/netfilter/nf_conntrack_core.c
index 7bee5bd22be2..3481e9d333b0 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
chainlen = 0;
hlist_nulls_for_each_entry(h, n,
&nf_conntrack_hash[reply_hash], hnnode) {
- if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
- zone, net))
- goto out;
+ //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
+ // zone, net))
+ // goto out;
if (chainlen++ > max_chainlen) {
chaintoolong:
NF_CT_STAT_INC(net, chaintoolong);
DNS resolution now works as expected.
$ dig +short @169.254.1.2 A www.google.com
142.250.198.100
The tcpdump is as follows,
16:54:43.618509 veth9cffd2a4 P IP 10.242.249.78.56805 >
169.254.1.2.53: 7503+ [1au] A? www.google.com. (55)
16:54:43.618509 bridge0 In IP 10.242.249.78.56805 > 127.0.0.1.53:
7503+ [1au] A? www.google.com. (55)
16:54:43.618666 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.56805:
7503 1/0/1 A 142.250.198.100 (59)
16:54:43.618677 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.56805:
7503 1/0/1 A 142.250.198.100 (59)
16:54:43.618683 eth1 Out IP 169.254.1.2.53 > 10.242.249.78.56805:
7503 1/0/1 A 142.250.198.100 (59)
16:54:43.618700 eth1 P IP 169.254.1.2.53 > 10.242.249.78.56805:
7503 1/0/1 A 142.250.198.100 (59)
16:54:43.618700 bond0 P IP 169.254.1.2.53 > 10.242.249.78.56805:
7503 1/0/1 A 142.250.198.100 (59)
16:54:43.618765 veth9cffd2a4 Out IP 169.254.1.2.53 >
10.242.249.78.56805: 7503 1/0/1 A 142.250.198.100 (59)
The issue remains present in kernel 6.14 as well.
Since we are not deeply familiar with NAT behavior, we would
appreciate guidance on a proper fix or any further debugging.
--
Regards
Yafang
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 9:03 [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment Yafang Shao
@ 2025-05-28 11:22 ` Florian Westphal
2025-05-28 11:41 ` Yafang Shao
2025-05-28 23:43 ` Shaun Brady
1 sibling, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2025-05-28 11:22 UTC (permalink / raw)
To: Yafang Shao
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
Yafang Shao <laoar.shao@gmail.com> wrote:
> Our kernel is 6.1.y (also reproduced on 6.14)
>
> Host Network Configuration:
> --------------------------------------
>
> We run a DNS proxy on our Kubernetes servers with the following iptables rules:
>
> -A PREROUTING -d 169.254.1.2/32 -j DNS-DNAT
> -A DNS-DNAT -d 169.254.1.2/32 -i eth0 -j RETURN
> -A DNS-DNAT -d 169.254.1.2/32 -i eth1 -j RETURN
> -A DNS-DNAT -d 169.254.1.2/32 -i bond0 -j RETURN
> -A DNS-DNAT -j DNAT --to-destination 127.0.0.1
> -A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
> -A POSTROUTING -j KUBE-POSTROUTING
> -A KUBE-POSTROUTING -m mark --mark 0x4000/0x4000 -j MASQUERADE
>
> Container Network Configuration:
> --------------------------------------------
> Containers use 169.254.1.2 as their DNS resolver:
>
> $ cat /etc/resolve.conf
> nameserver 169.254.1.2
>
> Issue Description
> ------------------------
>
> When performing DNS lookups from a container, the query fails with an
> unexpected source port:
>
> $ dig +short @169.254.1.2 A www.google.com
> ;; reply from unexpected source: 169.254.1.2#123, expected 169.254.1.2#53
>
> The tcpdump is as follows,
>
> 16:47:23.441705 veth9cffd2a4 P IP 10.242.249.78.37562 >
> 169.254.1.2.53: 298+ [1au] A? www.google.com. (55)
> 16:47:23.441705 bridge0 In IP 10.242.249.78.37562 > 127.0.0.1.53:
> 298+ [1au] A? www.google.com. (55)
> 16:47:23.441856 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.37562:
> 298 1/0/1 A 142.250.71.228 (59)
> 16:47:23.441863 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> 1/0/1 A 142.250.71.228 (59)
> 16:47:23.441867 eth1 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> 1/0/1 A 142.250.71.228 (59)
> 16:47:23.441885 eth1 P IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> 1/0/1 A 142.250.71.228 (59)
> 16:47:23.441885 bond0 P IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> 1/0/1 A 142.250.71.228 (59)
> 16:47:23.441916 veth9cffd2a4 Out IP 169.254.1.2.124 >
> 10.242.249.78.37562: UDP, length 59
>
> The DNS response port is unexpectedly changed from 53 to 124, causing
> the application can't receive the response.
>
> We suspected the issue might be related to commit d8f84a9bc7c4
> ("netfilter: nf_nat: don't try nat source port reallocation for
> reverse dir clash"). After applying this commit, the port remapping no
> longer occurs, but the DNS response is still dropped.
Thats suspicious, I don't see how this is related. d8f84a9bc7c4
deals with indepdent action, i.e.
A sends to B and B sends to A, but *at the same time*.
With a request-response protocol like DNS this should obviously never
happen -- B can't reply before A's request has passed through the stack.
> The response is now correctly sent to port 53, but it is dropped in
> __nf_conntrack_confirm().
>
> We bypassed the issue by modifying __nf_conntrack_confirm() to skip
> the conflicting conntrack entry check:
>
> diff --git a/net/netfilter/nf_conntrack_core.c
> b/net/netfilter/nf_conntrack_core.c
> index 7bee5bd22be2..3481e9d333b0 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
>
> chainlen = 0;
> hlist_nulls_for_each_entry(h, n,
> &nf_conntrack_hash[reply_hash], hnnode) {
> - if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> - zone, net))
> - goto out;
> + //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> + // zone, net))
> + // goto out;
> if (chainlen++ > max_chainlen) {
> chaintoolong:
> NF_CT_STAT_INC(net, chaintoolong);
I don't understand this bit either. For A/AAAA requests racing in same
direction, nf_ct_resolve_clash() machinery should have handled this
situation.
And I don't see how you can encounter a DNS reply before at least one
request has been committed to the table -- i.e., the conntrack being
confirmed here should not exist -- the packet should have been picked up
as a reply packet.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 11:22 ` Florian Westphal
@ 2025-05-28 11:41 ` Yafang Shao
2025-05-28 12:14 ` Florian Westphal
0 siblings, 1 reply; 16+ messages in thread
From: Yafang Shao @ 2025-05-28 11:41 UTC (permalink / raw)
To: Florian Westphal
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
On Wed, May 28, 2025 at 7:23 PM Florian Westphal <fw@strlen.de> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> wrote:
> > Our kernel is 6.1.y (also reproduced on 6.14)
> >
> > Host Network Configuration:
> > --------------------------------------
> >
> > We run a DNS proxy on our Kubernetes servers with the following iptables rules:
> >
> > -A PREROUTING -d 169.254.1.2/32 -j DNS-DNAT
> > -A DNS-DNAT -d 169.254.1.2/32 -i eth0 -j RETURN
> > -A DNS-DNAT -d 169.254.1.2/32 -i eth1 -j RETURN
> > -A DNS-DNAT -d 169.254.1.2/32 -i bond0 -j RETURN
> > -A DNS-DNAT -j DNAT --to-destination 127.0.0.1
> > -A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
> > -A POSTROUTING -j KUBE-POSTROUTING
> > -A KUBE-POSTROUTING -m mark --mark 0x4000/0x4000 -j MASQUERADE
> >
> > Container Network Configuration:
> > --------------------------------------------
> > Containers use 169.254.1.2 as their DNS resolver:
> >
> > $ cat /etc/resolve.conf
> > nameserver 169.254.1.2
> >
> > Issue Description
> > ------------------------
> >
> > When performing DNS lookups from a container, the query fails with an
> > unexpected source port:
> >
> > $ dig +short @169.254.1.2 A www.google.com
> > ;; reply from unexpected source: 169.254.1.2#123, expected 169.254.1.2#53
> >
> > The tcpdump is as follows,
> >
> > 16:47:23.441705 veth9cffd2a4 P IP 10.242.249.78.37562 >
> > 169.254.1.2.53: 298+ [1au] A? www.google.com. (55)
> > 16:47:23.441705 bridge0 In IP 10.242.249.78.37562 > 127.0.0.1.53:
> > 298+ [1au] A? www.google.com. (55)
> > 16:47:23.441856 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.37562:
> > 298 1/0/1 A 142.250.71.228 (59)
> > 16:47:23.441863 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> > 1/0/1 A 142.250.71.228 (59)
> > 16:47:23.441867 eth1 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> > 1/0/1 A 142.250.71.228 (59)
> > 16:47:23.441885 eth1 P IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> > 1/0/1 A 142.250.71.228 (59)
> > 16:47:23.441885 bond0 P IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> > 1/0/1 A 142.250.71.228 (59)
> > 16:47:23.441916 veth9cffd2a4 Out IP 169.254.1.2.124 >
> > 10.242.249.78.37562: UDP, length 59
> >
> > The DNS response port is unexpectedly changed from 53 to 124, causing
> > the application can't receive the response.
> >
> > We suspected the issue might be related to commit d8f84a9bc7c4
> > ("netfilter: nf_nat: don't try nat source port reallocation for
> > reverse dir clash"). After applying this commit, the port remapping no
> > longer occurs, but the DNS response is still dropped.
>
> Thats suspicious, I don't see how this is related. d8f84a9bc7c4
> deals with indepdent action, i.e.
> A sends to B and B sends to A, but *at the same time*.
>
> With a request-response protocol like DNS this should obviously never
> happen -- B can't reply before A's request has passed through the stack.
Correct, these operations cannot occur simultaneously. However, after
implementing this commit, port reallocation no longer occurs.
>
> > The response is now correctly sent to port 53, but it is dropped in
> > __nf_conntrack_confirm().
> >
> > We bypassed the issue by modifying __nf_conntrack_confirm() to skip
> > the conflicting conntrack entry check:
> >
> > diff --git a/net/netfilter/nf_conntrack_core.c
> > b/net/netfilter/nf_conntrack_core.c
> > index 7bee5bd22be2..3481e9d333b0 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
> >
> > chainlen = 0;
> > hlist_nulls_for_each_entry(h, n,
> > &nf_conntrack_hash[reply_hash], hnnode) {
> > - if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> > - zone, net))
> > - goto out;
> > + //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> > + // zone, net))
> > + // goto out;
> > if (chainlen++ > max_chainlen) {
> > chaintoolong:
> > NF_CT_STAT_INC(net, chaintoolong);
>
> I don't understand this bit either. For A/AAAA requests racing in same
> direction, nf_ct_resolve_clash() machinery should have handled this
> situation.
>
> And I don't see how you can encounter a DNS reply before at least one
> request has been committed to the table -- i.e., the conntrack being
> confirmed here should not exist -- the packet should have been picked up
> as a reply packet.
We've been able to consistently reproduce this behavior. Would you
have any recommended debugging approaches we could try?
--
Regards
Yafang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 11:41 ` Yafang Shao
@ 2025-05-28 12:14 ` Florian Westphal
2025-05-28 12:31 ` Yafang Shao
0 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2025-05-28 12:14 UTC (permalink / raw)
To: Yafang Shao
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
Yafang Shao <laoar.shao@gmail.com> wrote:
> > And I don't see how you can encounter a DNS reply before at least one
> > request has been committed to the table -- i.e., the conntrack being
> > confirmed here should not exist -- the packet should have been picked up
> > as a reply packet.
>
> We've been able to consistently reproduce this behavior. Would you
> have any recommended debugging approaches we could try?
Can you figure out why nf_ct_resolve_clash_harder() doesn't handle the
clash?
AFAIU reply tuple is identical while original isn't. It would be good
to confirm. If they were the same, I'd have expected
nf_ct_resolve_clash_harder() to merge the conntracks (nf_ct_can_merge()
branch in __nf_ct_resolve_clash).
Could you also dump/show the origin and reply tuples for the existing
entry and the clashing (new) entry?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 12:14 ` Florian Westphal
@ 2025-05-28 12:31 ` Yafang Shao
2025-05-28 12:43 ` Yafang Shao
2025-05-28 13:20 ` Florian Westphal
0 siblings, 2 replies; 16+ messages in thread
From: Yafang Shao @ 2025-05-28 12:31 UTC (permalink / raw)
To: Florian Westphal
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
On Wed, May 28, 2025 at 8:15 PM Florian Westphal <fw@strlen.de> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> wrote:
> > > And I don't see how you can encounter a DNS reply before at least one
> > > request has been committed to the table -- i.e., the conntrack being
> > > confirmed here should not exist -- the packet should have been picked up
> > > as a reply packet.
> >
> > We've been able to consistently reproduce this behavior. Would you
> > have any recommended debugging approaches we could try?
>
> Can you figure out why nf_ct_resolve_clash_harder() doesn't handle the
> clash?
I will try it.
>
> AFAIU reply tuple is identical while original isn't. It would be good
> to confirm. If they were the same, I'd have expected
> nf_ct_resolve_clash_harder() to merge the conntracks (nf_ct_can_merge()
> branch in __nf_ct_resolve_clash).
>
> Could you also dump/show the origin and reply tuples for the existing
> entry and the clashing (new) entry?
Original 6.1.y Kernel (Unmodified) , there are two entries:
$ cat /proc/net/nf_conntrack| grep 10.242.249.78
ipv4 2 udp 17 119 src=10.242.249.78 dst=169.254.1.2
sport=49469 dport=53 src=127.0.0.1 dst=10.242.249.78 sport=53
dport=49469 [ASSURED] mark=0 zone=0 use=2
ipv4 2 udp 17 29 src=169.254.1.2 dst=10.242.249.78 sport=53
dport=49469 [UNREPLIED] src=10.242.249.78 dst=169.254.1.2 sport=49469
dport=477 mark=0 zone=0 use=2
After applying commit d8f84a9bc7c4, only one entry remains:
$ cat /proc/net/nf_conntrack| grep 10.242.249.78
ipv4 2 udp 17 106 src=10.242.249.78 dst=169.254.1.2
sport=34616 dport=53 src=127.0.0.1 dst=10.242.249.78 sport=53
dport=34616 [ASSURED] mark=0 zone=0 use=2
After the additional custom hack, the entries now show two records:
$ cat /proc/net/nf_conntrack| grep 10.242.249.78
ipv4 2 udp 17 27 src=169.254.1.2 dst=10.242.249.78 sport=53
dport=46858 [UNREPLIED] src=10.242.249.78 dst=169.254.1.2 sport=46858
dport=53 mark=0 zone=0 use=2
ipv4 2 udp 17 27 src=10.242.249.78 dst=169.254.1.2
sport=46858 dport=53 src=127.0.0.1 dst=10.242.249.78 sport=53
dport=46858 mark=0 zone=0 use=2
--
Regards
Yafang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 12:31 ` Yafang Shao
@ 2025-05-28 12:43 ` Yafang Shao
2025-05-28 13:10 ` Florian Westphal
2025-05-28 13:20 ` Florian Westphal
1 sibling, 1 reply; 16+ messages in thread
From: Yafang Shao @ 2025-05-28 12:43 UTC (permalink / raw)
To: Florian Westphal
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
On Wed, May 28, 2025 at 8:31 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, May 28, 2025 at 8:15 PM Florian Westphal <fw@strlen.de> wrote:
> >
> > Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > And I don't see how you can encounter a DNS reply before at least one
> > > > request has been committed to the table -- i.e., the conntrack being
> > > > confirmed here should not exist -- the packet should have been picked up
> > > > as a reply packet.
> > >
> > > We've been able to consistently reproduce this behavior. Would you
> > > have any recommended debugging approaches we could try?
> >
> > Can you figure out why nf_ct_resolve_clash_harder() doesn't handle the
> > clash?
>
> I will try it.
After tracing with bpftrace, I found that the __nf_ct_resolve_clash()
function returns NF_DROP. Should I provide any additional details?
>
> >
> > AFAIU reply tuple is identical while original isn't. It would be good
> > to confirm. If they were the same, I'd have expected
> > nf_ct_resolve_clash_harder() to merge the conntracks (nf_ct_can_merge()
> > branch in __nf_ct_resolve_clash).
> >
> > Could you also dump/show the origin and reply tuples for the existing
> > entry and the clashing (new) entry?
>
> Original 6.1.y Kernel (Unmodified) , there are two entries:
> $ cat /proc/net/nf_conntrack| grep 10.242.249.78
> ipv4 2 udp 17 119 src=10.242.249.78 dst=169.254.1.2
> sport=49469 dport=53 src=127.0.0.1 dst=10.242.249.78 sport=53
> dport=49469 [ASSURED] mark=0 zone=0 use=2
> ipv4 2 udp 17 29 src=169.254.1.2 dst=10.242.249.78 sport=53
> dport=49469 [UNREPLIED] src=10.242.249.78 dst=169.254.1.2 sport=49469
> dport=477 mark=0 zone=0 use=2
>
>
> After applying commit d8f84a9bc7c4, only one entry remains:
> $ cat /proc/net/nf_conntrack| grep 10.242.249.78
> ipv4 2 udp 17 106 src=10.242.249.78 dst=169.254.1.2
> sport=34616 dport=53 src=127.0.0.1 dst=10.242.249.78 sport=53
> dport=34616 [ASSURED] mark=0 zone=0 use=2
>
>
> After the additional custom hack, the entries now show two records:
> $ cat /proc/net/nf_conntrack| grep 10.242.249.78
> ipv4 2 udp 17 27 src=169.254.1.2 dst=10.242.249.78 sport=53
> dport=46858 [UNREPLIED] src=10.242.249.78 dst=169.254.1.2 sport=46858
> dport=53 mark=0 zone=0 use=2
> ipv4 2 udp 17 27 src=10.242.249.78 dst=169.254.1.2
> sport=46858 dport=53 src=127.0.0.1 dst=10.242.249.78 sport=53
> dport=46858 mark=0 zone=0 use=2
>
>
> --
> Regards
> Yafang
--
Regards
Yafang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 12:43 ` Yafang Shao
@ 2025-05-28 13:10 ` Florian Westphal
0 siblings, 0 replies; 16+ messages in thread
From: Florian Westphal @ 2025-05-28 13:10 UTC (permalink / raw)
To: Yafang Shao
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
Yafang Shao <laoar.shao@gmail.com> wrote:
> After tracing with bpftrace, I found that the __nf_ct_resolve_clash()
> function returns NF_DROP. Should I provide any additional details?
No need, I have no more ideas.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 12:31 ` Yafang Shao
2025-05-28 12:43 ` Yafang Shao
@ 2025-05-28 13:20 ` Florian Westphal
2025-05-28 14:07 ` Yafang Shao
1 sibling, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2025-05-28 13:20 UTC (permalink / raw)
To: Yafang Shao
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
Yafang Shao <laoar.shao@gmail.com> wrote:
> After applying commit d8f84a9bc7c4, only one entry remains:
> $ cat /proc/net/nf_conntrack| grep 10.242.249.78
> ipv4 2 udp 17 106 src=10.242.249.78 dst=169.254.1.2
> sport=34616 dport=53 src=127.0.0.1 dst=10.242.249.78 sport=53
> dport=34616 [ASSURED] mark=0 zone=0 use=2
Makes sense to me, thats what would be expected, at least from ct state, no?
(I inderstand that things are not working as expected from DNS pov).
> After the additional custom hack, the entries now show two records:
> $ cat /proc/net/nf_conntrack| grep 10.242.249.78
> ipv4 2 udp 17 27 src=169.254.1.2 dst=10.242.249.78 sport=53
> dport=46858 [UNREPLIED] src=10.242.249.78 dst=169.254.1.2 sport=46858
> dport=53 mark=0 zone=0 use=2
> ipv4 2 udp 17 27 src=10.242.249.78 dst=169.254.1.2
> sport=46858 dport=53 src=127.0.0.1 dst=10.242.249.78 sport=53
> dport=46858 mark=0 zone=0 use=2
That makes no sense to me whatsoever.
The second entry looks correct/as expected:
10.242.249.78 -> 169.254.1.2 46858 -> 53 DNATed to 127.0.0.1:53 10.242.249.78:46858
... so we would expect replies coming from 127.0.0.1:53.
But the other entry makes no sense to me.
src=169.254.1.2 dst=10.242.249.78 sport=53 dport=46858 [UNREPLIED] src=10.242.249.78 dst=169.254.1.2 sport=46858 dport=53 mark=0 zone=0 use=2
This means conntrack saw a packet, not matching any existing entry for this:
169.254.1.2:53 -> 10.242.249.78:46858
... and that makes no sense to me.
The reply should be coming from 127.0.0.1:53.
I suspect stack refuses to send a packet from 127.0.0.1 to foreign/nonlocal address?
As far as conntrack is concerned, the origin 169.254.1.2:53 is a new flow.
We do expect this:
127.0.0.1:53 -> 10.242.249.78:46858, which would be classified as matching response to the
existing entry.
Do you have any load balancing, bridging etc. going on that would result in cloned
packets leaving the system, where one is going out unmodified?
Is route_localnet sysctl enabled? I have never tried such lo stunts myself.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 13:20 ` Florian Westphal
@ 2025-05-28 14:07 ` Yafang Shao
2025-05-28 21:48 ` Florian Westphal
0 siblings, 1 reply; 16+ messages in thread
From: Yafang Shao @ 2025-05-28 14:07 UTC (permalink / raw)
To: Florian Westphal
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
On Wed, May 28, 2025 at 9:20 PM Florian Westphal <fw@strlen.de> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> wrote:
> > After applying commit d8f84a9bc7c4, only one entry remains:
> > $ cat /proc/net/nf_conntrack| grep 10.242.249.78
> > ipv4 2 udp 17 106 src=10.242.249.78 dst=169.254.1.2
> > sport=34616 dport=53 src=127.0.0.1 dst=10.242.249.78 sport=53
> > dport=34616 [ASSURED] mark=0 zone=0 use=2
>
> Makes sense to me, thats what would be expected, at least from ct state, no?
> (I inderstand that things are not working as expected from DNS pov).
>
> > After the additional custom hack, the entries now show two records:
> > $ cat /proc/net/nf_conntrack| grep 10.242.249.78
> > ipv4 2 udp 17 27 src=169.254.1.2 dst=10.242.249.78 sport=53
> > dport=46858 [UNREPLIED] src=10.242.249.78 dst=169.254.1.2 sport=46858
> > dport=53 mark=0 zone=0 use=2
> > ipv4 2 udp 17 27 src=10.242.249.78 dst=169.254.1.2
> > sport=46858 dport=53 src=127.0.0.1 dst=10.242.249.78 sport=53
> > dport=46858 mark=0 zone=0 use=2
>
> That makes no sense to me whatsoever.
>
> The second entry looks correct/as expected:
> 10.242.249.78 -> 169.254.1.2 46858 -> 53 DNATed to 127.0.0.1:53 10.242.249.78:46858
>
> ... so we would expect replies coming from 127.0.0.1:53.
>
> But the other entry makes no sense to me.
>
> src=169.254.1.2 dst=10.242.249.78 sport=53 dport=46858 [UNREPLIED] src=10.242.249.78 dst=169.254.1.2 sport=46858 dport=53 mark=0 zone=0 use=2
>
> This means conntrack saw a packet, not matching any existing entry for this:
> 169.254.1.2:53 -> 10.242.249.78:46858
>
> ... and that makes no sense to me.
> The reply should be coming from 127.0.0.1:53.
>
> I suspect stack refuses to send a packet from 127.0.0.1 to foreign/nonlocal address?
>
> As far as conntrack is concerned, the origin 169.254.1.2:53 is a new flow.
>
> We do expect this:
> 127.0.0.1:53 -> 10.242.249.78:46858, which would be classified as matching response to the
> existing entry.
Could this issue be caused by misconfigured SNAT/DNAT rules? However,
I haven't been able to identify any problematic rules in my
investigation.
>
> Do you have any load balancing, bridging etc. going on that would result in cloned
> packets leaving the system, where one is going out unmodified?
No, we don't have cloned packets.
>
> Is route_localnet sysctl enabled? I have never tried such lo stunts myself.
The config is as follows,
net.ipv4.conf.all.route_localnet = 1
net.ipv4.conf.bond0.route_localnet = 0
net.ipv4.conf.bridge0.route_localnet = 0
net.ipv4.conf.default.route_localnet = 0
net.ipv4.conf.docker0.route_localnet = 0
net.ipv4.conf.eth0.route_localnet = 0
net.ipv4.conf.eth1.route_localnet = 0
net.ipv4.conf.lo.route_localnet = 0
--
Regards
Yafang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 14:07 ` Yafang Shao
@ 2025-05-28 21:48 ` Florian Westphal
2025-05-29 2:20 ` Yafang Shao
0 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2025-05-28 21:48 UTC (permalink / raw)
To: Yafang Shao
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
Yafang Shao <laoar.shao@gmail.com> wrote:
> On Wed, May 28, 2025 at 9:20 PM Florian Westphal <fw@strlen.de> wrote:
> > ... and that makes no sense to me.
> > The reply should be coming from 127.0.0.1:53.
> >
> > I suspect stack refuses to send a packet from 127.0.0.1 to foreign/nonlocal address?
> >
> > As far as conntrack is concerned, the origin 169.254.1.2:53 is a new flow.
> >
> > We do expect this:
> > 127.0.0.1:53 -> 10.242.249.78:46858, which would be classified as matching response to the
> > existing entry.
>
> Could this issue be caused by misconfigured SNAT/DNAT rules? However,
> I haven't been able to identify any problematic rules in my
> investigation.
No, because even if there was an SNAT rule it would not be used
for a reply packet.
Can you check the dns proxy and confirm that it is using the "wrong",
i.e. the public address as source for the udp packets?
Alternatively you could also try adding a NOTRACK rule in -t raw OUTPUT, for
udp packets coming from sport 53. It should prevent this problem and
make your setup work.
Assuming the dns proxy already uses the public address, no dnat reversal
is needed.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 9:03 [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment Yafang Shao
2025-05-28 11:22 ` Florian Westphal
@ 2025-05-28 23:43 ` Shaun Brady
2025-05-29 3:46 ` Yafang Shao
2025-05-30 0:45 ` Florian Westphal
1 sibling, 2 replies; 16+ messages in thread
From: Shaun Brady @ 2025-05-28 23:43 UTC (permalink / raw)
To: Yafang Shao
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
On Wed, May 28, 2025 at 05:03:56PM +0800, Yafang Shao wrote:
> diff --git a/net/netfilter/nf_conntrack_core.c
> b/net/netfilter/nf_conntrack_core.c
> index 7bee5bd22be2..3481e9d333b0 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
>
> chainlen = 0;
> hlist_nulls_for_each_entry(h, n,
> &nf_conntrack_hash[reply_hash], hnnode) {
> - if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> - zone, net))
> - goto out;
> + //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> + // zone, net))
> + // goto out;
> if (chainlen++ > max_chainlen) {
> chaintoolong:
> NF_CT_STAT_INC(net, chaintoolong);
Forgive me for jumping in with very little information, but on a hunch I
tried something. I applied the above patch to another bug I've been
investigating:
https://bugzilla.netfilter.org/show_bug.cgi?id=1795
and Ubuntu reference
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109889
The Ubuntu reproduction steps where easier to follow, so I mimicked
them:
# cat add_ip.sh
ip addr add 10.0.1.200/24 dev enp1s0
# cat nft.sh
nft -f - <<EOF
table ip dnat-test {
chain prerouting {
type nat hook prerouting priority dstnat; policy accept;
ip daddr 10.0.1.200 udp dport 1234 counter dnat to 10.0.1.180:1234
}
}
EOF
# cat listen.sh
echo pong|nc -l -u 10.0.1.180 1234
# ./add_ip.sh ; ./nft.sh ; listen.sh (and then just ./listen.sh again)
On a client machine I ran:
$ echo ping|nc -u -p 4321 10.0.1.200 1234
$ echo ping|nc -u -p 4321 10.0.1.180 1234
And sure enough the listen.sh never completes (demonstrates the bug).
When I apply the above patch, the problem goes away.
What I _also_ was able to do to make the problem go away was to apply
the following patch:
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index aad84aabd7f1..fecf5591f424 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -727,7 +727,7 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
!(range->flags & NF_NAT_RANGE_PROTO_RANDOM_ALL)) {
/* try the original tuple first */
if (nf_in_range(orig_tuple, range)) {
- if (!nf_nat_used_tuple_new(orig_tuple, ct)) {
+ if (!nf_nat_used_tuple(orig_tuple, ct)) {
*tuple = *orig_tuple;
return;
}
This was suggested to me by the bug report. I had not brought this up
yet, as I had little understanding of why and what else was broken by
reverting to nf_nat_used_tuple from _new.
I thought that both patches fix the problem might be of interest. I'll
keep digging in to my understanding.....
SB
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 21:48 ` Florian Westphal
@ 2025-05-29 2:20 ` Yafang Shao
0 siblings, 0 replies; 16+ messages in thread
From: Yafang Shao @ 2025-05-29 2:20 UTC (permalink / raw)
To: Florian Westphal
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
On Thu, May 29, 2025 at 5:48 AM Florian Westphal <fw@strlen.de> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> wrote:
> > On Wed, May 28, 2025 at 9:20 PM Florian Westphal <fw@strlen.de> wrote:
> > > ... and that makes no sense to me.
> > > The reply should be coming from 127.0.0.1:53.
> > >
> > > I suspect stack refuses to send a packet from 127.0.0.1 to foreign/nonlocal address?
> > >
> > > As far as conntrack is concerned, the origin 169.254.1.2:53 is a new flow.
> > >
> > > We do expect this:
> > > 127.0.0.1:53 -> 10.242.249.78:46858, which would be classified as matching response to the
> > > existing entry.
> >
> > Could this issue be caused by misconfigured SNAT/DNAT rules? However,
> > I haven't been able to identify any problematic rules in my
> > investigation.
>
> No, because even if there was an SNAT rule it would not be used
> for a reply packet.
>
> Can you check the dns proxy and confirm that it is using the "wrong",
> i.e. the public address as source for the udp packets?
No, it is not using the public address. The DNS server address
169.254.1.2 is a link-local address, not a routable public IP.
>
> Alternatively you could also try adding a NOTRACK rule in -t raw OUTPUT, for
> udp packets coming from sport 53. It should prevent this problem and
> make your setup work.
This is how we’re handling it in production right now. Without this
workaround, the issue would occur intermittently.
>
> Assuming the dns proxy already uses the public address, no dnat reversal
> is needed.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 23:43 ` Shaun Brady
@ 2025-05-29 3:46 ` Yafang Shao
2025-05-30 0:45 ` Florian Westphal
1 sibling, 0 replies; 16+ messages in thread
From: Yafang Shao @ 2025-05-29 3:46 UTC (permalink / raw)
To: Shaun Brady
Cc: pablo, kadlec, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam
On Thu, May 29, 2025 at 7:43 AM Shaun Brady <brady.1345@gmail.com> wrote:
>
> On Wed, May 28, 2025 at 05:03:56PM +0800, Yafang Shao wrote:
> > diff --git a/net/netfilter/nf_conntrack_core.c
> > b/net/netfilter/nf_conntrack_core.c
> > index 7bee5bd22be2..3481e9d333b0 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
> >
> > chainlen = 0;
> > hlist_nulls_for_each_entry(h, n,
> > &nf_conntrack_hash[reply_hash], hnnode) {
> > - if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> > - zone, net))
> > - goto out;
> > + //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> > + // zone, net))
> > + // goto out;
> > if (chainlen++ > max_chainlen) {
> > chaintoolong:
> > NF_CT_STAT_INC(net, chaintoolong);
>
> Forgive me for jumping in with very little information, but on a hunch I
> tried something. I applied the above patch to another bug I've been
> investigating:
>
> https://bugzilla.netfilter.org/show_bug.cgi?id=1795
> and Ubuntu reference
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109889
>
> The Ubuntu reproduction steps where easier to follow, so I mimicked
> them:
>
> # cat add_ip.sh
> ip addr add 10.0.1.200/24 dev enp1s0
> # cat nft.sh
> nft -f - <<EOF
> table ip dnat-test {
> chain prerouting {
> type nat hook prerouting priority dstnat; policy accept;
> ip daddr 10.0.1.200 udp dport 1234 counter dnat to 10.0.1.180:1234
> }
> }
> EOF
> # cat listen.sh
> echo pong|nc -l -u 10.0.1.180 1234
> # ./add_ip.sh ; ./nft.sh ; listen.sh (and then just ./listen.sh again)
>
> On a client machine I ran:
> $ echo ping|nc -u -p 4321 10.0.1.200 1234
> $ echo ping|nc -u -p 4321 10.0.1.180 1234
>
> And sure enough the listen.sh never completes (demonstrates the bug).
>
> When I apply the above patch, the problem goes away.
>
> What I _also_ was able to do to make the problem go away was to apply
> the following patch:
>
> diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
> index aad84aabd7f1..fecf5591f424 100644
> --- a/net/netfilter/nf_nat_core.c
> +++ b/net/netfilter/nf_nat_core.c
> @@ -727,7 +727,7 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
> !(range->flags & NF_NAT_RANGE_PROTO_RANDOM_ALL)) {
> /* try the original tuple first */
> if (nf_in_range(orig_tuple, range)) {
> - if (!nf_nat_used_tuple_new(orig_tuple, ct)) {
> + if (!nf_nat_used_tuple(orig_tuple, ct)) {
> *tuple = *orig_tuple;
> return;
> }
>
> This was suggested to me by the bug report. I had not brought this up
> yet, as I had little understanding of why and what else was broken by
> reverting to nf_nat_used_tuple from _new.
>
> I thought that both patches fix the problem might be of interest. I'll
> keep digging in to my understanding.....
Could you please extract and share the /proc/net/nf_conntrack entries
for the affected IP address?
--
Regards
Yafang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-28 23:43 ` Shaun Brady
2025-05-29 3:46 ` Yafang Shao
@ 2025-05-30 0:45 ` Florian Westphal
2025-05-30 2:44 ` Yafang Shao
1 sibling, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2025-05-30 0:45 UTC (permalink / raw)
To: Shaun Brady
Cc: Yafang Shao, pablo, kadlec, David Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, netfilter-devel,
coreteam
Shaun Brady <brady.1345@gmail.com> wrote:
> On Wed, May 28, 2025 at 05:03:56PM +0800, Yafang Shao wrote:
> > diff --git a/net/netfilter/nf_conntrack_core.c
> > b/net/netfilter/nf_conntrack_core.c
> > index 7bee5bd22be2..3481e9d333b0 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
> >
> > chainlen = 0;
> > hlist_nulls_for_each_entry(h, n,
> > &nf_conntrack_hash[reply_hash], hnnode) {
> > - if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> > - zone, net))
> > - goto out;
> > + //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> > + // zone, net))
> > + // goto out;
> > if (chainlen++ > max_chainlen) {
> > chaintoolong:
> > NF_CT_STAT_INC(net, chaintoolong);
>
> Forgive me for jumping in with very little information, but on a hunch I
> tried something. I applied the above patch to another bug I've been
> investigating:
>
> https://bugzilla.netfilter.org/show_bug.cgi?id=1795
> and Ubuntu reference
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109889
>
> The Ubuntu reproduction steps where easier to follow, so I mimicked
> them:
>
> # cat add_ip.sh
> ip addr add 10.0.1.200/24 dev enp1s0
> # cat nft.sh
> nft -f - <<EOF
> table ip dnat-test {
> chain prerouting {
> type nat hook prerouting priority dstnat; policy accept;
> ip daddr 10.0.1.200 udp dport 1234 counter dnat to 10.0.1.180:1234
> }
> }
> EOF
> # cat listen.sh
> echo pong|nc -l -u 10.0.1.180 1234
> # ./add_ip.sh ; ./nft.sh ; listen.sh (and then just ./listen.sh again)
We don't have a selftest for this, I'll add one.
Following patch should help, we fail to check for reverse collision
before concluding we don't need PAT to handle this.
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -248,7 +248,7 @@ static noinline bool
nf_nat_used_tuple_new(const struct nf_conntrack_tuple *tuple,
const struct nf_conn *ignored_ct)
{
- static const unsigned long uses_nat = IPS_NAT_MASK | IPS_SEQ_ADJUST_BIT;
+ static const unsigned long uses_nat = IPS_NAT_MASK | IPS_SEQ_ADJUST;
const struct nf_conntrack_tuple_hash *thash;
const struct nf_conntrack_zone *zone;
struct nf_conn *ct;
@@ -287,8 +287,14 @@ nf_nat_used_tuple_new(const struct nf_conntrack_tuple *tuple,
zone = nf_ct_zone(ignored_ct);
thash = nf_conntrack_find_get(net, zone, tuple);
- if (unlikely(!thash)) /* clashing entry went away */
- return false;
+ if (unlikely(!thash)) {
+ struct nf_conntrack_tuple reply;
+
+ nf_ct_invert_tuple(&reply, tuple);
+ thash = nf_conntrack_find_get(net, zone, &reply);
+ if (!thash) /* clashing entry went away */
+ return false;
+ }
ct = nf_ct_tuplehash_to_ctrack(thash);
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-30 0:45 ` Florian Westphal
@ 2025-05-30 2:44 ` Yafang Shao
2025-05-30 3:37 ` Shaun Brady
0 siblings, 1 reply; 16+ messages in thread
From: Yafang Shao @ 2025-05-30 2:44 UTC (permalink / raw)
To: Florian Westphal
Cc: Shaun Brady, pablo, kadlec, David Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, netfilter-devel,
coreteam
On Fri, May 30, 2025 at 7:21 AM Florian Westphal <fw@strlen.de> wrote:
>
> Shaun Brady <brady.1345@gmail.com> wrote:
> > On Wed, May 28, 2025 at 05:03:56PM +0800, Yafang Shao wrote:
> > > diff --git a/net/netfilter/nf_conntrack_core.c
> > > b/net/netfilter/nf_conntrack_core.c
> > > index 7bee5bd22be2..3481e9d333b0 100644
> > > --- a/net/netfilter/nf_conntrack_core.c
> > > +++ b/net/netfilter/nf_conntrack_core.c
> > > @@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
> > >
> > > chainlen = 0;
> > > hlist_nulls_for_each_entry(h, n,
> > > &nf_conntrack_hash[reply_hash], hnnode) {
> > > - if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> > > - zone, net))
> > > - goto out;
> > > + //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> > > + // zone, net))
> > > + // goto out;
> > > if (chainlen++ > max_chainlen) {
> > > chaintoolong:
> > > NF_CT_STAT_INC(net, chaintoolong);
> >
> > Forgive me for jumping in with very little information, but on a hunch I
> > tried something. I applied the above patch to another bug I've been
> > investigating:
> >
> > https://bugzilla.netfilter.org/show_bug.cgi?id=1795
> > and Ubuntu reference
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109889
> >
> > The Ubuntu reproduction steps where easier to follow, so I mimicked
> > them:
> >
> > # cat add_ip.sh
> > ip addr add 10.0.1.200/24 dev enp1s0
> > # cat nft.sh
> > nft -f - <<EOF
> > table ip dnat-test {
> > chain prerouting {
> > type nat hook prerouting priority dstnat; policy accept;
> > ip daddr 10.0.1.200 udp dport 1234 counter dnat to 10.0.1.180:1234
> > }
> > }
> > EOF
> > # cat listen.sh
> > echo pong|nc -l -u 10.0.1.180 1234
> > # ./add_ip.sh ; ./nft.sh ; listen.sh (and then just ./listen.sh again)
>
> We don't have a selftest for this, I'll add one.
>
> Following patch should help, we fail to check for reverse collision
> before concluding we don't need PAT to handle this.
>
> diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
> --- a/net/netfilter/nf_nat_core.c
> +++ b/net/netfilter/nf_nat_core.c
> @@ -248,7 +248,7 @@ static noinline bool
> nf_nat_used_tuple_new(const struct nf_conntrack_tuple *tuple,
> const struct nf_conn *ignored_ct)
> {
> - static const unsigned long uses_nat = IPS_NAT_MASK | IPS_SEQ_ADJUST_BIT;
> + static const unsigned long uses_nat = IPS_NAT_MASK | IPS_SEQ_ADJUST;
> const struct nf_conntrack_tuple_hash *thash;
> const struct nf_conntrack_zone *zone;
> struct nf_conn *ct;
> @@ -287,8 +287,14 @@ nf_nat_used_tuple_new(const struct nf_conntrack_tuple *tuple,
> zone = nf_ct_zone(ignored_ct);
>
> thash = nf_conntrack_find_get(net, zone, tuple);
> - if (unlikely(!thash)) /* clashing entry went away */
> - return false;
> + if (unlikely(!thash)) {
> + struct nf_conntrack_tuple reply;
> +
> + nf_ct_invert_tuple(&reply, tuple);
> + thash = nf_conntrack_find_get(net, zone, &reply);
> + if (!thash) /* clashing entry went away */
> + return false;
> + }
>
> ct = nf_ct_tuplehash_to_ctrack(thash);
>
JFYI
After applying this additional patch, the NAT source port is
reallocated to a new random port again in my case.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment
2025-05-30 2:44 ` Yafang Shao
@ 2025-05-30 3:37 ` Shaun Brady
0 siblings, 0 replies; 16+ messages in thread
From: Shaun Brady @ 2025-05-30 3:37 UTC (permalink / raw)
To: Yafang Shao
Cc: Florian Westphal, pablo, kadlec, David Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, netfilter-devel,
coreteam
On Fri, May 30, 2025 at 10:44:32AM +0800, Yafang Shao wrote:
>
> JFYI
> After applying this additional patch, the NAT source port is
> reallocated to a new random port again in my case.
>
Not surprisingly, it fixed the test case I had too (I think they are
functionally equivalent). I'll update the bugzilla to report a fix
inbound.
SB
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-05-30 3:37 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-28 9:03 [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment Yafang Shao
2025-05-28 11:22 ` Florian Westphal
2025-05-28 11:41 ` Yafang Shao
2025-05-28 12:14 ` Florian Westphal
2025-05-28 12:31 ` Yafang Shao
2025-05-28 12:43 ` Yafang Shao
2025-05-28 13:10 ` Florian Westphal
2025-05-28 13:20 ` Florian Westphal
2025-05-28 14:07 ` Yafang Shao
2025-05-28 21:48 ` Florian Westphal
2025-05-29 2:20 ` Yafang Shao
2025-05-28 23:43 ` Shaun Brady
2025-05-29 3:46 ` Yafang Shao
2025-05-30 0:45 ` Florian Westphal
2025-05-30 2:44 ` Yafang Shao
2025-05-30 3:37 ` Shaun Brady
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.