stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Florian Westphal <fw@strlen.de>,
	Pablo Neira Ayuso <pablo@netfilter.org>,
	Sasha Levin <sashal@kernel.org>,
	kadlec@netfilter.org, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com,
	netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
	netdev@vger.kernel.org
Subject: [PATCH AUTOSEL 6.11 59/76] netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
Date: Fri,  4 Oct 2024 14:17:16 -0400	[thread overview]
Message-ID: <20241004181828.3669209-59-sashal@kernel.org> (raw)
In-Reply-To: <20241004181828.3669209-1-sashal@kernel.org>

From: Florian Westphal <fw@strlen.de>

[ Upstream commit d8f84a9bc7c4e07fdc4edc00f9e868b8db974ccb ]

A conntrack entry can be inserted to the connection tracking table if there
is no existing entry with an identical tuple in either direction.

Example:
INITIATOR -> NAT/PAT -> RESPONDER

Initiator passes through NAT/PAT ("us") and SNAT is done (saddr rewrite).
Then, later, NAT/PAT machine itself also wants to connect to RESPONDER.

This will not work if the SNAT done earlier has same IP:PORT source pair.

Conntrack table has:
ORIGINAL: $IP_INITATOR:$SPORT -> $IP_RESPONDER:$DPORT
REPLY:    $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT

and new locally originating connection wants:
ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT
REPLY:    $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT

This is handled by the NAT engine which will do a source port reallocation
for the locally originating connection that is colliding with an existing
tuple by attempting a source port rewrite.

This is done even if this new connection attempt did not go through a
masquerade/snat rule.

There is a rare race condition with connection-less protocols like UDP,
where we do the port reallocation even though its not needed.

This happens when new packets from the same, pre-existing flow are received
in both directions at the exact same time on different CPUs after the
conntrack table was flushed (or conntrack becomes active for first time).

With strict ordering/single cpu, the first packet creates new ct entry and
second packet is resolved as established reply packet.

With parallel processing, both packets are picked up as new and both get
their own ct entry.

In this case, the 'reply' packet (picked up as ORIGINAL) can be mangled by
NAT engine because a port collision is detected.

This change isn't enough to prevent a packet drop later during
nf_conntrack_confirm(), the existing clash resolution strategy will not
detect such reverse clash case.  This is resolved by a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/netfilter/nf_nat_core.c | 120 +++++++++++++++++++++++++++++++++++-
 1 file changed, 118 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 016c816d91cbc..c212b1b137222 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -183,7 +183,35 @@ hash_by_src(const struct net *net,
 	return reciprocal_scale(hash, nf_nat_htable_size);
 }
 
-/* Is this tuple already taken? (not by us) */
+/**
+ * nf_nat_used_tuple - check if proposed nat tuple clashes with existing entry
+ * @tuple: proposed NAT binding
+ * @ignored_conntrack: our (unconfirmed) conntrack entry
+ *
+ * A conntrack entry can be inserted to the connection tracking table
+ * if there is no existing entry with an identical tuple in either direction.
+ *
+ * Example:
+ * INITIATOR -> NAT/PAT -> RESPONDER
+ *
+ * INITIATOR passes through NAT/PAT ("us") and SNAT is done (saddr rewrite).
+ * Then, later, NAT/PAT itself also connects to RESPONDER.
+ *
+ * This will not work if the SNAT done earlier has same IP:PORT source pair.
+ *
+ * Conntrack table has:
+ * ORIGINAL: $IP_INITIATOR:$SPORT -> $IP_RESPONDER:$DPORT
+ * REPLY:    $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT
+ *
+ * and new locally originating connection wants:
+ * ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT
+ * REPLY:    $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT
+ *
+ * ... which would mean incoming packets cannot be distinguished between
+ * the existing and the newly added entry (identical IP_CT_DIR_REPLY tuple).
+ *
+ * @return: true if the proposed NAT mapping collides with an existing entry.
+ */
 static int
 nf_nat_used_tuple(const struct nf_conntrack_tuple *tuple,
 		  const struct nf_conn *ignored_conntrack)
@@ -200,6 +228,94 @@ nf_nat_used_tuple(const struct nf_conntrack_tuple *tuple,
 	return nf_conntrack_tuple_taken(&reply, ignored_conntrack);
 }
 
+static bool nf_nat_allow_clash(const struct nf_conn *ct)
+{
+	return nf_ct_l4proto_find(nf_ct_protonum(ct))->allow_clash;
+}
+
+/**
+ * nf_nat_used_tuple_new - check if to-be-inserted conntrack collides with existing entry
+ * @tuple: proposed NAT binding
+ * @ignored_ct: our (unconfirmed) conntrack entry
+ *
+ * Same as nf_nat_used_tuple, but also check for rare clash in reverse
+ * direction. Should be called only when @tuple has not been altered, i.e.
+ * @ignored_conntrack will not be subject to NAT.
+ *
+ * @return: true if the proposed NAT mapping collides with existing entry.
+ */
+static noinline bool
+nf_nat_used_tuple_new(const struct nf_conntrack_tuple *tuple,
+		      const struct nf_conn *ignored_ct)
+{
+	static const unsigned long uses_nat = IPS_NAT_MASK | IPS_SEQ_ADJUST_BIT;
+	const struct nf_conntrack_tuple_hash *thash;
+	const struct nf_conntrack_zone *zone;
+	struct nf_conn *ct;
+	bool taken = true;
+	struct net *net;
+
+	if (!nf_nat_used_tuple(tuple, ignored_ct))
+		return false;
+
+	if (!nf_nat_allow_clash(ignored_ct))
+		return true;
+
+	/* Initial choice clashes with existing conntrack.
+	 * Check for (rare) reverse collision.
+	 *
+	 * This can happen when new packets are received in both directions
+	 * at the exact same time on different CPUs.
+	 *
+	 * Without SMP, first packet creates new conntrack entry and second
+	 * packet is resolved as established reply packet.
+	 *
+	 * With parallel processing, both packets could be picked up as
+	 * new and both get their own ct entry allocated.
+	 *
+	 * If ignored_conntrack and colliding ct are not subject to NAT then
+	 * pretend the tuple is available and let later clash resolution
+	 * handle this at insertion time.
+	 *
+	 * Without it, the 'reply' packet has its source port rewritten
+	 * by nat engine.
+	 */
+	if (READ_ONCE(ignored_ct->status) & uses_nat)
+		return true;
+
+	net = nf_ct_net(ignored_ct);
+	zone = nf_ct_zone(ignored_ct);
+
+	thash = nf_conntrack_find_get(net, zone, tuple);
+	if (unlikely(!thash)) /* clashing entry went away */
+		return false;
+
+	ct = nf_ct_tuplehash_to_ctrack(thash);
+
+	/* NB: IP_CT_DIR_ORIGINAL should be impossible because
+	 * nf_nat_used_tuple() handles origin collisions.
+	 *
+	 * Handle remote chance other CPU confirmed its ct right after.
+	 */
+	if (thash->tuple.dst.dir != IP_CT_DIR_REPLY)
+		goto out;
+
+	/* clashing connection subject to NAT? Retry with new tuple. */
+	if (READ_ONCE(ct->status) & uses_nat)
+		goto out;
+
+	if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
+			      &ignored_ct->tuplehash[IP_CT_DIR_REPLY].tuple) &&
+	    nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
+			      &ignored_ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple)) {
+		taken = false;
+		goto out;
+	}
+out:
+	nf_ct_put(ct);
+	return taken;
+}
+
 static bool nf_nat_may_kill(struct nf_conn *ct, unsigned long flags)
 {
 	static const unsigned long flags_refuse = IPS_FIXED_TIMEOUT |
@@ -611,7 +727,7 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
 	    !(range->flags & NF_NAT_RANGE_PROTO_RANDOM_ALL)) {
 		/* try the original tuple first */
 		if (nf_in_range(orig_tuple, range)) {
-			if (!nf_nat_used_tuple(orig_tuple, ct)) {
+			if (!nf_nat_used_tuple_new(orig_tuple, ct)) {
 				*tuple = *orig_tuple;
 				return;
 			}
-- 
2.43.0


  parent reply	other threads:[~2024-10-04 18:20 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-04 18:16 [PATCH AUTOSEL 6.11 01/76] bpf: Call the missed btf_record_free() when map creation fails Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 02/76] selftests/bpf: Fix ARG_PTR_TO_LONG {half-,}uninitialized test Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 03/76] bpf: Fix a sdiv overflow issue Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 04/76] bpf: Check percpu map value size first Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 05/76] bpftool: Fix undefined behavior in qsort(NULL, 0, ...) Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 06/76] bpftool: Fix undefined behavior caused by shifting into the sign bit Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 07/76] s390/boot: Compile all files with the same march flag Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 08/76] s390/facility: Disable compile time optimization for decompressor code Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 09/76] s390/mm: Add cond_resched() to cmm_alloc/free_pages() Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 10/76] bpf, x64: Fix a jit convergence issue Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 11/76] ext4: fix i_data_sem unlock order in ext4_ind_migrate() Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 12/76] ext4: avoid use-after-free in ext4_ext_show_leaf() Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 13/76] ext4: ext4_search_dir should return a proper error Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 14/76] iomap: fix iomap_dio_zero() for fs bs > system page size Sasha Levin
2024-10-04 22:46   ` Dave Chinner
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 15/76] ext4: don't set SB_RDONLY after filesystem errors Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 16/76] ext4: nested locking for xattr inode Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 17/76] s390/cpum_sf: Remove WARN_ON_ONCE statements Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 18/76] ext4: filesystems without casefold feature cannot be mounted with siphash Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 19/76] s390/traps: Handle early warnings gracefully Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 20/76] bpf: Prevent tail call between progs attached to different hooks Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 21/76] ktest.pl: Avoid false positives with grub2 skip regex Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 22/76] RDMA/mad: Improve handling of timed out WRs of mad agent Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 23/76] soundwire: intel_bus_common: enable interrupts before exiting reset Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 24/76] PCI: Add function 0 DMA alias quirk for Glenfly Arise chip Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 25/76] RDMA/rtrs-srv: Avoid null pointer deref during path establishment Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 26/76] clk: bcm: bcm53573: fix OF node leak in init Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 27/76] PCI: Add ACS quirk for Qualcomm SA8775P Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 28/76] i2c: i801: Use a different adapter-name for IDF adapters Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 29/76] PCI: Mark Creative Labs EMU20k2 INTx masking as broken Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 30/76] i3c: master: cdns: Fix use after free vulnerability in cdns_i3c_master Driver Due to Race Condition Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 31/76] RISC-V: Don't have MAX_PHYSMEM_BITS exceed phys_addr_t Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 32/76] io_uring: check if we need to reschedule during overflow flush Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 33/76] ntb: ntb_hw_switchtec: Fix use after free vulnerability in switchtec_ntb_remove due to race condition Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 34/76] mfd: intel_soc_pmic_chtwc: Make Lenovo Yoga Tab 3 X90F DMI match less strict Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 35/76] mfd: intel-lpss: Add Intel Arrow Lake-H LPSS PCI IDs Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 36/76] mfd: intel-lpss: Add Intel Panther Lake " Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 37/76] riscv: Omit optimized string routines when using KASAN Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 38/76] riscv: avoid Imbalance in RAS Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 39/76] RDMA/mlx5: Enforce umem boundaries for explicit ODP page faults Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 40/76] PCI: qcom: Disable mirroring of DBI and iATU register space in BAR region Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 41/76] PCI: endpoint: Assign PCI domain number for endpoint controllers Sasha Levin
2024-10-04 18:16 ` [PATCH AUTOSEL 6.11 42/76] soundwire: cadence: re-check Peripheral status with delayed_work Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 43/76] riscv/kexec_file: Fix relocation type R_RISCV_ADD16 and R_RISCV_SUB16 unknown Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 44/76] media: videobuf2-core: clear memory related fields in __vb2_plane_dmabuf_put() Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 45/76] remoteproc: imx_rproc: Use imx specific hook for find_loaded_rsc_table Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 46/76] clk: imx: Remove CLK_SET_PARENT_GATE for DRAM mux for i.MX7D Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 47/76] fs: nfs: fix missing refcnt by replacing folio_set_private by folio_attach_private Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 48/76] fuse: allow O_PATH fd for FUSE_DEV_IOC_BACKING_OPEN Sasha Levin
2024-10-07 10:15   ` Miklos Szeredi
2024-10-11 13:51     ` Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 49/76] fuse: handle idmappings properly in ->write_iter() Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 50/76] serial: protect uart_port_dtr_rts() in uart_shutdown() too Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 51/76] usb: typec: tipd: Free IRQ only if it was requested before Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 52/76] usb: typec: ucsi: Don't truncate the reads Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 53/76] usb: chipidea: udc: enable suspend interrupt after usb reset Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 54/76] usb: dwc2: Adjust the timing of USB Driver Interrupt Registration in the Crashkernel Scenario Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 55/76] xhci: dbc: Fix STALL transfer event handling Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 56/76] usb: host: xhci-plat: Parse xhci-missing_cas_quirk and apply quirk Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 57/76] comedi: ni_routing: tools: Check when the file could not be opened Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 58/76] LoongArch: Fix memleak in pci_acpi_scan_root() Sasha Levin
2024-10-04 18:17 ` Sasha Levin [this message]
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 60/76] netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 61/76] virtio_pmem: Check device status before requesting flush Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 62/76] tools/iio: Add memory allocation failure check for trigger_name Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 63/76] staging: vme_user: added bound check to geoid Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 64/76] usb: gadget: uvc: Fix ERR_PTR dereference in uvc_v4l2.c Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 65/76] dm vdo: don't refer to dedupe_context after releasing it Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 66/76] driver core: bus: Fix double free in driver API bus_register() Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 67/76] driver core: bus: Return -EIO instead of 0 when show/store invalid bus attribute Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 68/76] scsi: lpfc: Add ELS_RSP cmd to the list of WQEs to flush in lpfc_els_flush_cmd() Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 69/76] scsi: lpfc: Ensure DA_ID handling completion before deleting an NPIV instance Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 70/76] scsi: lpfc: Revise TRACE_EVENT log flag severities from KERN_ERR to KERN_WARNING Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 71/76] drm/xe/oa: Fix overflow in oa batch buffer Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 72/76] drm/amdgpu: nuke the VM PD/PT shadow handling Sasha Levin
2024-10-08  6:46   ` Christian König
2024-10-08 13:04     ` Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 73/76] drm/amd/display: Check null pointer before dereferencing se Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 74/76] fbcon: Fix a NULL pointer dereference issue in fbcon_putcs Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 75/76] smb: client: fix UAF in async decryption Sasha Levin
2024-10-04 18:17 ` [PATCH AUTOSEL 6.11 76/76] fbdev: sisfb: Fix strbuf array overflow Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241004181828.3669209-59-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=coreteam@netfilter.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=fw@strlen.de \
    --cc=kadlec@netfilter.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pablo@netfilter.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).