[PATCH nf-next 0/7] netfilter: rework conntrack/flowtable interaction

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH nf-next 0/7] netfilter: rework conntrack/flowtable interaction
@ 2024-09-24 19:44 Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 1/7] netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath Florian Westphal
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Florian Westphal @ 2024-09-24 19:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: cmi, nbd, sven.auhagen, Florian Westphal

This series resolves a few problems with flowtables when entries are
moved from offload (hw/sw offload) back to the conntrack slowpath.

First patch fixes conntrack reset validation, we must clear MAXACK flag
on reset packets in the sw flow offload path, conntrack state is stale,
it cannot validate reset sequence number.

Second patch adds mandatory locking when manipulating ct state flags.

Third patch is a cleanup patch so existing API can be re-used when
we lack an skb.

Patch 4 is a small preparation patch to reuse existing api and
get rid of redundant one later.

Patch 5 moves timeout extension logic from conntrack GC to flowtable
GC worker.

Patch 6 prevents accidental unwanted growth of conntrack timeout
when handling packets of same flow in slowpath at same time.

Patch 7 is an optimization to keep entry in software flowtable
when a fin is received.

NB: nftables flowtable selftest needs a minor fixup to exect 300s
timeout instead of 5 days after inital move to slowpath, this is the
only observed failure with nf kselftests or nftables shell tests.

Florian Westphal (7):
  netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to
    slowpath
  netfilter: nft_flow_offload: update tcp state flags under lock
  netfilter: conntrack: remove skb argument from nf_ct_refresh
  netfilter: flowtable: prefer plain nf_ct_refresh for setting initial
    timeout
  netfilter: conntrack: rework offload nf_conn timeout extension logic
  netfilter: nft_flow_offload: never grow the timeout when moving
    packets back to slowpath
  netfilter: nft_flow_offload: do not remove flowtable entry for fin
    packets

 Patches vs. nf-next, but could be applied to nf too.

 include/net/netfilter/nf_conntrack.h   |  18 +--
 net/netfilter/nf_conntrack_amanda.c    |   2 +-
 net/netfilter/nf_conntrack_broadcast.c |   2 +-
 net/netfilter/nf_conntrack_core.c      |  13 +-
 net/netfilter/nf_conntrack_h323_main.c |   4 +-
 net/netfilter/nf_conntrack_sip.c       |   4 +-
 net/netfilter/nf_flow_table_core.c     | 200 ++++++++++++++++++++++---
 net/netfilter/nf_flow_table_ip.c       |   5 +-
 net/netfilter/nft_ct.c                 |   2 +-
 net/netfilter/nft_flow_offload.c       |  16 +-
 10 files changed, 207 insertions(+), 59 deletions(-)

-- 
2.44.2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH nf-next 1/7] netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath
  2024-09-24 19:44 [PATCH nf-next 0/7] netfilter: rework conntrack/flowtable interaction Florian Westphal
@ 2024-09-24 19:44 ` Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 2/7] netfilter: nft_flow_offload: update tcp state flags under lock Florian Westphal
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2024-09-24 19:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: cmi, nbd, sven.auhagen, Florian Westphal

This state reset is racy, no locks are held here.

Since commit
8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp"),
the window checks are disabled for normal data packets, but MAXACK flag
is checked when validating TCP resets.

Clear the flag so tcp reset validation checks are ignored.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_flow_table_core.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index df72b0376970..bdde469bbbd1 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -161,10 +161,20 @@ void flow_offload_route_init(struct flow_offload *flow,
 }
 EXPORT_SYMBOL_GPL(flow_offload_route_init);
 
-static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
+static void flow_offload_fixup_tcp(struct nf_conn *ct)
 {
+	struct ip_ct_tcp *tcp = &ct->proto.tcp;
+
+	spin_lock_bh(&ct->lock);
+	/* Conntrack state is outdated due to offload bypass.
+	 * Clear IP_CT_TCP_FLAG_MAXACK_SET, otherwise conntracks
+	 * TCP reset validation will fail.
+	 */
 	tcp->seen[0].td_maxwin = 0;
+	tcp->seen[0].flags &= ~IP_CT_TCP_FLAG_MAXACK_SET;
 	tcp->seen[1].td_maxwin = 0;
+	tcp->seen[1].flags &= ~IP_CT_TCP_FLAG_MAXACK_SET;
+	spin_unlock_bh(&ct->lock);
 }
 
 static void flow_offload_fixup_ct(struct nf_conn *ct)
@@ -176,7 +186,7 @@ static void flow_offload_fixup_ct(struct nf_conn *ct)
 	if (l4num == IPPROTO_TCP) {
 		struct nf_tcp_net *tn = nf_tcp_pernet(net);
 
-		flow_offload_fixup_tcp(&ct->proto.tcp);
+		flow_offload_fixup_tcp(ct);
 
 		timeout = tn->timeouts[ct->proto.tcp.state];
 		timeout -= tn->offload_timeout;
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next 2/7] netfilter: nft_flow_offload: update tcp state flags under lock
  2024-09-24 19:44 [PATCH nf-next 0/7] netfilter: rework conntrack/flowtable interaction Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 1/7] netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath Florian Westphal
@ 2024-09-24 19:44 ` Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 3/7] netfilter: conntrack: remove skb argument from nf_ct_refresh Florian Westphal
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2024-09-24 19:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: cmi, nbd, sven.auhagen, Florian Westphal

The conntrack entry is already public, there is a small chance that another
CPU is handling a packet in reply direction and racing with the tcp state
update.

Move this under ct spinlock.

This is done once, when ct is about to be offloaded, so this should
not result in a noticeable performance hit.

Fixes: 8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nft_flow_offload.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c
index 2f732fae5a83..da9ebd00b198 100644
--- a/net/netfilter/nft_flow_offload.c
+++ b/net/netfilter/nft_flow_offload.c
@@ -289,6 +289,15 @@ static bool nft_flow_offload_skip(struct sk_buff *skb, int family)
 	return false;
 }
 
+static void flow_offload_ct_tcp(struct nf_conn *ct)
+{
+	/* conntrack will not see all packets, disable tcp window validation. */
+	spin_lock_bh(&ct->lock);
+	ct->proto.tcp.seen[0].flags |= IP_CT_TCP_FLAG_BE_LIBERAL;
+	ct->proto.tcp.seen[1].flags |= IP_CT_TCP_FLAG_BE_LIBERAL;
+	spin_unlock_bh(&ct->lock);
+}
+
 static void nft_flow_offload_eval(const struct nft_expr *expr,
 				  struct nft_regs *regs,
 				  const struct nft_pktinfo *pkt)
@@ -356,11 +365,8 @@ static void nft_flow_offload_eval(const struct nft_expr *expr,
 		goto err_flow_alloc;
 
 	flow_offload_route_init(flow, &route);
-
-	if (tcph) {
-		ct->proto.tcp.seen[0].flags |= IP_CT_TCP_FLAG_BE_LIBERAL;
-		ct->proto.tcp.seen[1].flags |= IP_CT_TCP_FLAG_BE_LIBERAL;
-	}
+	if (tcph)
+		flow_offload_ct_tcp(ct);
 
 	__set_bit(NF_FLOW_HW_BIDIRECTIONAL, &flow->flags);
 	ret = flow_offload_add(flowtable, flow);
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next 3/7] netfilter: conntrack: remove skb argument from nf_ct_refresh
  2024-09-24 19:44 [PATCH nf-next 0/7] netfilter: rework conntrack/flowtable interaction Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 1/7] netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 2/7] netfilter: nft_flow_offload: update tcp state flags under lock Florian Westphal
@ 2024-09-24 19:44 ` Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 4/7] netfilter: flowtable: prefer plain nf_ct_refresh for setting initial timeout Florian Westphal
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2024-09-24 19:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: cmi, nbd, sven.auhagen, Florian Westphal

Its not used (and could be NULL), so remove it.
This allows to use nf_ct_refresh in places where we don't have
an skb without having to double-check that skb == NULL would be safe.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack.h   | 8 +++-----
 net/netfilter/nf_conntrack_amanda.c    | 2 +-
 net/netfilter/nf_conntrack_broadcast.c | 2 +-
 net/netfilter/nf_conntrack_core.c      | 7 +++----
 net/netfilter/nf_conntrack_h323_main.c | 4 ++--
 net/netfilter/nf_conntrack_sip.c       | 4 ++--
 net/netfilter/nft_ct.c                 | 2 +-
 7 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index cba3ccf03fcc..3cbf29dd0b71 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -204,8 +204,7 @@ bool nf_ct_get_tuplepr(const struct sk_buff *skb, unsigned int nhoff,
 		       struct nf_conntrack_tuple *tuple);
 
 void __nf_ct_refresh_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
-			  const struct sk_buff *skb,
-			  u32 extra_jiffies, bool do_acct);
+			  u32 extra_jiffies, unsigned int bytes);
 
 /* Refresh conntrack for this many jiffies and do accounting */
 static inline void nf_ct_refresh_acct(struct nf_conn *ct,
@@ -213,15 +212,14 @@ static inline void nf_ct_refresh_acct(struct nf_conn *ct,
 				      const struct sk_buff *skb,
 				      u32 extra_jiffies)
 {
-	__nf_ct_refresh_acct(ct, ctinfo, skb, extra_jiffies, true);
+	__nf_ct_refresh_acct(ct, ctinfo, extra_jiffies, skb->len);
 }
 
 /* Refresh conntrack for this many jiffies */
 static inline void nf_ct_refresh(struct nf_conn *ct,
-				 const struct sk_buff *skb,
 				 u32 extra_jiffies)
 {
-	__nf_ct_refresh_acct(ct, 0, skb, extra_jiffies, false);
+	__nf_ct_refresh_acct(ct, 0, extra_jiffies, 0);
 }
 
 /* kill conntrack and do accounting */
diff --git a/net/netfilter/nf_conntrack_amanda.c b/net/netfilter/nf_conntrack_amanda.c
index d011d2eb0848..7be4c35e4795 100644
--- a/net/netfilter/nf_conntrack_amanda.c
+++ b/net/netfilter/nf_conntrack_amanda.c
@@ -106,7 +106,7 @@ static int amanda_help(struct sk_buff *skb,
 
 	/* increase the UDP timeout of the master connection as replies from
 	 * Amanda clients to the server can be quite delayed */
-	nf_ct_refresh(ct, skb, master_timeout * HZ);
+	nf_ct_refresh(ct, master_timeout * HZ);
 
 	/* No data? */
 	dataoff = protoff + sizeof(struct udphdr);
diff --git a/net/netfilter/nf_conntrack_broadcast.c b/net/netfilter/nf_conntrack_broadcast.c
index cfa0fe0356de..a7552a46d6ac 100644
--- a/net/netfilter/nf_conntrack_broadcast.c
+++ b/net/netfilter/nf_conntrack_broadcast.c
@@ -75,7 +75,7 @@ int nf_conntrack_broadcast_help(struct sk_buff *skb,
 	nf_ct_expect_related(exp, 0);
 	nf_ct_expect_put(exp);
 
-	nf_ct_refresh(ct, skb, timeout * HZ);
+	nf_ct_refresh(ct, timeout * HZ);
 out:
 	return NF_ACCEPT;
 }
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index d3cb53b008f5..62d97d320114 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -2043,9 +2043,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_in);
 /* Refresh conntrack for this many jiffies and do accounting if do_acct is 1 */
 void __nf_ct_refresh_acct(struct nf_conn *ct,
 			  enum ip_conntrack_info ctinfo,
-			  const struct sk_buff *skb,
 			  u32 extra_jiffies,
-			  bool do_acct)
+			  unsigned int bytes)
 {
 	/* Only update if this is not a fixed timeout */
 	if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status))
@@ -2058,8 +2057,8 @@ void __nf_ct_refresh_acct(struct nf_conn *ct,
 	if (READ_ONCE(ct->timeout) != extra_jiffies)
 		WRITE_ONCE(ct->timeout, extra_jiffies);
 acct:
-	if (do_acct)
-		nf_ct_acct_update(ct, CTINFO2DIR(ctinfo), skb->len);
+	if (bytes)
+		nf_ct_acct_update(ct, CTINFO2DIR(ctinfo), bytes);
 }
 EXPORT_SYMBOL_GPL(__nf_ct_refresh_acct);
 
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 5a9bce24f3c3..14f73872f647 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -1385,7 +1385,7 @@ static int process_rcf(struct sk_buff *skb, struct nf_conn *ct,
 	if (info->timeout > 0) {
 		pr_debug("nf_ct_ras: set RAS connection timeout to "
 			 "%u seconds\n", info->timeout);
-		nf_ct_refresh(ct, skb, info->timeout * HZ);
+		nf_ct_refresh(ct, info->timeout * HZ);
 
 		/* Set expect timeout */
 		spin_lock_bh(&nf_conntrack_expect_lock);
@@ -1433,7 +1433,7 @@ static int process_urq(struct sk_buff *skb, struct nf_conn *ct,
 	info->sig_port[!dir] = 0;
 
 	/* Give it 30 seconds for UCF or URJ */
-	nf_ct_refresh(ct, skb, 30 * HZ);
+	nf_ct_refresh(ct, 30 * HZ);
 
 	return 0;
 }
diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index d0eac27f6ba0..ca748f8dbff1 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -1553,7 +1553,7 @@ static int sip_help_tcp(struct sk_buff *skb, unsigned int protoff,
 	if (dataoff >= skb->len)
 		return NF_ACCEPT;
 
-	nf_ct_refresh(ct, skb, sip_timeout * HZ);
+	nf_ct_refresh(ct, sip_timeout * HZ);
 
 	if (unlikely(skb_linearize(skb)))
 		return NF_DROP;
@@ -1624,7 +1624,7 @@ static int sip_help_udp(struct sk_buff *skb, unsigned int protoff,
 	if (dataoff >= skb->len)
 		return NF_ACCEPT;
 
-	nf_ct_refresh(ct, skb, sip_timeout * HZ);
+	nf_ct_refresh(ct, sip_timeout * HZ);
 
 	if (unlikely(skb_linearize(skb)))
 		return NF_DROP;
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 67a41cd2baaf..2e59aba681a1 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -929,7 +929,7 @@ static void nft_ct_timeout_obj_eval(struct nft_object *obj,
 	 */
 	values = nf_ct_timeout_data(timeout);
 	if (values)
-		nf_ct_refresh(ct, pkt->skb, values[0]);
+		nf_ct_refresh(ct, values[0]);
 }
 
 static int nft_ct_timeout_obj_init(const struct nft_ctx *ctx,
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next 4/7] netfilter: flowtable: prefer plain nf_ct_refresh for setting initial timeout
  2024-09-24 19:44 [PATCH nf-next 0/7] netfilter: rework conntrack/flowtable interaction Florian Westphal
                   ` (2 preceding siblings ...)
  2024-09-24 19:44 ` [PATCH nf-next 3/7] netfilter: conntrack: remove skb argument from nf_ct_refresh Florian Westphal
@ 2024-09-24 19:44 ` Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 5/7] netfilter: conntrack: rework offload nf_conn timeout extension logic Florian Westphal
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2024-09-24 19:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: cmi, nbd, sven.auhagen, Florian Westphal

This allows to remove the nf_ct_offload_timeout helper in followup
patches.

Its safe to use in the add case, but not on teardown.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_flow_table_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index bdde469bbbd1..4569917dbe0a 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -304,7 +304,7 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
 		return err;
 	}
 
-	nf_ct_offload_timeout(flow->ct);
+	nf_ct_refresh(flow->ct, NF_CT_DAY);
 
 	if (nf_flowtable_hw_offload(flow_table)) {
 		__set_bit(NF_FLOW_HW, &flow->flags);
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next 5/7] netfilter: conntrack: rework offload nf_conn timeout extension logic
  2024-09-24 19:44 [PATCH nf-next 0/7] netfilter: rework conntrack/flowtable interaction Florian Westphal
                   ` (3 preceding siblings ...)
  2024-09-24 19:44 ` [PATCH nf-next 4/7] netfilter: flowtable: prefer plain nf_ct_refresh for setting initial timeout Florian Westphal
@ 2024-09-24 19:44 ` Florian Westphal
  2024-09-26 11:11   ` Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 6/7] netfilter: nft_flow_offload: never grow the timeout when moving packets back to slowpath Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 7/7] netfilter: nft_flow_offload: do not remove flowtable entry for fin packets Florian Westphal
  6 siblings, 1 reply; 9+ messages in thread
From: Florian Westphal @ 2024-09-24 19:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: cmi, nbd, sven.auhagen, Florian Westphal

Offload nf_conn entries may not see traffic for a very long time.

To prevent incorrect 'ct is stale' checks during nf_conntrack table
lookup, the gc worker extends the timeout nf_conn entries marked for
offload to a large value.

The existing logic suffers from a few problems.

Garbage collection runs without locks, its unlikely but possible
that @ct is removed right after the 'offload' bit test.

In that case, the timeout of a new/reallocated nf_conn entry will
be increased.

Prevent this by obtaining a reference count on the ct object and
re-check of the confirmed and offload bits.

If those are not set, the ct is being removed, skip the timeout
extension in this case.

Parallel teardown is also problematic:
 cpu1                                cpu2
 gc_worker
                                     calls flow_offload_teardown()
 tests OFFLOAD bit, set
                                     clear OFFLOAD bit
                                     ct->timeout is repaired (e.g. set to timeout[UDP_CT_REPLIED])
 nf_ct_offload_timeout() called
 expire value is fetched
 <INTERRUPT>
-> NF_CT_DAY timeout for flow that isn't offloaded
(and might not see any further packets).

Use cmpxchg: if ct->timeout was repaired after the 2nd 'offload bit' test
passed, then ct->timeout will only be updated of ct->timeout was not
altered in between.

As we already have a gc worker for flowtable entries, ct->timeout repair
can be handled from the flowtable gc worker.

This avoids having flowtable specific logic in the conntrack core
and avoids checking entries that were never offloaded.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack.h | 10 ---
 net/netfilter/nf_conntrack_core.c    |  6 --
 net/netfilter/nf_flow_table_core.c   | 99 ++++++++++++++++++++++++++++
 3 files changed, 99 insertions(+), 16 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 3cbf29dd0b71..3f02a45773e8 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -312,16 +312,6 @@ static inline bool nf_ct_should_gc(const struct nf_conn *ct)
 
 #define	NF_CT_DAY	(86400 * HZ)
 
-/* Set an arbitrary timeout large enough not to ever expire, this save
- * us a check for the IPS_OFFLOAD_BIT from the packet path via
- * nf_ct_is_expired().
- */
-static inline void nf_ct_offload_timeout(struct nf_conn *ct)
-{
-	if (nf_ct_expires(ct) < NF_CT_DAY / 2)
-		WRITE_ONCE(ct->timeout, nfct_time_stamp + NF_CT_DAY);
-}
-
 struct kernel_param;
 
 int nf_conntrack_set_hashsize(const char *val, const struct kernel_param *kp);
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 62d97d320114..0305d2cf41f6 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1498,12 +1498,6 @@ static void gc_worker(struct work_struct *work)
 
 			tmp = nf_ct_tuplehash_to_ctrack(h);
 
-			if (test_bit(IPS_OFFLOAD_BIT, &tmp->status)) {
-				nf_ct_offload_timeout(tmp);
-				if (!nf_conntrack_max95)
-					continue;
-			}
-
 			if (expired_count > GC_SCAN_EXPIRED_MAX) {
 				rcu_read_unlock();
 
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 4569917dbe0a..10a0fb83a01a 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -424,6 +424,103 @@ static bool nf_flow_custom_gc(struct nf_flowtable *flow_table,
 	return flow_table->type->gc && flow_table->type->gc(flow);
 }
 
+/**
+ * nf_flow_table_tcp_timeout() - new timeout of offloaded tcp entry
+ * @ct:		Flowtable offloaded tcp ct
+ *
+ * Return number of seconds when ct entry should expire.
+ */
+static u32 nf_flow_table_tcp_timeout(const struct nf_conn *ct)
+{
+	u8 state = READ_ONCE(ct->proto.tcp.state);
+
+	switch (state) {
+	case TCP_CONNTRACK_SYN_SENT:
+	case TCP_CONNTRACK_SYN_RECV:
+		return 0;
+	case TCP_CONNTRACK_ESTABLISHED:
+		return NF_CT_DAY;
+	case TCP_CONNTRACK_FIN_WAIT:
+	case TCP_CONNTRACK_CLOSE_WAIT:
+	case TCP_CONNTRACK_LAST_ACK:
+	case TCP_CONNTRACK_TIME_WAIT:
+		return 5 * 60 * HZ;
+	case TCP_CONNTRACK_CLOSE:
+		return 0;
+	}
+
+	return 0;
+}
+
+/**
+ * nf_flow_table_extend_ct_timeout() - Extend ct timeout of offloaded conntrack entry
+ * @ct:		Flowtable offloaded ct
+ *
+ * Datapath lookups in the conntrack table will evict nf_conn entries
+ * if they have expired.
+ *
+ * Once nf_conn entries have been offloaded, nf_conntrack might not see any
+ * packets anymore.  Thus ct->timeout is no longer refreshed and ct can
+ * be evicted.
+ *
+ * To avoid the need for an additional check on the offload bit for every
+ * packet processed via nf_conntrack_in(), set an arbitrary timeout large
+ * enough not to ever expire, this save us a check for the IPS_OFFLOAD_BIT
+ * from the packet path via nf_ct_is_expired().
+ */
+static void nf_flow_table_extend_ct_timeout(struct nf_conn *ct)
+{
+	static const u32 min_timeout = 5 * 60 * HZ;
+	u32 expires = nf_ct_expires(ct);
+
+	/* normal case: large enough timeout, nothing to do. */
+	if (likely(expires >= min_timeout))
+		return;
+
+	/* must check offload bit after this, we do not hold any locks.
+	 * flowtable and ct entries could have been removed on another CPU.
+	 */
+	if (!refcount_inc_not_zero(&ct->ct_general.use))
+		return;
+
+	/* load ct->status after refcount increase */
+	smp_acquire__after_ctrl_dep();
+
+	if (nf_ct_is_confirmed(ct) &&
+	    test_bit(IPS_OFFLOAD_BIT, &ct->status)) {
+		u8 l4proto = nf_ct_protonum(ct);
+		u32 new_timeout = true;
+
+		switch (l4proto) {
+		case IPPROTO_UDP:
+			new_timeout = NF_CT_DAY;
+			break;
+		case IPPROTO_TCP:
+			new_timeout = nf_flow_table_tcp_timeout(ct);
+			break;
+		default:
+			WARN_ON_ONCE(1);
+			break;
+		}
+
+		/* Update to ct->timeout from nf_conntrack happens
+		 * without holding ct->lock.
+		 *
+		 * Use cmpxchg to ensure timeout extension doesn't
+		 * happen when we race with conntrack datapath.
+		 *
+		 * The inverse -- datapath updating ->timeout right
+		 * after this -- is fine, datapath is authoritative.
+		 */
+		if (new_timeout) {
+			new_timeout += nfct_time_stamp;
+			cmpxchg(&ct->timeout, expires, new_timeout);
+		}
+	}
+
+	nf_ct_put(ct);
+}
+
 static void nf_flow_offload_gc_step(struct nf_flowtable *flow_table,
 				    struct flow_offload *flow, void *data)
 {
@@ -431,6 +528,8 @@ static void nf_flow_offload_gc_step(struct nf_flowtable *flow_table,
 	    nf_ct_is_dying(flow->ct) ||
 	    nf_flow_custom_gc(flow_table, flow))
 		flow_offload_teardown(flow);
+	else
+		nf_flow_table_extend_ct_timeout(flow->ct);
 
 	if (test_bit(NF_FLOW_TEARDOWN, &flow->flags)) {
 		if (test_bit(NF_FLOW_HW, &flow->flags)) {
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next 6/7] netfilter: nft_flow_offload: never grow the timeout when moving packets back to slowpath
  2024-09-24 19:44 [PATCH nf-next 0/7] netfilter: rework conntrack/flowtable interaction Florian Westphal
                   ` (4 preceding siblings ...)
  2024-09-24 19:44 ` [PATCH nf-next 5/7] netfilter: conntrack: rework offload nf_conn timeout extension logic Florian Westphal
@ 2024-09-24 19:44 ` Florian Westphal
  2024-09-24 19:44 ` [PATCH nf-next 7/7] netfilter: nft_flow_offload: do not remove flowtable entry for fin packets Florian Westphal
  6 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2024-09-24 19:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: cmi, nbd, sven.auhagen, Florian Westphal

The timeout for offloaded packets is set to a large value.  Once packets
are moved back to slowpath again, we must make sure that we don't end up
with an nf_conn entry with a huge (i.e., hours) timeout:

Its possible that we do not see any further traffic, in such case we
end up with a stale entry for a very long time.

flow_offload_fixup_ct() should reduce the timeout to avoid this, but it
has a possible race in case another packet is already handled by software
slowpath, e.g. in reverse direction.

This could happen e.g. when the packet in question exceeds the
MTU or has IP options set while other direction sees a tcp reset or
expired route.

 cpu1 (skb pushed to slowpatch)      cpu2 (dst expired):
 calls nf_conntrack_tcp_packet
                                     calls flow_offload_teardown()
                                     timeout = tn->timeouts[ct->proto.tcp.state];
 sets ct->proto.tcp.state = new_state
 update timeout
                                      WRITE_ONCE(ct->timeout, nfct_time_stamp + timeout);

This could be avoided via cmpxchg tricks, but lets keep it simpler
and clamp the new timeout to TCP_UNACKED, similar to what we do in classic
conntrack for mid-stream pickups.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_flow_table_core.c | 75 ++++++++++++++++++++++++------
 1 file changed, 62 insertions(+), 13 deletions(-)

diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 10a0fb83a01a..d0267fe3eb37 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -177,36 +177,71 @@ static void flow_offload_fixup_tcp(struct nf_conn *ct)
 	spin_unlock_bh(&ct->lock);
 }
 
-static void flow_offload_fixup_ct(struct nf_conn *ct)
+/**
+ * flow_offload_fixup_ct() - fix ct timeout on flow entry removal
+ * @ct:			conntrack entry
+ * @expired:		true if flowtable entry timed out
+ *
+ * Offload nf_conn entries have their timeout inflated to a very large
+ * value to prevent garbage collection from kicking in during lookup.
+ *
+ * When the flowtable entry is removed for whatever reason, then the
+ * ct entry must have the timeout set to a saner value to prevent it
+ * from remaining in place for a very long time.
+ *
+ * If the offload flow expired, also subtract the flow offload timeout,
+ * this helps to expire conntrack entries faster, especially for UDP.
+ */
+static void flow_offload_fixup_ct(struct nf_conn *ct, bool expired)
 {
+	u32 expires, offload_timeout = 0;
 	struct net *net = nf_ct_net(ct);
 	int l4num = nf_ct_protonum(ct);
 	s32 timeout;
 
 	if (l4num == IPPROTO_TCP) {
-		struct nf_tcp_net *tn = nf_tcp_pernet(net);
+		const struct nf_tcp_net *tn = nf_tcp_pernet(net);
+		u8 tcp_state = READ_ONCE(ct->proto.tcp.state);
+		u32 unacked_timeout;
 
 		flow_offload_fixup_tcp(ct);
 
-		timeout = tn->timeouts[ct->proto.tcp.state];
-		timeout -= tn->offload_timeout;
+		timeout = READ_ONCE(tn->timeouts[tcp_state]);
+		unacked_timeout = READ_ONCE(tn->timeouts[TCP_CONNTRACK_UNACK]);
+
+		/* Limit to unack, in case we missed a possible
+		 * ESTABLISHED -> UNACK transition right before,
+		 * forward path could have updated tcp.state now.
+		 *
+		 * This also won't leave nf_conn hanging around forever
+		 * in case no further packets are received while in
+		 * established state.
+		 */
+		if (timeout > unacked_timeout)
+			timeout = unacked_timeout;
+
+		offload_timeout = READ_ONCE(tn->offload_timeout);
 	} else if (l4num == IPPROTO_UDP) {
-		struct nf_udp_net *tn = nf_udp_pernet(net);
-		enum udp_conntrack state =
-			test_bit(IPS_SEEN_REPLY_BIT, &ct->status) ?
-			UDP_CT_REPLIED : UDP_CT_UNREPLIED;
+		const struct nf_udp_net *tn = nf_udp_pernet(net);
+		enum udp_conntrack state = test_bit(IPS_SEEN_REPLY_BIT, &ct->status) ?
+						    UDP_CT_REPLIED : UDP_CT_UNREPLIED;
 
-		timeout = tn->timeouts[state];
-		timeout -= tn->offload_timeout;
+		offload_timeout = READ_ONCE(tn->offload_timeout);
+		timeout = READ_ONCE(tn->timeouts[state]);
 	} else {
 		return;
 	}
 
+	if (expired)
+		timeout -= offload_timeout;
+
 	if (timeout < 0)
 		timeout = 0;
 
-	if (nf_flow_timeout_delta(READ_ONCE(ct->timeout)) > (__s32)timeout)
-		WRITE_ONCE(ct->timeout, nfct_time_stamp + timeout);
+	expires = nf_ct_expires(ct);
+	/* never increase ct->timeout */
+	if (expires > timeout)
+		nf_ct_refresh(ct, timeout);
 }
 
 static void flow_offload_route_release(struct flow_offload *flow)
@@ -350,11 +385,25 @@ static void flow_offload_del(struct nf_flowtable *flow_table,
 	flow_offload_free(flow);
 }
 
+/**
+ * flow_offload_teardown() - un-offload a flow
+ * @flow:	flow that should not be offload anymore
+ *
+ * This is called when @flow has been idle for too long or
+ * when there is a permanent processing problem during
+ * flow offload processing.
+ *
+ * Examples of such errors are:
+ *  - offloaded route/dst entry is stale
+ *  - tx function returns error
+ *  - RST flag set (TCP only)
+ */
 void flow_offload_teardown(struct flow_offload *flow)
 {
 	clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
 	set_bit(NF_FLOW_TEARDOWN, &flow->flags);
-	flow_offload_fixup_ct(flow->ct);
+	flow_offload_fixup_ct(flow->ct,
+			      nf_flow_has_expired(flow));
 }
 EXPORT_SYMBOL_GPL(flow_offload_teardown);
 
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next 7/7] netfilter: nft_flow_offload: do not remove flowtable entry for fin packets
  2024-09-24 19:44 [PATCH nf-next 0/7] netfilter: rework conntrack/flowtable interaction Florian Westphal
                   ` (5 preceding siblings ...)
  2024-09-24 19:44 ` [PATCH nf-next 6/7] netfilter: nft_flow_offload: never grow the timeout when moving packets back to slowpath Florian Westphal
@ 2024-09-24 19:44 ` Florian Westphal
  6 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2024-09-24 19:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: cmi, nbd, sven.auhagen, Florian Westphal

Treat fin packets like tcp packets with IP options or packets that would
need fragmentation: pass them to slow path, but keep the flowtable entry
around.

This allows to keep connections where one peer closes early but keeps
receiving data for a long time in forwarding fast path.

Conntrack should be moving the nf_conn entry towards a much lower
timeout, (default fin_wait 2 minutes).

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_flow_table_ip.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 98edcaa37b38..94d83003acf0 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -28,11 +28,14 @@ static int nf_flow_state_check(struct flow_offload *flow, int proto,
 		return 0;
 
 	tcph = (void *)(skb_network_header(skb) + thoff);
-	if (unlikely(tcph->fin || tcph->rst)) {
+	if (unlikely(tcph->rst)) {
 		flow_offload_teardown(flow);
 		return -1;
 	}
 
+	if (unlikely(tcph->fin))
+		return -1;
+
 	return 0;
 }
 
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH nf-next 5/7] netfilter: conntrack: rework offload nf_conn timeout extension logic
  2024-09-24 19:44 ` [PATCH nf-next 5/7] netfilter: conntrack: rework offload nf_conn timeout extension logic Florian Westphal
@ 2024-09-26 11:11   ` Florian Westphal
  0 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2024-09-26 11:11 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, cmi, nbd, sven.auhagen

Florian Westphal <fw@strlen.de> wrote:
> +/**
> + * nf_flow_table_tcp_timeout() - new timeout of offloaded tcp entry
> + * @ct:		Flowtable offloaded tcp ct
> + *
> + * Return number of seconds when ct entry should expire.

This needs following kdoc fixup:
- * Return number of seconds when ct entry should expire.
+ * @return: number of seconds when ct entry should expire.

Rest of series doesn't add kdoc warnings.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-09-26 11:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-24 19:44 [PATCH nf-next 0/7] netfilter: rework conntrack/flowtable interaction Florian Westphal
2024-09-24 19:44 ` [PATCH nf-next 1/7] netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath Florian Westphal
2024-09-24 19:44 ` [PATCH nf-next 2/7] netfilter: nft_flow_offload: update tcp state flags under lock Florian Westphal
2024-09-24 19:44 ` [PATCH nf-next 3/7] netfilter: conntrack: remove skb argument from nf_ct_refresh Florian Westphal
2024-09-24 19:44 ` [PATCH nf-next 4/7] netfilter: flowtable: prefer plain nf_ct_refresh for setting initial timeout Florian Westphal
2024-09-24 19:44 ` [PATCH nf-next 5/7] netfilter: conntrack: rework offload nf_conn timeout extension logic Florian Westphal
2024-09-26 11:11   ` Florian Westphal
2024-09-24 19:44 ` [PATCH nf-next 6/7] netfilter: nft_flow_offload: never grow the timeout when moving packets back to slowpath Florian Westphal
2024-09-24 19:44 ` [PATCH nf-next 7/7] netfilter: nft_flow_offload: do not remove flowtable entry for fin packets Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).