[PATCH nf-next 00/19] netfilter: nftables: dscp modification offload

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
@ 2023-05-03 12:55 Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 01/19] selftest: netfilter: use /proc for pid checking Boris Sukholitko
                   ` (21 more replies)
  0 siblings, 22 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 4489 bytes --]

Hi,

Consider ruleset such as:

table inet filter {
        chain forward {
                type filter hook forward priority filter; policy accept;
                ip dscp set cs3
                ct state established,related accept
        }
}

As expected, all of the packets from 10.0.2.99 to 10.0.1.99 have IPv4 tos field
changed to 0x60:

...
13:36:42.474591 fe:dc:b3:e2:dc:3b > 5a:45:4d:2a:25:65, ethertype IPv4 (0x0800), length 1090: (tos 0x60, ttl 62, id 39855, offset 0, flags [none], proto TCP (6), length 1076)
    10.0.2.99.12345 > 10.0.1.99.44084: Flags [P.], cksum 0x1bec (incorrect -> 0x44c3), seq 1:1025, ack 1025, win 1987, options [nop,nop,TS val 2854899766 ecr 3249774499], length 1024
...

Now lets try to add flow offload:

table inet filter {
        flowtable f1 {
                hook ingress priority filter
                devices = { veth0, veth1 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                ip dscp set cs3
                ip protocol { tcp, udp, gre } flow add 
                ct state established,related accept
        }
}

Although some of the packets still have their TOS being correct, some are not:

...
13:41:17.138782 5e:d5:1f:a3:ba:d1 > d2:d2:73:e6:5b:92, ethertype IPv4 (0x0800), length 1090: (tos 0x0, ttl 62, id 20142, offset 0, flags [none], proto TCP (6), length 1076)
    10.0.2.99.12345 > 10.0.1.99.34230: Flags [P.], cksum 0x1bec (incorrect -> 0xc090), seq 1:1025, ack 1, win 2009, options [nop,nop,TS val 2855174430 ecr 3250049157], length 1024
...

The root cause for the bug seems to be that nft_payload_set_eval (which sets the
dscp tos field) isn't being called on the offload fast path in
nf_flow_offload_ip_hook.

The fix in this patch series is to have payload modifications recorded in the
new conntrack extension. Then we apply those modifications on the fast path.

To signal intent to record payload changes, we add offload flag to the nft
userspace tool (separate patches follow). For example the dscp set line becomes:

....
ip dscp set cs3 offload
...

Some high level description of the patches:

* patches 1-4 fix small but annoying infelicities in nft_flowtable.sh test script
* patches 5-7 export payload modification functionality from nft_payload.c
* patches 8-10 add new NFT_PAYLOAD_CAN_OFFLOAD flag being set by the userspace
* patches 11-13 are technical changes to add the new conntrack extension
* patches 14-16 add payload context to the conntrack and apply them on the fast path
* patches 17-18 save the payload context if NFT_PAYLOAD_CAN_OFFLOAD flag is set.
* patch 19 adds dscp modification offload test to the nft_payload.sh test script.

Thanks,
Boris.

Boris Sukholitko (19):
  selftest: netfilter: use /proc for pid checking
  selftest: netfilter: no need for ps -x option
  selftest: netfilter: wait for specific nc pids
  selftest: netfilter: monitor result file sizes
  netfilter: nft_payload: refactor mangle operation
  netfilter: nft_payload: publish nft_payload_set
  netfilter: nft_payload: export mangle
  netfilter: nft_payload: use flag for checksum need
  netfilter: nft_payload: add offload flag define
  netfilter: nft_payload: allow offload in the netlink
  netfilter: conntrack: nft extension Kconfig
  netfilter: nft: empty nft conntrack extension
  netfilter: conntrack: register nft extension
  netfilter: nft: add payload context into extension
  netfilter: nft: add payload application
  netfilter: nftables: fast path payload mangle
  netfilter: nftables: payload save mechanism
  netfilter: nft_payload: save payload if needed
  selftests: netfilter: dscp offload test

 include/net/netfilter/nf_conntrack_extend.h   |  3 +
 include/net/netfilter/nf_tables.h             | 68 +++++++++++++++++++
 include/uapi/linux/netfilter/nf_tables.h      |  1 +
 net/netfilter/Kconfig                         | 10 +++
 net/netfilter/Makefile                        |  2 +
 net/netfilter/nf_conntrack_core.c             |  2 +
 net/netfilter/nf_conntrack_extend.c           |  9 ++-
 net/netfilter/nf_conntrack_netlink.c          |  2 +
 net/netfilter/nf_flow_table_ip.c              |  3 +
 net/netfilter/nft_conntrack_ext.c             | 56 +++++++++++++++
 net/netfilter/nft_payload.c                   | 46 +++++++------
 .../selftests/netfilter/nft_flowtable.sh      | 61 +++++++++++++++--
 12 files changed, 237 insertions(+), 26 deletions(-)
 create mode 100644 net/netfilter/nft_conntrack_ext.c

-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH nf-next 01/19] selftest: netfilter: use /proc for pid checking
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 18:47   ` Florian Westphal
  2023-05-03 12:55 ` [PATCH nf-next 02/19] selftest: netfilter: no need for ps -x option Boris Sukholitko
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 796 bytes --]

Some ps commands (e.g. busybox derived) have no -p option. Use /proc for
pid existence check.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 tools/testing/selftests/netfilter/nft_flowtable.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/netfilter/nft_flowtable.sh b/tools/testing/selftests/netfilter/nft_flowtable.sh
index 7060bae04ec8..4d8bc51b7a7b 100755
--- a/tools/testing/selftests/netfilter/nft_flowtable.sh
+++ b/tools/testing/selftests/netfilter/nft_flowtable.sh
@@ -288,11 +288,11 @@ test_tcp_forwarding_ip()
 
 	sleep 3
 
-	if ps -p $lpid > /dev/null;then
+	if test -d /proc/"$lpid"/; then
 		kill $lpid
 	fi
 
-	if ps -p $cpid > /dev/null;then
+	if test -d /proc/"$cpid"/; then
 		kill $cpid
 	fi
 
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 02/19] selftest: netfilter: no need for ps -x option
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 01/19] selftest: netfilter: use /proc for pid checking Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 18:53   ` Florian Westphal
  2023-05-03 12:55 ` [PATCH nf-next 03/19] selftest: netfilter: wait for specific nc pids Boris Sukholitko
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 997 bytes --]

Some ps commands (e.g. busybox derived) have no -x option. For the
purposes of hash calculation of the list of processes this option is
inessential.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 tools/testing/selftests/netfilter/nft_flowtable.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/netfilter/nft_flowtable.sh b/tools/testing/selftests/netfilter/nft_flowtable.sh
index 4d8bc51b7a7b..3cf20e9bd3a6 100755
--- a/tools/testing/selftests/netfilter/nft_flowtable.sh
+++ b/tools/testing/selftests/netfilter/nft_flowtable.sh
@@ -489,8 +489,8 @@ ip -net $nsr1 addr add 10.0.1.1/24 dev veth0
 ip -net $nsr1 addr add dead:1::1/64 dev veth0
 ip -net $nsr1 link set up dev veth0
 
-KEY_SHA="0x"$(ps -xaf | sha1sum | cut -d " " -f 1)
-KEY_AES="0x"$(ps -xaf | md5sum | cut -d " " -f 1)
+KEY_SHA="0x"$(ps -af | sha1sum | cut -d " " -f 1)
+KEY_AES="0x"$(ps -af | md5sum | cut -d " " -f 1)
 SPI1=$RANDOM
 SPI2=$RANDOM
 
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 03/19] selftest: netfilter: wait for specific nc pids
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 01/19] selftest: netfilter: use /proc for pid checking Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 02/19] selftest: netfilter: no need for ps -x option Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 04/19] selftest: netfilter: monitor result file sizes Boris Sukholitko
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 918 bytes --]

Doing wait with no parameters may interfere with some of the tests
having their own background processes.

Although no such test is currently present, the cleanup is useful
to rely on the nft_flowtable.sh for local development (e.g. running
background tcpdump command during the tests).

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 tools/testing/selftests/netfilter/nft_flowtable.sh | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/netfilter/nft_flowtable.sh b/tools/testing/selftests/netfilter/nft_flowtable.sh
index 3cf20e9bd3a6..92bc308bf168 100755
--- a/tools/testing/selftests/netfilter/nft_flowtable.sh
+++ b/tools/testing/selftests/netfilter/nft_flowtable.sh
@@ -296,7 +296,8 @@ test_tcp_forwarding_ip()
 		kill $cpid
 	fi
 
-	wait
+	wait $lpid
+	wait $cpid
 
 	if ! check_transfer "$nsin" "$ns2out" "ns1 -> ns2"; then
 		lret=1
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 04/19] selftest: netfilter: monitor result file sizes
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (2 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 03/19] selftest: netfilter: wait for specific nc pids Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 18:54   ` Florian Westphal
  2023-05-03 12:55 ` [PATCH nf-next 05/19] netfilter: nft_payload: refactor mangle operation Boris Sukholitko
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 1215 bytes --]

When running nft_flowtable.sh in VM on a busy server we've found that
the time of the netcat file transfers vary wildly.

Therefore replace hardcoded 3 second sleep with the loop checking for
a change in the file sizes. Once no change in detected we test the results.

Nice side effect is that we shave 1 second sleep in the fast case
(hard-coded 3 second sleep vs two 1 second sleeps).

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 tools/testing/selftests/netfilter/nft_flowtable.sh | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/netfilter/nft_flowtable.sh b/tools/testing/selftests/netfilter/nft_flowtable.sh
index 92bc308bf168..51f986f19fee 100755
--- a/tools/testing/selftests/netfilter/nft_flowtable.sh
+++ b/tools/testing/selftests/netfilter/nft_flowtable.sh
@@ -286,7 +286,15 @@ test_tcp_forwarding_ip()
 	ip netns exec $nsa nc -w 4 "$dstip" "$dstport" < "$nsin" > "$ns1out" &
 	cpid=$!
 
-	sleep 3
+	sleep 1
+
+	prev="$(ls -l $ns1out $ns2out)"
+	sleep 1
+
+	while [[ "$prev" != "$(ls -l $ns1out $ns2out)" ]]; do
+		sleep 1;
+		prev="$(ls -l $ns1out $ns2out)"
+	done
 
 	if test -d /proc/"$lpid"/; then
 		kill $lpid
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 05/19] netfilter: nft_payload: refactor mangle operation
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (3 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 04/19] selftest: netfilter: monitor result file sizes Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 06/19] netfilter: nft_payload: publish nft_payload_set Boris Sukholitko
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 1562 bytes --]

Make nft_payload_mangle helper function applying the payload changes.
It does not depend on expr context.

Change nft_payload_set_eval to use nft_payload_mangle helper.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 net/netfilter/nft_payload.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index 3a3c7746e88f..2a41e7e0b3b7 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -798,13 +798,11 @@ struct nft_payload_set {
 	u8			csum_flags;
 };
 
-static void nft_payload_set_eval(const struct nft_expr *expr,
-				 struct nft_regs *regs,
-				 const struct nft_pktinfo *pkt)
+static int nft_payload_mangle(const struct nft_payload_set *priv,
+			      const struct nft_pktinfo *pkt,
+			      const u32 *src)
 {
-	const struct nft_payload_set *priv = nft_expr_priv(expr);
 	struct sk_buff *skb = pkt->skb;
-	const u32 *src = &regs->data[priv->sreg];
 	int offset, csum_offset;
 	__wsum fsum, tsum;
 
@@ -862,6 +860,20 @@ static void nft_payload_set_eval(const struct nft_expr *expr,
 		    nft_payload_csum_sctp(skb, nft_thoff(pkt)))
 			goto err;
 	}
+	return 0;
+err:
+	return -1;
+}
+
+static void nft_payload_set_eval(const struct nft_expr *expr,
+				 struct nft_regs *regs,
+				 const struct nft_pktinfo *pkt)
+{
+	const struct nft_payload_set *priv = nft_expr_priv(expr);
+	const u32 *src = &regs->data[priv->sreg];
+
+	if (nft_payload_mangle(priv, pkt, src))
+		goto err;
 
 	return;
 err:
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 06/19] netfilter: nft_payload: publish nft_payload_set
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (4 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 05/19] netfilter: nft_payload: refactor mangle operation Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 07/19] netfilter: nft_payload: export mangle Boris Sukholitko
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 1559 bytes --]

Move struct nft_payload_set into nf_tables.h.

In conjunction with nft_payload_mangle function we can apply the
payload changes independent of nft_expr context.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 include/net/netfilter/nf_tables.h | 10 ++++++++++
 net/netfilter/nft_payload.c       | 10 ----------
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 1b8e305bb54a..b027200caf5b 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -863,6 +863,16 @@ static inline void nft_set_gc_batch_add(struct nft_set_gc_batch *gcb,
 	gcb->elems[gcb->head.cnt++] = elem;
 }
 
+struct nft_payload_set {
+	enum nft_payload_bases	base:8;
+	u8			offset;
+	u8			len;
+	u8			sreg;
+	u8			csum_type;
+	u8			csum_offset;
+	u8			csum_flags;
+};
+
 struct nft_expr_ops;
 /**
  *	struct nft_expr_type - nf_tables expression type
diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index 2a41e7e0b3b7..ed4a65ac68d5 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -788,16 +788,6 @@ static int nft_payload_csum_inet(struct sk_buff *skb, const u32 *src,
 	return 0;
 }
 
-struct nft_payload_set {
-	enum nft_payload_bases	base:8;
-	u8			offset;
-	u8			len;
-	u8			sreg;
-	u8			csum_type;
-	u8			csum_offset;
-	u8			csum_flags;
-};
-
 static int nft_payload_mangle(const struct nft_payload_set *priv,
 			      const struct nft_pktinfo *pkt,
 			      const u32 *src)
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 07/19] netfilter: nft_payload: export mangle
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (5 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 06/19] netfilter: nft_payload: publish nft_payload_set Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 08/19] netfilter: nft_payload: use flag for checksum need Boris Sukholitko
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 1703 bytes --]

Having struct nft_payload_set public already we now acquire the
ability to apply the payload changes separately from nft_payload.c

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 include/net/netfilter/nf_tables.h | 3 +++
 net/netfilter/nft_payload.c       | 7 ++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index b027200caf5b..cba06ea3fedd 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1729,4 +1729,7 @@ static inline bool nft_reg_track_cmp(struct nft_regs_track *track,
 	       track->regs[dreg].num_reg == 0;
 }
 
+int nft_payload_mangle(const struct nft_payload_set *priv,
+		       const struct nft_pktinfo *pkt,
+		       const u32 *src);
 #endif /* _NET_NF_TABLES_H */
diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index ed4a65ac68d5..50109663bb13 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -788,9 +788,9 @@ static int nft_payload_csum_inet(struct sk_buff *skb, const u32 *src,
 	return 0;
 }
 
-static int nft_payload_mangle(const struct nft_payload_set *priv,
-			      const struct nft_pktinfo *pkt,
-			      const u32 *src)
+int nft_payload_mangle(const struct nft_payload_set *priv,
+		       const struct nft_pktinfo *pkt,
+		       const u32 *src)
 {
 	struct sk_buff *skb = pkt->skb;
 	int offset, csum_offset;
@@ -854,6 +854,7 @@ static int nft_payload_mangle(const struct nft_payload_set *priv,
 err:
 	return -1;
 }
+EXPORT_SYMBOL_GPL(nft_payload_mangle);
 
 static void nft_payload_set_eval(const struct nft_expr *expr,
 				 struct nft_regs *regs,
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 08/19] netfilter: nft_payload: use flag for checksum need
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (6 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 07/19] netfilter: nft_payload: export mangle Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 09/19] netfilter: nft_payload: add offload flag define Boris Sukholitko
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 1600 bytes --]

Although NFT_PAYLOAD_L4CSUM_PSEUDOHDR flag is part of the enum,
we use it as a boolean value.

Change the flag usage to be more appropriate for an enum bitmask.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 net/netfilter/nft_payload.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index 50109663bb13..9e11df7389ca 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -792,8 +792,8 @@ int nft_payload_mangle(const struct nft_payload_set *priv,
 		       const struct nft_pktinfo *pkt,
 		       const u32 *src)
 {
+	int offset, csum_offset, needs_csum;
 	struct sk_buff *skb = pkt->skb;
-	int offset, csum_offset;
 	__wsum fsum, tsum;
 
 	switch (priv->base) {
@@ -822,8 +822,9 @@ int nft_payload_mangle(const struct nft_payload_set *priv,
 
 	csum_offset = offset + priv->csum_offset;
 	offset += priv->offset;
+	needs_csum = priv->csum_flags & NFT_PAYLOAD_L4CSUM_PSEUDOHDR;
 
-	if ((priv->csum_type == NFT_PAYLOAD_CSUM_INET || priv->csum_flags) &&
+	if ((priv->csum_type == NFT_PAYLOAD_CSUM_INET || needs_csum) &&
 	    ((priv->base != NFT_PAYLOAD_TRANSPORT_HEADER &&
 	      priv->base != NFT_PAYLOAD_INNER_HEADER) ||
 	     skb->ip_summed != CHECKSUM_PARTIAL)) {
@@ -834,7 +835,7 @@ int nft_payload_mangle(const struct nft_payload_set *priv,
 		    nft_payload_csum_inet(skb, src, fsum, tsum, csum_offset))
 			goto err;
 
-		if (priv->csum_flags &&
+		if (needs_csum &&
 		    nft_payload_l4csum_update(pkt, skb, fsum, tsum) < 0)
 			goto err;
 	}
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 09/19] netfilter: nft_payload: add offload flag define
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (7 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 08/19] netfilter: nft_payload: use flag for checksum need Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 10/19] netfilter: nft_payload: allow offload in the netlink Boris Sukholitko
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 1291 bytes --]

Add NFT_PAYLOAD_CAN_OFFLOAD to nft_payload_csum_flags enum.

The meaning of the flag is to allow payload application in the offload
fast path context.

This flag is being set by the userspace on the specific packet modification
statements. For example, look at the offload in the dscp statement below:

table inet filter {
        flowtable f1 {
                hook ingress priority filter
                devices = { veth0, veth1 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                ip dscp set cs3 offload
                ip protocol { tcp, udp, gre } flow add @f1
                ct state established,related accept
        }
}

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 include/uapi/linux/netfilter/nf_tables.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index c4d4d8e42dc8..02c30da5de8c 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -794,6 +794,7 @@ enum nft_payload_csum_types {
 
 enum nft_payload_csum_flags {
 	NFT_PAYLOAD_L4CSUM_PSEUDOHDR = (1 << 0),
+	NFT_PAYLOAD_CAN_OFFLOAD = (1 << 1),
 };
 
 enum nft_inner_type {
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 10/19] netfilter: nft_payload: allow offload in the netlink
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (8 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 09/19] netfilter: nft_payload: add offload flag define Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 11/19] netfilter: conntrack: nft extension Kconfig Boris Sukholitko
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 746 bytes --]

Userspace may send NFT_PAYLOAD_CAN_OFFLOAD in NFTA_PAYLOAD_CSUM_FLAGS now.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 net/netfilter/nft_payload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index 9e11df7389ca..a633f851316e 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -898,7 +898,7 @@ static int nft_payload_set_init(const struct nft_ctx *ctx,
 		u32 flags;
 
 		flags = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_CSUM_FLAGS]));
-		if (flags & ~NFT_PAYLOAD_L4CSUM_PSEUDOHDR)
+		if (flags & ~(NFT_PAYLOAD_L4CSUM_PSEUDOHDR | NFT_PAYLOAD_CAN_OFFLOAD))
 			return -EINVAL;
 
 		priv->csum_flags = flags;
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 11/19] netfilter: conntrack: nft extension Kconfig
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (9 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 10/19] netfilter: nft_payload: allow offload in the netlink Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 12/19] netfilter: nft: empty nft conntrack extension Boris Sukholitko
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 983 bytes --]

To apply payload changes during fast path nftables processing we
need a place to keep the information regarding the changes.

Add new nf_tables conntrack extension NFT_CONNTRACK_EXT to do this.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 net/netfilter/Kconfig | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 441d1f134110..30ee231df947 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -716,6 +716,16 @@ config NFT_REJECT_NETDEV
 
 endif # NF_TABLES_NETDEV
 
+config NFT_CONNTRACK_EXT
+	bool "Netfilter nf_tables conntrack extension support"
+	default n
+	help
+	  This option enables nf_tables conntrack extension. The extension is
+	  being used to help with some of nf_tables offload cases. For example,
+	  modifying dscp field of IP packet may be skipped during the offload.
+
+	  If unsure, choose N here.
+
 endif # NF_TABLES
 
 config NF_FLOW_TABLE_INET
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 12/19] netfilter: nft: empty nft conntrack extension
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (10 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 11/19] netfilter: conntrack: nft extension Kconfig Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 13/19] netfilter: conntrack: register nft extension Boris Sukholitko
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 2973 bytes --]

And do the related bookkeeping. The extension struct nf_conn_nft_ext
is empty for now.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 include/net/netfilter/nf_conntrack_extend.h | 3 +++
 include/net/netfilter/nf_tables.h           | 7 +++++++
 net/netfilter/nf_conntrack_extend.c         | 9 ++++++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/net/netfilter/nf_conntrack_extend.h b/include/net/netfilter/nf_conntrack_extend.h
index 0b247248b032..fa7321d71d98 100644
--- a/include/net/netfilter/nf_conntrack_extend.h
+++ b/include/net/netfilter/nf_conntrack_extend.h
@@ -30,6 +30,9 @@ enum nf_ct_ext_id {
 #endif
 #if IS_ENABLED(CONFIG_NET_ACT_CT)
 	NF_CT_EXT_ACT_CT,
+#endif
+#if IS_ENABLED(CONFIG_NFT_CONNTRACK_EXT)
+	NF_CT_EXT_NFT_EXT,
 #endif
 	NF_CT_EXT_NUM,
 };
diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index cba06ea3fedd..7d433f8db2e7 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -10,6 +10,7 @@
 #include <linux/netfilter/nf_tables.h>
 #include <linux/u64_stats_sync.h>
 #include <linux/rhashtable.h>
+#include <net/netfilter/nf_conntrack_extend.h>
 #include <net/netfilter/nf_flow_table.h>
 #include <net/netlink.h>
 #include <net/flow_offload.h>
@@ -1732,4 +1733,10 @@ static inline bool nft_reg_track_cmp(struct nft_regs_track *track,
 int nft_payload_mangle(const struct nft_payload_set *priv,
 		       const struct nft_pktinfo *pkt,
 		       const u32 *src);
+
+#if IS_ENABLED(CONFIG_NFT_CONNTRACK_EXT)
+struct nf_conn_nft_ext {
+};
+#endif
+
 #endif /* _NET_NF_TABLES_H */
diff --git a/net/netfilter/nf_conntrack_extend.c b/net/netfilter/nf_conntrack_extend.c
index 0b513f7bf9f3..bb389042261e 100644
--- a/net/netfilter/nf_conntrack_extend.c
+++ b/net/netfilter/nf_conntrack_extend.c
@@ -23,6 +23,7 @@
 #include <net/netfilter/nf_conntrack_labels.h>
 #include <net/netfilter/nf_conntrack_synproxy.h>
 #include <net/netfilter/nf_conntrack_act_ct.h>
+#include <net/netfilter/nf_tables.h>
 #include <net/netfilter/nf_nat.h>
 
 #define NF_CT_EXT_PREALLOC	128u /* conntrack events are on by default */
@@ -54,12 +55,15 @@ static const u8 nf_ct_ext_type_len[NF_CT_EXT_NUM] = {
 #if IS_ENABLED(CONFIG_NET_ACT_CT)
 	[NF_CT_EXT_ACT_CT] = sizeof(struct nf_conn_act_ct_ext),
 #endif
+#if IS_ENABLED(CONFIG_NFT_CONNTRACK_EXT)
+	[NF_CT_EXT_NFT_EXT] = sizeof(struct nf_conn_nft_ext),
+#endif
 };
 
 static __always_inline unsigned int total_extension_size(void)
 {
 	/* remember to add new extensions below */
-	BUILD_BUG_ON(NF_CT_EXT_NUM > 10);
+	BUILD_BUG_ON(NF_CT_EXT_NUM > 11);
 
 	return sizeof(struct nf_ct_ext) +
 	       sizeof(struct nf_conn_help)
@@ -85,6 +89,9 @@ static __always_inline unsigned int total_extension_size(void)
 #endif
 #if IS_ENABLED(CONFIG_NET_ACT_CT)
 		+ sizeof(struct nf_conn_act_ct_ext)
+#endif
+#if IS_ENABLED(CONFIG_NFT_CONNTRACK_EXT)
+		+ sizeof(struct nf_conn_nft_ext)
 #endif
 	;
 }
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 13/19] netfilter: conntrack: register nft extension
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (11 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 12/19] netfilter: nft: empty nft conntrack extension Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 14/19] netfilter: nft: add payload context into extension Boris Sukholitko
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 2430 bytes --]

Add the new nftables extension in the core and netlink conntrack
initialization.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 include/net/netfilter/nf_tables.h    | 12 ++++++++++++
 net/netfilter/nf_conntrack_core.c    |  2 ++
 net/netfilter/nf_conntrack_netlink.c |  2 ++
 3 files changed, 16 insertions(+)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 7d433f8db2e7..8f34571fe345 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1737,6 +1737,18 @@ int nft_payload_mangle(const struct nft_payload_set *priv,
 #if IS_ENABLED(CONFIG_NFT_CONNTRACK_EXT)
 struct nf_conn_nft_ext {
 };
+
+static inline void nfct_nft_ext_add(struct nf_conn *ct)
+{
+	struct nf_conn_nft_ext *ext = nf_ct_ext_add(ct, NF_CT_EXT_NFT_EXT, GFP_ATOMIC);
+
+	if (ext)
+		memset(ext, 0, sizeof(*ext));
+}
+#else
+static inline void nfct_nft_ext_add(struct nf_conn *ct)
+{
+}
 #endif
 
 #endif /* _NET_NF_TABLES_H */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index db1ea361f2da..1614ea3e58da 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -49,6 +49,7 @@
 #include <net/netfilter/nf_conntrack_synproxy.h>
 #include <net/netfilter/nf_nat.h>
 #include <net/netfilter/nf_nat_helper.h>
+#include <net/netfilter/nf_tables.h>
 #include <net/netns/hash.h>
 #include <net/ip.h>
 
@@ -1747,6 +1748,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
 	nf_ct_acct_ext_add(ct, GFP_ATOMIC);
 	nf_ct_tstamp_ext_add(ct, GFP_ATOMIC);
 	nf_ct_labels_ext_add(ct);
+	nfct_nft_ext_add(ct);
 
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
 	ecache = tmpl ? nf_ct_ecache_find(tmpl) : NULL;
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index fbc47e4b7bc3..4bc56a03d0a4 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -50,6 +50,7 @@
 #include <net/netfilter/nf_nat.h>
 #include <net/netfilter/nf_nat_helper.h>
 #endif
+#include <net/netfilter/nf_tables.h>
 
 #include <linux/netfilter/nfnetlink.h>
 #include <linux/netfilter/nfnetlink_conntrack.h>
@@ -2307,6 +2308,7 @@ ctnetlink_create_conntrack(struct net *net,
 	nf_ct_labels_ext_add(ct);
 	nfct_seqadj_ext_add(ct);
 	nfct_synproxy_ext_add(ct);
+	nfct_nft_ext_add(ct);
 
 	if (cda[CTA_STATUS]) {
 		err = ctnetlink_change_status(ct, cda);
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 14/19] netfilter: nft: add payload context into extension
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (12 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 13/19] netfilter: conntrack: register nft extension Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 15/19] netfilter: nft: add payload application Boris Sukholitko
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]

To apply the payload on both directions we need nft_payload_set struct
along with the source data.

There is also enum nft_ext_entry_type to signal the validity of the
context.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 include/net/netfilter/nf_tables.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 8f34571fe345..ffcbe25d6bd2 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1735,7 +1735,19 @@ int nft_payload_mangle(const struct nft_payload_set *priv,
 		       const u32 *src);
 
 #if IS_ENABLED(CONFIG_NFT_CONNTRACK_EXT)
+enum nft_ext_entry_type {
+	NFT_EXT_UNDEFINED
+	, NFT_EXT_PAYLOAD_SET
+};
+
+struct nf_conn_nft_ext_entry {
+	enum nft_ext_entry_type nfte_type;
+	struct nft_payload_set nfte_payload;
+	u32 nfte_data;
+};
+
 struct nf_conn_nft_ext {
+	struct nf_conn_nft_ext_entry nfte_entries[IP_CT_DIR_MAX];
 };
 
 static inline void nfct_nft_ext_add(struct nf_conn *ct)
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 15/19] netfilter: nft: add payload application
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (13 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 14/19] netfilter: nft: add payload context into extension Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 23:32   ` kernel test robot
  2023-05-04  0:44   ` kernel test robot
  2023-05-03 12:55 ` [PATCH nf-next 16/19] netfilter: nftables: fast path payload mangle Boris Sukholitko
                   ` (6 subsequent siblings)
  21 siblings, 2 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 2587 bytes --]

nf_flow_offload_apply_payload function is defined in the new
nft_conntrack_ext.c file. It applies payload changes using
nft_payload_mangle helper.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 include/net/netfilter/nf_tables.h | 13 +++++++++++++
 net/netfilter/Makefile            |  2 ++
 net/netfilter/nft_conntrack_ext.c | 26 ++++++++++++++++++++++++++
 3 files changed, 41 insertions(+)
 create mode 100644 net/netfilter/nft_conntrack_ext.c

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index ffcbe25d6bd2..48357db14602 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1757,10 +1757,23 @@ static inline void nfct_nft_ext_add(struct nf_conn *ct)
 	if (ext)
 		memset(ext, 0, sizeof(*ext));
 }
+
+int nf_flow_offload_apply_payload(struct sk_buff *skb,
+				  struct nf_conn *ct,
+				  enum ip_conntrack_dir dir,
+				  unsigned int thoff);
 #else
 static inline void nfct_nft_ext_add(struct nf_conn *ct)
 {
 }
+
+static inline int nf_flow_offload_apply_payload(struct sk_buff *skb,
+						struct nf_conn *ct,
+						enum ip_conntrack_dir dir,
+						unsigned int thoff)
+{
+	return 0;
+}
 #endif
 
 #endif /* _NET_NF_TABLES_H */
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index d4958e7e7631..c28bf8eaa759 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -135,6 +135,8 @@ obj-$(CONFIG_NFT_SYNPROXY)	+= nft_synproxy.o
 
 obj-$(CONFIG_NFT_NAT)		+= nft_chain_nat.o
 
+obj-$(CONFIG_NFT_CONNTRACK_EXT)	+= nft_conntrack_ext.o
+
 # nf_tables netdev
 obj-$(CONFIG_NFT_DUP_NETDEV)	+= nft_dup_netdev.o
 obj-$(CONFIG_NFT_FWD_NETDEV)	+= nft_fwd_netdev.o
diff --git a/net/netfilter/nft_conntrack_ext.c b/net/netfilter/nft_conntrack_ext.c
new file mode 100644
index 000000000000..0dabd2a84422
--- /dev/null
+++ b/net/netfilter/nft_conntrack_ext.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <net/netfilter/nf_tables.h>
+
+int nf_flow_offload_apply_payload(struct sk_buff *skb,
+				  struct nf_conn *ct,
+				  enum ip_conntrack_dir dir,
+				  unsigned int thoff)
+{
+	struct nf_conn_nft_ext_entry *en;
+	struct nf_conn_nft_ext *ncft;
+	struct nft_pktinfo pkt;
+
+	ncft = nf_ct_ext_find(ct, NF_CT_EXT_NFT_EXT);
+	if (!ncft)
+		return 0;
+
+	en = &ncft->nfte_entries[dir];
+	if (en->nfte_type != NFT_EXT_PAYLOAD_SET)
+		return 0;
+
+	memset(&pkt, 0, sizeof(pkt));
+	pkt.skb = skb;
+	pkt.thoff = thoff;
+	return nft_payload_mangle(&en->nfte_payload, &pkt, &en->nfte_data);
+}
+EXPORT_SYMBOL_GPL(nf_flow_offload_apply_payload);
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 16/19] netfilter: nftables: fast path payload mangle
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (14 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 15/19] netfilter: nft: add payload application Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 15:41   ` kernel test robot
  2023-05-03 12:55 ` [PATCH nf-next 17/19] netfilter: nftables: payload save mechanism Boris Sukholitko
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 975 bytes --]

Apply payload changes on the software fast path nf_flow_offload_ip_hook.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 net/netfilter/nf_flow_table_ip.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 19efba1e51ef..5eb2ed8e1f74 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -13,6 +13,7 @@
 #include <net/ip6_route.h>
 #include <net/neighbour.h>
 #include <net/netfilter/nf_flow_table.h>
+#include <net/netfilter/nf_tables.h>
 #include <net/netfilter/nf_conntrack_acct.h>
 /* For layer 4 checksum field offset. */
 #include <linux/tcp.h>
@@ -391,6 +392,8 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 
 	iph = ip_hdr(skb);
 	nf_flow_nat_ip(flow, skb, thoff, dir, iph);
+	if (nf_flow_offload_apply_payload(skb, flow->ct, dir, thoff))
+		return NF_DROP;
 
 	ip_decrease_ttl(iph);
 	skb_clear_tstamp(skb);
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 17/19] netfilter: nftables: payload save mechanism
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (15 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 16/19] netfilter: nftables: fast path payload mangle Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 18/19] netfilter: nft_payload: save payload if needed Boris Sukholitko
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 2358 bytes --]

nf_flow_offload_save_payload saves the payload in the nftables
conntrack extension so that nf_flow_offload_apply_payload can apply it
later.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 include/net/netfilter/nf_tables.h | 11 +++++++++++
 net/netfilter/nft_conntrack_ext.c | 30 ++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 48357db14602..6bfb38738838 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1762,6 +1762,10 @@ int nf_flow_offload_apply_payload(struct sk_buff *skb,
 				  struct nf_conn *ct,
 				  enum ip_conntrack_dir dir,
 				  unsigned int thoff);
+
+int nf_flow_offload_save_payload(struct sk_buff *skb,
+				 const struct nft_payload_set *priv,
+				 const u32 *src);
 #else
 static inline void nfct_nft_ext_add(struct nf_conn *ct)
 {
@@ -1774,6 +1778,13 @@ static inline int nf_flow_offload_apply_payload(struct sk_buff *skb,
 {
 	return 0;
 }
+
+static inline int nf_flow_offload_save_payload(struct sk_buff *skb,
+					       const struct nft_payload_set *priv,
+					       const u32 *src)
+{
+	return -1;
+}
 #endif
 
 #endif /* _NET_NF_TABLES_H */
diff --git a/net/netfilter/nft_conntrack_ext.c b/net/netfilter/nft_conntrack_ext.c
index 0dabd2a84422..750aeaaf2928 100644
--- a/net/netfilter/nft_conntrack_ext.c
+++ b/net/netfilter/nft_conntrack_ext.c
@@ -1,6 +1,36 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <net/netfilter/nf_tables.h>
 
+int nf_flow_offload_save_payload(struct sk_buff *skb,
+				 const struct nft_payload_set *priv,
+				 const u32 *src)
+{
+	struct nf_conn_nft_ext_entry *en;
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn_nft_ext *ncft;
+	struct nf_conn *ct;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (!ct)
+		goto err;
+
+	ncft = nf_ct_ext_find(ct, NF_CT_EXT_NFT_EXT);
+	if (!ncft)
+		goto err;
+
+	en = &ncft->nfte_entries[CTINFO2DIR(ctinfo)];
+	if (en->nfte_type != NFT_EXT_UNDEFINED)
+		goto err;
+
+	en->nfte_type = NFT_EXT_PAYLOAD_SET;
+	en->nfte_data = *src;
+	memcpy(&en->nfte_payload, priv, sizeof(*priv));
+	return 0;
+err:
+	return -1;
+}
+EXPORT_SYMBOL_GPL(nf_flow_offload_save_payload);
+
 int nf_flow_offload_apply_payload(struct sk_buff *skb,
 				  struct nf_conn *ct,
 				  enum ip_conntrack_dir dir,
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 18/19] netfilter: nft_payload: save payload if needed
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (16 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 17/19] netfilter: nftables: payload save mechanism Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 12:55 ` [PATCH nf-next 19/19] selftests: netfilter: dscp offload test Boris Sukholitko
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 733 bytes --]

If we have NFT_PAYLOAD_CAN_OFFLOAD flag set, save the payload for
later application.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 net/netfilter/nft_payload.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index a633f851316e..b8cb33316506 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -867,6 +867,10 @@ static void nft_payload_set_eval(const struct nft_expr *expr,
 	if (nft_payload_mangle(priv, pkt, src))
 		goto err;
 
+	if ((priv->csum_flags & NFT_PAYLOAD_CAN_OFFLOAD) &&
+	    nf_flow_offload_save_payload(pkt->skb, priv, src))
+		goto err;
+
 	return;
 err:
 	regs->verdict.code = NFT_BREAK;
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH nf-next 19/19] selftests: netfilter: dscp offload test
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (17 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 18/19] netfilter: nft_payload: save payload if needed Boris Sukholitko
@ 2023-05-03 12:55 ` Boris Sukholitko
  2023-05-03 18:46 ` [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Florian Westphal
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-03 12:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Ilya Lifshits, Boris Sukholitko

[-- Attachment #1: Type: text/plain, Size: 1729 bytes --]

Ensure that dscp offload modification nftables ruleset works.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
---
 .../selftests/netfilter/nft_flowtable.sh      | 40 +++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/tools/testing/selftests/netfilter/nft_flowtable.sh b/tools/testing/selftests/netfilter/nft_flowtable.sh
index 51f986f19fee..dc0980c64cc5 100755
--- a/tools/testing/selftests/netfilter/nft_flowtable.sh
+++ b/tools/testing/selftests/netfilter/nft_flowtable.sh
@@ -370,6 +370,46 @@ else
 	ret=1
 fi
 
+ip netns exec $nsr1 nft flush ruleset
+ip netns exec $nsr1 nft -f - <<EOF
+table inet filter {
+        flowtable f1 {
+                hook ingress priority filter
+		devices = { veth0, veth1 }
+        }
+
+        chain forward {
+                type filter hook forward priority filter; policy accept;
+                ip dscp set cs3 offload
+                ip protocol { tcp, udp, gre } flow add @f1
+                ct state established,related accept
+        }
+}
+EOF
+
+tf=/tmp/test_routing_dscp.dump
+rm -f $tf
+
+ip netns exec $ns1 nohup tcpdump -l -v -ne -i eth0 > $tf &
+td_pid=$!
+
+if test_tcp_forwarding $ns1 $ns2; then
+	kill $td_pid
+	wait $td_pid
+	out="$(grep -B1 '> 10.0.1.99' $tf | grep IP | grep -v 'tos 0x60' | head)"
+	if test -z "$out"; then
+		rm -f $tf
+		echo "PASS: dscp flow offloaded for ns1/ns2 is correct"
+	else
+		echo "non dscp packets found in $tf $out";
+		ret=1
+	fi
+else
+	echo "FAIL: dscp flow offload for ns1/ns2:" 1>&2
+	ip netns exec $nsr1 nft list ruleset
+	ret=1
+fi
+
 # delete default route, i.e. ns2 won't be able to reach ns1 and
 # will depend on ns1 being masqueraded in nsr1.
 # expect ns1 has nsr1 address.
-- 
2.32.0


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 16/19] netfilter: nftables: fast path payload mangle
  2023-05-03 12:55 ` [PATCH nf-next 16/19] netfilter: nftables: fast path payload mangle Boris Sukholitko
@ 2023-05-03 15:41   ` kernel test robot
  0 siblings, 0 replies; 41+ messages in thread
From: kernel test robot @ 2023-05-03 15:41 UTC (permalink / raw)
  To: Boris Sukholitko, netfilter-devel
  Cc: oe-kbuild-all, Ilya Lifshits, Boris Sukholitko

Hi Boris,

kernel test robot noticed the following build warnings:

[auto build test WARNING on shuah-kselftest/next]
[also build test WARNING on shuah-kselftest/fixes linus/master v6.3 next-20230428]
[cannot apply to nf-next/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Boris-Sukholitko/selftest-netfilter-use-proc-for-pid-checking/20230503-205838
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20230503125552.41113-17-boris.sukholitko%40broadcom.com
patch subject: [PATCH nf-next 16/19] netfilter: nftables: fast path payload mangle
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20230503/202305032310.WgPNAhx8-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/35886376721f375b56fd9b99c079298c8e027a26
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Boris-Sukholitko/selftest-netfilter-use-proc-for-pid-checking/20230503-205838
        git checkout 35886376721f375b56fd9b99c079298c8e027a26
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash net/netfilter/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202305032310.WgPNAhx8-lkp@intel.com/

All warnings (new ones prefixed by >>):

   net/netfilter/nf_flow_table_ip.c: In function 'nf_flow_offload_ip_hook':
>> net/netfilter/nf_flow_table_ip.c:395:58: warning: implicit conversion from 'enum flow_offload_tuple_dir' to 'enum ip_conntrack_dir' [-Wenum-conversion]
     395 |         if (nf_flow_offload_apply_payload(skb, flow->ct, dir, thoff))
         |                                                          ^~~


vim +395 net/netfilter/nf_flow_table_ip.c

   339	
   340	unsigned int
   341	nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
   342				const struct nf_hook_state *state)
   343	{
   344		struct flow_offload_tuple_rhash *tuplehash;
   345		struct nf_flowtable *flow_table = priv;
   346		struct flow_offload_tuple tuple = {};
   347		enum flow_offload_tuple_dir dir;
   348		struct flow_offload *flow;
   349		struct net_device *outdev;
   350		u32 hdrsize, offset = 0;
   351		unsigned int thoff, mtu;
   352		struct rtable *rt;
   353		struct iphdr *iph;
   354		__be32 nexthop;
   355		int ret;
   356	
   357		if (skb->protocol != htons(ETH_P_IP) &&
   358		    !nf_flow_skb_encap_protocol(skb, htons(ETH_P_IP), &offset))
   359			return NF_ACCEPT;
   360	
   361		if (nf_flow_tuple_ip(skb, state->in, &tuple, &hdrsize, offset) < 0)
   362			return NF_ACCEPT;
   363	
   364		tuplehash = flow_offload_lookup(flow_table, &tuple);
   365		if (tuplehash == NULL)
   366			return NF_ACCEPT;
   367	
   368		dir = tuplehash->tuple.dir;
   369		flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
   370	
   371		mtu = flow->tuplehash[dir].tuple.mtu + offset;
   372		if (unlikely(nf_flow_exceeds_mtu(skb, mtu)))
   373			return NF_ACCEPT;
   374	
   375		iph = (struct iphdr *)(skb_network_header(skb) + offset);
   376		thoff = (iph->ihl * 4) + offset;
   377		if (nf_flow_state_check(flow, iph->protocol, skb, thoff))
   378			return NF_ACCEPT;
   379	
   380		if (!nf_flow_dst_check(&tuplehash->tuple)) {
   381			flow_offload_teardown(flow);
   382			return NF_ACCEPT;
   383		}
   384	
   385		if (skb_try_make_writable(skb, thoff + hdrsize))
   386			return NF_DROP;
   387	
   388		flow_offload_refresh(flow_table, flow);
   389	
   390		nf_flow_encap_pop(skb, tuplehash);
   391		thoff -= offset;
   392	
   393		iph = ip_hdr(skb);
   394		nf_flow_nat_ip(flow, skb, thoff, dir, iph);
 > 395		if (nf_flow_offload_apply_payload(skb, flow->ct, dir, thoff))
   396			return NF_DROP;
   397	
   398		ip_decrease_ttl(iph);
   399		skb_clear_tstamp(skb);
   400	
   401		if (flow_table->flags & NF_FLOWTABLE_COUNTER)
   402			nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len);
   403	
   404		if (unlikely(tuplehash->tuple.xmit_type == FLOW_OFFLOAD_XMIT_XFRM)) {
   405			rt = (struct rtable *)tuplehash->tuple.dst_cache;
   406			memset(skb->cb, 0, sizeof(struct inet_skb_parm));
   407			IPCB(skb)->iif = skb->dev->ifindex;
   408			IPCB(skb)->flags = IPSKB_FORWARDED;
   409			return nf_flow_xmit_xfrm(skb, state, &rt->dst);
   410		}
   411	
   412		switch (tuplehash->tuple.xmit_type) {
   413		case FLOW_OFFLOAD_XMIT_NEIGH:
   414			rt = (struct rtable *)tuplehash->tuple.dst_cache;
   415			outdev = rt->dst.dev;
   416			skb->dev = outdev;
   417			nexthop = rt_nexthop(rt, flow->tuplehash[!dir].tuple.src_v4.s_addr);
   418			skb_dst_set_noref(skb, &rt->dst);
   419			neigh_xmit(NEIGH_ARP_TABLE, outdev, &nexthop, skb);
   420			ret = NF_STOLEN;
   421			break;
   422		case FLOW_OFFLOAD_XMIT_DIRECT:
   423			ret = nf_flow_queue_xmit(state->net, skb, tuplehash, ETH_P_IP);
   424			if (ret == NF_DROP)
   425				flow_offload_teardown(flow);
   426			break;
   427		default:
   428			WARN_ON_ONCE(1);
   429			ret = NF_DROP;
   430			break;
   431		}
   432	
   433		return ret;
   434	}
   435	EXPORT_SYMBOL_GPL(nf_flow_offload_ip_hook);
   436	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (18 preceding siblings ...)
  2023-05-03 12:55 ` [PATCH nf-next 19/19] selftests: netfilter: dscp offload test Boris Sukholitko
@ 2023-05-03 18:46 ` Florian Westphal
  2023-05-07 15:22   ` Boris Sukholitko
  2023-05-03 20:30 ` Pablo Neira Ayuso
  2023-05-03 20:41 ` Pablo Neira Ayuso
  21 siblings, 1 reply; 41+ messages in thread
From: Florian Westphal @ 2023-05-03 18:46 UTC (permalink / raw)
  To: Boris Sukholitko; +Cc: netfilter-devel, Ilya Lifshits

Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> Consider ruleset such as:
> 
> table inet filter {
>         chain forward {
>                 type filter hook forward priority filter; policy accept;
>                 ip dscp set cs3
>                 ct state established,related accept
>         }
> }
> 
> As expected, all of the packets from 10.0.2.99 to 10.0.1.99 have IPv4 tos field
> changed to 0x60:
> 
> ...
> 13:36:42.474591 fe:dc:b3:e2:dc:3b > 5a:45:4d:2a:25:65, ethertype IPv4 (0x0800), length 1090: (tos 0x60, ttl 62, id 39855, offset 0, flags [none], proto TCP (6), length 1076)
>     10.0.2.99.12345 > 10.0.1.99.44084: Flags [P.], cksum 0x1bec (incorrect -> 0x44c3), seq 1:1025, ack 1025, win 1987, options [nop,nop,TS val 2854899766 ecr 3249774499], length 1024
> ...
> 
> Now lets try to add flow offload:
> 
> table inet filter {
>         flowtable f1 {
>                 hook ingress priority filter
>                 devices = { veth0, veth1 }
>         }
> 
>         chain forward {
>                 type filter hook forward priority filter; policy accept;
>                 ip dscp set cs3
>                 ip protocol { tcp, udp, gre } flow add 
>                 ct state established,related accept
>         }
> }
> 
> Although some of the packets still have their TOS being correct, some are not:
> 
> ...
> 13:41:17.138782 5e:d5:1f:a3:ba:d1 > d2:d2:73:e6:5b:92, ethertype IPv4 (0x0800), length 1090: (tos 0x0, ttl 62, id 20142, offset 0, flags [none], proto TCP (6), length 1076)
>     10.0.2.99.12345 > 10.0.1.99.34230: Flags [P.], cksum 0x1bec (incorrect -> 0xc090), seq 1:1025, ack 1, win 2009, options [nop,nop,TS val 2855174430 ecr 3250049157], length 1024
> ...
> 
> The root cause for the bug seems to be that nft_payload_set_eval (which sets the
> dscp tos field) isn't being called on the offload fast path in
> nf_flow_offload_ip_hook.

I wish you would have reported this before you started to work on
this, because this is not a bug, this is expected behaviour.

Once you offload, the ruleset is bypassed, this is by design.
Lets not make the software offload more complex as it already is.

If you want to apply dscp payload modification, do not use flowtable
offload or hook those parts at netdev:ingress, it will be called before the
software offload pipeline.

I will reply to some of the changes to the shell tests because this
general reply above doesn't apply to those patches.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 01/19] selftest: netfilter: use /proc for pid checking
  2023-05-03 12:55 ` [PATCH nf-next 01/19] selftest: netfilter: use /proc for pid checking Boris Sukholitko
@ 2023-05-03 18:47   ` Florian Westphal
  2023-05-04  8:53     ` Boris Sukholitko
  0 siblings, 1 reply; 41+ messages in thread
From: Florian Westphal @ 2023-05-03 18:47 UTC (permalink / raw)
  To: Boris Sukholitko; +Cc: netfilter-devel, Ilya Lifshits

Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> Some ps commands (e.g. busybox derived) have no -p option. Use /proc for
> pid existence check.
> 
> Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
> ---
>  tools/testing/selftests/netfilter/nft_flowtable.sh | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/netfilter/nft_flowtable.sh b/tools/testing/selftests/netfilter/nft_flowtable.sh
> index 7060bae04ec8..4d8bc51b7a7b 100755
> --- a/tools/testing/selftests/netfilter/nft_flowtable.sh
> +++ b/tools/testing/selftests/netfilter/nft_flowtable.sh
> @@ -288,11 +288,11 @@ test_tcp_forwarding_ip()
>  
>  	sleep 3
>  
> -	if ps -p $lpid > /dev/null;then
> +	if test -d /proc/"$lpid"/; then
>  		kill $lpid

I'd shorten both to

kill $lpid 2>/dev/null

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 02/19] selftest: netfilter: no need for ps -x option
  2023-05-03 12:55 ` [PATCH nf-next 02/19] selftest: netfilter: no need for ps -x option Boris Sukholitko
@ 2023-05-03 18:53   ` Florian Westphal
  0 siblings, 0 replies; 41+ messages in thread
From: Florian Westphal @ 2023-05-03 18:53 UTC (permalink / raw)
  To: Boris Sukholitko; +Cc: netfilter-devel, Ilya Lifshits

Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> Some ps commands (e.g. busybox derived) have no -x option. For the
> purposes of hash calculation of the list of processes this option is
> inessential.
> 
> Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
> ---
>  tools/testing/selftests/netfilter/nft_flowtable.sh | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/netfilter/nft_flowtable.sh b/tools/testing/selftests/netfilter/nft_flowtable.sh
> index 4d8bc51b7a7b..3cf20e9bd3a6 100755
> --- a/tools/testing/selftests/netfilter/nft_flowtable.sh
> +++ b/tools/testing/selftests/netfilter/nft_flowtable.sh
> @@ -489,8 +489,8 @@ ip -net $nsr1 addr add 10.0.1.1/24 dev veth0
>  ip -net $nsr1 addr add dead:1::1/64 dev veth0
>  ip -net $nsr1 link set up dev veth0
>  
> -KEY_SHA="0x"$(ps -xaf | sha1sum | cut -d " " -f 1)
> -KEY_AES="0x"$(ps -xaf | md5sum | cut -d " " -f 1)
> +KEY_SHA="0x"$(ps -af | sha1sum | cut -d " " -f 1)

Could be

> +KEY_SHA="0x"$(ps xaf | sha1sum | cut -d " " -f 1)

instead.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 04/19] selftest: netfilter: monitor result file sizes
  2023-05-03 12:55 ` [PATCH nf-next 04/19] selftest: netfilter: monitor result file sizes Boris Sukholitko
@ 2023-05-03 18:54   ` Florian Westphal
  0 siblings, 0 replies; 41+ messages in thread
From: Florian Westphal @ 2023-05-03 18:54 UTC (permalink / raw)
  To: Boris Sukholitko; +Cc: netfilter-devel, Ilya Lifshits

Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> When running nft_flowtable.sh in VM on a busy server we've found that
> the time of the netcat file transfers vary wildly.
> 
> Therefore replace hardcoded 3 second sleep with the loop checking for
> a change in the file sizes. Once no change in detected we test the results.
> 
> Nice side effect is that we shave 1 second sleep in the fast case
> (hard-coded 3 second sleep vs two 1 second sleeps).

Much better than the old 'sleep', thanks.

Acked-by: Florian Westphal <fw@strlen.de>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (19 preceding siblings ...)
  2023-05-03 18:46 ` [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Florian Westphal
@ 2023-05-03 20:30 ` Pablo Neira Ayuso
  2023-05-03 20:41 ` Pablo Neira Ayuso
  21 siblings, 0 replies; 41+ messages in thread
From: Pablo Neira Ayuso @ 2023-05-03 20:30 UTC (permalink / raw)
  To: Boris Sukholitko; +Cc: netfilter-devel, Ilya Lifshits

Hi,

On Wed, May 03, 2023 at 03:55:33PM +0300, Boris Sukholitko wrote:
[...]
> Now lets try to add flow offload:
> 
> table inet filter {
>         flowtable f1 {
>                 hook ingress priority filter
>                 devices = { veth0, veth1 }
>         }
> 
>         chain forward {
>                 type filter hook forward priority filter; policy accept;
>                 ip dscp set cs3
>                 ip protocol { tcp, udp, gre } flow add 
>                 ct state established,related accept
>         }
> }

From user perspective, I think the way to go would be to allow users
to define a ruleset like this:

table inet filter {
        flowtable f1 {
                hook ingress priority filter
                devices = { veth0, veth1 }
        }

        chain ingress {
                type filter hook ingress device veth0 priority filter; policy accept; flags offload;
                ip dscp set cs3
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                meta l4proto { tcp, udp, gre } flow add @f1
                ct state established,related accept
        }
}

This ruleset defines a policy at ingress, the offload flag tells that
this is offloaded to hardware for veth0, ie. all rule in the 'ingress'
chain will be placed in hardware in the ingress path. The IP DSCP
field is set on at the ingress (offload) hook, therefore, the host
(software) in tcpdump will see already mangled packets with IP DSCP
field set to cs3.

To achieve this, please have a look at net/netfilter/nf_tables_offload.c
for the ruleset offload infrastructure. This is called whenever the
chain comes with the offload flag set on.

struct nft_expr_ops provides an .offload and .offload_action callbacks
which you can use to populate the existing hardware offload API as
defined by include/net/flow_offload.h.

You will also have to extend the offload parser to translate the
nftables bytecode to the (hardware) flow_offload API, similar to what
nft_payload does to infer the header field you want to mangle (the
flow_offload hardware API uses the flow_dissector structure).

It is going to be a bit of work but I think this is feasible.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
                   ` (20 preceding siblings ...)
  2023-05-03 20:30 ` Pablo Neira Ayuso
@ 2023-05-03 20:41 ` Pablo Neira Ayuso
  2023-05-04  8:50   ` Boris Sukholitko
  21 siblings, 1 reply; 41+ messages in thread
From: Pablo Neira Ayuso @ 2023-05-03 20:41 UTC (permalink / raw)
  To: Boris Sukholitko; +Cc: netfilter-devel, Ilya Lifshits

On Wed, May 03, 2023 at 03:55:33PM +0300, Boris Sukholitko wrote:
> Some high level description of the patches:
> 
> * patches 1-4 fix small but annoying infelicities in nft_flowtable.sh test script

Can you post a v2 addressing Florian's comments to integrate these
upstream?

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 15/19] netfilter: nft: add payload application
  2023-05-03 12:55 ` [PATCH nf-next 15/19] netfilter: nft: add payload application Boris Sukholitko
@ 2023-05-03 23:32   ` kernel test robot
  2023-05-04  0:44   ` kernel test robot
  1 sibling, 0 replies; 41+ messages in thread
From: kernel test robot @ 2023-05-03 23:32 UTC (permalink / raw)
  To: Boris Sukholitko, netfilter-devel
  Cc: oe-kbuild-all, Ilya Lifshits, Boris Sukholitko

Hi Boris,

kernel test robot noticed the following build errors:

[auto build test ERROR on shuah-kselftest/next]
[also build test ERROR on shuah-kselftest/fixes linus/master v6.3 next-20230428]
[cannot apply to nf-next/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Boris-Sukholitko/selftest-netfilter-use-proc-for-pid-checking/20230503-205838
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20230503125552.41113-16-boris.sukholitko%40broadcom.com
patch subject: [PATCH nf-next 15/19] netfilter: nft: add payload application
config: arm-allmodconfig (https://download.01.org/0day-ci/archive/20230504/202305040734.75NgGlSH-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/2dbe43b4daeda03360373d0a4f8ed72efee89a6a
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Boris-Sukholitko/selftest-netfilter-use-proc-for-pid-checking/20230503-205838
        git checkout 2dbe43b4daeda03360373d0a4f8ed72efee89a6a
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202305040734.75NgGlSH-lkp@intel.com/

All errors (new ones prefixed by >>):

   arm-linux-gnueabi-ld: net/netfilter/nft_conntrack_ext.o: in function `nf_flow_offload_apply_payload':
   nft_conntrack_ext.c:(.text+0xa0): undefined reference to `__nf_ct_ext_find'
>> arm-linux-gnueabi-ld: nft_conntrack_ext.c:(.text+0x110): undefined reference to `nft_payload_mangle'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 15/19] netfilter: nft: add payload application
  2023-05-03 12:55 ` [PATCH nf-next 15/19] netfilter: nft: add payload application Boris Sukholitko
  2023-05-03 23:32   ` kernel test robot
@ 2023-05-04  0:44   ` kernel test robot
  1 sibling, 0 replies; 41+ messages in thread
From: kernel test robot @ 2023-05-04  0:44 UTC (permalink / raw)
  To: Boris Sukholitko, netfilter-devel
  Cc: oe-kbuild-all, Ilya Lifshits, Boris Sukholitko

Hi Boris,

kernel test robot noticed the following build errors:

[auto build test ERROR on shuah-kselftest/next]
[also build test ERROR on shuah-kselftest/fixes linus/master v6.3 next-20230428]
[cannot apply to nf-next/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Boris-Sukholitko/selftest-netfilter-use-proc-for-pid-checking/20230503-205838
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20230503125552.41113-16-boris.sukholitko%40broadcom.com
patch subject: [PATCH nf-next 15/19] netfilter: nft: add payload application
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20230504/202305040828.dS4A4XA6-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/2dbe43b4daeda03360373d0a4f8ed72efee89a6a
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Boris-Sukholitko/selftest-netfilter-use-proc-for-pid-checking/20230503-205838
        git checkout 2dbe43b4daeda03360373d0a4f8ed72efee89a6a
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202305040828.dS4A4XA6-lkp@intel.com/

All errors (new ones prefixed by >>):

   powerpc-linux-ld: net/netfilter/nft_conntrack_ext.o: in function `nf_flow_offload_apply_payload':
   nft_conntrack_ext.c:(.text+0xb4): undefined reference to `__nf_ct_ext_find'
>> powerpc-linux-ld: nft_conntrack_ext.c:(.text+0x124): undefined reference to `nft_payload_mangle'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-03 20:41 ` Pablo Neira Ayuso
@ 2023-05-04  8:50   ` Boris Sukholitko
  0 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-04  8:50 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, Ilya Lifshits

[-- Attachment #1: Type: text/plain, Size: 425 bytes --]

On Wed, May 3, 2023 at 11:41 PM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> On Wed, May 03, 2023 at 03:55:33PM +0300, Boris Sukholitko wrote:
> > Some high level description of the patches:
> >
> > * patches 1-4 fix small but annoying infelicities in nft_flowtable.sh test script
>
> Can you post a v2 addressing Florian's comments to integrate these
> upstream?

Done.

Thanks,
Boris.

>
> Thanks.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 01/19] selftest: netfilter: use /proc for pid checking
  2023-05-03 18:47   ` Florian Westphal
@ 2023-05-04  8:53     ` Boris Sukholitko
  0 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-04  8:53 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, Ilya Lifshits

[-- Attachment #1: Type: text/plain, Size: 1172 bytes --]

Hi,

On Wed, May 3, 2023 at 9:47 PM Florian Westphal <fw@strlen.de> wrote:
>
> Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > Some ps commands (e.g. busybox derived) have no -p option. Use /proc for
> > pid existence check.
> >
> > Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
> > ---
> >  tools/testing/selftests/netfilter/nft_flowtable.sh | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/testing/selftests/netfilter/nft_flowtable.sh b/tools/testing/selftests/netfilter/nft_flowtable.sh
> > index 7060bae04ec8..4d8bc51b7a7b 100755
> > --- a/tools/testing/selftests/netfilter/nft_flowtable.sh
> > +++ b/tools/testing/selftests/netfilter/nft_flowtable.sh
> > @@ -288,11 +288,11 @@ test_tcp_forwarding_ip()
> >
> >       sleep 3
> >
> > -     if ps -p $lpid > /dev/null;then
> > +     if test -d /proc/"$lpid"/; then
> >               kill $lpid
>
> I'd shorten both to
>
> kill $lpid 2>/dev/null

I don't like the kill here because it may fail silently should we run
with "set -e" in the future.

I took the liberty to leave the test as is in v2.

Thanks,
Boris.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-03 18:46 ` [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Florian Westphal
@ 2023-05-07 15:22   ` Boris Sukholitko
  2023-05-07 17:37     ` Florian Westphal
  0 siblings, 1 reply; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-07 15:22 UTC (permalink / raw)
  To: Florian Westphal, Pablo Neira Ayuso; +Cc: netfilter-devel, Ilya Lifshits

[-- Attachment #1: Type: text/plain, Size: 1992 bytes --]

Hi Florian, Pablo,

CC'ing Pablo because of the non-applicability of ingress chain
solution to our customers.

On Wed, May 3, 2023 at 9:46 PM Florian Westphal <fw@strlen.de> wrote:
>
> Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
[... snip to non working offload ...]

> > table inet filter {
> >         flowtable f1 {
> >                 hook ingress priority filter
> >                 devices = { veth0, veth1 }
> >         }
> >
> >         chain forward {
> >                 type filter hook forward priority filter; policy accept;
> >                 ip dscp set cs3 offload
> >                 ip protocol { tcp, udp, gre } flow add @f1
> >                 ct state established,related accept
> >         }
> > }

[...]

>
> I wish you would have reported this before you started to work on
> this, because this is not a bug, this is expected behaviour.
>
> Once you offload, the ruleset is bypassed, this is by design.

From the rules UI perspective it seems possible to accelerate
forward chain handling with the statements such as dscp modification there.

Isn't it better to modify the packets according to the bypassed
ruleset thus making the behaviour more consistent?

> Lets not make the software offload more complex as it already is.

Could you please tell which parts of software offload are too complex?
It's not too bad from what I've seen :)

This patch series adds 56 lines of code in the new nf_conntrack.ext.c
file. 20 of them (nf_flow_offload_apply_payload) are used in
the software fast path. Is it too high of a price?

>
> If you want to apply dscp payload modification, do not use flowtable
> offload or hook those parts at netdev:ingress, it will be called before the
> software offload pipeline.
>

The problem is that our customers need to apply dscp modification in
more complex scenarios, e.g. after NAT.
Therefore I am not sure that ingress chain is enough for them.

Thanks,
Boris.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-07 15:22   ` Boris Sukholitko
@ 2023-05-07 17:37     ` Florian Westphal
  2023-05-08 13:38       ` Boris Sukholitko
  0 siblings, 1 reply; 41+ messages in thread
From: Florian Westphal @ 2023-05-07 17:37 UTC (permalink / raw)
  To: Boris Sukholitko
  Cc: Florian Westphal, Pablo Neira Ayuso, netfilter-devel,
	Ilya Lifshits

Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> On Wed, May 3, 2023 at 9:46 PM Florian Westphal <fw@strlen.de> wrote:
> >
> > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> [... snip to non working offload ...]
> 
> > > table inet filter {
> > >         flowtable f1 {
> > >                 hook ingress priority filter
> > >                 devices = { veth0, veth1 }
> > >         }
> > >
> > >         chain forward {
> > >                 type filter hook forward priority filter; policy accept;
> > >                 ip dscp set cs3 offload
> > >                 ip protocol { tcp, udp, gre } flow add @f1
> > >                 ct state established,related accept
> > >         }
> > > }
> 
> [...]
> 
> >
> > I wish you would have reported this before you started to work on
> > this, because this is not a bug, this is expected behaviour.
> >
> > Once you offload, the ruleset is bypassed, this is by design.
> 
> From the rules UI perspective it seems possible to accelerate
> forward chain handling with the statements such as dscp modification there.
> 
> Isn't it better to modify the packets according to the bypassed
> ruleset thus making the behaviour more consistent?

The behaviour is consistent.  Once flow is offloaded, ruleset is
bypassed.  Its easy to not offload those flows that need the ruleset.

> > Lets not make the software offload more complex as it already is.
> 
> Could you please tell which parts of software offload are too complex?
> It's not too bad from what I've seen :)
> 
> This patch series adds 56 lines of code in the new nf_conntrack.ext.c
> file. 20 of them (nf_flow_offload_apply_payload) are used in
> the software fast path. Is it too high of a price?

56 lines of code *now*.

Next someone wants to call into sets/maps for named counters that
they need.  Then someone wants limit or quota to work.  Then they want fib
for RPF.  Then xfrm policy matching to augment acccounting.
This will go on until we get to the point where removing "fast" path
turns into a performance optimization.

Existing rule hw offload via netdev:ingress makes it clear
what rules are offloaded and to which device and it augments
flowtable feature regardless if thats handled by software fastpath,
software fallback/slowpath or by hardware offload.

> > If you want to apply dscp payload modification, do not use flowtable
> > offload or hook those parts at netdev:ingress, it will be called before the
> > software offload pipeline.
> >
> 
> The problem is that our customers need to apply dscp modification in
> more complex scenarios, e.g. after NAT.
> Therefore I am not sure that ingress chain is enough for them.

I don't understand why this would have to occur after nat, but
netdev:egress exists as well.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-07 17:37     ` Florian Westphal
@ 2023-05-08 13:38       ` Boris Sukholitko
  2023-05-08 20:07         ` Pablo Neira Ayuso
  2023-05-09  9:48         ` Florian Westphal
  0 siblings, 2 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-08 13:38 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel, Ilya Lifshits

[-- Attachment #1: Type: text/plain, Size: 4045 bytes --]

On Sun, May 07, 2023 at 07:37:58PM +0200, Florian Westphal wrote:
> Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > On Wed, May 3, 2023 at 9:46 PM Florian Westphal <fw@strlen.de> wrote:
> > >
> > > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > [... snip to non working offload ...]
> > 
> > > > table inet filter {
> > > >         flowtable f1 {
> > > >                 hook ingress priority filter
> > > >                 devices = { veth0, veth1 }
> > > >         }
> > > >
> > > >         chain forward {
> > > >                 type filter hook forward priority filter; policy accept;
> > > >                 ip dscp set cs3 offload
> > > >                 ip protocol { tcp, udp, gre } flow add @f1
> > > >                 ct state established,related accept
> > > >         }
> > > > }
> > 
> > [...]
> > 
> > >
> > > I wish you would have reported this before you started to work on
> > > this, because this is not a bug, this is expected behaviour.
> > >
> > > Once you offload, the ruleset is bypassed, this is by design.
> > 
> > From the rules UI perspective it seems possible to accelerate
> > forward chain handling with the statements such as dscp modification there.
> > 
> > Isn't it better to modify the packets according to the bypassed
> > ruleset thus making the behaviour more consistent?
> 
> The behaviour is consistent.  Once flow is offloaded, ruleset is
> bypassed.  Its easy to not offload those flows that need the ruleset.
> 
> > > Lets not make the software offload more complex as it already is.
> > 
> > Could you please tell which parts of software offload are too complex?
> > It's not too bad from what I've seen :)
> > 
> > This patch series adds 56 lines of code in the new nf_conntrack.ext.c
> > file. 20 of them (nf_flow_offload_apply_payload) are used in
> > the software fast path. Is it too high of a price?
> 
> 56 lines of code *now*.
> 
> Next someone wants to call into sets/maps for named counters that
> they need.  Then someone wants limit or quota to work.  Then they want fib
> for RPF.  Then xfrm policy matching to augment acccounting.
> This will go on until we get to the point where removing "fast" path
> turns into a performance optimization.

OK. May I assume that you are concerned with the eventual performance impact
on the software fast path (i.e. nf_flow_offload_ip_hook)?

Obviously the performance of the fast path is very important to our
customers. Otherwise they would not be requiring dscp fast path
modification. :)

One of the things we've thought about regarding the fast path
performance is rewriting nf_flow_offload_ip_hook to work with
nf_flowtable->flow_block instead of flow_offload_tuple.

We hope that iterating over flow_action_entry list similar to what the
hardware acceleration does, will be more efficient also in software.

Nice side-effect of such optimization would be that the amount of
feature bloat (such as dscp modification!) will not affect your typical
connection unless the user actually uses them.

For example, for dscp payload modification we'll generate
FLOW_ACTION_MANGLE entry. This entry will appear on flow_block's of
the only connections which require it. Others will be uneffected.

Would you be ok with such direction (with performance tests of
course)? 

Thanks,
Boris.

> 
> Existing rule hw offload via netdev:ingress makes it clear
> what rules are offloaded and to which device and it augments
> flowtable feature regardless if thats handled by software fastpath,
> software fallback/slowpath or by hardware offload.
> 
> > > If you want to apply dscp payload modification, do not use flowtable
> > > offload or hook those parts at netdev:ingress, it will be called before the
> > > software offload pipeline.
> > >
> > 
> > The problem is that our customers need to apply dscp modification in
> > more complex scenarios, e.g. after NAT.
> > Therefore I am not sure that ingress chain is enough for them.
> 
> I don't understand why this would have to occur after nat, but
> netdev:egress exists as well.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-08 13:38       ` Boris Sukholitko
@ 2023-05-08 20:07         ` Pablo Neira Ayuso
  2023-05-09 14:56           ` Boris Sukholitko
  2023-05-09  9:48         ` Florian Westphal
  1 sibling, 1 reply; 41+ messages in thread
From: Pablo Neira Ayuso @ 2023-05-08 20:07 UTC (permalink / raw)
  To: Boris Sukholitko; +Cc: Florian Westphal, netfilter-devel, Ilya Lifshits

On Mon, May 08, 2023 at 04:38:06PM +0300, Boris Sukholitko wrote:
> On Sun, May 07, 2023 at 07:37:58PM +0200, Florian Westphal wrote:
> > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > On Wed, May 3, 2023 at 9:46 PM Florian Westphal <fw@strlen.de> wrote:
> > > >
> > > > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > [... snip to non working offload ...]
> > > 
> > > > > table inet filter {
> > > > >         flowtable f1 {
> > > > >                 hook ingress priority filter
> > > > >                 devices = { veth0, veth1 }
> > > > >         }
> > > > >
> > > > >         chain forward {
> > > > >                 type filter hook forward priority filter; policy accept;
> > > > >                 ip dscp set cs3 offload
> > > > >                 ip protocol { tcp, udp, gre } flow add @f1
> > > > >                 ct state established,related accept
> > > > >         }
> > > > > }
> > > 
> > > [...]
> > > 
> > > >
> > > > I wish you would have reported this before you started to work on
> > > > this, because this is not a bug, this is expected behaviour.
> > > >
> > > > Once you offload, the ruleset is bypassed, this is by design.
> > > 
> > > From the rules UI perspective it seems possible to accelerate
> > > forward chain handling with the statements such as dscp modification there.
> > > 
> > > Isn't it better to modify the packets according to the bypassed
> > > ruleset thus making the behaviour more consistent?
> > 
> > The behaviour is consistent.  Once flow is offloaded, ruleset is
> > bypassed.  Its easy to not offload those flows that need the ruleset.
> > 
> > > > Lets not make the software offload more complex as it already is.
> > > 
> > > Could you please tell which parts of software offload are too complex?
> > > It's not too bad from what I've seen :)
> > > 
> > > This patch series adds 56 lines of code in the new nf_conntrack.ext.c
> > > file. 20 of them (nf_flow_offload_apply_payload) are used in
> > > the software fast path. Is it too high of a price?
> > 
> > 56 lines of code *now*.
> > 
> > Next someone wants to call into sets/maps for named counters that
> > they need.  Then someone wants limit or quota to work.  Then they want fib
> > for RPF.  Then xfrm policy matching to augment acccounting.
> > This will go on until we get to the point where removing "fast" path
> > turns into a performance optimization.
> 
> OK. May I assume that you are concerned with the eventual performance impact
> on the software fast path (i.e. nf_flow_offload_ip_hook)?

I think Florian's concern is that there is better infrastructure to
handle for ruleset offloads, ie. ingress/egress ruleset hardware
offload infrastructure.

> Obviously the performance of the fast path is very important to our
> customers. Otherwise they would not be requiring dscp fast path
> modification. :)
> 
> One of the things we've thought about regarding the fast path
> performance is rewriting nf_flow_offload_ip_hook to work with
> nf_flowtable->flow_block instead of flow_offload_tuple.
> 
> We hope that iterating over flow_action_entry list similar to what the
> hardware acceleration does, will be more efficient also in software.
> 
> Nice side-effect of such optimization would be that the amount of
> feature bloat (such as dscp modification!) will not affect your typical
> connection unless the user actually uses them.
> 
> For example, for dscp payload modification we'll generate
> FLOW_ACTION_MANGLE entry. This entry will appear on flow_block's of
> the only connections which require it. Others will be uneffected.
>
> Would you be ok with such direction (with performance tests of
> course)?

I am still missing the reason why the ingress/egress ruleset hardware
offload infrastructure is not a good fit for your requirements.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-08 13:38       ` Boris Sukholitko
  2023-05-08 20:07         ` Pablo Neira Ayuso
@ 2023-05-09  9:48         ` Florian Westphal
  2023-05-10  7:49           ` Boris Sukholitko
  1 sibling, 1 reply; 41+ messages in thread
From: Florian Westphal @ 2023-05-09  9:48 UTC (permalink / raw)
  To: Boris Sukholitko
  Cc: Florian Westphal, Pablo Neira Ayuso, netfilter-devel,
	Ilya Lifshits

Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> On Sun, May 07, 2023 at 07:37:58PM +0200, Florian Westphal wrote:
> > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > On Wed, May 3, 2023 at 9:46 PM Florian Westphal <fw@strlen.de> wrote:
> > > >
> > > > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > [... snip to non working offload ...]
> > > 
> > > > > table inet filter {
> > > > >         flowtable f1 {
> > > > >                 hook ingress priority filter
> > > > >                 devices = { veth0, veth1 }
> > > > >         }
> > > > >
> > > > >         chain forward {
> > > > >                 type filter hook forward priority filter; policy accept;
> > > > >                 ip dscp set cs3 offload
> > > > >                 ip protocol { tcp, udp, gre } flow add @f1
> > > > >                 ct state established,related accept
> > > > >         }
> > > > > }
> > > 
> > > [...]
> > > 
> > > >
> > > > I wish you would have reported this before you started to work on
> > > > this, because this is not a bug, this is expected behaviour.
> > > >
> > > > Once you offload, the ruleset is bypassed, this is by design.
> > > 
> > > From the rules UI perspective it seems possible to accelerate
> > > forward chain handling with the statements such as dscp modification there.
> > > 
> > > Isn't it better to modify the packets according to the bypassed
> > > ruleset thus making the behaviour more consistent?
> > 
> > The behaviour is consistent.  Once flow is offloaded, ruleset is
> > bypassed.  Its easy to not offload those flows that need the ruleset.
> > 
> > > > Lets not make the software offload more complex as it already is.
> > > 
> > > Could you please tell which parts of software offload are too complex?
> > > It's not too bad from what I've seen :)
> > > 
> > > This patch series adds 56 lines of code in the new nf_conntrack.ext.c
> > > file. 20 of them (nf_flow_offload_apply_payload) are used in
> > > the software fast path. Is it too high of a price?
> > 
> > 56 lines of code *now*.
> > 
> > Next someone wants to call into sets/maps for named counters that
> > they need.  Then someone wants limit or quota to work.  Then they want fib
> > for RPF.  Then xfrm policy matching to augment acccounting.
> > This will go on until we get to the point where removing "fast" path
> > turns into a performance optimization.
> 
> OK. May I assume that you are concerned with the eventual performance impact
> on the software fast path (i.e. nf_flow_offload_ip_hook)?

Yes, but I also dislike the concept, see below.

> Obviously the performance of the fast path is very important to our
> customers. Otherwise they would not be requiring dscp fast path
> modification. :)
> 
> One of the things we've thought about regarding the fast path
> performance is rewriting nf_flow_offload_ip_hook to work with
> nf_flowtable->flow_block instead of flow_offload_tuple.

Sorry, I should have expanded on my reservations towards this concept.

Let me explain.
Lets consider your original example first:

----------
table inet filter {
        flowtable f1 {
                hook ingress priority filter
                devices = { veth0, veth1 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                ip dscp set cs3
                ip protocol { tcp, udp, gre } flow add
                ct state established,related accept
        }
}
----------

This has a clearly defined meaning in all possible combinations.

Software:
1. It defines a bypass for veth0 <-> veth1
2. the way this specific ruleset is defined, all of tcp/udp/gre will
   attempt to offload
3. once offload has happened, entire inet:forward may be bypassed
4. User ruleset needs to cope with packets being moved back to
   software: fragmented packets, tcp fin/rst, hw timeouts and so on.
5. User can control via 'offload' keyword if HW offload should be
   attempted or not

Hardware:
even 'nf_flow_offload_ip_hook' may be bypassed.  But nothing changes
compared to 'no hw offload' case from a conceptual point of view.

Lets now consider existing netdev:ingress/egress in this same picture:
(Example from Pablo):
------
table inet filter {
        flowtable f1 {
                hook ingress priority filter
                devices = { veth0, veth1 }
        }

        chain ingress {
                type filter hook ingress device veth0 priority filter; policy accept; flags offload;
                ip dscp set cs3
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                meta l4proto { tcp, udp, gre } flow add @f1
                ct state established,related accept
        }
}

Again, this has defined meaning in all combinations:
With HW offload: veth0 will be told to mangle dscp.
This happens in all cases and for every matching packet,
regardless if a flowtable exists or not.

Same would happen for 'egress', just that it would happen at xmit time
rather at receive time.  Again, its not relevant if there is active
flowtable or not, or if data path is offloaded to hardware, to software,
handled by fallback or entirely without flowtables being present.

Its also clear that this is tied to 'veth0', other devices will
not be involved and not do such mangling.

Now lets look at your proposal:
----------------
table inet filter {
        flowtable f1 {
                hook ingress priority filter
                devices = { veth0, veth1 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                ip dscp set cs3 offload
                ip protocol { tcp, udp, gre } flow add
                ct state established,related accept
        }
}
----------------

This means that software flowtable offload
shall do a 'ip dscp set cs3'.

What if the flowtable is offloaded to hardware
entirely, without software fallback?

What if the devices listed in the flowtable definition can handle
flow offload, but no payload mangling?

Does the 'offload' mean that the rule is only active for
software path?  Only for hardware path? both?

How can I tell if its offloaded to hardware for one device
but not for the other?  Or will that be disallowed?

What if someone adds another rule after or before 'ip dscp',
but without the 'offload' keyword?  Now ordering becomes an
issue.

Users now need to consider different control flows:

  jump exceptions
  ip dscp set cs3 offload

  chain exceptions {
    ip daddr 1.2.3.4 accept
  }

This won't work as expected, because offloaded flows will not
pass through 'forward' chain but somehow a few selected rules
will be run anyway.

TL;DR: I think that for HW offload its paramount to make it crystal
clear as to which device is responsible to handle such rules.

The existing netdev:ingress/egress hooks provide the needed
chain/rules/expression:device mapping.  User can easily
tell if HW is responsible or SW by looking for 'offload' flag
presence.

I don't think mixing software and hardware offload contexts as proposed
is a good idea, both from user frontend syntax, clarity and error reporting
(e.g. if hw rejects offload request) point of view.

I also believe that allowing payload mangling from *software* offload
path sets a precedence for essentially allowing all other expressions
again which completely negates the flowtable concept.

I still think that dscp mangling should be done via netdev:ingress/egress
hooks, I don't see why this has to be bolted into flowtable sw offload.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-08 20:07         ` Pablo Neira Ayuso
@ 2023-05-09 14:56           ` Boris Sukholitko
  0 siblings, 0 replies; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-09 14:56 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel, Ilya Lifshits

[-- Attachment #1: Type: text/plain, Size: 4693 bytes --]

On Mon, May 08, 2023 at 10:07:40PM +0200, Pablo Neira Ayuso wrote:
> On Mon, May 08, 2023 at 04:38:06PM +0300, Boris Sukholitko wrote:
> > On Sun, May 07, 2023 at 07:37:58PM +0200, Florian Westphal wrote:
> > > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > > On Wed, May 3, 2023 at 9:46 PM Florian Westphal <fw@strlen.de> wrote:
> > > > >
> > > > > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > > [... snip to non working offload ...]
> > > > 
> > > > > > table inet filter {
> > > > > >         flowtable f1 {
> > > > > >                 hook ingress priority filter
> > > > > >                 devices = { veth0, veth1 }
> > > > > >         }
> > > > > >
> > > > > >         chain forward {
> > > > > >                 type filter hook forward priority filter; policy accept;
> > > > > >                 ip dscp set cs3 offload
> > > > > >                 ip protocol { tcp, udp, gre } flow add @f1
> > > > > >                 ct state established,related accept
> > > > > >         }
> > > > > > }
> > > > 
> > > > [...]
> > > > 
> > > > >
> > > > > I wish you would have reported this before you started to work on
> > > > > this, because this is not a bug, this is expected behaviour.
> > > > >
> > > > > Once you offload, the ruleset is bypassed, this is by design.
> > > > 
> > > > From the rules UI perspective it seems possible to accelerate
> > > > forward chain handling with the statements such as dscp modification there.
> > > > 
> > > > Isn't it better to modify the packets according to the bypassed
> > > > ruleset thus making the behaviour more consistent?
> > > 
> > > The behaviour is consistent.  Once flow is offloaded, ruleset is
> > > bypassed.  Its easy to not offload those flows that need the ruleset.
> > > 
> > > > > Lets not make the software offload more complex as it already is.
> > > > 
> > > > Could you please tell which parts of software offload are too complex?
> > > > It's not too bad from what I've seen :)
> > > > 
> > > > This patch series adds 56 lines of code in the new nf_conntrack.ext.c
> > > > file. 20 of them (nf_flow_offload_apply_payload) are used in
> > > > the software fast path. Is it too high of a price?
> > > 
> > > 56 lines of code *now*.
> > > 
> > > Next someone wants to call into sets/maps for named counters that
> > > they need.  Then someone wants limit or quota to work.  Then they want fib
> > > for RPF.  Then xfrm policy matching to augment acccounting.
> > > This will go on until we get to the point where removing "fast" path
> > > turns into a performance optimization.
> > 
> > OK. May I assume that you are concerned with the eventual performance impact
> > on the software fast path (i.e. nf_flow_offload_ip_hook)?
> 
> I think Florian's concern is that there is better infrastructure to
> handle for ruleset offloads, ie. ingress/egress ruleset hardware
> offload infrastructure.
> 

I fully agree that ingress and egress chains are better for hardware
offload.

But what about software only platform? Having additional ingress or
egress chains means going through nftables VM in addition to the forward
chain. With the usual "premature optimization" disclaimers, IMHO we can
get better performance by doing only forward table software fast path.

> > Obviously the performance of the fast path is very important to our
> > customers. Otherwise they would not be requiring dscp fast path
> > modification. :)
> > 
> > One of the things we've thought about regarding the fast path
> > performance is rewriting nf_flow_offload_ip_hook to work with
> > nf_flowtable->flow_block instead of flow_offload_tuple.
> > 
> > We hope that iterating over flow_action_entry list similar to what the
> > hardware acceleration does, will be more efficient also in software.
> > 
> > Nice side-effect of such optimization would be that the amount of
> > feature bloat (such as dscp modification!) will not affect your typical
> > connection unless the user actually uses them.
> > 
> > For example, for dscp payload modification we'll generate
> > FLOW_ACTION_MANGLE entry. This entry will appear on flow_block's of
> > the only connections which require it. Others will be uneffected.
> >
> > Would you be ok with such direction (with performance tests of
> > course)?
> 
> I am still missing the reason why the ingress/egress ruleset hardware
> offload infrastructure is not a good fit for your requirements.

Because our current target is maximizing performance of the software fast
path only. We need to have our customer nftables rulesets running as
fast as possible on their hardware. For this, software fast path seems
like ideal solution. Maybe with a bit of work ... :)

Thanks,
Boris.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-09  9:48         ` Florian Westphal
@ 2023-05-10  7:49           ` Boris Sukholitko
  2023-05-10 12:55             ` Florian Westphal
  0 siblings, 1 reply; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-10  7:49 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel, Ilya Lifshits

[-- Attachment #1: Type: text/plain, Size: 10864 bytes --]

On Tue, May 09, 2023 at 11:48:27AM +0200, Florian Westphal wrote:
> Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > On Sun, May 07, 2023 at 07:37:58PM +0200, Florian Westphal wrote:
> > > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > > On Wed, May 3, 2023 at 9:46 PM Florian Westphal <fw@strlen.de> wrote:
> > > > >
> > > > > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > > [... snip to non working offload ...]
> > > > 
> > > > > > table inet filter {
> > > > > >         flowtable f1 {
> > > > > >                 hook ingress priority filter
> > > > > >                 devices = { veth0, veth1 }
> > > > > >         }
> > > > > >
> > > > > >         chain forward {
> > > > > >                 type filter hook forward priority filter; policy accept;
> > > > > >                 ip dscp set cs3 offload
> > > > > >                 ip protocol { tcp, udp, gre } flow add @f1
> > > > > >                 ct state established,related accept
> > > > > >         }
> > > > > > }
> > > > 
> > > > [...]
> > > > 
> > > > >
> > > > > I wish you would have reported this before you started to work on
> > > > > this, because this is not a bug, this is expected behaviour.
> > > > >
> > > > > Once you offload, the ruleset is bypassed, this is by design.
> > > > 
> > > > From the rules UI perspective it seems possible to accelerate
> > > > forward chain handling with the statements such as dscp modification there.
> > > > 
> > > > Isn't it better to modify the packets according to the bypassed
> > > > ruleset thus making the behaviour more consistent?
> > > 
> > > The behaviour is consistent.  Once flow is offloaded, ruleset is
> > > bypassed.  Its easy to not offload those flows that need the ruleset.
> > > 
> > > > > Lets not make the software offload more complex as it already is.
> > > > 
> > > > Could you please tell which parts of software offload are too complex?
> > > > It's not too bad from what I've seen :)
> > > > 
> > > > This patch series adds 56 lines of code in the new nf_conntrack.ext.c
> > > > file. 20 of them (nf_flow_offload_apply_payload) are used in
> > > > the software fast path. Is it too high of a price?
> > > 
> > > 56 lines of code *now*.
> > > 
> > > Next someone wants to call into sets/maps for named counters that
> > > they need.  Then someone wants limit or quota to work.  Then they want fib
> > > for RPF.  Then xfrm policy matching to augment acccounting.
> > > This will go on until we get to the point where removing "fast" path
> > > turns into a performance optimization.
> > 
> > OK. May I assume that you are concerned with the eventual performance impact
> > on the software fast path (i.e. nf_flow_offload_ip_hook)?
> 
> Yes, but I also dislike the concept, see below.
> 
> > Obviously the performance of the fast path is very important to our
> > customers. Otherwise they would not be requiring dscp fast path
> > modification. :)
> > 
> > One of the things we've thought about regarding the fast path
> > performance is rewriting nf_flow_offload_ip_hook to work with
> > nf_flowtable->flow_block instead of flow_offload_tuple.
> 
> Sorry, I should have expanded on my reservations towards this concept.
> 
> Let me explain.
> Lets consider your original example first:
> 
> ----------
> table inet filter {
>         flowtable f1 {
>                 hook ingress priority filter
>                 devices = { veth0, veth1 }
>         }
> 
>         chain forward {
>                 type filter hook forward priority filter; policy accept;
>                 ip dscp set cs3
>                 ip protocol { tcp, udp, gre } flow add
>                 ct state established,related accept
>         }
> }
> ----------
> 
> This has a clearly defined meaning in all possible combinations.
> 
> Software:
> 1. It defines a bypass for veth0 <-> veth1
> 2. the way this specific ruleset is defined, all of tcp/udp/gre will
>    attempt to offload

OK.

> 3. once offload has happened, entire inet:forward may be bypassed

By bypassed, do you mean that chain forward ruleset body becomes
irrelevant? If the unfortunate answer is yes, than as in the original
report, once we are in the software fast path we do not do dscp
modification, right?

Then once we hit the NF_FLOW_TIMEOUT, some of the packets will have dscp
modified, because we've went out of software acceleration. I.e. some of
the packets will arrive with and some without dscp. As you can imagine,
this could be hard to debug. This is the scenario that the patch series
tries to fix.

> 4. User ruleset needs to cope with packets being moved back to
>    software: fragmented packets, tcp fin/rst, hw timeouts and so on.

Should we require our user to understand that some lines in their
forward table configuration may or may not be executed sporadically?
Should we give the user at least some kind of warning regarding this
during the ruleset load?

To be constructive, isn't it better to rephrase points 3 and 4 as:

3. once offload has happened, entire inet:forward will be executed with
the same semantics but with better performance. Any difference between
fast and slow path is considered a bug.

4. If something in user ruleset (such as dscp rule now) precludes fast
path optimization then either error will be given or slow path will be
taken with a warning.

> 5. User can control via 'offload' keyword if HW offload should be
>    attempted or not
> 
> Hardware:
> even 'nf_flow_offload_ip_hook' may be bypassed.  But nothing changes
> compared to 'no hw offload' case from a conceptual point of view.
> 

Agreed.

> Lets now consider existing netdev:ingress/egress in this same picture:
> (Example from Pablo):
> ------
> table inet filter {
>         flowtable f1 {
>                 hook ingress priority filter
>                 devices = { veth0, veth1 }
>         }
> 
>         chain ingress {
>                 type filter hook ingress device veth0 priority filter; policy accept; flags offload;
>                 ip dscp set cs3
>         }
> 
>         chain forward {
>                 type filter hook forward priority filter; policy accept;
>                 meta l4proto { tcp, udp, gre } flow add @f1
>                 ct state established,related accept
>         }
> }
> 
> Again, this has defined meaning in all combinations:
> With HW offload: veth0 will be told to mangle dscp.
> This happens in all cases and for every matching packet,
> regardless if a flowtable exists or not.
> 
> Same would happen for 'egress', just that it would happen at xmit time
> rather at receive time.  Again, its not relevant if there is active
> flowtable or not, or if data path is offloaded to hardware, to software,
> handled by fallback or entirely without flowtables being present.
> 
> Its also clear that this is tied to 'veth0', other devices will
> not be involved and not do such mangling.
> 

As I've mentioned in my other reply to Pablo, our focus is exclusively
on the *software* fast path of the forward chain. In this scenario
getting into additional nftables VM path in the ingress or egress seems
like pessimization which we'd like to avoid.

> Now lets look at your proposal:
> ----------------
> table inet filter {
>         flowtable f1 {
>                 hook ingress priority filter
>                 devices = { veth0, veth1 }
>         }
> 
>         chain forward {
>                 type filter hook forward priority filter; policy accept;
>                 ip dscp set cs3 offload
>                 ip protocol { tcp, udp, gre } flow add
>                 ct state established,related accept
>         }
> }
> ----------------
> 
> This means that software flowtable offload
> shall do a 'ip dscp set cs3'.
> 
> What if the flowtable is offloaded to hardware
> entirely, without software fallback?
> 
> What if the devices listed in the flowtable definition can handle
> flow offload, but no payload mangling?
> 
> Does the 'offload' mean that the rule is only active for
> software path?  Only for hardware path? both?
> 
> How can I tell if its offloaded to hardware for one device
> but not for the other?  Or will that be disallowed?
> 
> What if someone adds another rule after or before 'ip dscp',
> but without the 'offload' keyword?  Now ordering becomes an
> issue.
> 
> Users now need to consider different control flows:
> 
>   jump exceptions
>   ip dscp set cs3 offload
> 
>   chain exceptions {
>     ip daddr 1.2.3.4 accept
>   }
> 
> This won't work as expected, because offloaded flows will not
> pass through 'forward' chain but somehow a few selected rules
> will be run anyway.
> 
> TL;DR: I think that for HW offload its paramount to make it crystal
> clear as to which device is responsible to handle such rules.

Your critique of my offload flag is well deserved and fully correct,
thanks! The reason for the dscp statement offload flag was to try to be
explicit about the need for payload modification capture.

What we've really wanted to do here is to make the payload capture
dependent on the chain flowtable acceleration status. I.e. if the
forward chain is supposed to be accelerated, than the payload
modification capture should happen. Being lazy, I've went through that
ugly keyword path. Apologies for that.
 
> 
> The existing netdev:ingress/egress hooks provide the needed
> chain/rules/expression:device mapping.  User can easily
> tell if HW is responsible or SW by looking for 'offload' flag
> presence.
> 

Yes. But again hardware offload is irrelevant for the problem we have
here.

> I don't think mixing software and hardware offload contexts as proposed
> is a good idea, both from user frontend syntax, clarity and error reporting
> (e.g. if hw rejects offload request) point of view.
> 

Frontend syntax and nft userspace should not be affected once we drop
the unneeded offload flag on dscp statement.

> I also believe that allowing payload mangling from *software* offload
> path sets a precedence for essentially allowing all other expressions
> again which completely negates the flowtable concept.

IMHO, the flowtable concept means transparent acceleration of packet
processing between the specified interfaces. If something in the ruleset
precludes such acceleration warnings/errors should be given.

Do you agree?

Once the goal of significant performance gains is preserved, the
expansion of the universe of accelerated expressions is benign. And why
not? We are still being fast.

My suggestion regarding doing the software fast path through the
flow_block tries to decouple the performance of the fast path from the
size of such universe. The basic idea is that you don't pay for what you
don't use.

> 
> I still think that dscp mangling should be done via netdev:ingress/egress
> hooks, I don't see why this has to be bolted into flowtable sw offload.
> 

Because it can be made faster :)

Thanks,
Boris.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-10  7:49           ` Boris Sukholitko
@ 2023-05-10 12:55             ` Florian Westphal
  2023-05-11 15:59               ` Boris Sukholitko
  0 siblings, 1 reply; 41+ messages in thread
From: Florian Westphal @ 2023-05-10 12:55 UTC (permalink / raw)
  To: Boris Sukholitko
  Cc: Florian Westphal, Pablo Neira Ayuso, netfilter-devel,
	Ilya Lifshits

Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > table inet filter {
> >         flowtable f1 {
> >                 hook ingress priority filter
> >                 devices = { veth0, veth1 }
> >         }
> > 
> >         chain forward {
> >                 type filter hook forward priority filter; policy accept;
> >                 ip dscp set cs3
> >                 ip protocol { tcp, udp, gre } flow add
> >                 ct state established,related accept
> >         }
> > }
> > ----------
> > 
> > This has a clearly defined meaning in all possible combinations.
> > 
> > Software:
> > 1. It defines a bypass for veth0 <-> veth1
> > 2. the way this specific ruleset is defined, all of tcp/udp/gre will
> >    attempt to offload
> 
> OK.
> 
> > 3. once offload has happened, entire inet:forward may be bypassed
> 
> By bypassed, do you mean that chain forward ruleset body becomes
> irrelevant?

Yes.  It becomes irrelevant because the entire ip forward stack
becomes irrelevant. Prerouting and postrouting hooks get skipped
too, would you restore those as well?

> If the unfortunate answer is yes, than as in the original
> report, once we are in the software fast path we do not do dscp
> modification, right?

Yes. Its expected.  Entire IP stack is bypassed, including MTU
checks, xfrm policy lookups and so on.

> Then once we hit the NF_FLOW_TIMEOUT, some of the packets will have dscp
> modified, because we've went out of software acceleration. I.e. some of
> the packets will arrive with and some without dscp. As you can imagine,
> this could be hard to debug. This is the scenario that the patch series
> tries to fix.
> 
> > 4. User ruleset needs to cope with packets being moved back to
> >    software: fragmented packets, tcp fin/rst, hw timeouts and so on.
> 
> Should we require our user to understand that some lines in their
> forward table configuration may or may not be executed sporadically?

Yes.  They have to understand the packet is handled in a *completely
different* way, the IP stack is bypassed, including ipsec/xfrm policies,
icmp checks, mtu checks, and so on.

This is also why the fastpath *must* push some packets back to
normal plane: so that packet passes through ip_forward() which does
all this extra work.

Some packets cannot be added to flowtable either even if user asks
for it, e.g. when ip options are enabled.

> Should we give the user at least some kind of warning regarding this
> during the ruleset load?

You could try to do this for the 'nft -f' case, but I doubt its worth
it, see below for an example.

> To be constructive, isn't it better to rephrase points 3 and 4 as:
> 
> 3. once offload has happened, entire inet:forward will be executed with
> the same semantics but with better performance. Any difference between
> fast and slow path is considered a bug.

Err. no.  Because its impossible to do unless you stick a
nf_hook_slow() call into the "fastpath", otherwise you will diverge
from what normal path is doing.  And its still not the same,
feature-wise, because we e.g. cache route info rather than per-packet
lookup.

Flowtable software path is the *fallback* for when we don't have offload
capable hardware, it bypasses entire ip forward plane and packets take
a different, shorter code path, iff possible.

> 4. If something in user ruleset (such as dscp rule now) precludes fast
> path optimization then either error will be given or slow path will be
> taken with a warning.

Yes, nftables userspace could be augmented so that 'nft -f' could
display a warning if there is a rule other than 'conntrack
established,related accept' or similar as first line.

But I don't think its worth doing, f.e. someone could be doing
selective offload based on src/dst networks or similar.

Updating/expanding flowtable documentation would be welcome of course.

> > Lets now consider existing netdev:ingress/egress in this same picture:
> > (Example from Pablo):
> > ------
> > table inet filter {
> >         flowtable f1 {
> >                 hook ingress priority filter
> >                 devices = { veth0, veth1 }
> >         }
> > 
> >         chain ingress {
> >                 type filter hook ingress device veth0 priority filter; policy accept; flags offload;
> >                 ip dscp set cs3
> >         }
> > 
> >         chain forward {
> >                 type filter hook forward priority filter; policy accept;
> >                 meta l4proto { tcp, udp, gre } flow add @f1
> >                 ct state established,related accept
> >         }
> > }
> > 
> > Again, this has defined meaning in all combinations:
> > With HW offload: veth0 will be told to mangle dscp.
> > This happens in all cases and for every matching packet,
> > regardless if a flowtable exists or not.
> > 
> > Same would happen for 'egress', just that it would happen at xmit time
> > rather at receive time.  Again, its not relevant if there is active
> > flowtable or not, or if data path is offloaded to hardware, to software,
> > handled by fallback or entirely without flowtables being present.
> > 
> > Its also clear that this is tied to 'veth0', other devices will
> > not be involved and not do such mangling.
> > 
> 
> As I've mentioned in my other reply to Pablo, our focus is exclusively
> on the *software* fast path of the forward chain. In this scenario
> getting into additional nftables VM path in the ingress or egress seems
> like pessimization which we'd like to avoid.

You will have to add a call to nf_hook_slow, or at very least to
nft_do_chain...

netdev:in/egress is the existing plumbing that we have that
allows for this, and I think that this is what should be used
here.

> > Now lets look at your proposal:
> > ----------------
> > table inet filter {
> >         flowtable f1 {
> >                 hook ingress priority filter
> >                 devices = { veth0, veth1 }
> >         }
> > 
> >         chain forward {
> >                 type filter hook forward priority filter; policy accept;
> >                 ip dscp set cs3 offload
> >                 ip protocol { tcp, udp, gre } flow add
> >                 ct state established,related accept
> >         }
> > }
> > ----------------
> > 
> > This means that software flowtable offload
> > shall do a 'ip dscp set cs3'.
> > 
> > What if the flowtable is offloaded to hardware
> > entirely, without software fallback?
> > 
> > What if the devices listed in the flowtable definition can handle
> > flow offload, but no payload mangling?
> > 
> > Does the 'offload' mean that the rule is only active for
> > software path?  Only for hardware path? both?
> > 
> > How can I tell if its offloaded to hardware for one device
> > but not for the other?  Or will that be disallowed?
> > 
> > What if someone adds another rule after or before 'ip dscp',
> > but without the 'offload' keyword?  Now ordering becomes an
> > issue.
> > 
> > Users now need to consider different control flows:
> > 
> >   jump exceptions
> >   ip dscp set cs3 offload
> > 
> >   chain exceptions {
> >     ip daddr 1.2.3.4 accept
> >   }
> > 
> > This won't work as expected, because offloaded flows will not
> > pass through 'forward' chain but somehow a few selected rules
> > will be run anyway.
> > 
> > TL;DR: I think that for HW offload its paramount to make it crystal
> > clear as to which device is responsible to handle such rules.
> 
> Your critique of my offload flag is well deserved and fully correct,
> thanks! The reason for the dscp statement offload flag was to try to be
> explicit about the need for payload modification capture.
> 
> What we've really wanted to do here is to make the payload capture
> dependent on the chain flowtable acceleration status. I.e. if the
> forward chain is supposed to be accelerated, than the payload
> modification capture should happen. Being lazy, I've went through that
> ugly keyword path. Apologies for that.

> Yes. But again hardware offload is irrelevant for the problem we have
> here.

Unfortunately its not, we cannot make waters murkier and just pretend
HW offloads don't exist.  Behavior for "flags offload;" presence or
absence should be identical (in absence of bugs of course).

> > I don't think mixing software and hardware offload contexts as proposed
> > is a good idea, both from user frontend syntax, clarity and error reporting
> > (e.g. if hw rejects offload request) point of view.
> > 
> 
> Frontend syntax and nft userspace should not be affected once we drop
> the unneeded offload flag on dscp statement.

So you propose to software-offload all mangle statements and
prune all other rules...?

How is that any better than what we do now?

> > I also believe that allowing payload mangling from *software* offload
> > path sets a precedence for essentially allowing all other expressions
> > again which completely negates the flowtable concept.
> 
> IMHO, the flowtable concept means transparent acceleration of packet
> processing between the specified interfaces. If something in the ruleset
> precludes such acceleration warnings/errors should be given.
> 
> Do you agree?

It would be nice to give a warning but I don't think its feasible.
Consider something like

 chain forward {
  ip dscp set cs3
  ct state established,related accept
  ip saddr @should_offload_nets flow add @ft
 }

Should this generate a warning?

Also, because of forward path bypass even if the ruleset is just doing:

ct state established,related
ip saddr @can_offload flow add @ft

Packet flow is not the same as without "flow add" for a large swath of
packets, as no prerouting or postrouting hooks are executed either.

> Once the goal of significant performance gains is preserved, the
> expansion of the universe of accelerated expressions is benign. And why
> not? We are still being fast.

I still think that my critique stands, pruning other rules and
magically considering some while completely disregarding their context
is not a good idea.

Now, theoretically, you could add this:

chain fast_fwd {
	hook flowtable @f1 prio 0
	ip dscp set cs3
}

Where this chain is hooked into the flowtable fastpath *ONLY*.

However, I don't like it either because its incompatible with
HW offloads and we can be sure that once we allow this people
will want things like sets and maps too 8-(

> > I still think that dscp mangling should be done via netdev:ingress/egress
> > hooks, I don't see why this has to be bolted into flowtable sw offload.
> > 
> Because it can be made faster :)

If you want to make software fastpath fast(er), explore a XDP program
that can handle the post-offload packet funneling via xdp_frames to get rid of
sk_buff allocation overhead.

Program should make upcalls to stack (costly) to get back to software
and otherwise appear like HW offload-capable device from flowtable point
of view.  Some parts of the flowtable infrastructure could probably be exposed
via kfuncs so that certain functionality such as NAT doesn't have to be
(re)implemented.

Requires lots of code churn in flowtable support code to get rid
of sk_buff * arguments whereever possible.

There is also pending xmit_more support (skb trains/aggregation),
perhaps it will land in 6.5 cycle.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-10 12:55             ` Florian Westphal
@ 2023-05-11 15:59               ` Boris Sukholitko
  2023-05-11 16:36                 ` Florian Westphal
  0 siblings, 1 reply; 41+ messages in thread
From: Boris Sukholitko @ 2023-05-11 15:59 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel, Ilya Lifshits

[-- Attachment #1: Type: text/plain, Size: 13993 bytes --]

On Wed, May 10, 2023 at 02:55:44PM +0200, Florian Westphal wrote:
> Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > table inet filter {
> > >         flowtable f1 {
> > >                 hook ingress priority filter
> > >                 devices = { veth0, veth1 }
> > >         }
> > > 
> > >         chain forward {
> > >                 type filter hook forward priority filter; policy accept;
> > >                 ip dscp set cs3
> > >                 ip protocol { tcp, udp, gre } flow add
> > >                 ct state established,related accept
> > >         }
> > > }
> > > ----------
> > > 
> > > This has a clearly defined meaning in all possible combinations.
> > > 
> > > Software:
> > > 1. It defines a bypass for veth0 <-> veth1
> > > 2. the way this specific ruleset is defined, all of tcp/udp/gre will
> > >    attempt to offload
> > 
> > OK.
> > 
> > > 3. once offload has happened, entire inet:forward may be bypassed
> > 
> > By bypassed, do you mean that chain forward ruleset body becomes
> > irrelevant?
> 
> Yes.  It becomes irrelevant because the entire ip forward stack
> becomes irrelevant. Prerouting and postrouting hooks get skipped
> too, would you restore those as well?
> 
> > If the unfortunate answer is yes, than as in the original
> > report, once we are in the software fast path we do not do dscp
> > modification, right?
> 
> Yes. Its expected.  Entire IP stack is bypassed, including MTU
> checks, xfrm policy lookups and so on.
> 
> > Then once we hit the NF_FLOW_TIMEOUT, some of the packets will have dscp
> > modified, because we've went out of software acceleration. I.e. some of
> > the packets will arrive with and some without dscp. As you can imagine,
> > this could be hard to debug. This is the scenario that the patch series
> > tries to fix.
> > 
> > > 4. User ruleset needs to cope with packets being moved back to
> > >    software: fragmented packets, tcp fin/rst, hw timeouts and so on.
> > 
> > Should we require our user to understand that some lines in their
> > forward table configuration may or may not be executed sporadically?
> 
> Yes.  They have to understand the packet is handled in a *completely
> different* way, the IP stack is bypassed, including ipsec/xfrm policies,
> icmp checks, mtu checks, and so on.
> 
> This is also why the fastpath *must* push some packets back to
> normal plane: so that packet passes through ip_forward() which does
> all this extra work.
> 
> Some packets cannot be added to flowtable either even if user asks
> for it, e.g. when ip options are enabled.
> 
> > Should we give the user at least some kind of warning regarding this
> > during the ruleset load?
> 
> You could try to do this for the 'nft -f' case, but I doubt its worth
> it, see below for an example.
> 
> > To be constructive, isn't it better to rephrase points 3 and 4 as:
> > 
> > 3. once offload has happened, entire inet:forward will be executed with
> > the same semantics but with better performance. Any difference between
> > fast and slow path is considered a bug.
> 
> Err. no.  Because its impossible to do unless you stick a
> nf_hook_slow() call into the "fastpath", otherwise you will diverge
> from what normal path is doing.  And its still not the same,
> feature-wise, because we e.g. cache route info rather than per-packet
> lookup.
> 
> Flowtable software path is the *fallback* for when we don't have offload
> capable hardware, it bypasses entire ip forward plane and packets take
> a different, shorter code path, iff possible.
> 
> > 4. If something in user ruleset (such as dscp rule now) precludes fast
> > path optimization then either error will be given or slow path will be
> > taken with a warning.
> 
> Yes, nftables userspace could be augmented so that 'nft -f' could
> display a warning if there is a rule other than 'conntrack
> established,related accept' or similar as first line.
> 
> But I don't think its worth doing, f.e. someone could be doing
> selective offload based on src/dst networks or similar.
> 
> Updating/expanding flowtable documentation would be welcome of course.
> 
> > > Lets now consider existing netdev:ingress/egress in this same picture:
> > > (Example from Pablo):
> > > ------
> > > table inet filter {
> > >         flowtable f1 {
> > >                 hook ingress priority filter
> > >                 devices = { veth0, veth1 }
> > >         }
> > > 
> > >         chain ingress {
> > >                 type filter hook ingress device veth0 priority filter; policy accept; flags offload;
> > >                 ip dscp set cs3
> > >         }
> > > 
> > >         chain forward {
> > >                 type filter hook forward priority filter; policy accept;
> > >                 meta l4proto { tcp, udp, gre } flow add @f1
> > >                 ct state established,related accept
> > >         }
> > > }
> > > 
> > > Again, this has defined meaning in all combinations:
> > > With HW offload: veth0 will be told to mangle dscp.
> > > This happens in all cases and for every matching packet,
> > > regardless if a flowtable exists or not.
> > > 
> > > Same would happen for 'egress', just that it would happen at xmit time
> > > rather at receive time.  Again, its not relevant if there is active
> > > flowtable or not, or if data path is offloaded to hardware, to software,
> > > handled by fallback or entirely without flowtables being present.
> > > 
> > > Its also clear that this is tied to 'veth0', other devices will
> > > not be involved and not do such mangling.
> > > 
> > 
> > As I've mentioned in my other reply to Pablo, our focus is exclusively
> > on the *software* fast path of the forward chain. In this scenario
> > getting into additional nftables VM path in the ingress or egress seems
> > like pessimization which we'd like to avoid.
> 
> You will have to add a call to nf_hook_slow, or at very least to
> nft_do_chain...
> 
> netdev:in/egress is the existing plumbing that we have that
> allows for this, and I think that this is what should be used
> here.
> 
> > > Now lets look at your proposal:
> > > ----------------
> > > table inet filter {
> > >         flowtable f1 {
> > >                 hook ingress priority filter
> > >                 devices = { veth0, veth1 }
> > >         }
> > > 
> > >         chain forward {
> > >                 type filter hook forward priority filter; policy accept;
> > >                 ip dscp set cs3 offload
> > >                 ip protocol { tcp, udp, gre } flow add
> > >                 ct state established,related accept
> > >         }
> > > }
> > > ----------------
> > > 
> > > This means that software flowtable offload
> > > shall do a 'ip dscp set cs3'.
> > > 
> > > What if the flowtable is offloaded to hardware
> > > entirely, without software fallback?
> > > 
> > > What if the devices listed in the flowtable definition can handle
> > > flow offload, but no payload mangling?
> > > 
> > > Does the 'offload' mean that the rule is only active for
> > > software path?  Only for hardware path? both?
> > > 
> > > How can I tell if its offloaded to hardware for one device
> > > but not for the other?  Or will that be disallowed?
> > > 
> > > What if someone adds another rule after or before 'ip dscp',
> > > but without the 'offload' keyword?  Now ordering becomes an
> > > issue.
> > > 
> > > Users now need to consider different control flows:
> > > 
> > >   jump exceptions
> > >   ip dscp set cs3 offload
> > > 
> > >   chain exceptions {
> > >     ip daddr 1.2.3.4 accept
> > >   }
> > > 
> > > This won't work as expected, because offloaded flows will not
> > > pass through 'forward' chain but somehow a few selected rules
> > > will be run anyway.
> > > 
> > > TL;DR: I think that for HW offload its paramount to make it crystal
> > > clear as to which device is responsible to handle such rules.
> > 
> > Your critique of my offload flag is well deserved and fully correct,
> > thanks! The reason for the dscp statement offload flag was to try to be
> > explicit about the need for payload modification capture.
> > 
> > What we've really wanted to do here is to make the payload capture
> > dependent on the chain flowtable acceleration status. I.e. if the
> > forward chain is supposed to be accelerated, than the payload
> > modification capture should happen. Being lazy, I've went through that
> > ugly keyword path. Apologies for that.
> 
> > Yes. But again hardware offload is irrelevant for the problem we have
> > here.
> 
> Unfortunately its not, we cannot make waters murkier and just pretend
> HW offloads don't exist.  Behavior for "flags offload;" presence or
> absence should be identical (in absence of bugs of course).
> 
> > > I don't think mixing software and hardware offload contexts as proposed
> > > is a good idea, both from user frontend syntax, clarity and error reporting
> > > (e.g. if hw rejects offload request) point of view.
> > > 
> > 
> > Frontend syntax and nft userspace should not be affected once we drop
> > the unneeded offload flag on dscp statement.
> 
> So you propose to software-offload all mangle statements and
> prune all other rules...?
> 
> How is that any better than what we do now?
> 
> > > I also believe that allowing payload mangling from *software* offload
> > > path sets a precedence for essentially allowing all other expressions
> > > again which completely negates the flowtable concept.
> > 
> > IMHO, the flowtable concept means transparent acceleration of packet
> > processing between the specified interfaces. If something in the ruleset
> > precludes such acceleration warnings/errors should be given.
> > 
> > Do you agree?
> 
> It would be nice to give a warning but I don't think its feasible.
> Consider something like
> 
>  chain forward {
>   ip dscp set cs3
>   ct state established,related accept
>   ip saddr @should_offload_nets flow add @ft
>  }
> 
> Should this generate a warning?
> 
> Also, because of forward path bypass even if the ruleset is just doing:
> 
> ct state established,related
> ip saddr @can_offload flow add @ft
> 
> Packet flow is not the same as without "flow add" for a large swath of
> packets, as no prerouting or postrouting hooks are executed either.
> 
> > Once the goal of significant performance gains is preserved, the
> > expansion of the universe of accelerated expressions is benign. And why
> > not? We are still being fast.
> 
> I still think that my critique stands, pruning other rules and
> magically considering some while completely disregarding their context
> is not a good idea.
> 

I think I finally understand your reasoning. May I summarise it as the
following:

nftables chain forward having a flow add clause becomes a request from
the user to skip parts of Linux network stack. The affected flows will
become special and unaffected by most of the rules of the "slowpath"
chain forward. This is a sharp tool and user gets to keep both pieces
if something breaks :)

I agree that this is a well-defined and defensible position. I concur.

> Now, theoretically, you could add this:
> 
> chain fast_fwd {
> 	hook flowtable @f1 prio 0
> 	ip dscp set cs3
> }
> 

Yes, I really like that. Here is what such chain will do:

1. On the slow path it will behave identical to the forward chain.
2. The only processing done on fast_fwd fast path is interpretation
   of struct flow_offload_entry list (.
3. Such fast path is done between devices defined in flowtable f1
4. Apart from the interpretation of flow offload entries no other
   processing will be done.
5. (4) means that no Linux IP stack is involved in the forwarding.
6. However (4) allows concatenation of other flow_offload_entry
   producers (e.g. TC, ingress, egress nft chains).
7. flow_offload_entry lists may be connection dependent.
8. Similar to chain forward now, flow_offload_entry lists will be passed
   to devices for hardware acceleration.
9. IOW, flow_offload_entry lists become connection specific programs.
   Therefore such lists may be compiled to EBPF and accelerated on XDP.
10. flow_action_entry interpreters should be prepared to deal with IP
    fragments and other strangeness that ensues on our networks.

> Where this chain is hooked into the flowtable fastpath *ONLY*.

I don't fully understand the ONLY part, but do my points above address
this?

> 
> However, I don't like it either because its incompatible with
> HW offloads and we can be sure that once we allow this people
> will want things like sets and maps too 8-(

I think that due to point (8) above the potential for hardware
acceleration is higher. The hardware (e.g. switch) is free to pass
the packets between flowtable ports and not involve Linux stack at all.
It may do such forwarding because of the promise (4) above.

sets and maps are welcome in chain fast_fwd :) EBPF and XDP already have
them. Once (9) becomes reality we'll be able to suport them, somehow :)

> 
> > > I still think that dscp mangling should be done via netdev:ingress/egress
> > > hooks, I don't see why this has to be bolted into flowtable sw offload.
> > > 
> > Because it can be made faster :)
> 
> If you want to make software fastpath fast(er), explore a XDP program
> that can handle the post-offload packet funneling via xdp_frames to get rid of
> sk_buff allocation overhead.
> 
> Program should make upcalls to stack (costly) to get back to software
> and otherwise appear like HW offload-capable device from flowtable point
> of view.  Some parts of the flowtable infrastructure could probably be exposed
> via kfuncs so that certain functionality such as NAT doesn't have to be
> (re)implemented.
> 
> Requires lots of code churn in flowtable support code to get rid
> of sk_buff * arguments whereever possible.
> 
> There is also pending xmit_more support (skb trains/aggregation),
> perhaps it will land in 6.5 cycle.

I've incorporated XDP into my suggestion above. It seems to fit.

What do you think? Is going chain fast_fwd direction is feasible and
desirable?

Thanks,
Boris.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
  2023-05-11 15:59               ` Boris Sukholitko
@ 2023-05-11 16:36                 ` Florian Westphal
  0 siblings, 0 replies; 41+ messages in thread
From: Florian Westphal @ 2023-05-11 16:36 UTC (permalink / raw)
  To: Boris Sukholitko
  Cc: Florian Westphal, Pablo Neira Ayuso, netfilter-devel,
	Ilya Lifshits

Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> On Wed, May 10, 2023 at 02:55:44PM +0200, Florian Westphal wrote:
> I think I finally understand your reasoning. May I summarise it as the
> following:
> 
> nftables chain forward having a flow add clause becomes a request from
> the user to skip parts of Linux network stack. The affected flows will
> become special and unaffected by most of the rules of the "slowpath"
> chain forward. This is a sharp tool and user gets to keep both pieces
> if something breaks :)

Yes.

> > Now, theoretically, you could add this:
> > 
> > chain fast_fwd {
> > 	hook flowtable @f1 prio 0
> > 	ip dscp set cs3
> > }
> > 
> 
> Yes, I really like that. Here is what such chain will do:

NOOOOOOOOOOOOOOOOOOOOO!

> 1. On the slow path it will behave identical to the forward chain.

So one extra interpreter trip?

> 2. The only processing done on fast_fwd fast path is interpretation
>    of struct flow_offload_entry list (.

list iteration? What?  But netdev:ingress can't be used because
its too slow?!

I'm going to stop responding, sorry.

Netfilter already has byzantine technical debt, I don't want
to maintain any more 8-(

> 3. Such fast path is done between devices defined in flowtable f1
> 4. Apart from the interpretation of flow offload entries no other
>    processing will be done.
> 5. (4) means that no Linux IP stack is involved in the forwarding.
> 6. However (4) allows concatenation of other flow_offload_entry
>    producers (e.g. TC, ingress, egress nft chains).

Ugh.  This is already problematic.  Pipeline/processing ordering matters.

> 7. flow_offload_entry lists may be connection dependent.

Thanks for reminding me.  This is also bad.
Flowtable offload is tied to conntrack, yes.

But rule offload SHOULD NOT be tied to connection tracking.
What you are proposing is the ability to attach rules to a conntrack
entry.

> 8. Similar to chain forward now, flow_offload_entry lists will be passed
>    to devices for hardware acceleration.

Wnich devices?  Error handling?

> 9. IOW, flow_offload_entry lists become connection specific programs.
>    Therefore such lists may be compiled to EBPF and accelerated on XDP.

By whom? How?

> 10. flow_action_entry interpreters should be prepared to deal with IP
>     fragments and other strangeness that ensues on our networks.
> 
> > Where this chain is hooked into the flowtable fastpath *ONLY*.
> 
> I don't fully understand the ONLY part, but do my points above address
> this?

Only == not called for slowpath.

I don't understand you, you reject netdev:ingress/egress
but want a new conntrack extension that iterates flow_offload entries in
software?

> > However, I don't like it either because its incompatible with
> > HW offloads and we can be sure that once we allow this people
> > will want things like sets and maps too 8-(
> 
> I think that due to point (8) above the potential for hardware
> acceleration is higher. The hardware (e.g. switch) is free to pass
> the packets between flowtable ports and not involve Linux stack at all.
> It may do such forwarding because of the promise (4) above.
>
> sets and maps are welcome in chain fast_fwd :) EBPF and XDP already have
> them. Once (9) becomes reality we'll be able to suport them, somehow :)

No, XDP *DOES NOT* have them.  nftables sets and ebpf sets are
completely different entities.  'nft add element inet filter bla { 1.2.3,4 }

will not magically alter some ebpf set.

They also have different scoping rules.

> What do you think? Is going chain fast_fwd direction is feasible and
> desirable?

I think you should use netdev:ingress/egress hook points.

Or use an xdp program and don't use netfilter at all.

If you want to use nftables sets with ebpf, then you might investigate
adding kfuncs for ebpf so nftables sets can be used from bpf programs,
that might actually be useful for some people, but I'm not sure how to
make this work at this time due to nature of set/map scoping in
nftables.  We have to be mindful to not crash kernel when table/set/map
is going away on netfilter side.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-05-11 16:36 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 01/19] selftest: netfilter: use /proc for pid checking Boris Sukholitko
2023-05-03 18:47   ` Florian Westphal
2023-05-04  8:53     ` Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 02/19] selftest: netfilter: no need for ps -x option Boris Sukholitko
2023-05-03 18:53   ` Florian Westphal
2023-05-03 12:55 ` [PATCH nf-next 03/19] selftest: netfilter: wait for specific nc pids Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 04/19] selftest: netfilter: monitor result file sizes Boris Sukholitko
2023-05-03 18:54   ` Florian Westphal
2023-05-03 12:55 ` [PATCH nf-next 05/19] netfilter: nft_payload: refactor mangle operation Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 06/19] netfilter: nft_payload: publish nft_payload_set Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 07/19] netfilter: nft_payload: export mangle Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 08/19] netfilter: nft_payload: use flag for checksum need Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 09/19] netfilter: nft_payload: add offload flag define Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 10/19] netfilter: nft_payload: allow offload in the netlink Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 11/19] netfilter: conntrack: nft extension Kconfig Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 12/19] netfilter: nft: empty nft conntrack extension Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 13/19] netfilter: conntrack: register nft extension Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 14/19] netfilter: nft: add payload context into extension Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 15/19] netfilter: nft: add payload application Boris Sukholitko
2023-05-03 23:32   ` kernel test robot
2023-05-04  0:44   ` kernel test robot
2023-05-03 12:55 ` [PATCH nf-next 16/19] netfilter: nftables: fast path payload mangle Boris Sukholitko
2023-05-03 15:41   ` kernel test robot
2023-05-03 12:55 ` [PATCH nf-next 17/19] netfilter: nftables: payload save mechanism Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 18/19] netfilter: nft_payload: save payload if needed Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 19/19] selftests: netfilter: dscp offload test Boris Sukholitko
2023-05-03 18:46 ` [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Florian Westphal
2023-05-07 15:22   ` Boris Sukholitko
2023-05-07 17:37     ` Florian Westphal
2023-05-08 13:38       ` Boris Sukholitko
2023-05-08 20:07         ` Pablo Neira Ayuso
2023-05-09 14:56           ` Boris Sukholitko
2023-05-09  9:48         ` Florian Westphal
2023-05-10  7:49           ` Boris Sukholitko
2023-05-10 12:55             ` Florian Westphal
2023-05-11 15:59               ` Boris Sukholitko
2023-05-11 16:36                 ` Florian Westphal
2023-05-03 20:30 ` Pablo Neira Ayuso
2023-05-03 20:41 ` Pablo Neira Ayuso
2023-05-04  8:50   ` Boris Sukholitko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.