* [PATCH] netfilter: nfnetlink: always ACK batch end if requested
@ 2025-10-01 21:15 Nikolaos Gkarlis
2025-10-02 9:48 ` Fernando Fernandez Mancera
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Nikolaos Gkarlis @ 2025-10-01 21:15 UTC (permalink / raw)
To: netfilter-devel; +Cc: pablo, fw, Nikolaos Gkarlis
Before ACKs were introduced for batch begin and batch end messages,
userspace expected to receive the same number of ACKs as it sent,
unless a fatal error occurred.
To preserve this deterministic behavior, send an ACK for batch end
messages even when an error happens in the middle of the batch,
similar to how ACKs are handled for command messages.
Signed-off-by: Nikolaos Gkarlis <nickgarlis@gmail.com>
---
Hi,
I recently came across the issue introduced by bf2ac490d28c and
while trying to find a way to handle it by adding an ACK on batch
begin, end messages, I spotted what looks like an inconsistency?
I have tested this change with my userspace application and it
seems to resolve the "problem". However, I am not sure if there
is a suitable place to add a regression test, since AFAIK nft
userspace does not currently use this feature. I would be happy
to contribute a test if you could point me to the right place.
I may be missing some context, so feedback on whether this is the
right approach would be very welcome.
net/netfilter/nfnetlink.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 811d02b4c4f7..0342087ead06 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -600,6 +600,11 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh,
status |= NFNL_BATCH_FAILURE;
goto replay_abort;
}
+
+ if (nlh->nlmsg_flags & NLM_F_ACK) {
+ memset(&extack, 0, sizeof(extack));
+ nfnl_err_add(&err_list, nlh, 0, &extack);
+ }
}
nfnl_err_deliver(&err_list, oskb);
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-01 21:15 [PATCH] netfilter: nfnetlink: always ACK batch end if requested Nikolaos Gkarlis @ 2025-10-02 9:48 ` Fernando Fernandez Mancera 2025-10-02 10:41 ` Nikolaos Gkarlis 2025-10-02 10:10 ` [PATCH] netfilter: nfnetlink: " Florian Westphal 2025-10-07 20:33 ` Pablo Neira Ayuso 2 siblings, 1 reply; 27+ messages in thread From: Fernando Fernandez Mancera @ 2025-10-02 9:48 UTC (permalink / raw) To: Nikolaos Gkarlis, netfilter-devel; +Cc: pablo, fw On 10/1/25 11:15 PM, Nikolaos Gkarlis wrote: > Before ACKs were introduced for batch begin and batch end messages, > userspace expected to receive the same number of ACKs as it sent, > unless a fatal error occurred. > > To preserve this deterministic behavior, send an ACK for batch end > messages even when an error happens in the middle of the batch, > similar to how ACKs are handled for command messages. > > Signed-off-by: Nikolaos Gkarlis <nickgarlis@gmail.com> > --- > Hi, > > I recently came across the issue introduced by bf2ac490d28c and > while trying to find a way to handle it by adding an ACK on batch > begin, end messages, I spotted what looks like an inconsistency? > > I have tested this change with my userspace application and it > seems to resolve the "problem". However, I am not sure if there > is a suitable place to add a regression test, since AFAIK nft > userspace does not currently use this feature. I would be happy > to contribute a test if you could point me to the right place. > > I may be missing some context, so feedback on whether this is the > right approach would be very welcome. > > net/netfilter/nfnetlink.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c > index 811d02b4c4f7..0342087ead06 100644 > --- a/net/netfilter/nfnetlink.c > +++ b/net/netfilter/nfnetlink.c > @@ -600,6 +600,11 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, > status |= NFNL_BATCH_FAILURE; > goto replay_abort; > } > + > + if (nlh->nlmsg_flags & NLM_F_ACK) { > + memset(&extack, 0, sizeof(extack)); > + nfnl_err_add(&err_list, nlh, 0, &extack); > + } > } Right, if BATCH_END message is reached and has the NLM_F_ACK, nfnetlink should send the corresponding ACK. Currently if BATCH_END message is missing, this would send an extra wrong ACK if the previous message was using NLM_F_ACK. e.g for a batch formatted like (BATCH_BEGIN|NFT_MSG_NEWRULE + NLM_F_ACK) - nfnetlink would send two ACKs while it should be only one. Granted it won't configure anything but it would be still misleading. What about this? diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c index b8d0fad1ed10..ecf85346d883 100644 --- a/net/netfilter/nfnetlink.c +++ b/net/netfilter/nfnetlink.c @@ -601,7 +601,7 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, goto replay_abort; } - if (nlh->nlmsg_flags & NLM_F_ACK) { + if (nlh->nlmsg_flags & NLM_F_ACK && status & NFNL_BATCH_DONE) { memset(&extack, 0, sizeof(extack)); nfnl_err_add(&err_list, nlh, 0, &extack); } > > nfnl_err_deliver(&err_list, oskb); ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-02 9:48 ` Fernando Fernandez Mancera @ 2025-10-02 10:41 ` Nikolaos Gkarlis 2025-10-02 11:03 ` Fernando Fernandez Mancera 0 siblings, 1 reply; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-02 10:41 UTC (permalink / raw) To: Fernando Fernandez Mancera; +Cc: netfilter-devel, pablo, fw Fernando Fernandez Mancera <fmancera@suse.de> wrote: > e.g for a batch formatted like (BATCH_BEGIN|NFT_MSG_NEWRULE + NLM_F_ACK) > - nfnetlink would send two ACKs while it should be only one. Granted it > won't configure anything but it would be still misleading. > > What about this? > Thanks ! I was a bit unsure on whether the status should also be checked. This seems to work with my test. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-02 10:41 ` Nikolaos Gkarlis @ 2025-10-02 11:03 ` Fernando Fernandez Mancera 2025-10-04 9:26 ` [PATCH v2 0/2] " Nikolaos Gkarlis 2025-10-04 9:38 ` [PATCH v2 0/2] always ACK batch end if requested Nikolaos Gkarlis 0 siblings, 2 replies; 27+ messages in thread From: Fernando Fernandez Mancera @ 2025-10-02 11:03 UTC (permalink / raw) To: Nikolaos Gkarlis; +Cc: netfilter-devel, pablo, fw On 10/2/25 12:41 PM, Nikolaos Gkarlis wrote: > Fernando Fernandez Mancera <fmancera@suse.de> wrote: > >> e.g for a batch formatted like (BATCH_BEGIN|NFT_MSG_NEWRULE + NLM_F_ACK) >> - nfnetlink would send two ACKs while it should be only one. Granted it >> won't configure anything but it would be still misleading. >> >> What about this? >> > > Thanks ! I was a bit unsure on whether the status should also be checked. > This seems to work with my test. Just a nit, the commit should also have a Fixes tag IMHO. Fixes: bf2ac490d28c ("netfilter: nfnetlink: Handle ACK flags for batch messages") Thanks! ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2 0/2] always ACK batch end if requested 2025-10-02 11:03 ` Fernando Fernandez Mancera @ 2025-10-04 9:26 ` Nikolaos Gkarlis 2025-10-04 9:26 ` [PATCH v2 1/2] netfilter: nfnetlink: " Nikolaos Gkarlis 2025-10-04 9:26 ` [PATCH v2 2/2] selftests: netfilter: add nfnetlink ACK handling tests Nikolaos Gkarlis 2025-10-04 9:38 ` [PATCH v2 0/2] always ACK batch end if requested Nikolaos Gkarlis 1 sibling, 2 replies; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-04 9:26 UTC (permalink / raw) To: netfilter-devel; +Cc: pablo, fw, fmancera, Nikolaos Gkarlis Hi again, Apologies for the delay. I have now added an nfnetlink selftest as suggested by Florian, and committed Fernando's suggestion to also check for the status. I have confirmed that the test fails on kernels > 6.10 that do not include Fernando's 09efbac953f6 fix. The test could be made more extensive, but I think this is a good starting point. Let me know what you think. Nikolaos Gkarlis (2): netfilter: nfnetlink: always ACK batch end if requested selftests: netfilter: add nfnetlink ACK handling tests net/netfilter/nfnetlink.c | 5 + .../testing/selftests/net/netfilter/Makefile | 3 + .../selftests/net/netfilter/nfnetlink.c | 424 ++++++++++++++++++ .../selftests/net/netfilter/nfnetlink.sh | 11 + 4 files changed, 443 insertions(+) create mode 100644 tools/testing/selftests/net/netfilter/nfnetlink.c create mode 100755 tools/testing/selftests/net/netfilter/nfnetlink.sh -- 2.34.1 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2 1/2] netfilter: nfnetlink: always ACK batch end if requested 2025-10-04 9:26 ` [PATCH v2 0/2] " Nikolaos Gkarlis @ 2025-10-04 9:26 ` Nikolaos Gkarlis 2025-10-04 9:26 ` [PATCH v2 2/2] selftests: netfilter: add nfnetlink ACK handling tests Nikolaos Gkarlis 1 sibling, 0 replies; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-04 9:26 UTC (permalink / raw) To: netfilter-devel; +Cc: pablo, fw, fmancera, Nikolaos Gkarlis Before ACKs were introduced for batch begin and batch end messages, userspace expected to receive the same number of ACKs as it sent, unless a fatal error occurred. To preserve this deterministic behavior, send an ACK for batch end messages even when an error happens in the middle of the batch, similar to how ACKs are handled for command messages. Fixes: bf2ac490d28c ("netfilter: nfnetlink: Handle ACK flags for batch messages") Signed-off-by: Nikolaos Gkarlis <nickgarlis@gmail.com> --- net/netfilter/nfnetlink.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c index 811d02b4c4f7..33acc1b94a0e 100644 --- a/net/netfilter/nfnetlink.c +++ b/net/netfilter/nfnetlink.c @@ -600,6 +600,11 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, status |= NFNL_BATCH_FAILURE; goto replay_abort; } + + if (nlh->nlmsg_flags & NLM_F_ACK && status & NFNL_BATCH_DONE) { + memset(&extack, 0, sizeof(extack)); + nfnl_err_add(&err_list, nlh, 0, &extack); + } } nfnl_err_deliver(&err_list, oskb); -- 2.34.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 2/2] selftests: netfilter: add nfnetlink ACK handling tests 2025-10-04 9:26 ` [PATCH v2 0/2] " Nikolaos Gkarlis 2025-10-04 9:26 ` [PATCH v2 1/2] netfilter: nfnetlink: " Nikolaos Gkarlis @ 2025-10-04 9:26 ` Nikolaos Gkarlis 2025-10-04 10:46 ` Florian Westphal 1 sibling, 1 reply; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-04 9:26 UTC (permalink / raw) To: netfilter-devel; +Cc: pablo, fw, fmancera, Nikolaos Gkarlis Add nfnetlink selftests to validate the ACKs sent after a batch message. These tests verify that: - ACKs are always received in order. - Module loading does not affect the responses. - The number of ACKs matches the number of requests, unless a fatal error occurs. Signed-off-by: Nikolaos Gkarlis <nickgarlis@gmail.com> --- .../testing/selftests/net/netfilter/Makefile | 3 + .../selftests/net/netfilter/nfnetlink.c | 424 ++++++++++++++++++ .../selftests/net/netfilter/nfnetlink.sh | 11 + 3 files changed, 438 insertions(+) create mode 100644 tools/testing/selftests/net/netfilter/nfnetlink.c create mode 100755 tools/testing/selftests/net/netfilter/nfnetlink.sh diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile index a98ed892f55f..cb38c8a9b2cc 100644 --- a/tools/testing/selftests/net/netfilter/Makefile +++ b/tools/testing/selftests/net/netfilter/Makefile @@ -37,6 +37,7 @@ TEST_PROGS += nft_zones_many.sh TEST_PROGS += rpath.sh TEST_PROGS += vxlan_mtu_frag.sh TEST_PROGS += xt_string.sh +TEST_PROGS += nfnetlink.sh TEST_PROGS_EXTENDED = nft_concat_range_perf.sh @@ -46,6 +47,7 @@ TEST_GEN_FILES += conntrack_dump_flush TEST_GEN_FILES += conntrack_reverse_clash TEST_GEN_FILES += sctp_collision TEST_GEN_FILES += udpclash +TEST_GEN_FILES += nfnetlink include ../../lib.mk @@ -54,6 +56,7 @@ $(OUTPUT)/nf_queue: LDLIBS += $(MNL_LDLIBS) $(OUTPUT)/conntrack_dump_flush: CFLAGS += $(MNL_CFLAGS) $(OUTPUT)/conntrack_dump_flush: LDLIBS += $(MNL_LDLIBS) +$(OUTPUT)/nfnetlink: LDLIBS += $(MNL_LDLIBS) $(OUTPUT)/udpclash: LDLIBS += -lpthread TEST_FILES := lib.sh diff --git a/tools/testing/selftests/net/netfilter/nfnetlink.c b/tools/testing/selftests/net/netfilter/nfnetlink.c new file mode 100644 index 000000000000..c1cb7484873a --- /dev/null +++ b/tools/testing/selftests/net/netfilter/nfnetlink.c @@ -0,0 +1,424 @@ +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <time.h> +#include <arpa/inet.h> +#include <libmnl/libmnl.h> +#include <linux/netfilter/nfnetlink.h> +#include <linux/netfilter/nf_tables.h> +#include <linux/netfilter/nf_conntrack_common.h> +#include <linux/netfilter.h> +#include <sys/types.h> + +#include "../../kselftest_harness.h" + +#define BATCH_PAGE_SIZE 8192 + +static bool batch_begin(struct mnl_nlmsg_batch *batch, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = NFNL_MSG_BATCH_BEGIN; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_UNSPEC; + nfh->version = NFNETLINK_V0; + nfh->res_id = NFNL_SUBSYS_NFTABLES; + + return mnl_nlmsg_batch_next(batch); +} + +static bool batch_end(struct mnl_nlmsg_batch *batch, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = NFNL_MSG_BATCH_END; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_UNSPEC; + nfh->version = NFNETLINK_V0; + nfh->res_id = NFNL_SUBSYS_NFTABLES; + + return mnl_nlmsg_batch_next(batch); +} + +static bool +create_table(struct mnl_nlmsg_batch *batch, const char *name, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | NFT_MSG_NEWTABLE; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_INET; + nfh->version = NFNETLINK_V0; + nfh->res_id = 0; + + mnl_attr_put_strz(nlh, NFTA_TABLE_NAME, name); + + return mnl_nlmsg_batch_next(batch); +} + +static bool +create_chain(struct mnl_nlmsg_batch *batch, const char *table, + const char *chain, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | NFT_MSG_NEWCHAIN; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_INET; + nfh->version = NFNETLINK_V0; + nfh->res_id = 0; + + mnl_attr_put_strz(nlh, NFTA_CHAIN_TABLE, table); + mnl_attr_put_strz(nlh, NFTA_CHAIN_NAME, chain); + + return mnl_nlmsg_batch_next(batch); +} + +static bool +create_rule_with_ct(struct mnl_nlmsg_batch *batch, const char *table, + const char *chain, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + struct nlattr *nest1, *nest2, *nest3, *nest4; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | NFT_MSG_NEWRULE; + nlh->nlmsg_flags = + NLM_F_REQUEST | NLM_F_CREATE | NLM_F_APPEND | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_INET; + nfh->version = NFNETLINK_V0; + nfh->res_id = 0; + + mnl_attr_put_strz(nlh, NFTA_RULE_TABLE, table); + mnl_attr_put_strz(nlh, NFTA_RULE_CHAIN, chain); + + /* Create expressions list */ + nest1 = mnl_attr_nest_start(nlh, NFTA_RULE_EXPRESSIONS); + + /* CT expression to load ct state */ + nest2 = mnl_attr_nest_start(nlh, NFTA_LIST_ELEM); + mnl_attr_put_strz(nlh, NFTA_EXPR_NAME, "ct"); + + nest3 = mnl_attr_nest_start(nlh, NFTA_EXPR_DATA); + mnl_attr_put_u32(nlh, NFTA_CT_KEY, htonl(NFT_CT_STATE)); + mnl_attr_put_u32(nlh, NFTA_CT_DREG, htonl(NFT_REG_1)); + mnl_attr_nest_end(nlh, nest3); + + mnl_attr_nest_end(nlh, nest2); + + /* Bitwise expression */ + nest2 = mnl_attr_nest_start(nlh, NFTA_LIST_ELEM); + mnl_attr_put_strz(nlh, NFTA_EXPR_NAME, "bitwise"); + + nest3 = mnl_attr_nest_start(nlh, NFTA_EXPR_DATA); + mnl_attr_put_u32(nlh, NFTA_BITWISE_SREG, htonl(NFT_REG_1)); + mnl_attr_put_u32(nlh, NFTA_BITWISE_DREG, htonl(NFT_REG_1)); + mnl_attr_put_u32(nlh, NFTA_BITWISE_LEN, htonl(4)); + + uint32_t mask = + NF_CT_STATE_BIT(IP_CT_ESTABLISHED) | NF_CT_STATE_BIT(IP_CT_RELATED); + nest4 = mnl_attr_nest_start(nlh, NFTA_BITWISE_MASK); + mnl_attr_put(nlh, NFTA_DATA_VALUE, sizeof(mask), &mask); + mnl_attr_nest_end(nlh, nest4); + + uint32_t val = 0x00000000; + nest4 = mnl_attr_nest_start(nlh, NFTA_BITWISE_XOR); + mnl_attr_put(nlh, NFTA_DATA_VALUE, sizeof(val), &val); + mnl_attr_nest_end(nlh, nest4); + + mnl_attr_nest_end(nlh, nest3); + mnl_attr_nest_end(nlh, nest2); + + /* Cmp expression */ + nest2 = mnl_attr_nest_start(nlh, NFTA_LIST_ELEM); + mnl_attr_put_strz(nlh, NFTA_EXPR_NAME, "cmp"); + + nest3 = mnl_attr_nest_start(nlh, NFTA_EXPR_DATA); + mnl_attr_put_u32(nlh, NFTA_CMP_SREG, htonl(NFT_REG_1)); + mnl_attr_put_u32(nlh, NFTA_CMP_OP, htonl(NFT_CMP_NEQ)); + + uint32_t cmp_data = 0x00000000; + nest4 = mnl_attr_nest_start(nlh, NFTA_CMP_DATA); + mnl_attr_put(nlh, NFTA_DATA_VALUE, sizeof(cmp_data), &cmp_data); + mnl_attr_nest_end(nlh, nest4); + + mnl_attr_nest_end(nlh, nest3); + mnl_attr_nest_end(nlh, nest2); + + /* Counter expression */ + nest2 = mnl_attr_nest_start(nlh, NFTA_LIST_ELEM); + mnl_attr_put_strz(nlh, NFTA_EXPR_NAME, "counter"); + + mnl_attr_nest_end(nlh, nest2); + + mnl_attr_nest_end(nlh, nest1); + + return mnl_nlmsg_batch_next(batch); +} + +static bool create_invalid_table(struct mnl_nlmsg_batch *batch, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | NFT_MSG_NEWTABLE; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_INET; + nfh->version = NFNETLINK_V0; + nfh->res_id = 0; + + /* Intentionally omit table name attribute to cause error */ + + return mnl_nlmsg_batch_next(batch); +} + +static void validate_res(struct mnl_socket *nl, uint32_t first_seq, + uint32_t expected_acks, + struct __test_metadata *_metadata) +{ + char buf[BATCH_PAGE_SIZE]; + int ret, acks_received = 0; + uint32_t last_seq = 0; + bool out_of_order = false; + + while (1) { + ret = mnl_socket_recvfrom(nl, buf, sizeof(buf)); + if (ret == -1) { + if (errno == EAGAIN || errno == EWOULDBLOCK) { + break; + } + + TH_LOG("Unexpected error on recv: %s", strerror(errno)); + ASSERT_TRUE(false); + return; + } + + struct nlmsghdr *nlh = (struct nlmsghdr *)buf; + while (mnl_nlmsg_ok(nlh, ret)) { + if (nlh->nlmsg_type == NLMSG_ERROR) { + struct nlmsgerr *err = + mnl_nlmsg_get_payload(nlh); + acks_received++; + + if (err->error != 0) { + TH_LOG("[seq=%u] Error: %s", + nlh->nlmsg_seq, + strerror(-err->error)); + } else { + TH_LOG("[seq=%u] ACK", nlh->nlmsg_seq); + } + + if (nlh->nlmsg_seq <= last_seq) { + out_of_order = true; + TH_LOG + ("Out of order ack: seq %u after %u", + nlh->nlmsg_seq, last_seq); + } + last_seq = nlh->nlmsg_seq; + } + nlh = mnl_nlmsg_next(nlh, &ret); + } + } + + ASSERT_FALSE(out_of_order); + + ASSERT_EQ(acks_received, expected_acks); +} + +FIXTURE(nfnetlink_batch) +{ + struct mnl_socket *nl; +}; + +FIXTURE_SETUP(nfnetlink_batch) +{ + struct timeval tv = {.tv_sec = 1,.tv_usec = 0 }; + + self->nl = mnl_socket_open(NETLINK_NETFILTER); + ASSERT_NE(self->nl, NULL); + + ASSERT_EQ(mnl_socket_bind(self->nl, 0, MNL_SOCKET_AUTOPID), 0); + + ASSERT_EQ(setsockopt + (mnl_socket_get_fd(self->nl), SOL_SOCKET, SO_RCVTIMEO, &tv, + sizeof(tv)), 0); +} + +FIXTURE_TEARDOWN(nfnetlink_batch) +{ + if (self->nl) { + mnl_socket_close(self->nl); + } +} + +TEST_F(nfnetlink_batch, simple_batch) +{ + struct mnl_nlmsg_batch *batch; + char buf[BATCH_PAGE_SIZE]; + uint32_t seq = time(NULL); + int ret; + + batch = mnl_nlmsg_batch_start(buf, sizeof(buf)); + ASSERT_NE(batch, NULL); + + ASSERT_TRUE(batch_begin(batch, seq++)); + create_table(batch, "test_table1", seq++); + batch_end(batch, seq++); + + ret = mnl_socket_sendto(self->nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + ASSERT_GT(ret, 0); + + // Expect 3 acks: batch_begin, create_table, batch_end + validate_res(self->nl, seq - 3, 3, _metadata); + + mnl_nlmsg_batch_stop(batch); +} + +TEST_F(nfnetlink_batch, module_load) +{ + struct mnl_nlmsg_batch *batch; + char buf[BATCH_PAGE_SIZE]; + uint32_t seq = time(NULL) + 1000; + int ret; + + batch = mnl_nlmsg_batch_start(buf, sizeof(buf)); + ASSERT_NE(batch, NULL); + + batch_begin(batch, seq++); + create_table(batch, "test_table2", seq++); + create_chain(batch, "test_table2", "test_chain", seq++); + create_rule_with_ct(batch, "test_table2", "test_chain", seq++); + batch_end(batch, seq++); + + ret = mnl_socket_sendto(self->nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + ASSERT_GT(ret, 0); + + // Expect 5 acks: batch_begin, table, chain, rule, batch_end + validate_res(self->nl, seq - 5, 5, _metadata); + + mnl_nlmsg_batch_stop(batch); +} + +// Repeat the CT test to verify module loading behavior +TEST_F(nfnetlink_batch, post_module_load) +{ + struct mnl_nlmsg_batch *batch; + char buf[BATCH_PAGE_SIZE]; + uint32_t seq = time(NULL) + 2000; + int ret; + + batch = mnl_nlmsg_batch_start(buf, sizeof(buf)); + ASSERT_NE(batch, NULL); + + batch_begin(batch, seq++); + create_table(batch, "test_table6", seq++); + create_chain(batch, "test_table6", "test_chain", seq++); + create_rule_with_ct(batch, "test_table6", "test_chain", seq++); + batch_end(batch, seq++); + + ret = mnl_socket_sendto(self->nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + ASSERT_GT(ret, 0); + + validate_res(self->nl, seq - 5, 5, _metadata); + + mnl_nlmsg_batch_stop(batch); +} + +TEST_F(nfnetlink_batch, invalid_batch) +{ + struct mnl_nlmsg_batch *batch; + char buf[BATCH_PAGE_SIZE]; + uint32_t seq = time(NULL) + 3000; + int ret; + + batch = mnl_nlmsg_batch_start(buf, sizeof(buf)); + ASSERT_NE(batch, NULL); + + batch_begin(batch, seq++); + create_table(batch, "test_table3", seq++); + create_invalid_table(batch, seq++); + create_chain(batch, "test_table3", "test_chain", seq++); + batch_end(batch, seq++); + + ret = mnl_socket_sendto(self->nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + ASSERT_GT(ret, 0); + + // Expect 5 acks: batch_begin, table, invalid_table(error), chain, batch_end + validate_res(self->nl, seq - 5, 5, _metadata); + + mnl_nlmsg_batch_stop(batch); +} + +TEST_F(nfnetlink_batch, batch_with_fatal_error) +{ + struct mnl_nlmsg_batch *batch; + char buf[BATCH_PAGE_SIZE]; + uint32_t seq = time(NULL) + 4000; + uid_t uid; + int ret; + + batch = mnl_nlmsg_batch_start(buf, sizeof(buf)); + ASSERT_NE(batch, NULL); + + batch_begin(batch, seq++); + create_table(batch, "test_table4", seq++); + create_table(batch, "test_table5", seq++); + batch_end(batch, seq++); + + // Drop privileges to trigger EPERM + uid = geteuid(); + if (uid == 0) { + ASSERT_EQ(seteuid(65534), 0); + } + + ret = mnl_socket_sendto(self->nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + + // Restore privileges + if (uid == 0) { + seteuid(uid); + } + + ASSERT_GT(ret, 0); + + // EPERM is fatal and should abort batch, expect only 1 ACK + validate_res(self->nl, seq - 4, 1, _metadata); + + mnl_nlmsg_batch_stop(batch); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/net/netfilter/nfnetlink.sh b/tools/testing/selftests/net/netfilter/nfnetlink.sh new file mode 100755 index 000000000000..4456796732a8 --- /dev/null +++ b/tools/testing/selftests/net/netfilter/nfnetlink.sh @@ -0,0 +1,11 @@ +#!/bin/bash + +exec unshare -n bash -c ' + nft flush ruleset + + rmmod nft_ct + + ./nfnetlink + + nft flush ruleset +' \ No newline at end of file -- 2.34.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/2] selftests: netfilter: add nfnetlink ACK handling tests 2025-10-04 9:26 ` [PATCH v2 2/2] selftests: netfilter: add nfnetlink ACK handling tests Nikolaos Gkarlis @ 2025-10-04 10:46 ` Florian Westphal 2025-10-04 11:08 ` Nikolaos Gkarlis 0 siblings, 1 reply; 27+ messages in thread From: Florian Westphal @ 2025-10-04 10:46 UTC (permalink / raw) To: Nikolaos Gkarlis; +Cc: netfilter-devel, pablo, fmancera Nikolaos Gkarlis <nickgarlis@gmail.com> wrote: > Add nfnetlink selftests to validate the ACKs sent after a batch > message. These tests verify that: > > - ACKs are always received in order. > - Module loading does not affect the responses. > - The number of ACKs matches the number of requests, unless a > fatal error occurs. Thanks for the tests! This looks good to me, justs one minor nit. Can you drop the shell wrapper and just call unshare(CLONE_NEWNET) from the fixture setup function? > +++ b/tools/testing/selftests/net/netfilter/Makefile > @@ -37,6 +37,7 @@ TEST_PROGS += nft_zones_many.sh > TEST_PROGS += rpath.sh > TEST_PROGS += vxlan_mtu_frag.sh > TEST_PROGS += xt_string.sh > +TEST_PROGS += nfnetlink.sh > TEST_PROGS_EXTENDED = nft_concat_range_perf.sh > > @@ -46,6 +47,7 @@ TEST_GEN_FILES += conntrack_dump_flush > TEST_GEN_FILES += conntrack_reverse_clash > TEST_GEN_FILES += sctp_collision > TEST_GEN_FILES += udpclash > +TEST_GEN_FILES += nfnetlink replacing this with TEST_GEN_PROGS = nfnetlink ... should make kselftests run this prog just like it does for TEST_PROGS shell scripts. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/2] selftests: netfilter: add nfnetlink ACK handling tests 2025-10-04 10:46 ` Florian Westphal @ 2025-10-04 11:08 ` Nikolaos Gkarlis 2025-10-04 12:26 ` Florian Westphal 0 siblings, 1 reply; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-04 11:08 UTC (permalink / raw) To: Florian Westphal; +Cc: netfilter-devel, pablo, fmancera Florian Westphal <fw@strlen.de> wrote: > Can you drop the shell wrapper and just call unshare(CLONE_NEWNET) from the > fixture setup function? Yes, I can do that. However, I am wondering about cleanup both before and after the tests, as well as ensuring that the nft_ct module is not loaded prior to execution. Should these steps be included in the program, or do we assume that the tests will always run on a clean system with no modules already loaded? Thanks for the review! ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/2] selftests: netfilter: add nfnetlink ACK handling tests 2025-10-04 11:08 ` Nikolaos Gkarlis @ 2025-10-04 12:26 ` Florian Westphal 2025-10-05 10:43 ` Nikolaos Gkarlis 0 siblings, 1 reply; 27+ messages in thread From: Florian Westphal @ 2025-10-04 12:26 UTC (permalink / raw) To: Nikolaos Gkarlis; +Cc: netfilter-devel, pablo, fmancera Nikolaos Gkarlis <nickgarlis@gmail.com> wrote: > Florian Westphal <fw@strlen.de> wrote: > > Can you drop the shell wrapper and just call unshare(CLONE_NEWNET) from the > > fixture setup function? > > Yes, I can do that. However, I am wondering about cleanup both > before and after the tests, as well as ensuring that the nft_ct > module is not loaded prior to execution. Is this to exercise replay path? Perhaps add a comment to the subtest that depends on this. I don't see the need for this otherwise. What happens when nft_ct is builtin? The test should not fail in that case. > Should these steps be > included in the program, or do we assume that the tests will > always run on a clean system with no modules already loaded? No, if you need nft_ct to not be loaded then the rmmod has to be used, but there is no guarantee it will work, e.g. because nft_ct is builtin or because its in use. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/2] selftests: netfilter: add nfnetlink ACK handling tests 2025-10-04 12:26 ` Florian Westphal @ 2025-10-05 10:43 ` Nikolaos Gkarlis 2025-10-05 11:42 ` Florian Westphal 0 siblings, 1 reply; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-05 10:43 UTC (permalink / raw) To: Florian Westphal; +Cc: netfilter-devel, pablo, fmancera Florian Westphal <fw@strlen.de> wrote: > Is this to exercise replay path? > Perhaps add a comment to the subtest that depends on this. Yes, that is the goal. I can add a comment to the test explaining that. > What happens when nft_ct is builtin? The test should not fail > in that case. No, the test does not fail when nft_ct is builtin or missing. It only verifies the number of ACKs returned and their order, which is what bf2ac490d28c broke. I have tested both builtin and non-existent cases. > No, if you need nft_ct to not be loaded then the rmmod has to > be used, but there is no guarantee it will work, e.g. because > nft_ct is builtin or because it's in use. What do you think about something like this? +++ b/tools/testing/selftests/net/netfilter/nfnetlink.sh @@ -0,0 +1,5 @@ +#!/bin/bash + +# If nft_ct is a module and is loaded, remove it to test module auto-loading +lsmod | grep -q "^nft_ct\b" && rmmod nft_ct +./nfnetlink I can send a v3 if that looks okay to you. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/2] selftests: netfilter: add nfnetlink ACK handling tests 2025-10-05 10:43 ` Nikolaos Gkarlis @ 2025-10-05 11:42 ` Florian Westphal 2025-10-05 12:54 ` [PATCH v3] " Nikolaos Gkarlis 0 siblings, 1 reply; 27+ messages in thread From: Florian Westphal @ 2025-10-05 11:42 UTC (permalink / raw) To: Nikolaos Gkarlis; +Cc: netfilter-devel, pablo, fmancera Nikolaos Gkarlis <nickgarlis@gmail.com> wrote: > Florian Westphal <fw@strlen.de> wrote: > > Is this to exercise replay path? > > Perhaps add a comment to the subtest that depends on this. > > Yes, that is the goal. I can add a comment to the test explaining that. Thanks! > verifies the number of ACKs returned and their order, which is what > bf2ac490d28c broke. I have tested both builtin and non-existent cases. Great. > What do you think about something like this? > > +++ b/tools/testing/selftests/net/netfilter/nfnetlink.sh > @@ -0,0 +1,5 @@ > +#!/bin/bash > + > +# If nft_ct is a module and is loaded, remove it to test module auto-loading Looks good, can you also mention that if removal doesn't work this is ok? > +lsmod | grep -q "^nft_ct\b" && rmmod nft_ct You could just "rmmod nft_ct 2>/dev/null" with above comment in place. > +./nfnetlink Since you need the shell wraper for the rmmod you could also make this unshare -n ./nfnetlink and not do it in the .c setup function. Your call. > I can send a v3 if that looks okay to you. Yes please. You can just send a v3 of the test case, no need to resend the fix. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v3] selftests: netfilter: add nfnetlink ACK handling tests 2025-10-05 11:42 ` Florian Westphal @ 2025-10-05 12:54 ` Nikolaos Gkarlis 2025-10-08 10:26 ` Florian Westphal 0 siblings, 1 reply; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-05 12:54 UTC (permalink / raw) To: netfilter-devel; +Cc: pablo, fw, fmancera, Nikolaos Gkarlis Add nfnetlink selftests to validate the ACKs sent after a batch message. These tests verify that: - ACKs are always received in order. - Module loading does not affect the responses. - The number of ACKs matches the number of requests, unless a fatal error occurs. Signed-off-by: Nikolaos Gkarlis <nickgarlis@gmail.com> --- .../testing/selftests/net/netfilter/Makefile | 3 + .../selftests/net/netfilter/nfnetlink.c | 437 ++++++++++++++++++ .../selftests/net/netfilter/nfnetlink.sh | 6 + 3 files changed, 446 insertions(+) create mode 100644 tools/testing/selftests/net/netfilter/nfnetlink.c create mode 100755 tools/testing/selftests/net/netfilter/nfnetlink.sh diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile index a98ed892f55f..cb38c8a9b2cc 100644 --- a/tools/testing/selftests/net/netfilter/Makefile +++ b/tools/testing/selftests/net/netfilter/Makefile @@ -37,6 +37,7 @@ TEST_PROGS += nft_zones_many.sh TEST_PROGS += rpath.sh TEST_PROGS += vxlan_mtu_frag.sh TEST_PROGS += xt_string.sh +TEST_PROGS += nfnetlink.sh TEST_PROGS_EXTENDED = nft_concat_range_perf.sh @@ -46,6 +47,7 @@ TEST_GEN_FILES += conntrack_dump_flush TEST_GEN_FILES += conntrack_reverse_clash TEST_GEN_FILES += sctp_collision TEST_GEN_FILES += udpclash +TEST_GEN_FILES += nfnetlink include ../../lib.mk @@ -54,6 +56,7 @@ $(OUTPUT)/nf_queue: LDLIBS += $(MNL_LDLIBS) $(OUTPUT)/conntrack_dump_flush: CFLAGS += $(MNL_CFLAGS) $(OUTPUT)/conntrack_dump_flush: LDLIBS += $(MNL_LDLIBS) +$(OUTPUT)/nfnetlink: LDLIBS += $(MNL_LDLIBS) $(OUTPUT)/udpclash: LDLIBS += -lpthread TEST_FILES := lib.sh diff --git a/tools/testing/selftests/net/netfilter/nfnetlink.c b/tools/testing/selftests/net/netfilter/nfnetlink.c new file mode 100644 index 000000000000..d6a2895be1f0 --- /dev/null +++ b/tools/testing/selftests/net/netfilter/nfnetlink.c @@ -0,0 +1,437 @@ +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <sched.h> +#include <time.h> +#include <arpa/inet.h> +#include <libmnl/libmnl.h> +#include <linux/netfilter/nfnetlink.h> +#include <linux/netfilter/nf_tables.h> +#include <linux/netfilter/nf_conntrack_common.h> +#include <linux/netfilter.h> +#include <sys/types.h> + +#include "../../kselftest_harness.h" + +#define BATCH_PAGE_SIZE 8192 + +static bool batch_begin(struct mnl_nlmsg_batch *batch, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = NFNL_MSG_BATCH_BEGIN; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_UNSPEC; + nfh->version = NFNETLINK_V0; + nfh->res_id = NFNL_SUBSYS_NFTABLES; + + return mnl_nlmsg_batch_next(batch); +} + +static bool batch_end(struct mnl_nlmsg_batch *batch, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = NFNL_MSG_BATCH_END; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_UNSPEC; + nfh->version = NFNETLINK_V0; + nfh->res_id = NFNL_SUBSYS_NFTABLES; + + return mnl_nlmsg_batch_next(batch); +} + +static bool +create_table(struct mnl_nlmsg_batch *batch, const char *name, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | NFT_MSG_NEWTABLE; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_INET; + nfh->version = NFNETLINK_V0; + nfh->res_id = 0; + + mnl_attr_put_strz(nlh, NFTA_TABLE_NAME, name); + + return mnl_nlmsg_batch_next(batch); +} + +static bool +create_chain(struct mnl_nlmsg_batch *batch, const char *table, + const char *chain, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | NFT_MSG_NEWCHAIN; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_INET; + nfh->version = NFNETLINK_V0; + nfh->res_id = 0; + + mnl_attr_put_strz(nlh, NFTA_CHAIN_TABLE, table); + mnl_attr_put_strz(nlh, NFTA_CHAIN_NAME, chain); + + return mnl_nlmsg_batch_next(batch); +} + +static bool +create_rule_with_ct(struct mnl_nlmsg_batch *batch, const char *table, + const char *chain, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + struct nlattr *nest1, *nest2, *nest3, *nest4; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | NFT_MSG_NEWRULE; + nlh->nlmsg_flags = + NLM_F_REQUEST | NLM_F_CREATE | NLM_F_APPEND | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_INET; + nfh->version = NFNETLINK_V0; + nfh->res_id = 0; + + mnl_attr_put_strz(nlh, NFTA_RULE_TABLE, table); + mnl_attr_put_strz(nlh, NFTA_RULE_CHAIN, chain); + + /* Create expressions list */ + nest1 = mnl_attr_nest_start(nlh, NFTA_RULE_EXPRESSIONS); + + /* CT expression to load ct state */ + nest2 = mnl_attr_nest_start(nlh, NFTA_LIST_ELEM); + mnl_attr_put_strz(nlh, NFTA_EXPR_NAME, "ct"); + + nest3 = mnl_attr_nest_start(nlh, NFTA_EXPR_DATA); + mnl_attr_put_u32(nlh, NFTA_CT_KEY, htonl(NFT_CT_STATE)); + mnl_attr_put_u32(nlh, NFTA_CT_DREG, htonl(NFT_REG_1)); + mnl_attr_nest_end(nlh, nest3); + + mnl_attr_nest_end(nlh, nest2); + + /* Bitwise expression */ + nest2 = mnl_attr_nest_start(nlh, NFTA_LIST_ELEM); + mnl_attr_put_strz(nlh, NFTA_EXPR_NAME, "bitwise"); + + nest3 = mnl_attr_nest_start(nlh, NFTA_EXPR_DATA); + mnl_attr_put_u32(nlh, NFTA_BITWISE_SREG, htonl(NFT_REG_1)); + mnl_attr_put_u32(nlh, NFTA_BITWISE_DREG, htonl(NFT_REG_1)); + mnl_attr_put_u32(nlh, NFTA_BITWISE_LEN, htonl(4)); + + uint32_t mask = + NF_CT_STATE_BIT(IP_CT_ESTABLISHED) | NF_CT_STATE_BIT(IP_CT_RELATED); + nest4 = mnl_attr_nest_start(nlh, NFTA_BITWISE_MASK); + mnl_attr_put(nlh, NFTA_DATA_VALUE, sizeof(mask), &mask); + mnl_attr_nest_end(nlh, nest4); + + uint32_t val = 0x00000000; + nest4 = mnl_attr_nest_start(nlh, NFTA_BITWISE_XOR); + mnl_attr_put(nlh, NFTA_DATA_VALUE, sizeof(val), &val); + mnl_attr_nest_end(nlh, nest4); + + mnl_attr_nest_end(nlh, nest3); + mnl_attr_nest_end(nlh, nest2); + + /* Cmp expression */ + nest2 = mnl_attr_nest_start(nlh, NFTA_LIST_ELEM); + mnl_attr_put_strz(nlh, NFTA_EXPR_NAME, "cmp"); + + nest3 = mnl_attr_nest_start(nlh, NFTA_EXPR_DATA); + mnl_attr_put_u32(nlh, NFTA_CMP_SREG, htonl(NFT_REG_1)); + mnl_attr_put_u32(nlh, NFTA_CMP_OP, htonl(NFT_CMP_NEQ)); + + uint32_t cmp_data = 0x00000000; + nest4 = mnl_attr_nest_start(nlh, NFTA_CMP_DATA); + mnl_attr_put(nlh, NFTA_DATA_VALUE, sizeof(cmp_data), &cmp_data); + mnl_attr_nest_end(nlh, nest4); + + mnl_attr_nest_end(nlh, nest3); + mnl_attr_nest_end(nlh, nest2); + + /* Counter expression */ + nest2 = mnl_attr_nest_start(nlh, NFTA_LIST_ELEM); + mnl_attr_put_strz(nlh, NFTA_EXPR_NAME, "counter"); + + mnl_attr_nest_end(nlh, nest2); + + mnl_attr_nest_end(nlh, nest1); + + return mnl_nlmsg_batch_next(batch); +} + +static bool create_invalid_table(struct mnl_nlmsg_batch *batch, uint32_t seq) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfh; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | NFT_MSG_NEWTABLE; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK; + nlh->nlmsg_seq = seq; + + nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg)); + nfh->nfgen_family = NFPROTO_INET; + nfh->version = NFNETLINK_V0; + nfh->res_id = 0; + + /* Intentionally omit table name attribute to cause error */ + + return mnl_nlmsg_batch_next(batch); +} + +static void validate_res(struct mnl_socket *nl, uint32_t first_seq, + uint32_t expected_acks, + struct __test_metadata *_metadata) +{ + char buf[BATCH_PAGE_SIZE]; + int ret, acks_received = 0; + uint32_t last_seq = 0; + bool out_of_order = false; + + while (1) { + ret = mnl_socket_recvfrom(nl, buf, sizeof(buf)); + if (ret == -1) { + if (errno == EAGAIN || errno == EWOULDBLOCK) { + break; + } + + TH_LOG("Unexpected error on recv: %s", strerror(errno)); + ASSERT_TRUE(false); + return; + } + + struct nlmsghdr *nlh = (struct nlmsghdr *)buf; + while (mnl_nlmsg_ok(nlh, ret)) { + if (nlh->nlmsg_type == NLMSG_ERROR) { + struct nlmsgerr *err = + mnl_nlmsg_get_payload(nlh); + acks_received++; + + if (err->error != 0) { + TH_LOG("[seq=%u] Error: %s", + nlh->nlmsg_seq, + strerror(-err->error)); + } else { + TH_LOG("[seq=%u] ACK", nlh->nlmsg_seq); + } + + if (nlh->nlmsg_seq <= last_seq) { + out_of_order = true; + TH_LOG + ("Out of order ack: seq %u after %u", + nlh->nlmsg_seq, last_seq); + } + last_seq = nlh->nlmsg_seq; + } + nlh = mnl_nlmsg_next(nlh, &ret); + } + } + + ASSERT_FALSE(out_of_order); + + ASSERT_EQ(acks_received, expected_acks); +} + +FIXTURE(nfnetlink_batch) +{ + struct mnl_socket *nl; +}; + +FIXTURE_SETUP(nfnetlink_batch) +{ + struct timeval tv = {.tv_sec = 1,.tv_usec = 0 }; + + ASSERT_EQ(unshare(CLONE_NEWNET), 0); + + self->nl = mnl_socket_open(NETLINK_NETFILTER); + ASSERT_NE(self->nl, NULL); + + ASSERT_EQ(mnl_socket_bind(self->nl, 0, MNL_SOCKET_AUTOPID), 0); + + ASSERT_EQ(setsockopt + (mnl_socket_get_fd(self->nl), SOL_SOCKET, SO_RCVTIMEO, &tv, + sizeof(tv)), 0); +} + +FIXTURE_TEARDOWN(nfnetlink_batch) +{ + if (self->nl) { + mnl_socket_close(self->nl); + } +} + +TEST_F(nfnetlink_batch, simple_batch) +{ + struct mnl_nlmsg_batch *batch; + char buf[BATCH_PAGE_SIZE]; + uint32_t seq = time(NULL); + int ret; + + batch = mnl_nlmsg_batch_start(buf, sizeof(buf)); + ASSERT_NE(batch, NULL); + + ASSERT_TRUE(batch_begin(batch, seq++)); + create_table(batch, "test_table1", seq++); + batch_end(batch, seq++); + + ret = mnl_socket_sendto(self->nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + ASSERT_GT(ret, 0); + + // Expect 3 acks: batch_begin, create_table, batch_end + validate_res(self->nl, seq - 3, 3, _metadata); + + mnl_nlmsg_batch_stop(batch); +} + +/* + * This test simulates the replay path by executing a batch that depends + * on the nft_ct module. The purpose is to verify that the same number of + * ACKs is received regardless of whether the module gets loaded automatically + * or not. If the module is built in or not present, the test will pass anyway. + */ +TEST_F(nfnetlink_batch, module_load) +{ + struct mnl_nlmsg_batch *batch; + char buf[BATCH_PAGE_SIZE]; + uint32_t seq = time(NULL) + 1000; + int ret; + + batch = mnl_nlmsg_batch_start(buf, sizeof(buf)); + ASSERT_NE(batch, NULL); + + batch_begin(batch, seq++); + create_table(batch, "test_table2", seq++); + create_chain(batch, "test_table2", "test_chain", seq++); + create_rule_with_ct(batch, "test_table2", "test_chain", seq++); + batch_end(batch, seq++); + + ret = mnl_socket_sendto(self->nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + ASSERT_GT(ret, 0); + + // Expect 5 acks: batch_begin, table, chain, rule, batch_end + validate_res(self->nl, seq - 5, 5, _metadata); + + mnl_nlmsg_batch_stop(batch); +} + +/* + * Repeat the previous test, but this time the nft_ct module should have been + * loaded by the previous test (if it exists and is not built-in). In any case, + * the same number of ACKs as before is expected. + */ +TEST_F(nfnetlink_batch, post_module_load) +{ + struct mnl_nlmsg_batch *batch; + char buf[BATCH_PAGE_SIZE]; + uint32_t seq = time(NULL) + 2000; + int ret; + + batch = mnl_nlmsg_batch_start(buf, sizeof(buf)); + ASSERT_NE(batch, NULL); + + batch_begin(batch, seq++); + create_table(batch, "test_table6", seq++); + create_chain(batch, "test_table6", "test_chain", seq++); + create_rule_with_ct(batch, "test_table6", "test_chain", seq++); + batch_end(batch, seq++); + + ret = mnl_socket_sendto(self->nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + ASSERT_GT(ret, 0); + + validate_res(self->nl, seq - 5, 5, _metadata); + + mnl_nlmsg_batch_stop(batch); +} + +TEST_F(nfnetlink_batch, invalid_batch) +{ + struct mnl_nlmsg_batch *batch; + char buf[BATCH_PAGE_SIZE]; + uint32_t seq = time(NULL) + 3000; + int ret; + + batch = mnl_nlmsg_batch_start(buf, sizeof(buf)); + ASSERT_NE(batch, NULL); + + batch_begin(batch, seq++); + create_table(batch, "test_table3", seq++); + create_invalid_table(batch, seq++); + create_chain(batch, "test_table3", "test_chain", seq++); + batch_end(batch, seq++); + + ret = mnl_socket_sendto(self->nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + ASSERT_GT(ret, 0); + + // Expect 5 acks: batch_begin, table, invalid_table(error), chain, batch_end + validate_res(self->nl, seq - 5, 5, _metadata); + + mnl_nlmsg_batch_stop(batch); +} + +TEST_F(nfnetlink_batch, batch_with_fatal_error) +{ + struct mnl_nlmsg_batch *batch; + char buf[BATCH_PAGE_SIZE]; + uint32_t seq = time(NULL) + 4000; + uid_t uid; + int ret; + + batch = mnl_nlmsg_batch_start(buf, sizeof(buf)); + ASSERT_NE(batch, NULL); + + batch_begin(batch, seq++); + create_table(batch, "test_table4", seq++); + create_table(batch, "test_table5", seq++); + batch_end(batch, seq++); + + // Drop privileges to trigger EPERM + uid = geteuid(); + if (uid == 0) { + ASSERT_EQ(seteuid(65534), 0); + } + + ret = mnl_socket_sendto(self->nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + + // Restore privileges + if (uid == 0) { + seteuid(uid); + } + + ASSERT_GT(ret, 0); + + // EPERM is fatal and should abort batch, expect only 1 ACK + validate_res(self->nl, seq - 4, 1, _metadata); + + mnl_nlmsg_batch_stop(batch); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/net/netfilter/nfnetlink.sh b/tools/testing/selftests/net/netfilter/nfnetlink.sh new file mode 100755 index 000000000000..3d264a4bde6f --- /dev/null +++ b/tools/testing/selftests/net/netfilter/nfnetlink.sh @@ -0,0 +1,6 @@ +#!/bin/bash + +# If nft_ct is a module and is loaded, remove it to test module auto-loading. +# If removal fails, continue anyway since it won't affect the test result. +rmmod nft_ct 2>/dev/null +./nfnetlink \ No newline at end of file -- 2.34.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v3] selftests: netfilter: add nfnetlink ACK handling tests 2025-10-05 12:54 ` [PATCH v3] " Nikolaos Gkarlis @ 2025-10-08 10:26 ` Florian Westphal 2025-10-08 10:37 ` Nikolaos Gkarlis 0 siblings, 1 reply; 27+ messages in thread From: Florian Westphal @ 2025-10-08 10:26 UTC (permalink / raw) To: Nikolaos Gkarlis; +Cc: netfilter-devel, pablo, fmancera Nikolaos Gkarlis <nickgarlis@gmail.com> wrote: > Add nfnetlink selftests to validate the ACKs sent after a batch > message. These tests verify that: > > - ACKs are always received in order. > - Module loading does not affect the responses. > - The number of ACKs matches the number of requests, unless a > fatal error occurs. I will keep this on hold while the discussion wrt. ACK flag on batches is ongoing. Also, checkpatch.pl reports: WARNING: Missing or malformed SPDX-License-Identifier tag in line 1 #59: FILE: tools/testing/selftests/net/netfilter/nfnetlink.c:1: ... and a few other nits that should be resolved too. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3] selftests: netfilter: add nfnetlink ACK handling tests 2025-10-08 10:26 ` Florian Westphal @ 2025-10-08 10:37 ` Nikolaos Gkarlis 2025-10-08 10:39 ` Florian Westphal 0 siblings, 1 reply; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-08 10:37 UTC (permalink / raw) To: Florian Westphal; +Cc: netfilter-devel, pablo, fmancera Florian Westphal <fw@strlen.de> wrote: > I will keep this on hold while the discussion wrt. ACK flag on > batches is ongoing. Thanks, I appreciate you taking the time for this. > Also, checkpatch.pl reports: > > WARNING: Missing or malformed SPDX-License-Identifier tag in line 1 > #59: FILE: tools/testing/selftests/net/netfilter/nfnetlink.c:1: > > ... and a few other nits that should be resolved too. Apologies, I will submit a new version ASAP. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3] selftests: netfilter: add nfnetlink ACK handling tests 2025-10-08 10:37 ` Nikolaos Gkarlis @ 2025-10-08 10:39 ` Florian Westphal 0 siblings, 0 replies; 27+ messages in thread From: Florian Westphal @ 2025-10-08 10:39 UTC (permalink / raw) To: Nikolaos Gkarlis; +Cc: netfilter-devel, pablo, fmancera Nikolaos Gkarlis <nickgarlis@gmail.com> wrote: > > ... and a few other nits that should be resolved too. > > Apologies, I will submit a new version ASAP. No need, please wait until the discussion has completed so we know if the selftest can be taken as-is or if it has to be adjusted (no-batch-end-ack-on-error). ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2 0/2] always ACK batch end if requested 2025-10-02 11:03 ` Fernando Fernandez Mancera 2025-10-04 9:26 ` [PATCH v2 0/2] " Nikolaos Gkarlis @ 2025-10-04 9:38 ` Nikolaos Gkarlis 1 sibling, 0 replies; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-04 9:38 UTC (permalink / raw) To: netfilter-devel; +Cc: pablo, fw, fmancera, Nikolaos Gkarlis Hi again, Apologies for the delay. I have now added an nfnetlink selftest as suggested by Florian, and committed Fernando's suggestion to also check for the status. I have confirmed that the test fails on kernels > 6.10 that do not include Fernando's 09efbac953f6 fix. The test could be made more extensive, but I think this is a good starting point. Let me know what you think. This is a resend to fix threading. Sorry about that. Nikolaos Gkarlis (2): netfilter: nfnetlink: always ACK batch end if requested selftests: netfilter: add nfnetlink ACK handling tests net/netfilter/nfnetlink.c | 5 + .../testing/selftests/net/netfilter/Makefile | 3 + .../selftests/net/netfilter/nfnetlink.c | 424 ++++++++++++++++++ .../selftests/net/netfilter/nfnetlink.sh | 11 + 4 files changed, 443 insertions(+) create mode 100644 tools/testing/selftests/net/netfilter/nfnetlink.c create mode 100755 tools/testing/selftests/net/netfilter/nfnetlink.sh -- 2.34.1 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-01 21:15 [PATCH] netfilter: nfnetlink: always ACK batch end if requested Nikolaos Gkarlis 2025-10-02 9:48 ` Fernando Fernandez Mancera @ 2025-10-02 10:10 ` Florian Westphal 2025-10-02 10:46 ` Nikolaos Gkarlis 2025-10-07 20:33 ` Pablo Neira Ayuso 2 siblings, 1 reply; 27+ messages in thread From: Florian Westphal @ 2025-10-02 10:10 UTC (permalink / raw) To: Nikolaos Gkarlis; +Cc: netfilter-devel, pablo Nikolaos Gkarlis <nickgarlis@gmail.com> wrote: > I have tested this change with my userspace application and it > seems to resolve the "problem". However, I am not sure if there > is a suitable place to add a regression test, since AFAIK nft > userspace does not currently use this feature. I would be happy > to contribute a test if you could point me to the right place. You could add a nfnetlink selftest to: tools/testing/selftests/net/netfilter ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-02 10:10 ` [PATCH] netfilter: nfnetlink: " Florian Westphal @ 2025-10-02 10:46 ` Nikolaos Gkarlis 0 siblings, 0 replies; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-02 10:46 UTC (permalink / raw) To: Florian Westphal; +Cc: netfilter-devel, pablo, Fernando Fernandez Mancera Florian Westphal <fw@strlen.de> wrote: > You could add a nfnetlink selftest to: > tools/testing/selftests/net/netfilter Thanks! I will add the test and include it in the patch. I should have some time to look at it this evening. Since there seem to be many corner cases, I suggest we wait for the test before merging, unless there is a particular rush to get this out. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-01 21:15 [PATCH] netfilter: nfnetlink: always ACK batch end if requested Nikolaos Gkarlis 2025-10-02 9:48 ` Fernando Fernandez Mancera 2025-10-02 10:10 ` [PATCH] netfilter: nfnetlink: " Florian Westphal @ 2025-10-07 20:33 ` Pablo Neira Ayuso 2025-10-08 7:28 ` Florian Westphal 2025-10-08 8:41 ` Nikolaos Gkarlis 2 siblings, 2 replies; 27+ messages in thread From: Pablo Neira Ayuso @ 2025-10-07 20:33 UTC (permalink / raw) To: Nikolaos Gkarlis; +Cc: netfilter-devel, fw Hi Nikolaos, On Wed, Oct 01, 2025 at 11:15:03PM +0200, Nikolaos Gkarlis wrote: > Before ACKs were introduced for batch begin and batch end messages, > userspace expected to receive the same number of ACKs as it sent, > unless a fatal error occurred. Regarding bf2ac490d28c, I don't understand why one needs an ack for _BEGIN message. Maybe, an ack for END message might make sense when BATCH_DONE is reached so you get a confirmation that the batch has been fully processed, however... > To preserve this deterministic behavior, send an ACK for batch end > messages even when an error happens in the middle of the batch, > similar to how ACKs are handled for command messages. I suspect the author of bf2ac490d28c is making wrong assumptions on the number of acknowledgements that are going to be received by userspace. Let's just forget about this bf2ac490d28c for a moment, a quick summary: #1 If you don't set NLM_F_ACK in your netlink messages in the batch (this is what netfilter's userspace does): then errors result in acknowledgement. But ENOBUFS is still possible: this means your batch has resulted in too many acknowledment messages (errors) filling up the userspace netlink socket buffer. #2 If you set NLM_F_ACK in your netlink messages in the batch: You get one acknowledgement for each message in the batch, with a sufficiently large batch, this may overrun the userspace socket buffer (ENOBUFS), then maybe the kernel was successful to fully process the transaction but some of those acks get lost. That is, 1bf2ac490d28c allows you to set on NLM_F_ACK for BATCH_END, but that acknowledgement can get lost in case of netlink socket overrun. ENOBUFS scenarios break assumptions on the number of messages that you are going to receive from the kernel. Netlink is a unreliable transport protocol, there are mechanisms to make it "more reliable" but message loss (particularly in the kernel -> userspace direction) is still possible. In this particular case, where batching several netlink messages in one single send(), userspace will not process the acknowledments messages in the userspace socket buffer until the batch is complete. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-07 20:33 ` Pablo Neira Ayuso @ 2025-10-08 7:28 ` Florian Westphal 2025-10-08 11:33 ` Pablo Neira Ayuso 2025-10-08 8:41 ` Nikolaos Gkarlis 1 sibling, 1 reply; 27+ messages in thread From: Florian Westphal @ 2025-10-08 7:28 UTC (permalink / raw) To: Pablo Neira Ayuso; +Cc: Nikolaos Gkarlis, netfilter-devel, donald.hunter [ Cc Donald Hunter ] Pablo Neira Ayuso <pablo@netfilter.org> wrote: > On Wed, Oct 01, 2025 at 11:15:03PM +0200, Nikolaos Gkarlis wrote: > > Before ACKs were introduced for batch begin and batch end messages, > > userspace expected to receive the same number of ACKs as it sent, > > unless a fatal error occurred. > > Regarding bf2ac490d28c, I don't understand why one needs an ack for > _BEGIN message. Maybe, an ack for END message might make sense when > BATCH_DONE is reached so you get a confirmation that the batch has > been fully processed, however... ... which (BATCH_DONE reached) would be made moot by this proposed patch, as we would ACK it even if its not reached anymore. > I suspect the author of bf2ac490d28c is making wrong assumptions on > the number of acknowledgements that are going to be received by > userspace. > > Let's just forget about this bf2ac490d28c for a moment, a quick summary: > > #1 If you don't set NLM_F_ACK in your netlink messages in the batch > (this is what netfilter's userspace does): then errors result in > acknowledgement. But ENOBUFS is still possible: this means your batch > has resulted in too many acknowledment messages (errors) filling up > the userspace netlink socket buffer. > #2 If you set NLM_F_ACK in your netlink messages in the batch: > You get one acknowledgement for each message in the batch, with a > sufficiently large batch, this may overrun the userspace socket > buffer (ENOBUFS), then maybe the kernel was successful to fully > process the transaction but some of those acks get lost. Right, 1:1 relationship between messages and ACKs is only there for theoretical infinite receive buffer which makes this feature rather limited for batched case. > In this particular case, where batching several netlink messages in > one single send(), userspace will not process the acknowledments > messages in the userspace socket buffer until the batch is complete. OK, from what I gather you'd like for "netfilter: nfnetlink: always ACK batch end if requested" to not be applied. I would still like to apply the nfnetlink selftest however (even if it has to be trimmed/modified), because it does catch the issue fixed by Fernando [ 09efbac953f6 ("netfilter: nfnetlink: reset nlh pointer during batch replay") ]: ok 1 nfnetlink_batch.simple_batch # RUN nfnetlink_batch.module_load ... # nfnetlink.c:239:module_load:[seq=1759907514] ACK # nfnetlink.c:239:module_load:[seq=1759907512] ACK # nfnetlink.c:244:module_load:Out of order ack: seq 1759907512 after 1759907514 # nfnetlink.c:239:module_load:[seq=1759907513] ACK # nfnetlink.c:239:module_load:[seq=1759907514] ACK # nfnetlink.c:239:module_load:[seq=1759907515] ACK # nfnetlink.c:254:module_load:Expected 0 (0) == out_of_order (1) # module_load: Test terminated by assertion # FAIL nfnetlink_batch.module_load If the decision is that there should NOT be an ACK for the BATCH_END if there was an error, then the test only needs minor adjustment: - // Expect 5 acks: batch_begin, table, invalid_table(error), chain, batch_end - validate_res(self->nl, seq - 5, 5, _metadata); + // Expect 4 acks: batch_begin, table, invalid_table(error), chain + validate_res(self->nl, seq - 4, 4, _metadata); So, what is the 'more useful' behaviour? Choices are all equally bad: 1. If we want to always include it, it might not be there due to -ENOBUFS, which will always happen if the batch was large (triggers too many acks). 2. If we only include it on success, it might not be there for the same reason, so absence doesn't imply failure. HOWEVER, if the batch was small enough then 2) gives slightly more useable feedback in the sense that the entire batch was processed. So I am leaning towards not applying the nfnetlink patch but applying the (adjusted) test case. Other takes? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-08 7:28 ` Florian Westphal @ 2025-10-08 11:33 ` Pablo Neira Ayuso 2025-10-08 13:35 ` Donald Hunter 0 siblings, 1 reply; 27+ messages in thread From: Pablo Neira Ayuso @ 2025-10-08 11:33 UTC (permalink / raw) To: Florian Westphal; +Cc: Nikolaos Gkarlis, netfilter-devel, donald.hunter On Wed, Oct 08, 2025 at 09:28:26AM +0200, Florian Westphal wrote: > [ Cc Donald Hunter ] > > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > On Wed, Oct 01, 2025 at 11:15:03PM +0200, Nikolaos Gkarlis wrote: > > > Before ACKs were introduced for batch begin and batch end messages, > > > userspace expected to receive the same number of ACKs as it sent, > > > unless a fatal error occurred. > > > > Regarding bf2ac490d28c, I don't understand why one needs an ack for > > _BEGIN message. Maybe, an ack for END message might make sense when > > BATCH_DONE is reached so you get a confirmation that the batch has > > been fully processed, however... > > ... which (BATCH_DONE reached) would be made moot by this proposed > patch, as we would ACK it even if its not reached anymore. Yes, I am inclined not to add more features to bf2ac490d28c (and follow up fixes patches that came with it). > > I suspect the author of bf2ac490d28c is making wrong assumptions on > > the number of acknowledgements that are going to be received by > > userspace. > > > > Let's just forget about this bf2ac490d28c for a moment, a quick summary: > > > > #1 If you don't set NLM_F_ACK in your netlink messages in the batch > > (this is what netfilter's userspace does): then errors result in > > acknowledgement. But ENOBUFS is still possible: this means your batch > > has resulted in too many acknowledment messages (errors) filling up > > the userspace netlink socket buffer. > > #2 If you set NLM_F_ACK in your netlink messages in the batch: > > You get one acknowledgement for each message in the batch, with a > > sufficiently large batch, this may overrun the userspace socket > > buffer (ENOBUFS), then maybe the kernel was successful to fully > > process the transaction but some of those acks get lost. > > Right, 1:1 relationship between messages and ACKs is only there for > theoretical infinite receive buffer which makes this feature rather limited > for batched case. Exactly. > > In this particular case, where batching several netlink messages in > > one single send(), userspace will not process the acknowledments > > messages in the userspace socket buffer until the batch is complete. > > OK, from what I gather you'd like for > "netfilter: nfnetlink: always ACK batch end if requested" > to not be applied. I think this at least needs more discussion, I think we are now understanding the implications of bf2ac490d28c. > I would still like to apply the nfnetlink selftest however (even > if it has to be trimmed/modified), because it does catch the issue > fixed by Fernando > [ 09efbac953f6 ("netfilter: nfnetlink: reset nlh pointer during batch replay") ]: > > ok 1 nfnetlink_batch.simple_batch > # RUN nfnetlink_batch.module_load ... > # nfnetlink.c:239:module_load:[seq=1759907514] ACK > # nfnetlink.c:239:module_load:[seq=1759907512] ACK > # nfnetlink.c:244:module_load:Out of order ack: seq 1759907512 after 1759907514 > # nfnetlink.c:239:module_load:[seq=1759907513] ACK > # nfnetlink.c:239:module_load:[seq=1759907514] ACK > # nfnetlink.c:239:module_load:[seq=1759907515] ACK > # nfnetlink.c:254:module_load:Expected 0 (0) == out_of_order (1) > # module_load: Test terminated by assertion > # FAIL nfnetlink_batch.module_load > > If the decision is that there should NOT be an ACK for the BATCH_END if > there was an error, then the test only needs minor adjustment: > > - // Expect 5 acks: batch_begin, table, invalid_table(error), chain, batch_end > - validate_res(self->nl, seq - 5, 5, _metadata); > + // Expect 4 acks: batch_begin, table, invalid_table(error), chain > + validate_res(self->nl, seq - 4, 4, _metadata); > > So, what is the 'more useful' behaviour? Choices are all equally bad: > > 1. If we want to always include it, it might not be there due to > -ENOBUFS, which will always happen if the batch was large (triggers > too many acks). Yes. > 2. If we only include it on success, it might not be there for the > same reason, so absence doesn't imply failure. Yes. > HOWEVER, if the batch was small enough then 2) gives slightly more > useable feedback in the sense that the entire batch was processed. Yes. I think Nikolaos pointed out that _BEGIN+NLM_F_ACK could actually provide an indication, with the assumption that the netlink userspace queue is going to be empty because it will be the first acknowledgement... > So I am leaning towards not applying the nfnetlink patch but applying > the (adjusted) test case. > > Other takes? Yes, I would start with this approach you propose, then keep discussing if it makes sense to keep extending bf2ac490d28c or leave it as is. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-08 11:33 ` Pablo Neira Ayuso @ 2025-10-08 13:35 ` Donald Hunter 2025-10-08 14:50 ` Florian Westphal 0 siblings, 1 reply; 27+ messages in thread From: Donald Hunter @ 2025-10-08 13:35 UTC (permalink / raw) To: Pablo Neira Ayuso; +Cc: Florian Westphal, Nikolaos Gkarlis, netfilter-devel On Wed, 8 Oct 2025 at 12:33, Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > On Wed, Oct 08, 2025 at 09:28:26AM +0200, Florian Westphal wrote: > > [ Cc Donald Hunter ] Thanks for the cc. I'm having difficulty catching up with the thread of conversation, so forgive me if I miss something. > Yes, I am inclined not to add more features to bf2ac490d28c (and > follow up fixes patches that came with it). Is there a problem with bf2ac490d28c or just with things that have come after it? > > > I suspect the author of bf2ac490d28c is making wrong assumptions on > > > the number of acknowledgements that are going to be received by > > > userspace. Hopefully not. In the success case, ask for an ack, get an ack. Without that guarantee to userspace, we'd need to extend the YNL spec to say which messages don't honour acks. > > > Let's just forget about this bf2ac490d28c for a moment, a quick summary: > > > > > > #1 If you don't set NLM_F_ACK in your netlink messages in the batch > > > (this is what netfilter's userspace does): then errors result in > > > acknowledgement. But ENOBUFS is still possible: this means your batch > > > has resulted in too many acknowledment messages (errors) filling up > > > the userspace netlink socket buffer. > > > #2 If you set NLM_F_ACK in your netlink messages in the batch: > > > You get one acknowledgement for each message in the batch, with a > > > sufficiently large batch, this may overrun the userspace socket > > > buffer (ENOBUFS), then maybe the kernel was successful to fully > > > process the transaction but some of those acks get lost. Are people reporting ENOBUFS because of ACKs in practice or is this theoretical? > > Right, 1:1 relationship between messages and ACKs is only there for > > theoretical infinite receive buffer which makes this feature rather limited > > for batched case. > > Exactly. > > > > In this particular case, where batching several netlink messages in > > > one single send(), userspace will not process the acknowledments > > > messages in the userspace socket buffer until the batch is complete. > > > > OK, from what I gather you'd like for > > "netfilter: nfnetlink: always ACK batch end if requested" > > to not be applied. > > I think this at least needs more discussion, I think we are now > understanding the implications of bf2ac490d28c. Is the problem due to interplay between ACKs and ERRORs ? I don't think anyone has successfully specified the expected behaviour when errors get reported, namely how many acks versus how many errors could be expected. I know that the YNL python library reports the first error received and terminates, regardless of how many ACKs were received. My guess is that it should be possible to report as many errors as were received within the receive message length before exiting but I have not tried to implement that. > > So, what is the 'more useful' behaviour? Choices are all equally bad: > > > > 1. If we want to always include it, it might not be there due to > > -ENOBUFS, which will always happen if the batch was large (triggers > > too many acks). > > Yes. > > > 2. If we only include it on success, it might not be there for the > > same reason, so absence doesn't imply failure. > > Yes. > > > HOWEVER, if the batch was small enough then 2) gives slightly more > > useable feedback in the sense that the entire batch was processed. > > Yes. > > I think Nikolaos pointed out that _BEGIN+NLM_F_ACK could actually > provide an indication, with the assumption that the netlink userspace > queue is going to be empty because it will be the first > acknowledgement... > > > So I am leaning towards not applying the nfnetlink patch but applying > > the (adjusted) test case. > > > > Other takes? I think we should try to specify the behaviour and see if it makes sense before layering more functionality onto what is there. I can describe how the YNL python code sees the world, when it asks for ACKs: 1. An ack for BEGIN, cmd, cmd, ... , END in the success scenario. 2. An ack for BEGIN, cmd, cmd, ... up to the first ERROR in a failure scenario. > Yes, I would start with this approach you propose, then keep > discussing if it makes sense to keep extending bf2ac490d28c or leave > it as is. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-08 13:35 ` Donald Hunter @ 2025-10-08 14:50 ` Florian Westphal 0 siblings, 0 replies; 27+ messages in thread From: Florian Westphal @ 2025-10-08 14:50 UTC (permalink / raw) To: Donald Hunter; +Cc: Pablo Neira Ayuso, Nikolaos Gkarlis, netfilter-devel Donald Hunter <donald.hunter@gmail.com> wrote: > Thanks for the cc. I'm having difficulty catching up with the thread > of conversation, so forgive me if I miss something. > > > Yes, I am inclined not to add more features to bf2ac490d28c (and > > follow up fixes patches that came with it). > > Is there a problem with bf2ac490d28c or just with things that have > come after it? it broke golang library for nftables: https://github.com/google/nftables/issues/329 Fix is: https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next.git/commit/?id=09efbac953f6f076a07735f9ba885148d4796235 Current discussion centers more on what bf2ac490d28c wants, what it does, what netlink can provide in practice and what userspace would expect/need. > > > > I suspect the author of bf2ac490d28c is making wrong assumptions on > > > > the number of acknowledgements that are going to be received by > > > > userspace. > > Hopefully not. In the success case, ask for an ack, get an ack. > Without that guarantee to userspace, we'd need to extend the YNL spec > to say which messages don't honour acks. Its not about the types, its more about available buffer space for acks. > > > > #2 If you set NLM_F_ACK in your netlink messages in the batch: > > > > You get one acknowledgement for each message in the batch, with a > > > > sufficiently large batch, this may overrun the userspace socket > > > > buffer (ENOBUFS), then maybe the kernel was successful to fully > > > > process the transaction but some of those acks get lost. > > Are people reporting ENOBUFS because of ACKs in practice or is this theoretical? It happens in practice, e.g.: https://lore.kernel.org/netfilter/aKrK6h2zYdqj2unR@lilyboudoir.localdomain/ (every set element triggers another error message). Note, this is without NLM_F_ACK, just due to error messages triggered during batch processing. > I think we should try to specify the behaviour and see if it makes > sense before layering more functionality onto what is there. I think we can all agree on that :-) > I can describe how the YNL python code sees the world, when it asks for ACKs: > > 1. An ack for BEGIN, cmd, cmd, ... , END in the success scenario. > 2. An ack for BEGIN, cmd, cmd, ... up to the first ERROR in a failure scenario. Thanks for clarifying. This should work, IFF the batch is small enough so receive buffer can hold all acks. nftables doesn't bail out on first error (transaction will fail and no permanent changes are made, but it will move to next contained message. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-07 20:33 ` Pablo Neira Ayuso 2025-10-08 7:28 ` Florian Westphal @ 2025-10-08 8:41 ` Nikolaos Gkarlis 2025-10-08 11:09 ` Pablo Neira Ayuso 1 sibling, 1 reply; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-08 8:41 UTC (permalink / raw) To: Pablo Neira Ayuso; +Cc: netfilter-devel, fw Pablo Neira Ayuso <pablo@netfilter.org> wrote: > Regarding bf2ac490d28c, I don't understand why one needs an ack for > _BEGIN message. Maybe, an ack for END message might make sense when > BATCH_DONE is reached so you get a confirmation that the batch has > been fully processed, however... _BEGIN might be excessive, but as you said, I do think _END could be useful in the way you describe. My assumption is that the author of 1bf2ac490d28c aims to standardize the behavior while also allowing some flexibility in what flags are sent. If someone tried to use those flags in a creative way that deviates from what nft userspace expects, they might run into difficulties handling the responses correctly. > I suspect the author of bf2ac490d28c is making wrong assumptions on > the number of acknowledgements that are going to be received by > userspace. That could very well be the case. As you said, you’re not always guaranteed to receive the same number of ACKs. I’m aware of the ENOBUFS error. Personally, I see it as a “fatal” or “delivery” error, which should tell userspace that no more messages are coming. Similar to EPERM which I have a test case for. It might not be the best approach, since one could argue such errors might also occur for individual batch commands. Still, now that I think about it, not receiving a _BEGIN message could indicate that the error is indeed fatal. Receiving an error about an invalid command isn’t necessarily a delivery failure (unlike ENOBUFS), and I’d still expect to get the entire message back, including the ACK. Otherwise, how would userspace know that it has read all messages and drained the buffer? You could argue that userspace should bail on the first error it receives, but if I’m not mistaken, the kernel will still send an error for any subsequent invalid command, meaning the buffer isn’t being drained again. > Netlink is a unreliable transport protocol, there are mechanisms to > make it "more reliable" but message loss (particularly in the > kernel -> userspace direction) is still possible. Is it unreliable mainly because of those corner cases, or are there other factors to consider as well? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-08 8:41 ` Nikolaos Gkarlis @ 2025-10-08 11:09 ` Pablo Neira Ayuso 2025-10-08 14:50 ` Nikolaos Gkarlis 0 siblings, 1 reply; 27+ messages in thread From: Pablo Neira Ayuso @ 2025-10-08 11:09 UTC (permalink / raw) To: Nikolaos Gkarlis; +Cc: netfilter-devel, fw Hi, On Wed, Oct 08, 2025 at 10:41:05AM +0200, Nikolaos Gkarlis wrote: > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > Regarding bf2ac490d28c, I don't understand why one needs an ack for > > _BEGIN message. Maybe, an ack for END message might make sense when > > BATCH_DONE is reached so you get a confirmation that the batch has > > been fully processed, however... > > _BEGIN might be excessive, but as you said, I do think _END could be > useful in the way you describe. > > My assumption is that the author of 1bf2ac490d28c aims to standardize > the behavior while also allowing some flexibility in what flags are > sent. If someone tried to use those flags in a creative way that > deviates from what nft userspace expects, they might run into > difficulties handling the responses correctly. I think the author of 1bf2ac490d28c is using it for a testing tool that sends very small batches (only few commands at a time). In that case, considering the default socket buffer size, the acknowledment is going to fit into the userspace netlink socket buffer. > > I suspect the author of bf2ac490d28c is making wrong assumptions on > > the number of acknowledgements that are going to be received by > > userspace. > > That could very well be the case. As you said, you’re not always > guaranteed to receive the same number of ACKs. > > I’m aware of the ENOBUFS error. Personally, I see it as a “fatal” or > “delivery” error, which should tell userspace that no more messages > are coming. In netlink, ENOBUFS is not "fatal", it means messages got lost, but there are still messages in the netlink socket buffer to be processed, ie. the netlink messages before the overrun are still in place, but the messages that could not fit in into the socket buffer has been dropped. nfnetlink handles a batch in two stages: 1) Process every netlink message in the batch, if either netlink message triggers an error or NLM_F_ACK is set on, then enqueue an error to the list. 2) If batch was successfully processed, iterate over the list of errors and create the netlink acknowledgement messages that is stores in the userspace netlink socket buffer. Since this is a batch of netlink messages, acknowledgement either triggered by explicit NLM_F_ACK or by errors may overrun the netlink socket buffer. > Similar to EPERM which I have a test case for. EPERM is indeed fatal. > It might not be the best approach, since one could argue such errors > might also occur for individual batch commands. Still, now that I > think about it, not receiving a _BEGIN message could indicate that > the error is indeed fatal. I think I see your point. Acknowledgement for _BEGIN will be likely in the netlink socket buffer, because it is the first message to be acknowledged, but _END is the last one to be processed, so it could get lost if many acknowledgements before have been queued to the userspace netlink socket buffer (leading to overrun). It seems with 1bf2ac490d28c, an acknowledgement with _BEGIN can be an indication of successfully handling a batch in the way you describe. > Receiving an error about an invalid command isn’t necessarily a > delivery failure (unlike ENOBUFS), and I’d still expect to get the > entire message back, including the ACK. Otherwise, how would userspace > know that it has read all messages and drained the buffer? For this nfnetlink batching, use select() to poll for pending messages to process, and make no assumptions on how many messages you receive. > You could argue that userspace should bail on the first error it > receives, but if I’m not mistaken, the kernel will still send an error > for any subsequent invalid command, meaning the buffer isn’t being > drained again. If you open, send then batch, process response, then close. Lazy approach that consist of bailing out on the first error is OK. The close call on the socket implicitly cleans up the ignored pending error messages in the socket buffer. But if you keep the socket open for several batches, with the approach you describe, then unprocessed netlink error messages will pile up on the socket buffer. If you do not do any sort of sequence tracking, then you application process old pending errors as new, libmnl handles this with EILSEQ. All these netlink subtle details are not easy to follow :). > > Netlink is a unreliable transport protocol, there are mechanisms to > > make it "more reliable" but message loss (particularly in the > > kernel -> userspace direction) is still possible. > > Is it unreliable mainly because of those corner cases, or are there > other factors to consider as well? As for netlink batching, which is supported in other classic netlink subsystems, this acknowledgement overrun issue exists, I am referring to the scenario where you add several netlink messages to the buffer and send() them to the kernel. As for nfnetlink, it is a bit special in that it has begin and end messages because this is needed for the transaction semantics (to implement a dryrun to test if incremental ruleset update is OK). 1bf2ac490d28c added the handling for NLM_F_ACK which I left it unspecified at the time. Netlink can be also used for event delivery to userspace, ENOBUFS can also happen in that case, but that is a different scenario. TBH, I am trying to remember the details, I don't talk about Netlink very often. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] netfilter: nfnetlink: always ACK batch end if requested 2025-10-08 11:09 ` Pablo Neira Ayuso @ 2025-10-08 14:50 ` Nikolaos Gkarlis 0 siblings, 0 replies; 27+ messages in thread From: Nikolaos Gkarlis @ 2025-10-08 14:50 UTC (permalink / raw) To: Pablo Neira Ayuso; +Cc: netfilter-devel, fw Hi again, Thanks for the detailed explanation. For the record, I am not strongly for or against one or the other approach but I do think that my approach is something to consider if batch ACKs are to be supported in the long run. > In netlink, ENOBUFS is not "fatal", it means messages got lost, but > there are still messages in the netlink socket buffer to be > processed, ie. the netlink messages before the overrun are still in > place, but the messages that could not fit in into the socket buffer > has been dropped. Yes perhaps in this instance "delivery" error would be a better description but I understand it's not an error in applying the batch. > For this nfnetlink batching, use select() to poll for pending > messages to process, and make no assumptions on how many messages > you receive. That's an option indeed although the library that I use https://github.com/google/nftables is unfortunately limited to blocking recvmsg calls and this is why it has to keep track of the sent acks and validate them after. I wonder if there are more cases like that ? In fact, that's the reason why I stumbled upon the duplicate ACK bug and attempted using ACKs on batch messages to work around the issue. Additionally, this library will dynamically adjust its buffer based on the expected response size. I appreciate you probably don't want to make architectural decisions based on unofficial library limitations but I thought it might be a good thing to mention that. To make the discussion clearer, I think we are referring to three different scenarios. I've outlined them below for clarity. Scenario 1: Success User sends: Command1 Command2 Command3 Kernel replies: Command1 - 0 Command2 - 0 Command3 - 0 Result: The batch is processed successfully. Scenario 2: Failure User sends: Command1 Command2 (invalid) Command3 (invalid) Kernel replies: Command1 - 0 Command2 - EINVAL Command3 - EINVAL Note: The reply for Command3 is still received even though Command2 failed. Scenario 3: Fatal / Permission Error User sends: Command1 Command2 Command3 Kernel replies: Command1 - EPERM Result: Indicates a fatal error. Processing stops immediately. Now let's add BATCH_BEGIN and BATCH_END into the mix. Scenario 1: Success with Batch User sends: BATCH_BEGIN Command1 Command2 Command3 BATCH_END Kernel replies: BATCH_BEGIN Command1 Command2 Command3 BATCH_END Result: Consistent behavior. Full batch completed successfully. Scenario 2: Failure with Batch User sends: BATCH_BEGIN Command1 Command2 (invalid) Command3 (invalid) BATCH_END Kernel replies: BATCH_BEGIN Command1 - 0 Command2 - EINVAL Command3 - EINVAL Observation: This seems inconsistent with earlier behavior. I would expect a BATCH_END reply here as well to indicate that the batch processing is complete, even if errors occurred. Scenario 3: Fatal Error with Batch User sends: BATCH_BEGIN Command1 Command2 Command3 BATCH_END Kernel replies: Command1 - EPERM Result: Matches the previous "fatal" case which I believe is what you're pointing out does not make sense and would require further changes which you don't think is a good idea to make. I’m not trying to push this change, since I don’t rely on batch ACKs. Just want to make sure we’re on the same page. Let me know how you’d like to proceed and whether I should update the test patch. Thanks! ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-10-08 14:50 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-01 21:15 [PATCH] netfilter: nfnetlink: always ACK batch end if requested Nikolaos Gkarlis 2025-10-02 9:48 ` Fernando Fernandez Mancera 2025-10-02 10:41 ` Nikolaos Gkarlis 2025-10-02 11:03 ` Fernando Fernandez Mancera 2025-10-04 9:26 ` [PATCH v2 0/2] " Nikolaos Gkarlis 2025-10-04 9:26 ` [PATCH v2 1/2] netfilter: nfnetlink: " Nikolaos Gkarlis 2025-10-04 9:26 ` [PATCH v2 2/2] selftests: netfilter: add nfnetlink ACK handling tests Nikolaos Gkarlis 2025-10-04 10:46 ` Florian Westphal 2025-10-04 11:08 ` Nikolaos Gkarlis 2025-10-04 12:26 ` Florian Westphal 2025-10-05 10:43 ` Nikolaos Gkarlis 2025-10-05 11:42 ` Florian Westphal 2025-10-05 12:54 ` [PATCH v3] " Nikolaos Gkarlis 2025-10-08 10:26 ` Florian Westphal 2025-10-08 10:37 ` Nikolaos Gkarlis 2025-10-08 10:39 ` Florian Westphal 2025-10-04 9:38 ` [PATCH v2 0/2] always ACK batch end if requested Nikolaos Gkarlis 2025-10-02 10:10 ` [PATCH] netfilter: nfnetlink: " Florian Westphal 2025-10-02 10:46 ` Nikolaos Gkarlis 2025-10-07 20:33 ` Pablo Neira Ayuso 2025-10-08 7:28 ` Florian Westphal 2025-10-08 11:33 ` Pablo Neira Ayuso 2025-10-08 13:35 ` Donald Hunter 2025-10-08 14:50 ` Florian Westphal 2025-10-08 8:41 ` Nikolaos Gkarlis 2025-10-08 11:09 ` Pablo Neira Ayuso 2025-10-08 14:50 ` Nikolaos Gkarlis
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).