* nftables: Writers starve readers
@ 2023-06-01 9:37 Phil Sutter
2023-06-01 15:11 ` Florian Westphal
0 siblings, 1 reply; 7+ messages in thread
From: Phil Sutter @ 2023-06-01 9:37 UTC (permalink / raw)
To: Pablo Neira Ayuso, Florian Westphal; +Cc: netfilter-devel
Hi!
I'm currently triaging a case where 'nft list ruleset' happens to take
more than 60s which causes trouble in the calling code. It is not
entirely clear what happens on the system that leads to this, so I'm
checking "suspicious" cases. One of them is "too many ruleset updates",
and indeed the following script is problematic:
| # init
| iptables-nft -N foo
| (
| echo "*filter";
| for ((i = 0; i < 100000; i++)); do
| echo "-A foo -m comment --comment \"rule $i\" -j ACCEPT"
| done
| echo "COMMIT"
| ) | iptables-nft-restore --noflush
|
| # flood
| while true; do
| iptables-nft -A foo -j ACCEPT
| iptables-nft -D foo -j ACCEPT
| done
A call to 'nft list ruleset' in a second terminal hangs without output.
It apparently hangs in nft_cache_update() because rule_cache_dump()
returns EINTR. On kernel side, I guess it stems from
nl_dump_check_consistent() in __nf_tables_dump_rules(). I haven't
checked, but the generation counter likely increases while dumping the
100k rules.
One may deem this scenario unrealistic, but I had to insert a 'sleep 5'
into the while-loop to unblock 'nft list ruleset' again. A new rule
every 4s especially in such a large ruleset is not that unrealistic IMO.
I wonder if we can provide some fairness to readers? Ideally a reader
would just see the ruleset as it was when it started dumping, but
keeping a copy of the large ruleset is probably not feasible.
Cheers, Phil
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nftables: Writers starve readers
2023-06-01 9:37 nftables: Writers starve readers Phil Sutter
@ 2023-06-01 15:11 ` Florian Westphal
2023-06-01 16:42 ` Phil Sutter
2023-06-01 20:06 ` Pablo Neira Ayuso
0 siblings, 2 replies; 7+ messages in thread
From: Florian Westphal @ 2023-06-01 15:11 UTC (permalink / raw)
To: Phil Sutter, Pablo Neira Ayuso, Florian Westphal, netfilter-devel
Phil Sutter <phil@nwl.cc> wrote:
> A call to 'nft list ruleset' in a second terminal hangs without output.
> It apparently hangs in nft_cache_update() because rule_cache_dump()
> returns EINTR. On kernel side, I guess it stems from
> nl_dump_check_consistent() in __nf_tables_dump_rules(). I haven't
> checked, but the generation counter likely increases while dumping the
> 100k rules.
Yes.
> One may deem this scenario unrealistic, but I had to insert a 'sleep 5'
> into the while-loop to unblock 'nft list ruleset' again. A new rule
> every 4s especially in such a large ruleset is not that unrealistic IMO.
Several seconds is very strange indeed, how is the data that needs
to be transferred to userspace and how large is the buffer provided
during dumps? strace would help here.
If thats rather small, then dumping a chain with 10k rules may
have to re-iterate the existig list for long time before it finds
the starting point on where to resume the dump.
> I wonder if we can provide some fairness to readers? Ideally a reader
> would just see the ruleset as it was when it started dumping, but
> keeping a copy of the large ruleset is probably not feasible.
I can't think of a good solution. We could add a
"--allow-inconsistent-dump" flag to nftables that disables the restart
logic for -EINTR case, but we can't make that the default unfortunately.
Or we could experiment with serializing the remaining rules into a
private kernel-side kmalloc'd buffer once the userspace buffer is
full, then copy from that buffer on resume without the inconsistency check.
I don't think that we can solve this, slowing down writers when there
are dumpers will load to the same issue, just in the oppostite direction.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nftables: Writers starve readers
2023-06-01 15:11 ` Florian Westphal
@ 2023-06-01 16:42 ` Phil Sutter
2023-06-01 20:06 ` Pablo Neira Ayuso
1 sibling, 0 replies; 7+ messages in thread
From: Phil Sutter @ 2023-06-01 16:42 UTC (permalink / raw)
To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel
On Thu, Jun 01, 2023 at 05:11:05PM +0200, Florian Westphal wrote:
> Phil Sutter <phil@nwl.cc> wrote:
> > A call to 'nft list ruleset' in a second terminal hangs without output.
> > It apparently hangs in nft_cache_update() because rule_cache_dump()
> > returns EINTR. On kernel side, I guess it stems from
> > nl_dump_check_consistent() in __nf_tables_dump_rules(). I haven't
> > checked, but the generation counter likely increases while dumping the
> > 100k rules.
>
> Yes.
>
> > One may deem this scenario unrealistic, but I had to insert a 'sleep 5'
> > into the while-loop to unblock 'nft list ruleset' again. A new rule
> > every 4s especially in such a large ruleset is not that unrealistic IMO.
>
> Several seconds is very strange indeed, how is the data that needs
> to be transferred to userspace and how large is the buffer provided
> during dumps? strace would help here.
Each recvmsg() call returns 32KB, grepping for NFT_MSG_NEWRULE returns
4290 lines.
| # time ./src/nft list ruleset | wc -l
| # Warning: table ip filter is managed by iptables-nft, do not touch!
| # Warning: table ip nat is managed by iptables-nft, do not touch!
| # Warning: table ip mangle is managed by iptables-nft, do not touch!
| 100190
|
| real 0m5.572s
| user 0m1.014s
| sys 0m4.885s
> If thats rather small, then dumping a chain with 10k rules may
> have to re-iterate the existig list for long time before it finds
> the starting point on where to resume the dump.
To my surprise, the mnl_nft_rule_dump() code-path does not call
mnl_set_rcvbuffer(). Though explicitly calling it from nft_mnl_talk() passing
1<<24 as buffer size does not lead to different behaviour. I seem to recall the
32k was a kernel-side limit in netlink?
> > I wonder if we can provide some fairness to readers? Ideally a reader
> > would just see the ruleset as it was when it started dumping, but
> > keeping a copy of the large ruleset is probably not feasible.
>
> I can't think of a good solution. We could add a
> "--allow-inconsistent-dump" flag to nftables that disables the restart
> logic for -EINTR case, but we can't make that the default unfortunately.
>
> Or we could experiment with serializing the remaining rules into a
> private kernel-side kmalloc'd buffer once the userspace buffer is
> full, then copy from that buffer on resume without the inconsistency check.
>
> I don't think that we can solve this, slowing down writers when there
> are dumpers will load to the same issue, just in the oppostite direction.
You're probably right, thanks for spending cycles on it.
Cheers, Phil
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nftables: Writers starve readers
2023-06-01 15:11 ` Florian Westphal
2023-06-01 16:42 ` Phil Sutter
@ 2023-06-01 20:06 ` Pablo Neira Ayuso
2023-06-02 12:23 ` Phil Sutter
1 sibling, 1 reply; 7+ messages in thread
From: Pablo Neira Ayuso @ 2023-06-01 20:06 UTC (permalink / raw)
To: Florian Westphal; +Cc: Phil Sutter, netfilter-devel
On Thu, Jun 01, 2023 at 05:11:05PM +0200, Florian Westphal wrote:
> Phil Sutter <phil@nwl.cc> wrote:
> > A call to 'nft list ruleset' in a second terminal hangs without output.
> > It apparently hangs in nft_cache_update() because rule_cache_dump()
> > returns EINTR. On kernel side, I guess it stems from
> > nl_dump_check_consistent() in __nf_tables_dump_rules(). I haven't
> > checked, but the generation counter likely increases while dumping the
> > 100k rules.
>
> Yes.
>
> > One may deem this scenario unrealistic, but I had to insert a 'sleep 5'
> > into the while-loop to unblock 'nft list ruleset' again. A new rule
> > every 4s especially in such a large ruleset is not that unrealistic IMO.
>
> Several seconds is very strange indeed, how is the data that needs
> to be transferred to userspace and how large is the buffer provided
> during dumps? strace would help here.
>
> If thats rather small, then dumping a chain with 10k rules may
> have to re-iterate the existig list for long time before it finds
> the starting point on where to resume the dump.
>
> > I wonder if we can provide some fairness to readers? Ideally a reader
> > would just see the ruleset as it was when it started dumping, but
> > keeping a copy of the large ruleset is probably not feasible.
>
> I can't think of a good solution. We could add a
> "--allow-inconsistent-dump" flag to nftables that disables the restart
> logic for -EINTR case, but we can't make that the default unfortunately.
>
> Or we could experiment with serializing the remaining rules into a
> private kernel-side kmalloc'd buffer once the userspace buffer is
> full, then copy from that buffer on resume without the inconsistency check.
>
> I don't think that we can solve this, slowing down writers when there
> are dumpers will load to the same issue, just in the oppostite direction.
There are currently two pending issues that, if addressed, could
improve things:
NLM_F_INTR is set on in case writer infers with a reader, currently
forcing userspace to read all of the remaining messages to leave
things in consistent state, otherwise next dump request hits EILSEQ in
libmnl. Before 6d085b22a8b5 ("table: support for the table owner
flag"), the socket was closed and reopen to workaround this issue.
There should be a way to discard the ongoing netlink dump without
having to read the remaining messages, that should also improve
things.
It should be possible to add generation counters per object type, so
userspace does not have to ditch all what it has in its cache, only
what it has changed. Currently the generation counter is global.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nftables: Writers starve readers
2023-06-01 20:06 ` Pablo Neira Ayuso
@ 2023-06-02 12:23 ` Phil Sutter
2023-06-02 22:11 ` Pablo Neira Ayuso
0 siblings, 1 reply; 7+ messages in thread
From: Phil Sutter @ 2023-06-02 12:23 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel
On Thu, Jun 01, 2023 at 10:06:24PM +0200, Pablo Neira Ayuso wrote:
> On Thu, Jun 01, 2023 at 05:11:05PM +0200, Florian Westphal wrote:
> > Phil Sutter <phil@nwl.cc> wrote:
> > > A call to 'nft list ruleset' in a second terminal hangs without output.
> > > It apparently hangs in nft_cache_update() because rule_cache_dump()
> > > returns EINTR. On kernel side, I guess it stems from
> > > nl_dump_check_consistent() in __nf_tables_dump_rules(). I haven't
> > > checked, but the generation counter likely increases while dumping the
> > > 100k rules.
> >
> > Yes.
> >
> > > One may deem this scenario unrealistic, but I had to insert a 'sleep 5'
> > > into the while-loop to unblock 'nft list ruleset' again. A new rule
> > > every 4s especially in such a large ruleset is not that unrealistic IMO.
> >
> > Several seconds is very strange indeed, how is the data that needs
> > to be transferred to userspace and how large is the buffer provided
> > during dumps? strace would help here.
> >
> > If thats rather small, then dumping a chain with 10k rules may
> > have to re-iterate the existig list for long time before it finds
> > the starting point on where to resume the dump.
> >
> > > I wonder if we can provide some fairness to readers? Ideally a reader
> > > would just see the ruleset as it was when it started dumping, but
> > > keeping a copy of the large ruleset is probably not feasible.
> >
> > I can't think of a good solution. We could add a
> > "--allow-inconsistent-dump" flag to nftables that disables the restart
> > logic for -EINTR case, but we can't make that the default unfortunately.
> >
> > Or we could experiment with serializing the remaining rules into a
> > private kernel-side kmalloc'd buffer once the userspace buffer is
> > full, then copy from that buffer on resume without the inconsistency check.
> >
> > I don't think that we can solve this, slowing down writers when there
> > are dumpers will load to the same issue, just in the oppostite direction.
>
> There are currently two pending issues that, if addressed, could
> improve things:
>
> NLM_F_INTR is set on in case writer infers with a reader, currently
> forcing userspace to read all of the remaining messages to leave
> things in consistent state, otherwise next dump request hits EILSEQ in
> libmnl. Before 6d085b22a8b5 ("table: support for the table owner
> flag"), the socket was closed and reopen to workaround this issue.
> There should be a way to discard the ongoing netlink dump without
> having to read the remaining messages, that should also improve
> things.
I tried restoring the immediate return from nft_mnl_recv() adding socket
close and open calls to sanitize things. Assuming my changes are
correct, they don't have a noticeable effect: The same test-case still
allows for a 4s delay in the rule add'n'delete loop to starve 'nft list
ruleset'.
> It should be possible to add generation counters per object type, so
> userspace does not have to ditch all what it has in its cache, only
> what it has changed. Currently the generation counter is global.
I guess the added complexity is probably not worth it. Kernel-side could
be pretty simple, but user space could no longer rely upon
nft_cache::genid but had to fetch each object's genid to check if the
local cache is outdated, plus I have no idea how one would detect that a
new table was added.
Cheers, Phil
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nftables: Writers starve readers
2023-06-02 12:23 ` Phil Sutter
@ 2023-06-02 22:11 ` Pablo Neira Ayuso
2023-06-02 22:54 ` Pablo Neira Ayuso
0 siblings, 1 reply; 7+ messages in thread
From: Pablo Neira Ayuso @ 2023-06-02 22:11 UTC (permalink / raw)
To: Phil Sutter, Florian Westphal, netfilter-devel
[-- Attachment #1: Type: text/plain, Size: 4121 bytes --]
On Fri, Jun 02, 2023 at 02:23:53PM +0200, Phil Sutter wrote:
> On Thu, Jun 01, 2023 at 10:06:24PM +0200, Pablo Neira Ayuso wrote:
> > On Thu, Jun 01, 2023 at 05:11:05PM +0200, Florian Westphal wrote:
> > > Phil Sutter <phil@nwl.cc> wrote:
> > > > A call to 'nft list ruleset' in a second terminal hangs without output.
> > > > It apparently hangs in nft_cache_update() because rule_cache_dump()
> > > > returns EINTR. On kernel side, I guess it stems from
> > > > nl_dump_check_consistent() in __nf_tables_dump_rules(). I haven't
> > > > checked, but the generation counter likely increases while dumping the
> > > > 100k rules.
> > >
> > > Yes.
> > >
> > > > One may deem this scenario unrealistic, but I had to insert a 'sleep 5'
> > > > into the while-loop to unblock 'nft list ruleset' again. A new rule
> > > > every 4s especially in such a large ruleset is not that unrealistic IMO.
> > >
> > > Several seconds is very strange indeed, how is the data that needs
> > > to be transferred to userspace and how large is the buffer provided
> > > during dumps? strace would help here.
> > >
> > > If thats rather small, then dumping a chain with 10k rules may
> > > have to re-iterate the existig list for long time before it finds
> > > the starting point on where to resume the dump.
> > >
> > > > I wonder if we can provide some fairness to readers? Ideally a reader
> > > > would just see the ruleset as it was when it started dumping, but
> > > > keeping a copy of the large ruleset is probably not feasible.
> > >
> > > I can't think of a good solution. We could add a
> > > "--allow-inconsistent-dump" flag to nftables that disables the restart
> > > logic for -EINTR case, but we can't make that the default unfortunately.
> > >
> > > Or we could experiment with serializing the remaining rules into a
> > > private kernel-side kmalloc'd buffer once the userspace buffer is
> > > full, then copy from that buffer on resume without the inconsistency check.
> > >
> > > I don't think that we can solve this, slowing down writers when there
> > > are dumpers will load to the same issue, just in the oppostite direction.
> >
> > There are currently two pending issues that, if addressed, could
> > improve things:
> >
> > NLM_F_INTR is set on in case writer infers with a reader, currently
> > forcing userspace to read all of the remaining messages to leave
> > things in consistent state, otherwise next dump request hits EILSEQ in
> > libmnl. Before 6d085b22a8b5 ("table: support for the table owner
> > flag"), the socket was closed and reopen to workaround this issue.
> > There should be a way to discard the ongoing netlink dump without
> > having to read the remaining messages, that should also improve
> > things.
>
> I tried restoring the immediate return from nft_mnl_recv() adding socket
> close and open calls to sanitize things. Assuming my changes are
> correct, they don't have a noticeable effect: The same test-case still
> allows for a 4s delay in the rule add'n'delete loop to starve 'nft list
> ruleset'.
Not for this particular torture test, but for some other usercases, it
might be useful.
> > It should be possible to add generation counters per object type, so
> > userspace does not have to ditch all what it has in its cache, only
> > what it has changed. Currently the generation counter is global.
>
> I guess the added complexity is probably not worth it. Kernel-side could
> be pretty simple, but user space could no longer rely upon
> nft_cache::genid but had to fetch each object's genid to check if the
> local cache is outdated, plus I have no idea how one would detect that a
> new table was added.
I made a patch to add a fall back, it displays a warning:
# Warning: ruleset has been updated while listing
it falls back to this mode if it takes more than 5 seconds to fetch a
consistent ruleset.
I think I still need a new function to make consistency check, in case
rule refering to unexisting chain, ie. top-level object in the
hierarchy is missing, to avoid crashes. Such code would only exercised
in case this fallback mode is enabled.
See attachment.
[-- Attachment #2: x.patch --]
[-- Type: text/x-diff, Size: 5036 bytes --]
diff --git a/include/cache.h b/include/cache.h
index 934c3a74fa95..c5c585bacaf7 100644
--- a/include/cache.h
+++ b/include/cache.h
@@ -136,6 +136,7 @@ struct nft_cache {
struct cache table_cache;
uint32_t seqnum;
uint32_t flags;
+ bool eintr;
};
void nft_chain_cache_update(struct netlink_ctx *ctx, struct table *table,
diff --git a/include/netlink.h b/include/netlink.h
index d52434c72bc2..d3e828f270e1 100644
--- a/include/netlink.h
+++ b/include/netlink.h
@@ -85,6 +85,7 @@ struct netlink_ctx {
uint32_t seqnum;
struct nftnl_batch *batch;
int maybe_emsgsize;
+ bool ignore_eintr;
};
extern struct nftnl_expr *alloc_nft_expr(const char *name);
diff --git a/src/cache.c b/src/cache.c
index 95adee7f8ac1..9237397d2d3e 100644
--- a/src/cache.c
+++ b/src/cache.c
@@ -15,6 +15,7 @@
#include <netlink.h>
#include <mnl.h>
#include <libnftnl/chain.h>
+#include <sys/time.h>
#include <linux/netfilter.h>
#include <linux/netfilter/nf_tables.h>
@@ -1175,10 +1176,14 @@ bool nft_cache_needs_update(struct nft_cache *cache)
return cache->flags & NFT_CACHE_UPDATE;
}
+/* display inconsistent ruleset after 5 seconds of retries. */
+#define MAX_RETRY_TIME 5
+
int nft_cache_update(struct nft_ctx *nft, unsigned int flags,
struct list_head *msgs,
const struct nft_cache_filter *filter)
{
+ struct timeval tv_start, tv_last, tv_diff;
struct netlink_ctx ctx = {
.list = LIST_HEAD_INIT(ctx.list),
.nft = nft,
@@ -1187,7 +1192,14 @@ int nft_cache_update(struct nft_ctx *nft, unsigned int flags,
struct nft_cache *cache = &nft->cache;
uint32_t genid, genid_stop, oldflags;
int ret;
+
+ gettimeofday(&tv_start, NULL);
replay:
+ gettimeofday(&tv_last, NULL);
+ timersub(&tv_diff, &tv_last, &tv_start);
+ if (tv_diff.tv_sec > MAX_RETRY_TIME)
+ ctx.ignore_eintr = true;
+
ctx.seqnum = cache->seqnum++;
genid = mnl_genid_get(&ctx);
if (!nft_cache_needs_refresh(cache) &&
@@ -1223,12 +1235,16 @@ replay:
genid_stop = mnl_genid_get(&ctx);
if (genid != genid_stop) {
- nft_cache_release(cache);
- goto replay;
+ if (!ctx.ignore_eintr) {
+ nft_cache_release(cache);
+ goto replay;
+ }
+ cache->eintr = true;
}
skip:
cache->genid = genid;
cache->flags = flags;
+
return 0;
}
diff --git a/src/mnl.c b/src/mnl.c
index 91775c41b246..f985e90c5926 100644
--- a/src/mnl.c
+++ b/src/mnl.c
@@ -706,8 +706,13 @@ struct nftnl_rule_list *mnl_nft_rule_dump(struct netlink_ctx *ctx, int family,
}
ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, rule_cb, nlr_list);
- if (ret < 0)
- goto err;
+ if (ret < 0) {
+ if (errno != EINTR)
+ goto err;
+
+ if (!ctx->ignore_eintr)
+ goto err;
+ }
return nlr_list;
err:
@@ -1029,8 +1034,13 @@ struct nftnl_chain_list *mnl_nft_chain_dump(struct netlink_ctx *ctx,
}
ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, chain_cb, nlc_list);
- if (ret < 0 && errno != ENOENT)
- goto err;
+ if (ret < 0 && errno != ENOENT) {
+ if (errno != EINTR)
+ goto err;
+
+ if (!ctx->ignore_eintr)
+ goto err;
+ }
return nlc_list;
err:
@@ -1174,8 +1184,13 @@ struct nftnl_table_list *mnl_nft_table_dump(struct netlink_ctx *ctx,
}
ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, table_cb, nlt_list);
- if (ret < 0 && errno != ENOENT)
- goto err;
+ if (ret < 0 && errno != ENOENT) {
+ if (errno != EINTR)
+ goto err;
+
+ if (!ctx->ignore_eintr)
+ goto err;
+ }
return nlt_list;
err:
@@ -1428,8 +1443,13 @@ mnl_nft_set_dump(struct netlink_ctx *ctx, int family,
memory_allocation_error();
ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, set_cb, nls_list);
- if (ret < 0 && errno != ENOENT)
- goto err;
+ if (ret < 0 && errno != ENOENT) {
+ if (errno != EINTR)
+ goto err;
+
+ if (!ctx->ignore_eintr)
+ goto err;
+ }
return nls_list;
err:
@@ -1653,8 +1673,13 @@ mnl_nft_obj_dump(struct netlink_ctx *ctx, int family,
memory_allocation_error();
ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, obj_cb, nln_list);
- if (ret < 0)
- goto err;
+ if (ret < 0) {
+ if (errno != EINTR)
+ goto err;
+
+ if (!ctx->ignore_eintr)
+ goto err;
+ }
return nln_list;
err:
@@ -1967,8 +1992,13 @@ mnl_nft_flowtable_dump(struct netlink_ctx *ctx, int family,
memory_allocation_error();
ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, flowtable_cb, nln_list);
- if (ret < 0 && errno != ENOENT)
- goto err;
+ if (ret < 0 && errno != ENOENT) {
+ if (errno != EINTR)
+ goto err;
+
+ if (!ctx->ignore_eintr)
+ goto err;
+ }
return nln_list;
err:
diff --git a/src/rule.c b/src/rule.c
index 633a5a12486d..3f984ea1fec5 100644
--- a/src/rule.c
+++ b/src/rule.c
@@ -2192,6 +2192,9 @@ static int do_list_ruleset(struct netlink_ctx *ctx, struct cmd *cmd)
unsigned int family = cmd->handle.family;
struct table *table;
+ if (ctx->nft->cache.eintr)
+ nft_print(&ctx->nft->output, "# Warning: ruleset has been updated while listing\n");
+
list_for_each_entry(table, &ctx->nft->cache.table_cache.list, cache.list) {
if (family != NFPROTO_UNSPEC &&
table->handle.family != family)
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: nftables: Writers starve readers
2023-06-02 22:11 ` Pablo Neira Ayuso
@ 2023-06-02 22:54 ` Pablo Neira Ayuso
0 siblings, 0 replies; 7+ messages in thread
From: Pablo Neira Ayuso @ 2023-06-02 22:54 UTC (permalink / raw)
To: Phil Sutter, Florian Westphal, netfilter-devel
On Sat, Jun 03, 2023 at 12:11:34AM +0200, Pablo Neira Ayuso wrote:
> On Fri, Jun 02, 2023 at 02:23:53PM +0200, Phil Sutter wrote:
> > On Thu, Jun 01, 2023 at 10:06:24PM +0200, Pablo Neira Ayuso wrote:
> > > On Thu, Jun 01, 2023 at 05:11:05PM +0200, Florian Westphal wrote:
> > > > Phil Sutter <phil@nwl.cc> wrote:
> > > > > A call to 'nft list ruleset' in a second terminal hangs without output.
> > > > > It apparently hangs in nft_cache_update() because rule_cache_dump()
> > > > > returns EINTR. On kernel side, I guess it stems from
> > > > > nl_dump_check_consistent() in __nf_tables_dump_rules(). I haven't
> > > > > checked, but the generation counter likely increases while dumping the
> > > > > 100k rules.
> > > >
> > > > Yes.
> > > >
> > > > > One may deem this scenario unrealistic, but I had to insert a 'sleep 5'
> > > > > into the while-loop to unblock 'nft list ruleset' again. A new rule
> > > > > every 4s especially in such a large ruleset is not that unrealistic IMO.
> > > >
> > > > Several seconds is very strange indeed, how is the data that needs
> > > > to be transferred to userspace and how large is the buffer provided
> > > > during dumps? strace would help here.
> > > >
> > > > If thats rather small, then dumping a chain with 10k rules may
> > > > have to re-iterate the existig list for long time before it finds
> > > > the starting point on where to resume the dump.
> > > >
> > > > > I wonder if we can provide some fairness to readers? Ideally a reader
> > > > > would just see the ruleset as it was when it started dumping, but
> > > > > keeping a copy of the large ruleset is probably not feasible.
> > > >
> > > > I can't think of a good solution. We could add a
> > > > "--allow-inconsistent-dump" flag to nftables that disables the restart
> > > > logic for -EINTR case, but we can't make that the default unfortunately.
> > > >
> > > > Or we could experiment with serializing the remaining rules into a
> > > > private kernel-side kmalloc'd buffer once the userspace buffer is
> > > > full, then copy from that buffer on resume without the inconsistency check.
> > > >
> > > > I don't think that we can solve this, slowing down writers when there
> > > > are dumpers will load to the same issue, just in the oppostite direction.
> > >
> > > There are currently two pending issues that, if addressed, could
> > > improve things:
> > >
> > > NLM_F_INTR is set on in case writer infers with a reader, currently
> > > forcing userspace to read all of the remaining messages to leave
> > > things in consistent state, otherwise next dump request hits EILSEQ in
> > > libmnl. Before 6d085b22a8b5 ("table: support for the table owner
> > > flag"), the socket was closed and reopen to workaround this issue.
> > > There should be a way to discard the ongoing netlink dump without
> > > having to read the remaining messages, that should also improve
> > > things.
> >
> > I tried restoring the immediate return from nft_mnl_recv() adding socket
> > close and open calls to sanitize things. Assuming my changes are
> > correct, they don't have a noticeable effect: The same test-case still
> > allows for a 4s delay in the rule add'n'delete loop to starve 'nft list
> > ruleset'.
>
> Not for this particular torture test, but for some other usercases, it
> might be useful.
>
> > > It should be possible to add generation counters per object type, so
> > > userspace does not have to ditch all what it has in its cache, only
> > > what it has changed. Currently the generation counter is global.
> >
> > I guess the added complexity is probably not worth it. Kernel-side could
> > be pretty simple, but user space could no longer rely upon
> > nft_cache::genid but had to fetch each object's genid to check if the
> > local cache is outdated, plus I have no idea how one would detect that a
> > new table was added.
>
> I made a patch to add a fall back, it displays a warning:
>
> # Warning: ruleset has been updated while listing
>
> it falls back to this mode if it takes more than 5 seconds to fetch a
> consistent ruleset.
>
> I think I still need a new function to make consistency check, in case
> rule refering to unexisting chain, ie. top-level object in the
> hierarchy is missing, to avoid crashes. Such code would only exercised
> in case this fallback mode is enabled.
At least this chunk to discard rules that are orphaned, is needed:
@@ -986,8 +987,13 @@ static int rule_init_cache(struct netlink_ctx *ctx, struct table *table,
if (!chain)
chain = chain_binding_lookup(table,
rule->handle.chain.name);
- if (!chain)
- goto err_ctx_list;
+ if (!chain) {
+ if (!ctx->ignore_eintr)
+ goto err_ctx_list;
+
+ list_del(&rule->list);
+ rule_free(rule);
+ }
list_move_tail(&rule->list, &chain->rules);
}
I remember we used to have a script to check for crashes when writers
race with readers, I cannot find it in tests/shell.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-06-02 22:54 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-01 9:37 nftables: Writers starve readers Phil Sutter
2023-06-01 15:11 ` Florian Westphal
2023-06-01 16:42 ` Phil Sutter
2023-06-01 20:06 ` Pablo Neira Ayuso
2023-06-02 12:23 ` Phil Sutter
2023-06-02 22:11 ` Pablo Neira Ayuso
2023-06-02 22:54 ` Pablo Neira Ayuso
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.