From: Pablo Neira Ayuso <pablo@netfilter.org>
To: Phil Sutter <phil@nwl.cc>, Florian Westphal <fw@strlen.de>,
netfilter-devel@vger.kernel.org
Subject: Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()
Date: Wed, 26 Aug 2020 17:32:19 +0200 [thread overview]
Message-ID: <20200826153219.GA2640@salvia> (raw)
In-Reply-To: <20200824131104.GC23632@orbyte.nwl.cc>
Hi Phil,
On Mon, Aug 24, 2020 at 03:11:04PM +0200, Phil Sutter wrote:
[...]
> On Sun, Aug 23, 2020 at 02:04:34PM +0200, Pablo Neira Ayuso wrote:
> > Hi Phil,
> >
> > On Sat, Aug 22, 2020 at 01:06:15AM +0200, Phil Sutter wrote:
> > > Hi,
> > >
> > > Starting firewalld with two active zones in an lxc container provokes a
> > > situation in which nfnetlink_rcv_msg() loops indefinitely, because
> > > nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every
> > > time.
> > >
> > > I identified netlink_attachskb() as the originator for the above error
> > > code. The conditional leading to it looks like this:
> > >
> > > | if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> > > | test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
> > > | [...]
> > > | if (!*timeo) {
> > >
> > > *timeo is zero, so this seems to be a non-blocking socket. Both
> > > NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds
> > > sk->sk_rcvbuf.
> > >
> > > From user space side, firewalld seems to simply call sendto() and the
> > > call never returns.
> > >
> > > How to solve that? I tried to find other code which does the same, but I
> > > haven't found one that does any looping. Should nfnetlink_rcv_msg()
> > > maybe just return -EAGAIN to the caller if it comes from call_rcu
> > > backend?
> >
> > It's a bug in the netlink frontend, which erroneously reports -EAGAIN
> > to the nfnetlink when the socket buffer is full, see:
> >
> > https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200823115536.16631-1-pablo@netfilter.org/
>
> Obviously this avoids the lockup. As correctly assumed by Florian,
> firewalld startup fails instead. (The daemon keeps running, but an error
> message is printed indicating that initial ruleset setup failed.)
Thanks for confirming, I'll apply this patch to nf.git.
> [...]
> > > The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user
> > > space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea
> > > if this is related and how.
> >
> > Next problem is to track why socket buffer is getting full with
> > GET_GENID.
> >
> > firewalld heavily uses NLM_F_ECHO, there I can see how it can easily
> > reach the default socket buffer size, but with GET_GENID I'm not sure
> > yet, probably the problem is elsewhere but it manifests in GET_GENID
> > because it's the first thing that is done when sending a batch (maybe
> > there are unread messages in the socket buffer, you might check
> > /proc/net/netlink to see if the socket buffer keeps growing as
> > firewalld moves on).
>
> Yes, it happens only for echo mode. With your fix in place, I also see
> what firewalld is trying to do: The JSON input leading to the error is
> huge (~72k characters). I suspect that GET_GENID just happens to be the
> last straw. Or my debugging was faulty somehow and netlink_attachskb()
> really got called via a different code-path.
>
> > Is this easy to reproduce? Or does this happens after some time of
> > firewalld execution?
>
> The necessary lxd setup aside, it's pretty trivial: launch an instance
> of images:centos/8/amd64, install firewalld therein, add two zone files
> and start firewalld. It happens immediately, so two active zones already
> make firewalld generate enough rules to exceed the buffer space.
>
> On Sun, Aug 23, 2020 at 01:55:36PM +0200, Pablo Neira Ayuso wrote:
> > Frontend callback reports EAGAIN to nfnetlink to retry a command, this
> > is used to signal that module autoloading is required. Unfortunately,
> > nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets
> > full, so it enters a busy-loop.
> >
> > This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and
> > to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast()
> > since this is always MSG_DONTWAIT in the existing code which is exactly
> > what nlmsg_unicast() passes to netlink_unicast() as parameter.
> >
> > Fixes: 96518518cc41 ("netfilter: add nftables")
> > Reported-by: Phil Sutter <phil@nwl.cc>
> > Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
>
> This indeed "fixes" the problem. Or rather, exposes the actual problem
> in echo-related code, namely the tendency to exhaust socket buffers.
>
> So the problem we're facing is that while user space still waits for
> sendmsg() to complete, receive buffer fills up. Is it possible to buffer
> the data in kernel somewhere else so user space has a chance to call
> recvmsg()?
There several possibilities, just a few that can be explored:
* I made a quick patch to batch several netlink messages coming as
reply to the NLM_F_ECHO request into one single skbuff. If you look
at the _notify() functions in the kernel, this is currently taking
one single skbuff for one message which adds a bit of overhead (the
skbuff truesize is used for the socket buffer accounting). I'm
measuring here on x86_64 that each command takes 768 bytes. With a
quick patch I'm batching several reply netlink messages into one
single skbuff, now each commands takes ~120 bytes (well, size
depends on how many expressions you use actually). This means this
can handle ~3550 commands vs. the existing ~555 commands (assuming
very small sk_rmem_alloc 426240 as the one you're reporting).
Even if this does not fix your problem (you refer to 72k chars, not
sure how many commands this is), this is probably good to have
anyway, this decreasing memory consumption by 85%. This will also
make event reporting (monitor mode) more reliable through netlink
(it's the same codepath).
Note that userspace requires no changes to support batching mode:
libmnl's mnl_cb_run() keeps iterating over the buffer that was
received until all netlink messages are consumed.
The quick patch is incomplete, I just want to prove the saving in
terms of memory. I'll give it another spin and submit this for
review.
* Probably recover the cookie idea: firewalld specifies a cookie
that identifies the rule from userspace, so there is no need for
NLM_F_ECHO. Eric mentioned he would like to delete rules by cookie,
this should be possible by looking up for the handle from the
cookie to delete rules. The cookie is stored in NFTA_RULE_USERDATA
so this is only meaningful to userspace. No kernel changes are
required (this is supported for a bit of time already).
Note: The cookie could be duplicated. Since the cookie is allocated
by userspace, it's up to userspace to ensure uniqueness. In case
cookies are duplicated, if you delete a rule by cookie then
Rule deletion by cookie requires dumping the whole ruleset though
(slow). However, I think firewalld keeps a rule cache, so it uses
the rule handle to delete the rule instead (no need to dump the
cache then). Therefore, the cookie is only used to fetch the rule
handle.
With the rule cookie in place, firewalld can send the batch, then
make a NLM_F_DUMP request after sending the batch to retrieve the
handle from the cookie instead of using NLM_F_ECHO. The batch send +
NLM_F_DUMP barely consumes 16KBytes per recvmsg() call. The send +
dump is not atomic though.
Because NLM_F_ECHO is only used to get back the rule handle, right?
next prev parent reply other threads:[~2020-08-26 15:32 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-21 23:06 nfnetlink: Busy-loop in nfnetlink_rcv_msg() Phil Sutter
2020-08-22 18:46 ` Florian Westphal
2020-08-24 10:47 ` Pablo Neira Ayuso
2020-08-24 12:39 ` Florian Westphal
2020-08-24 13:11 ` Phil Sutter
2020-08-26 15:32 ` Pablo Neira Ayuso [this message]
2020-08-26 18:54 ` Eric Garver
2020-08-27 14:23 ` Phil Sutter
2020-08-27 17:50 ` Pablo Neira Ayuso
2020-08-23 12:04 ` Pablo Neira Ayuso
-- strict thread matches above, loose matches on Subject: below --
2020-08-23 11:55 [PATCH nf] netfilter: nfnetlink: nfnetlink_unicast() reports EAGAIN instead of ENOBUFS Pablo Neira Ayuso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200826153219.GA2640@salvia \
--to=pablo@netfilter.org \
--cc=fw@strlen.de \
--cc=netfilter-devel@vger.kernel.org \
--cc=phil@nwl.cc \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).