From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Dmitry Safonov <dima@arista.com>
Cc: David Ahern <dsahern@gmail.com>,
linux-kernel@vger.kernel.org,
Alexander Duyck <alexander.h.duyck@linux.intel.com>,
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
Ido Schimmel <idosch@mellanox.com>,
netdev@vger.kernel.org
Subject: Re: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb
Date: Tue, 26 Mar 2019 20:33:44 -0700 [thread overview]
Message-ID: <20190327033344.GW4102@linux.ibm.com> (raw)
In-Reply-To: <d77c86d7-23da-c301-4443-08e9988ac801@arista.com>
On Tue, Mar 26, 2019 at 11:14:43PM +0000, Dmitry Safonov wrote:
> On 3/26/19 3:39 PM, David Ahern wrote:
> > On 3/26/19 9:30 AM, Dmitry Safonov wrote:
> >> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu().
> >> The limit is 128 pages or 512Kb (considering common case with 4Kb
> >> pages).
> >>
> >> Unfortunately, at Arista we have use-scenarios with full view software
> >> forwarding. At the scale of 100K and more routes even on 2 core boxes
> >> the hard-coded limit starts actively shooting in the leg: lockup
> >> detector notices that rtnl_lock is held for seconds.
> >> First reason is previously broken MAX_WORK, that didn't limit pending
> >> balancing work. While fixing it, I've noticed that the bottle-neck is
> >> actually in the number of synchronise_rcu() calls.
> >>
> >> I've tried to fix it with a patch to decrement number of tnodes in rcu
> >> callback, but it hasn't much affected performance.
> >>
> >> One possible way to "fix" it - provide another sysctl to control
> >> sync_pages, but in my POV it's nasty - exposing another realisation
> >> detail into user-space.
> >
> > well, that was accepted last week. ;-)
> >
> > commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f
> > Author: David Ahern <dsahern@gmail.com>
> > Date: Wed Mar 20 09:18:59 2019 -0700
> >
> > ipv4: Allow amount of dirty memory from fib resizing to be controllable
> >
> >
> > Can you see how that change (should backport easily) affects your test
> > case? From my perspective 16MB was the sweet spot.
>
> FWIW, I would like to +Cc Paul here.
>
> TLDR; we're looking with David into ways to improve a hardcoded limit
> tnode_free_size at net/ipv4/fib_trie.c: currently it's way too low
> (512Kb). David created a patch to provide sysctl that controls the limit
> and it would solve a problem for both of us. In parallel, I thought that
> exposing this to userspace is not much fun and added a shrinker with
> synchronize_rcu(). I'm not any sure that the latter is actually a sane
> solution..
> Is there any guarantee that memory to-be freed by call_rcu() will get
> freed in OOM conditions? Might there be a chance that we don't need any
> limit here at all?
Yes, unless whatever is causing the OOM is also stalling a CPU or task
that RCU is waiting on. The extreme case is of course when the OOM is
in fact being caused by the fact that RCU is waiting on a stalled CPU
or task. Of course, the fact that the CPU or task is being stalled is
a bug in its own right.
So, in the absence of bugs, yes, the memory that was passed to call_rcu()
should be freed within a reasonable length of time, even under OOM
conditions.
> Worth to mention that I don't argue David's patch as I pointed that it
> would (will) solve the problem for us both, but with good intentions
> wondering if we can do something here rather a new sysctl knob.
An intermediate position would be to have a reasonably high setting so
that the sysctl knob almost never needed to be adjusted.
RCU used to detect OOM conditions and work harder to finish the grace
period in those cases, but this was abandoned because it was found not
to make a significant difference in production. Which might support
the position of assuming that memory passed to call_rcu() gets freed
reasonably quickly even under OOM conditions.
Thanx, Paul
prev parent reply other threads:[~2019-03-27 14:55 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-03-26 15:30 [RFC 0/4] net/fib: Speed up trie rebalancing for full view Dmitry Safonov
2019-03-26 15:30 ` [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc() Dmitry Safonov
2019-04-01 15:40 ` Alexander Duyck
2019-04-01 15:55 ` Dmitry Safonov
2019-04-01 17:50 ` Alexander Duyck
2019-04-04 16:33 ` Dmitry Safonov
2019-03-26 15:30 ` [RFC 2/4] net/fib: Provide fib_balance_budget sysctl Dmitry Safonov
2019-04-01 18:09 ` Alexander Duyck
2019-04-04 18:31 ` Dmitry Safonov
2019-03-26 15:30 ` [RFC 3/4] net/fib: Check budget before should_{inflate,halve}() Dmitry Safonov
2019-04-01 18:20 ` Alexander Duyck
2019-03-26 15:30 ` [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb Dmitry Safonov
2019-03-26 15:39 ` David Ahern
2019-03-26 17:15 ` Dmitry Safonov
2019-03-26 17:57 ` David Ahern
2019-03-26 18:17 ` Dmitry Safonov
2019-03-26 23:14 ` Dmitry Safonov
2019-03-27 3:33 ` Paul E. McKenney [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190327033344.GW4102@linux.ibm.com \
--to=paulmck@linux.ibm.com \
--cc=alexander.h.duyck@linux.intel.com \
--cc=davem@davemloft.net \
--cc=dima@arista.com \
--cc=dsahern@gmail.com \
--cc=edumazet@google.com \
--cc=idosch@mellanox.com \
--cc=kuznet@ms2.inr.ac.ru \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=yoshfuji@linux-ipv6.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).