From: Eric Dumazet <edumazet@google.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
Linus Torvalds <torvalds@linuxfoundation.org>,
x86@kernel.org, Wangyang Guo <wangyang.guo@intel.com>,
Arjan van De Ven <arjan@linux.intel.com>,
"David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
netdev@vger.kernel.org, Will Deacon <will@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Boqun Feng <boqun.feng@gmail.com>,
Mark Rutland <mark.rutland@arm.com>,
Marc Zyngier <maz@kernel.org>
Subject: Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues
Date: Tue, 28 Feb 2023 16:07:01 +0100 [thread overview]
Message-ID: <CANn89iL2pYt2QA2sS4KkXrCSjprz9byE_p+Geom3MTNPMzfFDw@mail.gmail.com> (raw)
In-Reply-To: <20230228132118.978145284@linutronix.de>
On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Hi!
>
> Wangyang and Arjan reported a bottleneck in the networking code related to
> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
> a dst_entry increases.
We have per-cpu or per-tcp-socket dst though.
Input path is RCU and does not touch dst refcnt.
In real workloads (200Gbit NIC and above), we do not observe
contention on a dst refcnt.
So it would be nice knowing in which case you noticed some issues,
maybe there is something wrong in some layer.
>
> This happens when there are a large amount of connections to or from the
> same IP address. The memtier benchmark when run on the same host as
> memcached amplifies this massively. But even over real network connections
> this issue can be observed at an obviously smaller scale (due to the
> network bandwith limitations in my setup, i.e. 1Gb).
>
> There are two factors which make this reference count a scalability issue:
>
> 1) False sharing
>
> dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
> it into a seperate cacheline vs. the read mostly members located at
> the beginning of the struct.
>
> That prevents false sharing vs. the struct members in the first 64
> bytes of the structure, but there is also
>
> dst_entry::lwtstate
>
> which is located after the reference count and in the same cache
> line. This member is read after a reference count has been acquired.
>
> The other problem is struct rtable, which embeds a struct dst_entry
> at offset 0. struct dst_entry has a size of 112 bytes, which means
> that the struct members of rtable which follow the dst member share
> the same cache line as dst_entry::__refcnt. Especially
>
> rtable::rt_genid
>
> is also read by the contexts which have a reference count acquired
> already.
>
> When dst_entry:__refcnt is incremented or decremented via an atomic
> operation these read accesses stall and contribute to the performance
> problem.
In our kernel profiles, we never saw dst->refcnt changes being a
serious problem.
>
> 2) atomic_inc_not_zero()
>
> A reference on dst_entry:__refcnt is acquired via
> atomic_inc_not_zero() and released via atomic_dec_return().
>
> atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
> which exposes O(N^2) behaviour under contention with N concurrent
> operations.
>
> Lightweight instrumentation exposed an average of 8!! retry loops per
> atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
> running concurrently on 112 CPUs.
User space benchmark <> kernel space.
And we tend not using 112 cpus for kernel stack processing.
Again, concurrent dst->refcnt changes are quite unlikely.
>
> There is nothing which can be done to make atomic_inc_not_zero() more
> scalable.
>
> The following series addresses these issues:
>
> 1) Reorder and pad struct dst_entry to prevent the false sharing.
>
> 2) Implement and use a reference count implementation which avoids the
> atomic_inc_not_zero() problem.
>
> It is slightly less performant in the case of the final 1 -> 0
> transition, but the deconstruction of these objects is a low
> frequency event. get()/put() pairs are in the hotpath and that's
> what this implementation optimizes for.
>
> The algorithm of this reference count is only suitable for RCU
> managed objects. Therefore it cannot replace the refcount_t
> algorithm, which is also based on atomic_inc_not_zero(), due to a
> subtle race condition related to the 1 -> 0 transition and the final
> verdict to mark the reference count dead. See details in patch 2/3.
>
> It might be just my lack of imagination which declares this to be
> impossible and I'd be happy to be proven wrong.
>
> As a bonus the new rcuref implementation provides underflow/overflow
> detection and mitigation while being performance wise on par with
> open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
> the non-contended case.
>
> The combination of these two changes results in performance gains in micro
> benchmarks and also localhost and networked memtier benchmarks talking to
> memcached. It's hard to quantify the benchmark results as they depend
> heavily on the micro-architecture and the number of concurrent operations.
>
> The overall gain of both changes for localhost memtier ranges from 1.2X to
> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
>
> A micro benchmark which enforces maximized concurrency shows a gain between
> 1.2X and 4.7X!!!
Can you elaborate on what networking benchmark you have used,
and what actual gains you got ?
In which path access to dst->lwtstate proved to be a problem ?
>
> Obviously this is focussed on a particular problem and therefore needs to
> be discussed in detail. It also requires wider testing outside of the cases
> which this is focussed on.
>
> Though the false sharing issue is obvious and should be addressed
> independent of the more focussed reference count changes.
>
> The series is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
>
> I want to say thanks to Wangyang who analyzed the issue and provided
> the initial fix for the false sharing problem. Further thanks go to
> Arjan Peter, Marc, Will and Borislav for valuable input and providing
> test results on machines which I do not have access to.
>
> Thoughts?
Initial feeling is that we need more details on the real workload.
Is it some degenerated case with one connected UDP socket used by
multiple threads ?
To me, this looks like someone wanted to push a new piece of infra
(include/linux/rcuref.h)
and decided that dst->refcnt would be a perfect place.
Not the other way (noticing there is an issue, enquire networking
folks about it, before designing a solution)
Thanks
>
> Thanks,
>
> tglx
> ---
> include/linux/rcuref.h | 89 +++++++++++
> include/linux/types.h | 6
> include/net/dst.h | 21 ++
> include/net/sock.h | 2
> lib/Makefile | 2
> lib/rcuref.c | 311 ++++++++++++++++++++++++++++++++++++++++
> net/bridge/br_nf_core.c | 2
> net/core/dst.c | 26 ---
> net/core/rtnetlink.c | 2
> net/ipv6/route.c | 6
> net/netfilter/ipvs/ip_vs_xmit.c | 4
> 11 files changed, 436 insertions(+), 35 deletions(-)
next prev parent reply other threads:[~2023-02-28 15:07 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-28 14:33 [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
2023-02-28 14:33 ` [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
2023-02-28 15:17 ` Eric Dumazet
2023-02-28 21:31 ` Thomas Gleixner
2023-03-01 23:12 ` Thomas Gleixner
2023-02-28 14:33 ` [patch 2/3] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
2023-03-01 0:42 ` Linus Torvalds
2023-03-01 1:07 ` Linus Torvalds
2023-03-01 11:09 ` Thomas Gleixner
2023-03-02 1:05 ` Thomas Gleixner
2023-03-02 1:29 ` Randy Dunlap
2023-03-02 19:36 ` Linus Torvalds
2023-02-28 14:33 ` [patch 3/3] net: dst: Switch to rcuref_t " Thomas Gleixner
2023-02-28 15:07 ` Eric Dumazet [this message]
2023-02-28 16:38 ` [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
2023-02-28 16:59 ` Eric Dumazet
2023-03-01 1:00 ` Thomas Gleixner
2023-03-01 3:17 ` Jakub Kicinski
2023-03-01 10:40 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CANn89iL2pYt2QA2sS4KkXrCSjprz9byE_p+Geom3MTNPMzfFDw@mail.gmail.com \
--to=edumazet@google.com \
--cc=arjan@linux.intel.com \
--cc=boqun.feng@gmail.com \
--cc=davem@davemloft.net \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=maz@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linuxfoundation.org \
--cc=wangyang.guo@intel.com \
--cc=will@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).