netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Dumazet <edumazet@google.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linuxfoundation.org>,
	x86@kernel.org, Wangyang Guo <wangyang.guo@intel.com>,
	Arjan van De Ven <arjan@linux.intel.com>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	netdev@vger.kernel.org, Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Marc Zyngier <maz@kernel.org>
Subject: Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues
Date: Tue, 28 Feb 2023 16:07:01 +0100	[thread overview]
Message-ID: <CANn89iL2pYt2QA2sS4KkXrCSjprz9byE_p+Geom3MTNPMzfFDw@mail.gmail.com> (raw)
In-Reply-To: <20230228132118.978145284@linutronix.de>

On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Hi!
>
> Wangyang and Arjan reported a bottleneck in the networking code related to
> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
> a dst_entry increases.

We have per-cpu or per-tcp-socket dst though.

Input path is RCU and does not touch dst refcnt.

In real workloads (200Gbit NIC and above), we do not observe
contention on a dst refcnt.

So it would be nice knowing in which case you noticed some issues,
maybe there is something wrong in some layer.

>
> This happens when there are a large amount of connections to or from the
> same IP address. The memtier benchmark when run on the same host as
> memcached amplifies this massively. But even over real network connections
> this issue can be observed at an obviously smaller scale (due to the
> network bandwith limitations in my setup, i.e. 1Gb).
>
> There are two factors which make this reference count a scalability issue:
>
>    1) False sharing
>
>       dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
>       it into a seperate cacheline vs. the read mostly members located at
>       the beginning of the struct.
>
>       That prevents false sharing vs. the struct members in the first 64
>       bytes of the structure, but there is also
>
>             dst_entry::lwtstate
>
>       which is located after the reference count and in the same cache
>       line. This member is read after a reference count has been acquired.
>
>       The other problem is struct rtable, which embeds a struct dst_entry
>       at offset 0. struct dst_entry has a size of 112 bytes, which means
>       that the struct members of rtable which follow the dst member share
>       the same cache line as dst_entry::__refcnt. Especially
>
>           rtable::rt_genid
>
>       is also read by the contexts which have a reference count acquired
>       already.
>
>       When dst_entry:__refcnt is incremented or decremented via an atomic
>       operation these read accesses stall and contribute to the performance
>       problem.

In our kernel profiles, we never saw dst->refcnt changes being a
serious problem.

>
>    2) atomic_inc_not_zero()
>
>       A reference on dst_entry:__refcnt is acquired via
>       atomic_inc_not_zero() and released via atomic_dec_return().
>
>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
>       which exposes O(N^2) behaviour under contention with N concurrent
>       operations.
>
>       Lightweight instrumentation exposed an average of 8!! retry loops per
>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>       running concurrently on 112 CPUs.

User space benchmark <> kernel space.
And we tend not using 112 cpus for kernel stack processing.

Again, concurrent dst->refcnt changes are quite unlikely.

>
>       There is nothing which can be done to make atomic_inc_not_zero() more
>       scalable.
>
> The following series addresses these issues:
>
>     1) Reorder and pad struct dst_entry to prevent the false sharing.
>
>     2) Implement and use a reference count implementation which avoids the
>        atomic_inc_not_zero() problem.
>
>        It is slightly less performant in the case of the final 1 -> 0
>        transition, but the deconstruction of these objects is a low
>        frequency event. get()/put() pairs are in the hotpath and that's
>        what this implementation optimizes for.
>
>        The algorithm of this reference count is only suitable for RCU
>        managed objects. Therefore it cannot replace the refcount_t
>        algorithm, which is also based on atomic_inc_not_zero(), due to a
>        subtle race condition related to the 1 -> 0 transition and the final
>        verdict to mark the reference count dead. See details in patch 2/3.
>
>        It might be just my lack of imagination which declares this to be
>        impossible and I'd be happy to be proven wrong.
>
>        As a bonus the new rcuref implementation provides underflow/overflow
>        detection and mitigation while being performance wise on par with
>        open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
>        the non-contended case.
>
> The combination of these two changes results in performance gains in micro
> benchmarks and also localhost and networked memtier benchmarks talking to
> memcached. It's hard to quantify the benchmark results as they depend
> heavily on the micro-architecture and the number of concurrent operations.
>
> The overall gain of both changes for localhost memtier ranges from 1.2X to
> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
>
> A micro benchmark which enforces maximized concurrency shows a gain between
> 1.2X and 4.7X!!!

Can you elaborate on what networking benchmark you have used,
and what actual gains you got ?

In which path access to dst->lwtstate proved to be a problem ?

>
> Obviously this is focussed on a particular problem and therefore needs to
> be discussed in detail. It also requires wider testing outside of the cases
> which this is focussed on.
>
> Though the false sharing issue is obvious and should be addressed
> independent of the more focussed reference count changes.
>
> The series is also available from git:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
>
> I want to say thanks to Wangyang who analyzed the issue and provided
> the initial fix for the false sharing problem. Further thanks go to
> Arjan Peter, Marc, Will and Borislav for valuable input and providing
> test results on machines which I do not have access to.
>
> Thoughts?

Initial feeling is that we need more details on the real workload.

Is it some degenerated case with one connected UDP socket used by
multiple threads ?

To me, this looks like someone wanted to push a new piece of infra
(include/linux/rcuref.h)
and decided that dst->refcnt would be a perfect place.
Not the other way (noticing there is an issue, enquire networking
folks about it, before designing a solution)

Thanks



>
> Thanks,
>
>         tglx
> ---
>  include/linux/rcuref.h          |   89 +++++++++++
>  include/linux/types.h           |    6
>  include/net/dst.h               |   21 ++
>  include/net/sock.h              |    2
>  lib/Makefile                    |    2
>  lib/rcuref.c                    |  311 ++++++++++++++++++++++++++++++++++++++++
>  net/bridge/br_nf_core.c         |    2
>  net/core/dst.c                  |   26 ---
>  net/core/rtnetlink.c            |    2
>  net/ipv6/route.c                |    6
>  net/netfilter/ipvs/ip_vs_xmit.c |    4
>  11 files changed, 436 insertions(+), 35 deletions(-)

  parent reply	other threads:[~2023-02-28 15:07 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-28 14:33 [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
2023-02-28 14:33 ` [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
2023-02-28 15:17   ` Eric Dumazet
2023-02-28 21:31     ` Thomas Gleixner
2023-03-01 23:12       ` Thomas Gleixner
2023-02-28 14:33 ` [patch 2/3] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
2023-03-01  0:42   ` Linus Torvalds
2023-03-01  1:07     ` Linus Torvalds
2023-03-01 11:09     ` Thomas Gleixner
2023-03-02  1:05       ` Thomas Gleixner
2023-03-02  1:29         ` Randy Dunlap
2023-03-02 19:36         ` Linus Torvalds
2023-02-28 14:33 ` [patch 3/3] net: dst: Switch to rcuref_t " Thomas Gleixner
2023-02-28 15:07 ` Eric Dumazet [this message]
2023-02-28 16:38   ` [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
2023-02-28 16:59     ` Eric Dumazet
2023-03-01  1:00       ` Thomas Gleixner
2023-03-01  3:17         ` Jakub Kicinski
2023-03-01 10:40           ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANn89iL2pYt2QA2sS4KkXrCSjprz9byE_p+Geom3MTNPMzfFDw@mail.gmail.com \
    --to=edumazet@google.com \
    --cc=arjan@linux.intel.com \
    --cc=boqun.feng@gmail.com \
    --cc=davem@davemloft.net \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linuxfoundation.org \
    --cc=wangyang.guo@intel.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).