From: Thomas Gleixner <tglx@linutronix.de>
To: Eric Dumazet <edumazet@google.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
Linus Torvalds <torvalds@linuxfoundation.org>,
x86@kernel.org, Wangyang Guo <wangyang.guo@intel.com>,
Arjan van De Ven <arjan@linux.intel.com>,
"David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
netdev@vger.kernel.org, Will Deacon <will@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Boqun Feng <boqun.feng@gmail.com>,
Mark Rutland <mark.rutland@arm.com>,
Marc Zyngier <maz@kernel.org>
Subject: Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues
Date: Tue, 28 Feb 2023 17:38:25 +0100 [thread overview]
Message-ID: <87h6v5n3su.ffs@tglx> (raw)
In-Reply-To: <CANn89iL2pYt2QA2sS4KkXrCSjprz9byE_p+Geom3MTNPMzfFDw@mail.gmail.com>
Eric!
On Tue, Feb 28 2023 at 16:07, Eric Dumazet wrote:
> On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Hi!
>>
>> Wangyang and Arjan reported a bottleneck in the networking code related to
>> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
>> a dst_entry increases.
>
> We have per-cpu or per-tcp-socket dst though.
>
> Input path is RCU and does not touch dst refcnt.
>
> In real workloads (200Gbit NIC and above), we do not observe
> contention on a dst refcnt.
>
> So it would be nice knowing in which case you noticed some issues,
> maybe there is something wrong in some layer.
Two lines further down I explained which benchmark was used, no?
>> This happens when there are a large amount of connections to or from the
>> same IP address. The memtier benchmark when run on the same host as
>> memcached amplifies this massively. But even over real network connections
>> this issue can be observed at an obviously smaller scale (due to the
>> network bandwith limitations in my setup, i.e. 1Gb).
>> atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
>> which exposes O(N^2) behaviour under contention with N concurrent
>> operations.
>>
>> Lightweight instrumentation exposed an average of 8!! retry loops per
>> atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>> running concurrently on 112 CPUs.
>
> User space benchmark <> kernel space.
I know that. The point was to illustrate the non-scalability.
> And we tend not using 112 cpus for kernel stack processing.
>
> Again, concurrent dst->refcnt changes are quite unlikely.
So unlikely that they stand out in that particular benchmark.
>> The overall gain of both changes for localhost memtier ranges from 1.2X to
>> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
>>
>> A micro benchmark which enforces maximized concurrency shows a gain between
>> 1.2X and 4.7X!!!
>
> Can you elaborate on what networking benchmark you have used,
> and what actual gains you got ?
I'm happy to repeat here that it was memtier/memcached as I explained
more than once in the cover letter.
> In which path access to dst->lwtstate proved to be a problem ?
ip_finish_output2()
if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read
> To me, this looks like someone wanted to push a new piece of infra
> (include/linux/rcuref.h)
> and decided that dst->refcnt would be a perfect place.
>
> Not the other way (noticing there is an issue, enquire networking
> folks about it, before designing a solution)
We looked at this because the reference count operations stood out in
perf top and we analyzed it down to the false sharing _and_ the
non-scalability of atomic_inc_not_zero().
That's what made me to think about a different implementation and yes,
the case at hand, which happens to be in the network code, allowed me to
verify that it actually works and scales better.
I'm terrible sorry, that I missed to first ask network people for
permission to think. Won't happen again.
Thanks,
tglx
next prev parent reply other threads:[~2023-02-28 16:39 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-28 14:33 [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
2023-02-28 14:33 ` [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
2023-02-28 15:17 ` Eric Dumazet
2023-02-28 21:31 ` Thomas Gleixner
2023-03-01 23:12 ` Thomas Gleixner
2023-02-28 14:33 ` [patch 2/3] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
2023-03-01 0:42 ` Linus Torvalds
2023-03-01 1:07 ` Linus Torvalds
2023-03-01 11:09 ` Thomas Gleixner
2023-03-02 1:05 ` Thomas Gleixner
2023-03-02 1:29 ` Randy Dunlap
2023-03-02 19:36 ` Linus Torvalds
2023-02-28 14:33 ` [patch 3/3] net: dst: Switch to rcuref_t " Thomas Gleixner
2023-02-28 15:07 ` [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Eric Dumazet
2023-02-28 16:38 ` Thomas Gleixner [this message]
2023-02-28 16:59 ` Eric Dumazet
2023-03-01 1:00 ` Thomas Gleixner
2023-03-01 3:17 ` Jakub Kicinski
2023-03-01 10:40 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h6v5n3su.ffs@tglx \
--to=tglx@linutronix.de \
--cc=arjan@linux.intel.com \
--cc=boqun.feng@gmail.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=maz@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=peterz@infradead.org \
--cc=torvalds@linuxfoundation.org \
--cc=wangyang.guo@intel.com \
--cc=will@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox