From: Paolo Abeni <pabeni@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: netdev@vger.kernel.org, Soheil Hassas Yeganeh <soheil@google.com>,
Al Viro <viro@zeniv.linux.org.uk>,
Carlos Maiolino <cmaiolino@redhat.com>,
Eric Biggers <ebiggers@kernel.org>,
Jacob Keller <jacob.e.keller@intel.com>,
Jens Axboe <axboe@kernel.dk>,
Christian Brauner <brauner@kernel.org>,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
Date: Wed, 08 Mar 2023 09:55:31 +0100 [thread overview]
Message-ID: <f049d74b59323ed2ad16a0b52de86f157ae353ce.camel@redhat.com> (raw)
In-Reply-To: <20230307133057.1904d8ffab2980f8e23ee3cc@linux-foundation.org>
On Tue, 2023-03-07 at 13:30 -0800, Andrew Morton wrote:
> On Tue, 7 Mar 2023 19:46:37 +0100 Paolo Abeni <pabeni@redhat.com> wrote:
>
> > We are observing huge contention on the epmutex during an http
> > connection/rate test:
> >
> > 83.17% 0.25% nginx [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
> > [...]
> > |--66.96%--__fput
> > |--60.04%--eventpoll_release_file
> > |--58.41%--__mutex_lock.isra.6
> > |--56.56%--osq_lock
> >
> > The application is multi-threaded, creates a new epoll entry for
> > each incoming connection, and does not delete it before the
> > connection shutdown - that is, before the connection's fd close().
> >
> > Many different threads compete frequently for the epmutex lock,
> > affecting the overall performance.
> >
> > To reduce the contention this patch introduces explicit reference counting
> > for the eventpoll struct. Each registered event acquires a reference,
> > and references are released at ep_remove() time.
> >
> > Additionally, this introduces a new 'dying' flag to prevent races between
> > the EP file close() and the monitored file close().
> > ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
>
> "as dying"?
>
> > removing it, while EP file close() does not touch dying epitems.
>
> The need for this dying flag is somewhat unclear to me. I mean, if we
> have refcounting done correctly, why the need for this flag? Some
> additional description of the dynamics would be helpful.
>
> Methinks this flag is here to cope with the delayed freeing via
> hlist_del_rcu(), but that's a guess?
First thing first, thanks for the feedback!
Both ep_clear_and_put() and eventpoll_release_file() can release the
eventpoll struct. The second must acquire the file->f_lock spinlock to
reach/access such struct pointer. Callers of __ep_remove need to
acquire first the ep->mtx, so eventpoll_release_file() must release the
spinlock after fetching the pointer and before acquiring the mutex.
Meanwhile, without the 'dying' flag, ep_clear_and_put() could kick-in,
eventually on a different CPU, drop all the ep references and free the
struct.
An alternative to the 'dying' flag would be removing the following loop
from ep_clear_and_put():
while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
epi = rb_entry(rbp, struct epitem, rbn);
ep_remove_safe(ep, epi);
cond_resched();
}
So that ep_clear_and_put() would not release all the ep references
anymore. That option has the downside of keeping the ep struct alive
for an unlimited time after ep_clear_and_put(). A previous revision of
this patch implemented a similar behavior, but Eric Biggers noted it
could hurt some users:
https://lore.kernel.org/linux-fsdevel/Y3%2F4FW4mqY3fWRfU@sol.localdomain/
Please let me know if the above is clear enough.
> > The eventpoll struct is released by whoever - among EP file close() and
> > and the monitored file close() drops its last reference.
> >
> > With all the above in place, we can drop the epmutex usage at disposal time.
> >
> > Overall this produces a significant performance improvement in the
> > mentioned connection/rate scenario: the mutex operations disappear from
> > the topmost offenders in the perf report, and the measured connections/rate
> > grows by ~60%.
> >
> > To make the change more readable this additionally renames ep_free() to
> > ep_clear_and_put(), and moves the actual memory cleanup in a separate
> > ep_free() helper.
> >
> > ...
> >
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> >
> > ...
> >
> > + free_uid(ep->user);
> > + wakeup_source_unregister(ep->ws);
> > + kfree(ep);
> > +}
> > +
> > /*
> > * Removes a "struct epitem" from the eventpoll RB tree and deallocates
> > * all the associated resources. Must be called with "mtx" held.
> > + * If the dying flag is set, do the removal only if force is true.
>
> This comment describes "what" the code does, which is obvious from the
> code anwyay. It's better if comments describe "why" the code does what
> it does.
What about appending the following?
"""
This prevents ep_clear_and_put() from dropping all the ep references
while running concurrently with eventpoll_release_file().
"""
(I'll keep the 'what' part to hopefully make the 'why' more clear)
> > + * Returns true if the eventpoll can be disposed.
> > */
> > -static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> > +static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
> > {
> > struct file *file = epi->ffd.file;
> > struct epitems_head *to_free;
> >
> > ...
> >
> > /*
> > - * We don't want to get "file->f_lock" because it is not
> > - * necessary. It is not necessary because we're in the "struct file"
> > - * cleanup path, and this means that no one is using this file anymore.
> > - * So, for example, epoll_ctl() cannot hit here since if we reach this
> > - * point, the file counter already went to zero and fget() would fail.
> > - * The only hit might come from ep_free() but by holding the mutex
> > - * will correctly serialize the operation. We do need to acquire
> > - * "ep->mtx" after "epmutex" because ep_remove() requires it when called
> > - * from anywhere but ep_free().
> > - *
> > - * Besides, ep_remove() acquires the lock, so we can't hold it here.
> > + * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from
>
> s/cleat/clear/
>
> > + * touching the epitems list before eventpoll_release_file() can access
> > + * the ep->mtx.
> > */
> > - mutex_lock(&epmutex);
> > - if (unlikely(!file->f_ep)) {
> > - mutex_unlock(&epmutex);
> > - return;
> > - }
> > - hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
> > +again:
> > + spin_lock(&file->f_lock);
> > + if (file->f_ep && file->f_ep->first) {
> > + /* detach from ep tree */
>
> Comment appears to be misplaced - the following code doesn't detach
> anything?
Indeed. This is a left-over from a previous revision. Can be dropped.
I have a process question: I understand this is queued for the mm-
nonmm-unstable branch. Should I send a v5 with the above comments
changes or an incremental patch or something completely different?
Thanks!
Paolo
next prev parent reply other threads:[~2023-03-08 8:56 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-07 18:46 [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention Paolo Abeni
2023-03-07 21:17 ` Keller, Jacob E
2023-03-07 21:21 ` Soheil Hassas Yeganeh
2023-03-07 21:30 ` Andrew Morton
2023-03-08 8:55 ` Paolo Abeni [this message]
2023-03-08 18:40 ` Andrew Morton
2023-03-08 20:34 ` Paolo Abeni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f049d74b59323ed2ad16a0b52de86f157ae353ce.camel@redhat.com \
--to=pabeni@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=brauner@kernel.org \
--cc=cmaiolino@redhat.com \
--cc=ebiggers@kernel.org \
--cc=jacob.e.keller@intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=soheil@google.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).