Re: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Paolo Abeni <pabeni@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: netdev@vger.kernel.org, Soheil Hassas Yeganeh <soheil@google.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Carlos Maiolino <cmaiolino@redhat.com>,
	Eric Biggers <ebiggers@kernel.org>,
	Jacob Keller <jacob.e.keller@intel.com>,
	Jens Axboe <axboe@kernel.dk>,
	Christian Brauner <brauner@kernel.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
Date: Wed, 08 Mar 2023 09:55:31 +0100	[thread overview]
Message-ID: <f049d74b59323ed2ad16a0b52de86f157ae353ce.camel@redhat.com> (raw)
In-Reply-To: <20230307133057.1904d8ffab2980f8e23ee3cc@linux-foundation.org>

On Tue, 2023-03-07 at 13:30 -0800, Andrew Morton wrote:
> On Tue,  7 Mar 2023 19:46:37 +0100 Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > We are observing huge contention on the epmutex during an http
> > connection/rate test:
> > 
> >  83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
> > [...]
> >            |--66.96%--__fput
> >                       |--60.04%--eventpoll_release_file
> >                                  |--58.41%--__mutex_lock.isra.6
> >                                            |--56.56%--osq_lock
> > 
> > The application is multi-threaded, creates a new epoll entry for
> > each incoming connection, and does not delete it before the
> > connection shutdown - that is, before the connection's fd close().
> > 
> > Many different threads compete frequently for the epmutex lock,
> > affecting the overall performance.
> > 
> > To reduce the contention this patch introduces explicit reference counting
> > for the eventpoll struct. Each registered event acquires a reference,
> > and references are released at ep_remove() time.
> > 
> > Additionally, this introduces a new 'dying' flag to prevent races between
> > the EP file close() and the monitored file close().
> > ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
> 
> "as dying"?
> 
> > removing it, while EP file close() does not touch dying epitems.
> 
> The need for this dying flag is somewhat unclear to me.  I mean, if we
> have refcounting done correctly, why the need for this flag?  Some
> additional description of the dynamics would be helpful.
> 
> Methinks this flag is here to cope with the delayed freeing via
> hlist_del_rcu(), but that's a guess?

First thing first, thanks for the feedback!

Both ep_clear_and_put() and eventpoll_release_file() can release the
eventpoll struct. The second must acquire the file->f_lock spinlock to
reach/access such struct pointer. Callers of __ep_remove need to
acquire first the ep->mtx, so eventpoll_release_file() must release the
spinlock after fetching the pointer and before acquiring the mutex.

Meanwhile, without the 'dying' flag, ep_clear_and_put() could kick-in,
eventually on a different CPU, drop all the ep references and free the
struct. 
An alternative to the 'dying' flag would be removing the following loop
from ep_clear_and_put():

	while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
                epi = rb_entry(rbp, struct epitem, rbn);
                ep_remove_safe(ep, epi);
                cond_resched();
        }

So that ep_clear_and_put() would not release all the ep references
anymore. That option has the downside of keeping the ep struct alive
for an unlimited time after ep_clear_and_put(). A previous revision of
this patch implemented a similar behavior, but Eric Biggers noted it
could hurt some users:

https://lore.kernel.org/linux-fsdevel/Y3%2F4FW4mqY3fWRfU@sol.localdomain/

Please let me know if the above is clear enough.

> > The eventpoll struct is released by whoever - among EP file close() and
> > and the monitored file close() drops its last reference.
> > 
> > With all the above in place, we can drop the epmutex usage at disposal time.
> > 
> > Overall this produces a significant performance improvement in the
> > mentioned connection/rate scenario: the mutex operations disappear from
> > the topmost offenders in the perf report, and the measured connections/rate
> > grows by ~60%.
> > 
> > To make the change more readable this additionally renames ep_free() to
> > ep_clear_and_put(), and moves the actual memory cleanup in a separate
> > ep_free() helper.
> > 
> > ...
> > 
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > 
> > ...
> > 
> > +	free_uid(ep->user);
> > +	wakeup_source_unregister(ep->ws);
> > +	kfree(ep);
> > +}
> > +
> >  /*
> >   * Removes a "struct epitem" from the eventpoll RB tree and deallocates
> >   * all the associated resources. Must be called with "mtx" held.
> > + * If the dying flag is set, do the removal only if force is true.
> 
> This comment describes "what" the code does, which is obvious from the
> code anwyay.  It's better if comments describe "why" the code does what
> it does.

What about appending the following?

"""
This prevents ep_clear_and_put() from dropping all the ep references
while running concurrently with eventpoll_release_file().
"""

(I'll keep the 'what' part to hopefully make the 'why' more clear)

> > + * Returns true if the eventpoll can be disposed.
> >   */
> > -static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> > +static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
> >  {
> >  	struct file *file = epi->ffd.file;
> >  	struct epitems_head *to_free;
> > 
> > ...
> > 
> >  	/*
> > -	 * We don't want to get "file->f_lock" because it is not
> > -	 * necessary. It is not necessary because we're in the "struct file"
> > -	 * cleanup path, and this means that no one is using this file anymore.
> > -	 * So, for example, epoll_ctl() cannot hit here since if we reach this
> > -	 * point, the file counter already went to zero and fget() would fail.
> > -	 * The only hit might come from ep_free() but by holding the mutex
> > -	 * will correctly serialize the operation. We do need to acquire
> > -	 * "ep->mtx" after "epmutex" because ep_remove() requires it when called
> > -	 * from anywhere but ep_free().
> > -	 *
> > -	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
> > +	 * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from
> 
> s/cleat/clear/
> 
> > +	 * touching the epitems list before eventpoll_release_file() can access
> > +	 * the ep->mtx.
> >  	 */
> > -	mutex_lock(&epmutex);
> > -	if (unlikely(!file->f_ep)) {
> > -		mutex_unlock(&epmutex);
> > -		return;
> > -	}
> > -	hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
> > +again:
> > +	spin_lock(&file->f_lock);
> > +	if (file->f_ep && file->f_ep->first) {
> > +		/* detach from ep tree */
> 
> Comment appears to be misplaced - the following code doesn't detach
> anything?

Indeed. This is a left-over from a previous revision. Can be dropped.


I have a process question: I understand this is queued for the mm-
nonmm-unstable branch. Should I send a v5 with the above comments
changes or an incremental patch or something completely different?

Thanks!

Paolo

next prev parent reply	other threads:[~2023-03-08  8:56 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-07 18:46 [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention Paolo Abeni
2023-03-07 21:17 ` Keller, Jacob E
2023-03-07 21:21   ` Soheil Hassas Yeganeh
2023-03-07 21:30 ` Andrew Morton
2023-03-08  8:55   ` Paolo Abeni [this message]
2023-03-08 18:40     ` Andrew Morton
2023-03-08 20:34       ` Paolo Abeni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f049d74b59323ed2ad16a0b52de86f157ae353ce.camel@redhat.com \
    --to=pabeni@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=cmaiolino@redhat.com \
    --cc=ebiggers@kernel.org \
    --cc=jacob.e.keller@intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=soheil@google.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).