From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA452C433FE for ; Tue, 22 Nov 2022 17:00:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234410AbiKVRAa (ORCPT ); Tue, 22 Nov 2022 12:00:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57660 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233693AbiKVRA2 (ORCPT ); Tue, 22 Nov 2022 12:00:28 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40F167723B for ; Tue, 22 Nov 2022 08:59:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669136363; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Tcwj7zKb9FBT/27d+SqzgATb/+/cwGoTvbOKyV2oUC0=; b=iHEkcRHPLgjafGUvDi2GKLRYnJNsWpEQNSOH/AG3yzkPVG11fS6Axu2JSGpyllTg5AvNso j9aa/WhcDDiEgrxXV3mAVzhO8/ATJRq/fpXwjPsNI3OILqXqK+hJJQo3k5IlPXorZfsQC3 3tCO6ddQDJxB5Ba6u0XysiGXmTxC4yw= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-287-JQ_2NlVqNXWytsCaeXiHdg-1; Tue, 22 Nov 2022 11:59:21 -0500 X-MC-Unique: JQ_2NlVqNXWytsCaeXiHdg-1 Received: by mail-wm1-f69.google.com with SMTP id bg25-20020a05600c3c9900b003cf3ed7e27bso8327898wmb.4 for ; Tue, 22 Nov 2022 08:59:21 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Tcwj7zKb9FBT/27d+SqzgATb/+/cwGoTvbOKyV2oUC0=; b=l1mMyf9XtfC7mNCiCPvXxHK+csIl0OxaiZkOuvMfghZLmng7UcCG/g7JO/WXCyk1E/ gUBZy3BJ/hapYvhAju9ntdGsJ5cAHGr8wF8BDuHEv1BD4ZuQ9PIiU20LHKsLZXu57Iwj HxExT602v3XSf3j6vueR58tve7ipD0z3hbwgXVwei6XNEhktNwnSk3rTcISmQBkq1Ycs FLTB87fAnKZyASitIiVxsF6yp9tOSzWJX8dKnKGK36CIvhzVAoCRtlFEN35MfGzOM9lQ Pfj7gGmempAmkFK0ki9SFG3OU2Ps5bboUJY1hwwudqIPWAiH3ympCeZ6hi/WwfzBqR9i ElHw== X-Gm-Message-State: ANoB5pm24qC51sg/aytTsv0ZZ2gOYjPYuIyYYqH/ylImx0yQNbIFcShc AIXY1rjztfqA3+jeteeu4SQ2upxlajslyyZq2yDGGSEzYgKi9Xt/e6NZ7Gg7mYiVszaUCI4qIde 8bpIMReN3Ecp/W1NAawp+6YzKFw== X-Received: by 2002:adf:ffd2:0:b0:236:59ab:cf33 with SMTP id x18-20020adfffd2000000b0023659abcf33mr14799065wrs.568.1669136360580; Tue, 22 Nov 2022 08:59:20 -0800 (PST) X-Google-Smtp-Source: AA0mqf5ipsttDVyqGsvGwOwucuPtVofF5/nvmxT4xT5wxGK/GIy0z+5wA9kNhYU6aZgNw7rXtz8gyA== X-Received: by 2002:adf:ffd2:0:b0:236:59ab:cf33 with SMTP id x18-20020adfffd2000000b0023659abcf33mr14799050wrs.568.1669136360259; Tue, 22 Nov 2022 08:59:20 -0800 (PST) Received: from gerbillo.redhat.com (146-241-120-203.dyn.eolo.it. [146.241.120.203]) by smtp.gmail.com with ESMTPSA id k1-20020a5d6281000000b0022ae0965a8asm14469756wru.24.2022.11.22.08.59.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 22 Nov 2022 08:59:19 -0800 (PST) Message-ID: <819762b6eb549f74d0ebbb6663f042ae9b6cd86d.camel@redhat.com> Subject: Re: [REPOST PATCH] epoll: use refcount to reduce ep_mutex contention From: Paolo Abeni To: Soheil Hassas Yeganeh Cc: linux-fsdevel@vger.kernel.org, Al Viro , Davidlohr Bueso , Jason Baron , Roman Penyaev , netdev@vger.kernel.org, Carlos Maiolino Date: Tue, 22 Nov 2022 17:59:18 +0100 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.42.4 (3.42.4-2.fc35) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Hello, Thank you for the prompt feedback! On Tue, 2022-11-22 at 11:18 -0500, Soheil Hassas Yeganeh wrote: > On Tue, Nov 22, 2022 at 10:43 AM Paolo Abeni wrote: > > > > We are observing huge contention on the epmutex during an http > > connection/rate test: > > > > 83.17% 0.25% nginx [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe > > [...] > > |--66.96%--__fput > > |--60.04%--eventpoll_release_file > > |--58.41%--__mutex_lock.isra.6 > > |--56.56%--osq_lock > > > > The application is multi-threaded, creates a new epoll entry for > > each incoming connection, and does not delete it before the > > connection shutdown - that is, before the connection's fd close(). > > > > Many different threads compete frequently for the epmutex lock, > > affecting the overall performance. > > > > To reduce the contention this patch introduces explicit reference counting > > for the eventpoll struct. Each registered event acquires a reference, > > and references are released at ep_remove() time. ep_free() doesn't touch > > anymore the event RB tree, it just unregisters the existing callbacks > > and drops a reference to the ep struct. The struct itself is freed when > > the reference count reaches 0. The reference count updates are protected > > by the mtx mutex so no additional atomic operations are needed. > > > > Since ep_free() can't compete anymore with eventpoll_release_file() > > for epitems removal, we can drop the epmutex usage at disposal time. > > > > With the patched kernel, in the same connection/rate scenario, the mutex > > operations disappear from the perf report, and the measured connections/rate > > grows by ~60%. > > I locally tried this patch and I can reproduce the results. Thank you > for the nice optimization! > > > Tested-by: Xiumei Mu > > Signed-off-by: Paolo Abeni > > --- > > This is just a repost reaching out for more recipents, > > as suggested by Carlos. > > > > Previous post at: > > > > https://lore.kernel.org/linux-fsdevel/20221122102726.4jremle54zpcapia@andromeda/T/#m6f98d4ccbe0a385d10c04fd4018e782b793944e6 > > --- > > fs/eventpoll.c | 113 ++++++++++++++++++++++++++++--------------------- > > 1 file changed, 64 insertions(+), 49 deletions(-) > > > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > > index 52954d4637b5..6e415287aeb8 100644 > > --- a/fs/eventpoll.c > > +++ b/fs/eventpoll.c > > @@ -226,6 +226,12 @@ struct eventpoll { > > /* tracks wakeup nests for lockdep validation */ > > u8 nests; > > #endif > > + > > + /* > > + * protected by mtx, used to avoid races between ep_free() and > > + * ep_eventpoll_release() > > + */ > > + unsigned int refcount; > > nitpick: Given that napi_id and nest are both macro protected, you > might want to pull it right after min_wait_ts. Just to be on the same page: the above is just for an aesthetic reason, right? Is there some functional aspect I don't see? [...] > > @@ -2165,10 +2174,16 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds, > > error = -EEXIST; > > break; > > case EPOLL_CTL_DEL: > > - if (epi) > > - error = ep_remove(ep, epi); > > - else > > + if (epi) { > > + /* > > + * The eventpoll itself is still alive: the refcount > > + * can't go to zero here. > > + */ > > + WARN_ON_ONCE(ep_remove(ep, epi)); > > There are similar examples of calling ep_remove() without checking the > return value in ep_insert(). Yes, the error paths in ep_insert(). I added a comment referring to all of them, trying to explain that ep_dispose() is not needed there. > I believe we should add a similar comment there, and maybe a > WARN_ON_ONCE. I'm not sure, but it might be worth adding a new helper > given this repeated pattern? I like the idea of such helper. I'll use it in the next iteration, if there is a reasonable agreement on this patch. Whould 'ep_remove_safe()' fit as the helper's name? Thanks, Paolo