From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C791DC433EF for ; Mon, 4 Oct 2021 18:43:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A0850613C8 for ; Mon, 4 Oct 2021 18:43:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238820AbhJDSoz (ORCPT ); Mon, 4 Oct 2021 14:44:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54374 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235630AbhJDSoz (ORCPT ); Mon, 4 Oct 2021 14:44:55 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C5D28C061745; Mon, 4 Oct 2021 11:43:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=M3MCGNtC32L47sTZ2qKzzynee6LhYMUb7wH0lcWQ2k0=; b=sy9wcq9CiqHPkeKE/Dc0z3FezB ZkphCwajzy21jd0HUQToVpYWTOPqnlrL50YgzfjSowWx18/ni/uKbEyeN91/WHf1Gqz0iAykFL23A zqP8JvtYKxZCVIBrxIpnxIF/yoMPfHFd1Un5LqkkoNN3cwapmI+u3UMZgTTgay/m5DtDo4v7T48xv 99Xc8VfRmnQ923fmeqISaQqmYthJkMf63qiBfgW3sygpkcmV3M4TEWTUmU7Fj0yCQRGgbD8WlvFFk 8fWUq6SKHJ1FBKXkniVVzcRh+lcp/ODJ9eXIm/ohW5nPpfilUbkBn65tcRiuJ6enlL7n8zadN+Tq9 6ftcVgwg==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1mXSu8-00HCrh-Q9; Mon, 04 Oct 2021 18:41:51 +0000 Date: Mon, 4 Oct 2021 19:41:36 +0100 From: Matthew Wilcox To: Stephen Brennan Cc: Andrew Morton , Konrad Wilk , Alexander Viro , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: [PATCH v6 0/1] proc: Allow pid_revalidate() during LOOKUP_RCU Message-ID: References: <20211004175629.292270-1-stephen.s.brennan@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Mon, Oct 04, 2021 at 07:20:14PM +0100, Matthew Wilcox wrote: > On Mon, Oct 04, 2021 at 10:56:28AM -0700, Stephen Brennan wrote: > > Problem Description: > > > > When running running ~128 parallel instances of "TZ=/etc/localtime ps > > -fe >/dev/null" on a 128CPU machine, the %sys utilization reaches 97%, > > and perf shows the following code path as being responsible for heavy > > contention on the d_lockref spinlock: > > > > walk_component() > > lookup_fast() > > d_revalidate() > > pid_revalidate() // returns -ECHILD > > unlazy_child() > > lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention > > > > The reason is that pid_revalidate() is triggering a drop from RCU to ref > > path walk mode. All concurrent path lookups thus try to grab a reference > > to the dentry for /proc/, before re-executing pid_revalidate() and then > > stepping into the /proc/$pid directory. Thus there is huge spinlock > > contention. This patch allows pid_revalidate() to execute in RCU mode, > > meaning that the path lookup can successfully enter the /proc/$pid > > directory while still in RCU mode. Later on, the path lookup may still > > drop into ref mode, but the contention will be much reduced at this > > point. > > > > By applying this patch, %sys utilization falls to around 85% under the > > same workload, and the number of ps processes executed per unit time > > increases by 3x-4x. Although this particular workload is a bit > > contrived, we have seen some large collections of eager monitoring > > scripts which produced similarly high %sys time due to contention in the > > /proc directory. > > I think it's perhaps also worth noting that this is a performance > regression relative to ... v5.4? v4.14? I forget the details; do you > have those to hand, Stephen? > > (Yes, this is a stupid workload. Yes, a customer really does have > this workload.) OK, it's not a performance regression. My apologies; I misremembered the ticket. What happens with 4.14 is that the tasks all sleep on the directory's i_mutex. When i_mutex became i_rwsem, all the lookups would now contend on the dentry spinlock. That turns "lots of processes sleeping" into "lots of processes spinning", which looks like a regression if you're asking "Why has my system time increased a lot?"