From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE1FC322740 for ; Wed, 26 Nov 2025 11:31:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.42 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764156697; cv=none; b=ooAfJ+/jqgqKiVhkJnV892lZ4AJOJ0PAgVSvoTDMQcey7Gt3e3ktWGg4Mi+4E4pPEDAsF73/+kpUrqTzVw7FLGBdnDf3EgWjERhJfwMxeHn5DmuGHQjChtj2E27CINJ8C/tZS8zmtUFOnCv4zQEdVE/ja7Y8vLAaK47zQ6Hwj5w= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764156697; c=relaxed/simple; bh=DYyD/p5TxHRA+6U8qmE0B4qTQBu+r7An69CEbw7jNmI=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=XLBhFIWfKzlx30nzkBgSj45Xg2lF3h8ciutMEvHpSYFb3l25Lzb8Ij3gJA+gkYdkKp8BDnC12SC32+19gX5Ko4riFMUv5M8pgfwUv2BOF8UJyyXPtIF/wTc4OKnxQ5guXj3bnjGe5h2JU1geHXR1HiXnadgX3IZSPe8jXej7axU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=itENpZKh; arc=none smtp.client-ip=209.85.208.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="itENpZKh" Received: by mail-ed1-f42.google.com with SMTP id 4fb4d7f45d1cf-6418b55f86dso11148076a12.1 for ; Wed, 26 Nov 2025 03:31:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764156694; x=1764761494; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=jxpAOZp93TQa2PO+Fjc9mRu2Wb0BTI1kTLTUwaqDySc=; b=itENpZKh97QW8SElzGBmMkjZAJd4A5gzsqtOq6Y8ZTs1dlawCGlX7IQ/go/oy90sNF YLJZIK54Z2Upc+RTMkr/LlNVK/3O4aQkemSN/CmN+7+0v0hgIfq/7B7XCObdzCdWi6B+ JnpWY3zfwTjTQvvM6I4tZRbFyr8nLjyKI8iNOhLdmI11tIjzGIo7gLZx4ueh9C027y0Q mvmdR7/RFEYY1jrP/ezZQXwcX+WZ676LesoRkH/6Fg7O1LT8Hk2MZeR0P0uYQ4w6oqWv OANjh4hdUfw5DlfBDvoklNodYqEfb+zVviErnVEmgql8fbQOaFMOhfDKZvA7YgSFSsi2 FGSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764156694; x=1764761494; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=jxpAOZp93TQa2PO+Fjc9mRu2Wb0BTI1kTLTUwaqDySc=; b=D9eTifMHMSOs04redn+R2gA6IlqX8WyKACoBKEdxIUO/4u7Q7x71Z9erx8hJBL65vJ a3T361Pv3zNJnlkzmP9F+OLxwWcvc8LAnvztaJJzcImseTeHswO5bYtyzq3TuvHVIu8y lj8T8DUu/8+7RoMYhgFsYQcN+1Q/ouNz76XiEcekNAh7UaxLe8BFc2tPb/QlW3Ho4h9K eZicVYZMcFQHVNeS3lnNl8/P2Yr5UBXSxUNXoSGCtvj0p7vRgjod62QmcGrvcqZZhK3A QgRpjpv3/GjoG5VZ2j5reg5StMve65qbUHSHkwjpH1hFg+DEKO3mzfokjG3EQMoKkvst F4NA== X-Forwarded-Encrypted: i=1; AJvYcCU7o7DJrU8z+HF42zytcYenBm3jutgTcoAHEFgEGzyL6oksUYIxbx8l6I/6e0X07c8jeD6omxtYrSdETFUy@vger.kernel.org X-Gm-Message-State: AOJu0YzymTAsBw5YbOKut8mctAzNUi5D9Avkyns3uMxnt0T0iCiHRMar +8gRbkrr5ssCKffD/hGwwrhf4RDLEifrVhxLROZzoIcNISO0Z73aB1cz/nmBtTN//kbdVrGiKsg z+o9VU2VW1+jPivUalQNjyQcq0v7hxkU= X-Gm-Gg: ASbGncskHj78VtFVEsTbUSAUQJ2/XTcFleIuRvd2MKtXbJmP70gLrVBqv20mWLkCjUu TbwU73Lr4RoHA0cX437qCEXzxFP6vg41xpwoLfijY6P9gmNPTbzLeEaf7XmfmGun7gxnAKX4QM3 TTIsUxKX16k4bECEbIiwMIN2yqXE5nZvMnslFOq65qv/ysGPXZrU0sP6mOJr733wGTa/mFi9JSP cqLmyAyr8ppMTTLX7t781XYPo4nlJ/UBE2OgzcOUZT/IECOSddrkqe7htzoOuXcEFHF+tAo1LQi QfpusAXVJhQBPhQeBb9dE9dMGQ== X-Google-Smtp-Source: AGHT+IHjCnUSYipW7ZDYaodcyHWhh2tEk6wixN+zRinjYnYjuklr01SkWmXo4+9xqJXCfv3bK9rD9sZxt46t48oFCpM= X-Received: by 2002:a05:6402:1d50:b0:640:a03a:af98 with SMTP id 4fb4d7f45d1cf-64554685738mr17095569a12.18.1764156693747; Wed, 26 Nov 2025 03:31:33 -0800 (PST) Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20251119144930.2911698-1-mjguzik@gmail.com> <20251125-punkten-jegliche-5aee8187381d@brauner> <20251126-vermachen-sahne-c4f243016180@brauner> In-Reply-To: <20251126-vermachen-sahne-c4f243016180@brauner> From: Mateusz Guzik Date: Wed, 26 Nov 2025 12:31:21 +0100 X-Gm-Features: AWmQ_bnInxHzWcWPx1VHiPpFAVTlbM3QjIDRWLJM6bxyroCdykN-0QBxDZvJAOI Message-ID: Subject: Re: [PATCH] fs: mark lookup_slow() as noinline To: Christian Brauner Cc: viro@zeniv.linux.org.uk, jack@suse.cz, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Nov 26, 2025 at 11:08=E2=80=AFAM Christian Brauner wrote: > > On Tue, Nov 25, 2025 at 10:54:25AM +0100, Mateusz Guzik wrote: > > I'm going to save a rant about benchmarking changes like these in the > > current kernel for another day. > > Without knocking any tooling that we currently have but I don't think we > have _meaningful_ performance testing - especially not automated. so *this* is the day? ;) Even if one was to pretend for a minute that excellent benchmark suite for vfs exists and is being used here, it would still fail to spot numerous pessimizations. To give you an example, legitimize_mnt has the smp_mb fence which makes it visible when profiling things like access(2) or stat(2) (something around 2% on my profiles). However, if one was to whack the fence just to check if it is worth writing a real patch to do it, access(2) perf would increase a little bit while stat(2) would remain virtually the same. I am not speculating here, I did it. stat for me is just shy of 4 mln ops/s. Patching the kernel with a tunable to optionally skip the smp_mb fence pushes legitimize_mnt way down, while *not* increasing performance -- the win is eaten by stalls elsewhere (perf *does* increase for access(2), which is less shafted). This is why the path walking benches I posted are all lifted from access() usage as opposed to stat btw. Or to put it differently, stat(2) is already gimped and you can keep adding slowdowns without measurably altering anything, but that's only because the CPU is already stalled big time while executing the codepath. Part of the systemic problem is the pesky 'rep movsq/rep stosq' usage by gcc, notably emitted for stat (see vfs_getattr_nosec). It is my understanding that future versions of the compiler will fix it, but that's still years of damage to stay even if someone updates the kernel in their distro, so that's "nice". The good news is that clang does not do it, but it also optimized things differently in other manners, so it may not even be representative what people will see with gcc. Rant aside on that front aside, I don't know what would encompass a good test suite. I am however confident it would include real-life usage lifted from actual workloads for microbenchmarking purposes, like for example I did with access() vs gcc. Better quality bench for path lookup would involve all the syscalls invoked by gcc which do it, but per the above the current state of the kernel would downplay improvements to next to nothing. Inspired by this little thing: https://lkml.org/lkml/2015/5/19/1009 ... I was screwing around with going through *all* vfs syscalls, ordered in a way which provides the biggest data and instruction cache busting potential. non-vfs code is not called specifically not be shafted by slodowns elsewhere. It's not ready, but definitely worth exploring. I know there are some big bench suites out there (AIM?) but they look weirdly unmaintained and I never verified if they do what they claim. The microbenchmarks like will-it-scale are missing syscall coverage (for example: no readlink or symlink), the syscalls which are covered have spotty usage (for example there is a bench for parallel rw open of a file, while opening *ro* is more common and has different scalability), and even ignoring that all the lookups are done against /tmp/willitscale.XXXXXX. That's not representative of most real lookups in that there few path components *and* one of them is unusually long. and so on. That rant also aside: 1. concerning legitimize_mnt: I strongly suspect the fence can be avoided by guaranteeing that clearing ->mnt_ns waits for the rcu grace period before issuing mntput. the question is how painful it is to implement it 2. concering stat: the current code boils down to going to statx and telling it to not fill some of the fields, getting some fields stat is not going to look at anyway and finally converting the result to userspace-compatible layout. the last bit is universal across unix kernels afaics, curious how that happened. anyway my idea here is to instead implement a ->stat inode op which would fill in 'struct stat' (not kstat!), avoiding most of the current work. there is the obvious concern of code duplication, which I think I can cover in an acceptable manner by implementing generic helpers for fields the filesystem does not want to mess with on its own. that legitimize_mnt thing has been annoying me for a long time now, i did not post any patches as the namespace code is barely readable for me and i'm trying to not dig into it