From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A7E7612E7E
	for <linux-bcachefs@vger.kernel.org>; Thu,  3 Oct 2024 02:20:57 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.180
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1727922059; cv=none; b=E+nVMyPeAuea/j43PvALHE2n03zED2QPa+iVLnjew8psLz1Bia0xxJdJi7eLMKuN808oQ6ZiM8ZNub1MaoZeLfEHnLBRwvooIp6+cXx3oVlOx1imK2dKBaRndz5poh9fGUvBFeyEkK57B77eOd8EYMIfa9ozpMCUQUP713Lwcz4=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1727922059; c=relaxed/simple;
	bh=oCbr/02akKxRoebvZf+p2ecE4bYNdNI5xZGO00e/F1Q=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=vDcjj0R2N5GKcDrwNBnQ/WPtaghDFqhlpMop0MLmiTmVHu/DLR5GClRoBMrZnhLthrwwXiiOgFu4nLRyeVHWO7cVZjX/OTBu+EgWscokoraWkxFbCok6ccR8zbpwU+j37R2j5PJIl8D+SzKBhpUdGJmiADzzUvuHgT/E+ZC0SyA=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=g9JleLL/; arc=none smtp.client-ip=209.85.214.180
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="g9JleLL/"
Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-20b1335e4e4so3823395ad.0
        for <linux-bcachefs@vger.kernel.org>; Wed, 02 Oct 2024 19:20:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1727922057; x=1728526857; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=DRAVVahBtKPYhKDPQzInqp36XmS5vLI+LiAMQFyEfX8=;
        b=g9JleLL/saheIiMs7/Y2Ao8HWkLsvR3ZTgcY3bynL6jMWqeBUYfhoB9wPfSOSaaN74
         vRdeq16MTnyLiS/GG30B8n1mUD9s8imcnsQevFWq17NYoL1cSzYxkttR61IjhVxXutmu
         92M53SImdrQ4vdA85DnOB7WpxqOU2/sTpVv4oXPuS+jxsZCqHMhDzDFEiLfxRZsQ7Ye8
         fnmFBl7KgUagItSAafm+dmD0v5vFVd+fepvXfj3QAWUcUuTqaSESQp+rjdt00hWpAe7f
         dh+B5Wkncruy0dCXyVqAv8QSP1HPw1lQF+JsIUzmp1hX/63FFkg+MbazyK3ESYEC4NTv
         cssw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1727922057; x=1728526857;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=DRAVVahBtKPYhKDPQzInqp36XmS5vLI+LiAMQFyEfX8=;
        b=FT+m7PrZZY3CYimeXCTFyr0IQbXeEL1sezp/DJ+oN4p/MhxiowgVeyVlzRC1ThF+UP
         QfE7pQ5LonNQbShRZzUGlJpyW2UPgsSRtCNDrRM+rPa4e1wZA3a5+uB2bk4z4bqe5n3K
         PL622itslrew5CApV4QhWTVVpMxBlwupzb7LV8sV7xUY3jE0SDCNVS6ZAJXuYEoYhZ9g
         7hp36Mf4J1akKNXlAiZh1Ousf0pij8B1yoWZUnvzp0CB7EGpninilOmDMIW9W5JI+Vs7
         1AJjT922z0TGBEqLimSZoQ0YlFKu21ar/tKQG6gdmy/6yFv6q5rg655UAZkJjgxMAiFI
         m2DQ==
X-Forwarded-Encrypted: i=1; AJvYcCU4LodkUjXpjWq0hebosoOn9UPDfJ7cHC8pRALsUky+5lkmjRfwZcgWAYGGq5S/RSxjxhLzraO0YrDd7q+a7w==@vger.kernel.org
X-Gm-Message-State: AOJu0Yxpp6s1jPN35X7/Xp8ynolKNi4jHSIynaE+tSW9+iLOdo5CXJBj
	6HZtnKJ4Syi5h/5quU2P1c5KyBwH25q1KmSF4uqj95cAyCPIJ78/c+BHI1PWTRw=
X-Google-Smtp-Source: AGHT+IFI7ZWp14k/vsiNuCsC14zracI6uPQ5n4K6w+W+p0PX77SSq0c6hM8pYT54WP2BrSP3BnbFRA==
X-Received: by 2002:a17:903:32cc:b0:20b:a25e:16c5 with SMTP id d9443c01a7336-20bc5a87644mr79576035ad.53.1727922056848;
        Wed, 02 Oct 2024 19:20:56 -0700 (PDT)
Received: from dread.disaster.area (pa49-179-78-197.pa.nsw.optusnet.com.au. [49.179.78.197])
        by smtp.gmail.com with ESMTPSA id d9443c01a7336-20b37e10196sm89873705ad.145.2024.10.02.19.20.56
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 02 Oct 2024 19:20:56 -0700 (PDT)
Received: from dave by dread.disaster.area with local (Exim 4.96)
	(envelope-from <david@fromorbit.com>)
	id 1swBSX-00DCIv-2L;
	Thu, 03 Oct 2024 12:20:53 +1000
Date: Thu, 3 Oct 2024 12:20:53 +1000
From: Dave Chinner <david@fromorbit.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Christian Brauner <brauner@kernel.org>,
	linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-bcachefs@vger.kernel.org
Subject: Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
Message-ID: <Zv3/hQs+Rz/dcQnP@dread.disaster.area>
References: <20241002014017.3801899-1-david@fromorbit.com>
 <20241002-lethargisch-hypnose-fd06ae7a0977@brauner>
 <Zv098heGHOtGfw1R@dread.disaster.area>
 <CAHk-=wgBqi+1YjH=-AiSDqx8p0uA6yGZ=HmMKtkGC3Ey=OhXhw@mail.gmail.com>
 <kz36dz2tzysa7ih7qf6iuhvzrfvwytzcpcv46hzedtpdebazam@2op5ojw3xvse>
 <Zv3UdBPLutZkBeNg@dread.disaster.area>
 <dhv3pbtrwyt6myltrhvgxobsvrejpsguo4xn6p572j3t3t3axl@d6x455tgwi2s>
Precedence: bulk
X-Mailing-List: linux-bcachefs@vger.kernel.org
List-Id: <linux-bcachefs.vger.kernel.org>
List-Subscribe: <mailto:linux-bcachefs+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-bcachefs+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <dhv3pbtrwyt6myltrhvgxobsvrejpsguo4xn6p572j3t3t3axl@d6x455tgwi2s>

On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote:
> On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote:
> > > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> > > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > > >
> > > > > > I don't have big conceptual issues with the series otherwise. The only
> > > > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > > > that may encourage filesystems to do their own inode caching even if
> > > > > > they don't really have a need for it just because it's there.  So really
> > > > > > a way that would've solved this issue generically would have been my
> > > > > > preference.
> > > > >
> > > > > Well, that's the problem, isn't it? :/
> > > > >
> > > > > There really isn't a good generic solution for global list access
> > > > > and management.  The dlist stuff kinda works, but it still has
> > > > > significant overhead and doesn't get rid of spinlock contention
> > > > > completely because of the lack of locality between list add and
> > > > > remove operations.
> > > > 
> > > > I much prefer the approach taken in your patch series, to let the
> > > > filesystem own the inode list and keeping the old model as the
> > > > "default list".
> > > > 
> > > > In many ways, that is how *most* of the VFS layer works - it exposes
> > > > helper functions that the filesystems can use (and most do), but
> > > > doesn't force them.
> > > > 
> > > > Yes, the VFS layer does force some things - you can't avoid using
> > > > dentries, for example, because that's literally how the VFS layer
> > > > deals with filenames (and things like mounting etc). And honestly, the
> > > > VFS layer does a better job of filename caching than any filesystem
> > > > really can do, and with the whole UNIX mount model, filenames
> > > > fundamentally cross filesystem boundaries anyway.
> > > > 
> > > > But clearly the VFS layer inode list handling isn't the best it can
> > > > be, and unless we can fix that in some fundamental way (and I don't
> > > > love the "let's use crazy lists instead of a simple one" models) I do
> > > > think that just letting filesystems do their own thing if they have
> > > > something better is a good model.
> > > 
> > > Well, I don't love adding more indirection and callbacks.
> > 
> > It's way better than open coding inode cache traversals everywhere.
> 
> Eh? You had a nice iterator for dlock-list :)

Which was painful to work with because
it maintains the existing spin lock based traversal pattern. This
was necessary because the iterator held a spinlock internally. I
really didn't like that aspect of it because it perpeutated the need
to open code the iget/iput game to allow the spinlock to be dropped
across the inode operation that needed to be performed.

i.e. adding an dlist iterator didn't clean up any of the other mess
that sb->s_inodes iteration required...

> > We simply cannot do things like that without a new iteration model.
> > Abstraction is necessary to facilitate a new iteration model, and a
> > model that provides independent object callbacks allows scope for
> > concurrent processing of individual objects.
> 
> Parallelized iteration is a slick possibility.
> 
> My concern is that we've been trying to get away from callbacks for
> iteration - post spectre they're quite a bit more expensive than
> external iterators, and we've generally been successful with that. 

So everyone keeps saying, but the old adage applies here: Penny
wise, pound foolish.

Optimising away the callbacks might bring us a few percent
performance improvement for each operation (e.g. via the dlist
iterator mechanisms) in a traversal, but that iteration is still
only single threaded. Hence the maximum processing rate is
determined by the performance of a single CPU core.

However, if we change the API to allow for parallelism at the cost
of a few percent per object operation, then a single CPU core will
not process quite as many objects as before. However, the moment we
allow multiple CPU cores to process in parallel, we acheive
processing rate improvements measured in integer multiples.

Modern CPUs have concurrency to burn.  Optimising APIs for minimum
per-operation overhead rather than for concurrent processing
implementations is the wrong direction to be taking....

> > > Converting the standard inode hash table to an rhashtable (or more
> > > likely, creating a new standard implementation and converting
> > > filesystems one at a time) still needs to happen, and then the "use the
> > > hash table for iteration" approach could use that without every
> > > filesystem having to specialize.
> > 
> > Yes, but this still doesn't help filesystems like XFS where the
> > structure of the inode cache is highly optimised for the specific
> > on-disk and in-memory locality of inodes. We aren't going to be
> > converting XFS to a rhashtable based inode cache anytime soon
> > because it simply doesn't provide the functionality we require.
> > e.g. efficient lockless sequential inode number ordered traversal in
> > -every- inode cluster writeback operation.
> 
> I was going to ask what your requirements are - I may take on the
> general purpose inode rhashtable code, although since I'm still pretty
> buried we'll see.
> 
> Coincidentally, just today I'm working on an issue in bcachefs where
> we'd also prefer an ordered data structure to a hash table for the inode
> cache - in online fsck, we need to be able to check if an inode is still
> open, but checking for an inode in an interior snapshot node means we
> have to do a scan and check if any of the open inodes are in a
> descendent subvolume.
> 
> Radix tree doesn't work for us, since our keys are { inum, subvol } - 96
> bits -

Sure it does - you just need two layers of radix trees. i.e have a
radix tree per subvol to index inodes by inum, and a per-sb radix
tree to index the subvols. With some code to propagate radix tree
bits from the inode radix tree to the subvol radix tree they then
largely work in conjunction for filtered searches.

This is -exactly- the internal inode cache structure that XFS has.
We have a per-sb radix tree indexing the allocation groups, and a
radix tree per allocation group indexing inodes by inode number.
Hence an inode lookup involves splitting the inum into agno/agino
pairs, then doing a perag lookup with the agno, and doing a perag
inode cache lookup with the agino. All of these radix tree
lookups are lockless...

-Dave.
-- 
Dave Chinner
david@fromorbit.com