From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A7E7612E7E for ; Thu, 3 Oct 2024 02:20:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727922059; cv=none; b=E+nVMyPeAuea/j43PvALHE2n03zED2QPa+iVLnjew8psLz1Bia0xxJdJi7eLMKuN808oQ6ZiM8ZNub1MaoZeLfEHnLBRwvooIp6+cXx3oVlOx1imK2dKBaRndz5poh9fGUvBFeyEkK57B77eOd8EYMIfa9ozpMCUQUP713Lwcz4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727922059; c=relaxed/simple; bh=oCbr/02akKxRoebvZf+p2ecE4bYNdNI5xZGO00e/F1Q=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=vDcjj0R2N5GKcDrwNBnQ/WPtaghDFqhlpMop0MLmiTmVHu/DLR5GClRoBMrZnhLthrwwXiiOgFu4nLRyeVHWO7cVZjX/OTBu+EgWscokoraWkxFbCok6ccR8zbpwU+j37R2j5PJIl8D+SzKBhpUdGJmiADzzUvuHgT/E+ZC0SyA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=g9JleLL/; arc=none smtp.client-ip=209.85.214.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="g9JleLL/" Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-20b1335e4e4so3823395ad.0 for ; Wed, 02 Oct 2024 19:20:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1727922057; x=1728526857; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=DRAVVahBtKPYhKDPQzInqp36XmS5vLI+LiAMQFyEfX8=; b=g9JleLL/saheIiMs7/Y2Ao8HWkLsvR3ZTgcY3bynL6jMWqeBUYfhoB9wPfSOSaaN74 vRdeq16MTnyLiS/GG30B8n1mUD9s8imcnsQevFWq17NYoL1cSzYxkttR61IjhVxXutmu 92M53SImdrQ4vdA85DnOB7WpxqOU2/sTpVv4oXPuS+jxsZCqHMhDzDFEiLfxRZsQ7Ye8 fnmFBl7KgUagItSAafm+dmD0v5vFVd+fepvXfj3QAWUcUuTqaSESQp+rjdt00hWpAe7f dh+B5Wkncruy0dCXyVqAv8QSP1HPw1lQF+JsIUzmp1hX/63FFkg+MbazyK3ESYEC4NTv cssw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727922057; x=1728526857; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=DRAVVahBtKPYhKDPQzInqp36XmS5vLI+LiAMQFyEfX8=; b=FT+m7PrZZY3CYimeXCTFyr0IQbXeEL1sezp/DJ+oN4p/MhxiowgVeyVlzRC1ThF+UP QfE7pQ5LonNQbShRZzUGlJpyW2UPgsSRtCNDrRM+rPa4e1wZA3a5+uB2bk4z4bqe5n3K PL622itslrew5CApV4QhWTVVpMxBlwupzb7LV8sV7xUY3jE0SDCNVS6ZAJXuYEoYhZ9g 7hp36Mf4J1akKNXlAiZh1Ousf0pij8B1yoWZUnvzp0CB7EGpninilOmDMIW9W5JI+Vs7 1AJjT922z0TGBEqLimSZoQ0YlFKu21ar/tKQG6gdmy/6yFv6q5rg655UAZkJjgxMAiFI m2DQ== X-Forwarded-Encrypted: i=1; AJvYcCU4LodkUjXpjWq0hebosoOn9UPDfJ7cHC8pRALsUky+5lkmjRfwZcgWAYGGq5S/RSxjxhLzraO0YrDd7q+a7w==@vger.kernel.org X-Gm-Message-State: AOJu0Yxpp6s1jPN35X7/Xp8ynolKNi4jHSIynaE+tSW9+iLOdo5CXJBj 6HZtnKJ4Syi5h/5quU2P1c5KyBwH25q1KmSF4uqj95cAyCPIJ78/c+BHI1PWTRw= X-Google-Smtp-Source: AGHT+IFI7ZWp14k/vsiNuCsC14zracI6uPQ5n4K6w+W+p0PX77SSq0c6hM8pYT54WP2BrSP3BnbFRA== X-Received: by 2002:a17:903:32cc:b0:20b:a25e:16c5 with SMTP id d9443c01a7336-20bc5a87644mr79576035ad.53.1727922056848; Wed, 02 Oct 2024 19:20:56 -0700 (PDT) Received: from dread.disaster.area (pa49-179-78-197.pa.nsw.optusnet.com.au. [49.179.78.197]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20b37e10196sm89873705ad.145.2024.10.02.19.20.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Oct 2024 19:20:56 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1swBSX-00DCIv-2L; Thu, 03 Oct 2024 12:20:53 +1000 Date: Thu, 3 Oct 2024 12:20:53 +1000 From: Dave Chinner To: Kent Overstreet Cc: Linus Torvalds , Christian Brauner , linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-bcachefs@vger.kernel.org Subject: Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Message-ID: References: <20241002014017.3801899-1-david@fromorbit.com> <20241002-lethargisch-hypnose-fd06ae7a0977@brauner> Precedence: bulk X-Mailing-List: linux-bcachefs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote: > On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote: > > On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote: > > > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote: > > > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner wrote: > > > > > > > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > > > > > > > > > I don't have big conceptual issues with the series otherwise. The only > > > > > > thing that makes me a bit uneasy is that we are now providing an api > > > > > > that may encourage filesystems to do their own inode caching even if > > > > > > they don't really have a need for it just because it's there. So really > > > > > > a way that would've solved this issue generically would have been my > > > > > > preference. > > > > > > > > > > Well, that's the problem, isn't it? :/ > > > > > > > > > > There really isn't a good generic solution for global list access > > > > > and management. The dlist stuff kinda works, but it still has > > > > > significant overhead and doesn't get rid of spinlock contention > > > > > completely because of the lack of locality between list add and > > > > > remove operations. > > > > > > > > I much prefer the approach taken in your patch series, to let the > > > > filesystem own the inode list and keeping the old model as the > > > > "default list". > > > > > > > > In many ways, that is how *most* of the VFS layer works - it exposes > > > > helper functions that the filesystems can use (and most do), but > > > > doesn't force them. > > > > > > > > Yes, the VFS layer does force some things - you can't avoid using > > > > dentries, for example, because that's literally how the VFS layer > > > > deals with filenames (and things like mounting etc). And honestly, the > > > > VFS layer does a better job of filename caching than any filesystem > > > > really can do, and with the whole UNIX mount model, filenames > > > > fundamentally cross filesystem boundaries anyway. > > > > > > > > But clearly the VFS layer inode list handling isn't the best it can > > > > be, and unless we can fix that in some fundamental way (and I don't > > > > love the "let's use crazy lists instead of a simple one" models) I do > > > > think that just letting filesystems do their own thing if they have > > > > something better is a good model. > > > > > > Well, I don't love adding more indirection and callbacks. > > > > It's way better than open coding inode cache traversals everywhere. > > Eh? You had a nice iterator for dlock-list :) Which was painful to work with because it maintains the existing spin lock based traversal pattern. This was necessary because the iterator held a spinlock internally. I really didn't like that aspect of it because it perpeutated the need to open code the iget/iput game to allow the spinlock to be dropped across the inode operation that needed to be performed. i.e. adding an dlist iterator didn't clean up any of the other mess that sb->s_inodes iteration required... > > We simply cannot do things like that without a new iteration model. > > Abstraction is necessary to facilitate a new iteration model, and a > > model that provides independent object callbacks allows scope for > > concurrent processing of individual objects. > > Parallelized iteration is a slick possibility. > > My concern is that we've been trying to get away from callbacks for > iteration - post spectre they're quite a bit more expensive than > external iterators, and we've generally been successful with that. So everyone keeps saying, but the old adage applies here: Penny wise, pound foolish. Optimising away the callbacks might bring us a few percent performance improvement for each operation (e.g. via the dlist iterator mechanisms) in a traversal, but that iteration is still only single threaded. Hence the maximum processing rate is determined by the performance of a single CPU core. However, if we change the API to allow for parallelism at the cost of a few percent per object operation, then a single CPU core will not process quite as many objects as before. However, the moment we allow multiple CPU cores to process in parallel, we acheive processing rate improvements measured in integer multiples. Modern CPUs have concurrency to burn. Optimising APIs for minimum per-operation overhead rather than for concurrent processing implementations is the wrong direction to be taking.... > > > Converting the standard inode hash table to an rhashtable (or more > > > likely, creating a new standard implementation and converting > > > filesystems one at a time) still needs to happen, and then the "use the > > > hash table for iteration" approach could use that without every > > > filesystem having to specialize. > > > > Yes, but this still doesn't help filesystems like XFS where the > > structure of the inode cache is highly optimised for the specific > > on-disk and in-memory locality of inodes. We aren't going to be > > converting XFS to a rhashtable based inode cache anytime soon > > because it simply doesn't provide the functionality we require. > > e.g. efficient lockless sequential inode number ordered traversal in > > -every- inode cluster writeback operation. > > I was going to ask what your requirements are - I may take on the > general purpose inode rhashtable code, although since I'm still pretty > buried we'll see. > > Coincidentally, just today I'm working on an issue in bcachefs where > we'd also prefer an ordered data structure to a hash table for the inode > cache - in online fsck, we need to be able to check if an inode is still > open, but checking for an inode in an interior snapshot node means we > have to do a scan and check if any of the open inodes are in a > descendent subvolume. > > Radix tree doesn't work for us, since our keys are { inum, subvol } - 96 > bits - Sure it does - you just need two layers of radix trees. i.e have a radix tree per subvol to index inodes by inum, and a per-sb radix tree to index the subvols. With some code to propagate radix tree bits from the inode radix tree to the subvol radix tree they then largely work in conjunction for filtered searches. This is -exactly- the internal inode cache structure that XFS has. We have a per-sb radix tree indexing the allocation groups, and a radix tree per allocation group indexing inodes by inode number. Hence an inode lookup involves splitting the inum into agno/agino pairs, then doing a perag lookup with the agno, and doing a perag inode cache lookup with the agino. All of these radix tree lookups are lockless... -Dave. -- Dave Chinner david@fromorbit.com