From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7C9FB208A0 for ; Tue, 24 Sep 2024 03:34:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727148860; cv=none; b=heHxpIiTpgKpmX3ggMWgtISphr6II6kzinUGEuz1IJiwjwq7C+paw851KwgyC6K/cSMiQNJBCyCsV4OJnugqvWyJSj66AIRoeVl4BN0a61y4PtTOLhbouNjP1jHg8g4ZQkULUbpTcNBpyMhmtToWo1dsp3BcLEK6R75c2w/nZPo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727148860; c=relaxed/simple; bh=z1dOkhvrMWny0ohbL42mv6n+7Xdf2nXp1JIvTFKjenw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=CZfyuaGuFD1IyigxXdsqhG5sajuIAT+6ck89iamm/FRqIHQesRROprC8xqxwrwqdvMj8PwVmgG3FTXkn14z3xaRlQZVcoxQT4tLZ2IbmUWx3aix9RYTgkt28XhhnlSlmWIUg/7GSjx1g4hE80xXIJxzg4phxGFhk/Z/0IlvQyhU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=YUpd9d/J; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="YUpd9d/J" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-20696938f86so39427445ad.3 for ; Mon, 23 Sep 2024 20:34:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1727148858; x=1727753658; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=4/SKj4lWI1yJlZOtC07mR6A3SxW5HcZiLXldmOIyxwo=; b=YUpd9d/JlzZtdSZt+6CIO3OiWz7shf2HDWMi8QMtLoPNe1QbAfHjHmAgBjrtq7ypQJ GY6SBxFCJDUpcxV1rv+OOwQYS3XVjV4ffuyZmaAp54hx3SyTOCG6zukb691Tjh27pk4Q jsctXYtIm4s6kdvIfLfyMZ/vCBuJJgtxMysNWujULBltquRUYw2m6Sz8faOHRVaucI55 3NHqhp3T+RYRCYPeir7bfZDfSTWKQsaVw5KLZ32i5jMwfF/gH+PTCTwVl7XuOTVFPUMr /Ys3oI+fq25fEldvQOogVZaqRm37fXiBvA11LSy2x2ZOhIVgWfUMe8eP1Bbw9Nz3v9FQ fcoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727148858; x=1727753658; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=4/SKj4lWI1yJlZOtC07mR6A3SxW5HcZiLXldmOIyxwo=; b=w8gBbu/ckJgxM+VuXz6Q3eCMg+8iGYjvyz0HMA2YvT4YXn20oLGXfFD0aM5PnLjkB7 pQAzugeL5gHR+3AslVEfP438to3Ov75rEctA1adHr8ylfUCv/IET4tpEusNHb8J0SA/l Kgth/DRZYzZUH9FepgkigtQBmAu/GSKpTMNCMtmBjzwyqo7jMaBnFjcWdFTU2HvSpfk3 xLdSei82GhVdVBqZRjyhLLlNjvLDLUieD7V/skWlkid0Wx2kTuVoF4NGzqgncVTrrEkt oRwFeiUR4kd8UiosnqDiPhIsunPWDGmHCY0JONUQBkudmRLZBoRk6DhABDFN7AEI7Xah bf9Q== X-Forwarded-Encrypted: i=1; AJvYcCUTlY6r/rGSOLGwLUtvNkXxa3IQ6xWVjTn6hGUCnaqgzIMIbqkaKeYTZZRDUGIJI9FF9w6FjAfSGELSpC1wxw==@vger.kernel.org X-Gm-Message-State: AOJu0Yy5OBzRVAoNEu5+SHXUh2QLGUQurk3ZmN+xGt4KdYUlJ2k99tqY 6aehGm5zhXMhu69EprSaipdsYUMdun7Ks9rMbyq3m9YqGsUZ97esejYvidR3fig= X-Google-Smtp-Source: AGHT+IH3lRAeJYZlmh0xwaPd9RsTWL2j6QkCXJrNFRv9mB3OVeKtUOXeyHBewN2Wk8kyVcvwOLtk2Q== X-Received: by 2002:a17:902:d2d1:b0:205:40a6:115a with SMTP id d9443c01a7336-208d9872fc8mr201072425ad.48.1727148857727; Mon, 23 Sep 2024 20:34:17 -0700 (PDT) Received: from dread.disaster.area (pa49-179-78-197.pa.nsw.optusnet.com.au. [49.179.78.197]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20af16e0435sm2435535ad.13.2024.09.23.20.34.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Sep 2024 20:34:17 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1sswJa-009GFU-0W; Tue, 24 Sep 2024 13:34:14 +1000 Date: Tue, 24 Sep 2024 13:34:14 +1000 From: Dave Chinner To: Kent Overstreet Cc: Linus Torvalds , linux-bcachefs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Dave Chinner Subject: Re: [GIT PULL] bcachefs changes for 6.12-rc1 Message-ID: References: <6vizzdoktqzzkyyvxqupr6jgzqcd4cclc24pujgx53irxtsy4h@lzevj646ccmg> <74sgzrvtnry4wganaatcmxdsfwauv6r33qggxo27yvricrzxvq@77knsf6cfftl> Precedence: bulk X-Mailing-List: linux-bcachefs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <74sgzrvtnry4wganaatcmxdsfwauv6r33qggxo27yvricrzxvq@77knsf6cfftl> On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote: > On Mon, Sep 23, 2024 at 07:26:31PM GMT, Linus Torvalds wrote: > > On Mon, 23 Sept 2024 at 17:27, Dave Chinner wrote: > > > > > > However, the problematic workload is cold cache operations where > > > the dentry cache repeatedly misses. This places all the operational > > > concurrency directly on the inode hash as new inodes are inserted > > > into the hash. Add memory reclaim and that adds contention as it > > > removes inodes from the hash on eviction. > > > > Yeah, and then we spend all the time just adding the inodes to the > > hashes, and probably fairly seldom use them. Oh well. > > > > And I had missed the issue with PREEMPT_RT and the fact that right now > > the inode hash lock is outside the inode lock, which is problematic. > > > > So it's all a bit nasty. > > > > But I also assume most of the bad issues end up mainly showing up on > > just fairly synthetic benchmarks with ramdisks, because even with a > > good SSD I suspect the IO for the cold cache would still dominate? > > Not for bcachefs, because filling into the vfs inode cache doesn't > require a disk read - they're cached in the inodes btree and much > smaller there. We use a varint encoding so they're typically 50-100 > bytes, last I checked. > > Compare to ~1k, plus or minus, in the vfs inode cache. > > Thomas Bertshinger has been working on applications at LANL where > avoiding pulling into the vfs inode cache seems to make a significant > difference (file indexing in his case) - it turns out there's an xattr > syscall that's missing, which I believe he'll be submitting a patch for. > > But stat/statx always pulls into the vfs inode cache, and that's likely > worth fixing. No, let's not even consider going there. Unlike most people, old time XFS developers have direct experience with the problems that "uncached" inode access for stat purposes. XFS has had the bulkstat API for a long, long time (i.e. since 1998 on Irix). When it was first implemented on Irix, it was VFS cache coherent. But in the early 2000s, that caused problems with HSMs needing to scan billions inodes indexing petabytes of stored data with certain SLA guarantees (i.e. needing to scan at least a million inodes a second). The CPU overhead of cache instantiation and teardown was too great to meet those performance targets on 500MHz MIPS CPUs. So we converted bulkstat to run directly out of the XFS buffer cache (i.e. uncached from the perspective of the VFS). This reduced the CPU over per-inode substantially, allowing bulkstat rates to increase by a factor of 10. However, it introduced all sorts of coherency problems between cached inode state vs what was stored in the buffer cache. It was basically O_DIRECT for stat() and, as you'd expect from that description, the coherency problems were horrible. Detecting iallocated-but-not-yet-updated and unlinked-but-not-yet-freed inodes were particularly consistent sources of issues. The only way to fix these coherency problems was to check the inode cache for a resident inode first, which basically defeated the entire purpose of bypassing the VFS cache in the first place. So we went back to using coherent lookups once we stopped caring about the old SGI HSM SLA requirements back in 2010: commit 7dce11dbac54fce777eea0f5fb25b2694ccd7900 Author: Christoph Hellwig Date: Wed Jun 23 18:11:11 2010 +1000 xfs: always use iget in bulkstat The non-coherent bulkstat versionsthat look directly at the inode buffers causes various problems with performance optimizations that make increased use of just logging inodes. This patch makes bulkstat always use iget, which should be fast enough for normal use with the radix-tree based inode cache introduced a while ago. Signed-off-by: Christoph Hellwig Reviewed-by: Dave Chinner Essentially, we now always being inodes into cache as fully instantiated VFS inodes, and mark them with I_DONT_CACHE so they get torn down immediately after we've used them if they weren't already cached. (i.e. all accesses are done in a single task context with the inode hot in the CPU caches.) Without the patchset I pointed to, bulkstat maxxes out the VFS inode cache cycling on the sb->s_inodes_list_lock at about 700,000 inodes/sec being cycled through the cache. With that patchset, it maxxed out the bandwidth of the NVMe SSDs I was using at ~7GB/s read IO and cycling about 6 million inodes/sec through the VFS cache. IOWs, we can make the VFS inode cache scale way beyond what most users actually need right now. The lesson in all this? Don't hack around VFS scalability issues if it can be avoided. If we fix the problems properly in the first place, then most fs developers (let alone users!) won't even realise there was a scalability problem in the VFS to begin with. Users will, however, notice every inconsistency and/or corruption caused by fs developers trying to work around VFS cache limitations... -Dave. -- Dave Chinner david@fromorbit.com