From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0030681720 for ; Tue, 21 Jan 2025 01:20:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.45 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737422429; cv=none; b=PiLek4rKYBI7RXhmPE9nWOVLsny1eBWllnePYgEeEPooxbTY6nmxuS8ac2YzyjZgxJuD/usoH6zXQkTho6zrkNun8C31Gm5L7ozK4GBjuCsxIBe/8gaLpDA1uwOkEJQTnoOBrCiPiplBXU+PNcxtQoLO+GdVkg+ySpGq8JONe5s= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737422429; c=relaxed/simple; bh=eMs1XR2aYJpBvgtaU/Q8w7i4/0/0KAkiUbBrWVTBRLM=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=jJ1e1dqhZEZpJOkSHC+XYusEcq6B8xQmc7VmHBNMwClTY2TDg4llbeTK6ir9PivKGhrwKV8xWjSYqCkHvLJlT/SEmXAyJ7GG118pXun3UbPHZeZyATAGkKpWKlelSMEW3NWvmZUHq3X2ywZ8qY2wtzHAfFclRNOU3iHJQ3l/mqY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=li71IxbC; arc=none smtp.client-ip=209.85.216.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="li71IxbC" Received: by mail-pj1-f45.google.com with SMTP id 98e67ed59e1d1-2ef70c7efa5so6938071a91.2 for ; Mon, 20 Jan 2025 17:20:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1737422427; x=1738027227; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=pySU+UaLycNt2Mnc9d/1arFANwTD8nZiZk8SBrDSfas=; b=li71IxbCNZyGqJihs8uyKjZNYneISREH0JAetHjxdoM/SBOdOsYl2Y7xARO3jTYHXd gY8QG/hEZIySY7Zqqj09ikQzOfMC5w2+6KloKWHQe1O05n1xJCE8tC8TQMckwGHWrlTJ 7KVYB+skL+i4IrSU7I2gTBoAGxLfn2888mwBl8sOn7PF6mWA3QYVnzPyzEKm+PhKEPce QgY1/vXouAm19CWFlaRaJbP599Zvl4ermBXexB2JRarL140+aRqghPTUvbcBvai8TYop lyKwn6mxhTPTztZeRAMupCOIemlc/ibizjmliYkt3BoWpTIS3uHjHXQJRjqnMliqgX/Z u8rw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737422427; x=1738027227; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=pySU+UaLycNt2Mnc9d/1arFANwTD8nZiZk8SBrDSfas=; b=XGMC7A2/x4iW4EFOv2CHBIYKq478NUmBkTpBUkrmZviqkvdUoQNseg/V7iPFx19GgD 0WmvnqBfSGCpbMfxXvvwSK6s+TyFNATVo38J1iKqXWhv8kI1JAs7DBalGrnVNcgh9lXR BOjStG2HooAqB+O3V0LFvS8UhPgY3D2Guh3GZaRrt1fH9m1rWOps/64CZmnnm6Z4TkFt M9HiYAF+wwvJCP6ov7qImSaK4rGWri0yESNkGbNXLSkQvx09//d3w8yGmfwamQ8Wn893 KRqq0G5T1FKj1FTxmacgC+NhYq31QtHHTYQd1DwbgIG4bfJqzaCf2tt18QJbYsEmjEK7 9Q7w== X-Forwarded-Encrypted: i=1; AJvYcCWAUPWL/914ELNEZDP/CjE1MxfHSmDEvYU5XJo597lkLL8qgMohQkTyh3s7SWOwV24DIlN5MweqwqhADVco@vger.kernel.org X-Gm-Message-State: AOJu0Yy5hLejafWaVE+mp1VLpZSD4X1ReMgDfkwROtxy1VJsiXs7b8AD jq+B/f04lFm0ssoS9nLnXGke4qvmlCSnPJKYDjFaKYx29NWxaXKy62HXuGkeMR8= X-Gm-Gg: ASbGncvF/iV5aiGeQ1JBs/EXpi0jP8qCax7QVWbFZSyUmiBuEED2FeDb4Ff/Op3XJDK BWpB1lBb2+rZb/HSBPRiO3oXCKHSqMhmnttwixxvt0rb9t5GD09nE53LZ/YOdYB+uXdoQ9gIgwG r3Cu1Zo3b3tJI6YsmGOcS5fe4L/2ViiBjih1+5AYSCPn3u0oT+T3NFt6FSUO9aIYd7+Xbn/yNhB uUDkh8F5aCYkC/tiFzflEwPWRD64dGXolbe7SwwfO5GEf6vz5MgqXqWJ0tCLU4NeWTP9tfd1eUS RyNv+gpuHY6KNC2nm3JPwM3onDvynurNosutV+NqU0IrOw== X-Google-Smtp-Source: AGHT+IEdBGu/0/k4U5s63BnQhlq32zTJkGXSwTBdNLTg/0XWbTsLvr9v7sw3E1OMTa7Le6tOBENH1g== X-Received: by 2002:a05:6a00:4085:b0:72d:8d98:c256 with SMTP id d2e1a72fcca58-72daf945d6dmr21452387b3a.6.1737422425693; Mon, 20 Jan 2025 17:20:25 -0800 (PST) Received: from dread.disaster.area (pa49-186-89-135.pa.vic.optusnet.com.au. [49.186.89.135]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-72daba4ebf3sm7653322b3a.124.2025.01.20.17.20.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Jan 2025 17:20:25 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.98) (envelope-from ) id 1ta2wI-00000008TAc-04vH; Tue, 21 Jan 2025 12:20:22 +1100 Date: Tue, 21 Jan 2025 12:20:22 +1100 From: Dave Chinner To: NeilBrown Cc: Jeff Layton , lsf-pc@lists.linuxfoundation.org, Al Viro , linux-fsdevel@vger.kernel.org Subject: Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer Message-ID: References: <> <173732553757.22054.12851849131700067664@noble.neil.brown.name> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <173732553757.22054.12851849131700067664@noble.neil.brown.name> On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote: > On Mon, 20 Jan 2025, Dave Chinner wrote: > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > > > > > > My question to fs-devel is: is anyone willing to convert their fs (or > > > advice me on converting?) to use the new interface and do some testing > > > and be open to exploring any bugs that appear? > > > > tl;dr: You're asking for people to put in a *lot* of time to convert > > complex filesystems to concurrent directory modifications without > > clear indication that it will improve performance. Hence I wouldn't > > expect widespread enthusiasm to suddenly implement it... > > Thanks Dave! > Your point as detailed below seems to be that, for xfs at least, it may > be better to reduce hold times for exclusive locks rather than allow > concurrent locks. That seems entirely credible for a local fs but > doesn't apply for NFS as we cannot get a success status before the > operation is complete. How is that different from a local filesystem? A local filesystem can't return from open(O_CREAT) with a struct file referencing a newly allocated inode until the VFS inode is fully instantiated (or failed), either... i.e. this sounds like you want concurrent share-locked dirent ops so that synchronously processed operations can be issued concurrently. Could the NFS client implement asynchronous directory ops, keeping track of the operations in flight without needing to hold the parent i_rwsem across each individual operation? This basically what I've been describing for XFS to minimise parent dir lock hold times. What would VFS support for that look like? Is that of similar complexity to implementing shared locking support so that concurrent blocking directory operations can be issued? Is async processing a better model to move the directory ops towards so we can tie userspace directly into it via io_uring? > So it seems likely that different filesystems > will want different approaches. No surprise. > > There is some evidence that ext4 can be converted to concurrent > modification without a lot of work, and with measurable benefits. I > guess I should focus there for local filesystems. > > But I don't want to assume what is best for each file system which is > why I asked for input from developers of the various filesystems. > > But even for xfs, I think that to provide a successful return from mkdir > would require waiting for some IO to complete, and that other operations I don't see where IO enters the picture, to be honest. File creation does not typically require foreground IO on XFS at all (ignoring dirsync mode). How did you think we scale XFS to near a million file creates a second? :) In the case of mkdir, it does not take a direct reference to the inode being created so it potentially doesn't even need to wait for the completion of the operation. i.e. to use the new subdir it has to be open()d; that means going through the ->lookup path and which will block on I_NEW until the background creation is completed... That said, open(O_CREAT) would need to call wait_on_inode() somewhere to wait for I_NEW to clear so operations on the inode can proceed immediately via the persistent struct file reference it creates. With the right support, that waiting can be done without holding the parent directory locked, as any new lookup on that dirent/inode pair will block until I_NEW is cleared... Hence my question above about what does VFS support for async dirops actually looks like, and whether something like this: > might benefit from starting before that IO completes. > So maybe an xfs implementation of mkdir_shared would be: > - take internal exclusive lock on directory > - run fast foreground part of mkdir > - drop the lock > - wait for background stuff, which could affect error return, to > complete > - return appropriate error, or success as natively supported functionality might be a better solution to the problem.... >From the XFs perspective, we already have internal exclusive locking for dirent mods, but that is needed for serialising the physical directory mods against other (shared access) readers (e.g. lookup and readdir). We would not want to be sharing such an internal lock across the unbound fast path concurrency of lookup, create, unlink, readdir -and- multiple background processing threads. Logging a modification intent against an inode wouldn't require a new internal inode lock; AFAICT all it requires is exclusivity against lookup so that lookup can find new/unlinked dirents that have been logged but not yet added to or removed from the physical directory blocks. We can do that by inserting the VFS inode into cache in the foreground operation and leaving I_NEW set on create. We can do a similar thing with unlink (I_WILL_FREE?) to make the VFS inode invisible to new lookups before the unlink has actually been processed. In this way, we can push the actual processing into background work queues without actually changing how the namespace behaves from a user perspective... We -might- be able to do all these operations under a shared VFS lock, but it is not necessary to have a shared VFS lock to enable such a async processing model. If the performance gains for the NFS client come from allowing concurrent processing of individual synchronous operations, then we really need to consider if there are other methods of hiding synchronous operation latency that might be more applicable to more filesystems and high performance user interfaces... -Dave. -- Dave Chinner david@fromorbit.com