Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Zach Brown <zab@zabbo.net>
To: Amir Goldstein <amir73il@gmail.com>
Cc: RIc Wheeler <ricwheeler@gmail.com>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	Christian Brauner <brauner@kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file
Date: Thu, 6 Feb 2025 10:58:12 -0800	[thread overview]
Message-ID: <20250206185812.GA413506@localhost.localdomain> (raw)
In-Reply-To: <CAOQ4uxjN5oedNhZ2kCJC2XLncdkSFMYJOWmSEC3=a-uGjd=w7Q@mail.gmail.com>

(Yay, back from travel!)

On Mon, Feb 03, 2025 at 04:22:59PM +0100, Amir Goldstein wrote:
> On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote:
> >
> > Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave
> > a good background on this -
> > https://www.fosdem.org/2025/schedule/speaker/zach_brown/).  We are
> > looking at taking advantage of modern low latency NVME devices and
> > today's networks to implement a distributed file system that provides
> > better concurrency that high object counts need and still have the
> > bandwidth needed to support the backend archival systems we feed.
> >
> 
> I heard this talk and it was very interesting.
> Here's a direct link to slides from people who may be too lazy to
> follow 3 clicks:
> https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf
> 
> I was both very impressed by the cache coherent rename example
> and very puzzled - I do not know any filesystem where rename can be
> synchronized on a single block io, and looking up ancestors is usually
> done on in-memory dentries, so I may not have understood the example.

The meat of that talk was about how ngnfs uses its distributed block
cache as a serializing/coherence/consistency mechanism.  That specific
example was about how we can get concurrent rename between different
mounts without needing some global equivelant of rename mutex.

The core of the mechanism is that code paths that implement operations
have a transactional object that holds on to cached block references
which have a given access mode granted over the network.  In this rename
case, the ancestor walk holds on to all the blocks for the duration of
the walk.  (Can be a lot of blocks!).  If another mount somewhere else
tried to modify those ancestor blocks, that mount would need to revoke
the cached read access to be granted their write access.  That'd wait
for the first rename to finish and release the read refs.  This gives us
specific serialization of access to the blocks in question rather than
relying on a global serializing object over all renames.

That's the idea, anyway.  I'm implementing the first bits of this now.

It's sort of a silly example, because who puts cross-directory rename in
the fast path?  (Historically some s3<->posix servers implemented
CompleteMultipartUpload be renaming from tmp dirs to visible bucket
dirs, hrmph).  But it illustrates the pattern of shrinking contention
down to the block level.

- z

next prev parent reply	other threads:[~2025-02-06 18:58 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-02 21:39 [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file RIc Wheeler
2025-02-03 15:22 ` Amir Goldstein
2025-02-03 16:18   ` Ric Wheeler
2025-02-04  1:47     ` Dave Chinner
2025-02-05  8:05       ` Ric Wheeler
2025-02-06 18:58   ` Zach Brown [this message]
2025-02-06 23:36     ` Andreas Dilger
2025-02-03 15:23 ` Ric Wheeler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250206185812.GA413506@localhost.localdomain \
    --to=zab@zabbo.net \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ricwheeler@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).