Re: Mainlining the kernel module for TernFS, a distributed filesystem

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Theodore Ts'o" <tytso@mit.edu>
To: Francesco Mazzoli <f@mazzo.li>
Cc: Amir Goldstein <amir73il@gmail.com>,
	linux-fsdevel@vger.kernel.org,
	Christian Brauner <brauner@kernel.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Bernd Schubert <bernd.schubert@fastmail.fm>,
	Miklos Szeredi <miklos@szeredi.hu>
Subject: Re: Mainlining the kernel module for TernFS, a distributed filesystem
Date: Fri, 3 Oct 2025 22:52:47 -0400	[thread overview]
Message-ID: <20251004025247.GD386127@mit.edu> (raw)
In-Reply-To: <34918add-4215-4bd3-b51f-9e47157501a3@app.fastmail.com>

On Fri, Oct 03, 2025 at 04:01:56PM +0100, Francesco Mazzoli wrote:
> 
> > A codebase code with only one major user is a red flag.
> > I am sure that you and your colleagues are very talented,
> > but if your employer decides to cut down on upstreaming budget,
> > the kernel maintainers would be left with an effectively orphaned filesystem.

I'd go further than that.  Expanding your user base is definitely a
good thing, but I'd go further than that; see if you can expand your
developer community so that some of your users are finding enough
value that they are willing to contribute to the development of the
your file system.  Perhaps there are some use cases which aren't
important to you, so it's not something that you can justifying
pursuing, but perhaps it would be high value for some other company
with a similar, but not identical, use case?

To do that, some recommendations:

*) Have good developer's documentation; not just how to start using
   it, but how to get started understanding the code base.  That is,
   things like the layout of the code base, how to debug problems,
   etc.  I see that you have documentation on how to run regression
   tests, which is great.

*) At the moment, it looks like your primary focus for the client is
   the Ubuntu LTS kernel.  That makes sense, but if you are are going
   for upstream inclusion, it might be useful to have a version of the
   codebase which is sync'ed to the upstream kernel, and then having an
   adaption layer which allows the code to be compiled as a module on
   distribution kernels.

*) If you have a list of simple starter projects that you could hand
   off to someone who is intersted, that would be useful.  (For
   example, one such starter project might be adding dkms support for
   other distributions beyond Ubuntu, which might be useful for other
   potential users.  Do you have a desire for more tests?  In general,
   in my experience, most projects always could use more testing.)

Looking the documentation, here are some notes:

* "We don't expect new directories to be created often, and files (or
  directories) to be moved between directories often."  I *think*
  "don't expect" binds to both parts of the conjuction.  So can you
  confirm that whatw as meant is "... nor do we expect that files
  (or directries) to be moved frequently."

* If that's true, it means that you *do* expect that files and
  directories can be moved around.  What are the consistency
  expectations when a file is renamed/moved?  I assume that since
  clients might be scattered across the world, there is some period
  where different clients might have different views.  Is there some
  kind of guarantee about when the eventual consistency will
  definitely be resolved?

* In the description of the filesystem data or metadata, there is no
  mention of whether there are checksums at rest or not.  Given the
  requirements that there be protections against hard disk bitrot, I
  assume there would be -- but what is the granularity?  Every 4092
  bytes (as in GFS)?   Every 1M?   Every 4M?   Are the checksums verified
  on the server when the data is read?  Or by the client?   Or both?
  What is the recovery path if the checksum doesn't verify?

* Some of the above are about the protocol, and that would be good to
  document.  What if any are the authentication and authorization
  checking that gets done?  Are there any cryptographic protection for
  either encryption or data integrity?  I've seen some companies who
  consider their LLM to highly proprietary, to the extent that they
  want to use confidential compute VM's.  Or if you are using the file
  system for training data, the training data might have PII.

> These are all good questions, and while we have not profiled the
> FUSE driver extensively...

There has been some really interesting work that that Darrick Wong has
been doing using the low-level fuse API.  The low-level FUSE is Linux
only, but using that with fs-iomap patches, Darrick has managed to get
basically get equivalent performance for direct and buffered I/O
comparing the native ext4 file system driver with his patched fuse2fs
and low-level fuse fs-iomap implementation.  His goal was to provide
better security for untrusted containers that want to mount images
that might be carefully, maiciously trusted, but it does demonstrate
that you aren't particularly worried about metadata-heavy workloads,
and are primarily concerend about data plane performance, uisng the
low-level (linux-only) FUSE interface might work well for you.

> There are some specific things that would be difficult today. For
> instance FUSE does not expose `d_revalidate`, which means that
> dentries would be dropped needlessly in cases where we know they can
> be left in place.

I belive the low-level FUSE interface does expose dentry revalidation.

> parts of a file is unreadable, and in that case we'd have had to
> fall back to a non-passthrough version.

Ah, you are using erasure codes; what was the design considerations of
using RS as opposed to having multiple copies of data blocks.  Or do
you support both?

This would be great to document --- or maybe you might want to
consider creating a "Design and Implementation of TernFS" paper and
submitting to a conference like FAST.  :-)

Cheers,

						- Ted

next prev parent reply	other threads:[~2025-10-04  2:53 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-03 12:13 Mainlining the kernel module for TernFS, a distributed filesystem Francesco Mazzoli
2025-10-03 14:22 ` Amir Goldstein
2025-10-03 15:01   ` Francesco Mazzoli
2025-10-03 17:35     ` Bernd Schubert
2025-10-03 18:18       ` Francesco Mazzoli
2025-10-03 19:01         ` Francesco Mazzoli
2025-10-04  2:52     ` Theodore Ts'o [this message]
2025-10-04  9:01       ` Francesco Mazzoli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251004025247.GD386127@mit.edu \
    --to=tytso@mit.edu \
    --cc=amir73il@gmail.com \
    --cc=bernd.schubert@fastmail.fm \
    --cc=brauner@kernel.org \
    --cc=djwong@kernel.org \
    --cc=f@mazzo.li \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).