Re: Mainlining the kernel module for TernFS, a distributed filesystem

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Francesco Mazzoli" <f@mazzo.li>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Amir Goldstein" <amir73il@gmail.com>,
	linux-fsdevel@vger.kernel.org,
	"Christian Brauner" <brauner@kernel.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	"Bernd Schubert" <bernd.schubert@fastmail.fm>,
	"Miklos Szeredi" <miklos@szeredi.hu>
Subject: Re: Mainlining the kernel module for TernFS, a distributed filesystem
Date: Sat, 04 Oct 2025 10:01:04 +0100	[thread overview]
Message-ID: <a46bc09c-0af1-42b8-b134-128b93b7b5c4@app.fastmail.com> (raw)
In-Reply-To: <20251004025247.GD386127@mit.edu>

On Sat, Oct 4, 2025, at 03:52, Theodore Ts'o wrote:
> To do that, some recommendations:
> ...

Thank you, this is all very useful.

> Looking the documentation, here are some notes:
> 
> * "We don't expect new directories to be created often, and files (or
>   directories) to be moved between directories often."  I *think*
>   "don't expect" binds to both parts of the conjuction.  So can you
>   confirm that whatw as meant is "... nor do we expect that files
>   (or directries) to be moved frequently."

Your interpretation is correct.

> * If that's true, it means that you *do* expect that files and
>   directories can be moved around.  What are the consistency
>   expectations when a file is renamed/moved?  I assume that since
>   clients might be scattered across the world, there is some period
>   where different clients might have different views.  Is there some
>   kind of guarantee about when the eventual consistency will
>   definitely be resolved?

While TernFS is geo-replicated, metadata is geo-replicated in a master-slave
fashion: writes go through a single region, and writers in a given region
are guaranteed to read their own writes. We have plans to move this to
master-master setup, but it hasn't been very urgent since the metadata latency
hit is usually hidden by the time it takes to write the actual files (which as
remarked tend to be pretty big).

That said, directory entries are also cached, we use 250ms but it's
configurable.

File contents on the other hand are written locally and replicated both in a
push and pull fashion. However files are immutable, which means you never have
an inconsistent view of file contents in different regions.

See also the "Going global" section of the blog post:
<https://www.xtxmarkets.com/tech/2025-ternfs/>.

> * In the description of the filesystem data or metadata, there is no
>   mention of whether there are checksums at rest or not.  Given the
>   requirements that there be protections against hard disk bitrot, I
>   assume there would be -- but what is the granularity?  Every 4092
>   bytes (as in GFS)?   Every 1M?   Every 4M?   Are the checksums verified
>   on the server when the data is read?  Or by the client?   Or both?
>   What is the recovery path if the checksum doesn't verify?

Some of this is explained in the blog post mentioned above. In short: file
contents are both checksummed at a page level, but also at a higher boundary
(we call these "spans"), and the CRCs at this higher boundary are cross checked
by the metadata services and the storage nodes. I've written two blog posts
about these topics, see <https://mazzo.li/posts/mac-distributed-tx.html> and
<https://mazzo.li/posts/rs-crc.html>. The metadata is also checksummed by way
of RocksDB. Errors are recovered from using Reed-Solomon codes.

> * Some of the above are about the protocol, and that would be good to
>   document.  What if any are the authentication and authorization
>   checking that gets done?  Are there any cryptographic protection for
>   either encryption or data integrity?  I've seen some companies who
>   consider their LLM to highly proprietary, to the extent that they
>   want to use confidential compute VM's.  Or if you are using the file
>   system for training data, the training data might have PII.

There's no cryptographic protection or authentication in TernFS. We handle
authentication at a different layer: we have filesystem gateway that expose
only parts of the filesystem to less privileged users.

> There has been some really interesting work that that Darrick Wong has
> been doing using the low-level fuse API.  ...

One clear takeaway from this thread is that FUSE performance is a topic I
don't know enough about. I'll have to explore the various novelties that
you guys have brought up to bring me up to speed.

> I belive the low-level FUSE interface does expose dentry revalidation.

It doesn't directly but Bernd pointed out that it won't invalidate dentries
if the lookup is stable, which is good enough.

> Ah, you are using erasure codes; what was the design considerations of
> using RS as opposed to having multiple copies of data blocks.  Or do
> you support both?

We support both.

> This would be great to document --- or maybe you might want to
> consider creating a "Design and Implementation of TernFS" paper and
> submitting to a conference like FAST.  :-)

The blog post was intended to be that kind of document, but we might consider a
more detailed/academic publication!

Thanks,
Francesco

     prev parent reply	other threads:[~2025-10-04  9:02 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-03 12:13 Mainlining the kernel module for TernFS, a distributed filesystem Francesco Mazzoli
2025-10-03 14:22 ` Amir Goldstein
2025-10-03 15:01   ` Francesco Mazzoli
2025-10-03 17:35     ` Bernd Schubert
2025-10-03 18:18       ` Francesco Mazzoli
2025-10-03 19:01         ` Francesco Mazzoli
2025-10-04  2:52     ` Theodore Ts'o
2025-10-04  9:01       ` Francesco Mazzoli [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a46bc09c-0af1-42b8-b134-128b93b7b5c4@app.fastmail.com \
    --to=f@mazzo.li \
    --cc=amir73il@gmail.com \
    --cc=bernd.schubert@fastmail.fm \
    --cc=brauner@kernel.org \
    --cc=djwong@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).