From: "Theodore Ts'o" <tytso@mit.edu>
To: Francesco Mazzoli <f@mazzo.li>
Cc: Amir Goldstein <amir73il@gmail.com>,
linux-fsdevel@vger.kernel.org,
Christian Brauner <brauner@kernel.org>,
"Darrick J. Wong" <djwong@kernel.org>,
Bernd Schubert <bernd.schubert@fastmail.fm>,
Miklos Szeredi <miklos@szeredi.hu>
Subject: Re: Mainlining the kernel module for TernFS, a distributed filesystem
Date: Fri, 3 Oct 2025 22:52:47 -0400 [thread overview]
Message-ID: <20251004025247.GD386127@mit.edu> (raw)
In-Reply-To: <34918add-4215-4bd3-b51f-9e47157501a3@app.fastmail.com>
On Fri, Oct 03, 2025 at 04:01:56PM +0100, Francesco Mazzoli wrote:
>
> > A codebase code with only one major user is a red flag.
> > I am sure that you and your colleagues are very talented,
> > but if your employer decides to cut down on upstreaming budget,
> > the kernel maintainers would be left with an effectively orphaned filesystem.
I'd go further than that. Expanding your user base is definitely a
good thing, but I'd go further than that; see if you can expand your
developer community so that some of your users are finding enough
value that they are willing to contribute to the development of the
your file system. Perhaps there are some use cases which aren't
important to you, so it's not something that you can justifying
pursuing, but perhaps it would be high value for some other company
with a similar, but not identical, use case?
To do that, some recommendations:
*) Have good developer's documentation; not just how to start using
it, but how to get started understanding the code base. That is,
things like the layout of the code base, how to debug problems,
etc. I see that you have documentation on how to run regression
tests, which is great.
*) At the moment, it looks like your primary focus for the client is
the Ubuntu LTS kernel. That makes sense, but if you are are going
for upstream inclusion, it might be useful to have a version of the
codebase which is sync'ed to the upstream kernel, and then having an
adaption layer which allows the code to be compiled as a module on
distribution kernels.
*) If you have a list of simple starter projects that you could hand
off to someone who is intersted, that would be useful. (For
example, one such starter project might be adding dkms support for
other distributions beyond Ubuntu, which might be useful for other
potential users. Do you have a desire for more tests? In general,
in my experience, most projects always could use more testing.)
Looking the documentation, here are some notes:
* "We don't expect new directories to be created often, and files (or
directories) to be moved between directories often." I *think*
"don't expect" binds to both parts of the conjuction. So can you
confirm that whatw as meant is "... nor do we expect that files
(or directries) to be moved frequently."
* If that's true, it means that you *do* expect that files and
directories can be moved around. What are the consistency
expectations when a file is renamed/moved? I assume that since
clients might be scattered across the world, there is some period
where different clients might have different views. Is there some
kind of guarantee about when the eventual consistency will
definitely be resolved?
* In the description of the filesystem data or metadata, there is no
mention of whether there are checksums at rest or not. Given the
requirements that there be protections against hard disk bitrot, I
assume there would be -- but what is the granularity? Every 4092
bytes (as in GFS)? Every 1M? Every 4M? Are the checksums verified
on the server when the data is read? Or by the client? Or both?
What is the recovery path if the checksum doesn't verify?
* Some of the above are about the protocol, and that would be good to
document. What if any are the authentication and authorization
checking that gets done? Are there any cryptographic protection for
either encryption or data integrity? I've seen some companies who
consider their LLM to highly proprietary, to the extent that they
want to use confidential compute VM's. Or if you are using the file
system for training data, the training data might have PII.
> These are all good questions, and while we have not profiled the
> FUSE driver extensively...
There has been some really interesting work that that Darrick Wong has
been doing using the low-level fuse API. The low-level FUSE is Linux
only, but using that with fs-iomap patches, Darrick has managed to get
basically get equivalent performance for direct and buffered I/O
comparing the native ext4 file system driver with his patched fuse2fs
and low-level fuse fs-iomap implementation. His goal was to provide
better security for untrusted containers that want to mount images
that might be carefully, maiciously trusted, but it does demonstrate
that you aren't particularly worried about metadata-heavy workloads,
and are primarily concerend about data plane performance, uisng the
low-level (linux-only) FUSE interface might work well for you.
> There are some specific things that would be difficult today. For
> instance FUSE does not expose `d_revalidate`, which means that
> dentries would be dropped needlessly in cases where we know they can
> be left in place.
I belive the low-level FUSE interface does expose dentry revalidation.
> parts of a file is unreadable, and in that case we'd have had to
> fall back to a non-passthrough version.
Ah, you are using erasure codes; what was the design considerations of
using RS as opposed to having multiple copies of data blocks. Or do
you support both?
This would be great to document --- or maybe you might want to
consider creating a "Design and Implementation of TernFS" paper and
submitting to a conference like FAST. :-)
Cheers,
- Ted
next prev parent reply other threads:[~2025-10-04 2:53 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-03 12:13 Mainlining the kernel module for TernFS, a distributed filesystem Francesco Mazzoli
2025-10-03 14:22 ` Amir Goldstein
2025-10-03 15:01 ` Francesco Mazzoli
2025-10-03 17:35 ` Bernd Schubert
2025-10-03 18:18 ` Francesco Mazzoli
2025-10-03 19:01 ` Francesco Mazzoli
2025-10-04 2:52 ` Theodore Ts'o [this message]
2025-10-04 9:01 ` Francesco Mazzoli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251004025247.GD386127@mit.edu \
--to=tytso@mit.edu \
--cc=amir73il@gmail.com \
--cc=bernd.schubert@fastmail.fm \
--cc=brauner@kernel.org \
--cc=djwong@kernel.org \
--cc=f@mazzo.li \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).