linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Mainlining the kernel module for TernFS, a distributed filesystem
@ 2025-10-03 12:13 Francesco Mazzoli
  2025-10-03 14:22 ` Amir Goldstein
  0 siblings, 1 reply; 8+ messages in thread
From: Francesco Mazzoli @ 2025-10-03 12:13 UTC (permalink / raw)
  To: linux-fsdevel

My workplace (XTX Markets) has open sourced a distributed
filesystem which has been used internally for a few years, TernFS:
<https://github.com/XTXMarkets/ternfs>. The repository includes both the server
code for the filesystem but also several clients. The main client we use
is a kernel module which allows you to mount TernFS from Linux systems. The
current codebase would not be ready for upstreaming, but I wanted to gauge
if eventual upstreaming would be even possible in this case, and if yes,
what the process would be.

Obviously TernFS currently has only one user, although we run on more than
100 thousand machines, spanning relatively diverse hardware and running
fairly diverse software. And this might change if other organizations adopt
TernFS now that it is open source, naturally.

The kernel module has been fairly stable, although we need to properly adapt
it to the folio world. However it would be much easier to maintain it if
it was mainlined, and I wanted to describe the peculiarities of TernFS to
see if it would be even possible to do so. For those interested we also
have a blog post going in a lot more detail about the design of TernFS
(<https://www.xtxmarkets.com/tech/2025-ternfs/>), but hopefully this email
would be enough for the purposes of this discusion.

TernFS files are immutable, they're written once and then can't be modified.
Moreover, when files are created they're not actually linked into the
directory structure until they're closed. One way to think about it is that
in TernFS every file follows the semantics you'd have if you opened the file
with `O_TMPFILE` and then linked them with `linkat`. This is the most "odd"
part of the kernel module since it goes counter pretty baked in assumptions
of how the file lifecycle works.

TernFS also does not support many things, for example hardlinks, permissions,
any sort of extended attribute, and so on. This is I would imagine less
unpleasant though since it's just a matter of getting ENOTSUP out of a bunch
of syscalls.

Apart from that I wouldn't expect TernFS to be that different from Ceph or
other networked storage codebases inside the kernel.

Let me know what you think,
Francesco

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mainlining the kernel module for TernFS, a distributed filesystem
  2025-10-03 12:13 Mainlining the kernel module for TernFS, a distributed filesystem Francesco Mazzoli
@ 2025-10-03 14:22 ` Amir Goldstein
  2025-10-03 15:01   ` Francesco Mazzoli
  0 siblings, 1 reply; 8+ messages in thread
From: Amir Goldstein @ 2025-10-03 14:22 UTC (permalink / raw)
  To: Francesco Mazzoli
  Cc: linux-fsdevel, Christian Brauner, Darrick J. Wong, Bernd Schubert,
	Miklos Szeredi

On Fri, Oct 3, 2025 at 2:15 PM Francesco Mazzoli <f@mazzo.li> wrote:
>

Hi Francesco,

> My workplace (XTX Markets) has open sourced a distributed
> filesystem which has been used internally for a few years, TernFS:
> <https://github.com/XTXMarkets/ternfs>. The repository includes both the server
> code for the filesystem but also several clients. The main client we use
> is a kernel module which allows you to mount TernFS from Linux systems. The
> current codebase would not be ready for upstreaming, but I wanted to gauge
> if eventual upstreaming would be even possible in this case, and if yes,
> what the process would be.

First of all, the project looks very impressive!

The first thing to do to understand the prospect of upstreaming is exactly
what you did - send this email :)
It is very detailed and the linked design doc is very thorough.

Unfortunately, there is no official checklist for when or whether a
new filesystem
could be upstreamed, but we have a lot of Do's and Don'ts that we have
learned the
hard way, so I will try to list some of them.

>
> Obviously TernFS currently has only one user, although we run on more than
> 100 thousand machines, spanning relatively diverse hardware and running
> fairly diverse software. And this might change if other organizations adopt
> TernFS now that it is open source, naturally.
>

Very good observation.

A codebase code with only one major user is a red flag.
I am sure that you and your colleagues are very talented,
but if your employer decides to cut down on upstreaming budget,
the kernel maintainers would be left with an effectively orphaned filesystem.

This is especially true when the client is used in house, most likely
not on a distro running the latest upstream kernel.

So yeh, it's a bit of a chicken and egg problem,
but if you get community adoption for the server code,
it will make a big difference on the prospect of upstreaming the client code.

> The kernel module has been fairly stable, although we need to properly adapt
> it to the folio world. However it would be much easier to maintain it if
> it was mainlined, and I wanted to describe the peculiarities of TernFS to
> see if it would be even possible to do so. For those interested we also
> have a blog post going in a lot more detail about the design of TernFS
> (<https://www.xtxmarkets.com/tech/2025-ternfs/>), but hopefully this email
> would be enough for the purposes of this discusion.

I am very interested in this part, because that is IMO a question that
we need to ask every new filesystem upstream attempt:
"Can it be implemented in FUSE?"

Design doc says that:
:For this reason, we opted to work with Linux directly, rather than using FUSE.
:Working directly with the Linux kernel not only gave us the
confidence that we could
:achieve our performance requirements but also allowed us to bend the POSIX API
:to our needs, something that would have been more difficult if we had used FUSE

And later on continue to explain that you managed to work around the POSIX API
issue, so all that remains is the performance requirements.

More specifically the README says that you have a FUSE client and that it is
:slower than the kmod although still performant,
:requires a BPF program to correctly detect file closes

So my question is:
Why is the FUSE client slower?
Did you analyse the bottlenecks?
Do these bottlenecks exist when using the FUSE-iouring channel?
Mind you that FUSE-iouring was developed by DDN developers specifically
for the use case of very fast distributed filesystems in userspace.

There is another interesting project of FUSE-iomap [1], which is probably
less relevant for distributed network filesystems, but it goes to show,
if FUSE is not performant enough for your use case, you need to ask
yourself: Can I improve FUSE? (for the benefit of everyone)

It's not only because upstreaming kernel filesystems need to pass muster
with a bunch of picky kernel developers.

If you manage to write a good (enough) FUSE client, it will make your
development and deployments so much easier and both you and your
users will benefit from it.

Maybe the issue that you solved with an eBPF program could be
improved in upstream FUSE?...

[1] https://lore.kernel.org/linux-fsdevel/20250821003720.GA4194186@frogsfrogsfrogs/

>
> TernFS files are immutable, they're written once and then can't be modified.
> Moreover, when files are created they're not actually linked into the
> directory structure until they're closed. One way to think about it is that
> in TernFS every file follows the semantics you'd have if you opened the file
> with `O_TMPFILE` and then linked them with `linkat`. This is the most "odd"
> part of the kernel module since it goes counter pretty baked in assumptions
> of how the file lifecycle works.
>
> TernFS also does not support many things, for example hardlinks, permissions,
> any sort of extended attribute, and so on. This is I would imagine less
> unpleasant though since it's just a matter of getting ENOTSUP out of a bunch
> of syscalls.

I mean it sounds very cool from an engineering POV that you managed to
remove unneeded constraints (a.k.a POSIX standard) and make a better
product due to the simplifications, but that's exactly what userspace
filesystems
are for - for doing whatever you want ;)

>
> Apart from that I wouldn't expect TernFS to be that different from Ceph or
> other networked storage codebases inside the kernel.
>

Except for the wide adoption of the open source ceph server ;)

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mainlining the kernel module for TernFS, a distributed filesystem
  2025-10-03 14:22 ` Amir Goldstein
@ 2025-10-03 15:01   ` Francesco Mazzoli
  2025-10-03 17:35     ` Bernd Schubert
  2025-10-04  2:52     ` Theodore Ts'o
  0 siblings, 2 replies; 8+ messages in thread
From: Francesco Mazzoli @ 2025-10-03 15:01 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, Christian Brauner, Darrick J. Wong, Bernd Schubert,
	Miklos Szeredi

On Fri, Oct 3, 2025, at 15:22, Amir Goldstein wrote:
> First of all, the project looks very impressive!
> 
> The first thing to do to understand the prospect of upstreaming is exactly
> what you did - send this email :)
> It is very detailed and the linked design doc is very thorough.

Thanks for the kind words!

> A codebase code with only one major user is a red flag.
> I am sure that you and your colleagues are very talented,
> but if your employer decides to cut down on upstreaming budget,
> the kernel maintainers would be left with an effectively orphaned filesystem.
> 
> This is especially true when the client is used in house, most likely
> not on a distro running the latest upstream kernel.
> 
> So yeh, it's a bit of a chicken and egg problem,
> but if you get community adoption for the server code,
> it will make a big difference on the prospect of upstreaming the client code.

Understood, we can definitely wait and see if TernFS gains wider adoption
before making concrete plans to upstream.

> I am very interested in this part, because that is IMO a question that
> we need to ask every new filesystem upstream attempt:
> "Can it be implemented in FUSE?"

Yes, and we have done so:
<https://github.com/XTXMarkets/ternfs/blob/main/go/ternfuse/ternfuse.go>.

> So my question is:
> Why is the FUSE client slower?
> Did you analyse the bottlenecks?
> Do these bottlenecks exist when using the FUSE-iouring channel?
> Mind you that FUSE-iouring was developed by DDN developers specifically
> for the use case of very fast distributed filesystems in userspace.
> ...
> I mean it sounds very cool from an engineering POV that you managed to
> remove unneeded constraints (a.k.a POSIX standard) and make a better
> product due to the simplifications, but that's exactly what userspace
> filesystems
> are for - for doing whatever you want ;)

These are all good questions, and while we have not profiled the FUSE driver
extensively, my impression is that relying critically on FUSE would be risky.
There are some specific things that would be difficult today. For instance
FUSE does not expose `d_revalidate`, which means that dentries would be dropped
needlessly in cases where we know they can be left in place.

There are also some more high level FUSE design points which we were concerned
by (although I'm not up to speed with the FUSE over io_uring work). One obvious
concern is the fact that with FUSE it's much harder to minimize copying.
FUSE passthrough helps but it would have made the read path significantly more
complex given the need to juggle file descriptors between user space and the
kernel. Also, TernFS uses Reed-Solomon to recover from situations where some
parts of a file is unreadable, and in that case we'd have had to fall back to
a non-passthrough version. Another possible FUSE performance pitfall is that
you're liable to be bottlenecked by the FUSE request queue, while if you work
directly within the kernel you're not.

And of course before BPF we wouldn't have been able to track the nature of
file closes to a degree where the FUSE driver can implement TernFS semantics
correctly.

This is not to say that a FUSE driver couldn't possibly work, but I think there
are good reason for wanting to work directly with the kernel if you want to be
sure to utilize resources effectively.

> Except for the wide adoption of the open source ceph server ;)

Oh, absolutely, I was just talking about how the code would look :).

Thanks,
Francesco

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mainlining the kernel module for TernFS, a distributed filesystem
  2025-10-03 15:01   ` Francesco Mazzoli
@ 2025-10-03 17:35     ` Bernd Schubert
  2025-10-03 18:18       ` Francesco Mazzoli
  2025-10-04  2:52     ` Theodore Ts'o
  1 sibling, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2025-10-03 17:35 UTC (permalink / raw)
  To: Francesco Mazzoli, Amir Goldstein
  Cc: linux-fsdevel, Christian Brauner, Darrick J. Wong, Miklos Szeredi



On 10/3/25 17:01, Francesco Mazzoli wrote:
> On Fri, Oct 3, 2025, at 15:22, Amir Goldstein wrote:
>> First of all, the project looks very impressive!
>>
>> The first thing to do to understand the prospect of upstreaming is exactly
>> what you did - send this email :)
>> It is very detailed and the linked design doc is very thorough.
> 
> Thanks for the kind words!
> 
>> A codebase code with only one major user is a red flag.
>> I am sure that you and your colleagues are very talented,
>> but if your employer decides to cut down on upstreaming budget,
>> the kernel maintainers would be left with an effectively orphaned filesystem.
>>
>> This is especially true when the client is used in house, most likely
>> not on a distro running the latest upstream kernel.
>>
>> So yeh, it's a bit of a chicken and egg problem,
>> but if you get community adoption for the server code,
>> it will make a big difference on the prospect of upstreaming the client code.
> 
> Understood, we can definitely wait and see if TernFS gains wider adoption
> before making concrete plans to upstream.
> 
>> I am very interested in this part, because that is IMO a question that
>> we need to ask every new filesystem upstream attempt:
>> "Can it be implemented in FUSE?"
> 
> Yes, and we have done so:
> <https://github.com/XTXMarkets/ternfs/blob/main/go/ternfuse/ternfuse.go>.

Hmm, from fuse-io-uring point of view not ideal, see Han-Wens
explanation here
https://github.com/hanwen/go-fuse/issues/560

I just posted a new queue-reduction series today, maybe that
helps a bit
https://lore.kernel.org/r/20251003-reduced-nr-ring-queues_3-v2-0-742ff1a8fc58@ddn.com

At a minimum each implementation still should take care of numa affinity,
getting reasonable performance is hard if go-fuse has an issue with that.

Btw, I had see your design a week or two ago when posted on phoronix and
looks like you need to know in FUSE_RELEASE if application crashed. I think
that is trivial and we at DDN might also use for the posix/S3 interface,
patch follows - no need for extra steps with BPF).

> 
>> So my question is:
>> Why is the FUSE client slower?
>> Did you analyse the bottlenecks?
>> Do these bottlenecks exist when using the FUSE-iouring channel?
>> Mind you that FUSE-iouring was developed by DDN developers specifically
>> for the use case of very fast distributed filesystems in userspace.
>> ...
>> I mean it sounds very cool from an engineering POV that you managed to
>> remove unneeded constraints (a.k.a POSIX standard) and make a better
>> product due to the simplifications, but that's exactly what userspace
>> filesystems
>> are for - for doing whatever you want ;)
> 
> These are all good questions, and while we have not profiled the FUSE driver
> extensively, my impression is that relying critically on FUSE would be risky.
> There are some specific things that would be difficult today. For instance
> FUSE does not expose `d_revalidate`, which means that dentries would be dropped
> needlessly in cases where we know they can be left in place.

Fuse sends LOOKUP in fuse_dentry_revalidate()? I.e. that is just a userspace
counter then if a dentry was already looked up? For the upcoming
FUSE_LOOKUP_HANDLE we can also make sure it takes an additional flag argument.

> 
> There are also some more high level FUSE design points which we were concerned
> by (although I'm not up to speed with the FUSE over io_uring work). One obvious
> concern is the fact that with FUSE it's much harder to minimize copying.
> FUSE passthrough helps but it would have made the read path significantly more
> complex given the need to juggle file descriptors between user space and the
> kernel. Also, TernFS uses Reed-Solomon to recover from situations where some
> parts of a file is unreadable, and in that case we'd have had to fall back to
> a non-passthrough version. Another possible FUSE performance pitfall is that
> you're liable to be bottlenecked by the FUSE request queue, while if you work
> directly within the kernel you're not.

I agree on copying, but with io-uring I'm not sure about a request queue issue.
At best missing is a dynamic size of ring entries, which would reduce memory
usage. And yeah, zero-copy would help as well, but we at DDN buffer access
with erase coding, compression, etc - maybe possible at some with bpf, but right
now too hard.

> 
> And of course before BPF we wouldn't have been able to track the nature of
> file closes to a degree where the FUSE driver can implement TernFS semantics
> correctly.

See above, patch follows.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mainlining the kernel module for TernFS, a distributed filesystem
  2025-10-03 17:35     ` Bernd Schubert
@ 2025-10-03 18:18       ` Francesco Mazzoli
  2025-10-03 19:01         ` Francesco Mazzoli
  0 siblings, 1 reply; 8+ messages in thread
From: Francesco Mazzoli @ 2025-10-03 18:18 UTC (permalink / raw)
  To: Bernd Schubert, Amir Goldstein
  Cc: linux-fsdevel, Christian Brauner, Darrick J. Wong, Miklos Szeredi

On Fri, Oct 3, 2025, at 18:35, Bernd Schubert wrote:
> Btw, I had see your design a week or two ago when posted on phoronix and
> looks like you need to know in FUSE_RELEASE if application crashed. I think
> that is trivial and we at DDN might also use for the posix/S3 interface,
> patch follows - no need for extra steps with BPF).

It's a bit more complicated than that, sadly. I'd imagine that FUSE_RELEASE
will be called when the file refcount drops to zero but this might very well be
after we actually intended to link the file. Consider the case when a process
forks, the children inherits the file descriptors (including open TernFS
files), and then the parent close()s the file, intending to link it. You won't
get FUSE_RELEASE because of the reference in the child, and the file won't be
linked as a consequence.

However you can't link the file too eagerly either for the reverse reason. What
you need is to track "intentional" closes, and you're going to end up relying
on some heuristic, unless you use something like O_TMPFILE + linkat.

In the kernel module we do that by tracking where the close came from and if
the close is being performed as part of the process winding down. We only link
the file if the close is coming from the process that created the file and not
as part of process winddown. This particular heuristic that has worked well for
us, and empirically it has been quite user friendly.

In FUSE with BPF we do something arguably more principled: we mark a file as
"explicitly closed" if it was closed through close(), and only link it after an
explicit close has been recorded.

> Fuse sends LOOKUP in fuse_dentry_revalidate()? I.e. that is just a userspace
> counter then if a dentry was already looked up? For the upcoming
> FUSE_LOOKUP_HANDLE we can also make sure it takes an additional flag argument.

Oh, I had not realized that FUSE will return valid if the lookup is stable,
thank you. You'll still pay the price of roundtripping through userspace
though, and given how common lookups are, I'd imagine tons of spurious lookups
into the FUSE server would still be unpleasant.

> I agree on copying, but with io-uring I'm not sure about a request queue issue.
> At best missing is a dynamic size of ring entries, which would reduce memory
> usage. And yeah, zero-copy would help as well, but we at DDN buffer access
> with erase coding, compression, etc - maybe possible at some with bpf, but right
> now too hard.

I'll have to take a look at FUSE + io_uring, won't comment on that until I'm
familiar with it :).

Francesco

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mainlining the kernel module for TernFS, a distributed filesystem
  2025-10-03 18:18       ` Francesco Mazzoli
@ 2025-10-03 19:01         ` Francesco Mazzoli
  0 siblings, 0 replies; 8+ messages in thread
From: Francesco Mazzoli @ 2025-10-03 19:01 UTC (permalink / raw)
  To: Bernd Schubert, Amir Goldstein
  Cc: linux-fsdevel, Christian Brauner, Darrick J. Wong, Miklos Szeredi

On Fri, Oct 3, 2025, at 19:18, Francesco Mazzoli wrote:
> > I agree on copying, but with io-uring I'm not sure about a request queue issue.
> > At best missing is a dynamic size of ring entries, which would reduce memory
> > usage. And yeah, zero-copy would help as well, but we at DDN buffer access
> > with erase coding, compression, etc - maybe possible at some with bpf, but right
> > now too hard.
> 
> I'll have to take a look at FUSE + io_uring, won't comment on that until I'm
> familiar with it :).

Oh, one more point on copying: when reconstructing using Reed-Solomon, you want to
read and write to the page cache to fetch pages that you need for reconstruction
if you have them already, and store the additional pages you fetch. Again I'd
imagine this to be hard to do with FUSE in a zero-copy way.

All of this should not detract from the point that I'm sure a very performant
TernFS driver can be written, but I'm not convinced it would be the better option
all things considered.

Francesco 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mainlining the kernel module for TernFS, a distributed filesystem
  2025-10-03 15:01   ` Francesco Mazzoli
  2025-10-03 17:35     ` Bernd Schubert
@ 2025-10-04  2:52     ` Theodore Ts'o
  2025-10-04  9:01       ` Francesco Mazzoli
  1 sibling, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2025-10-04  2:52 UTC (permalink / raw)
  To: Francesco Mazzoli
  Cc: Amir Goldstein, linux-fsdevel, Christian Brauner, Darrick J. Wong,
	Bernd Schubert, Miklos Szeredi

On Fri, Oct 03, 2025 at 04:01:56PM +0100, Francesco Mazzoli wrote:
> 
> > A codebase code with only one major user is a red flag.
> > I am sure that you and your colleagues are very talented,
> > but if your employer decides to cut down on upstreaming budget,
> > the kernel maintainers would be left with an effectively orphaned filesystem.

I'd go further than that.  Expanding your user base is definitely a
good thing, but I'd go further than that; see if you can expand your
developer community so that some of your users are finding enough
value that they are willing to contribute to the development of the
your file system.  Perhaps there are some use cases which aren't
important to you, so it's not something that you can justifying
pursuing, but perhaps it would be high value for some other company
with a similar, but not identical, use case?

To do that, some recommendations:

*) Have good developer's documentation; not just how to start using
   it, but how to get started understanding the code base.  That is,
   things like the layout of the code base, how to debug problems,
   etc.  I see that you have documentation on how to run regression
   tests, which is great.

*) At the moment, it looks like your primary focus for the client is
   the Ubuntu LTS kernel.  That makes sense, but if you are are going
   for upstream inclusion, it might be useful to have a version of the
   codebase which is sync'ed to the upstream kernel, and then having an
   adaption layer which allows the code to be compiled as a module on
   distribution kernels.
   
*) If you have a list of simple starter projects that you could hand
   off to someone who is intersted, that would be useful.  (For
   example, one such starter project might be adding dkms support for
   other distributions beyond Ubuntu, which might be useful for other
   potential users.  Do you have a desire for more tests?  In general,
   in my experience, most projects always could use more testing.)

Looking the documentation, here are some notes:

* "We don't expect new directories to be created often, and files (or
  directories) to be moved between directories often."  I *think*
  "don't expect" binds to both parts of the conjuction.  So can you
  confirm that whatw as meant is "... nor do we expect that files
  (or directries) to be moved frequently."

* If that's true, it means that you *do* expect that files and
  directories can be moved around.  What are the consistency
  expectations when a file is renamed/moved?  I assume that since
  clients might be scattered across the world, there is some period
  where different clients might have different views.  Is there some
  kind of guarantee about when the eventual consistency will
  definitely be resolved?

* In the description of the filesystem data or metadata, there is no
  mention of whether there are checksums at rest or not.  Given the
  requirements that there be protections against hard disk bitrot, I
  assume there would be -- but what is the granularity?  Every 4092
  bytes (as in GFS)?   Every 1M?   Every 4M?   Are the checksums verified
  on the server when the data is read?  Or by the client?   Or both?
  What is the recovery path if the checksum doesn't verify?

* Some of the above are about the protocol, and that would be good to
  document.  What if any are the authentication and authorization
  checking that gets done?  Are there any cryptographic protection for
  either encryption or data integrity?  I've seen some companies who
  consider their LLM to highly proprietary, to the extent that they
  want to use confidential compute VM's.  Or if you are using the file
  system for training data, the training data might have PII.

> These are all good questions, and while we have not profiled the
> FUSE driver extensively...

There has been some really interesting work that that Darrick Wong has
been doing using the low-level fuse API.  The low-level FUSE is Linux
only, but using that with fs-iomap patches, Darrick has managed to get
basically get equivalent performance for direct and buffered I/O
comparing the native ext4 file system driver with his patched fuse2fs
and low-level fuse fs-iomap implementation.  His goal was to provide
better security for untrusted containers that want to mount images
that might be carefully, maiciously trusted, but it does demonstrate
that you aren't particularly worried about metadata-heavy workloads,
and are primarily concerend about data plane performance, uisng the
low-level (linux-only) FUSE interface might work well for you.

> There are some specific things that would be difficult today. For
> instance FUSE does not expose `d_revalidate`, which means that
> dentries would be dropped needlessly in cases where we know they can
> be left in place.

I belive the low-level FUSE interface does expose dentry revalidation.


> parts of a file is unreadable, and in that case we'd have had to
> fall back to a non-passthrough version.

Ah, you are using erasure codes; what was the design considerations of
using RS as opposed to having multiple copies of data blocks.  Or do
you support both?

This would be great to document --- or maybe you might want to
consider creating a "Design and Implementation of TernFS" paper and
submitting to a conference like FAST.  :-)

Cheers,

						- Ted
						

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mainlining the kernel module for TernFS, a distributed filesystem
  2025-10-04  2:52     ` Theodore Ts'o
@ 2025-10-04  9:01       ` Francesco Mazzoli
  0 siblings, 0 replies; 8+ messages in thread
From: Francesco Mazzoli @ 2025-10-04  9:01 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, linux-fsdevel, Christian Brauner, Darrick J. Wong,
	Bernd Schubert, Miklos Szeredi

On Sat, Oct 4, 2025, at 03:52, Theodore Ts'o wrote:
> To do that, some recommendations:
> ...

Thank you, this is all very useful.

> Looking the documentation, here are some notes:
> 
> * "We don't expect new directories to be created often, and files (or
>   directories) to be moved between directories often."  I *think*
>   "don't expect" binds to both parts of the conjuction.  So can you
>   confirm that whatw as meant is "... nor do we expect that files
>   (or directries) to be moved frequently."

Your interpretation is correct.

> * If that's true, it means that you *do* expect that files and
>   directories can be moved around.  What are the consistency
>   expectations when a file is renamed/moved?  I assume that since
>   clients might be scattered across the world, there is some period
>   where different clients might have different views.  Is there some
>   kind of guarantee about when the eventual consistency will
>   definitely be resolved?

While TernFS is geo-replicated, metadata is geo-replicated in a master-slave
fashion: writes go through a single region, and writers in a given region
are guaranteed to read their own writes. We have plans to move this to
master-master setup, but it hasn't been very urgent since the metadata latency
hit is usually hidden by the time it takes to write the actual files (which as
remarked tend to be pretty big).

That said, directory entries are also cached, we use 250ms but it's
configurable.

File contents on the other hand are written locally and replicated both in a
push and pull fashion. However files are immutable, which means you never have
an inconsistent view of file contents in different regions.

See also the "Going global" section of the blog post:
<https://www.xtxmarkets.com/tech/2025-ternfs/>.

> * In the description of the filesystem data or metadata, there is no
>   mention of whether there are checksums at rest or not.  Given the
>   requirements that there be protections against hard disk bitrot, I
>   assume there would be -- but what is the granularity?  Every 4092
>   bytes (as in GFS)?   Every 1M?   Every 4M?   Are the checksums verified
>   on the server when the data is read?  Or by the client?   Or both?
>   What is the recovery path if the checksum doesn't verify?

Some of this is explained in the blog post mentioned above. In short: file
contents are both checksummed at a page level, but also at a higher boundary
(we call these "spans"), and the CRCs at this higher boundary are cross checked
by the metadata services and the storage nodes. I've written two blog posts
about these topics, see <https://mazzo.li/posts/mac-distributed-tx.html> and
<https://mazzo.li/posts/rs-crc.html>. The metadata is also checksummed by way
of RocksDB. Errors are recovered from using Reed-Solomon codes.

> * Some of the above are about the protocol, and that would be good to
>   document.  What if any are the authentication and authorization
>   checking that gets done?  Are there any cryptographic protection for
>   either encryption or data integrity?  I've seen some companies who
>   consider their LLM to highly proprietary, to the extent that they
>   want to use confidential compute VM's.  Or if you are using the file
>   system for training data, the training data might have PII.

There's no cryptographic protection or authentication in TernFS. We handle
authentication at a different layer: we have filesystem gateway that expose
only parts of the filesystem to less privileged users.

> There has been some really interesting work that that Darrick Wong has
> been doing using the low-level fuse API.  ...

One clear takeaway from this thread is that FUSE performance is a topic I
don't know enough about. I'll have to explore the various novelties that
you guys have brought up to bring me up to speed.

> I belive the low-level FUSE interface does expose dentry revalidation.

It doesn't directly but Bernd pointed out that it won't invalidate dentries
if the lookup is stable, which is good enough.

> Ah, you are using erasure codes; what was the design considerations of
> using RS as opposed to having multiple copies of data blocks.  Or do
> you support both?

We support both.

> This would be great to document --- or maybe you might want to
> consider creating a "Design and Implementation of TernFS" paper and
> submitting to a conference like FAST.  :-)

The blog post was intended to be that kind of document, but we might consider a
more detailed/academic publication!

Thanks,
Francesco

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-10-04  9:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-03 12:13 Mainlining the kernel module for TernFS, a distributed filesystem Francesco Mazzoli
2025-10-03 14:22 ` Amir Goldstein
2025-10-03 15:01   ` Francesco Mazzoli
2025-10-03 17:35     ` Bernd Schubert
2025-10-03 18:18       ` Francesco Mazzoli
2025-10-03 19:01         ` Francesco Mazzoli
2025-10-04  2:52     ` Theodore Ts'o
2025-10-04  9:01       ` Francesco Mazzoli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).