* Mainlining the kernel module for TernFS, a distributed filesystem @ 2025-10-03 12:13 Francesco Mazzoli 2025-10-03 14:22 ` Amir Goldstein 0 siblings, 1 reply; 8+ messages in thread From: Francesco Mazzoli @ 2025-10-03 12:13 UTC (permalink / raw) To: linux-fsdevel My workplace (XTX Markets) has open sourced a distributed filesystem which has been used internally for a few years, TernFS: <https://github.com/XTXMarkets/ternfs>. The repository includes both the server code for the filesystem but also several clients. The main client we use is a kernel module which allows you to mount TernFS from Linux systems. The current codebase would not be ready for upstreaming, but I wanted to gauge if eventual upstreaming would be even possible in this case, and if yes, what the process would be. Obviously TernFS currently has only one user, although we run on more than 100 thousand machines, spanning relatively diverse hardware and running fairly diverse software. And this might change if other organizations adopt TernFS now that it is open source, naturally. The kernel module has been fairly stable, although we need to properly adapt it to the folio world. However it would be much easier to maintain it if it was mainlined, and I wanted to describe the peculiarities of TernFS to see if it would be even possible to do so. For those interested we also have a blog post going in a lot more detail about the design of TernFS (<https://www.xtxmarkets.com/tech/2025-ternfs/>), but hopefully this email would be enough for the purposes of this discusion. TernFS files are immutable, they're written once and then can't be modified. Moreover, when files are created they're not actually linked into the directory structure until they're closed. One way to think about it is that in TernFS every file follows the semantics you'd have if you opened the file with `O_TMPFILE` and then linked them with `linkat`. This is the most "odd" part of the kernel module since it goes counter pretty baked in assumptions of how the file lifecycle works. TernFS also does not support many things, for example hardlinks, permissions, any sort of extended attribute, and so on. This is I would imagine less unpleasant though since it's just a matter of getting ENOTSUP out of a bunch of syscalls. Apart from that I wouldn't expect TernFS to be that different from Ceph or other networked storage codebases inside the kernel. Let me know what you think, Francesco ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Mainlining the kernel module for TernFS, a distributed filesystem 2025-10-03 12:13 Mainlining the kernel module for TernFS, a distributed filesystem Francesco Mazzoli @ 2025-10-03 14:22 ` Amir Goldstein 2025-10-03 15:01 ` Francesco Mazzoli 0 siblings, 1 reply; 8+ messages in thread From: Amir Goldstein @ 2025-10-03 14:22 UTC (permalink / raw) To: Francesco Mazzoli Cc: linux-fsdevel, Christian Brauner, Darrick J. Wong, Bernd Schubert, Miklos Szeredi On Fri, Oct 3, 2025 at 2:15 PM Francesco Mazzoli <f@mazzo.li> wrote: > Hi Francesco, > My workplace (XTX Markets) has open sourced a distributed > filesystem which has been used internally for a few years, TernFS: > <https://github.com/XTXMarkets/ternfs>. The repository includes both the server > code for the filesystem but also several clients. The main client we use > is a kernel module which allows you to mount TernFS from Linux systems. The > current codebase would not be ready for upstreaming, but I wanted to gauge > if eventual upstreaming would be even possible in this case, and if yes, > what the process would be. First of all, the project looks very impressive! The first thing to do to understand the prospect of upstreaming is exactly what you did - send this email :) It is very detailed and the linked design doc is very thorough. Unfortunately, there is no official checklist for when or whether a new filesystem could be upstreamed, but we have a lot of Do's and Don'ts that we have learned the hard way, so I will try to list some of them. > > Obviously TernFS currently has only one user, although we run on more than > 100 thousand machines, spanning relatively diverse hardware and running > fairly diverse software. And this might change if other organizations adopt > TernFS now that it is open source, naturally. > Very good observation. A codebase code with only one major user is a red flag. I am sure that you and your colleagues are very talented, but if your employer decides to cut down on upstreaming budget, the kernel maintainers would be left with an effectively orphaned filesystem. This is especially true when the client is used in house, most likely not on a distro running the latest upstream kernel. So yeh, it's a bit of a chicken and egg problem, but if you get community adoption for the server code, it will make a big difference on the prospect of upstreaming the client code. > The kernel module has been fairly stable, although we need to properly adapt > it to the folio world. However it would be much easier to maintain it if > it was mainlined, and I wanted to describe the peculiarities of TernFS to > see if it would be even possible to do so. For those interested we also > have a blog post going in a lot more detail about the design of TernFS > (<https://www.xtxmarkets.com/tech/2025-ternfs/>), but hopefully this email > would be enough for the purposes of this discusion. I am very interested in this part, because that is IMO a question that we need to ask every new filesystem upstream attempt: "Can it be implemented in FUSE?" Design doc says that: :For this reason, we opted to work with Linux directly, rather than using FUSE. :Working directly with the Linux kernel not only gave us the confidence that we could :achieve our performance requirements but also allowed us to bend the POSIX API :to our needs, something that would have been more difficult if we had used FUSE And later on continue to explain that you managed to work around the POSIX API issue, so all that remains is the performance requirements. More specifically the README says that you have a FUSE client and that it is :slower than the kmod although still performant, :requires a BPF program to correctly detect file closes So my question is: Why is the FUSE client slower? Did you analyse the bottlenecks? Do these bottlenecks exist when using the FUSE-iouring channel? Mind you that FUSE-iouring was developed by DDN developers specifically for the use case of very fast distributed filesystems in userspace. There is another interesting project of FUSE-iomap [1], which is probably less relevant for distributed network filesystems, but it goes to show, if FUSE is not performant enough for your use case, you need to ask yourself: Can I improve FUSE? (for the benefit of everyone) It's not only because upstreaming kernel filesystems need to pass muster with a bunch of picky kernel developers. If you manage to write a good (enough) FUSE client, it will make your development and deployments so much easier and both you and your users will benefit from it. Maybe the issue that you solved with an eBPF program could be improved in upstream FUSE?... [1] https://lore.kernel.org/linux-fsdevel/20250821003720.GA4194186@frogsfrogsfrogs/ > > TernFS files are immutable, they're written once and then can't be modified. > Moreover, when files are created they're not actually linked into the > directory structure until they're closed. One way to think about it is that > in TernFS every file follows the semantics you'd have if you opened the file > with `O_TMPFILE` and then linked them with `linkat`. This is the most "odd" > part of the kernel module since it goes counter pretty baked in assumptions > of how the file lifecycle works. > > TernFS also does not support many things, for example hardlinks, permissions, > any sort of extended attribute, and so on. This is I would imagine less > unpleasant though since it's just a matter of getting ENOTSUP out of a bunch > of syscalls. I mean it sounds very cool from an engineering POV that you managed to remove unneeded constraints (a.k.a POSIX standard) and make a better product due to the simplifications, but that's exactly what userspace filesystems are for - for doing whatever you want ;) > > Apart from that I wouldn't expect TernFS to be that different from Ceph or > other networked storage codebases inside the kernel. > Except for the wide adoption of the open source ceph server ;) Cheers, Amir. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Mainlining the kernel module for TernFS, a distributed filesystem 2025-10-03 14:22 ` Amir Goldstein @ 2025-10-03 15:01 ` Francesco Mazzoli 2025-10-03 17:35 ` Bernd Schubert 2025-10-04 2:52 ` Theodore Ts'o 0 siblings, 2 replies; 8+ messages in thread From: Francesco Mazzoli @ 2025-10-03 15:01 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, Christian Brauner, Darrick J. Wong, Bernd Schubert, Miklos Szeredi On Fri, Oct 3, 2025, at 15:22, Amir Goldstein wrote: > First of all, the project looks very impressive! > > The first thing to do to understand the prospect of upstreaming is exactly > what you did - send this email :) > It is very detailed and the linked design doc is very thorough. Thanks for the kind words! > A codebase code with only one major user is a red flag. > I am sure that you and your colleagues are very talented, > but if your employer decides to cut down on upstreaming budget, > the kernel maintainers would be left with an effectively orphaned filesystem. > > This is especially true when the client is used in house, most likely > not on a distro running the latest upstream kernel. > > So yeh, it's a bit of a chicken and egg problem, > but if you get community adoption for the server code, > it will make a big difference on the prospect of upstreaming the client code. Understood, we can definitely wait and see if TernFS gains wider adoption before making concrete plans to upstream. > I am very interested in this part, because that is IMO a question that > we need to ask every new filesystem upstream attempt: > "Can it be implemented in FUSE?" Yes, and we have done so: <https://github.com/XTXMarkets/ternfs/blob/main/go/ternfuse/ternfuse.go>. > So my question is: > Why is the FUSE client slower? > Did you analyse the bottlenecks? > Do these bottlenecks exist when using the FUSE-iouring channel? > Mind you that FUSE-iouring was developed by DDN developers specifically > for the use case of very fast distributed filesystems in userspace. > ... > I mean it sounds very cool from an engineering POV that you managed to > remove unneeded constraints (a.k.a POSIX standard) and make a better > product due to the simplifications, but that's exactly what userspace > filesystems > are for - for doing whatever you want ;) These are all good questions, and while we have not profiled the FUSE driver extensively, my impression is that relying critically on FUSE would be risky. There are some specific things that would be difficult today. For instance FUSE does not expose `d_revalidate`, which means that dentries would be dropped needlessly in cases where we know they can be left in place. There are also some more high level FUSE design points which we were concerned by (although I'm not up to speed with the FUSE over io_uring work). One obvious concern is the fact that with FUSE it's much harder to minimize copying. FUSE passthrough helps but it would have made the read path significantly more complex given the need to juggle file descriptors between user space and the kernel. Also, TernFS uses Reed-Solomon to recover from situations where some parts of a file is unreadable, and in that case we'd have had to fall back to a non-passthrough version. Another possible FUSE performance pitfall is that you're liable to be bottlenecked by the FUSE request queue, while if you work directly within the kernel you're not. And of course before BPF we wouldn't have been able to track the nature of file closes to a degree where the FUSE driver can implement TernFS semantics correctly. This is not to say that a FUSE driver couldn't possibly work, but I think there are good reason for wanting to work directly with the kernel if you want to be sure to utilize resources effectively. > Except for the wide adoption of the open source ceph server ;) Oh, absolutely, I was just talking about how the code would look :). Thanks, Francesco ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Mainlining the kernel module for TernFS, a distributed filesystem 2025-10-03 15:01 ` Francesco Mazzoli @ 2025-10-03 17:35 ` Bernd Schubert 2025-10-03 18:18 ` Francesco Mazzoli 2025-10-04 2:52 ` Theodore Ts'o 1 sibling, 1 reply; 8+ messages in thread From: Bernd Schubert @ 2025-10-03 17:35 UTC (permalink / raw) To: Francesco Mazzoli, Amir Goldstein Cc: linux-fsdevel, Christian Brauner, Darrick J. Wong, Miklos Szeredi On 10/3/25 17:01, Francesco Mazzoli wrote: > On Fri, Oct 3, 2025, at 15:22, Amir Goldstein wrote: >> First of all, the project looks very impressive! >> >> The first thing to do to understand the prospect of upstreaming is exactly >> what you did - send this email :) >> It is very detailed and the linked design doc is very thorough. > > Thanks for the kind words! > >> A codebase code with only one major user is a red flag. >> I am sure that you and your colleagues are very talented, >> but if your employer decides to cut down on upstreaming budget, >> the kernel maintainers would be left with an effectively orphaned filesystem. >> >> This is especially true when the client is used in house, most likely >> not on a distro running the latest upstream kernel. >> >> So yeh, it's a bit of a chicken and egg problem, >> but if you get community adoption for the server code, >> it will make a big difference on the prospect of upstreaming the client code. > > Understood, we can definitely wait and see if TernFS gains wider adoption > before making concrete plans to upstream. > >> I am very interested in this part, because that is IMO a question that >> we need to ask every new filesystem upstream attempt: >> "Can it be implemented in FUSE?" > > Yes, and we have done so: > <https://github.com/XTXMarkets/ternfs/blob/main/go/ternfuse/ternfuse.go>. Hmm, from fuse-io-uring point of view not ideal, see Han-Wens explanation here https://github.com/hanwen/go-fuse/issues/560 I just posted a new queue-reduction series today, maybe that helps a bit https://lore.kernel.org/r/20251003-reduced-nr-ring-queues_3-v2-0-742ff1a8fc58@ddn.com At a minimum each implementation still should take care of numa affinity, getting reasonable performance is hard if go-fuse has an issue with that. Btw, I had see your design a week or two ago when posted on phoronix and looks like you need to know in FUSE_RELEASE if application crashed. I think that is trivial and we at DDN might also use for the posix/S3 interface, patch follows - no need for extra steps with BPF). > >> So my question is: >> Why is the FUSE client slower? >> Did you analyse the bottlenecks? >> Do these bottlenecks exist when using the FUSE-iouring channel? >> Mind you that FUSE-iouring was developed by DDN developers specifically >> for the use case of very fast distributed filesystems in userspace. >> ... >> I mean it sounds very cool from an engineering POV that you managed to >> remove unneeded constraints (a.k.a POSIX standard) and make a better >> product due to the simplifications, but that's exactly what userspace >> filesystems >> are for - for doing whatever you want ;) > > These are all good questions, and while we have not profiled the FUSE driver > extensively, my impression is that relying critically on FUSE would be risky. > There are some specific things that would be difficult today. For instance > FUSE does not expose `d_revalidate`, which means that dentries would be dropped > needlessly in cases where we know they can be left in place. Fuse sends LOOKUP in fuse_dentry_revalidate()? I.e. that is just a userspace counter then if a dentry was already looked up? For the upcoming FUSE_LOOKUP_HANDLE we can also make sure it takes an additional flag argument. > > There are also some more high level FUSE design points which we were concerned > by (although I'm not up to speed with the FUSE over io_uring work). One obvious > concern is the fact that with FUSE it's much harder to minimize copying. > FUSE passthrough helps but it would have made the read path significantly more > complex given the need to juggle file descriptors between user space and the > kernel. Also, TernFS uses Reed-Solomon to recover from situations where some > parts of a file is unreadable, and in that case we'd have had to fall back to > a non-passthrough version. Another possible FUSE performance pitfall is that > you're liable to be bottlenecked by the FUSE request queue, while if you work > directly within the kernel you're not. I agree on copying, but with io-uring I'm not sure about a request queue issue. At best missing is a dynamic size of ring entries, which would reduce memory usage. And yeah, zero-copy would help as well, but we at DDN buffer access with erase coding, compression, etc - maybe possible at some with bpf, but right now too hard. > > And of course before BPF we wouldn't have been able to track the nature of > file closes to a degree where the FUSE driver can implement TernFS semantics > correctly. See above, patch follows. Thanks, Bernd ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Mainlining the kernel module for TernFS, a distributed filesystem 2025-10-03 17:35 ` Bernd Schubert @ 2025-10-03 18:18 ` Francesco Mazzoli 2025-10-03 19:01 ` Francesco Mazzoli 0 siblings, 1 reply; 8+ messages in thread From: Francesco Mazzoli @ 2025-10-03 18:18 UTC (permalink / raw) To: Bernd Schubert, Amir Goldstein Cc: linux-fsdevel, Christian Brauner, Darrick J. Wong, Miklos Szeredi On Fri, Oct 3, 2025, at 18:35, Bernd Schubert wrote: > Btw, I had see your design a week or two ago when posted on phoronix and > looks like you need to know in FUSE_RELEASE if application crashed. I think > that is trivial and we at DDN might also use for the posix/S3 interface, > patch follows - no need for extra steps with BPF). It's a bit more complicated than that, sadly. I'd imagine that FUSE_RELEASE will be called when the file refcount drops to zero but this might very well be after we actually intended to link the file. Consider the case when a process forks, the children inherits the file descriptors (including open TernFS files), and then the parent close()s the file, intending to link it. You won't get FUSE_RELEASE because of the reference in the child, and the file won't be linked as a consequence. However you can't link the file too eagerly either for the reverse reason. What you need is to track "intentional" closes, and you're going to end up relying on some heuristic, unless you use something like O_TMPFILE + linkat. In the kernel module we do that by tracking where the close came from and if the close is being performed as part of the process winding down. We only link the file if the close is coming from the process that created the file and not as part of process winddown. This particular heuristic that has worked well for us, and empirically it has been quite user friendly. In FUSE with BPF we do something arguably more principled: we mark a file as "explicitly closed" if it was closed through close(), and only link it after an explicit close has been recorded. > Fuse sends LOOKUP in fuse_dentry_revalidate()? I.e. that is just a userspace > counter then if a dentry was already looked up? For the upcoming > FUSE_LOOKUP_HANDLE we can also make sure it takes an additional flag argument. Oh, I had not realized that FUSE will return valid if the lookup is stable, thank you. You'll still pay the price of roundtripping through userspace though, and given how common lookups are, I'd imagine tons of spurious lookups into the FUSE server would still be unpleasant. > I agree on copying, but with io-uring I'm not sure about a request queue issue. > At best missing is a dynamic size of ring entries, which would reduce memory > usage. And yeah, zero-copy would help as well, but we at DDN buffer access > with erase coding, compression, etc - maybe possible at some with bpf, but right > now too hard. I'll have to take a look at FUSE + io_uring, won't comment on that until I'm familiar with it :). Francesco ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Mainlining the kernel module for TernFS, a distributed filesystem 2025-10-03 18:18 ` Francesco Mazzoli @ 2025-10-03 19:01 ` Francesco Mazzoli 0 siblings, 0 replies; 8+ messages in thread From: Francesco Mazzoli @ 2025-10-03 19:01 UTC (permalink / raw) To: Bernd Schubert, Amir Goldstein Cc: linux-fsdevel, Christian Brauner, Darrick J. Wong, Miklos Szeredi On Fri, Oct 3, 2025, at 19:18, Francesco Mazzoli wrote: > > I agree on copying, but with io-uring I'm not sure about a request queue issue. > > At best missing is a dynamic size of ring entries, which would reduce memory > > usage. And yeah, zero-copy would help as well, but we at DDN buffer access > > with erase coding, compression, etc - maybe possible at some with bpf, but right > > now too hard. > > I'll have to take a look at FUSE + io_uring, won't comment on that until I'm > familiar with it :). Oh, one more point on copying: when reconstructing using Reed-Solomon, you want to read and write to the page cache to fetch pages that you need for reconstruction if you have them already, and store the additional pages you fetch. Again I'd imagine this to be hard to do with FUSE in a zero-copy way. All of this should not detract from the point that I'm sure a very performant TernFS driver can be written, but I'm not convinced it would be the better option all things considered. Francesco ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Mainlining the kernel module for TernFS, a distributed filesystem 2025-10-03 15:01 ` Francesco Mazzoli 2025-10-03 17:35 ` Bernd Schubert @ 2025-10-04 2:52 ` Theodore Ts'o 2025-10-04 9:01 ` Francesco Mazzoli 1 sibling, 1 reply; 8+ messages in thread From: Theodore Ts'o @ 2025-10-04 2:52 UTC (permalink / raw) To: Francesco Mazzoli Cc: Amir Goldstein, linux-fsdevel, Christian Brauner, Darrick J. Wong, Bernd Schubert, Miklos Szeredi On Fri, Oct 03, 2025 at 04:01:56PM +0100, Francesco Mazzoli wrote: > > > A codebase code with only one major user is a red flag. > > I am sure that you and your colleagues are very talented, > > but if your employer decides to cut down on upstreaming budget, > > the kernel maintainers would be left with an effectively orphaned filesystem. I'd go further than that. Expanding your user base is definitely a good thing, but I'd go further than that; see if you can expand your developer community so that some of your users are finding enough value that they are willing to contribute to the development of the your file system. Perhaps there are some use cases which aren't important to you, so it's not something that you can justifying pursuing, but perhaps it would be high value for some other company with a similar, but not identical, use case? To do that, some recommendations: *) Have good developer's documentation; not just how to start using it, but how to get started understanding the code base. That is, things like the layout of the code base, how to debug problems, etc. I see that you have documentation on how to run regression tests, which is great. *) At the moment, it looks like your primary focus for the client is the Ubuntu LTS kernel. That makes sense, but if you are are going for upstream inclusion, it might be useful to have a version of the codebase which is sync'ed to the upstream kernel, and then having an adaption layer which allows the code to be compiled as a module on distribution kernels. *) If you have a list of simple starter projects that you could hand off to someone who is intersted, that would be useful. (For example, one such starter project might be adding dkms support for other distributions beyond Ubuntu, which might be useful for other potential users. Do you have a desire for more tests? In general, in my experience, most projects always could use more testing.) Looking the documentation, here are some notes: * "We don't expect new directories to be created often, and files (or directories) to be moved between directories often." I *think* "don't expect" binds to both parts of the conjuction. So can you confirm that whatw as meant is "... nor do we expect that files (or directries) to be moved frequently." * If that's true, it means that you *do* expect that files and directories can be moved around. What are the consistency expectations when a file is renamed/moved? I assume that since clients might be scattered across the world, there is some period where different clients might have different views. Is there some kind of guarantee about when the eventual consistency will definitely be resolved? * In the description of the filesystem data or metadata, there is no mention of whether there are checksums at rest or not. Given the requirements that there be protections against hard disk bitrot, I assume there would be -- but what is the granularity? Every 4092 bytes (as in GFS)? Every 1M? Every 4M? Are the checksums verified on the server when the data is read? Or by the client? Or both? What is the recovery path if the checksum doesn't verify? * Some of the above are about the protocol, and that would be good to document. What if any are the authentication and authorization checking that gets done? Are there any cryptographic protection for either encryption or data integrity? I've seen some companies who consider their LLM to highly proprietary, to the extent that they want to use confidential compute VM's. Or if you are using the file system for training data, the training data might have PII. > These are all good questions, and while we have not profiled the > FUSE driver extensively... There has been some really interesting work that that Darrick Wong has been doing using the low-level fuse API. The low-level FUSE is Linux only, but using that with fs-iomap patches, Darrick has managed to get basically get equivalent performance for direct and buffered I/O comparing the native ext4 file system driver with his patched fuse2fs and low-level fuse fs-iomap implementation. His goal was to provide better security for untrusted containers that want to mount images that might be carefully, maiciously trusted, but it does demonstrate that you aren't particularly worried about metadata-heavy workloads, and are primarily concerend about data plane performance, uisng the low-level (linux-only) FUSE interface might work well for you. > There are some specific things that would be difficult today. For > instance FUSE does not expose `d_revalidate`, which means that > dentries would be dropped needlessly in cases where we know they can > be left in place. I belive the low-level FUSE interface does expose dentry revalidation. > parts of a file is unreadable, and in that case we'd have had to > fall back to a non-passthrough version. Ah, you are using erasure codes; what was the design considerations of using RS as opposed to having multiple copies of data blocks. Or do you support both? This would be great to document --- or maybe you might want to consider creating a "Design and Implementation of TernFS" paper and submitting to a conference like FAST. :-) Cheers, - Ted ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Mainlining the kernel module for TernFS, a distributed filesystem 2025-10-04 2:52 ` Theodore Ts'o @ 2025-10-04 9:01 ` Francesco Mazzoli 0 siblings, 0 replies; 8+ messages in thread From: Francesco Mazzoli @ 2025-10-04 9:01 UTC (permalink / raw) To: Theodore Ts'o Cc: Amir Goldstein, linux-fsdevel, Christian Brauner, Darrick J. Wong, Bernd Schubert, Miklos Szeredi On Sat, Oct 4, 2025, at 03:52, Theodore Ts'o wrote: > To do that, some recommendations: > ... Thank you, this is all very useful. > Looking the documentation, here are some notes: > > * "We don't expect new directories to be created often, and files (or > directories) to be moved between directories often." I *think* > "don't expect" binds to both parts of the conjuction. So can you > confirm that whatw as meant is "... nor do we expect that files > (or directries) to be moved frequently." Your interpretation is correct. > * If that's true, it means that you *do* expect that files and > directories can be moved around. What are the consistency > expectations when a file is renamed/moved? I assume that since > clients might be scattered across the world, there is some period > where different clients might have different views. Is there some > kind of guarantee about when the eventual consistency will > definitely be resolved? While TernFS is geo-replicated, metadata is geo-replicated in a master-slave fashion: writes go through a single region, and writers in a given region are guaranteed to read their own writes. We have plans to move this to master-master setup, but it hasn't been very urgent since the metadata latency hit is usually hidden by the time it takes to write the actual files (which as remarked tend to be pretty big). That said, directory entries are also cached, we use 250ms but it's configurable. File contents on the other hand are written locally and replicated both in a push and pull fashion. However files are immutable, which means you never have an inconsistent view of file contents in different regions. See also the "Going global" section of the blog post: <https://www.xtxmarkets.com/tech/2025-ternfs/>. > * In the description of the filesystem data or metadata, there is no > mention of whether there are checksums at rest or not. Given the > requirements that there be protections against hard disk bitrot, I > assume there would be -- but what is the granularity? Every 4092 > bytes (as in GFS)? Every 1M? Every 4M? Are the checksums verified > on the server when the data is read? Or by the client? Or both? > What is the recovery path if the checksum doesn't verify? Some of this is explained in the blog post mentioned above. In short: file contents are both checksummed at a page level, but also at a higher boundary (we call these "spans"), and the CRCs at this higher boundary are cross checked by the metadata services and the storage nodes. I've written two blog posts about these topics, see <https://mazzo.li/posts/mac-distributed-tx.html> and <https://mazzo.li/posts/rs-crc.html>. The metadata is also checksummed by way of RocksDB. Errors are recovered from using Reed-Solomon codes. > * Some of the above are about the protocol, and that would be good to > document. What if any are the authentication and authorization > checking that gets done? Are there any cryptographic protection for > either encryption or data integrity? I've seen some companies who > consider their LLM to highly proprietary, to the extent that they > want to use confidential compute VM's. Or if you are using the file > system for training data, the training data might have PII. There's no cryptographic protection or authentication in TernFS. We handle authentication at a different layer: we have filesystem gateway that expose only parts of the filesystem to less privileged users. > There has been some really interesting work that that Darrick Wong has > been doing using the low-level fuse API. ... One clear takeaway from this thread is that FUSE performance is a topic I don't know enough about. I'll have to explore the various novelties that you guys have brought up to bring me up to speed. > I belive the low-level FUSE interface does expose dentry revalidation. It doesn't directly but Bernd pointed out that it won't invalidate dentries if the lookup is stable, which is good enough. > Ah, you are using erasure codes; what was the design considerations of > using RS as opposed to having multiple copies of data blocks. Or do > you support both? We support both. > This would be great to document --- or maybe you might want to > consider creating a "Design and Implementation of TernFS" paper and > submitting to a conference like FAST. :-) The blog post was intended to be that kind of document, but we might consider a more detailed/academic publication! Thanks, Francesco ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-10-04 9:02 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-03 12:13 Mainlining the kernel module for TernFS, a distributed filesystem Francesco Mazzoli 2025-10-03 14:22 ` Amir Goldstein 2025-10-03 15:01 ` Francesco Mazzoli 2025-10-03 17:35 ` Bernd Schubert 2025-10-03 18:18 ` Francesco Mazzoli 2025-10-03 19:01 ` Francesco Mazzoli 2025-10-04 2:52 ` Theodore Ts'o 2025-10-04 9:01 ` Francesco Mazzoli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).