* [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more @ 2026-02-02 13:51 ` Miklos Szeredi 2026-02-02 16:14 ` Amir Goldstein ` (3 more replies) 0 siblings, 4 replies; 79+ messages in thread From: Miklos Szeredi @ 2026-02-02 13:51 UTC (permalink / raw) To: f-pc, linux-fsdevel Cc: Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer I propose a session where various topics of interest could be discussed including but not limited to the below list New features being proposed at various stages of readiness: - fuse4fs: exporting the iomap interface to userspace - famfs: export distributed memory - zero copy for fuse-io-uring - large folios - file handles on the userspace API - compound requests - BPF scripts How do these fit into the existing codebase? Cleaner separation of layers: - transport layer: /dev/fuse, io-uring, viriofs - filesystem layer: local fs, distributed fs Introduce new version of cleaned up API? - remove async INIT - no fixed ROOT_ID - consolidate caching rules - who's responsible for updating which metadata? - remove legacy and problematic flags - get rid of splice on /dev/fuse for new API version? Unresolved issues: - locked / writeback folios vs. reclaim / page migration - strictlimiting vs. large folios ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-02 13:51 ` [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Miklos Szeredi @ 2026-02-02 16:14 ` Amir Goldstein 2026-02-03 7:55 ` Miklos Szeredi 2026-02-03 10:15 ` Luis Henriques 2026-02-03 10:36 ` Amir Goldstein ` (2 subsequent siblings) 3 siblings, 2 replies; 79+ messages in thread From: Amir Goldstein @ 2026-02-02 16:14 UTC (permalink / raw) To: Miklos Szeredi Cc: linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc [Fixed lsf-pc address typo] On Mon, Feb 2, 2026 at 2:51 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > I propose a session where various topics of interest could be > discussed including but not limited to the below list > > New features being proposed at various stages of readiness: > > - fuse4fs: exporting the iomap interface to userspace > > - famfs: export distributed memory > > - zero copy for fuse-io-uring > > - large folios > > - file handles on the userspace API > > - compound requests > > - BPF scripts > > How do these fit into the existing codebase? > > Cleaner separation of layers: > > - transport layer: /dev/fuse, io-uring, viriofs > > - filesystem layer: local fs, distributed fs > > Introduce new version of cleaned up API? > > - remove async INIT > > - no fixed ROOT_ID > > - consolidate caching rules > > - who's responsible for updating which metadata? > > - remove legacy and problematic flags > > - get rid of splice on /dev/fuse for new API version? > > Unresolved issues: > > - locked / writeback folios vs. reclaim / page migration > > - strictlimiting vs. large folios All important topics which I am sure will be discussed on a FUSE BoF. I think that at least one question of interest to the wider fs audience is Can any of the above improvements be used to help phase out some of the old under maintained fs and reduce the burden on vfs maintainers? Thanks, Amir. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-02 16:14 ` Amir Goldstein @ 2026-02-03 7:55 ` Miklos Szeredi 2026-02-03 9:19 ` [Lsf-pc] " Jan Kara 2026-02-04 9:22 ` Joanne Koong 2026-02-03 10:15 ` Luis Henriques 1 sibling, 2 replies; 79+ messages in thread From: Miklos Szeredi @ 2026-02-03 7:55 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote: > All important topics which I am sure will be discussed on a FUSE BoF. I see your point. Maybe the BPF one could be useful as a cross track discussion, though I'm not sure the fuse side of the design is mature enough for that. Joanne, you did some experiments with that, no? > I think that at least one question of interest to the wider fs audience is > > Can any of the above improvements be used to help phase out some > of the old under maintained fs and reduce the burden on vfs maintainers? I think the major show stopper is that nobody is going to put a major effort into porting unmaintained kernel filesystems to a different framework. Alternatively someone could implement a "VFS emulator" library. But keeping that in sync with the kernel, together with all the old fs would be an even greater burden... Thanks, Miklos ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-03 7:55 ` Miklos Szeredi @ 2026-02-03 9:19 ` Jan Kara 2026-02-03 10:31 ` Amir Goldstein 2026-02-04 9:22 ` Joanne Koong 1 sibling, 1 reply; 79+ messages in thread From: Jan Kara @ 2026-02-03 9:19 UTC (permalink / raw) To: Miklos Szeredi Cc: Amir Goldstein, linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc On Tue 03-02-26 08:55:26, Miklos Szeredi via Lsf-pc wrote: > On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote: > > I think that at least one question of interest to the wider fs audience is > > > > Can any of the above improvements be used to help phase out some > > of the old under maintained fs and reduce the burden on vfs maintainers? > > I think the major show stopper is that nobody is going to put a major > effort into porting unmaintained kernel filesystems to a different > framework. There's some interest from people doing vfs maintenance work (as it has potential to save their work) and it is actually a reasonable task for someone wanting to get acquainted with filesystem development work. So I think there are chances of some progress. For example there was some interest in doing this for minix. Of course we'll be sure only when it happens :) > Alternatively someone could implement a "VFS emulator" library. But > keeping that in sync with the kernel, together with all the old fs > would be an even greater burden... Full VFS emulator would be too much I think. Maybe some helper library to ease some tasks would be useful but I think time for comming up with libraries is when someone commits to actually doing some conversion. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-03 9:19 ` [Lsf-pc] " Jan Kara @ 2026-02-03 10:31 ` Amir Goldstein 0 siblings, 0 replies; 79+ messages in thread From: Amir Goldstein @ 2026-02-03 10:31 UTC (permalink / raw) To: Jan Kara Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc On Tue, Feb 3, 2026 at 10:19 AM Jan Kara <jack@suse.cz> wrote: > > On Tue 03-02-26 08:55:26, Miklos Szeredi via Lsf-pc wrote: > > On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote: > > > I think that at least one question of interest to the wider fs audience is > > > > > > Can any of the above improvements be used to help phase out some > > > of the old under maintained fs and reduce the burden on vfs maintainers? > > > > I think the major show stopper is that nobody is going to put a major > > effort into porting unmaintained kernel filesystems to a different > > framework. > > There's some interest from people doing vfs maintenance work (as it has > potential to save their work) and it is actually a reasonable task for > someone wanting to get acquainted with filesystem development work. So I > think there are chances of some progress. For example there was some > interest in doing this for minix. Of course we'll be sure only when it > happens :) > > > Alternatively someone could implement a "VFS emulator" library. But > > keeping that in sync with the kernel, together with all the old fs > > would be an even greater burden... > > Full VFS emulator would be too much I think. Maybe some helper library to > ease some tasks would be useful but I think time for comming up with > libraries is when someone commits to actually doing some conversion. > I think that the concept of a VFS emulator is wrong to apply here. A VFS emulator would be needed for running the latest uptodate fs driver. If we want to fork a kernel driver at a point in time and make it into a FUSE server, we need a one time conversion from kernel/vfs API to userspace/lowlevel FUSE API. LLMs are very good and doing this sort of mechanic conversion and after the first few fs have been converted by developers, LLM would learn how to do it better for the next fs. The main challenges I see are verification and package maintenance. The conversion needs to be tested, so there needs to be a decent test suite. If an fs already has a progs/utils package, it would be natural if FUSE server code could be added to this package, but those packages are not always maintained. We can map the most likely candidates that have decent test suites and a fairly maintained utils package for a start. Thanks, Amir. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-03 7:55 ` Miklos Szeredi 2026-02-03 9:19 ` [Lsf-pc] " Jan Kara @ 2026-02-04 9:22 ` Joanne Koong 2026-02-04 10:37 ` Amir Goldstein ` (3 more replies) 1 sibling, 4 replies; 79+ messages in thread From: Joanne Koong @ 2026-02-04 9:22 UTC (permalink / raw) To: Miklos Szeredi Cc: Amir Goldstein, linux-fsdevel, Darrick J . Wong, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote: > > > All important topics which I am sure will be discussed on a FUSE BoF. Two other items I'd like to add to the potential discussion list are: * leveraging io-uring multishot for batching fuse writeback and readahead requests, ie maximizing the throughput per roundtrip context switch [1] * settling how load distribution should be done for configurable queues. We came to a bit of a standstill on Bernd's patchset [2] and it would be great to finally get this resolved and the feature landed. imo configurable queues and incremental buffer consumption are the two main features needed to make fuse-over-io-uring more feasible on large-scale systems. > > I see your point. Maybe the BPF one could be useful as a cross track > discussion, though I'm not sure the fuse side of the design is mature > enough for that. Joanne, you did some experiments with that, no? The discussion on this was started in response [3] to Darrick's iomap containerization patchset. I have a prototype based on [4] I can get into reviewable shape this month or next, if there's interest in getting something concrete before May. I did a quick check with the bpf team a few days ago and confirmed with them that struct ops is the way to go for adding the hook point for fuse. For attaching the bpf progs to the fuse connection, going through the bpf link interface is the modern/preferred way of doing this. Discussion wise, imo on the fuse side what would be most useful to discuss in May would be what other interception points do we think would be the most useful in fuse and what should the API interfaces that we expose for those look like (eg should these just take the in/out request structs already defined in the uapi? or expose more state information?). imo, we should take an incremental approach and add interception points more conservatively than liberally, on a per-need basis as use cases actually come up. > > > I think that at least one question of interest to the wider fs audience is > > > > Can any of the above improvements be used to help phase out some > > of the old under maintained fs and reduce the burden on vfs maintainers? I think it might be helpful to know ahead of time where the main hesitation lies. Is it performance? Maybe it'd be helpful if before May there was a prototype converting a simpler filesystem (Darrick and I were musing about fat maybe being a good one) and getting a sense of what the delta is between the native kernel implementation and a fuse-based version? In the past year fuse added a lot of new capabilities that improved performance by quite a bit so I'm curious to see where the delta now lies. Or maybe the hesitation is something else entirely, in which case that's probably a conversation better left for May. Thanks, Joanne [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z3mTdZdfe5rTukKOnU0y5dpM8aFTCqbctBWsa-S301TQ@mail.gmail.com/ [2] https://lore.kernel.org/linux-fsdevel/20251013-reduced-nr-ring-queues_3-v3-4-6d87c8aa31ae@ddn.com/t/#u [3] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z05QZmos90qmWtnWGF+Kb7rVziJ51UpuJ0O=A+6N1vrg@mail.gmail.com/t/#u [4] https://lore.kernel.org/linux-fsdevel/176169810144.1424854.11439355400009006946.stgit@frogsfrogsfrogs/T/#m4998d92f6210d50d0bf6760490689c029bda9231 > > I think the major show stopper is that nobody is going to put a major > effort into porting unmaintained kernel filesystems to a different > framework. > > Alternatively someone could implement a "VFS emulator" library. But > keeping that in sync with the kernel, together with all the old fs > would be an even greater burden... > > Thanks, > Miklos ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 9:22 ` Joanne Koong @ 2026-02-04 10:37 ` Amir Goldstein 2026-02-04 10:43 ` [Lsf-pc] " Jan Kara ` (2 subsequent siblings) 3 siblings, 0 replies; 79+ messages in thread From: Amir Goldstein @ 2026-02-04 10:37 UTC (permalink / raw) To: Joanne Koong Cc: Miklos Szeredi, linux-fsdevel, Darrick J . Wong, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc On Wed, Feb 4, 2026 at 10:22 AM Joanne Koong <joannelkoong@gmail.com> wrote: > > On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote: > > > > > All important topics which I am sure will be discussed on a FUSE BoF. > > Two other items I'd like to add to the potential discussion list are: > > * leveraging io-uring multishot for batching fuse writeback and > readahead requests, ie maximizing the throughput per roundtrip context > switch [1] > > * settling how load distribution should be done for configurable > queues. We came to a bit of a standstill on Bernd's patchset [2] and > it would be great to finally get this resolved and the feature landed. > imo configurable queues and incremental buffer consumption are the two > main features needed to make fuse-over-io-uring more feasible on > large-scale systems. > > > > > I see your point. Maybe the BPF one could be useful as a cross track > > discussion, though I'm not sure the fuse side of the design is mature > > enough for that. Joanne, you did some experiments with that, no? > > The discussion on this was started in response [3] to Darrick's iomap > containerization patchset. I have a prototype based on [4] I can get > into reviewable shape this month or next, if there's interest in > getting something concrete before May. I did a quick check with the > bpf team a few days ago and confirmed with them that struct ops is the > way to go for adding the hook point for fuse. For attaching the bpf > progs to the fuse connection, going through the bpf link interface is > the modern/preferred way of doing this. Discussion wise, imo on the > fuse side what would be most useful to discuss in May would be what > other interception points do we think would be the most useful in fuse > and what should the API interfaces that we expose for those look like > (eg should these just take the in/out request structs already defined > in the uapi? or expose more state information?). imo, we should take > an incremental approach and add interception points more > conservatively than liberally, on a per-need basis as use cases > actually come up. > > > > > > I think that at least one question of interest to the wider fs audience is > > > > > > Can any of the above improvements be used to help phase out some > > > of the old under maintained fs and reduce the burden on vfs maintainers? > > I think it might be helpful to know ahead of time where the main > hesitation lies. Is it performance? I think that for phasing out unmaintained filesystems performance is really the last concern if a concern at all (call it a nudge). > Maybe it'd be helpful if before > May there was a prototype converting a simpler filesystem (Darrick and > I were musing about fat maybe being a good one) and getting a sense of > what the delta is between the native kernel implementation and a > fuse-based version? In the past year fuse added a lot of new > capabilities that improved performance by quite a bit so I'm curious > to see where the delta now lies. Yeh, this is a fun exercise. fat could be a good candidate. I'd do it myself if I find the time. If anyone starts doing that maybe post a message here or in FUSE thread so we can avoid working on the same fs. > Or maybe the hesitation is something > else entirely, in which case that's probably a conversation better > left for May. > Besides testing and maintenance which I already mentioned and functionality (e.g. nfs export), there could be other concerns fuse has some unique behaviors, but maybe those could be fixes for the sake of this sort of project. I guess we will know better once we start experimenting and let actual users try the conversion. Finding those users could be a challenge in itself. Thanks, Amir. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 9:22 ` Joanne Koong 2026-02-04 10:37 ` Amir Goldstein @ 2026-02-04 10:43 ` Jan Kara 2026-02-06 6:09 ` Darrick J. Wong 2026-02-04 20:47 ` Bernd Schubert 2026-02-06 6:26 ` Darrick J. Wong 3 siblings, 1 reply; 79+ messages in thread From: Jan Kara @ 2026-02-04 10:43 UTC (permalink / raw) To: Joanne Koong Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Darrick J . Wong, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc On Wed 04-02-26 01:22:02, Joanne Koong wrote: > On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > I think that at least one question of interest to the wider fs audience is > > > > > > Can any of the above improvements be used to help phase out some > > > of the old under maintained fs and reduce the burden on vfs maintainers? > > I think it might be helpful to know ahead of time where the main > hesitation lies. Is it performance? Maybe it'd be helpful if before > May there was a prototype converting a simpler filesystem (Darrick and > I were musing about fat maybe being a good one) and getting a sense of > what the delta is between the native kernel implementation and a > fuse-based version? In the past year fuse added a lot of new > capabilities that improved performance by quite a bit so I'm curious > to see where the delta now lies. Or maybe the hesitation is something > else entirely, in which case that's probably a conversation better > left for May. I'm not sure which filesystems Amir had exactly in mind but in my opinion FAT is used widely enough to not be a primary target of this effort. It would be rather filesystems like (random selection) bfs, adfs, vboxfs, minix, efs, freevxfs, etc. The user base of these is very small, testing is minimal if possible at all, and thus the value of keeping these in the kernel vs the effort they add to infrastructure changes (like folio conversions, iomap conversion, ...) is not very favorable. For these the biggest problem IMO is actually finding someone willing to invest into doing (and testing) the conversion. I don't think there are severe technical obstacles for most of them. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 10:43 ` [Lsf-pc] " Jan Kara @ 2026-02-06 6:09 ` Darrick J. Wong 2026-02-21 6:07 ` Demi Marie Obenour 0 siblings, 1 reply; 79+ messages in thread From: Darrick J. Wong @ 2026-02-06 6:09 UTC (permalink / raw) To: Jan Kara Cc: Joanne Koong, Miklos Szeredi, Amir Goldstein, linux-fsdevel, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc On Wed, Feb 04, 2026 at 11:43:05AM +0100, Jan Kara wrote: > On Wed 04-02-26 01:22:02, Joanne Koong wrote: > > On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > I think that at least one question of interest to the wider fs audience is > > > > > > > > Can any of the above improvements be used to help phase out some > > > > of the old under maintained fs and reduce the burden on vfs maintainers? > > > > I think it might be helpful to know ahead of time where the main > > hesitation lies. Is it performance? Maybe it'd be helpful if before > > May there was a prototype converting a simpler filesystem (Darrick and > > I were musing about fat maybe being a good one) and getting a sense of > > what the delta is between the native kernel implementation and a > > fuse-based version? In the past year fuse added a lot of new > > capabilities that improved performance by quite a bit so I'm curious > > to see where the delta now lies. Or maybe the hesitation is something > > else entirely, in which case that's probably a conversation better > > left for May. > > I'm not sure which filesystems Amir had exactly in mind but in my opinion > FAT is used widely enough to not be a primary target of this effort. It OTOH the ESP and USB sticks needn't be high performance. <shrug> > would be rather filesystems like (random selection) bfs, adfs, vboxfs, > minix, efs, freevxfs, etc. The user base of these is very small, testing is > minimal if possible at all, and thus the value of keeping these in the > kernel vs the effort they add to infrastructure changes (like folio > conversions, iomap conversion, ...) is not very favorable. But yeah, these ones in the long tail are probably good targets. Though I think willy pointed out that the biggest barrier in his fs folio conversions was that many of them aren't testable (e.g. lack mkfs or fsck tools) which makes a legacy pivot that much harder. > For these the biggest problem IMO is actually finding someone willing to > invest into doing (and testing) the conversion. I don't think there are > severe technical obstacles for most of them. Yep, that's the biggest hurdle -- convincing managers to pay for a bunch of really old filesystems that are no longer mainstream. --D > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-06 6:09 ` Darrick J. Wong @ 2026-02-21 6:07 ` Demi Marie Obenour 2026-02-21 7:07 ` Darrick J. Wong 0 siblings, 1 reply; 79+ messages in thread From: Demi Marie Obenour @ 2026-02-21 6:07 UTC (permalink / raw) To: Darrick J. Wong, Jan Kara Cc: Joanne Koong, Miklos Szeredi, Amir Goldstein, linux-fsdevel, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc [-- Attachment #1.1.1: Type: text/plain, Size: 3153 bytes --] On 2/6/26 01:09, Darrick J. Wong wrote: > On Wed, Feb 04, 2026 at 11:43:05AM +0100, Jan Kara wrote: >> On Wed 04-02-26 01:22:02, Joanne Koong wrote: >>> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote: >>>>> I think that at least one question of interest to the wider fs audience is >>>>> >>>>> Can any of the above improvements be used to help phase out some >>>>> of the old under maintained fs and reduce the burden on vfs maintainers? >>> >>> I think it might be helpful to know ahead of time where the main >>> hesitation lies. Is it performance? Maybe it'd be helpful if before >>> May there was a prototype converting a simpler filesystem (Darrick and >>> I were musing about fat maybe being a good one) and getting a sense of >>> what the delta is between the native kernel implementation and a >>> fuse-based version? In the past year fuse added a lot of new >>> capabilities that improved performance by quite a bit so I'm curious >>> to see where the delta now lies. Or maybe the hesitation is something >>> else entirely, in which case that's probably a conversation better >>> left for May. >> >> I'm not sure which filesystems Amir had exactly in mind but in my opinion >> FAT is used widely enough to not be a primary target of this effort. It > > OTOH the ESP and USB sticks needn't be high performance. <shrug> Yup. Also USB sticks are not trusted. >> would be rather filesystems like (random selection) bfs, adfs, vboxfs, >> minix, efs, freevxfs, etc. The user base of these is very small, testing is >> minimal if possible at all, and thus the value of keeping these in the >> kernel vs the effort they add to infrastructure changes (like folio >> conversions, iomap conversion, ...) is not very favorable. > > But yeah, these ones in the long tail are probably good targets. Though > I think willy pointed out that the biggest barrier in his fs folio > conversions was that many of them aren't testable (e.g. lack mkfs or > fsck tools) which makes a legacy pivot that much harder. Does it make sense to keep these filesystems around? If all one cares about is getting the data off of the filesystem, libguestfs with an old kernel is sufficient. If the VFS changes introduced bugs, an old kernel might even be more reliable. If there is a way to make sure the FUSE port works, that would be great. However, if there is no way to test them, then maybe they should just be dropped. >> For these the biggest problem IMO is actually finding someone willing to >> invest into doing (and testing) the conversion. I don't think there are >> severe technical obstacles for most of them. > > Yep, that's the biggest hurdle -- convincing managers to pay for a bunch > of really old filesystems that are no longer mainstream. Could libguestfs with old guest kernels be a sufficient replacement? It's not going to be fast, but it's enough for data preservation. libguestfs supports "fixed appliances", which allow using whatever kernel one wants. They even provide some as precompiled binaries. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-21 6:07 ` Demi Marie Obenour @ 2026-02-21 7:07 ` Darrick J. Wong 2026-02-21 22:16 ` Demi Marie Obenour 0 siblings, 1 reply; 79+ messages in thread From: Darrick J. Wong @ 2026-02-21 7:07 UTC (permalink / raw) To: Demi Marie Obenour Cc: Jan Kara, Joanne Koong, Miklos Szeredi, Amir Goldstein, linux-fsdevel, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc On Sat, Feb 21, 2026 at 01:07:55AM -0500, Demi Marie Obenour wrote: > On 2/6/26 01:09, Darrick J. Wong wrote: > > On Wed, Feb 04, 2026 at 11:43:05AM +0100, Jan Kara wrote: > >> On Wed 04-02-26 01:22:02, Joanne Koong wrote: > >>> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > >>>>> I think that at least one question of interest to the wider fs audience is > >>>>> > >>>>> Can any of the above improvements be used to help phase out some > >>>>> of the old under maintained fs and reduce the burden on vfs maintainers? > >>> > >>> I think it might be helpful to know ahead of time where the main > >>> hesitation lies. Is it performance? Maybe it'd be helpful if before > >>> May there was a prototype converting a simpler filesystem (Darrick and > >>> I were musing about fat maybe being a good one) and getting a sense of > >>> what the delta is between the native kernel implementation and a > >>> fuse-based version? In the past year fuse added a lot of new > >>> capabilities that improved performance by quite a bit so I'm curious > >>> to see where the delta now lies. Or maybe the hesitation is something > >>> else entirely, in which case that's probably a conversation better > >>> left for May. > >> > >> I'm not sure which filesystems Amir had exactly in mind but in my opinion > >> FAT is used widely enough to not be a primary target of this effort. It > > > > OTOH the ESP and USB sticks needn't be high performance. <shrug> > > Yup. Also USB sticks are not trusted. > > >> would be rather filesystems like (random selection) bfs, adfs, vboxfs, > >> minix, efs, freevxfs, etc. The user base of these is very small, testing is > >> minimal if possible at all, and thus the value of keeping these in the > >> kernel vs the effort they add to infrastructure changes (like folio > >> conversions, iomap conversion, ...) is not very favorable. > > > > But yeah, these ones in the long tail are probably good targets. Though > > I think willy pointed out that the biggest barrier in his fs folio > > conversions was that many of them aren't testable (e.g. lack mkfs or > > fsck tools) which makes a legacy pivot that much harder. > > Does it make sense to keep these filesystems around? If all one cares > about is getting the data off of the filesystem, libguestfs with an > old kernel is sufficient. If the VFS changes introduced bugs, an old > kernel might even be more reliable. If there is a way to make sure > the FUSE port works, that would be great. However, if there is no > way to test them, then maybe they should just be dropped. > > >> For these the biggest problem IMO is actually finding someone willing to > >> invest into doing (and testing) the conversion. I don't think there are > >> severe technical obstacles for most of them. > > > > Yep, that's the biggest hurdle -- convincing managers to pay for a bunch > > of really old filesystems that are no longer mainstream. > > Could libguestfs with old guest kernels be a sufficient replacement? > It's not going to be fast, but it's enough for data preservation. In principle it might work, though I have questions about the quality of whatever's internally driving guestmount. Do you know how exactly libguestfs/guestmount accesses (say) an XFS filesystem? I'm curious because libxfs isn't a shared library, so either it would have to manipulate xfs_db (ugh!), run the kernel in a VM layer, or ... do they have their own implementation ala grub? --D > libguestfs supports "fixed appliances", which allow using whatever > kernel one wants. They even provide some as precompiled binaries. > -- > Sincerely, > Demi Marie Obenour (she/her/hers) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-21 7:07 ` Darrick J. Wong @ 2026-02-21 22:16 ` Demi Marie Obenour 2026-02-23 21:58 ` Darrick J. Wong 0 siblings, 1 reply; 79+ messages in thread From: Demi Marie Obenour @ 2026-02-21 22:16 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, Joanne Koong, Miklos Szeredi, Amir Goldstein, linux-fsdevel, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc [-- Attachment #1.1.1: Type: text/plain, Size: 3850 bytes --] On 2/21/26 02:07, Darrick J. Wong wrote: > On Sat, Feb 21, 2026 at 01:07:55AM -0500, Demi Marie Obenour wrote: >> On 2/6/26 01:09, Darrick J. Wong wrote: >>> On Wed, Feb 04, 2026 at 11:43:05AM +0100, Jan Kara wrote: >>>> On Wed 04-02-26 01:22:02, Joanne Koong wrote: >>>>> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote: >>>>>>> I think that at least one question of interest to the wider fs audience is >>>>>>> >>>>>>> Can any of the above improvements be used to help phase out some >>>>>>> of the old under maintained fs and reduce the burden on vfs maintainers? >>>>> >>>>> I think it might be helpful to know ahead of time where the main >>>>> hesitation lies. Is it performance? Maybe it'd be helpful if before >>>>> May there was a prototype converting a simpler filesystem (Darrick and >>>>> I were musing about fat maybe being a good one) and getting a sense of >>>>> what the delta is between the native kernel implementation and a >>>>> fuse-based version? In the past year fuse added a lot of new >>>>> capabilities that improved performance by quite a bit so I'm curious >>>>> to see where the delta now lies. Or maybe the hesitation is something >>>>> else entirely, in which case that's probably a conversation better >>>>> left for May. >>>> >>>> I'm not sure which filesystems Amir had exactly in mind but in my opinion >>>> FAT is used widely enough to not be a primary target of this effort. It >>> >>> OTOH the ESP and USB sticks needn't be high performance. <shrug> >> >> Yup. Also USB sticks are not trusted. >> >>>> would be rather filesystems like (random selection) bfs, adfs, vboxfs, >>>> minix, efs, freevxfs, etc. The user base of these is very small, testing is >>>> minimal if possible at all, and thus the value of keeping these in the >>>> kernel vs the effort they add to infrastructure changes (like folio >>>> conversions, iomap conversion, ...) is not very favorable. >>> >>> But yeah, these ones in the long tail are probably good targets. Though >>> I think willy pointed out that the biggest barrier in his fs folio >>> conversions was that many of them aren't testable (e.g. lack mkfs or >>> fsck tools) which makes a legacy pivot that much harder. >> >> Does it make sense to keep these filesystems around? If all one cares >> about is getting the data off of the filesystem, libguestfs with an >> old kernel is sufficient. If the VFS changes introduced bugs, an old >> kernel might even be more reliable. If there is a way to make sure >> the FUSE port works, that would be great. However, if there is no >> way to test them, then maybe they should just be dropped. >> >>>> For these the biggest problem IMO is actually finding someone willing to >>>> invest into doing (and testing) the conversion. I don't think there are >>>> severe technical obstacles for most of them. >>> >>> Yep, that's the biggest hurdle -- convincing managers to pay for a bunch >>> of really old filesystems that are no longer mainstream. >> >> Could libguestfs with old guest kernels be a sufficient replacement? >> It's not going to be fast, but it's enough for data preservation. > > In principle it might work, though I have questions about the quality of > whatever's internally driving guestmount. > > Do you know how exactly libguestfs/guestmount accesses (say) an XFS > filesystem? I'm curious because libxfs isn't a shared library, so > either it would have to manipulate xfs_db (ugh!), run the kernel in a VM > layer, or ... do they have their own implementation ala grub? They run Linux in a VM. Using an old Linux would allow working with old filesystems that have since been removed. If KVM is available, the VM is (or at least should be) strongly sandboxed. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-21 22:16 ` Demi Marie Obenour @ 2026-02-23 21:58 ` Darrick J. Wong 0 siblings, 0 replies; 79+ messages in thread From: Darrick J. Wong @ 2026-02-23 21:58 UTC (permalink / raw) To: Demi Marie Obenour Cc: Jan Kara, Joanne Koong, Miklos Szeredi, Amir Goldstein, linux-fsdevel, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc On Sat, Feb 21, 2026 at 05:16:25PM -0500, Demi Marie Obenour wrote: > On 2/21/26 02:07, Darrick J. Wong wrote: > > On Sat, Feb 21, 2026 at 01:07:55AM -0500, Demi Marie Obenour wrote: > >> On 2/6/26 01:09, Darrick J. Wong wrote: > >>> On Wed, Feb 04, 2026 at 11:43:05AM +0100, Jan Kara wrote: > >>>> On Wed 04-02-26 01:22:02, Joanne Koong wrote: > >>>>> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > >>>>>>> I think that at least one question of interest to the wider fs audience is > >>>>>>> > >>>>>>> Can any of the above improvements be used to help phase out some > >>>>>>> of the old under maintained fs and reduce the burden on vfs maintainers? > >>>>> > >>>>> I think it might be helpful to know ahead of time where the main > >>>>> hesitation lies. Is it performance? Maybe it'd be helpful if before > >>>>> May there was a prototype converting a simpler filesystem (Darrick and > >>>>> I were musing about fat maybe being a good one) and getting a sense of > >>>>> what the delta is between the native kernel implementation and a > >>>>> fuse-based version? In the past year fuse added a lot of new > >>>>> capabilities that improved performance by quite a bit so I'm curious > >>>>> to see where the delta now lies. Or maybe the hesitation is something > >>>>> else entirely, in which case that's probably a conversation better > >>>>> left for May. > >>>> > >>>> I'm not sure which filesystems Amir had exactly in mind but in my opinion > >>>> FAT is used widely enough to not be a primary target of this effort. It > >>> > >>> OTOH the ESP and USB sticks needn't be high performance. <shrug> > >> > >> Yup. Also USB sticks are not trusted. > >> > >>>> would be rather filesystems like (random selection) bfs, adfs, vboxfs, > >>>> minix, efs, freevxfs, etc. The user base of these is very small, testing is > >>>> minimal if possible at all, and thus the value of keeping these in the > >>>> kernel vs the effort they add to infrastructure changes (like folio > >>>> conversions, iomap conversion, ...) is not very favorable. > >>> > >>> But yeah, these ones in the long tail are probably good targets. Though > >>> I think willy pointed out that the biggest barrier in his fs folio > >>> conversions was that many of them aren't testable (e.g. lack mkfs or > >>> fsck tools) which makes a legacy pivot that much harder. > >> > >> Does it make sense to keep these filesystems around? If all one cares > >> about is getting the data off of the filesystem, libguestfs with an > >> old kernel is sufficient. If the VFS changes introduced bugs, an old > >> kernel might even be more reliable. If there is a way to make sure > >> the FUSE port works, that would be great. However, if there is no > >> way to test them, then maybe they should just be dropped. > >> > >>>> For these the biggest problem IMO is actually finding someone willing to > >>>> invest into doing (and testing) the conversion. I don't think there are > >>>> severe technical obstacles for most of them. > >>> > >>> Yep, that's the biggest hurdle -- convincing managers to pay for a bunch > >>> of really old filesystems that are no longer mainstream. > >> > >> Could libguestfs with old guest kernels be a sufficient replacement? > >> It's not going to be fast, but it's enough for data preservation. > > > > In principle it might work, though I have questions about the quality of > > whatever's internally driving guestmount. > > > > Do you know how exactly libguestfs/guestmount accesses (say) an XFS > > filesystem? I'm curious because libxfs isn't a shared library, so > > either it would have to manipulate xfs_db (ugh!), run the kernel in a VM > > layer, or ... do they have their own implementation ala grub? > > They run Linux in a VM. Using an old Linux would allow working with > old filesystems that have since been removed. If KVM is available, > the VM is (or at least should be) strongly sandboxed. /me tries out libguestfs and ... wow it's slow to start. It does seem to provide the isolation of the fs parsing code that I want, but the overheads are quite high. 500MB memory to mount a totally empty XFS filesystem, and 350MB of disk space to create a rootfs, ouch. --D > -- > Sincerely, > Demi Marie Obenour (she/her/hers) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 9:22 ` Joanne Koong 2026-02-04 10:37 ` Amir Goldstein 2026-02-04 10:43 ` [Lsf-pc] " Jan Kara @ 2026-02-04 20:47 ` Bernd Schubert 2026-02-06 6:26 ` Darrick J. Wong 3 siblings, 0 replies; 79+ messages in thread From: Bernd Schubert @ 2026-02-04 20:47 UTC (permalink / raw) To: Joanne Koong, Miklos Szeredi Cc: Amir Goldstein, linux-fsdevel, Darrick J . Wong, John Groves, Luis Henriques, Horst Birthelmer, lsf-pc On 2/4/26 10:22, Joanne Koong wrote: > On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote: >> >> On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote: >> >>> All important topics which I am sure will be discussed on a FUSE BoF. > > Two other items I'd like to add to the potential discussion list are: > > * leveraging io-uring multishot for batching fuse writeback and > readahead requests, ie maximizing the throughput per roundtrip context > switch [1] > > * settling how load distribution should be done for configurable > queues. We came to a bit of a standstill on Bernd's patchset [2] and > it would be great to finally get this resolved and the feature landed. > imo configurable queues and incremental buffer consumption are the two > main features needed to make fuse-over-io-uring more feasible on > large-scale systems. Coincidentally I looked into this today because we had totally imbalanced queues when this was activated - slip through in the queue assignment, should be fixed in our branches. v4 basically has all your comment addressed, our branch(es) have 3 bug fixes now to what I thought would be v4 - unless I get pulled into other things again (which is unfortunately likely), v4 will come tomorrow. Thanks, Bernd ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 9:22 ` Joanne Koong ` (2 preceding siblings ...) 2026-02-04 20:47 ` Bernd Schubert @ 2026-02-06 6:26 ` Darrick J. Wong 3 siblings, 0 replies; 79+ messages in thread From: Darrick J. Wong @ 2026-02-06 6:26 UTC (permalink / raw) To: Joanne Koong Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc On Wed, Feb 04, 2026 at 01:22:02AM -0800, Joanne Koong wrote: > On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote: > > > > > All important topics which I am sure will be discussed on a FUSE BoF. > > Two other items I'd like to add to the potential discussion list are: > > * leveraging io-uring multishot for batching fuse writeback and > readahead requests, ie maximizing the throughput per roundtrip context > switch [1] > > * settling how load distribution should be done for configurable > queues. We came to a bit of a standstill on Bernd's patchset [2] and > it would be great to finally get this resolved and the feature landed. > imo configurable queues and incremental buffer consumption are the two > main features needed to make fuse-over-io-uring more feasible on > large-scale systems. > > > > > I see your point. Maybe the BPF one could be useful as a cross track > > discussion, though I'm not sure the fuse side of the design is mature > > enough for that. Joanne, you did some experiments with that, no? > > The discussion on this was started in response [3] to Darrick's iomap > containerization patchset. I have a prototype based on [4] I can get > into reviewable shape this month or next, if there's interest in > getting something concrete before May. I did a quick check with the (Which we're working on :D) > bpf team a few days ago and confirmed with them that struct ops is the > way to go for adding the hook point for fuse. For attaching the bpf > progs to the fuse connection, going through the bpf link interface is > the modern/preferred way of doing this. Yes. That conversion turned out not to be too difficult, but the resulting uapi is a little awkward because you have to pass the /dev/fuse fd in one of the structs that you pass to the bpf syscall, and then the bpf functions have to go find the struct file and use that to get back to the fuse_conn. > Discussion wise, imo on the > fuse side what would be most useful to discuss in May would be what > other interception points do we think would be the most useful in fuse > and what should the API interfaces that we expose for those look like > (eg should these just take the in/out request structs already defined > in the uapi? or expose more state information?). imo, we should take > an incremental approach and add interception points more > conservatively than liberally, on a per-need basis as use cases > actually come up. I would start by only allowing iomap_{begin,end,ioend} bpf functions, and only let them access the same in-arguments and outarg struct as the fuse server upcall would have. (I don't have any opinions on the fuse-bpf filtering stuff that was discussed at lsfmm 2023) > > > I think that at least one question of interest to the wider fs audience is > > > > > > Can any of the above improvements be used to help phase out some > > > of the old under maintained fs and reduce the burden on vfs maintainers? > > I think it might be helpful to know ahead of time where the main > hesitation lies. Is it performance? Maybe it'd be helpful if before > May there was a prototype converting a simpler filesystem (Darrick and > I were musing about fat maybe being a good one) and getting a sense of > what the delta is between the native kernel implementation and a > fuse-based version? In the past year fuse added a lot of new > capabilities that improved performance by quite a bit so I'm curious > to see where the delta now lies. Or maybe the hesitation is something > else entirely, in which case that's probably a conversation better > left for May. TBH I think it's mostly inertia because the current solutions aren't so bad that our managers are screaming at us to get 'er done now. :P That and conversion is a lot of work. --D > Thanks, > Joanne > > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z3mTdZdfe5rTukKOnU0y5dpM8aFTCqbctBWsa-S301TQ@mail.gmail.com/ > > [2] https://lore.kernel.org/linux-fsdevel/20251013-reduced-nr-ring-queues_3-v3-4-6d87c8aa31ae@ddn.com/t/#u > > [3] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z05QZmos90qmWtnWGF+Kb7rVziJ51UpuJ0O=A+6N1vrg@mail.gmail.com/t/#u > > [4] https://lore.kernel.org/linux-fsdevel/176169810144.1424854.11439355400009006946.stgit@frogsfrogsfrogs/T/#m4998d92f6210d50d0bf6760490689c029bda9231 > > > > > I think the major show stopper is that nobody is going to put a major > > effort into porting unmaintained kernel filesystems to a different > > framework. > > > > Alternatively someone could implement a "VFS emulator" library. But > > keeping that in sync with the kernel, together with all the old fs > > would be an even greater burden... > > > > Thanks, > > Miklos > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-02 16:14 ` Amir Goldstein 2026-02-03 7:55 ` Miklos Szeredi @ 2026-02-03 10:15 ` Luis Henriques 2026-02-03 10:20 ` Amir Goldstein 1 sibling, 1 reply; 79+ messages in thread From: Luis Henriques @ 2026-02-03 10:15 UTC (permalink / raw) To: Amir Goldstein Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Horst Birthelmer, lsf-pc On Mon, Feb 02 2026, Amir Goldstein wrote: > [Fixed lsf-pc address typo] > > On Mon, Feb 2, 2026 at 2:51 PM Miklos Szeredi <miklos@szeredi.hu> wrote: >> >> I propose a session where various topics of interest could be >> discussed including but not limited to the below list >> >> New features being proposed at various stages of readiness: >> >> - fuse4fs: exporting the iomap interface to userspace >> >> - famfs: export distributed memory >> >> - zero copy for fuse-io-uring >> >> - large folios >> >> - file handles on the userspace API >> >> - compound requests >> >> - BPF scripts >> >> How do these fit into the existing codebase? >> >> Cleaner separation of layers: >> >> - transport layer: /dev/fuse, io-uring, viriofs >> >> - filesystem layer: local fs, distributed fs >> >> Introduce new version of cleaned up API? >> >> - remove async INIT >> >> - no fixed ROOT_ID >> >> - consolidate caching rules >> >> - who's responsible for updating which metadata? >> >> - remove legacy and problematic flags >> >> - get rid of splice on /dev/fuse for new API version? >> >> Unresolved issues: >> >> - locked / writeback folios vs. reclaim / page migration >> >> - strictlimiting vs. large folios > > All important topics which I am sure will be discussed on a FUSE BoF. I wonder if the topic I proposed separately (on restarting FUSE servers) should also be merged into this list. It's already a very comprehensive list, so I'm not sure it's worth having a separate topic if most of it will (likely) be touched here already. What do you think? Cheers, -- Luís > I think that at least one question of interest to the wider fs audience is > > Can any of the above improvements be used to help phase out some > of the old under maintained fs and reduce the burden on vfs maintainers? > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-03 10:15 ` Luis Henriques @ 2026-02-03 10:20 ` Amir Goldstein 2026-02-03 10:38 ` Luis Henriques 2026-02-03 14:20 ` Christian Brauner 0 siblings, 2 replies; 79+ messages in thread From: Amir Goldstein @ 2026-02-03 10:20 UTC (permalink / raw) To: Luis Henriques Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Horst Birthelmer, lsf-pc On Tue, Feb 3, 2026 at 11:15 AM Luis Henriques <luis@igalia.com> wrote: > > On Mon, Feb 02 2026, Amir Goldstein wrote: > > > [Fixed lsf-pc address typo] > > > > On Mon, Feb 2, 2026 at 2:51 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > >> > >> I propose a session where various topics of interest could be > >> discussed including but not limited to the below list > >> > >> New features being proposed at various stages of readiness: > >> > >> - fuse4fs: exporting the iomap interface to userspace > >> > >> - famfs: export distributed memory > >> > >> - zero copy for fuse-io-uring > >> > >> - large folios > >> > >> - file handles on the userspace API > >> > >> - compound requests > >> > >> - BPF scripts > >> > >> How do these fit into the existing codebase? > >> > >> Cleaner separation of layers: > >> > >> - transport layer: /dev/fuse, io-uring, viriofs > >> > >> - filesystem layer: local fs, distributed fs > >> > >> Introduce new version of cleaned up API? > >> > >> - remove async INIT > >> > >> - no fixed ROOT_ID > >> > >> - consolidate caching rules > >> > >> - who's responsible for updating which metadata? > >> > >> - remove legacy and problematic flags > >> > >> - get rid of splice on /dev/fuse for new API version? > >> > >> Unresolved issues: > >> > >> - locked / writeback folios vs. reclaim / page migration > >> > >> - strictlimiting vs. large folios > > > > All important topics which I am sure will be discussed on a FUSE BoF. > > I wonder if the topic I proposed separately (on restarting FUSE servers) > should also be merged into this list. It's already a very comprehensive > list, so I'm not sure it's worth having a separate topic if most of it > will (likely) be touched here already. > > What do you think? We are likely going to do a FUSE BoF, likely Wed afternoon, so we can have an internal schedule for that. Restartability and stable FUSE handles is one of the requirements to replace an existing fs if that fs is NFS exportrable. Thanks, Amir. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-03 10:20 ` Amir Goldstein @ 2026-02-03 10:38 ` Luis Henriques 2026-02-03 14:20 ` Christian Brauner 1 sibling, 0 replies; 79+ messages in thread From: Luis Henriques @ 2026-02-03 10:38 UTC (permalink / raw) To: Amir Goldstein Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Horst Birthelmer, lsf-pc On Tue, Feb 03 2026, Amir Goldstein wrote: > On Tue, Feb 3, 2026 at 11:15 AM Luis Henriques <luis@igalia.com> wrote: >> >> On Mon, Feb 02 2026, Amir Goldstein wrote: >> >> > [Fixed lsf-pc address typo] >> > >> > On Mon, Feb 2, 2026 at 2:51 PM Miklos Szeredi <miklos@szeredi.hu> wrote: >> >> >> >> I propose a session where various topics of interest could be >> >> discussed including but not limited to the below list >> >> >> >> New features being proposed at various stages of readiness: >> >> >> >> - fuse4fs: exporting the iomap interface to userspace >> >> >> >> - famfs: export distributed memory >> >> >> >> - zero copy for fuse-io-uring >> >> >> >> - large folios >> >> >> >> - file handles on the userspace API >> >> >> >> - compound requests >> >> >> >> - BPF scripts >> >> >> >> How do these fit into the existing codebase? >> >> >> >> Cleaner separation of layers: >> >> >> >> - transport layer: /dev/fuse, io-uring, viriofs >> >> >> >> - filesystem layer: local fs, distributed fs >> >> >> >> Introduce new version of cleaned up API? >> >> >> >> - remove async INIT >> >> >> >> - no fixed ROOT_ID >> >> >> >> - consolidate caching rules >> >> >> >> - who's responsible for updating which metadata? >> >> >> >> - remove legacy and problematic flags >> >> >> >> - get rid of splice on /dev/fuse for new API version? >> >> >> >> Unresolved issues: >> >> >> >> - locked / writeback folios vs. reclaim / page migration >> >> >> >> - strictlimiting vs. large folios >> > >> > All important topics which I am sure will be discussed on a FUSE BoF. >> >> I wonder if the topic I proposed separately (on restarting FUSE servers) >> should also be merged into this list. It's already a very comprehensive >> list, so I'm not sure it's worth having a separate topic if most of it >> will (likely) be touched here already. >> >> What do you think? > > We are likely going to do a FUSE BoF, likely Wed afternoon, > so we can have an internal schedule for that. > > Restartability and stable FUSE handles is one of the requirements > to replace an existing fs if that fs is NFS exportrable. Great, thanks Amir. So I'll just assume these topics will be pushed into the BoF. It looks like it will be a very interesting afternoon! ;-) Cheers, -- Luís ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-03 10:20 ` Amir Goldstein 2026-02-03 10:38 ` Luis Henriques @ 2026-02-03 14:20 ` Christian Brauner 1 sibling, 0 replies; 79+ messages in thread From: Christian Brauner @ 2026-02-03 14:20 UTC (permalink / raw) To: Amir Goldstein Cc: Luis Henriques, Miklos Szeredi, linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Horst Birthelmer, lsf-pc > Restartability and stable FUSE handles is one of the requirements I'd be interested in this. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-02 13:51 ` [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Miklos Szeredi 2026-02-02 16:14 ` Amir Goldstein @ 2026-02-03 10:36 ` Amir Goldstein 2026-02-03 17:13 ` John Groves 2026-02-04 19:06 ` Darrick J. Wong 3 siblings, 0 replies; 79+ messages in thread From: Amir Goldstein @ 2026-02-03 10:36 UTC (permalink / raw) To: Miklos Szeredi Cc: f-pc, linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer On Mon, Feb 2, 2026 at 2:51 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > I propose a session where various topics of interest could be > discussed including but not limited to the below list > > New features being proposed at various stages of readiness: > > - fuse4fs: exporting the iomap interface to userspace > > - famfs: export distributed memory > > - zero copy for fuse-io-uring > > - large folios > > - file handles on the userspace API > > - compound requests > > - BPF scripts > > How do these fit into the existing codebase? > > Cleaner separation of layers: > > - transport layer: /dev/fuse, io-uring, viriofs > > - filesystem layer: local fs, distributed fs > > Introduce new version of cleaned up API? > > - remove async INIT > > - no fixed ROOT_ID > > - consolidate caching rules > > - who's responsible for updating which metadata? > > - remove legacy and problematic flags > - Let server explicitly declare NFS export support We could couple that with LOOKUP_HANDLE, because I think there are very few servers out there that truly provide stable NFS handles with current FUSE protocol. Thanks, Amir. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-02 13:51 ` [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Miklos Szeredi 2026-02-02 16:14 ` Amir Goldstein 2026-02-03 10:36 ` Amir Goldstein @ 2026-02-03 17:13 ` John Groves 2026-02-04 19:06 ` Darrick J. Wong 3 siblings, 0 replies; 79+ messages in thread From: John Groves @ 2026-02-03 17:13 UTC (permalink / raw) To: Miklos Szeredi Cc: f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer On 26/02/02 02:51PM, Miklos Szeredi wrote: > I propose a session where various topics of interest could be > discussed including but not limited to the below list > > New features being proposed at various stages of readiness: > > - fuse4fs: exporting the iomap interface to userspace > > - famfs: export distributed memory I plan to attend, and have been on the fence about whether a proper famfs session is needed. I'm open to ideas on that, but would certainly participate in this sort of overview session too. JG > > - zero copy for fuse-io-uring > > - large folios > > - file handles on the userspace API > > - compound requests > > - BPF scripts > > How do these fit into the existing codebase? > > Cleaner separation of layers: > > - transport layer: /dev/fuse, io-uring, viriofs > > - filesystem layer: local fs, distributed fs > > Introduce new version of cleaned up API? > > - remove async INIT > > - no fixed ROOT_ID > > - consolidate caching rules > > - who's responsible for updating which metadata? > > - remove legacy and problematic flags > > - get rid of splice on /dev/fuse for new API version? > > Unresolved issues: > > - locked / writeback folios vs. reclaim / page migration > > - strictlimiting vs. large folios ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-02 13:51 ` [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Miklos Szeredi ` (2 preceding siblings ...) 2026-02-03 17:13 ` John Groves @ 2026-02-04 19:06 ` Darrick J. Wong 2026-02-04 19:38 ` Horst Birthelmer ` (4 more replies) 3 siblings, 5 replies; 79+ messages in thread From: Darrick J. Wong @ 2026-02-04 19:06 UTC (permalink / raw) To: Miklos Szeredi Cc: f-pc, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote: > I propose a session where various topics of interest could be > discussed including but not limited to the below list > > New features being proposed at various stages of readiness: > > - fuse4fs: exporting the iomap interface to userspace FYI, I took a semi-break from fuse-iomap for 7.0 because I was too busy working on xfs_healer, but I was planning to repost the patchbomb with many many cleanups and reorganizations (thanks Joanne!) as soon as possible after Linus tags 7.0-rc1. I don't think LSFMM is a good venue for discussing a gigantic pile of code, because (IMO) LSF is better spent either (a) retrying in person to reach consensus on things that we couldn't do online; or (b) discussing roadmaps and/or people problems. In other words, I'd rather use in-person time to go through broader topics that affect multiple people, and the mailing lists for detailed examination of a large body of text. However -- do you have questions about the design? That could be a good topic for email /and/ for a face to face meeting. Though I strongly suspect that there are so many other sub-topics that fuse-iomap could eat up an entire afternoon at LSFMM: 0 How do we convince $managers to spend money on porting filesystems to fuse? Even if they use the regular slow mode? 1 What's the process for merging all the code changes into libfuse? The iomap parts are pretty straightforward because libfuse passes the request/reply straight through to fuse server, but... 2 ...the fuse service container part involves a bunch of architecture shifts to libfuse. First you need a new mount helper to connect to a unix socket to start the service, pass some resources (fds and mount options) through the unix socket to the service. Obviously that requires new library code for a fuse server to see the unix socket and request those resources. After that you also need to define a systemd service file that stands up the appropriate sandboxing. I've not written examples, but that needs to be in the final product. 3 What tooling changes to we need to make to /sbin/mount so that it can discover fuse-service-container support and the caller's preferences in using the f-s-c vs. the kernel and whatnot? Do we add another weird x-foo-bar "mount option" so that preferences may be specified explicitly? 4 For defaults situations, where do we make policy about when to use f-s-c and when do we allow use of the kernel driver? I would guess that anything in /etc/fstab could use the kernel driver, and everything else should use a fuse container if possible. For unprivileged non-root-ns mounts I think we'd only allow the container? <shrug> If we made progress on merging the kernel code in the next three months, does that clear the way for discussions of 2-4 at LSF? Also, I hear that FOSSY 2026 will have kernel and KDE tracks, and it's in Vancouver BC, which could be a good venu to talk to the DE people. > - famfs: export distributed memory This has been, uh, hanging out for an extraordinarily long time. > - zero copy for fuse-io-uring > > - large folios > > - file handles on the userspace API (also all that restart stuff, but I think that was already proposed) > - compound requests > > - BPF scripts Is this an extension of the fuse-bpf filtering discussion that happened in 2023? (I wondered why you wouldn't just do bpf hooks in the vfs itself, but maybe hch already NAKed that?) As for fuse-iomap -- this week Joanne and I have been working on making it so that fuse servers can upload ->iomap_{begin,end,ioend} functions into the kernel as BPF programs to avoid server upcalls. This might be a better way to handle the repeating-pattern-iomapping pattern that seems to exist in famfs than hardcoding things in yet another "upload iomap mappings" fuse request. (Yes I see you FUSE_SETUPMAPPING...) > How do these fit into the existing codebase? > > Cleaner separation of layers: > > - transport layer: /dev/fuse, io-uring, viriofs I've noticed that each thread in the libfuse uring backend collects a pile of CQEs and processes them linearly. So if it receives 5 CQEs and the first request takes 30 seconds, the other four just get stuck in line...? > - filesystem layer: local fs, distributed fs <nod> > Introduce new version of cleaned up API? > > - remove async INIT > > - no fixed ROOT_ID Can we just merge this? https://lore.kernel.org/linux-fsdevel/176169811231.1426070.12996939158894110793.stgit@frogsfrogsfrogs/ > - consolidate caching rules > > - who's responsible for updating which metadata? These two seem like a good combined session -- "who owns what file metadata?" > - remove legacy and problematic flags > > - get rid of splice on /dev/fuse for new API version? > > Unresolved issues: > > - locked / writeback folios vs. reclaim / page migration > > - strictlimiting vs. large folios /me has no idea about these last four. --D ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 19:06 ` Darrick J. Wong @ 2026-02-04 19:38 ` Horst Birthelmer 2026-02-04 20:58 ` Bernd Schubert ` (3 subsequent siblings) 4 siblings, 0 replies; 79+ messages in thread From: Horst Birthelmer @ 2026-02-04 19:38 UTC (permalink / raw) To: Darrick J. Wong Cc: Miklos Szeredi, f-pc, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques On Wed, Feb 04, 2026 at 11:06:49AM -0800, Darrick J. Wong wrote: > On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote: > > I propose a session where various topics of interest could be > > discussed including but not limited to the below list > > > > New features being proposed at various stages of readiness: > > > > - fuse4fs: exporting the iomap interface to userspace > > FYI, I took a semi-break from fuse-iomap for 7.0 because I was too busy > working on xfs_healer, but I was planning to repost the patchbomb with > many many cleanups and reorganizations (thanks Joanne!) as soon as > possible after Linus tags 7.0-rc1. > > I don't think LSFMM is a good venue for discussing a gigantic pile of > code, because (IMO) LSF is better spent either (a) retrying in person to > reach consensus on things that we couldn't do online; or (b) discussing > roadmaps and/or people problems. In other words, I'd rather use > in-person time to go through broader topics that affect multiple people, > and the mailing lists for detailed examination of a large body of text. > > However -- do you have questions about the design? That could be a good > topic for email /and/ for a face to face meeting. Though I strongly > suspect that there are so many other sub-topics that fuse-iomap could > eat up an entire afternoon at LSFMM: > > 0 How do we convince $managers to spend money on porting filesystems > to fuse? Even if they use the regular slow mode? > > 1 What's the process for merging all the code changes into libfuse? > The iomap parts are pretty straightforward because libfuse passes > the request/reply straight through to fuse server, but... > Just convince Bernd ... ;-) > 2 ...the fuse service container part involves a bunch of architecture > shifts to libfuse. First you need a new mount helper to connect to > a unix socket to start the service, pass some resources (fds and > mount options) through the unix socket to the service. Obviously > that requires new library code for a fuse server to see the unix > socket and request those resources. After that you also need to > define a systemd service file that stands up the appropriate > sandboxing. I've not written examples, but that needs to be in the > final product. > This really sounds like a good topic for an afternoon and in person the bandwith for passing ideas is higher. I'd be really interested in what those architectural shifts are. It is clearly a lot more than the passage above. > > --D Looking forward to seeing you there. Horst ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 19:06 ` Darrick J. Wong 2026-02-04 19:38 ` Horst Birthelmer @ 2026-02-04 20:58 ` Bernd Schubert 2026-02-06 5:47 ` Darrick J. Wong 2026-02-04 22:50 ` Gao Xiang ` (2 subsequent siblings) 4 siblings, 1 reply; 79+ messages in thread From: Bernd Schubert @ 2026-02-04 20:58 UTC (permalink / raw) To: Darrick J. Wong, Miklos Szeredi Cc: f-pc, linux-fsdevel, Joanne Koong, John Groves, Amir Goldstein, Luis Henriques, Horst Birthelmer On 2/4/26 20:06, Darrick J. Wong wrote: > On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote: >> I propose a session where various topics of interest could be >> discussed including but not limited to the below list >> >> New features being proposed at various stages of readiness: >> >> - fuse4fs: exporting the iomap interface to userspace > > FYI, I took a semi-break from fuse-iomap for 7.0 because I was too busy > working on xfs_healer, but I was planning to repost the patchbomb with > many many cleanups and reorganizations (thanks Joanne!) as soon as > possible after Linus tags 7.0-rc1. > > I don't think LSFMM is a good venue for discussing a gigantic pile of > code, because (IMO) LSF is better spent either (a) retrying in person to > reach consensus on things that we couldn't do online; or (b) discussing > roadmaps and/or people problems. In other words, I'd rather use > in-person time to go through broader topics that affect multiple people, > and the mailing lists for detailed examination of a large body of text. > > However -- do you have questions about the design? That could be a good > topic for email /and/ for a face to face meeting. Though I strongly > suspect that there are so many other sub-topics that fuse-iomap could > eat up an entire afternoon at LSFMM: > > 0 How do we convince $managers to spend money on porting filesystems > to fuse? Even if they use the regular slow mode? > > 1 What's the process for merging all the code changes into libfuse? > The iomap parts are pretty straightforward because libfuse passes > the request/reply straight through to fuse server, but... To be honest, I'm rather lost with your patch bomb - in which order do I need to review what? And what can be merged without? Regarding libfuse patches - certainly helpful if you also post them here, but I don't want to create PRs out of your series, which then might fail the PR tests and I would have to fix it on my own ;) So the right order is to create libfuse PRs, let the test run, let everyone review here or via PR and then it gets merged. > > 2 ...the fuse service container part involves a bunch of architecture > shifts to libfuse. First you need a new mount helper to connect to > a unix socket to start the service, pass some resources (fds and > mount options) through the unix socket to the service. Obviously > that requires new library code for a fuse server to see the unix > socket and request those resources. After that you also need to > define a systemd service file that stands up the appropriate > sandboxing. I've not written examples, but that needs to be in the > final product. > > 3 What tooling changes to we need to make to /sbin/mount so that it > can discover fuse-service-container support and the caller's > preferences in using the f-s-c vs. the kernel and whatnot? Do we > add another weird x-foo-bar "mount option" so that preferences may > be specified explicitly? > > 4 For defaults situations, where do we make policy about when to use > f-s-c and when do we allow use of the kernel driver? I would guess > that anything in /etc/fstab could use the kernel driver, and > everything else should use a fuse container if possible. For > unprivileged non-root-ns mounts I think we'd only allow the > container? > > <shrug> If we made progress on merging the kernel code in the next three > months, does that clear the way for discussions of 2-4 at LSF? > > Also, I hear that FOSSY 2026 will have kernel and KDE tracks, and it's > in Vancouver BC, which could be a good venu to talk to the DE people. > >> - famfs: export distributed memory > > This has been, uh, hanging out for an extraordinarily long time. > >> - zero copy for fuse-io-uring >> >> - large folios >> >> - file handles on the userspace API > > (also all that restart stuff, but I think that was already proposed) > >> - compound requests >> >> - BPF scripts > > Is this an extension of the fuse-bpf filtering discussion that happened > in 2023? (I wondered why you wouldn't just do bpf hooks in the vfs > itself, but maybe hch already NAKed that?) > > As for fuse-iomap -- this week Joanne and I have been working on making > it so that fuse servers can upload ->iomap_{begin,end,ioend} functions > into the kernel as BPF programs to avoid server upcalls. This might be > a better way to handle the repeating-pattern-iomapping pattern that > seems to exist in famfs than hardcoding things in yet another "upload > iomap mappings" fuse request. > > (Yes I see you FUSE_SETUPMAPPING...) > >> How do these fit into the existing codebase? >> >> Cleaner separation of layers: >> >> - transport layer: /dev/fuse, io-uring, viriofs > > I've noticed that each thread in the libfuse uring backend collects a > pile of CQEs and processes them linearly. So if it receives 5 CQEs and > the first request takes 30 seconds, the other four just get stuck in > line...? I'm certainly open for suggestions and patches :) At DDN the queues are polled from reactors (co-routine line), that additional libfuse API will never go public, but I definitely want to finish and if possible implement a new API before I leave (less than 2 months left). We had a bit of discussion with Stefan Hajnoczi about that around last March, but I never came even close that task the whole year. > >> - filesystem layer: local fs, distributed fs > > <nod> > >> Introduce new version of cleaned up API? >> >> - remove async INIT >> >> - no fixed ROOT_ID > > Can we just merge this? > https://lore.kernel.org/linux-fsdevel/176169811231.1426070.12996939158894110793.stgit@frogsfrogsfrogs/ Could you create a libfuse PR please? Thanks, Bernd ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 20:58 ` Bernd Schubert @ 2026-02-06 5:47 ` Darrick J. Wong 0 siblings, 0 replies; 79+ messages in thread From: Darrick J. Wong @ 2026-02-06 5:47 UTC (permalink / raw) To: Bernd Schubert Cc: Miklos Szeredi, f-pc, linux-fsdevel, Joanne Koong, John Groves, Amir Goldstein, Luis Henriques, Horst Birthelmer On Wed, Feb 04, 2026 at 09:58:51PM +0100, Bernd Schubert wrote: > > > On 2/4/26 20:06, Darrick J. Wong wrote: > > On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote: > >> I propose a session where various topics of interest could be > >> discussed including but not limited to the below list > >> > >> New features being proposed at various stages of readiness: > >> > >> - fuse4fs: exporting the iomap interface to userspace > > > > FYI, I took a semi-break from fuse-iomap for 7.0 because I was too busy > > working on xfs_healer, but I was planning to repost the patchbomb with > > many many cleanups and reorganizations (thanks Joanne!) as soon as > > possible after Linus tags 7.0-rc1. > > > > I don't think LSFMM is a good venue for discussing a gigantic pile of > > code, because (IMO) LSF is better spent either (a) retrying in person to > > reach consensus on things that we couldn't do online; or (b) discussing > > roadmaps and/or people problems. In other words, I'd rather use > > in-person time to go through broader topics that affect multiple people, > > and the mailing lists for detailed examination of a large body of text. > > > > However -- do you have questions about the design? That could be a good > > topic for email /and/ for a face to face meeting. Though I strongly > > suspect that there are so many other sub-topics that fuse-iomap could > > eat up an entire afternoon at LSFMM: > > > > 0 How do we convince $managers to spend money on porting filesystems > > to fuse? Even if they use the regular slow mode? > > > > 1 What's the process for merging all the code changes into libfuse? > > The iomap parts are pretty straightforward because libfuse passes > > the request/reply straight through to fuse server, but... > > To be honest, I'm rather lost with your patch bomb - in which order do I > need to review what? And what can be merged without? If there are any fixes they're usually at the beginning. At the moment you actually /have/ merged everything that can be. :) The rest relies on kernel patches that aren't upstream. > Regarding libfuse patches - certainly helpful if you also post them > here, but I don't want to create PRs out of your series, which then > might fail the PR tests and I would have to fix it on my own ;) > So the right order is to create libfuse PRs, let the test run, let > everyone review here or via PR and then it gets merged. I can generate pull requests for the libfuse things, no problem. The hard question is, can your CI system build a kernel with the relevant patches or do we have to wait until Miklos merges them into upstream? > > 2 ...the fuse service container part involves a bunch of architecture > > shifts to libfuse. First you need a new mount helper to connect to > > a unix socket to start the service, pass some resources (fds and > > mount options) through the unix socket to the service. Obviously > > that requires new library code for a fuse server to see the unix > > socket and request those resources. After that you also need to > > define a systemd service file that stands up the appropriate > > sandboxing. I've not written examples, but that needs to be in the > > final product. > > > > 3 What tooling changes to we need to make to /sbin/mount so that it > > can discover fuse-service-container support and the caller's > > preferences in using the f-s-c vs. the kernel and whatnot? Do we > > add another weird x-foo-bar "mount option" so that preferences may > > be specified explicitly? > > > > 4 For defaults situations, where do we make policy about when to use > > f-s-c and when do we allow use of the kernel driver? I would guess > > that anything in /etc/fstab could use the kernel driver, and > > everything else should use a fuse container if possible. For > > unprivileged non-root-ns mounts I think we'd only allow the > > container? > > > > <shrug> If we made progress on merging the kernel code in the next three > > months, does that clear the way for discussions of 2-4 at LSF? > > > > Also, I hear that FOSSY 2026 will have kernel and KDE tracks, and it's > > in Vancouver BC, which could be a good venu to talk to the DE people. > > > >> - famfs: export distributed memory > > > > This has been, uh, hanging out for an extraordinarily long time. > > > >> - zero copy for fuse-io-uring > >> > >> - large folios > >> > >> - file handles on the userspace API > > > > (also all that restart stuff, but I think that was already proposed) > > > >> - compound requests > >> > >> - BPF scripts > > > > Is this an extension of the fuse-bpf filtering discussion that happened > > in 2023? (I wondered why you wouldn't just do bpf hooks in the vfs > > itself, but maybe hch already NAKed that?) > > > > As for fuse-iomap -- this week Joanne and I have been working on making > > it so that fuse servers can upload ->iomap_{begin,end,ioend} functions > > into the kernel as BPF programs to avoid server upcalls. This might be > > a better way to handle the repeating-pattern-iomapping pattern that > > seems to exist in famfs than hardcoding things in yet another "upload > > iomap mappings" fuse request. > > > > (Yes I see you FUSE_SETUPMAPPING...) > > > >> How do these fit into the existing codebase? > >> > >> Cleaner separation of layers: > >> > >> - transport layer: /dev/fuse, io-uring, viriofs > > > > I've noticed that each thread in the libfuse uring backend collects a > > pile of CQEs and processes them linearly. So if it receives 5 CQEs and > > the first request takes 30 seconds, the other four just get stuck in > > line...? > > I'm certainly open for suggestions and patches :) The only things I can think of are (a) a pool of threads pinned to the same CPU as the CQE reader, but I don't think that's going to be good for low latency; (b) as long as the request is still in libfuse, maybe it can decide "I'm taking too long" and spawn a pthread to hand the request to; or (c) can other threads steal a CQE to work on if they go idle? That might only work for FUSE_DESTROY though, since there won't be new requests issued after that. For the particular problems I was seeing with FUSE_DESTROY I picked (b). https://git.kernel.org/pub/scm/linux/kernel/git/djwong/libfuse.git/commit/?h=djwong-wtf&id=e2784aaa0bc0d396fe1c75b826fc140366f576bc But that also only happens if your kernel has https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-fixes&id=a9df193a5913e747d8c2830197c4f36d56f42e4c so there's no action to be taken for libfuse right now. > At DDN the queues are polled from reactors (co-routine line), that > additional libfuse API will never go public, but I definitely want to > finish and if possible implement a new API before I leave (less than 2 > months left). We had a bit of discussion with Stefan Hajnoczi about that > around last March, but I never came even close that task the whole year. <nod> > > > >> - filesystem layer: local fs, distributed fs > > > > <nod> > > > >> Introduce new version of cleaned up API? > >> > >> - remove async INIT > >> > >> - no fixed ROOT_ID > > > > Can we just merge this? > > https://lore.kernel.org/linux-fsdevel/176169811231.1426070.12996939158894110793.stgit@frogsfrogsfrogs/ > > Could you create a libfuse PR please? Well we'd have to get the kernel patch merged first, and (AFAIK) it's not queued up for Linux 7.0. --D > > Thanks, > Bernd > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 19:06 ` Darrick J. Wong 2026-02-04 19:38 ` Horst Birthelmer 2026-02-04 20:58 ` Bernd Schubert @ 2026-02-04 22:50 ` Gao Xiang 2026-02-06 5:38 ` Darrick J. Wong 2026-02-04 23:19 ` Gao Xiang 2026-02-05 3:33 ` John Groves 4 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-02-04 22:50 UTC (permalink / raw) To: Darrick J. Wong, Miklos Szeredi Cc: f-pc, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer On 2026/2/5 03:06, Darrick J. Wong wrote: > On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote: ... > > 4 For defaults situations, where do we make policy about when to use > f-s-c and when do we allow use of the kernel driver? I would guess > that anything in /etc/fstab could use the kernel driver, and > everything else should use a fuse container if possible. For > unprivileged non-root-ns mounts I think we'd only allow the > container? Just a side note: As a filesystem for containers, I have to say here again one of the goal of EROFS is to allow unprivileged non-root-ns mounts for container users because again I've seen no on-disk layout security risk especially for the uncompressed layout format and container users have already request this, but as Christoph said, I will finish security model first before I post some code for pure untrusted images. But first allow dm-verity/fs-verity signed images as the first step. On the other side, my objective thought of that is FUSE is becoming complex either from its protocol and implementations (even from the TODO lists here) and leak of security design too, it's hard to say from the attack surface which is better and Linux kernel is never regarded as a microkernel model. In order to phase out "legacy and problematic flags", FUSE have to wait until all current users don't use them anymore. I really think it should be a per-filesystem policy rather than the current arbitary policy just out of fragment words, but I will prepare more materials and bring this for more formal discussion until the whole goal is finished. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 22:50 ` Gao Xiang @ 2026-02-06 5:38 ` Darrick J. Wong 2026-02-06 6:15 ` Gao Xiang 0 siblings, 1 reply; 79+ messages in thread From: Darrick J. Wong @ 2026-02-06 5:38 UTC (permalink / raw) To: Gao Xiang Cc: Miklos Szeredi, f-pc, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer On Thu, Feb 05, 2026 at 06:50:28AM +0800, Gao Xiang wrote: > > > On 2026/2/5 03:06, Darrick J. Wong wrote: > > On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote: > > ... > > > > > 4 For defaults situations, where do we make policy about when to use > > f-s-c and when do we allow use of the kernel driver? I would guess > > that anything in /etc/fstab could use the kernel driver, and > > everything else should use a fuse container if possible. For > > unprivileged non-root-ns mounts I think we'd only allow the > > container? > > Just a side note: As a filesystem for containers, I have to say here > again one of the goal of EROFS is to allow unprivileged non-root-ns > mounts for container users because again I've seen no on-disk layout > security risk especially for the uncompressed layout format and > container users have already request this, but as Christoph said, > I will finish security model first before I post some code for pure > untrusted images. But first allow dm-verity/fs-verity signed images > as the first step. <nod> I haven't forgotten. For readonly root fses erofs is probably the best we're going to get, and it's less clunky than fuse. There's less of a firewall due to !microkernel but I'd wager that most immutable distros will find erofs a good enough balance between performance and isolation. Fuse, otoh, is for all the other weird users -- you found an old cupboard full of wide scsi disks; or management decided that letting container customers bring their own prepopulated data partitions(!) is a good idea; or the default when someone plugs in a device that the system knows nothing about. > On the other side, my objective thought of that is FUSE is becoming > complex either from its protocol and implementations (even from the It already is. > TODO lists here) and leak of security design too, it's hard to say > from the attack surface which is better and Linux kernel is never > regarded as a microkernel model. In order to phase out "legacy and > problematic flags", FUSE have to wait until all current users don't > use them anymore. > > I really think it should be a per-filesystem policy rather than the > current arbitary policy just out of fragment words, but I will > prepare more materials and bring this for more formal discussion > until the whole goal is finished. Well yes, the transition from kernel to kernel-or-fuse would be decided on a per-filesystem basis. When the fuse driver reaches par with the kernel driver on functionality and stability then it becomes a candidate for secure container usage. Not before. --D > Thanks, > Gao Xiang > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-06 5:38 ` Darrick J. Wong @ 2026-02-06 6:15 ` Gao Xiang 2026-02-21 0:47 ` Darrick J. Wong 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-02-06 6:15 UTC (permalink / raw) To: Darrick J. Wong Cc: Miklos Szeredi, f-pc, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer Hi Darrick, On 2026/2/6 13:38, Darrick J. Wong wrote: > On Thu, Feb 05, 2026 at 06:50:28AM +0800, Gao Xiang wrote: >> >> >> On 2026/2/5 03:06, Darrick J. Wong wrote: >>> On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote: >> >> ... >> >>> >>> 4 For defaults situations, where do we make policy about when to use >>> f-s-c and when do we allow use of the kernel driver? I would guess >>> that anything in /etc/fstab could use the kernel driver, and >>> everything else should use a fuse container if possible. For >>> unprivileged non-root-ns mounts I think we'd only allow the >>> container? >> >> Just a side note: As a filesystem for containers, I have to say here >> again one of the goal of EROFS is to allow unprivileged non-root-ns >> mounts for container users because again I've seen no on-disk layout >> security risk especially for the uncompressed layout format and >> container users have already request this, but as Christoph said, >> I will finish security model first before I post some code for pure >> untrusted images. But first allow dm-verity/fs-verity signed images >> as the first step. > > <nod> I haven't forgotten. For readonly root fses erofs is probably the > best we're going to get, and it's less clunky than fuse. There's less > of a firewall due to !microkernel but I'd wager that most immutable > distros will find erofs a good enough balance between performance and > isolation. Thanks, but I can't make decisions for every individual end user. However, in my view, this approach is valuable for all container users if they don't mind to try this approach (I'm building this capabilities with several communities and people): they can achieve nearly native performance on read-write workloads with a trusted fs as well as the remote data source is completely isolated using an immutable secure filesystem. I will make signed images work first, but as the next step, I'll definitely work on defining a clear on-disk boundary (very likely excluding per-inode compression layouts in the beginning) to enable most users to leverage untrusted data directly in a totally isolated user/mount namespace. > > Fuse, otoh, is for all the other weird users -- you found an old > cupboard full of wide scsi disks; or management decided that letting > container customers bring their own prepopulated data partitions(!) is a > good idea; or the default when someone plugs in a device that the system > knows nothing about. Honestly, I've checked what Ted, Dave, and you said previously. For generic COW filesystems, it's surely hard to guarantee filesystem consistency at all times, mainly because of those on-disk formats by design (lots of duplicated metadata for different purposes, which can cause extra inconsistency compared to archive fses.) Of course, it's not entirely impossible, but as Ted pointed out, it becomes a matter of 1) human resources; 2) enforcing such strict consistency checks harms performance in general use cases which just use trusted filesystem / media directly like databases. I'm not against FUSE further improvements because they are seperated stories, I do think those items are useful for new Linux innovation, but as for the topic of allowing "root" in non-root-user-ns to mount, I still insist that it should be a per-filesystem policy, because filesystems are designed for different targeted use cases: - either you face and address the issue (by design or by enginneering), or - find another alternative way to serve users. But I do hope we shouldn't force some arbitary policy without any technical reason, the feature is indeed useful for container users. > >> On the other side, my objective thought of that is FUSE is becoming >> complex either from its protocol and implementations (even from the > > It already is. > >> TODO lists here) and leak of security design too, it's hard to say >> from the attack surface which is better and Linux kernel is never >> regarded as a microkernel model. In order to phase out "legacy and >> problematic flags", FUSE have to wait until all current users don't >> use them anymore. >> >> I really think it should be a per-filesystem policy rather than the >> current arbitary policy just out of fragment words, but I will >> prepare more materials and bring this for more formal discussion >> until the whole goal is finished. > > Well yes, the transition from kernel to kernel-or-fuse would be > decided on a per-filesystem basis. When the fuse driver reaches par > with the kernel driver on functionality and stability then it becomes a > candidate for secure container usage. Not before. I respect this path, but just from my own perspective, userspace malicious problems are usually much harder to defence since the trusted boundary is weaker, in order to allow unpriviledged daemons, you have to monitor if page cache or any metadata cache or any potential/undiscovered deadlock vectors can be abused by those malicious daemons, so that you have to find more harden ways to limit such abused usage naturally since you never trust those unpriviledged daemons (which is arbitary executable code rather than a binary source) instead, which is opposed to performance cases in principle without detailed analysis. Just my two cents. Thanks, Gao Xiang > > --D > >> Thanks, >> Gao Xiang >> >> ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-06 6:15 ` Gao Xiang @ 2026-02-21 0:47 ` Darrick J. Wong 2026-03-17 4:17 ` Gao Xiang 0 siblings, 1 reply; 79+ messages in thread From: Darrick J. Wong @ 2026-02-21 0:47 UTC (permalink / raw) To: Gao Xiang Cc: Miklos Szeredi, f-pc, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer On Fri, Feb 06, 2026 at 02:15:12PM +0800, Gao Xiang wrote: > Hi Darrick, > > On 2026/2/6 13:38, Darrick J. Wong wrote: > > On Thu, Feb 05, 2026 at 06:50:28AM +0800, Gao Xiang wrote: > > > > > > > > > On 2026/2/5 03:06, Darrick J. Wong wrote: > > > > On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote: > > > > > > ... > > > > > > > > > > > 4 For defaults situations, where do we make policy about when to use > > > > f-s-c and when do we allow use of the kernel driver? I would guess > > > > that anything in /etc/fstab could use the kernel driver, and > > > > everything else should use a fuse container if possible. For > > > > unprivileged non-root-ns mounts I think we'd only allow the > > > > container? > > > > > > Just a side note: As a filesystem for containers, I have to say here > > > again one of the goal of EROFS is to allow unprivileged non-root-ns > > > mounts for container users because again I've seen no on-disk layout > > > security risk especially for the uncompressed layout format and > > > container users have already request this, but as Christoph said, > > > I will finish security model first before I post some code for pure > > > untrusted images. But first allow dm-verity/fs-verity signed images > > > as the first step. > > > > <nod> I haven't forgotten. For readonly root fses erofs is probably the > > best we're going to get, and it's less clunky than fuse. There's less > > of a firewall due to !microkernel but I'd wager that most immutable > > distros will find erofs a good enough balance between performance and > > isolation. > > Thanks, but I can't make decisions for every individual end user. > However, in my view, this approach is valuable for all container > users if they don't mind to try this approach (I'm building this > capabilities with several communities and people): they can achieve > nearly native performance on read-write workloads with a trusted > fs as well as the remote data source is completely isolated using > an immutable secure filesystem. > > I will make signed images work first, but as the next step, I'll > definitely work on defining a clear on-disk boundary (very > likely excluding per-inode compression layouts in the beginning) > to enable most users to leverage untrusted data directly in > a totally isolated user/mount namespace. <nod> I hope you succeed! > > > > Fuse, otoh, is for all the other weird users -- you found an old > > cupboard full of wide scsi disks; or management decided that letting > > container customers bring their own prepopulated data partitions(!) is a > > good idea; or the default when someone plugs in a device that the system > > knows nothing about. > > Honestly, I've checked what Ted, Dave, and you said previously. > For generic COW filesystems, it's surely hard to guarantee > filesystem consistency at all times, mainly because of those > on-disk formats by design (lots of duplicated metadata for > different purposes, which can cause extra inconsistency compared > to archive fses.) Of course, it's not entirely impossible, but > as Ted pointed out, it becomes a matter of > > 1) human resources; > 2) enforcing such strict consistency checks harms performance > in general use cases which just use trusted filesystem / > media directly like databases. > > I'm not against FUSE further improvements because they are seperated > stories, I do think those items are useful for new Linux innovation, > but as for the topic of allowing "root" in non-root-user-ns to mount, > I still insist that it should be a per-filesystem policy, because > filesystems are designed for different targeted use cases: > > - either you face and address the issue (by design or by > enginneering), or > - find another alternative way to serve users. > > But I do hope we shouldn't force some arbitary policy without any > technical reason, the feature is indeed useful for container users. Oh yes, the policy question is a very large one; for a specific given filesystem, you need to trust: A> whatever user is asking to do the mount B> the quality of the kernel or userspace drivers C> the provenance of the filesystem image This is a hugely personal (or institutional) question, all we can do is provide mechanisms for kernel and userspace drivers, a sensible default policy, and a reasonable way to relate all three properties to action. Or just go with IT policy, which is deny, delete, destroy. :P > > > > > On the other side, my objective thought of that is FUSE is becoming > > > complex either from its protocol and implementations (even from the > > > > It already is. > > > > > TODO lists here) and leak of security design too, it's hard to say > > > from the attack surface which is better and Linux kernel is never > > > regarded as a microkernel model. In order to phase out "legacy and > > > problematic flags", FUSE have to wait until all current users don't > > > use them anymore. > > > > > > I really think it should be a per-filesystem policy rather than the > > > current arbitary policy just out of fragment words, but I will > > > prepare more materials and bring this for more formal discussion > > > until the whole goal is finished. > > > > Well yes, the transition from kernel to kernel-or-fuse would be > > decided on a per-filesystem basis. When the fuse driver reaches par > > with the kernel driver on functionality and stability then it becomes a > > candidate for secure container usage. Not before. > > I respect this path, but just from my own perspective, userspace > malicious problems are usually much harder to defence since the > trusted boundary is weaker, in order to allow unpriviledged > daemons, you have to monitor if page cache or any metadata cache > or any potential/undiscovered deadlock vectors can be abused > by those malicious daemons, so that you have to find more harden > ways to limit such abused usage naturally since you never trust > those unpriviledged daemons (which is arbitary executable code > rather than a binary source) instead, which is opposed to > performance cases in principle without detailed analysis. I'm well aware that going to userspace opens a whole floodgate of weird dynamic behavior possibilities. Though obviously my experiences with kernel XFS has shown me that those challenges exist there too. :/ The kernel does have the nice property that you can set NOFS and ignore SIGSTOP/KILL if necessary to get things done. --D > Just my two cents. > > Thanks, > Gao Xiang > > > > > --D > > > > > Thanks, > > > Gao Xiang > > > > > > > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-21 0:47 ` Darrick J. Wong @ 2026-03-17 4:17 ` Gao Xiang 2026-03-18 21:51 ` Darrick J. Wong 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-17 4:17 UTC (permalink / raw) To: Darrick J. Wong Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc Hi Darrick, On 2026/2/21 08:47, Darrick J. Wong wrote: > On Fri, Feb 06, 2026 at 02:15:12PM +0800, Gao Xiang wrote: ... > >>> >>> Fuse, otoh, is for all the other weird users -- you found an old >>> cupboard full of wide scsi disks; or management decided that letting >>> container customers bring their own prepopulated data partitions(!) is a >>> good idea; or the default when someone plugs in a device that the system >>> knows nothing about. I brainstormed some more thoughts: End users would like to mount a filesystem, but it's unknown that the filesystem is consistent or not, especially for filesystems are intended to be mounted as "rw", it's very hard to know if the filesystem metadata is fully consistent without a full fsck scan in advance. Considering the following metadata inconsistent case (note that block 0x123 is referenced by the inconsistent metadata, rather than normal filesystem reflink with correct metadata): inode A (with high permission) extent [0~4k) maps to block 0x123 random inode B (with low permission) extent [0~4k) maps to block 0x123 too So there will exist at least three attack ways: 1) Normal users will record the sensitive information to inode A (since it's not the normal COW, the block 0x123 will be updated in place), but normal users don't know there exists the malicious inode B, so the sensitive information can be fetched via inode B illegally; 2) Attackers can write inode B with low permission in the proper timing to change the inode A to compromise the computer system; 3) Of course, such two inodes can cause double freeing issues. I think the normal copy-on-write (including OverlayFS) mechanism doesn't have the issue (because all changes will just have another copy). Of course, hardlinking won't have the same issue either, because there is only one inode for all hardlinks. I don't think FUSE-implemented userspace drivers will resolve such issues (I think users can only get the following usage reclaim: "that is not the case that we will handle with userspace FUSE drivers, because the metadata is serious broken"), the only way to resolve such attack vectors is to run the full-scan fsck consistency check and then mount "rw" or using the immutable filesystem like EROFS (so that there will not be such inconsisteny issues by design) and isolate the entire write traffic with a full copy-on-write mechanism with OverlayFS for example (IOWs, to make all write copy-on-write into another trusted local filesystem). I hope it's a valid case, and that can indeed happen if the arbitary generic filesystem can be mounted in "rw". And my immutable image filesystem idea can help mitigate this too (just because the immutable image won't be changed in any way, and all writes are always copy-up) Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-17 4:17 ` Gao Xiang @ 2026-03-18 21:51 ` Darrick J. Wong 2026-03-19 8:05 ` Gao Xiang 2026-03-22 3:25 ` Demi Marie Obenour 0 siblings, 2 replies; 79+ messages in thread From: Darrick J. Wong @ 2026-03-18 21:51 UTC (permalink / raw) To: Gao Xiang Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On Tue, Mar 17, 2026 at 12:17:48PM +0800, Gao Xiang wrote: > Hi Darrick, > > On 2026/2/21 08:47, Darrick J. Wong wrote: > > On Fri, Feb 06, 2026 at 02:15:12PM +0800, Gao Xiang wrote: > > ... > > > > > > > > > > > Fuse, otoh, is for all the other weird users -- you found an old > > > > cupboard full of wide scsi disks; or management decided that letting > > > > container customers bring their own prepopulated data partitions(!) is a > > > > good idea; or the default when someone plugs in a device that the system > > > > knows nothing about. > > I brainstormed some more thoughts: > > End users would like to mount a filesystem, but it's unknown that > the filesystem is consistent or not, especially for filesystems > are intended to be mounted as "rw", it's very hard to know if the > filesystem metadata is fully consistent without a full fsck scan > in advance. > > Considering the following metadata inconsistent case (note that > block 0x123 is referenced by the inconsistent metadata, rather > than normal filesystem reflink with correct metadata): > > inode A (with high permission) > extent [0~4k) maps to block 0x123 > > random inode B (with low permission) > extent [0~4k) maps to block 0x123 too > > So there will exist at least three attack ways: > > 1) Normal users will record the sensitive information to inode > A (since it's not the normal COW, the block 0x123 will be > updated in place), but normal users don't know there exists > the malicious inode B, so the sensitive information can be > fetched via inode B illegally; > > 2) Attackers can write inode B with low permission in the proper > timing to change the inode A to compromise the computer > system; > > 3) Of course, such two inodes can cause double freeing issues. > > I think the normal copy-on-write (including OverlayFS) mechanism > doesn't have the issue (because all changes will just have another > copy). Of course, hardlinking won't have the same issue either, > because there is only one inode for all hardlinks. Yes, though you can screw with the link counts to cause other mayhem ;) > I don't think FUSE-implemented userspace drivers will resolve > such issues (I think users can only get the following usage reclaim: Filesystem implementations /can/ detect these sorts of problems, but most of them have no means to do that quickly. As you and Demi Marie have noted, the only reasonable way to guard against these things is pre-mount fsck. And even then, attackers still have a window to screw with the fs metadata after fsck exits but before mount(2) takes the block device. I guess you'd have to inject the fsck run after the O_EXCL opening. Technically speaking fuse4fs could just invoke e2fsck -fn before it starts up the rest of the libfuse initialization but who knows if that's an acceptable risk. Also unclear if you actually want -fy for that. > "that is not the case that we will handle with userspace FUSE > drivers, because the metadata is serious broken"), the only way to > resolve such attack vectors is to run > > the full-scan fsck consistency check and then mount "rw" > > or > > using the immutable filesystem like EROFS (so that there will not > be such inconsisteny issues by design) and isolate the entire write > traffic with a full copy-on-write mechanism with OverlayFS for > example (IOWs, to make all write copy-on-write into another trusted > local filesystem). (Yeah, that's probably the only way to go for prepopulated images like root filesystems and container packages) > I hope it's a valid case, and that can indeed happen if the arbitary > generic filesystem can be mounted in "rw". And my immutable image > filesystem idea can help mitigate this too (just because the immutable > image won't be changed in any way, and all writes are always copy-up) That, we agree on :) --D > Thanks, > Gao Xiang > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-18 21:51 ` Darrick J. Wong @ 2026-03-19 8:05 ` Gao Xiang 2026-03-22 3:25 ` Demi Marie Obenour 1 sibling, 0 replies; 79+ messages in thread From: Gao Xiang @ 2026-03-19 8:05 UTC (permalink / raw) To: Darrick J. Wong Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc Hi Darrick, On 2026/3/19 05:51, Darrick J. Wong wrote: > On Tue, Mar 17, 2026 at 12:17:48PM +0800, Gao Xiang wrote: >> Hi Darrick, >> >> On 2026/2/21 08:47, Darrick J. Wong wrote: >>> On Fri, Feb 06, 2026 at 02:15:12PM +0800, Gao Xiang wrote: >> >> ... >> >>> >>>>> >>>>> Fuse, otoh, is for all the other weird users -- you found an old >>>>> cupboard full of wide scsi disks; or management decided that letting >>>>> container customers bring their own prepopulated data partitions(!) is a >>>>> good idea; or the default when someone plugs in a device that the system >>>>> knows nothing about. >> >> I brainstormed some more thoughts: >> >> End users would like to mount a filesystem, but it's unknown that >> the filesystem is consistent or not, especially for filesystems >> are intended to be mounted as "rw", it's very hard to know if the >> filesystem metadata is fully consistent without a full fsck scan >> in advance. >> >> Considering the following metadata inconsistent case (note that >> block 0x123 is referenced by the inconsistent metadata, rather >> than normal filesystem reflink with correct metadata): >> >> inode A (with high permission) >> extent [0~4k) maps to block 0x123 >> >> random inode B (with low permission) >> extent [0~4k) maps to block 0x123 too >> >> So there will exist at least three attack ways: >> >> 1) Normal users will record the sensitive information to inode >> A (since it's not the normal COW, the block 0x123 will be >> updated in place), but normal users don't know there exists >> the malicious inode B, so the sensitive information can be >> fetched via inode B illegally; >> >> 2) Attackers can write inode B with low permission in the proper >> timing to change the inode A to compromise the computer >> system; >> >> 3) Of course, such two inodes can cause double freeing issues. >> >> I think the normal copy-on-write (including OverlayFS) mechanism >> doesn't have the issue (because all changes will just have another >> copy). Of course, hardlinking won't have the same issue either, >> because there is only one inode for all hardlinks. > > Yes, though you can screw with the link counts to cause other mayhem ;) Yes, for generic writable filesystems, incorrect nlink values can also be another potential attack vector. However, for strict immutable filesystems, we never actually leverage nlink for any writable thing except getattr(), which is used only to display archived stat information in the image to users. This is similar to how FUSE getattr simply returns nlink to userspace, so corrupted nlink values for immutable fses doesn't result in any serious thing (again like ro FUSE returns arbitary nlink to userspace). Since the filesystem is strictly immutable, any write operation triggers a copy-up (copy-on-write) to another trusted filesystem via OverlayFS. I admit that hardlinking is no longer valid in this context; however, since we are already in the containerization era, almost all applications work well with new OverlayFS semantics. > >> I don't think FUSE-implemented userspace drivers will resolve >> such issues (I think users can only get the following usage reclaim: > > Filesystem implementations /can/ detect these sorts of problems, but > most of them have no means to do that quickly. As you and Demi Marie > have noted, the only reasonable way to guard against these things is > pre-mount fsck. > > And even then, attackers still have a window to screw with the fs > metadata after fsck exits but before mount(2) takes the block device. > I guess you'd have to inject the fsck run after the O_EXCL opening. Let's not talk about the attack like malicious block devices, the typical real use case is that the container runtime fetchs a filesystem image from remote, and then mount it. Consider such typical scenario, I still think full fsck should be run before mounting, especially for "rw"; otherwise FUSE won't help serious inconsistent metadata corruption attacks; > > Technically speaking fuse4fs could just invoke e2fsck -fn before it > starts up the rest of the libfuse initialization but who knows if that's > an acceptable risk. Also unclear if you actually want -fy for that. But if `e2fsck -fn` is run, and we scan the image then finally find no metadata inconsistency, why not just mounting in the kernel then? ;-) I guess the main propose of FUSE was to avoid the impact of serious malicious inconsistency? I agree that this approach will almost never crash the kernel, but like what I said, the security risk is still here, and it doesn't need any malicious block device likewise, just fetching untrusted remote write fileystems to local, and mount. Out of the topic: some of our alibaba cloud serverless businesses are still mounting untrusted rw filesystems from arbitary publishers in the kernel without any fsck in advance, I tried to persuade them "don't do that" for many many times, but who knows? :-) > >> "that is not the case that we will handle with userspace FUSE >> drivers, because the metadata is serious broken"), the only way to >> resolve such attack vectors is to run >> >> the full-scan fsck consistency check and then mount "rw" >> >> or >> >> using the immutable filesystem like EROFS (so that there will not >> be such inconsisteny issues by design) and isolate the entire write >> traffic with a full copy-on-write mechanism with OverlayFS for >> example (IOWs, to make all write copy-on-write into another trusted >> local filesystem). > > (Yeah, that's probably the only way to go for prepopulated images like > root filesystems and container packages) > >> I hope it's a valid case, and that can indeed happen if the arbitary >> generic filesystem can be mounted in "rw". And my immutable image >> filesystem idea can help mitigate this too (just because the immutable >> image won't be changed in any way, and all writes are always copy-up) > > That, we agree on :) :) Thanks, Gao Xiang > > --D > >> Thanks, >> Gao Xiang >> ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-18 21:51 ` Darrick J. Wong 2026-03-19 8:05 ` Gao Xiang @ 2026-03-22 3:25 ` Demi Marie Obenour 2026-03-22 3:52 ` Gao Xiang ` (2 more replies) 1 sibling, 3 replies; 79+ messages in thread From: Demi Marie Obenour @ 2026-03-22 3:25 UTC (permalink / raw) To: Darrick J. Wong, Gao Xiang Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc [-- Attachment #1.1.1: Type: text/plain, Size: 6015 bytes --] On 3/18/26 17:51, Darrick J. Wong wrote: > On Tue, Mar 17, 2026 at 12:17:48PM +0800, Gao Xiang wrote: >> Hi Darrick, >> >> On 2026/2/21 08:47, Darrick J. Wong wrote: >>> On Fri, Feb 06, 2026 at 02:15:12PM +0800, Gao Xiang wrote: >> >> ... >> >>> >>>>> >>>>> Fuse, otoh, is for all the other weird users -- you found an old >>>>> cupboard full of wide scsi disks; or management decided that letting >>>>> container customers bring their own prepopulated data partitions(!) is a >>>>> good idea; or the default when someone plugs in a device that the system >>>>> knows nothing about. >> >> I brainstormed some more thoughts: >> >> End users would like to mount a filesystem, but it's unknown that >> the filesystem is consistent or not, especially for filesystems >> are intended to be mounted as "rw", it's very hard to know if the >> filesystem metadata is fully consistent without a full fsck scan >> in advance. >> >> Considering the following metadata inconsistent case (note that >> block 0x123 is referenced by the inconsistent metadata, rather >> than normal filesystem reflink with correct metadata): >> >> inode A (with high permission) >> extent [0~4k) maps to block 0x123 >> >> random inode B (with low permission) >> extent [0~4k) maps to block 0x123 too >> >> So there will exist at least three attack ways: >> >> 1) Normal users will record the sensitive information to inode >> A (since it's not the normal COW, the block 0x123 will be >> updated in place), but normal users don't know there exists >> the malicious inode B, so the sensitive information can be >> fetched via inode B illegally; >> >> 2) Attackers can write inode B with low permission in the proper >> timing to change the inode A to compromise the computer >> system; >> >> 3) Of course, such two inodes can cause double freeing issues. >> >> I think the normal copy-on-write (including OverlayFS) mechanism >> doesn't have the issue (because all changes will just have another >> copy). Of course, hardlinking won't have the same issue either, >> because there is only one inode for all hardlinks. > > Yes, though you can screw with the link counts to cause other mayhem ;) > >> I don't think FUSE-implemented userspace drivers will resolve >> such issues (I think users can only get the following usage reclaim: > > Filesystem implementations /can/ detect these sorts of problems, but > most of them have no means to do that quickly. As you and Demi Marie > have noted, the only reasonable way to guard against these things is > pre-mount fsck. > > And even then, attackers still have a window to screw with the fs > metadata after fsck exits but before mount(2) takes the block device. > I guess you'd have to inject the fsck run after the O_EXCL opening. > > Technically speaking fuse4fs could just invoke e2fsck -fn before it > starts up the rest of the libfuse initialization but who knows if that's > an acceptable risk. Also unclear if you actually want -fy for that. To me, the attacks mentioned above are all either user error, or vulnerabilities in software accessing the filesystem. If one doesn't trust a filesystem image, then any data from the filesystem can't be trusted either. The only exception is if one can verify the data cryptographically, which is what fsverity is for. If the filesystem is mounted r/o and the image doesn't change, one could guarantee that accessing the filesystem will at least return deterministic results even for corrupted images. That's something that would need to be guaranteed by individual filesystem implementations, though. See the end of this email for a long note about what can and cannot be guaranteed in the face of corrupt or malicious filesystem images. >> "that is not the case that we will handle with userspace FUSE >> drivers, because the metadata is serious broken"), the only way to >> resolve such attack vectors is to run >> >> the full-scan fsck consistency check and then mount "rw" >> >> or >> >> using the immutable filesystem like EROFS (so that there will not >> be such inconsisteny issues by design) and isolate the entire write >> traffic with a full copy-on-write mechanism with OverlayFS for >> example (IOWs, to make all write copy-on-write into another trusted >> local filesystem). > > (Yeah, that's probably the only way to go for prepopulated images like > root filesystems and container packages) Even an immutable filesystem can still be corrupt. >> I hope it's a valid case, and that can indeed happen if the arbitary >> generic filesystem can be mounted in "rw". And my immutable image >> filesystem idea can help mitigate this too (just because the immutable >> image won't be changed in any way, and all writes are always copy-up) > > That, we agree on :) Indeed, expecting writes to a corrupt filesystem to behave reasonably is very foolish. Long note starts here: There is no *fundamental* reason that a crafted filesystem image must be able to cause crashes, memory corruption, etc. This applies even if the filesystem image may be written to while mounted. It is always *possible* to write a filesystem such that it never trusts anything it reads from disk and assumes each read could return arbitrarily malicious results. Right now, many filesystem maintainers do not consider this to be a priority. Even if they did, I don't think *anyone* (myself included) could write a filesystem implementation in C that didn't have memory corruption flaws. The only exceptions are if the filesystem is incredibly simple or formal methods are used, and neither is the case for existing filesystems in the Linux kernel. By sandboxing a filesystem, one ensures that an attacker who compromises a filesystem implementation needs to find *another* exploit to compromise the whole system. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-22 3:25 ` Demi Marie Obenour @ 2026-03-22 3:52 ` Gao Xiang 2026-03-22 4:51 ` Gao Xiang 2026-03-22 5:14 ` Gao Xiang 2 siblings, 0 replies; 79+ messages in thread From: Gao Xiang @ 2026-03-22 3:52 UTC (permalink / raw) To: Demi Marie Obenour, Darrick J. Wong Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc Hi Demi, On 2026/3/22 11:25, Demi Marie Obenour wrote: ... > >>> "that is not the case that we will handle with userspace FUSE >>> drivers, because the metadata is serious broken"), the only way to >>> resolve such attack vectors is to run >>> >>> the full-scan fsck consistency check and then mount "rw" >>> >>> or >>> >>> using the immutable filesystem like EROFS (so that there will not >>> be such inconsisteny issues by design) and isolate the entire write >>> traffic with a full copy-on-write mechanism with OverlayFS for >>> example (IOWs, to make all write copy-on-write into another trusted >>> local filesystem). >> >> (Yeah, that's probably the only way to go for prepopulated images like >> root filesystems and container packages) > > Even an immutable filesystem can still be corrupt. I disagree with you here, I think we need define what kind of corruption is really harmful to systems. I can definitely say, if an immutable filesystem is well-defined, it cannot bring any harmful behaviors to the systems. Taking one example, nlink can still be mismatched for immutable filesystems, but does it have any real impact? 1) you can write an unpriviledged FUSE daemon to return arbitary nlink all the time, so getattr results doesn't really matter; 2) OverlayFS and some other fses I don't remember now will return nlink = 1 all the time. As long as the mount/user namespace are totally isolated (of course you shouldn't mix with the other namespaces), I cannot think out a real practical attack patch to attack users __just out of the well-designed immutable filesystems__. According to the EROFS on-disk format for example, some field of course can still be considered as corruption, but so what? It cannot bring any harmful behavior like the other generic writable filesystems, which much rely on the allocation metadata, nlink, etc. are absolutely correct, otherwise the write paths are hightly vulnerable. Let's keep in other words, many situations, you still need to download archive files (like zip, tar, etc.) from the internet, but without any verification hash for example: Sometimes we face random corruptions out of these archive files, but so what? such archives can be extracted with garbage data, or garbage metadata, but if the namespaces are isolated, what's the real impact to the computer systems or users? That is all I want to say, if you find any real impact, let just write down the real attack paths, but that is all my ideas in mind. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-22 3:25 ` Demi Marie Obenour 2026-03-22 3:52 ` Gao Xiang @ 2026-03-22 4:51 ` Gao Xiang 2026-03-22 5:13 ` Demi Marie Obenour 2026-03-23 9:54 ` [Lsf-pc] " Jan Kara 2026-03-22 5:14 ` Gao Xiang 2 siblings, 2 replies; 79+ messages in thread From: Gao Xiang @ 2026-03-22 4:51 UTC (permalink / raw) To: Demi Marie Obenour, Darrick J. Wong Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/22 11:25, Demi Marie Obenour wrote: ... >> >> Technically speaking fuse4fs could just invoke e2fsck -fn before it >> starts up the rest of the libfuse initialization but who knows if that's >> an acceptable risk. Also unclear if you actually want -fy for that. > Let me try to reply the remaining part: > To me, the attacks mentioned above are all either user error, > or vulnerabilities in software accessing the filesystem. If one There are many consequences if users try to use potential inconsistent writable filesystems directly (without full fsck), what I can think out including but not limited to: - data loss (considering data block double free issue); - data theft (for example, users keep sensitive information in the workload in a high permission inode but it can be read with low permission malicious inode later); - data tamper (the same principle). All vulnerabilities above happen after users try to write the inconsistent filesystem, which is hard to prevent by on-disk design. But if users write with copy-on-write to another local consistent filesystem, all the vulnerabilities above won't exist. > doesn't trust a filesystem image, then any data from the filesystem > can't be trusted either. The only exception is if one can verify I don't think trustiness is the core part of this whole topic, because Linux namespace & cgroup concepts are totally _invented_ for untrusted or isolated workloads. If you untrust some workload, fine, isolate into another namespace: you cannot strictly trust anything. The kernel always has bugs, but is that the real main reason you never run untrusted workloads? I don't think so. > the data cryptographically, which is what fsverity is for. > If the filesystem is mounted r/o and the image doesn't change, one > could guarantee that accessing the filesystem will at least return > deterministic results even for corrupted images. That's something that > would need to be guaranteed by individual filesystem implementations, > though. I just want to say that the real problem with generic writable filesystems is that their on-disk design makes it difficult to prevent or detect harmful inconsistencies. First, the on-disk format includes redundant metadata and even malicious journal metadata (as I mentioned in previous emails). This makes it hard to determine whether the filesystem is inconsistent without performing a full disk scan, which takes much long time. Of course, you could mount severely inconsistent writable filesystems in read-only (RO) mode. However, they are still inconsistent by definition according to their formal on-disk specifications. Furthermore, the runtime kernel implementatio mixes read-write and read-only logic within the same codebase, which complicates the practical consequences. Due to immutable filesystem designs, almost all typical severe inconsistencies cannot happen by design or be regard as harmful. I believe the core issue is not trustworthiness; even with an untrusted workload, you should be able to audit it easily. However, severely inconsistent writable filesystems make such auditability much harder. > > See the end of this email for a long note about what can and cannot > be guaranteed in the face of corrupt or malicious filesystem images. > >>> "that is not the case that we will handle with userspace FUSE >>> drivers, because the metadata is serious broken"), the only way to >>> resolve such attack vectors is to run >>> >>> the full-scan fsck consistency check and then mount "rw" >>> >>> or >>> >>> using the immutable filesystem like EROFS (so that there will not >>> be such inconsisteny issues by design) and isolate the entire write >>> traffic with a full copy-on-write mechanism with OverlayFS for >>> example (IOWs, to make all write copy-on-write into another trusted >>> local filesystem). >> >> (Yeah, that's probably the only way to go for prepopulated images like >> root filesystems and container packages) > > Even an immutable filesystem can still be corrupt. > >>> I hope it's a valid case, and that can indeed happen if the arbitary >>> generic filesystem can be mounted in "rw". And my immutable image >>> filesystem idea can help mitigate this too (just because the immutable >>> image won't be changed in any way, and all writes are always copy-up) >> >> That, we agree on :) > > Indeed, expecting writes to a corrupt filesystem to behave reasonably > is very foolish. > > Long note starts here: There is no *fundamental* reason that a crafted > filesystem image must be able to cause crashes, memory corruption, etc. I still think those kinds of security risks just of implementation bugs are the easist part of the whole issue. Many linux kernel bugs can cause crashes, memory corruption, why crafted filesystems need to be specially considered? > This applies even if the filesystem image may be written to while > mounted. It is always *possible* to write a filesystem such that > it never trusts anything it reads from disk and assumes each read > could return arbitrarily malicious results. Linux namespaces are invented for those kind of usage, the broken archive images return garbage data or even archive images can be changed randomly at runtime, what's the real impacts if they are isolated by the namespaces? > > Right now, many filesystem maintainers do not consider this to be a > priority. Even if they did, I don't think *anyone* (myself included) > could write a filesystem implementation in C that didn't have memory > corruption flaws. The only exceptions are if the filesystem is I think this is still falling into the aspect of implementation bugs, my question is simply: "why filesystem is special in this kind of area, there are many other kernel subsystems in C which can receive untrusted data, like TCP/IP stack", why filesystem is special for particular memory corruption flaws? I really think different aspects are often mixed when this topic is mentioned, which makes the discussion getting more and more divergent. If we talk about implementation bugs, I think filesystem is not special, but as I said, I think the main issue is the writable filesystem on-disk format design, due to the design, there are many severe consequences out of inconsistent filesystems. > incredibly simple or formal methods are used, and neither is the > case for existing filesystems in the Linux kernel. By sandboxing a > filesystem, one ensures that an attacker who compromises a filesystem > implementation needs to find *another* exploit to compromise the > whole system. Yes, yet sandboxing is the one part, of course VM sandboxing is better than Linux namespace isolation, but VMs cost much. Other than sandboxing, I think auditability is important too, especially users provide sensitive data to new workloads. Of course, only dealing with trusted workloads is the best, out of question. But in the real world, we cannot always face complete trusted workloads. For untrusted workloads, we need to find reliable ways to audit them until they become trusted. Just like in the real world: accumulate credit, undergo audits, and eventually earn trust. Sorry about my English, but I hope I express my whole idea. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-22 4:51 ` Gao Xiang @ 2026-03-22 5:13 ` Demi Marie Obenour 2026-03-22 5:30 ` Gao Xiang 2026-03-23 9:54 ` [Lsf-pc] " Jan Kara 1 sibling, 1 reply; 79+ messages in thread From: Demi Marie Obenour @ 2026-03-22 5:13 UTC (permalink / raw) To: Gao Xiang, Darrick J. Wong Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc [-- Attachment #1.1.1: Type: text/plain, Size: 9023 bytes --] On 3/22/26 00:51, Gao Xiang wrote: > > > On 2026/3/22 11:25, Demi Marie Obenour wrote: > > ... > >>> >>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>> starts up the rest of the libfuse initialization but who knows if that's >>> an acceptable risk. Also unclear if you actually want -fy for that. >> > > Let me try to reply the remaining part: > >> To me, the attacks mentioned above are all either user error, >> or vulnerabilities in software accessing the filesystem. If one > > There are many consequences if users try to use potential inconsistent > writable filesystems directly (without full fsck), what I can think > out including but not limited to: > > - data loss (considering data block double free issue); > - data theft (for example, users keep sensitive information in the > workload in a high permission inode but it can be read with > low permission malicious inode later); > - data tamper (the same principle). > > All vulnerabilities above happen after users try to write the > inconsistent filesystem, which is hard to prevent by on-disk > design. > > But if users write with copy-on-write to another local consistent > filesystem, all the vulnerabilities above won't exist. That makes sense! Is this because the reads are at least deterministic? >> doesn't trust a filesystem image, then any data from the filesystem >> can't be trusted either. The only exception is if one can verify > > I don't think trustiness is the core part of this whole topic, > because Linux namespace & cgroup concepts are totally _invented_ > for untrusted or isolated workloads. > > If you untrust some workload, fine, isolate into another > namespace: you cannot strictly trust anything. > > The kernel always has bugs, but is that the real main reason > you never run untrusted workloads? I don't think so. I always use VMs for untrusted workloads. >> the data cryptographically, which is what fsverity is for. >> If the filesystem is mounted r/o and the image doesn't change, one >> could guarantee that accessing the filesystem will at least return >> deterministic results even for corrupted images. That's something that >> would need to be guaranteed by individual filesystem implementations, >> though. > > I just want to say that the real problem with generic writable > filesystems is that their on-disk design makes it difficult to > prevent or detect harmful inconsistencies. > > First, the on-disk format includes redundant metadata and even > malicious journal metadata (as I mentioned in previous emails). > This makes it hard to determine whether the filesystem is > inconsistent without performing a full disk scan, which takes > much long time. > > Of course, you could mount severely inconsistent writable > filesystems in read-only (RO) mode. However, they are still > inconsistent by definition according to their formal on-disk > specifications. Furthermore, the runtime kernel implementatio > mixes read-write and read-only logic within the same > codebase, which complicates the practical consequences. > > Due to immutable filesystem designs, almost all typical severe > inconsistencies cannot happen by design or be regard as harmful. > I believe the core issue is not trustworthiness; even with > an untrusted workload, you should be able to audit it easily. > However, severely inconsistent writable filesystems make such > auditability much harder. That actually makes a lot of sense. I had not considered the journal, which means one must modify the disk image just to mount it. >> See the end of this email for a long note about what can and cannot >> be guaranteed in the face of corrupt or malicious filesystem images. >> >>>> "that is not the case that we will handle with userspace FUSE >>>> drivers, because the metadata is serious broken"), the only way to >>>> resolve such attack vectors is to run >>>> >>>> the full-scan fsck consistency check and then mount "rw" >>>> >>>> or >>>> >>>> using the immutable filesystem like EROFS (so that there will not >>>> be such inconsisteny issues by design) and isolate the entire write >>>> traffic with a full copy-on-write mechanism with OverlayFS for >>>> example (IOWs, to make all write copy-on-write into another trusted >>>> local filesystem). >>> >>> (Yeah, that's probably the only way to go for prepopulated images like >>> root filesystems and container packages) >> >> Even an immutable filesystem can still be corrupt. >> >>>> I hope it's a valid case, and that can indeed happen if the arbitary >>>> generic filesystem can be mounted in "rw". And my immutable image >>>> filesystem idea can help mitigate this too (just because the immutable >>>> image won't be changed in any way, and all writes are always copy-up) >>> >>> That, we agree on :) >> >> Indeed, expecting writes to a corrupt filesystem to behave reasonably >> is very foolish. >> >> Long note starts here: There is no *fundamental* reason that a crafted >> filesystem image must be able to cause crashes, memory corruption, etc. > > I still think those kinds of security risks just of implementation > bugs are the easist part of the whole issue. > > Many linux kernel bugs can cause crashes, memory corruption, why > crafted filesystems need to be specially considered? In the past, filesystem implementations have often not focused on this. The Linux Kernel CVE team does not issue CVEs for such bugs. >> This applies even if the filesystem image may be written to while >> mounted. It is always *possible* to write a filesystem such that >> it never trusts anything it reads from disk and assumes each read >> could return arbitrarily malicious results. > > Linux namespaces are invented for those kind of usage, the broken > archive images return garbage data or even archive images can be > changed randomly at runtime, what's the real impacts if they are > isolated by the namespaces? None! Regardless of whether one considers namespaces sufficient for isolating malicious code, they can definitely isolate filesystem operations very well. >> Right now, many filesystem maintainers do not consider this to be a >> priority. Even if they did, I don't think *anyone* (myself included) >> could write a filesystem implementation in C that didn't have memory >> corruption flaws. The only exceptions are if the filesystem is > > I think this is still falling into the aspect of implementation > bugs, my question is simply: "why filesystem is special in this > kind of area, there are many other kernel subsystems in C which > can receive untrusted data, like TCP/IP stack", why filesystem > is special for particular memory corruption flaws? See above - the difference is that filesystems have historically not been written with untrusted input in mind. This, of course, can be changed. > I really think different aspects are often mixed when this topic > is mentioned, which makes the discussion getting more and more > divergent. I agree. > If we talk about implementation bugs, I think filesystem is not > special, but as I said, I think the main issue is the writable > filesystem on-disk format design, due to the design, there are > many severe consequences out of inconsistent filesystems. It definitely makes things much harder, and dramatically increases the attack surface. Most uses I have (notably backups) have a hard requirement for writable storage, and when they don't need it they can use dm-verity. >> incredibly simple or formal methods are used, and neither is the >> case for existing filesystems in the Linux kernel. By sandboxing a >> filesystem, one ensures that an attacker who compromises a filesystem >> implementation needs to find *another* exploit to compromise the >> whole system. > > Yes, yet sandboxing is the one part, of course VM sandboxing > is better than Linux namespace isolation, but VMs cost much. I use a lot of VMs, but they indeed use significant resources. I hope that at some point this can largely be solved with copy-on-write VM forking. > Other than sandboxing, I think auditability is important too, > especially users provide sensitive data to new workloads. > > Of course, only dealing with trusted workloads is the best, > out of question. But in the real world, we cannot always > face complete trusted workloads. For untrusted workloads, > we need to find reliable ways to audit them until they > become trusted. > > Just like in the real world: accumulate credit, undergo > audits, and eventually earn trust. > > Sorry about my English, but I hope I express my whole idea. > > Thanks, > Gao Xiang Don't worry about your English. It is completely understandable and more than capable of getting your (very informative) points across. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-22 5:13 ` Demi Marie Obenour @ 2026-03-22 5:30 ` Gao Xiang 0 siblings, 0 replies; 79+ messages in thread From: Gao Xiang @ 2026-03-22 5:30 UTC (permalink / raw) To: Demi Marie Obenour, Darrick J. Wong Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/22 13:13, Demi Marie Obenour wrote: > On 3/22/26 00:51, Gao Xiang wrote: >> >> >> On 2026/3/22 11:25, Demi Marie Obenour wrote: >> >> ... >> >>>> >>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>>> starts up the rest of the libfuse initialization but who knows if that's >>>> an acceptable risk. Also unclear if you actually want -fy for that. >>> >> >> Let me try to reply the remaining part: >> >>> To me, the attacks mentioned above are all either user error, >>> or vulnerabilities in software accessing the filesystem. If one >> >> There are many consequences if users try to use potential inconsistent >> writable filesystems directly (without full fsck), what I can think >> out including but not limited to: >> >> - data loss (considering data block double free issue); >> - data theft (for example, users keep sensitive information in the >> workload in a high permission inode but it can be read with >> low permission malicious inode later); >> - data tamper (the same principle). >> >> All vulnerabilities above happen after users try to write the >> inconsistent filesystem, which is hard to prevent by on-disk >> design. >> >> But if users write with copy-on-write to another local consistent >> filesystem, all the vulnerabilities above won't exist. > > That makes sense! Is this because the reads are at least > deterministic? I read the remaining parts, I think only this one that needs to be clarified. As I said in the letest reply, I don't think __simple__ is a suitable descriptive word. The fact is A filesystem need to provide users enough information about files and filesystem hierarchy, otherwise it shouldn't be called as filesystems. That is the only thing that immutable filesystem do, providing filesystem information to vfs, that is all. No simpler than that, it's the minimal feature set of a filesystem (and comparing the slight on-disk difference means nothing unless the on-disk design is too bad). And I think that should answer your question. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-22 4:51 ` Gao Xiang 2026-03-22 5:13 ` Demi Marie Obenour @ 2026-03-23 9:54 ` Jan Kara 2026-03-23 10:19 ` Gao Xiang 1 sibling, 1 reply; 79+ messages in thread From: Jan Kara @ 2026-03-23 9:54 UTC (permalink / raw) To: Gao Xiang Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On Sun 22-03-26 12:51:57, Gao Xiang wrote: > On 2026/3/22 11:25, Demi Marie Obenour wrote: > > > Technically speaking fuse4fs could just invoke e2fsck -fn before it > > > starts up the rest of the libfuse initialization but who knows if that's > > > an acceptable risk. Also unclear if you actually want -fy for that. > > > > Let me try to reply the remaining part: > > > To me, the attacks mentioned above are all either user error, > > or vulnerabilities in software accessing the filesystem. If one > > There are many consequences if users try to use potential inconsistent > writable filesystems directly (without full fsck), what I can think > out including but not limited to: > > - data loss (considering data block double free issue); > - data theft (for example, users keep sensitive information in the > workload in a high permission inode but it can be read with > low permission malicious inode later); > - data tamper (the same principle). > > All vulnerabilities above happen after users try to write the > inconsistent filesystem, which is hard to prevent by on-disk > design. > > But if users write with copy-on-write to another local consistent > filesystem, all the vulnerabilities above won't exist. OK, so if I understand correctly you are advocating that untrusted initial data should be provided on immutable filesystem and any needed modification would be handled by overlayfs (or some similar layer) and stored on (initially empty) writeable filesystem. That's a sensible design for usecase like containers but what started this thread about FUSE drivers for filesystems were usecases like access to filesystems on drives attached at USB port of your laptop. There it isn't really practical to use your design. You need a standard writeable filesystem for that but at the same time you cannot quite trust the content of everything that gets attached to your USB port... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 9:54 ` [Lsf-pc] " Jan Kara @ 2026-03-23 10:19 ` Gao Xiang 2026-03-23 11:14 ` Jan Kara 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-23 10:19 UTC (permalink / raw) To: Jan Kara Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc Hi Jan, On 2026/3/23 17:54, Jan Kara wrote: > On Sun 22-03-26 12:51:57, Gao Xiang wrote: >> On 2026/3/22 11:25, Demi Marie Obenour wrote: >>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>>> starts up the rest of the libfuse initialization but who knows if that's >>>> an acceptable risk. Also unclear if you actually want -fy for that. >>> >> >> Let me try to reply the remaining part: >> >>> To me, the attacks mentioned above are all either user error, >>> or vulnerabilities in software accessing the filesystem. If one >> >> There are many consequences if users try to use potential inconsistent >> writable filesystems directly (without full fsck), what I can think >> out including but not limited to: >> >> - data loss (considering data block double free issue); >> - data theft (for example, users keep sensitive information in the >> workload in a high permission inode but it can be read with >> low permission malicious inode later); >> - data tamper (the same principle). >> >> All vulnerabilities above happen after users try to write the >> inconsistent filesystem, which is hard to prevent by on-disk >> design. >> >> But if users write with copy-on-write to another local consistent >> filesystem, all the vulnerabilities above won't exist. > > OK, so if I understand correctly you are advocating that untrusted initial data > should be provided on immutable filesystem and any needed modification > would be handled by overlayfs (or some similar layer) and stored on > (initially empty) writeable filesystem. > > That's a sensible design for usecase like containers but what started this > thread about FUSE drivers for filesystems were usecases like access to > filesystems on drives attached at USB port of your laptop. There it isn't > really practical to use your design. You need a standard writeable > filesystem for that but at the same time you cannot quite trust the content > of everything that gets attached to your USB port... Yes, that is my proposal and my overall interest now. I know your interest but I'm here I just would like to say: Without full scan fsck, even with FUSE, the system is still vulnerable if the FUSE approch is used. I could give a detailed example, for example: There are passwd files `/etc/passwd` and `/etc/shadow` with proper permissions (for example, you could audit the file permission with e2fsprogs/xfsprogs without a full fsck scan) in the inconsistent remote filesystems, but there are some other malicious files called "foo" and "bar" somewhere with low permissions but sharing the same blocks which is disallowed by filesystem on-disk formats illegally (because they violate copy-on-write semantics by design), also see my previous reply: https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com The initial data of `/etc/passwd` and `/etc/shadow` in the filesystem image doesn't matter, but users could then keep very sensitive information later just out of the inconsistent filesystems, which could cause "data theft" above. So I think inconsistent filesystem harms to users no matter the implementations use FUSE or not. You could claim it's not the case we care, but I think most users should care, and they should full fsck in advance, but if it's fscked and consistent, I'm afraid the images can then be handled in kernel directly. Thanks, Gao Xiang > > Honza ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 10:19 ` Gao Xiang @ 2026-03-23 11:14 ` Jan Kara 2026-03-23 11:42 ` Gao Xiang 2026-03-23 12:08 ` Demi Marie Obenour 0 siblings, 2 replies; 79+ messages in thread From: Jan Kara @ 2026-03-23 11:14 UTC (permalink / raw) To: Gao Xiang Cc: Jan Kara, Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc Hi Gao! On Mon 23-03-26 18:19:16, Gao Xiang wrote: > On 2026/3/23 17:54, Jan Kara wrote: > > On Sun 22-03-26 12:51:57, Gao Xiang wrote: > > > On 2026/3/22 11:25, Demi Marie Obenour wrote: > > > > > Technically speaking fuse4fs could just invoke e2fsck -fn before it > > > > > starts up the rest of the libfuse initialization but who knows if that's > > > > > an acceptable risk. Also unclear if you actually want -fy for that. > > > > > > > > > > Let me try to reply the remaining part: > > > > > > > To me, the attacks mentioned above are all either user error, > > > > or vulnerabilities in software accessing the filesystem. If one > > > > > > There are many consequences if users try to use potential inconsistent > > > writable filesystems directly (without full fsck), what I can think > > > out including but not limited to: > > > > > > - data loss (considering data block double free issue); > > > - data theft (for example, users keep sensitive information in the > > > workload in a high permission inode but it can be read with > > > low permission malicious inode later); > > > - data tamper (the same principle). > > > > > > All vulnerabilities above happen after users try to write the > > > inconsistent filesystem, which is hard to prevent by on-disk > > > design. > > > > > > But if users write with copy-on-write to another local consistent > > > filesystem, all the vulnerabilities above won't exist. > > > > OK, so if I understand correctly you are advocating that untrusted initial data > > should be provided on immutable filesystem and any needed modification > > would be handled by overlayfs (or some similar layer) and stored on > > (initially empty) writeable filesystem. > > > > That's a sensible design for usecase like containers but what started this > > thread about FUSE drivers for filesystems were usecases like access to > > filesystems on drives attached at USB port of your laptop. There it isn't > > really practical to use your design. You need a standard writeable > > filesystem for that but at the same time you cannot quite trust the content > > of everything that gets attached to your USB port... > > Yes, that is my proposal and my overall interest now. I know > your interest but I'm here I just would like to say: > > Without full scan fsck, even with FUSE, the system is still > vulnerable if the FUSE approch is used. > > I could give a detailed example, for example: > > There are passwd files `/etc/passwd` and `/etc/shadow` with > proper permissions (for example, you could audit the file > permission with e2fsprogs/xfsprogs without a full fsck scan) in > the inconsistent remote filesystems, but there are some other > malicious files called "foo" and "bar" somewhere with low > permissions but sharing the same blocks which is disallowed > by filesystem on-disk formats illegally (because they violate > copy-on-write semantics by design), also see my previous > reply: > https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com > > The initial data of `/etc/passwd` and `/etc/shadow` in the > filesystem image doesn't matter, but users could then keep > very sensitive information later just out of the > inconsistent filesystems, which could cause "data theft" > above. Yes, I've seen you mentioning this case earlier in this thread. But let me say I consider it rather contrived :). For the container usecase if you are fetching say a root fs image and don't trust the content of the image, then how do you know it doesn't contain a malicious code that sends all the sensitive data to some third party? So I believe the owner of the container has to trust the content of the image, otherwise you've already lost. The container environment *provider* doesn't necessarily trust either the container owner or the image so they need to make sure their infrastructure isn't compromised by malicious actions from these - and for that either your immutable image scheme or FUSE mounting works. Similarly with the USB drive content. Either some malicious actor plugs USB drive into a laptop, it gets automounted, and that must not crash the kernel or give attacker more priviledge - but that's all - no data is stored on the drive. Or I myself plug some not-so-trusted USB drive to my laptop to read some content from it or possibly put there some data for a friend - again that must not compromise my machine but I'd be really dumb and already lost the security game if I'd put any sensitive data to such drive. And for this USB drive case FUSE mounting solves these problems nicely. So in my opinion for practical usecases the FUSE solution addresses the real security concerns. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 11:14 ` Jan Kara @ 2026-03-23 11:42 ` Gao Xiang 2026-03-23 12:01 ` Gao Xiang 2026-03-23 12:08 ` Demi Marie Obenour 1 sibling, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-23 11:42 UTC (permalink / raw) To: Jan Kara Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc Hi Jan! On 2026/3/23 19:14, Jan Kara wrote: > Hi Gao! > > On Mon 23-03-26 18:19:16, Gao Xiang wrote: >> On 2026/3/23 17:54, Jan Kara wrote: >>> On Sun 22-03-26 12:51:57, Gao Xiang wrote: >>>> On 2026/3/22 11:25, Demi Marie Obenour wrote: >>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>>>>> starts up the rest of the libfuse initialization but who knows if that's >>>>>> an acceptable risk. Also unclear if you actually want -fy for that. >>>>> >>>> >>>> Let me try to reply the remaining part: >>>> >>>>> To me, the attacks mentioned above are all either user error, >>>>> or vulnerabilities in software accessing the filesystem. If one >>>> >>>> There are many consequences if users try to use potential inconsistent >>>> writable filesystems directly (without full fsck), what I can think >>>> out including but not limited to: >>>> >>>> - data loss (considering data block double free issue); >>>> - data theft (for example, users keep sensitive information in the >>>> workload in a high permission inode but it can be read with >>>> low permission malicious inode later); >>>> - data tamper (the same principle). >>>> >>>> All vulnerabilities above happen after users try to write the >>>> inconsistent filesystem, which is hard to prevent by on-disk >>>> design. >>>> >>>> But if users write with copy-on-write to another local consistent >>>> filesystem, all the vulnerabilities above won't exist. >>> >>> OK, so if I understand correctly you are advocating that untrusted initial data >>> should be provided on immutable filesystem and any needed modification >>> would be handled by overlayfs (or some similar layer) and stored on >>> (initially empty) writeable filesystem. >>> >>> That's a sensible design for usecase like containers but what started this >>> thread about FUSE drivers for filesystems were usecases like access to >>> filesystems on drives attached at USB port of your laptop. There it isn't >>> really practical to use your design. You need a standard writeable >>> filesystem for that but at the same time you cannot quite trust the content >>> of everything that gets attached to your USB port... >> >> Yes, that is my proposal and my overall interest now. I know >> your interest but I'm here I just would like to say: >> >> Without full scan fsck, even with FUSE, the system is still >> vulnerable if the FUSE approch is used. >> >> I could give a detailed example, for example: >> >> There are passwd files `/etc/passwd` and `/etc/shadow` with >> proper permissions (for example, you could audit the file >> permission with e2fsprogs/xfsprogs without a full fsck scan) in >> the inconsistent remote filesystems, but there are some other >> malicious files called "foo" and "bar" somewhere with low >> permissions but sharing the same blocks which is disallowed >> by filesystem on-disk formats illegally (because they violate >> copy-on-write semantics by design), also see my previous >> reply: >> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com >> >> The initial data of `/etc/passwd` and `/etc/shadow` in the >> filesystem image doesn't matter, but users could then keep >> very sensitive information later just out of the >> inconsistent filesystems, which could cause "data theft" >> above. > > Yes, I've seen you mentioning this case earlier in this thread. But let me > say I consider it rather contrived :). For the container usecase if you are > fetching say a root fs image and don't trust the content of the image, then > how do you know it doesn't contain a malicious code that sends all the > sensitive data to some third party? So I believe the owner of the container > has to trust the content of the image, otherwise you've already lost. The fact is that many cloud vendors have malicious content scanners, much like virus scanners. They just scan the filesystem tree and all contents, but I think the severe filesystem metadata consistency is not what they previously care about. But of course, you could ask them to fsck too. Also see below. > > The container environment *provider* doesn't necessarily trust either the > container owner or the image so they need to make sure their infrastructure > isn't compromised by malicious actions from these - and for that either > your immutable image scheme or FUSE mounting works. > > Similarly with the USB drive content. Either some malicious actor plugs USB > drive into a laptop, it gets automounted, and that must not crash the > kernel or give attacker more priviledge - but that's all - no data is > stored on the drive. Or I myself plug some not-so-trusted USB drive to my > laptop to read some content from it or possibly put there some data for a > friend - again that must not compromise my machine but I'd be really dumb > and already lost the security game if I'd put any sensitive data to such > drive. And for this USB drive case FUSE mounting solves these problems > nicely. > > So in my opinion for practical usecases the FUSE solution addresses the > real security concerns. Sorry, I shouldn't speak in that way, as you said, the security concepts depends on the context, limitation and how do you think. First, I need to rephrase a bit above in case there could be some divergent discussion: > for example, you could audit the file permission with e2fsprogs/xfsprogs without a full fsck scan. In order to make the userspace programs best-effort, they should open for write and fstat the permission bits before writing sensitive informations, it avoids TOCTOU attacks as much as possible as userspace programs. Container users use namespaces of course, namespace can only provide isolations, that is the only security guarantees namespace can provide, no question of that. Let's just strictly speaking, as you mentioned, both ways ensure the isolation (if namespaces are used) and kernel stability (let's not nitpick about this). And let's not talk about malicious block devices or likewise, because it's not a typical setup (maybe it could be a typical setup for some cases, but it should be another system-wide security design) and should be clarified by system admins for example. What I just want to say is that: FUSE mount approach _might_ give more incorrect security guarantees than the real users expect: I think other than avoiding system crashes etc, many users should expect that they could use the generic writable filesystem directly with FUSE without full-scan fsck in advance and keep their sensitive data directly, I don't think that is the corner cases if you don't claim the limitation of FUSE approaches. If none expects that, that is absolute be fine, as I said, it provides strong isolation and stability, but I really suspect this approach could be abused to mount totally untrusted remote filesystems (Actually as I said, some business of ours already did: fetching EXT4 filesystems with unknown status and mount without fscking, that is really disappointing.) Thanks, Gao Xiang > > Honza ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 11:42 ` Gao Xiang @ 2026-03-23 12:01 ` Gao Xiang 2026-03-23 14:13 ` Jan Kara 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-23 12:01 UTC (permalink / raw) To: Jan Kara Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/23 19:42, Gao Xiang wrote: > Hi Jan! > > On 2026/3/23 19:14, Jan Kara wrote: >> Hi Gao! >> >> On Mon 23-03-26 18:19:16, Gao Xiang wrote: >>> On 2026/3/23 17:54, Jan Kara wrote: >>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote: >>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote: >>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>>>>>> starts up the rest of the libfuse initialization but who knows if that's >>>>>>> an acceptable risk. Also unclear if you actually want -fy for that. >>>>>> >>>>> >>>>> Let me try to reply the remaining part: >>>>> >>>>>> To me, the attacks mentioned above are all either user error, >>>>>> or vulnerabilities in software accessing the filesystem. If one >>>>> >>>>> There are many consequences if users try to use potential inconsistent >>>>> writable filesystems directly (without full fsck), what I can think >>>>> out including but not limited to: >>>>> >>>>> - data loss (considering data block double free issue); >>>>> - data theft (for example, users keep sensitive information in the >>>>> workload in a high permission inode but it can be read with >>>>> low permission malicious inode later); >>>>> - data tamper (the same principle). >>>>> >>>>> All vulnerabilities above happen after users try to write the >>>>> inconsistent filesystem, which is hard to prevent by on-disk >>>>> design. >>>>> >>>>> But if users write with copy-on-write to another local consistent >>>>> filesystem, all the vulnerabilities above won't exist. >>>> >>>> OK, so if I understand correctly you are advocating that untrusted initial data >>>> should be provided on immutable filesystem and any needed modification >>>> would be handled by overlayfs (or some similar layer) and stored on >>>> (initially empty) writeable filesystem. >>>> >>>> That's a sensible design for usecase like containers but what started this >>>> thread about FUSE drivers for filesystems were usecases like access to >>>> filesystems on drives attached at USB port of your laptop. There it isn't >>>> really practical to use your design. You need a standard writeable >>>> filesystem for that but at the same time you cannot quite trust the content >>>> of everything that gets attached to your USB port... >>> >>> Yes, that is my proposal and my overall interest now. I know >>> your interest but I'm here I just would like to say: >>> >>> Without full scan fsck, even with FUSE, the system is still >>> vulnerable if the FUSE approch is used. >>> >>> I could give a detailed example, for example: >>> >>> There are passwd files `/etc/passwd` and `/etc/shadow` with >>> proper permissions (for example, you could audit the file >>> permission with e2fsprogs/xfsprogs without a full fsck scan) in >>> the inconsistent remote filesystems, but there are some other >>> malicious files called "foo" and "bar" somewhere with low >>> permissions but sharing the same blocks which is disallowed >>> by filesystem on-disk formats illegally (because they violate >>> copy-on-write semantics by design), also see my previous >>> reply: >>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com >>> >>> The initial data of `/etc/passwd` and `/etc/shadow` in the >>> filesystem image doesn't matter, but users could then keep >>> very sensitive information later just out of the >>> inconsistent filesystems, which could cause "data theft" >>> above. >> >> Yes, I've seen you mentioning this case earlier in this thread. But let me >> say I consider it rather contrived :). For the container usecase if you are >> fetching say a root fs image and don't trust the content of the image, then >> how do you know it doesn't contain a malicious code that sends all the >> sensitive data to some third party? So I believe the owner of the container >> has to trust the content of the image, otherwise you've already lost. > > The fact is that many cloud vendors have malicious content > scanners, much like virus scanners. > > They just scan the filesystem tree and all contents, but I > think the severe filesystem metadata consistency is not > what they previously care about. But of course, you could > ask them to fsck too. > > Also see below. > >> >> The container environment *provider* doesn't necessarily trust either the >> container owner or the image so they need to make sure their infrastructure >> isn't compromised by malicious actions from these - and for that either >> your immutable image scheme or FUSE mounting works. >> >> Similarly with the USB drive content. Either some malicious actor plugs USB >> drive into a laptop, it gets automounted, and that must not crash the >> kernel or give attacker more priviledge - but that's all - no data is >> stored on the drive. Or I myself plug some not-so-trusted USB drive to my >> laptop to read some content from it or possibly put there some data for a >> friend - again that must not compromise my machine but I'd be really dumb >> and already lost the security game if I'd put any sensitive data to such >> drive. And for this USB drive case FUSE mounting solves these problems >> nicely. >> >> So in my opinion for practical usecases the FUSE solution addresses the >> real security concerns. > > Sorry, I shouldn't speak in that way, as you said, the > security concepts depends on the context, limitation > and how do you think. > > First, I need to rephrase a bit above in case there > could be some divergent discussion: > >> for example, you could audit the file permission with > e2fsprogs/xfsprogs without a full fsck scan. > > In order to make the userspace programs best-effort, they > should open for write and fstat the permission bits > before writing sensitive informations, it avoids TOCTOU > attacks as much as possible as userspace programs. > > > Container users use namespaces of course, namespace can > only provide isolations, that is the only security > guarantees namespace can provide, no question of that. > > Let's just strictly speaking, as you mentioned, both ways > ensure the isolation (if namespaces are used) and kernel > stability (let's not nitpick about this). And let's not > talk about malicious block devices or likewise, because > it's not a typical setup (maybe it could be a typical > setup for some cases, but it should be another system-wide > security design) and should be clarified by system admins > for example. > > What I just want to say is that: FUSE mount approach _might_ > give more incorrect security guarantees than the real users > expect: I think other than avoiding system crashes etc, many > users should expect that they could use the generic writable > filesystem directly with FUSE without full-scan fsck > in advance and keep their sensitive data directly, I don't If you think that is still the corner cases that users expect incorrectly, For example, I think double freeing issues can make any useful write stuffs lost just out of inconsistent filesystem -- that may be totally unrelated to the security. What I want to say is that, the users' interest of new FUSE approch is "no full fsck"; Otherwise, if full fsck is used, why not they mount in the kernel then (I do think kernel filesystems should fix all bugs out of normal consistent usage)? However, "no fsck" and FUSE mounts bring many incorrect assumption that users can never expect: it's still unreliable, maybe cannot keep any useful data in that storage. Hopefully I explain my idea. > think that is the corner cases if you don't claim the > limitation of FUSE approaches. > > If none expects that, that is absolute be fine, as I said, > it provides strong isolation and stability, but I really > suspect this approach could be abused to mount totally > untrusted remote filesystems (Actually as I said, some > business of ours already did: fetching EXT4 filesystems > with unknown status and mount without fscking, that is > really disappointing.) > > Thanks, > Gao Xiang > > >> >> Honza > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 12:01 ` Gao Xiang @ 2026-03-23 14:13 ` Jan Kara 2026-03-23 14:36 ` Gao Xiang 0 siblings, 1 reply; 79+ messages in thread From: Jan Kara @ 2026-03-23 14:13 UTC (permalink / raw) To: Gao Xiang Cc: Jan Kara, Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On Mon 23-03-26 20:01:32, Gao Xiang wrote: > On 2026/3/23 19:42, Gao Xiang wrote: > > > for example, you could audit the file permission with > > e2fsprogs/xfsprogs without a full fsck scan. > > > > In order to make the userspace programs best-effort, they > > should open for write and fstat the permission bits > > before writing sensitive informations, it avoids TOCTOU > > attacks as much as possible as userspace programs. > > > > > > Container users use namespaces of course, namespace can > > only provide isolations, that is the only security > > guarantees namespace can provide, no question of that. > > > > Let's just strictly speaking, as you mentioned, both ways > > ensure the isolation (if namespaces are used) and kernel > > stability (let's not nitpick about this). And let's not > > talk about malicious block devices or likewise, because > > it's not a typical setup (maybe it could be a typical > > setup for some cases, but it should be another system-wide > > security design) and should be clarified by system admins > > for example. > > > > What I just want to say is that: FUSE mount approach _might_ > > give more incorrect security guarantees than the real users > > expect: I think other than avoiding system crashes etc, many > > users should expect that they could use the generic writable > > filesystem directly with FUSE without full-scan fsck > > in advance and keep their sensitive data directly, I don't > > > If you think that is still the corner cases that users expect > incorrectly, For example, I think double freeing issues can > make any useful write stuffs lost just out of inconsistent > filesystem -- that may be totally unrelated to the security. > > What I want to say is that, the users' interest of new FUSE > approch is "no full fsck"; Otherwise, if full fsck is used, > why not they mount in the kernel then (I do think kernel > filesystems should fix all bugs out of normal consistent > usage)? > > However, "no fsck" and FUSE mounts bring many incorrect > assumption that users can never expect: it's still unreliable, > maybe cannot keep any useful data in that storage. > > Hopefully I explain my idea. I see and I agree that for some cases FUSE access to the untrusted filesystem needn't be enough > > think that is the corner cases if you don't claim the > > limitation of FUSE approaches. > > > > If none expects that, that is absolute be fine, as I said, > > it provides strong isolation and stability, but I really > > suspect this approach could be abused to mount totally > > untrusted remote filesystems (Actually as I said, some > > business of ours already did: fetching EXT4 filesystems > > with unknown status and mount without fscking, that is > > really disappointing.) Yes, someone downloading untrusted ext4 image, mounting in read-write and using it for sensitive application, that falls to "insane" category for me :) We agree on that. And I agree that depending on the application using FUSE to access such filesystem needn't be safe enough and immutable fs + overlayfs writeable layer may provide better guarantees about fs behavior. I would still consider such design highly suspicious but without more detailed knowledge about the application I cannot say it's outright broken :). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 14:13 ` Jan Kara @ 2026-03-23 14:36 ` Gao Xiang 2026-03-23 14:47 ` Jan Kara 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-23 14:36 UTC (permalink / raw) To: Jan Kara Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/23 22:13, Jan Kara wrote: > On Mon 23-03-26 20:01:32, Gao Xiang wrote: >> On 2026/3/23 19:42, Gao Xiang wrote: >>>> for example, you could audit the file permission with >>> e2fsprogs/xfsprogs without a full fsck scan. >>> >>> In order to make the userspace programs best-effort, they >>> should open for write and fstat the permission bits >>> before writing sensitive informations, it avoids TOCTOU >>> attacks as much as possible as userspace programs. >>> >>> >>> Container users use namespaces of course, namespace can >>> only provide isolations, that is the only security >>> guarantees namespace can provide, no question of that. >>> >>> Let's just strictly speaking, as you mentioned, both ways >>> ensure the isolation (if namespaces are used) and kernel >>> stability (let's not nitpick about this). And let's not >>> talk about malicious block devices or likewise, because >>> it's not a typical setup (maybe it could be a typical >>> setup for some cases, but it should be another system-wide >>> security design) and should be clarified by system admins >>> for example. >>> >>> What I just want to say is that: FUSE mount approach _might_ >>> give more incorrect security guarantees than the real users >>> expect: I think other than avoiding system crashes etc, many >>> users should expect that they could use the generic writable >>> filesystem directly with FUSE without full-scan fsck >>> in advance and keep their sensitive data directly, I don't >> >> >> If you think that is still the corner cases that users expect >> incorrectly, For example, I think double freeing issues can >> make any useful write stuffs lost just out of inconsistent >> filesystem -- that may be totally unrelated to the security. >> >> What I want to say is that, the users' interest of new FUSE >> approch is "no full fsck"; Otherwise, if full fsck is used, >> why not they mount in the kernel then (I do think kernel >> filesystems should fix all bugs out of normal consistent >> usage)? >> >> However, "no fsck" and FUSE mounts bring many incorrect >> assumption that users can never expect: it's still unreliable, >> maybe cannot keep any useful data in that storage. >> >> Hopefully I explain my idea. > > I see and I agree that for some cases FUSE access to the untrusted > filesystem needn't be enough Yes, my opinion is that FUSE approches are fine, as long as we clearly document the limitation and the restriction. For writable filesystems, clearly a full scan fsck is needed to keep the filesystem consistent in order to avoid potential data loss at least. Otherwise, I'm pretty sure some aggressive users will abuse this feature with "no fsck"... > >>> think that is the corner cases if you don't claim the >>> limitation of FUSE approaches. >>> >>> If none expects that, that is absolute be fine, as I said, >>> it provides strong isolation and stability, but I really >>> suspect this approach could be abused to mount totally >>> untrusted remote filesystems (Actually as I said, some >>> business of ours already did: fetching EXT4 filesystems >>> with unknown status and mount without fscking, that is >>> really disappointing.) > > Yes, someone downloading untrusted ext4 image, mounting in read-write and > using it for sensitive application, that falls to "insane" category for me > :) We agree on that. And I agree that depending on the application using > FUSE to access such filesystem needn't be safe enough and immutable fs + > overlayfs writeable layer may provide better guarantees about fs behavior. That is my overall goal, I just want to make it clear the difference out of write isolation, but of course, "secure" or not is relative, and according to the system design. If isolation and system stability are enough for a system and can be called "secure", yes, they are both the same in such aspects. > I would still consider such design highly suspicious but without more > detailed knowledge about the application I cannot say it's outright broken > :). What do you mean "such design"? "Writable untrusted remote EXT4 images mounting on the host"? Really, we have such applications for containers for many years but I don't want to name it here, but I'm totally exhaused by such usage (since I explained many many times, and they even never bother with LWN.net) and the internal team. Thanks, Gao Xiang > > Honza > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 14:36 ` Gao Xiang @ 2026-03-23 14:47 ` Jan Kara 2026-03-23 14:57 ` Gao Xiang 2026-03-24 8:48 ` Christian Brauner 0 siblings, 2 replies; 79+ messages in thread From: Jan Kara @ 2026-03-23 14:47 UTC (permalink / raw) To: Gao Xiang Cc: Jan Kara, Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On Mon 23-03-26 22:36:46, Gao Xiang wrote: > On 2026/3/23 22:13, Jan Kara wrote: > > > > think that is the corner cases if you don't claim the > > > > limitation of FUSE approaches. > > > > > > > > If none expects that, that is absolute be fine, as I said, > > > > it provides strong isolation and stability, but I really > > > > suspect this approach could be abused to mount totally > > > > untrusted remote filesystems (Actually as I said, some > > > > business of ours already did: fetching EXT4 filesystems > > > > with unknown status and mount without fscking, that is > > > > really disappointing.) > > > > Yes, someone downloading untrusted ext4 image, mounting in read-write and > > using it for sensitive application, that falls to "insane" category for me > > :) We agree on that. And I agree that depending on the application using > > FUSE to access such filesystem needn't be safe enough and immutable fs + > > overlayfs writeable layer may provide better guarantees about fs behavior. > > That is my overall goal, I just want to make it clear > the difference out of write isolation, but of course, > "secure" or not is relative, and according to the > system design. > > If isolation and system stability are enough for > a system and can be called "secure", yes, they are > both the same in such aspects. > > > I would still consider such design highly suspicious but without more > > detailed knowledge about the application I cannot say it's outright broken > > :). > > What do you mean "such design"? "Writable untrusted > remote EXT4 images mounting on the host"? Really, we have > such applications for containers for many years but I don't > want to name it here, but I'm totally exhaused by such > usage (since I explained many many times, and they even > never bother with LWN.net) and the internal team. By "such design" I meant generally the concept that you fetch filesystem images (regardless whether ext4 or some other type) from untrusted source. Unless you do cryptographical verification of the data, you never know what kind of garbage your application is processing which is always invitation for nasty exploits and bugs... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 14:47 ` Jan Kara @ 2026-03-23 14:57 ` Gao Xiang 2026-03-24 8:48 ` Christian Brauner 1 sibling, 0 replies; 79+ messages in thread From: Gao Xiang @ 2026-03-23 14:57 UTC (permalink / raw) To: Jan Kara Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/23 22:47, Jan Kara wrote: > On Mon 23-03-26 22:36:46, Gao Xiang wrote: >> On 2026/3/23 22:13, Jan Kara wrote: >>>>> think that is the corner cases if you don't claim the >>>>> limitation of FUSE approaches. >>>>> >>>>> If none expects that, that is absolute be fine, as I said, >>>>> it provides strong isolation and stability, but I really >>>>> suspect this approach could be abused to mount totally >>>>> untrusted remote filesystems (Actually as I said, some >>>>> business of ours already did: fetching EXT4 filesystems >>>>> with unknown status and mount without fscking, that is >>>>> really disappointing.) >>> >>> Yes, someone downloading untrusted ext4 image, mounting in read-write and >>> using it for sensitive application, that falls to "insane" category for me >>> :) We agree on that. And I agree that depending on the application using >>> FUSE to access such filesystem needn't be safe enough and immutable fs + >>> overlayfs writeable layer may provide better guarantees about fs behavior. >> >> That is my overall goal, I just want to make it clear >> the difference out of write isolation, but of course, >> "secure" or not is relative, and according to the >> system design. >> >> If isolation and system stability are enough for >> a system and can be called "secure", yes, they are >> both the same in such aspects. >> >>> I would still consider such design highly suspicious but without more >>> detailed knowledge about the application I cannot say it's outright broken >>> :). >> >> What do you mean "such design"? "Writable untrusted >> remote EXT4 images mounting on the host"? Really, we have >> such applications for containers for many years but I don't >> want to name it here, but I'm totally exhaused by such >> usage (since I explained many many times, and they even >> never bother with LWN.net) and the internal team. > > By "such design" I meant generally the concept that you fetch filesystem > images (regardless whether ext4 or some other type) from untrusted source. > Unless you do cryptographical verification of the data, you never know what > kind of garbage your application is processing which is always invitation > for nasty exploits and bugs... That is very common for Docker images for example (Although I admit Docker images now use TAR archives, which can be seen as an immutable system too). For example, Docker Hub keeps TAR images with sha256, but only sha256 without any cryptographical signature. Of course, you could rely on your image scanner to audit malicious contents if you want (e.g. you keep your sensitive information) or rely on 3rd-party applications to scan for you, or never scan since you don't keep any sensitive information in that. --- so because Docker image will write into another trusted local filesystem like the model I mentioned before. IOWs, the image sha256 only ensures that "the tar images you downloaded" is what "the publisher once uploaded", no more than that. And that model is also our interest, since the core EROFS format should fit such model too. Thanks, Gao XIang > > Honza ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 14:47 ` Jan Kara 2026-03-23 14:57 ` Gao Xiang @ 2026-03-24 8:48 ` Christian Brauner 2026-03-24 9:30 ` Gao Xiang 2026-03-24 11:58 ` Demi Marie Obenour 1 sibling, 2 replies; 79+ messages in thread From: Christian Brauner @ 2026-03-24 8:48 UTC (permalink / raw) To: Jan Kara Cc: Gao Xiang, Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote: > On Mon 23-03-26 22:36:46, Gao Xiang wrote: > > On 2026/3/23 22:13, Jan Kara wrote: > > > > > think that is the corner cases if you don't claim the > > > > > limitation of FUSE approaches. > > > > > > > > > > If none expects that, that is absolute be fine, as I said, > > > > > it provides strong isolation and stability, but I really > > > > > suspect this approach could be abused to mount totally > > > > > untrusted remote filesystems (Actually as I said, some > > > > > business of ours already did: fetching EXT4 filesystems > > > > > with unknown status and mount without fscking, that is > > > > > really disappointing.) > > > > > > Yes, someone downloading untrusted ext4 image, mounting in read-write and > > > using it for sensitive application, that falls to "insane" category for me > > > :) We agree on that. And I agree that depending on the application using > > > FUSE to access such filesystem needn't be safe enough and immutable fs + > > > overlayfs writeable layer may provide better guarantees about fs behavior. > > > > That is my overall goal, I just want to make it clear > > the difference out of write isolation, but of course, > > "secure" or not is relative, and according to the > > system design. > > > > If isolation and system stability are enough for > > a system and can be called "secure", yes, they are > > both the same in such aspects. > > > > > I would still consider such design highly suspicious but without more > > > detailed knowledge about the application I cannot say it's outright broken > > > :). > > > > What do you mean "such design"? "Writable untrusted > > remote EXT4 images mounting on the host"? Really, we have > > such applications for containers for many years but I don't > > want to name it here, but I'm totally exhaused by such > > usage (since I explained many many times, and they even > > never bother with LWN.net) and the internal team. > > By "such design" I meant generally the concept that you fetch filesystem > images (regardless whether ext4 or some other type) from untrusted source. > Unless you do cryptographical verification of the data, you never know what > kind of garbage your application is processing which is always invitation > for nasty exploits and bugs... If this is another 500 mail discussion about FS_USERNS_MOUNT on block-backed filesystems then my verdict still stands that the only condition under which I will let the VFS allow this if the underlying device is signed and dm-verity protected. The kernel will continue to refuse unprivileged policy in general and specifically based on quality or implementation of the underlying filesystem driver. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-24 8:48 ` Christian Brauner @ 2026-03-24 9:30 ` Gao Xiang 2026-03-24 9:49 ` Demi Marie Obenour 2026-03-24 11:58 ` Demi Marie Obenour 1 sibling, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-24 9:30 UTC (permalink / raw) To: Christian Brauner, Jan Kara Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc Hi Christian, On 2026/3/24 16:48, Christian Brauner wrote: > On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote: >> On Mon 23-03-26 22:36:46, Gao Xiang wrote: >>> On 2026/3/23 22:13, Jan Kara wrote: >>>>>> think that is the corner cases if you don't claim the >>>>>> limitation of FUSE approaches. >>>>>> >>>>>> If none expects that, that is absolute be fine, as I said, >>>>>> it provides strong isolation and stability, but I really >>>>>> suspect this approach could be abused to mount totally >>>>>> untrusted remote filesystems (Actually as I said, some >>>>>> business of ours already did: fetching EXT4 filesystems >>>>>> with unknown status and mount without fscking, that is >>>>>> really disappointing.) >>>> >>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and >>>> using it for sensitive application, that falls to "insane" category for me >>>> :) We agree on that. And I agree that depending on the application using >>>> FUSE to access such filesystem needn't be safe enough and immutable fs + >>>> overlayfs writeable layer may provide better guarantees about fs behavior. >>> >>> That is my overall goal, I just want to make it clear >>> the difference out of write isolation, but of course, >>> "secure" or not is relative, and according to the >>> system design. >>> >>> If isolation and system stability are enough for >>> a system and can be called "secure", yes, they are >>> both the same in such aspects. >>> >>>> I would still consider such design highly suspicious but without more >>>> detailed knowledge about the application I cannot say it's outright broken >>>> :). >>> >>> What do you mean "such design"? "Writable untrusted >>> remote EXT4 images mounting on the host"? Really, we have >>> such applications for containers for many years but I don't >>> want to name it here, but I'm totally exhaused by such >>> usage (since I explained many many times, and they even >>> never bother with LWN.net) and the internal team. >> >> By "such design" I meant generally the concept that you fetch filesystem >> images (regardless whether ext4 or some other type) from untrusted source. >> Unless you do cryptographical verification of the data, you never know what >> kind of garbage your application is processing which is always invitation >> for nasty exploits and bugs... > > If this is another 500 mail discussion about FS_USERNS_MOUNT on > block-backed filesystems then my verdict still stands that the only > condition under which I will let the VFS allow this if the underlying > device is signed and dm-verity protected. The kernel will continue to > refuse unprivileged policy in general and specifically based on quality > or implementation of the underlying filesystem driver. First, if block devices are your concern, fine, how about allowing it if EROFS file-backed mounts and S_IMMUTABLE for underlay files is set, and refuse any block device mounts. If the issue is "you don't know how to define the quality or implementation of the underlying filesystem drivers", you could list your detailed concerns (I think at least people have trust to the individual filesystem maintainers' judgements), otherwise there will be endless new sets of new immutable filesystems for this requirement (previously, composefs , puzzlefs, and tarfs are all for this; I admit I didn't get the point of FS_USERNS_MOUNT at that time of 2023; but know I also think FS_USERNS_MOUNT is a strong requirement for DinD for example), because that idea should be sensible according to Darrick and Jan's reply, and I think more people will agree with that. And another idea is that you still could return arbitary metadata with immutable FUSE fses and let users get garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT, and if user and mount namespaces are isolated, why bothering it? I just hope know why? And as you may notice, "Demi Marie Obenour wrote:" > The only exceptions are if the filesystem is incredibly simple > or formal methods are used, and neither is the case for existing > filesystems in the Linux kernel. I still strong disagree with that judgement, a minimal EROFS can build an image with superblock, dirs, and files with xattrs in a 4k-size image; and 4k image should be enough for fuzzing; also the in-core EROFS format even never allocates any extra buffers, which is much simplar than FUSE. In brief, so how to meet your requirement? Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-24 9:30 ` Gao Xiang @ 2026-03-24 9:49 ` Demi Marie Obenour 2026-03-24 9:53 ` Gao Xiang 0 siblings, 1 reply; 79+ messages in thread From: Demi Marie Obenour @ 2026-03-24 9:49 UTC (permalink / raw) To: Gao Xiang, Christian Brauner, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc [-- Attachment #1.1.1: Type: text/plain, Size: 4926 bytes --] On 3/24/26 05:30, Gao Xiang wrote: > Hi Christian, > > On 2026/3/24 16:48, Christian Brauner wrote: >> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote: >>> On Mon 23-03-26 22:36:46, Gao Xiang wrote: >>>> On 2026/3/23 22:13, Jan Kara wrote: >>>>>>> think that is the corner cases if you don't claim the >>>>>>> limitation of FUSE approaches. >>>>>>> >>>>>>> If none expects that, that is absolute be fine, as I said, >>>>>>> it provides strong isolation and stability, but I really >>>>>>> suspect this approach could be abused to mount totally >>>>>>> untrusted remote filesystems (Actually as I said, some >>>>>>> business of ours already did: fetching EXT4 filesystems >>>>>>> with unknown status and mount without fscking, that is >>>>>>> really disappointing.) >>>>> >>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and >>>>> using it for sensitive application, that falls to "insane" category for me >>>>> :) We agree on that. And I agree that depending on the application using >>>>> FUSE to access such filesystem needn't be safe enough and immutable fs + >>>>> overlayfs writeable layer may provide better guarantees about fs behavior. >>>> >>>> That is my overall goal, I just want to make it clear >>>> the difference out of write isolation, but of course, >>>> "secure" or not is relative, and according to the >>>> system design. >>>> >>>> If isolation and system stability are enough for >>>> a system and can be called "secure", yes, they are >>>> both the same in such aspects. >>>> >>>>> I would still consider such design highly suspicious but without more >>>>> detailed knowledge about the application I cannot say it's outright broken >>>>> :). >>>> >>>> What do you mean "such design"? "Writable untrusted >>>> remote EXT4 images mounting on the host"? Really, we have >>>> such applications for containers for many years but I don't >>>> want to name it here, but I'm totally exhaused by such >>>> usage (since I explained many many times, and they even >>>> never bother with LWN.net) and the internal team. >>> >>> By "such design" I meant generally the concept that you fetch filesystem >>> images (regardless whether ext4 or some other type) from untrusted source. >>> Unless you do cryptographical verification of the data, you never know what >>> kind of garbage your application is processing which is always invitation >>> for nasty exploits and bugs... >> >> If this is another 500 mail discussion about FS_USERNS_MOUNT on >> block-backed filesystems then my verdict still stands that the only >> condition under which I will let the VFS allow this if the underlying >> device is signed and dm-verity protected. The kernel will continue to >> refuse unprivileged policy in general and specifically based on quality >> or implementation of the underlying filesystem driver. > > > First, if block devices are your concern, fine, how about > allowing it if EROFS file-backed mounts and S_IMMUTABLE > for underlay files is set, and refuse any block device > mounts. > > If the issue is "you don't know how to define the quality > or implementation of the underlying filesystem drivers", > you could list your detailed concerns (I think at least > people have trust to the individual filesystem > maintainers' judgements), otherwise there will be endless > new sets of new immutable filesystems for this requirement > (previously, composefs , puzzlefs, and tarfs are all for > this; I admit I didn't get the point of FS_USERNS_MOUNT > at that time of 2023; but know I also think FS_USERNS_MOUNT > is a strong requirement for DinD for example), because that > idea should be sensible according to Darrick and Jan's > reply, and I think more people will agree with that. > > And another idea is that you still could return arbitary > metadata with immutable FUSE fses and let users get > garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT, > and if user and mount namespaces are isolated, why bothering > it? > > I just hope know why? And as you may notice, > "Demi Marie Obenour wrote:" > >> The only exceptions are if the filesystem is incredibly simple >> or formal methods are used, and neither is the case for existing >> filesystems in the Linux kernel. > > I still strong disagree with that judgement, a minimal EROFS > can build an image with superblock, dirs, and files with > xattrs in a 4k-size image; and 4k image should be enough for > fuzzing; also the in-core EROFS format even never allocates > any extra buffers, which is much simplar than FUSE. > > In brief, so how to meet your requirement? > > Thanks, > Gao Xiang Rewriting the code in Rust would dramatically reduce the attack surface when it comes to memory corruption. That's a lot to ask, though, and a lot of work. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-24 9:49 ` Demi Marie Obenour @ 2026-03-24 9:53 ` Gao Xiang 2026-03-24 10:02 ` Demi Marie Obenour 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-24 9:53 UTC (permalink / raw) To: Demi Marie Obenour, Christian Brauner, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/24 17:49, Demi Marie Obenour wrote: > On 3/24/26 05:30, Gao Xiang wrote: >> Hi Christian, >> >> On 2026/3/24 16:48, Christian Brauner wrote: >>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote: >>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote: >>>>> On 2026/3/23 22:13, Jan Kara wrote: >>>>>>>> think that is the corner cases if you don't claim the >>>>>>>> limitation of FUSE approaches. >>>>>>>> >>>>>>>> If none expects that, that is absolute be fine, as I said, >>>>>>>> it provides strong isolation and stability, but I really >>>>>>>> suspect this approach could be abused to mount totally >>>>>>>> untrusted remote filesystems (Actually as I said, some >>>>>>>> business of ours already did: fetching EXT4 filesystems >>>>>>>> with unknown status and mount without fscking, that is >>>>>>>> really disappointing.) >>>>>> >>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and >>>>>> using it for sensitive application, that falls to "insane" category for me >>>>>> :) We agree on that. And I agree that depending on the application using >>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs + >>>>>> overlayfs writeable layer may provide better guarantees about fs behavior. >>>>> >>>>> That is my overall goal, I just want to make it clear >>>>> the difference out of write isolation, but of course, >>>>> "secure" or not is relative, and according to the >>>>> system design. >>>>> >>>>> If isolation and system stability are enough for >>>>> a system and can be called "secure", yes, they are >>>>> both the same in such aspects. >>>>> >>>>>> I would still consider such design highly suspicious but without more >>>>>> detailed knowledge about the application I cannot say it's outright broken >>>>>> :). >>>>> >>>>> What do you mean "such design"? "Writable untrusted >>>>> remote EXT4 images mounting on the host"? Really, we have >>>>> such applications for containers for many years but I don't >>>>> want to name it here, but I'm totally exhaused by such >>>>> usage (since I explained many many times, and they even >>>>> never bother with LWN.net) and the internal team. >>>> >>>> By "such design" I meant generally the concept that you fetch filesystem >>>> images (regardless whether ext4 or some other type) from untrusted source. >>>> Unless you do cryptographical verification of the data, you never know what >>>> kind of garbage your application is processing which is always invitation >>>> for nasty exploits and bugs... >>> >>> If this is another 500 mail discussion about FS_USERNS_MOUNT on >>> block-backed filesystems then my verdict still stands that the only >>> condition under which I will let the VFS allow this if the underlying >>> device is signed and dm-verity protected. The kernel will continue to >>> refuse unprivileged policy in general and specifically based on quality >>> or implementation of the underlying filesystem driver. >> >> >> First, if block devices are your concern, fine, how about >> allowing it if EROFS file-backed mounts and S_IMMUTABLE >> for underlay files is set, and refuse any block device >> mounts. >> >> If the issue is "you don't know how to define the quality >> or implementation of the underlying filesystem drivers", >> you could list your detailed concerns (I think at least >> people have trust to the individual filesystem >> maintainers' judgements), otherwise there will be endless >> new sets of new immutable filesystems for this requirement >> (previously, composefs , puzzlefs, and tarfs are all for >> this; I admit I didn't get the point of FS_USERNS_MOUNT >> at that time of 2023; but know I also think FS_USERNS_MOUNT >> is a strong requirement for DinD for example), because that >> idea should be sensible according to Darrick and Jan's >> reply, and I think more people will agree with that. >> >> And another idea is that you still could return arbitary >> metadata with immutable FUSE fses and let users get >> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT, >> and if user and mount namespaces are isolated, why bothering >> it? >> >> I just hope know why? And as you may notice, >> "Demi Marie Obenour wrote:" >> >>> The only exceptions are if the filesystem is incredibly simple >>> or formal methods are used, and neither is the case for existing >>> filesystems in the Linux kernel. >> >> I still strong disagree with that judgement, a minimal EROFS >> can build an image with superblock, dirs, and files with >> xattrs in a 4k-size image; and 4k image should be enough for >> fuzzing; also the in-core EROFS format even never allocates >> any extra buffers, which is much simplar than FUSE. >> >> In brief, so how to meet your requirement? >> >> Thanks, >> Gao Xiang > > Rewriting the code in Rust would dramatically reduce the attack > surface when it comes to memory corruption. That's a lot to ask, > though, and a lot of work. I don't think so, FUSE can do FS_USERNS_MOUNT and written in C , and the attack surface is already huge. EROFS will switch to Rust some time, but your judgement will make people to make another complete new toys of Rust kernel filesystems --- just because EROFS is currently not written in Rust. I'm completely exhaused with such game: If I will address every single fuzzing bug and CVE, why not? Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-24 9:53 ` Gao Xiang @ 2026-03-24 10:02 ` Demi Marie Obenour 2026-03-24 10:14 ` Gao Xiang 0 siblings, 1 reply; 79+ messages in thread From: Demi Marie Obenour @ 2026-03-24 10:02 UTC (permalink / raw) To: Gao Xiang, Christian Brauner, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc [-- Attachment #1.1.1: Type: text/plain, Size: 6316 bytes --] On 3/24/26 05:53, Gao Xiang wrote: > > > On 2026/3/24 17:49, Demi Marie Obenour wrote: >> On 3/24/26 05:30, Gao Xiang wrote: >>> Hi Christian, >>> >>> On 2026/3/24 16:48, Christian Brauner wrote: >>>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote: >>>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote: >>>>>> On 2026/3/23 22:13, Jan Kara wrote: >>>>>>>>> think that is the corner cases if you don't claim the >>>>>>>>> limitation of FUSE approaches. >>>>>>>>> >>>>>>>>> If none expects that, that is absolute be fine, as I said, >>>>>>>>> it provides strong isolation and stability, but I really >>>>>>>>> suspect this approach could be abused to mount totally >>>>>>>>> untrusted remote filesystems (Actually as I said, some >>>>>>>>> business of ours already did: fetching EXT4 filesystems >>>>>>>>> with unknown status and mount without fscking, that is >>>>>>>>> really disappointing.) >>>>>>> >>>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and >>>>>>> using it for sensitive application, that falls to "insane" category for me >>>>>>> :) We agree on that. And I agree that depending on the application using >>>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs + >>>>>>> overlayfs writeable layer may provide better guarantees about fs behavior. >>>>>> >>>>>> That is my overall goal, I just want to make it clear >>>>>> the difference out of write isolation, but of course, >>>>>> "secure" or not is relative, and according to the >>>>>> system design. >>>>>> >>>>>> If isolation and system stability are enough for >>>>>> a system and can be called "secure", yes, they are >>>>>> both the same in such aspects. >>>>>> >>>>>>> I would still consider such design highly suspicious but without more >>>>>>> detailed knowledge about the application I cannot say it's outright broken >>>>>>> :). >>>>>> >>>>>> What do you mean "such design"? "Writable untrusted >>>>>> remote EXT4 images mounting on the host"? Really, we have >>>>>> such applications for containers for many years but I don't >>>>>> want to name it here, but I'm totally exhaused by such >>>>>> usage (since I explained many many times, and they even >>>>>> never bother with LWN.net) and the internal team. >>>>> >>>>> By "such design" I meant generally the concept that you fetch filesystem >>>>> images (regardless whether ext4 or some other type) from untrusted source. >>>>> Unless you do cryptographical verification of the data, you never know what >>>>> kind of garbage your application is processing which is always invitation >>>>> for nasty exploits and bugs... >>>> >>>> If this is another 500 mail discussion about FS_USERNS_MOUNT on >>>> block-backed filesystems then my verdict still stands that the only >>>> condition under which I will let the VFS allow this if the underlying >>>> device is signed and dm-verity protected. The kernel will continue to >>>> refuse unprivileged policy in general and specifically based on quality >>>> or implementation of the underlying filesystem driver. >>> >>> >>> First, if block devices are your concern, fine, how about >>> allowing it if EROFS file-backed mounts and S_IMMUTABLE >>> for underlay files is set, and refuse any block device >>> mounts. >>> >>> If the issue is "you don't know how to define the quality >>> or implementation of the underlying filesystem drivers", >>> you could list your detailed concerns (I think at least >>> people have trust to the individual filesystem >>> maintainers' judgements), otherwise there will be endless >>> new sets of new immutable filesystems for this requirement >>> (previously, composefs , puzzlefs, and tarfs are all for >>> this; I admit I didn't get the point of FS_USERNS_MOUNT >>> at that time of 2023; but know I also think FS_USERNS_MOUNT >>> is a strong requirement for DinD for example), because that >>> idea should be sensible according to Darrick and Jan's >>> reply, and I think more people will agree with that. >>> >>> And another idea is that you still could return arbitary >>> metadata with immutable FUSE fses and let users get >>> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT, >>> and if user and mount namespaces are isolated, why bothering >>> it? >>> >>> I just hope know why? And as you may notice, >>> "Demi Marie Obenour wrote:" >>> >>>> The only exceptions are if the filesystem is incredibly simple >>>> or formal methods are used, and neither is the case for existing >>>> filesystems in the Linux kernel. >>> >>> I still strong disagree with that judgement, a minimal EROFS >>> can build an image with superblock, dirs, and files with >>> xattrs in a 4k-size image; and 4k image should be enough for >>> fuzzing; also the in-core EROFS format even never allocates >>> any extra buffers, which is much simplar than FUSE. >>> >>> In brief, so how to meet your requirement? >>> >>> Thanks, >>> Gao Xiang >> >> Rewriting the code in Rust would dramatically reduce the attack >> surface when it comes to memory corruption. That's a lot to ask, >> though, and a lot of work. > > I don't think so, FUSE can do FS_USERNS_MOUNT and written in C > , and the attack surface is already huge. > > EROFS will switch to Rust some time, but your judgement will > make people to make another complete new toys of Rust kernel > filesystems --- just because EROFS is currently not written > in Rust. > > I'm completely exhaused with such game: If I will address > every single fuzzing bug and CVE, why not? > > Thanks, > Gao Xiang I should have written that rewriting in Rust could help convince people that it is in fact safe. One *can* make safe C code, as shown by OpenSSH. It's just *harder* to write safe C code, and harder to demonstrate to others that C code is in fact safe. Whether the burden of proof being placed on you is excessive is a separate question that I do not have the experience to comment on. That said: > I will address every single fuzzing bug and CVE is very different than the view of most filesystem developers. If the fuzzers have good code coverage in EROFS, this is a very strong argument for making an exception. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-24 10:02 ` Demi Marie Obenour @ 2026-03-24 10:14 ` Gao Xiang 2026-03-24 10:17 ` Demi Marie Obenour 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-24 10:14 UTC (permalink / raw) To: Demi Marie Obenour, Christian Brauner, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/24 18:02, Demi Marie Obenour wrote: > On 3/24/26 05:53, Gao Xiang wrote: >> >> >> On 2026/3/24 17:49, Demi Marie Obenour wrote: >>> On 3/24/26 05:30, Gao Xiang wrote: >>>> Hi Christian, >>>> >>>> On 2026/3/24 16:48, Christian Brauner wrote: >>>>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote: >>>>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote: >>>>>>> On 2026/3/23 22:13, Jan Kara wrote: >>>>>>>>>> think that is the corner cases if you don't claim the >>>>>>>>>> limitation of FUSE approaches. >>>>>>>>>> >>>>>>>>>> If none expects that, that is absolute be fine, as I said, >>>>>>>>>> it provides strong isolation and stability, but I really >>>>>>>>>> suspect this approach could be abused to mount totally >>>>>>>>>> untrusted remote filesystems (Actually as I said, some >>>>>>>>>> business of ours already did: fetching EXT4 filesystems >>>>>>>>>> with unknown status and mount without fscking, that is >>>>>>>>>> really disappointing.) >>>>>>>> >>>>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and >>>>>>>> using it for sensitive application, that falls to "insane" category for me >>>>>>>> :) We agree on that. And I agree that depending on the application using >>>>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs + >>>>>>>> overlayfs writeable layer may provide better guarantees about fs behavior. >>>>>>> >>>>>>> That is my overall goal, I just want to make it clear >>>>>>> the difference out of write isolation, but of course, >>>>>>> "secure" or not is relative, and according to the >>>>>>> system design. >>>>>>> >>>>>>> If isolation and system stability are enough for >>>>>>> a system and can be called "secure", yes, they are >>>>>>> both the same in such aspects. >>>>>>> >>>>>>>> I would still consider such design highly suspicious but without more >>>>>>>> detailed knowledge about the application I cannot say it's outright broken >>>>>>>> :). >>>>>>> >>>>>>> What do you mean "such design"? "Writable untrusted >>>>>>> remote EXT4 images mounting on the host"? Really, we have >>>>>>> such applications for containers for many years but I don't >>>>>>> want to name it here, but I'm totally exhaused by such >>>>>>> usage (since I explained many many times, and they even >>>>>>> never bother with LWN.net) and the internal team. >>>>>> >>>>>> By "such design" I meant generally the concept that you fetch filesystem >>>>>> images (regardless whether ext4 or some other type) from untrusted source. >>>>>> Unless you do cryptographical verification of the data, you never know what >>>>>> kind of garbage your application is processing which is always invitation >>>>>> for nasty exploits and bugs... >>>>> >>>>> If this is another 500 mail discussion about FS_USERNS_MOUNT on >>>>> block-backed filesystems then my verdict still stands that the only >>>>> condition under which I will let the VFS allow this if the underlying >>>>> device is signed and dm-verity protected. The kernel will continue to >>>>> refuse unprivileged policy in general and specifically based on quality >>>>> or implementation of the underlying filesystem driver. >>>> >>>> >>>> First, if block devices are your concern, fine, how about >>>> allowing it if EROFS file-backed mounts and S_IMMUTABLE >>>> for underlay files is set, and refuse any block device >>>> mounts. >>>> >>>> If the issue is "you don't know how to define the quality >>>> or implementation of the underlying filesystem drivers", >>>> you could list your detailed concerns (I think at least >>>> people have trust to the individual filesystem >>>> maintainers' judgements), otherwise there will be endless >>>> new sets of new immutable filesystems for this requirement >>>> (previously, composefs , puzzlefs, and tarfs are all for >>>> this; I admit I didn't get the point of FS_USERNS_MOUNT >>>> at that time of 2023; but know I also think FS_USERNS_MOUNT >>>> is a strong requirement for DinD for example), because that >>>> idea should be sensible according to Darrick and Jan's >>>> reply, and I think more people will agree with that. >>>> >>>> And another idea is that you still could return arbitary >>>> metadata with immutable FUSE fses and let users get >>>> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT, >>>> and if user and mount namespaces are isolated, why bothering >>>> it? >>>> >>>> I just hope know why? And as you may notice, >>>> "Demi Marie Obenour wrote:" >>>> >>>>> The only exceptions are if the filesystem is incredibly simple >>>>> or formal methods are used, and neither is the case for existing >>>>> filesystems in the Linux kernel. >>>> >>>> I still strong disagree with that judgement, a minimal EROFS >>>> can build an image with superblock, dirs, and files with >>>> xattrs in a 4k-size image; and 4k image should be enough for >>>> fuzzing; also the in-core EROFS format even never allocates >>>> any extra buffers, which is much simplar than FUSE. >>>> >>>> In brief, so how to meet your requirement? >>>> >>>> Thanks, >>>> Gao Xiang >>> >>> Rewriting the code in Rust would dramatically reduce the attack >>> surface when it comes to memory corruption. That's a lot to ask, >>> though, and a lot of work. >> >> I don't think so, FUSE can do FS_USERNS_MOUNT and written in C >> , and the attack surface is already huge. >> >> EROFS will switch to Rust some time, but your judgement will >> make people to make another complete new toys of Rust kernel >> filesystems --- just because EROFS is currently not written >> in Rust. >> >> I'm completely exhaused with such game: If I will address >> every single fuzzing bug and CVE, why not? >> >> Thanks, >> Gao Xiang > > I should have written that rewriting in Rust could help convince > people that it is in fact safe. One *can* make safe C code, as shown > by OpenSSH. It's just *harder* to write safe C code, and harder to > demonstrate to others that C code is in fact safe. How do you define a formal `safe C`? "C without pointers"? Actually, we tried to switch to Rust but Rust developpers resist with incremental change, they just want a pure Rust and switch to it all the time, that is impossible for all mature kernel filesystems. > > Whether the burden of proof being placed on you is excessive is a > separate question that I do not have the experience to comment on. That is funny TBH, just because the whole policy here is broken, if you call out the LOC of codebase, I believe FUSE, OverlayFS and even TCP/IP are all complex than EROFS. If you still think LOC is the issue, I'm pretty fine to isolate a `fs/simple_erofs` and drop all advanced runtime features and even compression. > > That said: > >> I will address every single fuzzing bug and CVE > > is very different than the view of most filesystem developers. > If the fuzzers have good code coverage in EROFS, this is a very strong > argument for making an exception. I don't know if it's just your judgement or Christian's judgement. Currently EROFS is well-fuzzed by syzkaller and I keep maintaining it as 0 active issue (as I said, 4k images are enough for fuzzing all EROFS metadata format, almost all previous syzkaller issues are out of compressed inodes but we can just disable compression formats for FS_USERNS_MOUNT, just because compression algorithms are already complex for fuzzing) and we will definitely improve this part even further if that is the real concern of this. And we will accept any fuzzing bug as CVE, and fix them as 0day bugs like other subsystems written in C which accept untrusted (meta)data. Is that end of story of this game? Thanks,' Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-24 10:14 ` Gao Xiang @ 2026-03-24 10:17 ` Demi Marie Obenour 2026-03-24 10:25 ` Gao Xiang 0 siblings, 1 reply; 79+ messages in thread From: Demi Marie Obenour @ 2026-03-24 10:17 UTC (permalink / raw) To: Gao Xiang, Christian Brauner, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc [-- Attachment #1.1.1: Type: text/plain, Size: 8218 bytes --] On 3/24/26 06:14, Gao Xiang wrote: > > > On 2026/3/24 18:02, Demi Marie Obenour wrote: >> On 3/24/26 05:53, Gao Xiang wrote: >>> >>> >>> On 2026/3/24 17:49, Demi Marie Obenour wrote: >>>> On 3/24/26 05:30, Gao Xiang wrote: >>>>> Hi Christian, >>>>> >>>>> On 2026/3/24 16:48, Christian Brauner wrote: >>>>>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote: >>>>>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote: >>>>>>>> On 2026/3/23 22:13, Jan Kara wrote: >>>>>>>>>>> think that is the corner cases if you don't claim the >>>>>>>>>>> limitation of FUSE approaches. >>>>>>>>>>> >>>>>>>>>>> If none expects that, that is absolute be fine, as I said, >>>>>>>>>>> it provides strong isolation and stability, but I really >>>>>>>>>>> suspect this approach could be abused to mount totally >>>>>>>>>>> untrusted remote filesystems (Actually as I said, some >>>>>>>>>>> business of ours already did: fetching EXT4 filesystems >>>>>>>>>>> with unknown status and mount without fscking, that is >>>>>>>>>>> really disappointing.) >>>>>>>>> >>>>>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and >>>>>>>>> using it for sensitive application, that falls to "insane" category for me >>>>>>>>> :) We agree on that. And I agree that depending on the application using >>>>>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs + >>>>>>>>> overlayfs writeable layer may provide better guarantees about fs behavior. >>>>>>>> >>>>>>>> That is my overall goal, I just want to make it clear >>>>>>>> the difference out of write isolation, but of course, >>>>>>>> "secure" or not is relative, and according to the >>>>>>>> system design. >>>>>>>> >>>>>>>> If isolation and system stability are enough for >>>>>>>> a system and can be called "secure", yes, they are >>>>>>>> both the same in such aspects. >>>>>>>> >>>>>>>>> I would still consider such design highly suspicious but without more >>>>>>>>> detailed knowledge about the application I cannot say it's outright broken >>>>>>>>> :). >>>>>>>> >>>>>>>> What do you mean "such design"? "Writable untrusted >>>>>>>> remote EXT4 images mounting on the host"? Really, we have >>>>>>>> such applications for containers for many years but I don't >>>>>>>> want to name it here, but I'm totally exhaused by such >>>>>>>> usage (since I explained many many times, and they even >>>>>>>> never bother with LWN.net) and the internal team. >>>>>>> >>>>>>> By "such design" I meant generally the concept that you fetch filesystem >>>>>>> images (regardless whether ext4 or some other type) from untrusted source. >>>>>>> Unless you do cryptographical verification of the data, you never know what >>>>>>> kind of garbage your application is processing which is always invitation >>>>>>> for nasty exploits and bugs... >>>>>> >>>>>> If this is another 500 mail discussion about FS_USERNS_MOUNT on >>>>>> block-backed filesystems then my verdict still stands that the only >>>>>> condition under which I will let the VFS allow this if the underlying >>>>>> device is signed and dm-verity protected. The kernel will continue to >>>>>> refuse unprivileged policy in general and specifically based on quality >>>>>> or implementation of the underlying filesystem driver. >>>>> >>>>> >>>>> First, if block devices are your concern, fine, how about >>>>> allowing it if EROFS file-backed mounts and S_IMMUTABLE >>>>> for underlay files is set, and refuse any block device >>>>> mounts. >>>>> >>>>> If the issue is "you don't know how to define the quality >>>>> or implementation of the underlying filesystem drivers", >>>>> you could list your detailed concerns (I think at least >>>>> people have trust to the individual filesystem >>>>> maintainers' judgements), otherwise there will be endless >>>>> new sets of new immutable filesystems for this requirement >>>>> (previously, composefs , puzzlefs, and tarfs are all for >>>>> this; I admit I didn't get the point of FS_USERNS_MOUNT >>>>> at that time of 2023; but know I also think FS_USERNS_MOUNT >>>>> is a strong requirement for DinD for example), because that >>>>> idea should be sensible according to Darrick and Jan's >>>>> reply, and I think more people will agree with that. >>>>> >>>>> And another idea is that you still could return arbitary >>>>> metadata with immutable FUSE fses and let users get >>>>> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT, >>>>> and if user and mount namespaces are isolated, why bothering >>>>> it? >>>>> >>>>> I just hope know why? And as you may notice, >>>>> "Demi Marie Obenour wrote:" >>>>> >>>>>> The only exceptions are if the filesystem is incredibly simple >>>>>> or formal methods are used, and neither is the case for existing >>>>>> filesystems in the Linux kernel. >>>>> >>>>> I still strong disagree with that judgement, a minimal EROFS >>>>> can build an image with superblock, dirs, and files with >>>>> xattrs in a 4k-size image; and 4k image should be enough for >>>>> fuzzing; also the in-core EROFS format even never allocates >>>>> any extra buffers, which is much simplar than FUSE. >>>>> >>>>> In brief, so how to meet your requirement? >>>>> >>>>> Thanks, >>>>> Gao Xiang >>>> >>>> Rewriting the code in Rust would dramatically reduce the attack >>>> surface when it comes to memory corruption. That's a lot to ask, >>>> though, and a lot of work. >>> >>> I don't think so, FUSE can do FS_USERNS_MOUNT and written in C >>> , and the attack surface is already huge. >>> >>> EROFS will switch to Rust some time, but your judgement will >>> make people to make another complete new toys of Rust kernel >>> filesystems --- just because EROFS is currently not written >>> in Rust. >>> >>> I'm completely exhaused with such game: If I will address >>> every single fuzzing bug and CVE, why not? >>> >>> Thanks, >>> Gao Xiang >> >> I should have written that rewriting in Rust could help convince >> people that it is in fact safe. One *can* make safe C code, as shown >> by OpenSSH. It's just *harder* to write safe C code, and harder to >> demonstrate to others that C code is in fact safe. > > How do you define a formal `safe C`? "C without pointers"? Safe = "history of not having many vulnerabilities" > Actually, we tried to switch to Rust but Rust developpers > resist with incremental change, they just want a pure Rust > and switch to it all the time, that is impossible for all > mature kernel filesystems. Incremental change is definitely good. >> Whether the burden of proof being placed on you is excessive is a >> separate question that I do not have the experience to comment on. > > That is funny TBH, just because the whole policy here > is broken, if you call out the LOC of codebase, I > believe FUSE, OverlayFS and even TCP/IP are all complex > than EROFS. > > If you still think LOC is the issue, I'm pretty fine to > isolate a `fs/simple_erofs` and drop all advanced runtime > features and even compression. I don't think LOC is the main problem. >> That said: >> >>> I will address every single fuzzing bug and CVE >> >> is very different than the view of most filesystem developers. >> If the fuzzers have good code coverage in EROFS, this is a very strong >> argument for making an exception. > > I don't know if it's just your judgement or Christian's > judgement. > > Currently EROFS is well-fuzzed by syzkaller and I keep > maintaining it as 0 active issue (as I said, 4k images > are enough for fuzzing all EROFS metadata format, almost > all previous syzkaller issues are out of compressed > inodes but we can just disable compression formats for > FS_USERNS_MOUNT, just because compression algorithms > are already complex for fuzzing) and we will definitely > improve this part even further if that is the real > concern of this. > > And we will accept any fuzzing bug as CVE, and fix them > as 0day bugs like other subsystems written in C which > accept untrusted (meta)data. Is that end of story of > this game? It should be! -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-24 10:17 ` Demi Marie Obenour @ 2026-03-24 10:25 ` Gao Xiang 0 siblings, 0 replies; 79+ messages in thread From: Gao Xiang @ 2026-03-24 10:25 UTC (permalink / raw) To: Demi Marie Obenour, Christian Brauner, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/24 18:17, Demi Marie Obenour wrote: > On 3/24/26 06:14, Gao Xiang wrote: >> >> >> On 2026/3/24 18:02, Demi Marie Obenour wrote: >>> On 3/24/26 05:53, Gao Xiang wrote: >>>> >>>> >>>> On 2026/3/24 17:49, Demi Marie Obenour wrote: >>>>> On 3/24/26 05:30, Gao Xiang wrote: >>>>>> Hi Christian, >>>>>> >>>>>> On 2026/3/24 16:48, Christian Brauner wrote: >>>>>>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote: >>>>>>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote: >>>>>>>>> On 2026/3/23 22:13, Jan Kara wrote: >>>>>>>>>>>> think that is the corner cases if you don't claim the >>>>>>>>>>>> limitation of FUSE approaches. >>>>>>>>>>>> >>>>>>>>>>>> If none expects that, that is absolute be fine, as I said, >>>>>>>>>>>> it provides strong isolation and stability, but I really >>>>>>>>>>>> suspect this approach could be abused to mount totally >>>>>>>>>>>> untrusted remote filesystems (Actually as I said, some >>>>>>>>>>>> business of ours already did: fetching EXT4 filesystems >>>>>>>>>>>> with unknown status and mount without fscking, that is >>>>>>>>>>>> really disappointing.) >>>>>>>>>> >>>>>>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and >>>>>>>>>> using it for sensitive application, that falls to "insane" category for me >>>>>>>>>> :) We agree on that. And I agree that depending on the application using >>>>>>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs + >>>>>>>>>> overlayfs writeable layer may provide better guarantees about fs behavior. >>>>>>>>> >>>>>>>>> That is my overall goal, I just want to make it clear >>>>>>>>> the difference out of write isolation, but of course, >>>>>>>>> "secure" or not is relative, and according to the >>>>>>>>> system design. >>>>>>>>> >>>>>>>>> If isolation and system stability are enough for >>>>>>>>> a system and can be called "secure", yes, they are >>>>>>>>> both the same in such aspects. >>>>>>>>> >>>>>>>>>> I would still consider such design highly suspicious but without more >>>>>>>>>> detailed knowledge about the application I cannot say it's outright broken >>>>>>>>>> :). >>>>>>>>> >>>>>>>>> What do you mean "such design"? "Writable untrusted >>>>>>>>> remote EXT4 images mounting on the host"? Really, we have >>>>>>>>> such applications for containers for many years but I don't >>>>>>>>> want to name it here, but I'm totally exhaused by such >>>>>>>>> usage (since I explained many many times, and they even >>>>>>>>> never bother with LWN.net) and the internal team. >>>>>>>> >>>>>>>> By "such design" I meant generally the concept that you fetch filesystem >>>>>>>> images (regardless whether ext4 or some other type) from untrusted source. >>>>>>>> Unless you do cryptographical verification of the data, you never know what >>>>>>>> kind of garbage your application is processing which is always invitation >>>>>>>> for nasty exploits and bugs... >>>>>>> >>>>>>> If this is another 500 mail discussion about FS_USERNS_MOUNT on >>>>>>> block-backed filesystems then my verdict still stands that the only >>>>>>> condition under which I will let the VFS allow this if the underlying >>>>>>> device is signed and dm-verity protected. The kernel will continue to >>>>>>> refuse unprivileged policy in general and specifically based on quality >>>>>>> or implementation of the underlying filesystem driver. >>>>>> >>>>>> >>>>>> First, if block devices are your concern, fine, how about >>>>>> allowing it if EROFS file-backed mounts and S_IMMUTABLE >>>>>> for underlay files is set, and refuse any block device >>>>>> mounts. >>>>>> >>>>>> If the issue is "you don't know how to define the quality >>>>>> or implementation of the underlying filesystem drivers", >>>>>> you could list your detailed concerns (I think at least >>>>>> people have trust to the individual filesystem >>>>>> maintainers' judgements), otherwise there will be endless >>>>>> new sets of new immutable filesystems for this requirement >>>>>> (previously, composefs , puzzlefs, and tarfs are all for >>>>>> this; I admit I didn't get the point of FS_USERNS_MOUNT >>>>>> at that time of 2023; but know I also think FS_USERNS_MOUNT >>>>>> is a strong requirement for DinD for example), because that >>>>>> idea should be sensible according to Darrick and Jan's >>>>>> reply, and I think more people will agree with that. >>>>>> >>>>>> And another idea is that you still could return arbitary >>>>>> metadata with immutable FUSE fses and let users get >>>>>> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT, >>>>>> and if user and mount namespaces are isolated, why bothering >>>>>> it? >>>>>> >>>>>> I just hope know why? And as you may notice, >>>>>> "Demi Marie Obenour wrote:" >>>>>> >>>>>>> The only exceptions are if the filesystem is incredibly simple >>>>>>> or formal methods are used, and neither is the case for existing >>>>>>> filesystems in the Linux kernel. >>>>>> >>>>>> I still strong disagree with that judgement, a minimal EROFS >>>>>> can build an image with superblock, dirs, and files with >>>>>> xattrs in a 4k-size image; and 4k image should be enough for >>>>>> fuzzing; also the in-core EROFS format even never allocates >>>>>> any extra buffers, which is much simplar than FUSE. >>>>>> >>>>>> In brief, so how to meet your requirement? >>>>>> >>>>>> Thanks, >>>>>> Gao Xiang >>>>> >>>>> Rewriting the code in Rust would dramatically reduce the attack >>>>> surface when it comes to memory corruption. That's a lot to ask, >>>>> though, and a lot of work. >>>> >>>> I don't think so, FUSE can do FS_USERNS_MOUNT and written in C >>>> , and the attack surface is already huge. >>>> >>>> EROFS will switch to Rust some time, but your judgement will >>>> make people to make another complete new toys of Rust kernel >>>> filesystems --- just because EROFS is currently not written >>>> in Rust. >>>> >>>> I'm completely exhaused with such game: If I will address >>>> every single fuzzing bug and CVE, why not? >>>> >>>> Thanks, >>>> Gao Xiang >>> >>> I should have written that rewriting in Rust could help convince >>> people that it is in fact safe. One *can* make safe C code, as shown >>> by OpenSSH. It's just *harder* to write safe C code, and harder to >>> demonstrate to others that C code is in fact safe. >> >> How do you define a formal `safe C`? "C without pointers"? > > Safe = "history of not having many vulnerabilities" So there will be no pointer, but that is almost impossible for filesystems, since filesystem APIs work with pointers. > >> Actually, we tried to switch to Rust but Rust developpers >> resist with incremental change, they just want a pure Rust >> and switch to it all the time, that is impossible for all >> mature kernel filesystems. > > Incremental change is definitely good. Those developpers resisted this two years ago. > >>> Whether the burden of proof being placed on you is excessive is a >>> separate question that I do not have the experience to comment on. >> >> That is funny TBH, just because the whole policy here >> is broken, if you call out the LOC of codebase, I >> believe FUSE, OverlayFS and even TCP/IP are all complex >> than EROFS. >> >> If you still think LOC is the issue, I'm pretty fine to >> isolate a `fs/simple_erofs` and drop all advanced runtime >> features and even compression. > > I don't think LOC is the main problem. But folks come to me, telling me that you're unsafe because your LOC is larger than my new stuff. How do I react? > >>> That said: >>> >>>> I will address every single fuzzing bug and CVE >>> >>> is very different than the view of most filesystem developers. >>> If the fuzzers have good code coverage in EROFS, this is a very strong >>> argument for making an exception. >> >> I don't know if it's just your judgement or Christian's >> judgement. >> >> Currently EROFS is well-fuzzed by syzkaller and I keep >> maintaining it as 0 active issue (as I said, 4k images >> are enough for fuzzing all EROFS metadata format, almost >> all previous syzkaller issues are out of compressed >> inodes but we can just disable compression formats for >> FS_USERNS_MOUNT, just because compression algorithms >> are already complex for fuzzing) and we will definitely >> improve this part even further if that is the real >> concern of this. >> >> And we will accept any fuzzing bug as CVE, and fix them >> as 0day bugs like other subsystems written in C which >> accept untrusted (meta)data. Is that end of story of >> this game? > > It should be! So why? Who could tell me why? Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-24 8:48 ` Christian Brauner 2026-03-24 9:30 ` Gao Xiang @ 2026-03-24 11:58 ` Demi Marie Obenour 2026-03-24 12:21 ` Gao Xiang 1 sibling, 1 reply; 79+ messages in thread From: Demi Marie Obenour @ 2026-03-24 11:58 UTC (permalink / raw) To: Christian Brauner, Jan Kara Cc: Gao Xiang, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc [-- Attachment #1.1.1: Type: text/plain, Size: 3426 bytes --] On 3/24/26 04:48, Christian Brauner wrote: > On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote: >> On Mon 23-03-26 22:36:46, Gao Xiang wrote: >>> On 2026/3/23 22:13, Jan Kara wrote: >>>>>> think that is the corner cases if you don't claim the >>>>>> limitation of FUSE approaches. >>>>>> >>>>>> If none expects that, that is absolute be fine, as I said, >>>>>> it provides strong isolation and stability, but I really >>>>>> suspect this approach could be abused to mount totally >>>>>> untrusted remote filesystems (Actually as I said, some >>>>>> business of ours already did: fetching EXT4 filesystems >>>>>> with unknown status and mount without fscking, that is >>>>>> really disappointing.) >>>> >>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and >>>> using it for sensitive application, that falls to "insane" category for me >>>> :) We agree on that. And I agree that depending on the application using >>>> FUSE to access such filesystem needn't be safe enough and immutable fs + >>>> overlayfs writeable layer may provide better guarantees about fs behavior. >>> >>> That is my overall goal, I just want to make it clear >>> the difference out of write isolation, but of course, >>> "secure" or not is relative, and according to the >>> system design. >>> >>> If isolation and system stability are enough for >>> a system and can be called "secure", yes, they are >>> both the same in such aspects. >>> >>>> I would still consider such design highly suspicious but without more >>>> detailed knowledge about the application I cannot say it's outright broken >>>> :). >>> >>> What do you mean "such design"? "Writable untrusted >>> remote EXT4 images mounting on the host"? Really, we have >>> such applications for containers for many years but I don't >>> want to name it here, but I'm totally exhaused by such >>> usage (since I explained many many times, and they even >>> never bother with LWN.net) and the internal team. >> >> By "such design" I meant generally the concept that you fetch filesystem >> images (regardless whether ext4 or some other type) from untrusted source. >> Unless you do cryptographical verification of the data, you never know what >> kind of garbage your application is processing which is always invitation >> for nasty exploits and bugs... > > If this is another 500 mail discussion about FS_USERNS_MOUNT on > block-backed filesystems then my verdict still stands that the only > condition under which I will let the VFS allow this if the underlying > device is signed and dm-verity protected. The kernel will continue to > refuse unprivileged policy in general and specifically based on quality > or implementation of the underlying filesystem driver. As far as I can tell, the main problems are: 1. Most filesystems can only be run in kernel mode, so one needs a VM and an expensive RPC protocol if one wants to run them in a sandboxed environment. 2. Context switch overhead is so high that running filesystems entirely in userspace, without some form of in-kernel I/O acceleration, is a performance problem. 3. Filesystems are written in C and not designed to be secure against malicious on-disk images. Gao Xiang is working on problem for EROFS. FUSE iomap support solves 2. lklfuse solves problem 1. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-24 11:58 ` Demi Marie Obenour @ 2026-03-24 12:21 ` Gao Xiang 2026-03-26 14:39 ` Christian Brauner 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-24 12:21 UTC (permalink / raw) To: Demi Marie Obenour, Christian Brauner, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/24 19:58, Demi Marie Obenour wrote: > On 3/24/26 04:48, Christian Brauner wrote: ... >>>> >>>>> I would still consider such design highly suspicious but without more >>>>> detailed knowledge about the application I cannot say it's outright broken >>>>> :). >>>> >>>> What do you mean "such design"? "Writable untrusted >>>> remote EXT4 images mounting on the host"? Really, we have >>>> such applications for containers for many years but I don't >>>> want to name it here, but I'm totally exhaused by such >>>> usage (since I explained many many times, and they even >>>> never bother with LWN.net) and the internal team. >>> >>> By "such design" I meant generally the concept that you fetch filesystem >>> images (regardless whether ext4 or some other type) from untrusted source. >>> Unless you do cryptographical verification of the data, you never know what >>> kind of garbage your application is processing which is always invitation >>> for nasty exploits and bugs... >> >> If this is another 500 mail discussion about FS_USERNS_MOUNT on >> block-backed filesystems then my verdict still stands that the only >> condition under which I will let the VFS allow this if the underlying >> device is signed and dm-verity protected. The kernel will continue to >> refuse unprivileged policy in general and specifically based on quality >> or implementation of the underlying filesystem driver. > > As far as I can tell, the main problems are: > > 1. Most filesystems can only be run in kernel mode, so one needs a > VM and an expensive RPC protocol if one wants to run them in a > sandboxed environment. > > 2. Context switch overhead is so high that running filesystems entirely > in userspace, without some form of in-kernel I/O acceleration, > is a performance problem. > > 3. Filesystems are written in C and not designed to be secure against > malicious on-disk images. > > Gao Xiang is working on problem for EROFS. > FUSE iomap support solves 2. lklfuse solves problem 1. Sigh, I just would like to say, as Darrick and Jan's previous replies, immutable on-disk fses are a special kind of filesystems and the overall on-disk format is to provide vfs/MM basic informattion (like LOOKUP, GETATTR, and READDIR, READ), and the reason is that even some values of metadata could be considered as inconsistent, it's just like FUSE unprivileged daemon returns garbage (meta)data and/or TAR extracts garbage (meta)data -- shouldn't matter at all. Why I'm here is I'm totally exhaused by arbitary claim like "all kernel filesystem are insecure". Again, that is absolutely untrue: the feature set, the working model and the implementation complexity of immutable filesystems make it more secure by design. Also the reason of "another 500 mail discussion about FS_USERNS_MOUNT" is just because "FS_USERNS_MOUNT is very very useful to containers", and the special kind of immutable on-disk filesystems can fit this goal technically which is much much unlike to generic writable ondisk fses or NFS and why I working on EROFS is also because I believe immutable ondisk filesystems are absolutely useful, more secure than other generic writable fses by design especially on containers and handling untrusted remote data. I here claim again that all implementation vulnerability of EROFS will claim as 0-day bug, and I've already did in this way for many years. Let's step back, even not me, if there are some other sane immutable filesystems aiming for containers, they will definitely claim the same, why not? Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-24 12:21 ` Gao Xiang @ 2026-03-26 14:39 ` Christian Brauner 0 siblings, 0 replies; 79+ messages in thread From: Christian Brauner @ 2026-03-26 14:39 UTC (permalink / raw) To: Gao Xiang Cc: Demi Marie Obenour, Jan Kara, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On Tue, Mar 24, 2026 at 08:21:00PM +0800, Gao Xiang wrote: > > > On 2026/3/24 19:58, Demi Marie Obenour wrote: > > On 3/24/26 04:48, Christian Brauner wrote: > > ... > > > > > > > > > > > > I would still consider such design highly suspicious but without more > > > > > > detailed knowledge about the application I cannot say it's outright broken > > > > > > :). > > > > > > > > > > What do you mean "such design"? "Writable untrusted > > > > > remote EXT4 images mounting on the host"? Really, we have > > > > > such applications for containers for many years but I don't > > > > > want to name it here, but I'm totally exhaused by such > > > > > usage (since I explained many many times, and they even > > > > > never bother with LWN.net) and the internal team. > > > > > > > > By "such design" I meant generally the concept that you fetch filesystem > > > > images (regardless whether ext4 or some other type) from untrusted source. > > > > Unless you do cryptographical verification of the data, you never know what > > > > kind of garbage your application is processing which is always invitation > > > > for nasty exploits and bugs... > > > > > > If this is another 500 mail discussion about FS_USERNS_MOUNT on > > > block-backed filesystems then my verdict still stands that the only > > > condition under which I will let the VFS allow this if the underlying > > > device is signed and dm-verity protected. The kernel will continue to > > > refuse unprivileged policy in general and specifically based on quality > > > or implementation of the underlying filesystem driver. > > > > As far as I can tell, the main problems are: > > > > 1. Most filesystems can only be run in kernel mode, so one needs a > > VM and an expensive RPC protocol if one wants to run them in a > > sandboxed environment. > > > > 2. Context switch overhead is so high that running filesystems entirely > > in userspace, without some form of in-kernel I/O acceleration, > > is a performance problem. > > > > 3. Filesystems are written in C and not designed to be secure against > > malicious on-disk images. > > > > Gao Xiang is working on problem for EROFS. > > FUSE iomap support solves 2. lklfuse solves problem 1. > > Sigh, I just would like to say, as Darrick and Jan's previous > replies, immutable on-disk fses are a special kind of filesystems > and the overall on-disk format is to provide vfs/MM basic > informattion (like LOOKUP, GETATTR, and READDIR, READ), and the > reason is that even some values of metadata could be considered > as inconsistent, it's just like FUSE unprivileged daemon returns > garbage (meta)data and/or TAR extracts garbage (meta)data -- > shouldn't matter at all. > > Why I'm here is I'm totally exhaused by arbitary claim like > "all kernel filesystem are insecure". Again, that is absolutely > untrue: the feature set, the working model and the implementation > complexity of immutable filesystems make it more secure by > design. > > Also the reason of "another 500 mail discussion about > FS_USERNS_MOUNT" is just because "FS_USERNS_MOUNT is very very > useful to containers", and the special kind of immutable on-disk > filesystems can fit this goal technically which is much much > unlike to generic writable ondisk fses or NFS and why I working > on EROFS is also because I believe immutable ondisk filesystems > are absolutely useful, more secure than other generic writable > fses by design especially on containers and handling untrusted > remote data. > > I here claim again that all implementation vulnerability of > EROFS will claim as 0-day bug, and I've already did in this way > for many years. Let's step back, even not me, if there are > some other sane immutable filesystems aiming for containers, > they will definitely claim the same, why not? If you want unprivileged filesystem drivers mountable by arbitrary users and containers then get behind the effort to move this completely out of the kernel and into fuse making fuse fast enough so that we don't have to think about it anymore. The whole push over the last years has been that if users want to mount arbitrary in-kernel filesystems in userspace then they better built a delegation and security model _in userspace_ to make this happen. This is why we built mountfsd in userspace which works just fine today. I don't understand what exactly people think is going to happen once we start promising that mounting untrusted images in the kernel for even one filesystem is fine. This will march us down security madness we have not experienced before with all of the k8s and container workloads out there. For me it is currently still completely irrelevant what filesystem driver this is and whether it is immutable or not. Look at the size of your attack surface in your codebase and your algorithms and the ever expanding functionality it exposes. This pipe dream of "rootless" containers being able to mount arbitrary images in-kernel without userspace policy is not workable. We debate this over and over because userspace is unwilling to accept that there are fundamental policy problems that are not solved in the kernel. And that includes when it is safe to mount arbitrary data. This is especially true now as we're being flooded with (valid and invalid) CVEs due to everyone believing their personal LLM companion. You're going to be at LSF/MM/BPF and I'm sure there'll be more discussion around this. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 11:14 ` Jan Kara 2026-03-23 11:42 ` Gao Xiang @ 2026-03-23 12:08 ` Demi Marie Obenour 2026-03-23 12:13 ` Gao Xiang 1 sibling, 1 reply; 79+ messages in thread From: Demi Marie Obenour @ 2026-03-23 12:08 UTC (permalink / raw) To: Jan Kara, Gao Xiang Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc [-- Attachment #1.1.1: Type: text/plain, Size: 5606 bytes --] On 3/23/26 07:14, Jan Kara wrote: > Hi Gao! > > On Mon 23-03-26 18:19:16, Gao Xiang wrote: >> On 2026/3/23 17:54, Jan Kara wrote: >>> On Sun 22-03-26 12:51:57, Gao Xiang wrote: >>>> On 2026/3/22 11:25, Demi Marie Obenour wrote: >>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>>>>> starts up the rest of the libfuse initialization but who knows if that's >>>>>> an acceptable risk. Also unclear if you actually want -fy for that. >>>>> >>>> >>>> Let me try to reply the remaining part: >>>> >>>>> To me, the attacks mentioned above are all either user error, >>>>> or vulnerabilities in software accessing the filesystem. If one >>>> >>>> There are many consequences if users try to use potential inconsistent >>>> writable filesystems directly (without full fsck), what I can think >>>> out including but not limited to: >>>> >>>> - data loss (considering data block double free issue); >>>> - data theft (for example, users keep sensitive information in the >>>> workload in a high permission inode but it can be read with >>>> low permission malicious inode later); >>>> - data tamper (the same principle). >>>> >>>> All vulnerabilities above happen after users try to write the >>>> inconsistent filesystem, which is hard to prevent by on-disk >>>> design. >>>> >>>> But if users write with copy-on-write to another local consistent >>>> filesystem, all the vulnerabilities above won't exist. >>> >>> OK, so if I understand correctly you are advocating that untrusted initial data >>> should be provided on immutable filesystem and any needed modification >>> would be handled by overlayfs (or some similar layer) and stored on >>> (initially empty) writeable filesystem. >>> >>> That's a sensible design for usecase like containers but what started this >>> thread about FUSE drivers for filesystems were usecases like access to >>> filesystems on drives attached at USB port of your laptop. There it isn't >>> really practical to use your design. You need a standard writeable >>> filesystem for that but at the same time you cannot quite trust the content >>> of everything that gets attached to your USB port... >> >> Yes, that is my proposal and my overall interest now. I know >> your interest but I'm here I just would like to say: >> >> Without full scan fsck, even with FUSE, the system is still >> vulnerable if the FUSE approch is used. >> >> I could give a detailed example, for example: >> >> There are passwd files `/etc/passwd` and `/etc/shadow` with >> proper permissions (for example, you could audit the file >> permission with e2fsprogs/xfsprogs without a full fsck scan) in >> the inconsistent remote filesystems, but there are some other >> malicious files called "foo" and "bar" somewhere with low >> permissions but sharing the same blocks which is disallowed >> by filesystem on-disk formats illegally (because they violate >> copy-on-write semantics by design), also see my previous >> reply: >> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com >> >> The initial data of `/etc/passwd` and `/etc/shadow` in the >> filesystem image doesn't matter, but users could then keep >> very sensitive information later just out of the >> inconsistent filesystems, which could cause "data theft" >> above. > > Yes, I've seen you mentioning this case earlier in this thread. But let me > say I consider it rather contrived :). For the container usecase if you are > fetching say a root fs image and don't trust the content of the image, then > how do you know it doesn't contain a malicious code that sends all the > sensitive data to some third party? So I believe the owner of the container > has to trust the content of the image, otherwise you've already lost. > > The container environment *provider* doesn't necessarily trust either the > container owner or the image so they need to make sure their infrastructure > isn't compromised by malicious actions from these - and for that either > your immutable image scheme or FUSE mounting works. > > Similarly with the USB drive content. Either some malicious actor plugs USB > drive into a laptop, it gets automounted, and that must not crash the > kernel or give attacker more priviledge - but that's all - no data is > stored on the drive. Or I myself plug some not-so-trusted USB drive to my > laptop to read some content from it or possibly put there some data for a > friend - again that must not compromise my machine but I'd be really dumb > and already lost the security game if I'd put any sensitive data to such > drive. And for this USB drive case FUSE mounting solves these problems > nicely. > > So in my opinion for practical usecases the FUSE solution addresses the > real security concerns. > > Honza I agree, *if* the FUSE filesystem is strongly sandboxed so it cannot mess with things like my home directory. Personally, I would run the FUSE filesystem in a VM but that's a separate concern. There are also (very severe) concerns about USB devices *specifically*. These are off-topic for this discussion, though. Of course, the FUSE filesystem must be mounted with nosuid, nodev, and nosymfollow. Otherwise there are lots of attacks possible. Finally, it is very much possible to use storage that one does not have complete trust in, provided that one uses cryptography to ensure that the damage it can do is limited. Many backup systems work this way. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 12:08 ` Demi Marie Obenour @ 2026-03-23 12:13 ` Gao Xiang 2026-03-23 12:19 ` Demi Marie Obenour 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-23 12:13 UTC (permalink / raw) To: Demi Marie Obenour, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/23 20:08, Demi Marie Obenour wrote: > On 3/23/26 07:14, Jan Kara wrote: >> Hi Gao! >> >> On Mon 23-03-26 18:19:16, Gao Xiang wrote: >>> On 2026/3/23 17:54, Jan Kara wrote: >>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote: >>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote: >>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>>>>>> starts up the rest of the libfuse initialization but who knows if that's >>>>>>> an acceptable risk. Also unclear if you actually want -fy for that. >>>>>> >>>>> >>>>> Let me try to reply the remaining part: >>>>> >>>>>> To me, the attacks mentioned above are all either user error, >>>>>> or vulnerabilities in software accessing the filesystem. If one >>>>> >>>>> There are many consequences if users try to use potential inconsistent >>>>> writable filesystems directly (without full fsck), what I can think >>>>> out including but not limited to: >>>>> >>>>> - data loss (considering data block double free issue); >>>>> - data theft (for example, users keep sensitive information in the >>>>> workload in a high permission inode but it can be read with >>>>> low permission malicious inode later); >>>>> - data tamper (the same principle). >>>>> >>>>> All vulnerabilities above happen after users try to write the >>>>> inconsistent filesystem, which is hard to prevent by on-disk >>>>> design. >>>>> >>>>> But if users write with copy-on-write to another local consistent >>>>> filesystem, all the vulnerabilities above won't exist. >>>> >>>> OK, so if I understand correctly you are advocating that untrusted initial data >>>> should be provided on immutable filesystem and any needed modification >>>> would be handled by overlayfs (or some similar layer) and stored on >>>> (initially empty) writeable filesystem. >>>> >>>> That's a sensible design for usecase like containers but what started this >>>> thread about FUSE drivers for filesystems were usecases like access to >>>> filesystems on drives attached at USB port of your laptop. There it isn't >>>> really practical to use your design. You need a standard writeable >>>> filesystem for that but at the same time you cannot quite trust the content >>>> of everything that gets attached to your USB port... >>> >>> Yes, that is my proposal and my overall interest now. I know >>> your interest but I'm here I just would like to say: >>> >>> Without full scan fsck, even with FUSE, the system is still >>> vulnerable if the FUSE approch is used. >>> >>> I could give a detailed example, for example: >>> >>> There are passwd files `/etc/passwd` and `/etc/shadow` with >>> proper permissions (for example, you could audit the file >>> permission with e2fsprogs/xfsprogs without a full fsck scan) in >>> the inconsistent remote filesystems, but there are some other >>> malicious files called "foo" and "bar" somewhere with low >>> permissions but sharing the same blocks which is disallowed >>> by filesystem on-disk formats illegally (because they violate >>> copy-on-write semantics by design), also see my previous >>> reply: >>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com >>> >>> The initial data of `/etc/passwd` and `/etc/shadow` in the >>> filesystem image doesn't matter, but users could then keep >>> very sensitive information later just out of the >>> inconsistent filesystems, which could cause "data theft" >>> above. >> >> Yes, I've seen you mentioning this case earlier in this thread. But let me >> say I consider it rather contrived :). For the container usecase if you are >> fetching say a root fs image and don't trust the content of the image, then >> how do you know it doesn't contain a malicious code that sends all the >> sensitive data to some third party? So I believe the owner of the container >> has to trust the content of the image, otherwise you've already lost. >> >> The container environment *provider* doesn't necessarily trust either the >> container owner or the image so they need to make sure their infrastructure >> isn't compromised by malicious actions from these - and for that either >> your immutable image scheme or FUSE mounting works. >> >> Similarly with the USB drive content. Either some malicious actor plugs USB >> drive into a laptop, it gets automounted, and that must not crash the >> kernel or give attacker more priviledge - but that's all - no data is >> stored on the drive. Or I myself plug some not-so-trusted USB drive to my >> laptop to read some content from it or possibly put there some data for a >> friend - again that must not compromise my machine but I'd be really dumb >> and already lost the security game if I'd put any sensitive data to such >> drive. And for this USB drive case FUSE mounting solves these problems >> nicely. >> >> So in my opinion for practical usecases the FUSE solution addresses the >> real security concerns. >> >> Honza > > I agree, *if* the FUSE filesystem is strongly sandboxed so it cannot > mess with things like my home directory. Personally, I would run > the FUSE filesystem in a VM but that's a separate concern. > > There are also (very severe) concerns about USB devices *specifically*. > These are off-topic for this discussion, though. > > Of course, the FUSE filesystem must be mounted with nosuid, nodev, > and nosymfollow. Otherwise there are lots of attacks possible. > > Finally, it is very much possible to use storage that one does not have > complete trust in, provided that one uses cryptography to ensure that > the damage it can do is limited. Many backup systems work this way. In brief, as I said, that is _not_ always a security concern: - If you don't fsck, and FUSE mount it, your write data to that filesystem could be lost if the writable filesystem is inconsistent; - But if you fsck in advance and the filesystem, the kernel implementation should make sure they should fix all bugs of consistent filesystems. So what's the meaning of "no fsck" here if you cannot write anything in it with FUSE approaches. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 12:13 ` Gao Xiang @ 2026-03-23 12:19 ` Demi Marie Obenour 2026-03-23 12:30 ` Gao Xiang 0 siblings, 1 reply; 79+ messages in thread From: Demi Marie Obenour @ 2026-03-23 12:19 UTC (permalink / raw) To: Gao Xiang, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc [-- Attachment #1.1.1: Type: text/plain, Size: 6733 bytes --] On 3/23/26 08:13, Gao Xiang wrote: > > > On 2026/3/23 20:08, Demi Marie Obenour wrote: >> On 3/23/26 07:14, Jan Kara wrote: >>> Hi Gao! >>> >>> On Mon 23-03-26 18:19:16, Gao Xiang wrote: >>>> On 2026/3/23 17:54, Jan Kara wrote: >>>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote: >>>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote: >>>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>>>>>>> starts up the rest of the libfuse initialization but who knows if that's >>>>>>>> an acceptable risk. Also unclear if you actually want -fy for that. >>>>>>> >>>>>> >>>>>> Let me try to reply the remaining part: >>>>>> >>>>>>> To me, the attacks mentioned above are all either user error, >>>>>>> or vulnerabilities in software accessing the filesystem. If one >>>>>> >>>>>> There are many consequences if users try to use potential inconsistent >>>>>> writable filesystems directly (without full fsck), what I can think >>>>>> out including but not limited to: >>>>>> >>>>>> - data loss (considering data block double free issue); >>>>>> - data theft (for example, users keep sensitive information in the >>>>>> workload in a high permission inode but it can be read with >>>>>> low permission malicious inode later); >>>>>> - data tamper (the same principle). >>>>>> >>>>>> All vulnerabilities above happen after users try to write the >>>>>> inconsistent filesystem, which is hard to prevent by on-disk >>>>>> design. >>>>>> >>>>>> But if users write with copy-on-write to another local consistent >>>>>> filesystem, all the vulnerabilities above won't exist. >>>>> >>>>> OK, so if I understand correctly you are advocating that untrusted initial data >>>>> should be provided on immutable filesystem and any needed modification >>>>> would be handled by overlayfs (or some similar layer) and stored on >>>>> (initially empty) writeable filesystem. >>>>> >>>>> That's a sensible design for usecase like containers but what started this >>>>> thread about FUSE drivers for filesystems were usecases like access to >>>>> filesystems on drives attached at USB port of your laptop. There it isn't >>>>> really practical to use your design. You need a standard writeable >>>>> filesystem for that but at the same time you cannot quite trust the content >>>>> of everything that gets attached to your USB port... >>>> >>>> Yes, that is my proposal and my overall interest now. I know >>>> your interest but I'm here I just would like to say: >>>> >>>> Without full scan fsck, even with FUSE, the system is still >>>> vulnerable if the FUSE approch is used. >>>> >>>> I could give a detailed example, for example: >>>> >>>> There are passwd files `/etc/passwd` and `/etc/shadow` with >>>> proper permissions (for example, you could audit the file >>>> permission with e2fsprogs/xfsprogs without a full fsck scan) in >>>> the inconsistent remote filesystems, but there are some other >>>> malicious files called "foo" and "bar" somewhere with low >>>> permissions but sharing the same blocks which is disallowed >>>> by filesystem on-disk formats illegally (because they violate >>>> copy-on-write semantics by design), also see my previous >>>> reply: >>>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com >>>> >>>> The initial data of `/etc/passwd` and `/etc/shadow` in the >>>> filesystem image doesn't matter, but users could then keep >>>> very sensitive information later just out of the >>>> inconsistent filesystems, which could cause "data theft" >>>> above. >>> >>> Yes, I've seen you mentioning this case earlier in this thread. But let me >>> say I consider it rather contrived :). For the container usecase if you are >>> fetching say a root fs image and don't trust the content of the image, then >>> how do you know it doesn't contain a malicious code that sends all the >>> sensitive data to some third party? So I believe the owner of the container >>> has to trust the content of the image, otherwise you've already lost. >>> >>> The container environment *provider* doesn't necessarily trust either the >>> container owner or the image so they need to make sure their infrastructure >>> isn't compromised by malicious actions from these - and for that either >>> your immutable image scheme or FUSE mounting works. >>> >>> Similarly with the USB drive content. Either some malicious actor plugs USB >>> drive into a laptop, it gets automounted, and that must not crash the >>> kernel or give attacker more priviledge - but that's all - no data is >>> stored on the drive. Or I myself plug some not-so-trusted USB drive to my >>> laptop to read some content from it or possibly put there some data for a >>> friend - again that must not compromise my machine but I'd be really dumb >>> and already lost the security game if I'd put any sensitive data to such >>> drive. And for this USB drive case FUSE mounting solves these problems >>> nicely. >>> >>> So in my opinion for practical usecases the FUSE solution addresses the >>> real security concerns. >>> >>> Honza >> >> I agree, *if* the FUSE filesystem is strongly sandboxed so it cannot >> mess with things like my home directory. Personally, I would run >> the FUSE filesystem in a VM but that's a separate concern. >> >> There are also (very severe) concerns about USB devices *specifically*. >> These are off-topic for this discussion, though. >> >> Of course, the FUSE filesystem must be mounted with nosuid, nodev, >> and nosymfollow. Otherwise there are lots of attacks possible. >> >> Finally, it is very much possible to use storage that one does not have >> complete trust in, provided that one uses cryptography to ensure that >> the damage it can do is limited. Many backup systems work this way. > > In brief, as I said, that is _not_ always a security concern: > > - If you don't fsck, and FUSE mount it, your write data to that > filesystem could be lost if the writable filesystem is > inconsistent; In the applications I am thinking of, one _hopes_ that the filesystem is consistent, which it almost always will be. However, one wants to be safe in the unlikely case of it being inconsistent. > - But if you fsck in advance and the filesystem, the kernel > implementation should make sure they should fix all bugs of > consistent filesystems. > > So what's the meaning of "no fsck" here if you cannot write > anything in it with FUSE approaches. FUSE can (and usually does) have write support. Also, fsck does not protect against TOCTOU attacks. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 12:19 ` Demi Marie Obenour @ 2026-03-23 12:30 ` Gao Xiang 2026-03-23 12:33 ` Gao Xiang 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-23 12:30 UTC (permalink / raw) To: Demi Marie Obenour, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/23 20:19, Demi Marie Obenour wrote: > On 3/23/26 08:13, Gao Xiang wrote: >> >> >> On 2026/3/23 20:08, Demi Marie Obenour wrote: >>> On 3/23/26 07:14, Jan Kara wrote: >>>> Hi Gao! >>>> >>>> On Mon 23-03-26 18:19:16, Gao Xiang wrote: >>>>> On 2026/3/23 17:54, Jan Kara wrote: >>>>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote: >>>>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote: >>>>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>>>>>>>> starts up the rest of the libfuse initialization but who knows if that's >>>>>>>>> an acceptable risk. Also unclear if you actually want -fy for that. >>>>>>>> >>>>>>> >>>>>>> Let me try to reply the remaining part: >>>>>>> >>>>>>>> To me, the attacks mentioned above are all either user error, >>>>>>>> or vulnerabilities in software accessing the filesystem. If one >>>>>>> >>>>>>> There are many consequences if users try to use potential inconsistent >>>>>>> writable filesystems directly (without full fsck), what I can think >>>>>>> out including but not limited to: >>>>>>> >>>>>>> - data loss (considering data block double free issue); >>>>>>> - data theft (for example, users keep sensitive information in the >>>>>>> workload in a high permission inode but it can be read with >>>>>>> low permission malicious inode later); >>>>>>> - data tamper (the same principle). >>>>>>> >>>>>>> All vulnerabilities above happen after users try to write the >>>>>>> inconsistent filesystem, which is hard to prevent by on-disk >>>>>>> design. >>>>>>> >>>>>>> But if users write with copy-on-write to another local consistent >>>>>>> filesystem, all the vulnerabilities above won't exist. >>>>>> >>>>>> OK, so if I understand correctly you are advocating that untrusted initial data >>>>>> should be provided on immutable filesystem and any needed modification >>>>>> would be handled by overlayfs (or some similar layer) and stored on >>>>>> (initially empty) writeable filesystem. >>>>>> >>>>>> That's a sensible design for usecase like containers but what started this >>>>>> thread about FUSE drivers for filesystems were usecases like access to >>>>>> filesystems on drives attached at USB port of your laptop. There it isn't >>>>>> really practical to use your design. You need a standard writeable >>>>>> filesystem for that but at the same time you cannot quite trust the content >>>>>> of everything that gets attached to your USB port... >>>>> >>>>> Yes, that is my proposal and my overall interest now. I know >>>>> your interest but I'm here I just would like to say: >>>>> >>>>> Without full scan fsck, even with FUSE, the system is still >>>>> vulnerable if the FUSE approch is used. >>>>> >>>>> I could give a detailed example, for example: >>>>> >>>>> There are passwd files `/etc/passwd` and `/etc/shadow` with >>>>> proper permissions (for example, you could audit the file >>>>> permission with e2fsprogs/xfsprogs without a full fsck scan) in >>>>> the inconsistent remote filesystems, but there are some other >>>>> malicious files called "foo" and "bar" somewhere with low >>>>> permissions but sharing the same blocks which is disallowed >>>>> by filesystem on-disk formats illegally (because they violate >>>>> copy-on-write semantics by design), also see my previous >>>>> reply: >>>>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com >>>>> >>>>> The initial data of `/etc/passwd` and `/etc/shadow` in the >>>>> filesystem image doesn't matter, but users could then keep >>>>> very sensitive information later just out of the >>>>> inconsistent filesystems, which could cause "data theft" >>>>> above. >>>> >>>> Yes, I've seen you mentioning this case earlier in this thread. But let me >>>> say I consider it rather contrived :). For the container usecase if you are >>>> fetching say a root fs image and don't trust the content of the image, then >>>> how do you know it doesn't contain a malicious code that sends all the >>>> sensitive data to some third party? So I believe the owner of the container >>>> has to trust the content of the image, otherwise you've already lost. >>>> >>>> The container environment *provider* doesn't necessarily trust either the >>>> container owner or the image so they need to make sure their infrastructure >>>> isn't compromised by malicious actions from these - and for that either >>>> your immutable image scheme or FUSE mounting works. >>>> >>>> Similarly with the USB drive content. Either some malicious actor plugs USB >>>> drive into a laptop, it gets automounted, and that must not crash the >>>> kernel or give attacker more priviledge - but that's all - no data is >>>> stored on the drive. Or I myself plug some not-so-trusted USB drive to my >>>> laptop to read some content from it or possibly put there some data for a >>>> friend - again that must not compromise my machine but I'd be really dumb >>>> and already lost the security game if I'd put any sensitive data to such >>>> drive. And for this USB drive case FUSE mounting solves these problems >>>> nicely. >>>> >>>> So in my opinion for practical usecases the FUSE solution addresses the >>>> real security concerns. >>>> >>>> Honza >>> >>> I agree, *if* the FUSE filesystem is strongly sandboxed so it cannot >>> mess with things like my home directory. Personally, I would run >>> the FUSE filesystem in a VM but that's a separate concern. >>> >>> There are also (very severe) concerns about USB devices *specifically*. >>> These are off-topic for this discussion, though. >>> >>> Of course, the FUSE filesystem must be mounted with nosuid, nodev, >>> and nosymfollow. Otherwise there are lots of attacks possible. >>> >>> Finally, it is very much possible to use storage that one does not have >>> complete trust in, provided that one uses cryptography to ensure that >>> the damage it can do is limited. Many backup systems work this way. >> >> In brief, as I said, that is _not_ always a security concern: >> >> - If you don't fsck, and FUSE mount it, your write data to that >> filesystem could be lost if the writable filesystem is >> inconsistent; > > In the applications I am thinking of, one _hopes_ that the filesystem > is consistent, which it almost always will be. However, one wants > to be safe in the unlikely case of it being inconsistent. I don't think so, USB stick can be corrupted too and the network can receive in , there are too many practical problems here. > >> - But if you fsck in advance and the filesystem, the kernel >> implementation should make sure they should fix all bugs of >> consistent filesystems. >> >> So what's the meaning of "no fsck" here if you cannot write >> anything in it with FUSE approaches. > > FUSE can (and usually does) have write support. Also, fsck does not > protect against TOCTOU attacks. If you consider TOCTOU attacks, why FUSE filesystem can protect if it TOCTOU randomly? Sigh, I think the whole story is that: - The kernel writable filesystem should fix all bugs if the filesystem is consistent, and this condition should be ensured by fsck in advance; - So, alternative approaches like FUSE are not meaningful _only if_ we cannot do "fsck" (let's not think untypical TOCTOU). - but without "fsck", the filesystem can be inconsistent by change or by attack, so the write stuff can be lost. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 12:30 ` Gao Xiang @ 2026-03-23 12:33 ` Gao Xiang 0 siblings, 0 replies; 79+ messages in thread From: Gao Xiang @ 2026-03-23 12:33 UTC (permalink / raw) To: Demi Marie Obenour, Jan Kara Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/23 20:30, Gao Xiang wrote: > > > On 2026/3/23 20:19, Demi Marie Obenour wrote: >> On 3/23/26 08:13, Gao Xiang wrote: >>> >>> >>> On 2026/3/23 20:08, Demi Marie Obenour wrote: >>>> On 3/23/26 07:14, Jan Kara wrote: >>>>> Hi Gao! >>>>> >>>>> On Mon 23-03-26 18:19:16, Gao Xiang wrote: >>>>>> On 2026/3/23 17:54, Jan Kara wrote: >>>>>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote: >>>>>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote: >>>>>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>>>>>>>>> starts up the rest of the libfuse initialization but who knows if that's >>>>>>>>>> an acceptable risk. Also unclear if you actually want -fy for that. >>>>>>>>> >>>>>>>> >>>>>>>> Let me try to reply the remaining part: >>>>>>>> >>>>>>>>> To me, the attacks mentioned above are all either user error, >>>>>>>>> or vulnerabilities in software accessing the filesystem. If one >>>>>>>> >>>>>>>> There are many consequences if users try to use potential inconsistent >>>>>>>> writable filesystems directly (without full fsck), what I can think >>>>>>>> out including but not limited to: >>>>>>>> >>>>>>>> - data loss (considering data block double free issue); >>>>>>>> - data theft (for example, users keep sensitive information in the >>>>>>>> workload in a high permission inode but it can be read with >>>>>>>> low permission malicious inode later); >>>>>>>> - data tamper (the same principle). >>>>>>>> >>>>>>>> All vulnerabilities above happen after users try to write the >>>>>>>> inconsistent filesystem, which is hard to prevent by on-disk >>>>>>>> design. >>>>>>>> >>>>>>>> But if users write with copy-on-write to another local consistent >>>>>>>> filesystem, all the vulnerabilities above won't exist. >>>>>>> >>>>>>> OK, so if I understand correctly you are advocating that untrusted initial data >>>>>>> should be provided on immutable filesystem and any needed modification >>>>>>> would be handled by overlayfs (or some similar layer) and stored on >>>>>>> (initially empty) writeable filesystem. >>>>>>> >>>>>>> That's a sensible design for usecase like containers but what started this >>>>>>> thread about FUSE drivers for filesystems were usecases like access to >>>>>>> filesystems on drives attached at USB port of your laptop. There it isn't >>>>>>> really practical to use your design. You need a standard writeable >>>>>>> filesystem for that but at the same time you cannot quite trust the content >>>>>>> of everything that gets attached to your USB port... >>>>>> >>>>>> Yes, that is my proposal and my overall interest now. I know >>>>>> your interest but I'm here I just would like to say: >>>>>> >>>>>> Without full scan fsck, even with FUSE, the system is still >>>>>> vulnerable if the FUSE approch is used. >>>>>> >>>>>> I could give a detailed example, for example: >>>>>> >>>>>> There are passwd files `/etc/passwd` and `/etc/shadow` with >>>>>> proper permissions (for example, you could audit the file >>>>>> permission with e2fsprogs/xfsprogs without a full fsck scan) in >>>>>> the inconsistent remote filesystems, but there are some other >>>>>> malicious files called "foo" and "bar" somewhere with low >>>>>> permissions but sharing the same blocks which is disallowed >>>>>> by filesystem on-disk formats illegally (because they violate >>>>>> copy-on-write semantics by design), also see my previous >>>>>> reply: >>>>>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com >>>>>> >>>>>> The initial data of `/etc/passwd` and `/etc/shadow` in the >>>>>> filesystem image doesn't matter, but users could then keep >>>>>> very sensitive information later just out of the >>>>>> inconsistent filesystems, which could cause "data theft" >>>>>> above. >>>>> >>>>> Yes, I've seen you mentioning this case earlier in this thread. But let me >>>>> say I consider it rather contrived :). For the container usecase if you are >>>>> fetching say a root fs image and don't trust the content of the image, then >>>>> how do you know it doesn't contain a malicious code that sends all the >>>>> sensitive data to some third party? So I believe the owner of the container >>>>> has to trust the content of the image, otherwise you've already lost. >>>>> >>>>> The container environment *provider* doesn't necessarily trust either the >>>>> container owner or the image so they need to make sure their infrastructure >>>>> isn't compromised by malicious actions from these - and for that either >>>>> your immutable image scheme or FUSE mounting works. >>>>> >>>>> Similarly with the USB drive content. Either some malicious actor plugs USB >>>>> drive into a laptop, it gets automounted, and that must not crash the >>>>> kernel or give attacker more priviledge - but that's all - no data is >>>>> stored on the drive. Or I myself plug some not-so-trusted USB drive to my >>>>> laptop to read some content from it or possibly put there some data for a >>>>> friend - again that must not compromise my machine but I'd be really dumb >>>>> and already lost the security game if I'd put any sensitive data to such >>>>> drive. And for this USB drive case FUSE mounting solves these problems >>>>> nicely. >>>>> >>>>> So in my opinion for practical usecases the FUSE solution addresses the >>>>> real security concerns. >>>>> >>>>> Honza >>>> >>>> I agree, *if* the FUSE filesystem is strongly sandboxed so it cannot >>>> mess with things like my home directory. Personally, I would run >>>> the FUSE filesystem in a VM but that's a separate concern. >>>> >>>> There are also (very severe) concerns about USB devices *specifically*. >>>> These are off-topic for this discussion, though. >>>> >>>> Of course, the FUSE filesystem must be mounted with nosuid, nodev, >>>> and nosymfollow. Otherwise there are lots of attacks possible. >>>> >>>> Finally, it is very much possible to use storage that one does not have >>>> complete trust in, provided that one uses cryptography to ensure that >>>> the damage it can do is limited. Many backup systems work this way. >>> >>> In brief, as I said, that is _not_ always a security concern: >>> >>> - If you don't fsck, and FUSE mount it, your write data to that >>> filesystem could be lost if the writable filesystem is >>> inconsistent; >> >> In the applications I am thinking of, one _hopes_ that the filesystem >> is consistent, which it almost always will be. However, one wants >> to be safe in the unlikely case of it being inconsistent. > > I don't think so, USB stick can be corrupted too and the > network can receive in , there are too many practical > problems here. Not because attacks, just because cheap USB sticks or bad network condition for example. > >> >>> - But if you fsck in advance and the filesystem, the kernel >>> implementation should make sure they should fix all bugs of >>> consistent filesystems. >>> >>> So what's the meaning of "no fsck" here if you cannot write >>> anything in it with FUSE approaches. >> >> FUSE can (and usually does) have write support. Also, fsck does not >> protect against TOCTOU attacks. > > If you consider TOCTOU attacks, why FUSE filesystem can > protect if it TOCTOU randomly? > > Sigh, I think the whole story is that: > > - The kernel writable filesystem should fix all bugs if the > filesystem is consistent, and this condition should be > ensured by fsck in advance; > > - So, alternative approaches like FUSE are not meaningful ^ are meaningful > _only if_ we cannot do "fsck" (let's not think untypical > TOCTOU). > > - but without "fsck", the filesystem can be inconsistent > by change or by attack, so the write stuff can be lost. > > Thanks, > Gao Xiang > > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-22 3:25 ` Demi Marie Obenour 2026-03-22 3:52 ` Gao Xiang 2026-03-22 4:51 ` Gao Xiang @ 2026-03-22 5:14 ` Gao Xiang 2026-03-23 9:43 ` [Lsf-pc] " Jan Kara 2 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-22 5:14 UTC (permalink / raw) To: Demi Marie Obenour, Darrick J. Wong Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/22 11:25, Demi Marie Obenour wrote: > The only exceptions are if the filesystem is incredibly simple > or formal methods are used, and neither is the case for existing > filesystems in the Linux kernel. Again, first, I don't think "simple" is a helpful and descriptive word out of this kind of area: "simple" formats are all formats just archive the filesystem data and metadata, but without any more use cases. No simpler than that, because you need to tell vfs the file (meta)data (even the file data is the garbage data), otherwise they won't be called as filesystems. So why we always fall into comparing which archive filesystem is simpler than others unless some bad buggy designs in those "simple" filesystems. Here, I can definitely say _EROFS uncompressed format_ fits this kind of area, and I will write down formally later if each on-disk field has unexpected values like garbage numbers, what the outcome. And the final goal is to allow EROFS uncompressed format can be mounted as the "root" into totally isolated user/mount namespaces since it's really useful and no practical risk. If any other kernel filesystem maintainers say that they can do the same , why not also allow them do the same thing? I don't think it's a reasonable policy that "due to EXT4, XFS, BtrFS communities say that they cannot tolerate the inconsistent consequence, any other kernel filesystem should follow the same policy even they don't have such issue by design." In other words, does TCP/IP protocol simple? and is there no simplier protocol for network data? I don't think so, but why untrusted network data can be parsed in the kernel? Does TCP/IP kernel implementation already bugless? Quite confusing. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-22 5:14 ` Gao Xiang @ 2026-03-23 9:43 ` Jan Kara 2026-03-23 10:05 ` Gao Xiang 0 siblings, 1 reply; 79+ messages in thread From: Jan Kara @ 2026-03-23 9:43 UTC (permalink / raw) To: Gao Xiang Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On Sun 22-03-26 13:14:55, Gao Xiang wrote: > > > On 2026/3/22 11:25, Demi Marie Obenour wrote: > > > The only exceptions are if the filesystem is incredibly simple > > or formal methods are used, and neither is the case for existing > > filesystems in the Linux kernel. > > Again, first, I don't think "simple" is a helpful and descriptive > word out of this kind of area: > > "simple" formats are all formats just archive the filesystem > data and metadata, but without any more use cases. No simpler > than that, because you need to tell vfs the file (meta)data > (even the file data is the garbage data), otherwise they won't > be called as filesystems. > > So why we always fall into comparing which archive filesystem > is simpler than others unless some bad buggy designs in those > "simple" filesystems. > > Here, I can definitely say _EROFS uncompressed format_ fits > this kind of area, and I will write down formally later if each > on-disk field has unexpected values like garbage numbers, what > the outcome. And the final goal is to allow EROFS uncompressed > format can be mounted as the "root" into totally isolated > user/mount namespaces since it's really useful and no practical > risk. > > If any other kernel filesystem maintainers say that they can do > the same , why not also allow them do the same thing? I don't > think it's a reasonable policy that "due to EXT4, XFS, BtrFS > communities say that they cannot tolerate the inconsistent > consequence, any other kernel filesystem should follow the > same policy even they don't have such issue by design." > > In other words, does TCP/IP protocol simple? and is there no > simplier protocol for network data? I don't think so, but why > untrusted network data can be parsed in the kernel? Does > TCP/IP kernel implementation already bugless? So the amount of state TCP/IP needs to keep around is very small (I'd say kilobytes) compared to the amount of state a filesystem needs to maintain (gigabytes). This leads to very fundamental differences in the complexity of data structures, their verification, etc. So yes, it is much easier to harden TCP/IP against untrusted input than a filesystem implementation. And yes, when you have immutable filesystem, things are much simpler because the data structures and algorithms can be much simpler and as you wrote a lot of these inconsistencies don't matter (at least for the kernel). But once you add ability to modify the filesystem - here I don't think it matters whether through CoW or other means - things get complicated quickly and it gets much more complex to make your code resilient to all kinds of inconsistencies... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 9:43 ` [Lsf-pc] " Jan Kara @ 2026-03-23 10:05 ` Gao Xiang 2026-03-23 10:14 ` Jan Kara 0 siblings, 1 reply; 79+ messages in thread From: Gao Xiang @ 2026-03-23 10:05 UTC (permalink / raw) To: Jan Kara Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc Hi Jan, On 2026/3/23 17:43, Jan Kara wrote: > On Sun 22-03-26 13:14:55, Gao Xiang wrote: >> >> >> On 2026/3/22 11:25, Demi Marie Obenour wrote: >> >>> The only exceptions are if the filesystem is incredibly simple >>> or formal methods are used, and neither is the case for existing >>> filesystems in the Linux kernel. >> >> Again, first, I don't think "simple" is a helpful and descriptive >> word out of this kind of area: >> >> "simple" formats are all formats just archive the filesystem >> data and metadata, but without any more use cases. No simpler >> than that, because you need to tell vfs the file (meta)data >> (even the file data is the garbage data), otherwise they won't >> be called as filesystems. >> >> So why we always fall into comparing which archive filesystem >> is simpler than others unless some bad buggy designs in those >> "simple" filesystems. >> >> Here, I can definitely say _EROFS uncompressed format_ fits >> this kind of area, and I will write down formally later if each >> on-disk field has unexpected values like garbage numbers, what >> the outcome. And the final goal is to allow EROFS uncompressed >> format can be mounted as the "root" into totally isolated >> user/mount namespaces since it's really useful and no practical >> risk. >> >> If any other kernel filesystem maintainers say that they can do >> the same , why not also allow them do the same thing? I don't >> think it's a reasonable policy that "due to EXT4, XFS, BtrFS >> communities say that they cannot tolerate the inconsistent >> consequence, any other kernel filesystem should follow the >> same policy even they don't have such issue by design." >> >> In other words, does TCP/IP protocol simple? and is there no >> simplier protocol for network data? I don't think so, but why >> untrusted network data can be parsed in the kernel? Does >> TCP/IP kernel implementation already bugless? > > So the amount of state TCP/IP needs to keep around is very small (I'd say > kilobytes) compared to the amount of state a filesystem needs to maintain > (gigabytes). This leads to very fundamental differences in the complexity > of data structures, their verification, etc. So yes, it is much easier to > harden TCP/IP against untrusted input than a filesystem implementation. Thanks for the reply. I just want to say I think the core EROFS format is not complex too, but I don't want to make the deadly-simple comparsion among potential simple filesystems, since TCP/IP is not the deadly-simple one. In brief, mounting as "root" in the isolated user/mount namespace is absolutely our interest and useful to container users, and as one of the author and maintainer of EROFS, I can ensure EROFS can bear with untrusted (meta)data. > > And yes, when you have immutable filesystem, things are much simpler > because the data structures and algorithms can be much simpler and as you > wrote a lot of these inconsistencies don't matter (at least for the > kernel). But once you add ability to modify the filesystem - here I don't > think it matters whether through CoW or other means - things get > complicated quickly and it gets much more complex to make your code > resilient to all kinds of inconsistencies... I only consider the COW approach using OverlayFS for example, it just copies up (meta)data into another filesystem (the semantics is just like copy the file in the userspace) and the immutable filesystem image won't change in any case. Overlayfs write goes through normal user write and the writable filesystem is consistent, so I don't think it does matter. Or am I missing something? (e.g could you point out some case which OverlayFS cannot handle properly?) Thanks, Gao Xiang > > Honza ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 10:05 ` Gao Xiang @ 2026-03-23 10:14 ` Jan Kara 2026-03-23 10:30 ` Gao Xiang 0 siblings, 1 reply; 79+ messages in thread From: Jan Kara @ 2026-03-23 10:14 UTC (permalink / raw) To: Gao Xiang Cc: Jan Kara, Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc Hi Gao! On Mon 23-03-26 18:05:17, Gao Xiang wrote: > On 2026/3/23 17:43, Jan Kara wrote: > > And yes, when you have immutable filesystem, things are much simpler > > because the data structures and algorithms can be much simpler and as you > > wrote a lot of these inconsistencies don't matter (at least for the > > kernel). But once you add ability to modify the filesystem - here I don't > > think it matters whether through CoW or other means - things get > > complicated quickly and it gets much more complex to make your code > > resilient to all kinds of inconsistencies... > > I only consider the COW approach using OverlayFS for example, > it just copies up (meta)data into another filesystem (the > semantics is just like copy the file in the userspace) and > the immutable filesystem image won't change in any case. > > Overlayfs write goes through normal user write and the > writable filesystem is consistent, so I don't think it does > matter. Or am I missing something? (e.g could you point > out some case which OverlayFS cannot handle properly?) No, you are correct. For the usecases where immutable fs + overlayfs + empty initial writeable filesystem works, this is a safe design. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-23 10:14 ` Jan Kara @ 2026-03-23 10:30 ` Gao Xiang 0 siblings, 0 replies; 79+ messages in thread From: Gao Xiang @ 2026-03-23 10:30 UTC (permalink / raw) To: Jan Kara Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc On 2026/3/23 18:14, Jan Kara wrote: > Hi Gao! > > On Mon 23-03-26 18:05:17, Gao Xiang wrote: >> On 2026/3/23 17:43, Jan Kara wrote: >>> And yes, when you have immutable filesystem, things are much simpler >>> because the data structures and algorithms can be much simpler and as you >>> wrote a lot of these inconsistencies don't matter (at least for the >>> kernel). But once you add ability to modify the filesystem - here I don't >>> think it matters whether through CoW or other means - things get >>> complicated quickly and it gets much more complex to make your code >>> resilient to all kinds of inconsistencies... >> >> I only consider the COW approach using OverlayFS for example, >> it just copies up (meta)data into another filesystem (the >> semantics is just like copy the file in the userspace) and >> the immutable filesystem image won't change in any case. >> >> Overlayfs write goes through normal user write and the >> writable filesystem is consistent, so I don't think it does >> matter. Or am I missing something? (e.g could you point >> out some case which OverlayFS cannot handle properly?) > > No, you are correct. For the usecases where immutable fs + overlayfs + > empty initial writeable filesystem works, this is a safe design. (Just BTW, not only to empty initial writable filesystems but also to previously mounted, consistent trusted local filesystems used as upper layers, as they are typical cases for containers.) Thanks, Gao Xiang > > Honza ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 19:06 ` Darrick J. Wong ` (2 preceding siblings ...) 2026-02-04 22:50 ` Gao Xiang @ 2026-02-04 23:19 ` Gao Xiang 2026-02-05 3:33 ` John Groves 4 siblings, 0 replies; 79+ messages in thread From: Gao Xiang @ 2026-02-04 23:19 UTC (permalink / raw) To: Darrick J. Wong, Miklos Szeredi Cc: f-pc, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer On 2026/2/5 03:06, Darrick J. Wong wrote: > On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote: ... >> >> - BPF scripts > > Is this an extension of the fuse-bpf filtering discussion that happened > in 2023? (I wondered why you wouldn't just do bpf hooks in the vfs > itself, but maybe hch already NAKed that?) For this part: as far as I can tell, no one NAKed vfs BPF hooks, and I had the similar idea two years ago: https://lore.kernel.org/r/CAOQ4uxjCebxGxkguAh9s4_Vg7QHM=oBoV0LUPZpb+0pcm3z1bw@mail.gmail.com We have some fanotify BPF hook ideas, e.g. to make lazy pulling more efficient with applied BPF filters, and I've asked BPF experts to look into that but no deadline on this too. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-04 19:06 ` Darrick J. Wong ` (3 preceding siblings ...) 2026-02-04 23:19 ` Gao Xiang @ 2026-02-05 3:33 ` John Groves 2026-02-05 9:27 ` Amir Goldstein 4 siblings, 1 reply; 79+ messages in thread From: John Groves @ 2026-02-05 3:33 UTC (permalink / raw) To: Darrick J. Wong Cc: Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer On 26/02/04 11:06AM, Darrick J. Wong wrote: [ ... ] > > - famfs: export distributed memory > > This has been, uh, hanging out for an extraordinarily long time. Um, *yeah*. Although a significant part of that time was on me, because getting it ported into fuse was kinda hard, my users and I are hoping we can get this upstreamed fairly soon now. I'm hoping that after the 6.19 merge window dust settles we can negotiate any needed changes etc. and shoot for the 7.0 merge window. :D Regards, John ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-05 3:33 ` John Groves @ 2026-02-05 9:27 ` Amir Goldstein 2026-02-06 5:52 ` Darrick J. Wong 0 siblings, 1 reply; 79+ messages in thread From: Amir Goldstein @ 2026-02-05 9:27 UTC (permalink / raw) To: john Cc: Darrick J. Wong, Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Joanne Koong, Bernd Schubert, Luis Henriques, Horst Birthelmer On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote: > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > [ ... ] > > > > - famfs: export distributed memory > > > > This has been, uh, hanging out for an extraordinarily long time. > > Um, *yeah*. Although a significant part of that time was on me, because > getting it ported into fuse was kinda hard, my users and I are hoping we > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > merge window dust settles we can negotiate any needed changes etc. and > shoot for the 7.0 merge window. > I think that the work on famfs is setting an example, and I very much hope it will be a good example, of how improving existing infrastructure (FUSE) is a better contribution than adding another fs to the pile. I acknowledge that doing the latter is way easier (not for vfs maintainers) and I very much appreciate your efforts working on the generic FUSE support that will hopefully serve the community and your users better in the long run. Thanks, Amir. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-05 9:27 ` Amir Goldstein @ 2026-02-06 5:52 ` Darrick J. Wong 2026-02-06 20:48 ` John Groves 0 siblings, 1 reply; 79+ messages in thread From: Darrick J. Wong @ 2026-02-06 5:52 UTC (permalink / raw) To: Amir Goldstein Cc: john, Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Joanne Koong, Bernd Schubert, Luis Henriques, Horst Birthelmer On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote: > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote: > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > > > [ ... ] > > > > > > - famfs: export distributed memory > > > > > > This has been, uh, hanging out for an extraordinarily long time. > > > > Um, *yeah*. Although a significant part of that time was on me, because > > getting it ported into fuse was kinda hard, my users and I are hoping we > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > > merge window dust settles we can negotiate any needed changes etc. and > > shoot for the 7.0 merge window. I think we've all missed getting merged for 7.0 since 6.19 will be released in 3 days. :/ (Granted most of the maintainers I know are /much/ less conservative than I was about the schedule) > I think that the work on famfs is setting an example, and I very much > hope it will be a good example, of how improving existing infrastructure > (FUSE) is a better contribution than adding another fs to the pile. Yeah. Joanne and I spent a couple of days this week coprogramming a prototype of a way for famfs to create BPF programs to handle INTERLEAVED_EXTENT files. We might be ready to show that off in a couple of weeks, and that might be a way to clear up the GET_FMAP/IOMAP_BEGIN logjam at last. --D > I acknowledge that doing the latter is way easier (not for vfs maintainers) > and I very much appreciate your efforts working on the generic FUSE support > that will hopefully serve the community and your users better in the long run. > > Thanks, > Amir. > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-06 5:52 ` Darrick J. Wong @ 2026-02-06 20:48 ` John Groves 2026-02-07 0:22 ` Joanne Koong 2026-02-20 23:59 ` Darrick J. Wong 0 siblings, 2 replies; 79+ messages in thread From: John Groves @ 2026-02-06 20:48 UTC (permalink / raw) To: Darrick J. Wong Cc: Amir Goldstein, Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Joanne Koong, Bernd Schubert, Luis Henriques, Horst Birthelmer On 26/02/05 09:52PM, Darrick J. Wong wrote: > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote: > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote: > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > > > > > [ ... ] > > > > > > > > - famfs: export distributed memory > > > > > > > > This has been, uh, hanging out for an extraordinarily long time. > > > > > > Um, *yeah*. Although a significant part of that time was on me, because > > > getting it ported into fuse was kinda hard, my users and I are hoping we > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > > > merge window dust settles we can negotiate any needed changes etc. and > > > shoot for the 7.0 merge window. > > I think we've all missed getting merged for 7.0 since 6.19 will be > released in 3 days. :/ > > (Granted most of the maintainers I know are /much/ less conservative > than I was about the schedule) Doh - right you are... > > > I think that the work on famfs is setting an example, and I very much > > hope it will be a good example, of how improving existing infrastructure > > (FUSE) is a better contribution than adding another fs to the pile. > > Yeah. Joanne and I spent a couple of days this week coprogramming a > prototype of a way for famfs to create BPF programs to handle > INTERLEAVED_EXTENT files. We might be ready to show that off in a > couple of weeks, and that might be a way to clear up the > GET_FMAP/IOMAP_BEGIN logjam at last. I'd love to learn more about this; happy to do a call if that's a good way to get me briefed. I [generally but not specifically] understand how this could avoid GET_FMAP, but not GET_DAXDEV. But I'm not sure it could (or should) avoid dax_iomap_rw() and dax_iomap_fault(). The thing is that those call my begin() function to resolve an offset in a file to an offset on a daxdev, and then dax completes the fault or memcpy. In that dance, famfs never knows the kernel address of the memory at all (also true of xfs in fs-dax mode, unless that's changed fairly recently). I think that's a pretty decent interface all in all. Also: dunno whether y'all have looked at the dax patches in the famfs series, but the solution to working with Alistair's folio-ification and cleanup of the dax layer (which set me back months) was to create drivers/dax/fsdev.c, which, when bound to a daxdev in place of drivers/dax/device.c, configures folios & pages compatibly with fs-dax. So I kinda think I need the dax_iomap* interface. As usual, if I'm overlooking something let me know... Regards, John ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-06 20:48 ` John Groves @ 2026-02-07 0:22 ` Joanne Koong 2026-02-12 4:46 ` Joanne Koong 2026-02-20 23:59 ` Darrick J. Wong 1 sibling, 1 reply; 79+ messages in thread From: Joanne Koong @ 2026-02-07 0:22 UTC (permalink / raw) To: John Groves Cc: Darrick J. Wong, Amir Goldstein, Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Bernd Schubert, Luis Henriques, Horst Birthelmer On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote: > > On 26/02/05 09:52PM, Darrick J. Wong wrote: > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote: > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote: > > > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > > > > > > > [ ... ] > > > > > > > > > > - famfs: export distributed memory > > > > > > > > > > This has been, uh, hanging out for an extraordinarily long time. > > > > > > > > Um, *yeah*. Although a significant part of that time was on me, because > > > > getting it ported into fuse was kinda hard, my users and I are hoping we > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > > > > merge window dust settles we can negotiate any needed changes etc. and > > > > shoot for the 7.0 merge window. > > > > I think we've all missed getting merged for 7.0 since 6.19 will be > > released in 3 days. :/ > > > > (Granted most of the maintainers I know are /much/ less conservative > > than I was about the schedule) > > Doh - right you are... > > > > > > I think that the work on famfs is setting an example, and I very much > > > hope it will be a good example, of how improving existing infrastructure > > > (FUSE) is a better contribution than adding another fs to the pile. > > > > Yeah. Joanne and I spent a couple of days this week coprogramming a > > prototype of a way for famfs to create BPF programs to handle > > INTERLEAVED_EXTENT files. We might be ready to show that off in a > > couple of weeks, and that might be a way to clear up the > > GET_FMAP/IOMAP_BEGIN logjam at last. > > I'd love to learn more about this; happy to do a call if that's a > good way to get me briefed. > > I [generally but not specifically] understand how this could avoid > GET_FMAP, but not GET_DAXDEV. > > But I'm not sure it could (or should) avoid dax_iomap_rw() and > dax_iomap_fault(). The thing is that those call my begin() function > to resolve an offset in a file to an offset on a daxdev, and then > dax completes the fault or memcpy. In that dance, famfs never knows > the kernel address of the memory at all (also true of xfs in fs-dax > mode, unless that's changed fairly recently). I think that's a pretty > decent interface all in all. > > Also: dunno whether y'all have looked at the dax patches in the famfs > series, but the solution to working with Alistair's folio-ification > and cleanup of the dax layer (which set me back months) was to create > drivers/dax/fsdev.c, which, when bound to a daxdev in place of > drivers/dax/device.c, configures folios & pages compatibly with > fs-dax. So I kinda think I need the dax_iomap* interface. > > As usual, if I'm overlooking something let me know... Hi John, The conversation started [1] on Darrick's containerization patchset about using bpf to a) avoid extra requests / context switching for ->iomap_begin and ->iomap_end calls and b) offload what would otherwise have to be hard-coded kernel logic into userspace, which gives userspace more flexibility / control with updating the logic and is less of a maintenance burden for fuse. There was some musing [2] about whether with bpf infrastructure added, it would allow famfs to move all famfs-specific logic to userspace/bpf. I agree that it makes sense for famfs to go through dax iomap interfaces. imo it seems cleanest if fuse has a generic iomap interface with iomap dax going through that plumbing, and any famfs-specific logic that would be needed beyond that (eg computing the interleaved mappings) being moved to custom famfs bpf programs. I started trying to implement this yesterday afternoon because I wanted to make sure it would actually be doable for the famfs logic before bringing it up and I didn't want to derail your project. So far I only have the general iomap interface for fuse added with dax operations going through dax_iomap* and haven't tried out integrating the famfs GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to get to that early next week. The work I did with Darrick this week was on getting a server's bpf programs hooked up to fuse through bpf links and Darrick has fleshed that out and gotten that working now. If it turns out famfs can go through a generic iomap fuse plumbing layer, I'd be curious to hear your thoughts on which approach you'd prefer. Thanks, Joanne [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u > > Regards, > John > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-07 0:22 ` Joanne Koong @ 2026-02-12 4:46 ` Joanne Koong 2026-02-21 0:37 ` Darrick J. Wong 0 siblings, 1 reply; 79+ messages in thread From: Joanne Koong @ 2026-02-12 4:46 UTC (permalink / raw) To: John Groves Cc: Darrick J. Wong, Amir Goldstein, Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Bernd Schubert, Luis Henriques, Horst Birthelmer On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote: > > > > On 26/02/05 09:52PM, Darrick J. Wong wrote: > > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote: > > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote: > > > > > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > > > > > > > > > [ ... ] > > > > > > > > > > > > - famfs: export distributed memory > > > > > > > > > > > > This has been, uh, hanging out for an extraordinarily long time. > > > > > > > > > > Um, *yeah*. Although a significant part of that time was on me, because > > > > > getting it ported into fuse was kinda hard, my users and I are hoping we > > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > > > > > merge window dust settles we can negotiate any needed changes etc. and > > > > > shoot for the 7.0 merge window. > > > > > > I think we've all missed getting merged for 7.0 since 6.19 will be > > > released in 3 days. :/ > > > > > > (Granted most of the maintainers I know are /much/ less conservative > > > than I was about the schedule) > > > > Doh - right you are... > > > > > > > > > I think that the work on famfs is setting an example, and I very much > > > > hope it will be a good example, of how improving existing infrastructure > > > > (FUSE) is a better contribution than adding another fs to the pile. > > > > > > Yeah. Joanne and I spent a couple of days this week coprogramming a > > > prototype of a way for famfs to create BPF programs to handle > > > INTERLEAVED_EXTENT files. We might be ready to show that off in a > > > couple of weeks, and that might be a way to clear up the > > > GET_FMAP/IOMAP_BEGIN logjam at last. > > > > I'd love to learn more about this; happy to do a call if that's a > > good way to get me briefed. > > > > I [generally but not specifically] understand how this could avoid > > GET_FMAP, but not GET_DAXDEV. > > > > But I'm not sure it could (or should) avoid dax_iomap_rw() and > > dax_iomap_fault(). The thing is that those call my begin() function > > to resolve an offset in a file to an offset on a daxdev, and then > > dax completes the fault or memcpy. In that dance, famfs never knows > > the kernel address of the memory at all (also true of xfs in fs-dax > > mode, unless that's changed fairly recently). I think that's a pretty > > decent interface all in all. > > > > Also: dunno whether y'all have looked at the dax patches in the famfs > > series, but the solution to working with Alistair's folio-ification > > and cleanup of the dax layer (which set me back months) was to create > > drivers/dax/fsdev.c, which, when bound to a daxdev in place of > > drivers/dax/device.c, configures folios & pages compatibly with > > fs-dax. So I kinda think I need the dax_iomap* interface. > > > > As usual, if I'm overlooking something let me know... > > Hi John, > > The conversation started [1] on Darrick's containerization patchset > about using bpf to a) avoid extra requests / context switching for > ->iomap_begin and ->iomap_end calls and b) offload what would > otherwise have to be hard-coded kernel logic into userspace, which > gives userspace more flexibility / control with updating the logic and > is less of a maintenance burden for fuse. There was some musing [2] > about whether with bpf infrastructure added, it would allow famfs to > move all famfs-specific logic to userspace/bpf. > > I agree that it makes sense for famfs to go through dax iomap > interfaces. imo it seems cleanest if fuse has a generic iomap > interface with iomap dax going through that plumbing, and any > famfs-specific logic that would be needed beyond that (eg computing > the interleaved mappings) being moved to custom famfs bpf programs. I > started trying to implement this yesterday afternoon because I wanted > to make sure it would actually be doable for the famfs logic before > bringing it up and I didn't want to derail your project. So far I only > have the general iomap interface for fuse added with dax operations > going through dax_iomap* and haven't tried out integrating the famfs > GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to > get to that early next week. The work I did with Darrick this week was > on getting a server's bpf programs hooked up to fuse through bpf links > and Darrick has fleshed that out and gotten that working now. If it > turns out famfs can go through a generic iomap fuse plumbing layer, > I'd be curious to hear your thoughts on which approach you'd prefer. I put together a quick prototype to test this out - this is what it looks like with fuse having a generic iomap interface that supports dax [1], and the famfs custom logic moved to a bpf program [2]. I didn't change much, I just moved around your famfs code to the bpf side. The kernel side changes are in [3] and the libfuse changes are in [4]. For testing out the prototype, I hooked it up to passthrough_hp to test running the bpf program and verify that it is able to find the extent from the bpf map. In my opinion, this makes the fuse side infrastructure cleaner and more extendable for other servers that will want to go through dax iomap in the future, but I think this also has a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP request after a file is opened, the server can directly populate the metadata map from userspace with the mapping info when it processes the FUSE_OPEN request, which gets rid of the roundtrip cost. The server can dynamically update the metadata at any time from userspace if the mapping info needs to change in the future. For setting up the daxdevs, I moved your logic to the init side, where the server passes the daxdev info upfront through an IOMAP_CONFIG exchange with the kernel initializing the daxdevs based off that info. I think this will also make deploying future updates for famfs easier, as updating the logic won't need to go through the upstream kernel mailing list process and deploying updates won't require a new kernel release. These are just my two cents based on my (cursory) understanding of famfs. Just wanted to float this alternative approach in case it's useful. Thanks, Joanne [1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd [2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa [3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/ [4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/ > > Thanks, > Joanne > > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d > [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u > > > > > Regards, > > John > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-12 4:46 ` Joanne Koong @ 2026-02-21 0:37 ` Darrick J. Wong 2026-02-26 20:21 ` Joanne Koong 0 siblings, 1 reply; 79+ messages in thread From: Darrick J. Wong @ 2026-02-21 0:37 UTC (permalink / raw) To: Joanne Koong Cc: John Groves, Amir Goldstein, Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Bernd Schubert, Luis Henriques, Horst Birthelmer On Wed, Feb 11, 2026 at 08:46:26PM -0800, Joanne Koong wrote: > On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > > > On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote: > > > > > > On 26/02/05 09:52PM, Darrick J. Wong wrote: > > > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote: > > > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote: > > > > > > > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > > > > > > > > > > > [ ... ] > > > > > > > > > > > > > > - famfs: export distributed memory > > > > > > > > > > > > > > This has been, uh, hanging out for an extraordinarily long time. > > > > > > > > > > > > Um, *yeah*. Although a significant part of that time was on me, because > > > > > > getting it ported into fuse was kinda hard, my users and I are hoping we > > > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > > > > > > merge window dust settles we can negotiate any needed changes etc. and > > > > > > shoot for the 7.0 merge window. > > > > > > > > I think we've all missed getting merged for 7.0 since 6.19 will be > > > > released in 3 days. :/ > > > > > > > > (Granted most of the maintainers I know are /much/ less conservative > > > > than I was about the schedule) > > > > > > Doh - right you are... > > > > > > > > > > > > I think that the work on famfs is setting an example, and I very much > > > > > hope it will be a good example, of how improving existing infrastructure > > > > > (FUSE) is a better contribution than adding another fs to the pile. > > > > > > > > Yeah. Joanne and I spent a couple of days this week coprogramming a > > > > prototype of a way for famfs to create BPF programs to handle > > > > INTERLEAVED_EXTENT files. We might be ready to show that off in a > > > > couple of weeks, and that might be a way to clear up the > > > > GET_FMAP/IOMAP_BEGIN logjam at last. > > > > > > I'd love to learn more about this; happy to do a call if that's a > > > good way to get me briefed. > > > > > > I [generally but not specifically] understand how this could avoid > > > GET_FMAP, but not GET_DAXDEV. > > > > > > But I'm not sure it could (or should) avoid dax_iomap_rw() and > > > dax_iomap_fault(). The thing is that those call my begin() function > > > to resolve an offset in a file to an offset on a daxdev, and then > > > dax completes the fault or memcpy. In that dance, famfs never knows > > > the kernel address of the memory at all (also true of xfs in fs-dax > > > mode, unless that's changed fairly recently). I think that's a pretty > > > decent interface all in all. > > > > > > Also: dunno whether y'all have looked at the dax patches in the famfs > > > series, but the solution to working with Alistair's folio-ification > > > and cleanup of the dax layer (which set me back months) was to create > > > drivers/dax/fsdev.c, which, when bound to a daxdev in place of > > > drivers/dax/device.c, configures folios & pages compatibly with > > > fs-dax. So I kinda think I need the dax_iomap* interface. > > > > > > As usual, if I'm overlooking something let me know... > > > > Hi John, > > > > The conversation started [1] on Darrick's containerization patchset > > about using bpf to a) avoid extra requests / context switching for > > ->iomap_begin and ->iomap_end calls and b) offload what would > > otherwise have to be hard-coded kernel logic into userspace, which > > gives userspace more flexibility / control with updating the logic and > > is less of a maintenance burden for fuse. There was some musing [2] > > about whether with bpf infrastructure added, it would allow famfs to > > move all famfs-specific logic to userspace/bpf. > > > > I agree that it makes sense for famfs to go through dax iomap > > interfaces. imo it seems cleanest if fuse has a generic iomap > > interface with iomap dax going through that plumbing, and any > > famfs-specific logic that would be needed beyond that (eg computing > > the interleaved mappings) being moved to custom famfs bpf programs. I > > started trying to implement this yesterday afternoon because I wanted > > to make sure it would actually be doable for the famfs logic before > > bringing it up and I didn't want to derail your project. So far I only > > have the general iomap interface for fuse added with dax operations > > going through dax_iomap* and haven't tried out integrating the famfs > > GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to > > get to that early next week. The work I did with Darrick this week was > > on getting a server's bpf programs hooked up to fuse through bpf links > > and Darrick has fleshed that out and gotten that working now. If it > > turns out famfs can go through a generic iomap fuse plumbing layer, > > I'd be curious to hear your thoughts on which approach you'd prefer. > > I put together a quick prototype to test this out - this is what it > looks like with fuse having a generic iomap interface that supports > dax [1], and the famfs custom logic moved to a bpf program [2]. I The bpf maps that you've used to upload per-inode data into the kernel is a /much/ cleaner method than custom-compiling C into BPF at runtime! You can statically compile the BPF object code into the fuse server, which means that (a) you can take advantage of the bpftool skeletons, and (b) you can in theory vendor-sign the BPF code if and when that becomes a requirement. I think that's way better than having to put vmlinux.h and fuse_iomap_bpf.h on the deployed system. Though there's one hitch in example/Makefile: vmlinux.h: $(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@ The build system isn't necessarily running the same kernel as the deploy images. It might be for Meta, but it's not unheard of for our build system to be running (say) OL10+UEK8 kernel, but the build target is OL8 and UEK7. There doesn't seem to be any standardization across distros for where a vmlinux.h file might be found. Fedora puts it under /usr/src/$unamestuf, Debian puts it in /usr/include/$gcc_triple, and I guess SUSE doesn't ship it at all? That's going to be a headache for deployment as I've been muttering for a couple of weeks now. :( Maybe we could reduce the fuse-iomap bpf definitions to use only cardinal types and the types that iomap itself defines. That might not be too hard right now because bpf functions reuse structures from include/uapi/fuse.h, which currently use uint{8,16,32,64}_t. It'll get harder if that __uintXX_t -> __uXX transition actually happens. But getting back to the famfs bpf stuff, I think doing the interleaved mappings via BPF gives the famfs server a lot more flexibility in terms of what it can do when future hardware arrives with even weirder configurations. --D > didn't change much, I just moved around your famfs code to the bpf > side. The kernel side changes are in [3] and the libfuse changes are > in [4]. > > For testing out the prototype, I hooked it up to passthrough_hp to > test running the bpf program and verify that it is able to find the > extent from the bpf map. In my opinion, this makes the fuse side > infrastructure cleaner and more extendable for other servers that will > want to go through dax iomap in the future, but I think this also has > a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP > request after a file is opened, the server can directly populate the > metadata map from userspace with the mapping info when it processes > the FUSE_OPEN request, which gets rid of the roundtrip cost. The > server can dynamically update the metadata at any time from userspace > if the mapping info needs to change in the future. For setting up the > daxdevs, I moved your logic to the init side, where the server passes > the daxdev info upfront through an IOMAP_CONFIG exchange with the > kernel initializing the daxdevs based off that info. I think this will > also make deploying future updates for famfs easier, as updating the > logic won't need to go through the upstream kernel mailing list > process and deploying updates won't require a new kernel release. > > These are just my two cents based on my (cursory) understanding of > famfs. Just wanted to float this alternative approach in case it's > useful. > > Thanks, > Joanne > > [1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd > [2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa > [3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/ > [4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/ > > > > > Thanks, > > Joanne > > > > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d > > [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u > > > > > > > > Regards, > > > John > > > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-21 0:37 ` Darrick J. Wong @ 2026-02-26 20:21 ` Joanne Koong 2026-03-03 4:57 ` Darrick J. Wong 0 siblings, 1 reply; 79+ messages in thread From: Joanne Koong @ 2026-02-26 20:21 UTC (permalink / raw) To: Darrick J. Wong Cc: John Groves, Amir Goldstein, Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Bernd Schubert, Luis Henriques, Horst Birthelmer On Fri, Feb 20, 2026 at 4:37 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Wed, Feb 11, 2026 at 08:46:26PM -0800, Joanne Koong wrote: > > On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > > > > > On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote: > > > > > > > > On 26/02/05 09:52PM, Darrick J. Wong wrote: > > > > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote: > > > > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote: > > > > > > > > > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > > > > > > > > > > > > > [ ... ] > > > > > > > > > > > > > > > > - famfs: export distributed memory > > > > > > > > > > > > > > > > This has been, uh, hanging out for an extraordinarily long time. > > > > > > > > > > > > > > Um, *yeah*. Although a significant part of that time was on me, because > > > > > > > getting it ported into fuse was kinda hard, my users and I are hoping we > > > > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > > > > > > > merge window dust settles we can negotiate any needed changes etc. and > > > > > > > shoot for the 7.0 merge window. > > > > > > > > > > I think we've all missed getting merged for 7.0 since 6.19 will be > > > > > released in 3 days. :/ > > > > > > > > > > (Granted most of the maintainers I know are /much/ less conservative > > > > > than I was about the schedule) > > > > > > > > Doh - right you are... > > > > > > > > > > > > > > > I think that the work on famfs is setting an example, and I very much > > > > > > hope it will be a good example, of how improving existing infrastructure > > > > > > (FUSE) is a better contribution than adding another fs to the pile. > > > > > > > > > > Yeah. Joanne and I spent a couple of days this week coprogramming a > > > > > prototype of a way for famfs to create BPF programs to handle > > > > > INTERLEAVED_EXTENT files. We might be ready to show that off in a > > > > > couple of weeks, and that might be a way to clear up the > > > > > GET_FMAP/IOMAP_BEGIN logjam at last. > > > > > > > > I'd love to learn more about this; happy to do a call if that's a > > > > good way to get me briefed. > > > > > > > > I [generally but not specifically] understand how this could avoid > > > > GET_FMAP, but not GET_DAXDEV. > > > > > > > > But I'm not sure it could (or should) avoid dax_iomap_rw() and > > > > dax_iomap_fault(). The thing is that those call my begin() function > > > > to resolve an offset in a file to an offset on a daxdev, and then > > > > dax completes the fault or memcpy. In that dance, famfs never knows > > > > the kernel address of the memory at all (also true of xfs in fs-dax > > > > mode, unless that's changed fairly recently). I think that's a pretty > > > > decent interface all in all. > > > > > > > > Also: dunno whether y'all have looked at the dax patches in the famfs > > > > series, but the solution to working with Alistair's folio-ification > > > > and cleanup of the dax layer (which set me back months) was to create > > > > drivers/dax/fsdev.c, which, when bound to a daxdev in place of > > > > drivers/dax/device.c, configures folios & pages compatibly with > > > > fs-dax. So I kinda think I need the dax_iomap* interface. > > > > > > > > As usual, if I'm overlooking something let me know... > > > > > > Hi John, > > > > > > The conversation started [1] on Darrick's containerization patchset > > > about using bpf to a) avoid extra requests / context switching for > > > ->iomap_begin and ->iomap_end calls and b) offload what would > > > otherwise have to be hard-coded kernel logic into userspace, which > > > gives userspace more flexibility / control with updating the logic and > > > is less of a maintenance burden for fuse. There was some musing [2] > > > about whether with bpf infrastructure added, it would allow famfs to > > > move all famfs-specific logic to userspace/bpf. > > > > > > I agree that it makes sense for famfs to go through dax iomap > > > interfaces. imo it seems cleanest if fuse has a generic iomap > > > interface with iomap dax going through that plumbing, and any > > > famfs-specific logic that would be needed beyond that (eg computing > > > the interleaved mappings) being moved to custom famfs bpf programs. I > > > started trying to implement this yesterday afternoon because I wanted > > > to make sure it would actually be doable for the famfs logic before > > > bringing it up and I didn't want to derail your project. So far I only > > > have the general iomap interface for fuse added with dax operations > > > going through dax_iomap* and haven't tried out integrating the famfs > > > GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to > > > get to that early next week. The work I did with Darrick this week was > > > on getting a server's bpf programs hooked up to fuse through bpf links > > > and Darrick has fleshed that out and gotten that working now. If it > > > turns out famfs can go through a generic iomap fuse plumbing layer, > > > I'd be curious to hear your thoughts on which approach you'd prefer. > > > > I put together a quick prototype to test this out - this is what it > > looks like with fuse having a generic iomap interface that supports > > dax [1], and the famfs custom logic moved to a bpf program [2]. I > > The bpf maps that you've used to upload per-inode data into the kernel > is a /much/ cleaner method than custom-compiling C into BPF at runtime! > You can statically compile the BPF object code into the fuse server, > which means that (a) you can take advantage of the bpftool skeletons, > and (b) you can in theory vendor-sign the BPF code if and when that > becomes a requirement. > > I think that's way better than having to put vmlinux.h and > fuse_iomap_bpf.h on the deployed system. Though there's one hitch in > example/Makefile: > > vmlinux.h: > $(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@ > > The build system isn't necessarily running the same kernel as the deploy > images. It might be for Meta, but it's not unheard of for our build > system to be running (say) OL10+UEK8 kernel, but the build target is OL8 > and UEK7. > > There doesn't seem to be any standardization across distros for where a > vmlinux.h file might be found. Fedora puts it under > /usr/src/$unamestuf, Debian puts it in /usr/include/$gcc_triple, and I > guess SUSE doesn't ship it at all? > > That's going to be a headache for deployment as I've been muttering for > a couple of weeks now. :( I don't think this is an issue because bpf does dynamic btf-based relocations (CO-RE) at load time [1]. On the target machine, when libbpf loads the bpf object it will read the machine's btf and patch any offsets in bytecode and load the fixed-up version into the kernel. All that's needed on the target machine for CO-RE is CONFIG_DEBUG_INFO_BTF=y which is enabled by default on mainstream distributions. I think this addresses the deployment headache you've been running into? Thanks, Joanne [1] https://docs.ebpf.io/concepts/core/ > > Maybe we could reduce the fuse-iomap bpf definitions to use only > cardinal types and the types that iomap itself defines. That might not > be too hard right now because bpf functions reuse structures from > include/uapi/fuse.h, which currently use uint{8,16,32,64}_t. It'll get > harder if that __uintXX_t -> __uXX transition actually happens. > > But getting back to the famfs bpf stuff, I think doing the interleaved > mappings via BPF gives the famfs server a lot more flexibility in terms > of what it can do when future hardware arrives with even weirder > configurations. > > --D > > > didn't change much, I just moved around your famfs code to the bpf > > side. The kernel side changes are in [3] and the libfuse changes are > > in [4]. > > > > For testing out the prototype, I hooked it up to passthrough_hp to > > test running the bpf program and verify that it is able to find the > > extent from the bpf map. In my opinion, this makes the fuse side > > infrastructure cleaner and more extendable for other servers that will > > want to go through dax iomap in the future, but I think this also has > > a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP > > request after a file is opened, the server can directly populate the > > metadata map from userspace with the mapping info when it processes > > the FUSE_OPEN request, which gets rid of the roundtrip cost. The > > server can dynamically update the metadata at any time from userspace > > if the mapping info needs to change in the future. For setting up the > > daxdevs, I moved your logic to the init side, where the server passes > > the daxdev info upfront through an IOMAP_CONFIG exchange with the > > kernel initializing the daxdevs based off that info. I think this will > > also make deploying future updates for famfs easier, as updating the > > logic won't need to go through the upstream kernel mailing list > > process and deploying updates won't require a new kernel release. > > > > These are just my two cents based on my (cursory) understanding of > > famfs. Just wanted to float this alternative approach in case it's > > useful. > > > > Thanks, > > Joanne > > > > [1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd > > [2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa > > [3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/ > > [4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/ > > > > > > > > Thanks, > > > Joanne > > > > > > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d > > > [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u > > > > > > > > > > > Regards, > > > > John > > > > > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-26 20:21 ` Joanne Koong @ 2026-03-03 4:57 ` Darrick J. Wong 2026-03-03 17:28 ` Joanne Koong 0 siblings, 1 reply; 79+ messages in thread From: Darrick J. Wong @ 2026-03-03 4:57 UTC (permalink / raw) To: Joanne Koong Cc: John Groves, Amir Goldstein, Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Bernd Schubert, Luis Henriques, Horst Birthelmer On Thu, Feb 26, 2026 at 12:21:43PM -0800, Joanne Koong wrote: > On Fri, Feb 20, 2026 at 4:37 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > On Wed, Feb 11, 2026 at 08:46:26PM -0800, Joanne Koong wrote: > > > On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > > > > > > > On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote: > > > > > > > > > > On 26/02/05 09:52PM, Darrick J. Wong wrote: > > > > > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote: > > > > > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote: > > > > > > > > > > > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > > > > > > > > > > > > > > > [ ... ] > > > > > > > > > > > > > > > > > > - famfs: export distributed memory > > > > > > > > > > > > > > > > > > This has been, uh, hanging out for an extraordinarily long time. > > > > > > > > > > > > > > > > Um, *yeah*. Although a significant part of that time was on me, because > > > > > > > > getting it ported into fuse was kinda hard, my users and I are hoping we > > > > > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > > > > > > > > merge window dust settles we can negotiate any needed changes etc. and > > > > > > > > shoot for the 7.0 merge window. > > > > > > > > > > > > I think we've all missed getting merged for 7.0 since 6.19 will be > > > > > > released in 3 days. :/ > > > > > > > > > > > > (Granted most of the maintainers I know are /much/ less conservative > > > > > > than I was about the schedule) > > > > > > > > > > Doh - right you are... > > > > > > > > > > > > > > > > > > I think that the work on famfs is setting an example, and I very much > > > > > > > hope it will be a good example, of how improving existing infrastructure > > > > > > > (FUSE) is a better contribution than adding another fs to the pile. > > > > > > > > > > > > Yeah. Joanne and I spent a couple of days this week coprogramming a > > > > > > prototype of a way for famfs to create BPF programs to handle > > > > > > INTERLEAVED_EXTENT files. We might be ready to show that off in a > > > > > > couple of weeks, and that might be a way to clear up the > > > > > > GET_FMAP/IOMAP_BEGIN logjam at last. > > > > > > > > > > I'd love to learn more about this; happy to do a call if that's a > > > > > good way to get me briefed. > > > > > > > > > > I [generally but not specifically] understand how this could avoid > > > > > GET_FMAP, but not GET_DAXDEV. > > > > > > > > > > But I'm not sure it could (or should) avoid dax_iomap_rw() and > > > > > dax_iomap_fault(). The thing is that those call my begin() function > > > > > to resolve an offset in a file to an offset on a daxdev, and then > > > > > dax completes the fault or memcpy. In that dance, famfs never knows > > > > > the kernel address of the memory at all (also true of xfs in fs-dax > > > > > mode, unless that's changed fairly recently). I think that's a pretty > > > > > decent interface all in all. > > > > > > > > > > Also: dunno whether y'all have looked at the dax patches in the famfs > > > > > series, but the solution to working with Alistair's folio-ification > > > > > and cleanup of the dax layer (which set me back months) was to create > > > > > drivers/dax/fsdev.c, which, when bound to a daxdev in place of > > > > > drivers/dax/device.c, configures folios & pages compatibly with > > > > > fs-dax. So I kinda think I need the dax_iomap* interface. > > > > > > > > > > As usual, if I'm overlooking something let me know... > > > > > > > > Hi John, > > > > > > > > The conversation started [1] on Darrick's containerization patchset > > > > about using bpf to a) avoid extra requests / context switching for > > > > ->iomap_begin and ->iomap_end calls and b) offload what would > > > > otherwise have to be hard-coded kernel logic into userspace, which > > > > gives userspace more flexibility / control with updating the logic and > > > > is less of a maintenance burden for fuse. There was some musing [2] > > > > about whether with bpf infrastructure added, it would allow famfs to > > > > move all famfs-specific logic to userspace/bpf. > > > > > > > > I agree that it makes sense for famfs to go through dax iomap > > > > interfaces. imo it seems cleanest if fuse has a generic iomap > > > > interface with iomap dax going through that plumbing, and any > > > > famfs-specific logic that would be needed beyond that (eg computing > > > > the interleaved mappings) being moved to custom famfs bpf programs. I > > > > started trying to implement this yesterday afternoon because I wanted > > > > to make sure it would actually be doable for the famfs logic before > > > > bringing it up and I didn't want to derail your project. So far I only > > > > have the general iomap interface for fuse added with dax operations > > > > going through dax_iomap* and haven't tried out integrating the famfs > > > > GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to > > > > get to that early next week. The work I did with Darrick this week was > > > > on getting a server's bpf programs hooked up to fuse through bpf links > > > > and Darrick has fleshed that out and gotten that working now. If it > > > > turns out famfs can go through a generic iomap fuse plumbing layer, > > > > I'd be curious to hear your thoughts on which approach you'd prefer. > > > > > > I put together a quick prototype to test this out - this is what it > > > looks like with fuse having a generic iomap interface that supports > > > dax [1], and the famfs custom logic moved to a bpf program [2]. I > > > > The bpf maps that you've used to upload per-inode data into the kernel > > is a /much/ cleaner method than custom-compiling C into BPF at runtime! > > You can statically compile the BPF object code into the fuse server, > > which means that (a) you can take advantage of the bpftool skeletons, > > and (b) you can in theory vendor-sign the BPF code if and when that > > becomes a requirement. > > > > I think that's way better than having to put vmlinux.h and > > fuse_iomap_bpf.h on the deployed system. Though there's one hitch in > > example/Makefile: > > > > vmlinux.h: > > $(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@ > > > > The build system isn't necessarily running the same kernel as the deploy > > images. It might be for Meta, but it's not unheard of for our build > > system to be running (say) OL10+UEK8 kernel, but the build target is OL8 > > and UEK7. > > > > There doesn't seem to be any standardization across distros for where a > > vmlinux.h file might be found. Fedora puts it under > > /usr/src/$unamestuf, Debian puts it in /usr/include/$gcc_triple, and I > > guess SUSE doesn't ship it at all? > > > > That's going to be a headache for deployment as I've been muttering for > > a couple of weeks now. :( > > I don't think this is an issue because bpf does dynamic btf-based > relocations (CO-RE) at load time [1]. On the target machine, when > libbpf loads the bpf object it will read the machine's btf and patch > any offsets in bytecode and load the fixed-up version into the kernel. > All that's needed on the target machine for CO-RE is > CONFIG_DEBUG_INFO_BTF=y which is enabled by default on mainstream > distributions. I think this addresses the deployment headache you've > been running into? Not really -- CO-RE does indeed work quite nicely to smooth over layout changes in C structures between a BPF program and the kernel it's being loaded into (thanks, whoever came up with that!) but the problem I have is how you /get/ those definitions into clang in the first place. I was under the impression from many of the bpf examples that you're supposed to #include a distro-provided "vmlinux.h", but there doesn't seem to be a standard way to find that file. Most -dev packages provide a pkgconfig file that give you the appropriate CFLAGS/LDFLAGS to add, but apparently this is not the case for BPF...? Perhaps it's the case that distro packages that are building BPF programs simply add a build dependency on the package providing vmlinux.h (e.g. Build-Depends: linux-bpf-dev on Debian) and patch in "CFLAGS=-I/some/path" as needed? I suppose for a dynamically generated and compiled BPF program, one could just "bpftool skel" the /sys/kernel/btf files, capture the output, and "#include </dev/fd/XXX>" the results. Honestly that sounds better than trusting some weird system package. But maybe dynamic compilation is a totally stupid idea. I did grow up in the era of mshtml email wreaking havoc, after all... --D > Thanks, > Joanne > > [1] https://docs.ebpf.io/concepts/core/ > > > > > Maybe we could reduce the fuse-iomap bpf definitions to use only > > cardinal types and the types that iomap itself defines. That might not > > be too hard right now because bpf functions reuse structures from > > include/uapi/fuse.h, which currently use uint{8,16,32,64}_t. It'll get > > harder if that __uintXX_t -> __uXX transition actually happens. > > > > But getting back to the famfs bpf stuff, I think doing the interleaved > > mappings via BPF gives the famfs server a lot more flexibility in terms > > of what it can do when future hardware arrives with even weirder > > configurations. > > > > --D > > > > > didn't change much, I just moved around your famfs code to the bpf > > > side. The kernel side changes are in [3] and the libfuse changes are > > > in [4]. > > > > > > For testing out the prototype, I hooked it up to passthrough_hp to > > > test running the bpf program and verify that it is able to find the > > > extent from the bpf map. In my opinion, this makes the fuse side > > > infrastructure cleaner and more extendable for other servers that will > > > want to go through dax iomap in the future, but I think this also has > > > a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP > > > request after a file is opened, the server can directly populate the > > > metadata map from userspace with the mapping info when it processes > > > the FUSE_OPEN request, which gets rid of the roundtrip cost. The > > > server can dynamically update the metadata at any time from userspace > > > if the mapping info needs to change in the future. For setting up the > > > daxdevs, I moved your logic to the init side, where the server passes > > > the daxdev info upfront through an IOMAP_CONFIG exchange with the > > > kernel initializing the daxdevs based off that info. I think this will > > > also make deploying future updates for famfs easier, as updating the > > > logic won't need to go through the upstream kernel mailing list > > > process and deploying updates won't require a new kernel release. > > > > > > These are just my two cents based on my (cursory) understanding of > > > famfs. Just wanted to float this alternative approach in case it's > > > useful. > > > > > > Thanks, > > > Joanne > > > > > > [1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd > > > [2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa > > > [3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/ > > > [4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/ > > > > > > > > > > > Thanks, > > > > Joanne > > > > > > > > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d > > > > [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u > > > > > > > > > > > > > > Regards, > > > > > John > > > > > > > > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-03-03 4:57 ` Darrick J. Wong @ 2026-03-03 17:28 ` Joanne Koong 0 siblings, 0 replies; 79+ messages in thread From: Joanne Koong @ 2026-03-03 17:28 UTC (permalink / raw) To: Darrick J. Wong Cc: John Groves, Amir Goldstein, Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Bernd Schubert, Luis Henriques, Horst Birthelmer On Mon, Mar 2, 2026 at 8:57 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Thu, Feb 26, 2026 at 12:21:43PM -0800, Joanne Koong wrote: > > On Fri, Feb 20, 2026 at 4:37 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > On Wed, Feb 11, 2026 at 08:46:26PM -0800, Joanne Koong wrote: > > > > On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > > > > > > > > > On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote: > > > > > > > > > > > > On 26/02/05 09:52PM, Darrick J. Wong wrote: > > > > > > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote: > > > > > > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote: > > > > > > > > > > > > > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > > > > > > > > > > > > > > > > > [ ... ] > > > > > > > > > > > > > > > > > > > > - famfs: export distributed memory > > > > > > > > > > > > > > > > > > > > This has been, uh, hanging out for an extraordinarily long time. > > > > > > > > > > > > > > > > > > Um, *yeah*. Although a significant part of that time was on me, because > > > > > > > > > getting it ported into fuse was kinda hard, my users and I are hoping we > > > > > > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > > > > > > > > > merge window dust settles we can negotiate any needed changes etc. and > > > > > > > > > shoot for the 7.0 merge window. > > > > > > > > > > > > > > I think we've all missed getting merged for 7.0 since 6.19 will be > > > > > > > released in 3 days. :/ > > > > > > > > > > > > > > (Granted most of the maintainers I know are /much/ less conservative > > > > > > > than I was about the schedule) > > > > > > > > > > > > Doh - right you are... > > > > > > > > > > > > > > > > > > > > > I think that the work on famfs is setting an example, and I very much > > > > > > > > hope it will be a good example, of how improving existing infrastructure > > > > > > > > (FUSE) is a better contribution than adding another fs to the pile. > > > > > > > > > > > > > > Yeah. Joanne and I spent a couple of days this week coprogramming a > > > > > > > prototype of a way for famfs to create BPF programs to handle > > > > > > > INTERLEAVED_EXTENT files. We might be ready to show that off in a > > > > > > > couple of weeks, and that might be a way to clear up the > > > > > > > GET_FMAP/IOMAP_BEGIN logjam at last. > > > > > > > > > > > > I'd love to learn more about this; happy to do a call if that's a > > > > > > good way to get me briefed. > > > > > > > > > > > > I [generally but not specifically] understand how this could avoid > > > > > > GET_FMAP, but not GET_DAXDEV. > > > > > > > > > > > > But I'm not sure it could (or should) avoid dax_iomap_rw() and > > > > > > dax_iomap_fault(). The thing is that those call my begin() function > > > > > > to resolve an offset in a file to an offset on a daxdev, and then > > > > > > dax completes the fault or memcpy. In that dance, famfs never knows > > > > > > the kernel address of the memory at all (also true of xfs in fs-dax > > > > > > mode, unless that's changed fairly recently). I think that's a pretty > > > > > > decent interface all in all. > > > > > > > > > > > > Also: dunno whether y'all have looked at the dax patches in the famfs > > > > > > series, but the solution to working with Alistair's folio-ification > > > > > > and cleanup of the dax layer (which set me back months) was to create > > > > > > drivers/dax/fsdev.c, which, when bound to a daxdev in place of > > > > > > drivers/dax/device.c, configures folios & pages compatibly with > > > > > > fs-dax. So I kinda think I need the dax_iomap* interface. > > > > > > > > > > > > As usual, if I'm overlooking something let me know... > > > > > > > > > > Hi John, > > > > > > > > > > The conversation started [1] on Darrick's containerization patchset > > > > > about using bpf to a) avoid extra requests / context switching for > > > > > ->iomap_begin and ->iomap_end calls and b) offload what would > > > > > otherwise have to be hard-coded kernel logic into userspace, which > > > > > gives userspace more flexibility / control with updating the logic and > > > > > is less of a maintenance burden for fuse. There was some musing [2] > > > > > about whether with bpf infrastructure added, it would allow famfs to > > > > > move all famfs-specific logic to userspace/bpf. > > > > > > > > > > I agree that it makes sense for famfs to go through dax iomap > > > > > interfaces. imo it seems cleanest if fuse has a generic iomap > > > > > interface with iomap dax going through that plumbing, and any > > > > > famfs-specific logic that would be needed beyond that (eg computing > > > > > the interleaved mappings) being moved to custom famfs bpf programs. I > > > > > started trying to implement this yesterday afternoon because I wanted > > > > > to make sure it would actually be doable for the famfs logic before > > > > > bringing it up and I didn't want to derail your project. So far I only > > > > > have the general iomap interface for fuse added with dax operations > > > > > going through dax_iomap* and haven't tried out integrating the famfs > > > > > GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to > > > > > get to that early next week. The work I did with Darrick this week was > > > > > on getting a server's bpf programs hooked up to fuse through bpf links > > > > > and Darrick has fleshed that out and gotten that working now. If it > > > > > turns out famfs can go through a generic iomap fuse plumbing layer, > > > > > I'd be curious to hear your thoughts on which approach you'd prefer. > > > > > > > > I put together a quick prototype to test this out - this is what it > > > > looks like with fuse having a generic iomap interface that supports > > > > dax [1], and the famfs custom logic moved to a bpf program [2]. I > > > > > > The bpf maps that you've used to upload per-inode data into the kernel > > > is a /much/ cleaner method than custom-compiling C into BPF at runtime! > > > You can statically compile the BPF object code into the fuse server, > > > which means that (a) you can take advantage of the bpftool skeletons, > > > and (b) you can in theory vendor-sign the BPF code if and when that > > > becomes a requirement. > > > > > > I think that's way better than having to put vmlinux.h and > > > fuse_iomap_bpf.h on the deployed system. Though there's one hitch in > > > example/Makefile: > > > > > > vmlinux.h: > > > $(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@ > > > > > > The build system isn't necessarily running the same kernel as the deploy > > > images. It might be for Meta, but it's not unheard of for our build > > > system to be running (say) OL10+UEK8 kernel, but the build target is OL8 > > > and UEK7. > > > > > > There doesn't seem to be any standardization across distros for where a > > > vmlinux.h file might be found. Fedora puts it under > > > /usr/src/$unamestuf, Debian puts it in /usr/include/$gcc_triple, and I > > > guess SUSE doesn't ship it at all? > > > > > > That's going to be a headache for deployment as I've been muttering for > > > a couple of weeks now. :( > > > > I don't think this is an issue because bpf does dynamic btf-based > > relocations (CO-RE) at load time [1]. On the target machine, when > > libbpf loads the bpf object it will read the machine's btf and patch > > any offsets in bytecode and load the fixed-up version into the kernel. > > All that's needed on the target machine for CO-RE is > > CONFIG_DEBUG_INFO_BTF=y which is enabled by default on mainstream > > distributions. I think this addresses the deployment headache you've > > been running into? > > Not really -- CO-RE does indeed work quite nicely to smooth over layout > changes in C structures between a BPF program and the kernel it's being > loaded into (thanks, whoever came up with that!) but the problem I have > is how you /get/ those definitions into clang in the first place. > > I was under the impression from many of the bpf examples that you're > supposed to #include a distro-provided "vmlinux.h", but there doesn't > seem to be a standard way to find that file. Most -dev packages provide The vmlinux.h: $(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@ line generates the vmlinux.h file. /sys/kernel/btf/vmlinux is a kernel sysfs path and isn't distro dependent. Then CO-RE takes care of the rest with fixing any mismatches between the vmlinux on the build machine vs. the target machine. Thanks, Joanne > a pkgconfig file that give you the appropriate CFLAGS/LDFLAGS to add, > but apparently this is not the case for BPF...? > > Perhaps it's the case that distro packages that are building BPF > programs simply add a build dependency on the package providing > vmlinux.h (e.g. Build-Depends: linux-bpf-dev on Debian) and patch in > "CFLAGS=-I/some/path" as needed? > > I suppose for a dynamically generated and compiled BPF program, one > could just "bpftool skel" the /sys/kernel/btf files, capture the output, > and "#include </dev/fd/XXX>" the results. Honestly that sounds better > than trusting some weird system package. > > But maybe dynamic compilation is a totally stupid idea. I did grow up > in the era of mshtml email wreaking havoc, after all... > > --D > > > Thanks, > > Joanne > > > > [1] https://docs.ebpf.io/concepts/core/ > > > > > > > > Maybe we could reduce the fuse-iomap bpf definitions to use only > > > cardinal types and the types that iomap itself defines. That might not > > > be too hard right now because bpf functions reuse structures from > > > include/uapi/fuse.h, which currently use uint{8,16,32,64}_t. It'll get > > > harder if that __uintXX_t -> __uXX transition actually happens. > > > > > > But getting back to the famfs bpf stuff, I think doing the interleaved > > > mappings via BPF gives the famfs server a lot more flexibility in terms > > > of what it can do when future hardware arrives with even weirder > > > configurations. > > > > > > --D > > > > > > > didn't change much, I just moved around your famfs code to the bpf > > > > side. The kernel side changes are in [3] and the libfuse changes are > > > > in [4]. > > > > > > > > For testing out the prototype, I hooked it up to passthrough_hp to > > > > test running the bpf program and verify that it is able to find the > > > > extent from the bpf map. In my opinion, this makes the fuse side > > > > infrastructure cleaner and more extendable for other servers that will > > > > want to go through dax iomap in the future, but I think this also has > > > > a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP > > > > request after a file is opened, the server can directly populate the > > > > metadata map from userspace with the mapping info when it processes > > > > the FUSE_OPEN request, which gets rid of the roundtrip cost. The > > > > server can dynamically update the metadata at any time from userspace > > > > if the mapping info needs to change in the future. For setting up the > > > > daxdevs, I moved your logic to the init side, where the server passes > > > > the daxdev info upfront through an IOMAP_CONFIG exchange with the > > > > kernel initializing the daxdevs based off that info. I think this will > > > > also make deploying future updates for famfs easier, as updating the > > > > logic won't need to go through the upstream kernel mailing list > > > > process and deploying updates won't require a new kernel release. > > > > > > > > These are just my two cents based on my (cursory) understanding of > > > > famfs. Just wanted to float this alternative approach in case it's > > > > useful. > > > > > > > > Thanks, > > > > Joanne > > > > > > > > [1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd > > > > [2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa > > > > [3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/ > > > > [4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/ > > > > > > > > > > > > > > Thanks, > > > > > Joanne > > > > > > > > > > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d > > > > > [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u > > > > > > > > > > > > > > > > > Regards, > > > > > > John > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more 2026-02-06 20:48 ` John Groves 2026-02-07 0:22 ` Joanne Koong @ 2026-02-20 23:59 ` Darrick J. Wong 1 sibling, 0 replies; 79+ messages in thread From: Darrick J. Wong @ 2026-02-20 23:59 UTC (permalink / raw) To: John Groves Cc: Amir Goldstein, Miklos Szeredi, f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Joanne Koong, Bernd Schubert, Luis Henriques, Horst Birthelmer On Fri, Feb 06, 2026 at 02:48:43PM -0600, John Groves wrote: > On 26/02/05 09:52PM, Darrick J. Wong wrote: > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote: > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote: > > > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > > > > > > > [ ... ] > > > > > > > > > > - famfs: export distributed memory > > > > > > > > > > This has been, uh, hanging out for an extraordinarily long time. > > > > > > > > Um, *yeah*. Although a significant part of that time was on me, because > > > > getting it ported into fuse was kinda hard, my users and I are hoping we > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > > > > merge window dust settles we can negotiate any needed changes etc. and > > > > shoot for the 7.0 merge window. > > > > I think we've all missed getting merged for 7.0 since 6.19 will be > > released in 3 days. :/ > > > > (Granted most of the maintainers I know are /much/ less conservative > > than I was about the schedule) > > Doh - right you are... > > > > > > I think that the work on famfs is setting an example, and I very much > > > hope it will be a good example, of how improving existing infrastructure > > > (FUSE) is a better contribution than adding another fs to the pile. > > > > Yeah. Joanne and I spent a couple of days this week coprogramming a > > prototype of a way for famfs to create BPF programs to handle > > INTERLEAVED_EXTENT files. We might be ready to show that off in a > > couple of weeks, and that might be a way to clear up the > > GET_FMAP/IOMAP_BEGIN logjam at last. > > I'd love to learn more about this; happy to do a call if that's a > good way to get me briefed. > > I [generally but not specifically] understand how this could avoid > GET_FMAP, but not GET_DAXDEV. fuse-iomap requires fuse servers to open block devices and register them with the fuse_conn as a backing file. The kernel returns a magic cookie that can then be passed back to the kernel in iomap_begin. This is (AFAICT) similar to what fuse does w.r.t. passthrough files. IIRC, GET_DAXDEV is an ondemand fuse request, which is quite different from the fuse-iomap model where bdevs have to be registered before you can use them. > But I'm not sure it could (or should) avoid dax_iomap_rw() and > dax_iomap_fault(). The thing is that those call my begin() function > to resolve an offset in a file to an offset on a daxdev, and then > dax completes the fault or memcpy. In that dance, famfs never knows > the kernel address of the memory at all (also true of xfs in fs-dax > mode, unless that's changed fairly recently). I think that's a pretty > decent interface all in all. Right. dax_iomap_{rw,fault} call the ->iomap_begin they're given, which can be fuse_iomap_begin, which will either (a) look in the iext cache, (b) see if the fuse server supplied a bpf program, or (c) upcall the fuse server. I also took another look at my broken fuse-iomap-dax patch and realized that in addition to corrupting data somewhere, there's also a gigantic XXX around dax_writeback_mapping_range because it takes a bdev instead of asking the filesystem for mappings, which means that it's broken for any fsdax file who stores data on more than one device. > Also: dunno whether y'all have looked at the dax patches in the famfs > series, but the solution to working with Alistair's folio-ification > and cleanup of the dax layer (which set me back months) was to create > drivers/dax/fsdev.c, which, when bound to a daxdev in place of > drivers/dax/device.c, configures folios & pages compatibly with > fs-dax. So I kinda think I need the dax_iomap* interface. Oh that's good news! --D > As usual, if I'm overlooking something let me know... > > Regards, > John > > ^ permalink raw reply [flat|nested] 79+ messages in thread
end of thread, other threads:[~2026-03-26 14:39 UTC | newest]
Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <aYIsRc03fGhQ7vbS@groves.net>
2026-02-02 13:51 ` [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Miklos Szeredi
2026-02-02 16:14 ` Amir Goldstein
2026-02-03 7:55 ` Miklos Szeredi
2026-02-03 9:19 ` [Lsf-pc] " Jan Kara
2026-02-03 10:31 ` Amir Goldstein
2026-02-04 9:22 ` Joanne Koong
2026-02-04 10:37 ` Amir Goldstein
2026-02-04 10:43 ` [Lsf-pc] " Jan Kara
2026-02-06 6:09 ` Darrick J. Wong
2026-02-21 6:07 ` Demi Marie Obenour
2026-02-21 7:07 ` Darrick J. Wong
2026-02-21 22:16 ` Demi Marie Obenour
2026-02-23 21:58 ` Darrick J. Wong
2026-02-04 20:47 ` Bernd Schubert
2026-02-06 6:26 ` Darrick J. Wong
2026-02-03 10:15 ` Luis Henriques
2026-02-03 10:20 ` Amir Goldstein
2026-02-03 10:38 ` Luis Henriques
2026-02-03 14:20 ` Christian Brauner
2026-02-03 10:36 ` Amir Goldstein
2026-02-03 17:13 ` John Groves
2026-02-04 19:06 ` Darrick J. Wong
2026-02-04 19:38 ` Horst Birthelmer
2026-02-04 20:58 ` Bernd Schubert
2026-02-06 5:47 ` Darrick J. Wong
2026-02-04 22:50 ` Gao Xiang
2026-02-06 5:38 ` Darrick J. Wong
2026-02-06 6:15 ` Gao Xiang
2026-02-21 0:47 ` Darrick J. Wong
2026-03-17 4:17 ` Gao Xiang
2026-03-18 21:51 ` Darrick J. Wong
2026-03-19 8:05 ` Gao Xiang
2026-03-22 3:25 ` Demi Marie Obenour
2026-03-22 3:52 ` Gao Xiang
2026-03-22 4:51 ` Gao Xiang
2026-03-22 5:13 ` Demi Marie Obenour
2026-03-22 5:30 ` Gao Xiang
2026-03-23 9:54 ` [Lsf-pc] " Jan Kara
2026-03-23 10:19 ` Gao Xiang
2026-03-23 11:14 ` Jan Kara
2026-03-23 11:42 ` Gao Xiang
2026-03-23 12:01 ` Gao Xiang
2026-03-23 14:13 ` Jan Kara
2026-03-23 14:36 ` Gao Xiang
2026-03-23 14:47 ` Jan Kara
2026-03-23 14:57 ` Gao Xiang
2026-03-24 8:48 ` Christian Brauner
2026-03-24 9:30 ` Gao Xiang
2026-03-24 9:49 ` Demi Marie Obenour
2026-03-24 9:53 ` Gao Xiang
2026-03-24 10:02 ` Demi Marie Obenour
2026-03-24 10:14 ` Gao Xiang
2026-03-24 10:17 ` Demi Marie Obenour
2026-03-24 10:25 ` Gao Xiang
2026-03-24 11:58 ` Demi Marie Obenour
2026-03-24 12:21 ` Gao Xiang
2026-03-26 14:39 ` Christian Brauner
2026-03-23 12:08 ` Demi Marie Obenour
2026-03-23 12:13 ` Gao Xiang
2026-03-23 12:19 ` Demi Marie Obenour
2026-03-23 12:30 ` Gao Xiang
2026-03-23 12:33 ` Gao Xiang
2026-03-22 5:14 ` Gao Xiang
2026-03-23 9:43 ` [Lsf-pc] " Jan Kara
2026-03-23 10:05 ` Gao Xiang
2026-03-23 10:14 ` Jan Kara
2026-03-23 10:30 ` Gao Xiang
2026-02-04 23:19 ` Gao Xiang
2026-02-05 3:33 ` John Groves
2026-02-05 9:27 ` Amir Goldstein
2026-02-06 5:52 ` Darrick J. Wong
2026-02-06 20:48 ` John Groves
2026-02-07 0:22 ` Joanne Koong
2026-02-12 4:46 ` Joanne Koong
2026-02-21 0:37 ` Darrick J. Wong
2026-02-26 20:21 ` Joanne Koong
2026-03-03 4:57 ` Darrick J. Wong
2026-03-03 17:28 ` Joanne Koong
2026-02-20 23:59 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox