[LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
@ 2026-02-02 13:51 ` Miklos Szeredi
  2026-02-02 16:14   ` Amir Goldstein
                     ` (3 more replies)
  0 siblings, 4 replies; 79+ messages in thread
From: Miklos Szeredi @ 2026-02-02 13:51 UTC (permalink / raw)
  To: f-pc, linux-fsdevel
  Cc: Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer

I propose a session where various topics of interest could be
discussed including but not limited to the below list

New features being proposed at various stages of readiness:

 - fuse4fs: exporting the iomap interface to userspace

 - famfs: export distributed memory

 - zero copy for fuse-io-uring

 - large folios

 - file handles on the userspace API

 - compound requests

 - BPF scripts

How do these fit into the existing codebase?

Cleaner separation of layers:

 - transport layer: /dev/fuse, io-uring, viriofs

 - filesystem layer: local fs, distributed fs

Introduce new version of cleaned up API?

 - remove async INIT

 - no fixed ROOT_ID

 - consolidate caching rules

 - who's responsible for updating which metadata?

 - remove legacy and problematic flags

 - get rid of splice on /dev/fuse for new API version?

Unresolved issues:

 - locked / writeback folios vs. reclaim / page migration

 - strictlimiting vs. large folios

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-02 13:51 ` [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Miklos Szeredi
@ 2026-02-02 16:14   ` Amir Goldstein
  2026-02-03  7:55     ` Miklos Szeredi
  2026-02-03 10:15     ` Luis Henriques
  2026-02-03 10:36   ` Amir Goldstein
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 79+ messages in thread
From: Amir Goldstein @ 2026-02-02 16:14 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves,
	Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc

[Fixed lsf-pc address typo]

On Mon, Feb 2, 2026 at 2:51 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> I propose a session where various topics of interest could be
> discussed including but not limited to the below list
>
> New features being proposed at various stages of readiness:
>
>  - fuse4fs: exporting the iomap interface to userspace
>
>  - famfs: export distributed memory
>
>  - zero copy for fuse-io-uring
>
>  - large folios
>
>  - file handles on the userspace API
>
>  - compound requests
>
>  - BPF scripts
>
> How do these fit into the existing codebase?
>
> Cleaner separation of layers:
>
>  - transport layer: /dev/fuse, io-uring, viriofs
>
>  - filesystem layer: local fs, distributed fs
>
> Introduce new version of cleaned up API?
>
>  - remove async INIT
>
>  - no fixed ROOT_ID
>
>  - consolidate caching rules
>
>  - who's responsible for updating which metadata?
>
>  - remove legacy and problematic flags
>
>  - get rid of splice on /dev/fuse for new API version?
>
> Unresolved issues:
>
>  - locked / writeback folios vs. reclaim / page migration
>
>  - strictlimiting vs. large folios

All important topics which I am sure will be discussed on a FUSE BoF.

I think that at least one question of interest to the wider fs audience is

Can any of the above improvements be used to help phase out some
of the old under maintained fs and reduce the burden on vfs maintainers?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-02 16:14   ` Amir Goldstein
@ 2026-02-03  7:55     ` Miklos Szeredi
  2026-02-03  9:19       ` [Lsf-pc] " Jan Kara
  2026-02-04  9:22       ` Joanne Koong
  2026-02-03 10:15     ` Luis Henriques
  1 sibling, 2 replies; 79+ messages in thread
From: Miklos Szeredi @ 2026-02-03  7:55 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves,
	Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc

On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote:

> All important topics which I am sure will be discussed on a FUSE BoF.

I see your point.   Maybe the BPF one could be useful as a cross track
discussion, though I'm not sure the fuse side of the design is mature
enough for that.  Joanne, you did some experiments with that, no?

> I think that at least one question of interest to the wider fs audience is
>
> Can any of the above improvements be used to help phase out some
> of the old under maintained fs and reduce the burden on vfs maintainers?

I think the major show stopper is that nobody is going to put a major
effort into porting unmaintained kernel filesystems to a different
framework.

Alternatively someone could implement a "VFS emulator" library.  But
keeping that in sync with the kernel, together with all the old fs
would be an even greater burden...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-03  7:55     ` Miklos Szeredi
@ 2026-02-03  9:19       ` Jan Kara
  2026-02-03 10:31         ` Amir Goldstein
  2026-02-04  9:22       ` Joanne Koong
  1 sibling, 1 reply; 79+ messages in thread
From: Jan Kara @ 2026-02-03  9:19 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel, Joanne Koong, Darrick J . Wong,
	John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer,
	lsf-pc

On Tue 03-02-26 08:55:26, Miklos Szeredi via Lsf-pc wrote:
> On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote:
> > I think that at least one question of interest to the wider fs audience is
> >
> > Can any of the above improvements be used to help phase out some
> > of the old under maintained fs and reduce the burden on vfs maintainers?
> 
> I think the major show stopper is that nobody is going to put a major
> effort into porting unmaintained kernel filesystems to a different
> framework.

There's some interest from people doing vfs maintenance work (as it has
potential to save their work) and it is actually a reasonable task for
someone wanting to get acquainted with filesystem development work. So I
think there are chances of some progress. For example there was some
interest in doing this for minix. Of course we'll be sure only when it
happens :)

> Alternatively someone could implement a "VFS emulator" library.  But
> keeping that in sync with the kernel, together with all the old fs
> would be an even greater burden...

Full VFS emulator would be too much I think. Maybe some helper library to
ease some tasks would be useful but I think time for comming up with
libraries is when someone commits to actually doing some conversion.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-02 16:14   ` Amir Goldstein
  2026-02-03  7:55     ` Miklos Szeredi
@ 2026-02-03 10:15     ` Luis Henriques
  2026-02-03 10:20       ` Amir Goldstein
  1 sibling, 1 reply; 79+ messages in thread
From: Luis Henriques @ 2026-02-03 10:15 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, Darrick J . Wong,
	John Groves, Bernd Schubert, Horst Birthelmer, lsf-pc

On Mon, Feb 02 2026, Amir Goldstein wrote:

> [Fixed lsf-pc address typo]
>
> On Mon, Feb 2, 2026 at 2:51 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>
>> I propose a session where various topics of interest could be
>> discussed including but not limited to the below list
>>
>> New features being proposed at various stages of readiness:
>>
>>  - fuse4fs: exporting the iomap interface to userspace
>>
>>  - famfs: export distributed memory
>>
>>  - zero copy for fuse-io-uring
>>
>>  - large folios
>>
>>  - file handles on the userspace API
>>
>>  - compound requests
>>
>>  - BPF scripts
>>
>> How do these fit into the existing codebase?
>>
>> Cleaner separation of layers:
>>
>>  - transport layer: /dev/fuse, io-uring, viriofs
>>
>>  - filesystem layer: local fs, distributed fs
>>
>> Introduce new version of cleaned up API?
>>
>>  - remove async INIT
>>
>>  - no fixed ROOT_ID
>>
>>  - consolidate caching rules
>>
>>  - who's responsible for updating which metadata?
>>
>>  - remove legacy and problematic flags
>>
>>  - get rid of splice on /dev/fuse for new API version?
>>
>> Unresolved issues:
>>
>>  - locked / writeback folios vs. reclaim / page migration
>>
>>  - strictlimiting vs. large folios
>
> All important topics which I am sure will be discussed on a FUSE BoF.

I wonder if the topic I proposed separately (on restarting FUSE servers)
should also be merged into this list.  It's already a very comprehensive
list, so I'm not sure it's worth having a separate topic if most of it
will (likely) be touched here already.

What do you think?

Cheers,
-- 
Luís

> I think that at least one question of interest to the wider fs audience is
>
> Can any of the above improvements be used to help phase out some
> of the old under maintained fs and reduce the burden on vfs maintainers?
>
> Thanks,
> Amir.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-03 10:15     ` Luis Henriques
@ 2026-02-03 10:20       ` Amir Goldstein
  2026-02-03 10:38         ` Luis Henriques
  2026-02-03 14:20         ` Christian Brauner
  0 siblings, 2 replies; 79+ messages in thread
From: Amir Goldstein @ 2026-02-03 10:20 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, Darrick J . Wong,
	John Groves, Bernd Schubert, Horst Birthelmer, lsf-pc

On Tue, Feb 3, 2026 at 11:15 AM Luis Henriques <luis@igalia.com> wrote:
>
> On Mon, Feb 02 2026, Amir Goldstein wrote:
>
> > [Fixed lsf-pc address typo]
> >
> > On Mon, Feb 2, 2026 at 2:51 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>
> >> I propose a session where various topics of interest could be
> >> discussed including but not limited to the below list
> >>
> >> New features being proposed at various stages of readiness:
> >>
> >>  - fuse4fs: exporting the iomap interface to userspace
> >>
> >>  - famfs: export distributed memory
> >>
> >>  - zero copy for fuse-io-uring
> >>
> >>  - large folios
> >>
> >>  - file handles on the userspace API
> >>
> >>  - compound requests
> >>
> >>  - BPF scripts
> >>
> >> How do these fit into the existing codebase?
> >>
> >> Cleaner separation of layers:
> >>
> >>  - transport layer: /dev/fuse, io-uring, viriofs
> >>
> >>  - filesystem layer: local fs, distributed fs
> >>
> >> Introduce new version of cleaned up API?
> >>
> >>  - remove async INIT
> >>
> >>  - no fixed ROOT_ID
> >>
> >>  - consolidate caching rules
> >>
> >>  - who's responsible for updating which metadata?
> >>
> >>  - remove legacy and problematic flags
> >>
> >>  - get rid of splice on /dev/fuse for new API version?
> >>
> >> Unresolved issues:
> >>
> >>  - locked / writeback folios vs. reclaim / page migration
> >>
> >>  - strictlimiting vs. large folios
> >
> > All important topics which I am sure will be discussed on a FUSE BoF.
>
> I wonder if the topic I proposed separately (on restarting FUSE servers)
> should also be merged into this list.  It's already a very comprehensive
> list, so I'm not sure it's worth having a separate topic if most of it
> will (likely) be touched here already.
>
> What do you think?

We are likely going to do a FUSE BoF, likely Wed afternoon,
so we can have an internal schedule for that.

Restartability and stable FUSE handles is one of the requirements
to replace an existing fs if that fs is NFS exportrable.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-03  9:19       ` [Lsf-pc] " Jan Kara
@ 2026-02-03 10:31         ` Amir Goldstein
  0 siblings, 0 replies; 79+ messages in thread
From: Amir Goldstein @ 2026-02-03 10:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, Darrick J . Wong,
	John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer,
	lsf-pc

On Tue, Feb 3, 2026 at 10:19 AM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 03-02-26 08:55:26, Miklos Szeredi via Lsf-pc wrote:
> > On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote:
> > > I think that at least one question of interest to the wider fs audience is
> > >
> > > Can any of the above improvements be used to help phase out some
> > > of the old under maintained fs and reduce the burden on vfs maintainers?
> >
> > I think the major show stopper is that nobody is going to put a major
> > effort into porting unmaintained kernel filesystems to a different
> > framework.
>
> There's some interest from people doing vfs maintenance work (as it has
> potential to save their work) and it is actually a reasonable task for
> someone wanting to get acquainted with filesystem development work. So I
> think there are chances of some progress. For example there was some
> interest in doing this for minix. Of course we'll be sure only when it
> happens :)
>
> > Alternatively someone could implement a "VFS emulator" library.  But
> > keeping that in sync with the kernel, together with all the old fs
> > would be an even greater burden...
>
> Full VFS emulator would be too much I think. Maybe some helper library to
> ease some tasks would be useful but I think time for comming up with
> libraries is when someone commits to actually doing some conversion.
>

I think that the concept of a VFS emulator is wrong to apply here.
A VFS emulator would be needed for running the latest uptodate fs driver.

If we want to fork a kernel driver at a point in time and make it into
a FUSE server,
we need a one time conversion from kernel/vfs API to
userspace/lowlevel FUSE API.
LLMs are very good and doing this sort of mechanic conversion and
after the first
few fs have been converted by developers, LLM would learn how to do it better
for the next fs.

The main challenges I see are verification and package maintenance.
The conversion needs to be tested, so there needs to be a decent
test suite.
If an fs already has a progs/utils package, it would be natural if FUSE
server code could be added to this package, but those packages are not
always maintained.

We can map the most likely candidates that have decent test suites and
a fairly maintained utils package for a start.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-02 13:51 ` [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Miklos Szeredi
  2026-02-02 16:14   ` Amir Goldstein
@ 2026-02-03 10:36   ` Amir Goldstein
  2026-02-03 17:13   ` John Groves
  2026-02-04 19:06   ` Darrick J. Wong
  3 siblings, 0 replies; 79+ messages in thread
From: Amir Goldstein @ 2026-02-03 10:36 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: f-pc, linux-fsdevel, Joanne Koong, Darrick J . Wong, John Groves,
	Bernd Schubert, Luis Henriques, Horst Birthelmer

On Mon, Feb 2, 2026 at 2:51 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> I propose a session where various topics of interest could be
> discussed including but not limited to the below list
>
> New features being proposed at various stages of readiness:
>
>  - fuse4fs: exporting the iomap interface to userspace
>
>  - famfs: export distributed memory
>
>  - zero copy for fuse-io-uring
>
>  - large folios
>
>  - file handles on the userspace API
>
>  - compound requests
>
>  - BPF scripts
>
> How do these fit into the existing codebase?
>
> Cleaner separation of layers:
>
>  - transport layer: /dev/fuse, io-uring, viriofs
>
>  - filesystem layer: local fs, distributed fs
>
> Introduce new version of cleaned up API?
>
>  - remove async INIT
>
>  - no fixed ROOT_ID
>
>  - consolidate caching rules
>
>  - who's responsible for updating which metadata?
>
>  - remove legacy and problematic flags
>

- Let server explicitly declare NFS export support

We could couple that with LOOKUP_HANDLE, because I think there are very few
servers out there that truly provide stable NFS handles with current
FUSE protocol.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-03 10:20       ` Amir Goldstein
@ 2026-02-03 10:38         ` Luis Henriques
  2026-02-03 14:20         ` Christian Brauner
  1 sibling, 0 replies; 79+ messages in thread
From: Luis Henriques @ 2026-02-03 10:38 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, Darrick J . Wong,
	John Groves, Bernd Schubert, Horst Birthelmer, lsf-pc

On Tue, Feb 03 2026, Amir Goldstein wrote:

> On Tue, Feb 3, 2026 at 11:15 AM Luis Henriques <luis@igalia.com> wrote:
>>
>> On Mon, Feb 02 2026, Amir Goldstein wrote:
>>
>> > [Fixed lsf-pc address typo]
>> >
>> > On Mon, Feb 2, 2026 at 2:51 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>> >>
>> >> I propose a session where various topics of interest could be
>> >> discussed including but not limited to the below list
>> >>
>> >> New features being proposed at various stages of readiness:
>> >>
>> >>  - fuse4fs: exporting the iomap interface to userspace
>> >>
>> >>  - famfs: export distributed memory
>> >>
>> >>  - zero copy for fuse-io-uring
>> >>
>> >>  - large folios
>> >>
>> >>  - file handles on the userspace API
>> >>
>> >>  - compound requests
>> >>
>> >>  - BPF scripts
>> >>
>> >> How do these fit into the existing codebase?
>> >>
>> >> Cleaner separation of layers:
>> >>
>> >>  - transport layer: /dev/fuse, io-uring, viriofs
>> >>
>> >>  - filesystem layer: local fs, distributed fs
>> >>
>> >> Introduce new version of cleaned up API?
>> >>
>> >>  - remove async INIT
>> >>
>> >>  - no fixed ROOT_ID
>> >>
>> >>  - consolidate caching rules
>> >>
>> >>  - who's responsible for updating which metadata?
>> >>
>> >>  - remove legacy and problematic flags
>> >>
>> >>  - get rid of splice on /dev/fuse for new API version?
>> >>
>> >> Unresolved issues:
>> >>
>> >>  - locked / writeback folios vs. reclaim / page migration
>> >>
>> >>  - strictlimiting vs. large folios
>> >
>> > All important topics which I am sure will be discussed on a FUSE BoF.
>>
>> I wonder if the topic I proposed separately (on restarting FUSE servers)
>> should also be merged into this list.  It's already a very comprehensive
>> list, so I'm not sure it's worth having a separate topic if most of it
>> will (likely) be touched here already.
>>
>> What do you think?
>
> We are likely going to do a FUSE BoF, likely Wed afternoon,
> so we can have an internal schedule for that.
>
> Restartability and stable FUSE handles is one of the requirements
> to replace an existing fs if that fs is NFS exportrable.

Great, thanks Amir.  So I'll just assume these topics will be pushed into
the BoF.  It looks like it will be a very interesting afternoon! ;-)

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-03 10:20       ` Amir Goldstein
  2026-02-03 10:38         ` Luis Henriques
@ 2026-02-03 14:20         ` Christian Brauner
  1 sibling, 0 replies; 79+ messages in thread
From: Christian Brauner @ 2026-02-03 14:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Luis Henriques, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	Darrick J . Wong, John Groves, Bernd Schubert, Horst Birthelmer,
	lsf-pc

> Restartability and stable FUSE handles is one of the requirements

I'd be interested in this.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-02 13:51 ` [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Miklos Szeredi
  2026-02-02 16:14   ` Amir Goldstein
  2026-02-03 10:36   ` Amir Goldstein
@ 2026-02-03 17:13   ` John Groves
  2026-02-04 19:06   ` Darrick J. Wong
  3 siblings, 0 replies; 79+ messages in thread
From: John Groves @ 2026-02-03 17:13 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	Joanne Koong, Darrick J . Wong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer

On 26/02/02 02:51PM, Miklos Szeredi wrote:
> I propose a session where various topics of interest could be
> discussed including but not limited to the below list
> 
> New features being proposed at various stages of readiness:
> 
>  - fuse4fs: exporting the iomap interface to userspace
> 
>  - famfs: export distributed memory

I plan to attend, and have been on the fence about whether a proper famfs
session is needed. I'm open to ideas on that, but would certainly
participate in this sort of overview session too.

JG

> 
>  - zero copy for fuse-io-uring
> 
>  - large folios
> 
>  - file handles on the userspace API
> 
>  - compound requests
> 
>  - BPF scripts
> 
> How do these fit into the existing codebase?
> 
> Cleaner separation of layers:
> 
>  - transport layer: /dev/fuse, io-uring, viriofs
> 
>  - filesystem layer: local fs, distributed fs
> 
> Introduce new version of cleaned up API?
> 
>  - remove async INIT
> 
>  - no fixed ROOT_ID
> 
>  - consolidate caching rules
> 
>  - who's responsible for updating which metadata?
> 
>  - remove legacy and problematic flags
> 
>  - get rid of splice on /dev/fuse for new API version?
> 
> Unresolved issues:
> 
>  - locked / writeback folios vs. reclaim / page migration
> 
>  - strictlimiting vs. large folios

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-03  7:55     ` Miklos Szeredi
  2026-02-03  9:19       ` [Lsf-pc] " Jan Kara
@ 2026-02-04  9:22       ` Joanne Koong
  2026-02-04 10:37         ` Amir Goldstein
                           ` (3 more replies)
  1 sibling, 4 replies; 79+ messages in thread
From: Joanne Koong @ 2026-02-04  9:22 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel, Darrick J . Wong, John Groves,
	Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc

On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote:
>
> > All important topics which I am sure will be discussed on a FUSE BoF.

Two other items I'd like to add to the potential discussion list are:

* leveraging io-uring multishot for batching fuse writeback and
readahead requests, ie maximizing the throughput per roundtrip context
switch [1]

* settling how load distribution should be done for configurable
queues. We came to a bit of a standstill on Bernd's patchset [2] and
it would be great to finally get this resolved and the feature landed.
imo configurable queues and incremental buffer consumption are the two
main features needed to make fuse-over-io-uring more feasible on
large-scale systems.

>
> I see your point.   Maybe the BPF one could be useful as a cross track
> discussion, though I'm not sure the fuse side of the design is mature
> enough for that.  Joanne, you did some experiments with that, no?

The discussion on this was started in response [3] to Darrick's iomap
containerization patchset. I have a prototype based on [4] I can get
into reviewable shape this month or next, if there's interest in
getting something concrete before May. I did a quick check with the
bpf team a few days ago and confirmed with them that struct ops is the
way to go for adding the hook point for fuse. For attaching the bpf
progs to the fuse connection, going through the bpf link interface is
the modern/preferred way of doing this. Discussion wise, imo on the
fuse side what would be most useful to discuss in May would be what
other interception points do we think would be the most useful in fuse
and what should the API interfaces that we expose for those look like
(eg should these just take the in/out request structs already defined
in the uapi? or expose more state information?). imo, we should take
an incremental approach and add interception points more
conservatively than liberally, on a per-need basis as use cases
actually come up.

>
> > I think that at least one question of interest to the wider fs audience is
> >
> > Can any of the above improvements be used to help phase out some
> > of the old under maintained fs and reduce the burden on vfs maintainers?

I think it might be helpful to know ahead of time where the main
hesitation lies. Is it performance? Maybe it'd be helpful if before
May there was a prototype converting a simpler filesystem (Darrick and
I were musing about fat maybe being a good one) and getting a sense of
what the delta is between the native kernel implementation and a
fuse-based version? In the past year fuse added a lot of new
capabilities that improved performance by quite a bit so I'm curious
to see where the delta now lies. Or maybe the hesitation is something
else entirely, in which case that's probably a conversation better
left for May.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z3mTdZdfe5rTukKOnU0y5dpM8aFTCqbctBWsa-S301TQ@mail.gmail.com/

[2] https://lore.kernel.org/linux-fsdevel/20251013-reduced-nr-ring-queues_3-v3-4-6d87c8aa31ae@ddn.com/t/#u

[3] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z05QZmos90qmWtnWGF+Kb7rVziJ51UpuJ0O=A+6N1vrg@mail.gmail.com/t/#u

[4] https://lore.kernel.org/linux-fsdevel/176169810144.1424854.11439355400009006946.stgit@frogsfrogsfrogs/T/#m4998d92f6210d50d0bf6760490689c029bda9231

>
> I think the major show stopper is that nobody is going to put a major
> effort into porting unmaintained kernel filesystems to a different
> framework.
>
> Alternatively someone could implement a "VFS emulator" library.  But
> keeping that in sync with the kernel, together with all the old fs
> would be an even greater burden...
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04  9:22       ` Joanne Koong
@ 2026-02-04 10:37         ` Amir Goldstein
  2026-02-04 10:43         ` [Lsf-pc] " Jan Kara
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 79+ messages in thread
From: Amir Goldstein @ 2026-02-04 10:37 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel, Darrick J . Wong, John Groves,
	Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc

On Wed, Feb 4, 2026 at 10:22 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > All important topics which I am sure will be discussed on a FUSE BoF.
>
> Two other items I'd like to add to the potential discussion list are:
>
> * leveraging io-uring multishot for batching fuse writeback and
> readahead requests, ie maximizing the throughput per roundtrip context
> switch [1]
>
> * settling how load distribution should be done for configurable
> queues. We came to a bit of a standstill on Bernd's patchset [2] and
> it would be great to finally get this resolved and the feature landed.
> imo configurable queues and incremental buffer consumption are the two
> main features needed to make fuse-over-io-uring more feasible on
> large-scale systems.
>
> >
> > I see your point.   Maybe the BPF one could be useful as a cross track
> > discussion, though I'm not sure the fuse side of the design is mature
> > enough for that.  Joanne, you did some experiments with that, no?
>
> The discussion on this was started in response [3] to Darrick's iomap
> containerization patchset. I have a prototype based on [4] I can get
> into reviewable shape this month or next, if there's interest in
> getting something concrete before May. I did a quick check with the
> bpf team a few days ago and confirmed with them that struct ops is the
> way to go for adding the hook point for fuse. For attaching the bpf
> progs to the fuse connection, going through the bpf link interface is
> the modern/preferred way of doing this. Discussion wise, imo on the
> fuse side what would be most useful to discuss in May would be what
> other interception points do we think would be the most useful in fuse
> and what should the API interfaces that we expose for those look like
> (eg should these just take the in/out request structs already defined
> in the uapi? or expose more state information?). imo, we should take
> an incremental approach and add interception points more
> conservatively than liberally, on a per-need basis as use cases
> actually come up.
>
> >
> > > I think that at least one question of interest to the wider fs audience is
> > >
> > > Can any of the above improvements be used to help phase out some
> > > of the old under maintained fs and reduce the burden on vfs maintainers?
>
> I think it might be helpful to know ahead of time where the main
> hesitation lies. Is it performance?

I think that for phasing out unmaintained filesystems performance
is really the last concern if a concern at all (call it a nudge).

> Maybe it'd be helpful if before
> May there was a prototype converting a simpler filesystem (Darrick and
> I were musing about fat maybe being a good one) and getting a sense of
> what the delta is between the native kernel implementation and a
> fuse-based version? In the past year fuse added a lot of new
> capabilities that improved performance by quite a bit so I'm curious
> to see where the delta now lies.

Yeh, this is a fun exercise. fat could be a good candidate.
I'd do it myself if I find the time.
If anyone starts doing that maybe post a message here or in FUSE thread
so we can avoid working on the same fs.

> Or maybe the hesitation is something
> else entirely, in which case that's probably a conversation better
> left for May.
>

Besides testing and maintenance which I already mentioned and
functionality (e.g. nfs export), there could be other concerns
fuse has some unique behaviors, but maybe those could be
fixes for the sake of this sort of project.

I guess we will know better once we start experimenting and
let actual users try the conversion.
Finding those users could be a challenge in itself.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04  9:22       ` Joanne Koong
  2026-02-04 10:37         ` Amir Goldstein
@ 2026-02-04 10:43         ` Jan Kara
  2026-02-06  6:09           ` Darrick J. Wong
  2026-02-04 20:47         ` Bernd Schubert
  2026-02-06  6:26         ` Darrick J. Wong
  3 siblings, 1 reply; 79+ messages in thread
From: Jan Kara @ 2026-02-04 10:43 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Darrick J . Wong,
	John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer,
	lsf-pc

On Wed 04-02-26 01:22:02, Joanne Koong wrote:
> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > I think that at least one question of interest to the wider fs audience is
> > >
> > > Can any of the above improvements be used to help phase out some
> > > of the old under maintained fs and reduce the burden on vfs maintainers?
> 
> I think it might be helpful to know ahead of time where the main
> hesitation lies. Is it performance? Maybe it'd be helpful if before
> May there was a prototype converting a simpler filesystem (Darrick and
> I were musing about fat maybe being a good one) and getting a sense of
> what the delta is between the native kernel implementation and a
> fuse-based version? In the past year fuse added a lot of new
> capabilities that improved performance by quite a bit so I'm curious
> to see where the delta now lies. Or maybe the hesitation is something
> else entirely, in which case that's probably a conversation better
> left for May.

I'm not sure which filesystems Amir had exactly in mind but in my opinion
FAT is used widely enough to not be a primary target of this effort. It
would be rather filesystems like (random selection) bfs, adfs, vboxfs,
minix, efs, freevxfs, etc. The user base of these is very small, testing is
minimal if possible at all, and thus the value of keeping these in the
kernel vs the effort they add to infrastructure changes (like folio
conversions, iomap conversion, ...) is not very favorable.

For these the biggest problem IMO is actually finding someone willing to
invest into doing (and testing) the conversion. I don't think there are
severe technical obstacles for most of them.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-02 13:51 ` [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Miklos Szeredi
                     ` (2 preceding siblings ...)
  2026-02-03 17:13   ` John Groves
@ 2026-02-04 19:06   ` Darrick J. Wong
  2026-02-04 19:38     ` Horst Birthelmer
                       ` (4 more replies)
  3 siblings, 5 replies; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-04 19:06 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: f-pc, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer

On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote:
> I propose a session where various topics of interest could be
> discussed including but not limited to the below list
> 
> New features being proposed at various stages of readiness:
> 
>  - fuse4fs: exporting the iomap interface to userspace

FYI, I took a semi-break from fuse-iomap for 7.0 because I was too busy
working on xfs_healer, but I was planning to repost the patchbomb with
many many cleanups and reorganizations (thanks Joanne!) as soon as
possible after Linus tags 7.0-rc1.

I don't think LSFMM is a good venue for discussing a gigantic pile of
code, because (IMO) LSF is better spent either (a) retrying in person to
reach consensus on things that we couldn't do online; or (b) discussing
roadmaps and/or people problems.  In other words, I'd rather use
in-person time to go through broader topics that affect multiple people,
and the mailing lists for detailed examination of a large body of text.

However -- do you have questions about the design?  That could be a good
topic for email /and/ for a face to face meeting.  Though I strongly
suspect that there are so many other sub-topics that fuse-iomap could
eat up an entire afternoon at LSFMM:

 0 How do we convince $managers to spend money on porting filesystems
   to fuse?  Even if they use the regular slow mode?

 1 What's the process for merging all the code changes into libfuse?
   The iomap parts are pretty straightforward because libfuse passes
   the request/reply straight through to fuse server, but...

 2 ...the fuse service container part involves a bunch of architecture
   shifts to libfuse.  First you need a new mount helper to connect to
   a unix socket to start the service, pass some resources (fds and
   mount options) through the unix socket to the service.  Obviously
   that requires new library code for a fuse server to see the unix
   socket and request those resources.  After that you also need to
   define a systemd service file that stands up the appropriate
   sandboxing.  I've not written examples, but that needs to be in the
   final product.

 3 What tooling changes to we need to make to /sbin/mount so that it
   can discover fuse-service-container support and the caller's
   preferences in using the f-s-c vs. the kernel and whatnot?  Do we
   add another weird x-foo-bar "mount option" so that preferences may
   be specified explicitly?

 4 For defaults situations, where do we make policy about when to use
   f-s-c and when do we allow use of the kernel driver?  I would guess
   that anything in /etc/fstab could use the kernel driver, and
   everything else should use a fuse container if possible.  For
   unprivileged non-root-ns mounts I think we'd only allow the
   container?

<shrug> If we made progress on merging the kernel code in the next three
months, does that clear the way for discussions of 2-4 at LSF?

Also, I hear that FOSSY 2026 will have kernel and KDE tracks, and it's
in Vancouver BC, which could be a good venu to talk to the DE people.

>  - famfs: export distributed memory

This has been, uh, hanging out for an extraordinarily long time.

>  - zero copy for fuse-io-uring
> 
>  - large folios
> 
>  - file handles on the userspace API

(also all that restart stuff, but I think that was already proposed)

>  - compound requests
> 
>  - BPF scripts

Is this an extension of the fuse-bpf filtering discussion that happened
in 2023?  (I wondered why you wouldn't just do bpf hooks in the vfs
itself, but maybe hch already NAKed that?)

As for fuse-iomap -- this week Joanne and I have been working on making
it so that fuse servers can upload ->iomap_{begin,end,ioend} functions
into the kernel as BPF programs to avoid server upcalls.  This might be
a better way to handle the repeating-pattern-iomapping pattern that
seems to exist in famfs than hardcoding things in yet another "upload
iomap mappings" fuse request.

(Yes I see you FUSE_SETUPMAPPING...)

> How do these fit into the existing codebase?
> 
> Cleaner separation of layers:
> 
>  - transport layer: /dev/fuse, io-uring, viriofs

I've noticed that each thread in the libfuse uring backend collects a
pile of CQEs and processes them linearly.  So if it receives 5 CQEs and
the first request takes 30 seconds, the other four just get stuck in
line...?

>  - filesystem layer: local fs, distributed fs

<nod>

> Introduce new version of cleaned up API?
> 
>  - remove async INIT
> 
>  - no fixed ROOT_ID

Can we just merge this?
https://lore.kernel.org/linux-fsdevel/176169811231.1426070.12996939158894110793.stgit@frogsfrogsfrogs/

>  - consolidate caching rules
> 
>  - who's responsible for updating which metadata?

These two seem like a good combined session -- "who owns what file
metadata?"

>  - remove legacy and problematic flags
> 
>  - get rid of splice on /dev/fuse for new API version?
> 
> Unresolved issues:
> 
>  - locked / writeback folios vs. reclaim / page migration
> 
>  - strictlimiting vs. large folios

/me has no idea about these last four.

--D

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04 19:06   ` Darrick J. Wong
@ 2026-02-04 19:38     ` Horst Birthelmer
  2026-02-04 20:58     ` Bernd Schubert
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 79+ messages in thread
From: Horst Birthelmer @ 2026-02-04 19:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Miklos Szeredi, f-pc, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques

On Wed, Feb 04, 2026 at 11:06:49AM -0800, Darrick J. Wong wrote:
> On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote:
> > I propose a session where various topics of interest could be
> > discussed including but not limited to the below list
> > 
> > New features being proposed at various stages of readiness:
> > 
> >  - fuse4fs: exporting the iomap interface to userspace
> 
> FYI, I took a semi-break from fuse-iomap for 7.0 because I was too busy
> working on xfs_healer, but I was planning to repost the patchbomb with
> many many cleanups and reorganizations (thanks Joanne!) as soon as
> possible after Linus tags 7.0-rc1.
> 
> I don't think LSFMM is a good venue for discussing a gigantic pile of
> code, because (IMO) LSF is better spent either (a) retrying in person to
> reach consensus on things that we couldn't do online; or (b) discussing
> roadmaps and/or people problems.  In other words, I'd rather use
> in-person time to go through broader topics that affect multiple people,
> and the mailing lists for detailed examination of a large body of text.
> 
> However -- do you have questions about the design?  That could be a good
> topic for email /and/ for a face to face meeting.  Though I strongly
> suspect that there are so many other sub-topics that fuse-iomap could
> eat up an entire afternoon at LSFMM:
> 
>  0 How do we convince $managers to spend money on porting filesystems
>    to fuse?  Even if they use the regular slow mode?
> 
>  1 What's the process for merging all the code changes into libfuse?
>    The iomap parts are pretty straightforward because libfuse passes
>    the request/reply straight through to fuse server, but...
> 
Just convince Bernd ... ;-)

>  2 ...the fuse service container part involves a bunch of architecture
>    shifts to libfuse.  First you need a new mount helper to connect to
>    a unix socket to start the service, pass some resources (fds and
>    mount options) through the unix socket to the service.  Obviously
>    that requires new library code for a fuse server to see the unix
>    socket and request those resources.  After that you also need to
>    define a systemd service file that stands up the appropriate
>    sandboxing.  I've not written examples, but that needs to be in the
>    final product.
> 
This really sounds like a good topic for an afternoon and in person the 
bandwith for passing ideas is higher.
I'd be really interested in what those architectural shifts are. It is
clearly a lot more than the passage above.

> 
> --D

Looking forward to seeing you there.

Horst

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04  9:22       ` Joanne Koong
  2026-02-04 10:37         ` Amir Goldstein
  2026-02-04 10:43         ` [Lsf-pc] " Jan Kara
@ 2026-02-04 20:47         ` Bernd Schubert
  2026-02-06  6:26         ` Darrick J. Wong
  3 siblings, 0 replies; 79+ messages in thread
From: Bernd Schubert @ 2026-02-04 20:47 UTC (permalink / raw)
  To: Joanne Koong, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel, Darrick J . Wong, John Groves,
	Luis Henriques, Horst Birthelmer, lsf-pc



On 2/4/26 10:22, Joanne Koong wrote:
> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>
>> On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote:
>>
>>> All important topics which I am sure will be discussed on a FUSE BoF.
> 
> Two other items I'd like to add to the potential discussion list are:
> 
> * leveraging io-uring multishot for batching fuse writeback and
> readahead requests, ie maximizing the throughput per roundtrip context
> switch [1]
> 
> * settling how load distribution should be done for configurable
> queues. We came to a bit of a standstill on Bernd's patchset [2] and
> it would be great to finally get this resolved and the feature landed.
> imo configurable queues and incremental buffer consumption are the two
> main features needed to make fuse-over-io-uring more feasible on
> large-scale systems.

Coincidentally I looked into this today because we had totally
imbalanced queues when this was activated - slip through in the queue
assignment, should be fixed in our branches. v4 basically has all your
comment addressed, our branch(es) have 3 bug fixes now to what I thought
would be v4 - unless I get pulled into other things again (which is
unfortunately likely), v4 will come tomorrow.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04 19:06   ` Darrick J. Wong
  2026-02-04 19:38     ` Horst Birthelmer
@ 2026-02-04 20:58     ` Bernd Schubert
  2026-02-06  5:47       ` Darrick J. Wong
  2026-02-04 22:50     ` Gao Xiang
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 79+ messages in thread
From: Bernd Schubert @ 2026-02-04 20:58 UTC (permalink / raw)
  To: Darrick J. Wong, Miklos Szeredi
  Cc: f-pc, linux-fsdevel, Joanne Koong, John Groves, Amir Goldstein,
	Luis Henriques, Horst Birthelmer



On 2/4/26 20:06, Darrick J. Wong wrote:
> On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote:
>> I propose a session where various topics of interest could be
>> discussed including but not limited to the below list
>>
>> New features being proposed at various stages of readiness:
>>
>>  - fuse4fs: exporting the iomap interface to userspace
> 
> FYI, I took a semi-break from fuse-iomap for 7.0 because I was too busy
> working on xfs_healer, but I was planning to repost the patchbomb with
> many many cleanups and reorganizations (thanks Joanne!) as soon as
> possible after Linus tags 7.0-rc1.
> 
> I don't think LSFMM is a good venue for discussing a gigantic pile of
> code, because (IMO) LSF is better spent either (a) retrying in person to
> reach consensus on things that we couldn't do online; or (b) discussing
> roadmaps and/or people problems.  In other words, I'd rather use
> in-person time to go through broader topics that affect multiple people,
> and the mailing lists for detailed examination of a large body of text.
> 
> However -- do you have questions about the design?  That could be a good
> topic for email /and/ for a face to face meeting.  Though I strongly
> suspect that there are so many other sub-topics that fuse-iomap could
> eat up an entire afternoon at LSFMM:
> 
>  0 How do we convince $managers to spend money on porting filesystems
>    to fuse?  Even if they use the regular slow mode?
> 
>  1 What's the process for merging all the code changes into libfuse?
>    The iomap parts are pretty straightforward because libfuse passes
>    the request/reply straight through to fuse server, but...

To be honest, I'm rather lost with your patch bomb - in which order do I
need to review what? And what can be merged without?
Regarding libfuse patches - certainly helpful if you also post them
here, but I don't want to create PRs out of your series, which then
might fail the PR tests and I would have to fix it on my own ;)
So the right order is to create libfuse PRs, let the test run, let
everyone review here or via PR and then it gets merged.

> 
>  2 ...the fuse service container part involves a bunch of architecture
>    shifts to libfuse.  First you need a new mount helper to connect to
>    a unix socket to start the service, pass some resources (fds and
>    mount options) through the unix socket to the service.  Obviously
>    that requires new library code for a fuse server to see the unix
>    socket and request those resources.  After that you also need to
>    define a systemd service file that stands up the appropriate
>    sandboxing.  I've not written examples, but that needs to be in the
>    final product.
> 
>  3 What tooling changes to we need to make to /sbin/mount so that it
>    can discover fuse-service-container support and the caller's
>    preferences in using the f-s-c vs. the kernel and whatnot?  Do we
>    add another weird x-foo-bar "mount option" so that preferences may
>    be specified explicitly?
> 
>  4 For defaults situations, where do we make policy about when to use
>    f-s-c and when do we allow use of the kernel driver?  I would guess
>    that anything in /etc/fstab could use the kernel driver, and
>    everything else should use a fuse container if possible.  For
>    unprivileged non-root-ns mounts I think we'd only allow the
>    container?
> 
> <shrug> If we made progress on merging the kernel code in the next three
> months, does that clear the way for discussions of 2-4 at LSF?
> 
> Also, I hear that FOSSY 2026 will have kernel and KDE tracks, and it's
> in Vancouver BC, which could be a good venu to talk to the DE people.
> 
>>  - famfs: export distributed memory
> 
> This has been, uh, hanging out for an extraordinarily long time.
> 
>>  - zero copy for fuse-io-uring
>>
>>  - large folios
>>
>>  - file handles on the userspace API
> 
> (also all that restart stuff, but I think that was already proposed)
> 
>>  - compound requests
>>
>>  - BPF scripts
> 
> Is this an extension of the fuse-bpf filtering discussion that happened
> in 2023?  (I wondered why you wouldn't just do bpf hooks in the vfs
> itself, but maybe hch already NAKed that?)
> 
> As for fuse-iomap -- this week Joanne and I have been working on making
> it so that fuse servers can upload ->iomap_{begin,end,ioend} functions
> into the kernel as BPF programs to avoid server upcalls.  This might be
> a better way to handle the repeating-pattern-iomapping pattern that
> seems to exist in famfs than hardcoding things in yet another "upload
> iomap mappings" fuse request.
> 
> (Yes I see you FUSE_SETUPMAPPING...)
> 
>> How do these fit into the existing codebase?
>>
>> Cleaner separation of layers:
>>
>>  - transport layer: /dev/fuse, io-uring, viriofs
> 
> I've noticed that each thread in the libfuse uring backend collects a
> pile of CQEs and processes them linearly.  So if it receives 5 CQEs and
> the first request takes 30 seconds, the other four just get stuck in
> line...?

I'm certainly open for suggestions and patches :)
At DDN the queues are polled from reactors (co-routine line), that
additional libfuse API will never go public, but I definitely want to
finish and if possible implement a new API before I leave (less than 2
months left). We had a bit of discussion with Stefan Hajnoczi about that
around last March, but I never came even close that task the whole year.

> 
>>  - filesystem layer: local fs, distributed fs
> 
> <nod>
> 
>> Introduce new version of cleaned up API?
>>
>>  - remove async INIT
>>
>>  - no fixed ROOT_ID
> 
> Can we just merge this?
> https://lore.kernel.org/linux-fsdevel/176169811231.1426070.12996939158894110793.stgit@frogsfrogsfrogs/

Could you create a libfuse PR please?


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04 19:06   ` Darrick J. Wong
  2026-02-04 19:38     ` Horst Birthelmer
  2026-02-04 20:58     ` Bernd Schubert
@ 2026-02-04 22:50     ` Gao Xiang
  2026-02-06  5:38       ` Darrick J. Wong
  2026-02-04 23:19     ` Gao Xiang
  2026-02-05  3:33     ` John Groves
  4 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-02-04 22:50 UTC (permalink / raw)
  To: Darrick J. Wong, Miklos Szeredi
  Cc: f-pc, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer

On 2026/2/5 03:06, Darrick J. Wong wrote:
> On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote:

...

> 
>   4 For defaults situations, where do we make policy about when to use
>     f-s-c and when do we allow use of the kernel driver?  I would guess
>     that anything in /etc/fstab could use the kernel driver, and
>     everything else should use a fuse container if possible.  For
>     unprivileged non-root-ns mounts I think we'd only allow the
>     container?

Just a side note: As a filesystem for containers, I have to say here
again one of the goal of EROFS is to allow unprivileged non-root-ns
mounts for container users because again I've seen no on-disk layout
security risk especially for the uncompressed layout format and
container users have already request this, but as Christoph said,
I will finish security model first before I post some code for pure
untrusted images.  But first allow dm-verity/fs-verity signed images
as the first step.

On the other side, my objective thought of that is FUSE is becoming
complex either from its protocol and implementations (even from the
TODO lists here) and leak of security design too, it's hard to say
from the attack surface which is better and Linux kernel is never
regarded as a microkernel model. In order to phase out "legacy and
problematic flags", FUSE have to wait until all current users don't
use them anymore.

I really think it should be a per-filesystem policy rather than the
current arbitary policy just out of fragment words, but I will
prepare more materials and bring this for more formal discussion
until the whole goal is finished.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04 19:06   ` Darrick J. Wong
                       ` (2 preceding siblings ...)
  2026-02-04 22:50     ` Gao Xiang
@ 2026-02-04 23:19     ` Gao Xiang
  2026-02-05  3:33     ` John Groves
  4 siblings, 0 replies; 79+ messages in thread
From: Gao Xiang @ 2026-02-04 23:19 UTC (permalink / raw)
  To: Darrick J. Wong, Miklos Szeredi
  Cc: f-pc, linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer



On 2026/2/5 03:06, Darrick J. Wong wrote:
> On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote:

...

>>
>>   - BPF scripts
> 
> Is this an extension of the fuse-bpf filtering discussion that happened
> in 2023?  (I wondered why you wouldn't just do bpf hooks in the vfs
> itself, but maybe hch already NAKed that?)

For this part: as far as I can tell, no one NAKed vfs BPF hooks,
and I had the similar idea two years ago:
https://lore.kernel.org/r/CAOQ4uxjCebxGxkguAh9s4_Vg7QHM=oBoV0LUPZpb+0pcm3z1bw@mail.gmail.com

We have some fanotify BPF hook ideas, e.g. to make lazy pulling
more efficient with applied BPF filters, and I've asked BPF
experts to look into that but no deadline on this too.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04 19:06   ` Darrick J. Wong
                       ` (3 preceding siblings ...)
  2026-02-04 23:19     ` Gao Xiang
@ 2026-02-05  3:33     ` John Groves
  2026-02-05  9:27       ` Amir Goldstein
  4 siblings, 1 reply; 79+ messages in thread
From: John Groves @ 2026-02-05  3:33 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Miklos Szeredi, f-pc@lists.linux-foundation.org,
	linux-fsdevel@vger.kernel.org, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer

On 26/02/04 11:06AM, Darrick J. Wong wrote:

[ ... ]

> >  - famfs: export distributed memory
> 
> This has been, uh, hanging out for an extraordinarily long time.

Um, *yeah*. Although a significant part of that time was on me, because 
getting it ported into fuse was kinda hard, my users and I are hoping we 
can get this upstreamed fairly soon now. I'm hoping that after the 6.19 
merge window dust settles we can negotiate any needed changes etc. and 
shoot for the 7.0 merge window.

:D

Regards,
John



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-05  3:33     ` John Groves
@ 2026-02-05  9:27       ` Amir Goldstein
  2026-02-06  5:52         ` Darrick J. Wong
  0 siblings, 1 reply; 79+ messages in thread
From: Amir Goldstein @ 2026-02-05  9:27 UTC (permalink / raw)
  To: john
  Cc: Darrick J. Wong, Miklos Szeredi, f-pc@lists.linux-foundation.org,
	linux-fsdevel@vger.kernel.org, Joanne Koong, Bernd Schubert,
	Luis Henriques, Horst Birthelmer

On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote:
>
> On 26/02/04 11:06AM, Darrick J. Wong wrote:
>
> [ ... ]
>
> > >  - famfs: export distributed memory
> >
> > This has been, uh, hanging out for an extraordinarily long time.
>
> Um, *yeah*. Although a significant part of that time was on me, because
> getting it ported into fuse was kinda hard, my users and I are hoping we
> can get this upstreamed fairly soon now. I'm hoping that after the 6.19
> merge window dust settles we can negotiate any needed changes etc. and
> shoot for the 7.0 merge window.
>

I think that the work on famfs is setting an example, and I very much
hope it will be a good example, of how improving existing infrastructure
(FUSE) is a better contribution than adding another fs to the pile.

I acknowledge that doing the latter is way easier (not for vfs maintainers)
and I very much appreciate your efforts working on the generic FUSE support
that will hopefully serve the community and your users better in the long run.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04 22:50     ` Gao Xiang
@ 2026-02-06  5:38       ` Darrick J. Wong
  2026-02-06  6:15         ` Gao Xiang
  0 siblings, 1 reply; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-06  5:38 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Miklos Szeredi, f-pc, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer

On Thu, Feb 05, 2026 at 06:50:28AM +0800, Gao Xiang wrote:
> 
> 
> On 2026/2/5 03:06, Darrick J. Wong wrote:
> > On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote:
> 
> ...
> 
> > 
> >   4 For defaults situations, where do we make policy about when to use
> >     f-s-c and when do we allow use of the kernel driver?  I would guess
> >     that anything in /etc/fstab could use the kernel driver, and
> >     everything else should use a fuse container if possible.  For
> >     unprivileged non-root-ns mounts I think we'd only allow the
> >     container?
> 
> Just a side note: As a filesystem for containers, I have to say here
> again one of the goal of EROFS is to allow unprivileged non-root-ns
> mounts for container users because again I've seen no on-disk layout
> security risk especially for the uncompressed layout format and
> container users have already request this, but as Christoph said,
> I will finish security model first before I post some code for pure
> untrusted images.  But first allow dm-verity/fs-verity signed images
> as the first step.

<nod> I haven't forgotten.  For readonly root fses erofs is probably the
best we're going to get, and it's less clunky than fuse.  There's less
of a firewall due to !microkernel but I'd wager that most immutable
distros will find erofs a good enough balance between performance and
isolation.

Fuse, otoh, is for all the other weird users -- you found an old
cupboard full of wide scsi disks; or management decided that letting
container customers bring their own prepopulated data partitions(!) is a
good idea; or the default when someone plugs in a device that the system
knows nothing about.

> On the other side, my objective thought of that is FUSE is becoming
> complex either from its protocol and implementations (even from the

It already is.

> TODO lists here) and leak of security design too, it's hard to say
> from the attack surface which is better and Linux kernel is never
> regarded as a microkernel model. In order to phase out "legacy and
> problematic flags", FUSE have to wait until all current users don't
> use them anymore.
> 
> I really think it should be a per-filesystem policy rather than the
> current arbitary policy just out of fragment words, but I will
> prepare more materials and bring this for more formal discussion
> until the whole goal is finished.

Well yes, the transition from kernel to kernel-or-fuse would be
decided on a per-filesystem basis.  When the fuse driver reaches par
with the kernel driver on functionality and stability then it becomes a
candidate for secure container usage.  Not before.

--D

> Thanks,
> Gao Xiang
> 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04 20:58     ` Bernd Schubert
@ 2026-02-06  5:47       ` Darrick J. Wong
  0 siblings, 0 replies; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-06  5:47 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, f-pc, linux-fsdevel, Joanne Koong, John Groves,
	Amir Goldstein, Luis Henriques, Horst Birthelmer

On Wed, Feb 04, 2026 at 09:58:51PM +0100, Bernd Schubert wrote:
> 
> 
> On 2/4/26 20:06, Darrick J. Wong wrote:
> > On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote:
> >> I propose a session where various topics of interest could be
> >> discussed including but not limited to the below list
> >>
> >> New features being proposed at various stages of readiness:
> >>
> >>  - fuse4fs: exporting the iomap interface to userspace
> > 
> > FYI, I took a semi-break from fuse-iomap for 7.0 because I was too busy
> > working on xfs_healer, but I was planning to repost the patchbomb with
> > many many cleanups and reorganizations (thanks Joanne!) as soon as
> > possible after Linus tags 7.0-rc1.
> > 
> > I don't think LSFMM is a good venue for discussing a gigantic pile of
> > code, because (IMO) LSF is better spent either (a) retrying in person to
> > reach consensus on things that we couldn't do online; or (b) discussing
> > roadmaps and/or people problems.  In other words, I'd rather use
> > in-person time to go through broader topics that affect multiple people,
> > and the mailing lists for detailed examination of a large body of text.
> > 
> > However -- do you have questions about the design?  That could be a good
> > topic for email /and/ for a face to face meeting.  Though I strongly
> > suspect that there are so many other sub-topics that fuse-iomap could
> > eat up an entire afternoon at LSFMM:
> > 
> >  0 How do we convince $managers to spend money on porting filesystems
> >    to fuse?  Even if they use the regular slow mode?
> > 
> >  1 What's the process for merging all the code changes into libfuse?
> >    The iomap parts are pretty straightforward because libfuse passes
> >    the request/reply straight through to fuse server, but...
> 
> To be honest, I'm rather lost with your patch bomb - in which order do I
> need to review what? And what can be merged without?

If there are any fixes they're usually at the beginning.

At the moment you actually /have/ merged everything that can be. :)

The rest relies on kernel patches that aren't upstream.

> Regarding libfuse patches - certainly helpful if you also post them
> here, but I don't want to create PRs out of your series, which then
> might fail the PR tests and I would have to fix it on my own ;)
> So the right order is to create libfuse PRs, let the test run, let
> everyone review here or via PR and then it gets merged.

I can generate pull requests for the libfuse things, no problem.  The
hard question is, can your CI system build a kernel with the relevant
patches or do we have to wait until Miklos merges them into upstream?

> >  2 ...the fuse service container part involves a bunch of architecture
> >    shifts to libfuse.  First you need a new mount helper to connect to
> >    a unix socket to start the service, pass some resources (fds and
> >    mount options) through the unix socket to the service.  Obviously
> >    that requires new library code for a fuse server to see the unix
> >    socket and request those resources.  After that you also need to
> >    define a systemd service file that stands up the appropriate
> >    sandboxing.  I've not written examples, but that needs to be in the
> >    final product.
> > 
> >  3 What tooling changes to we need to make to /sbin/mount so that it
> >    can discover fuse-service-container support and the caller's
> >    preferences in using the f-s-c vs. the kernel and whatnot?  Do we
> >    add another weird x-foo-bar "mount option" so that preferences may
> >    be specified explicitly?
> > 
> >  4 For defaults situations, where do we make policy about when to use
> >    f-s-c and when do we allow use of the kernel driver?  I would guess
> >    that anything in /etc/fstab could use the kernel driver, and
> >    everything else should use a fuse container if possible.  For
> >    unprivileged non-root-ns mounts I think we'd only allow the
> >    container?
> > 
> > <shrug> If we made progress on merging the kernel code in the next three
> > months, does that clear the way for discussions of 2-4 at LSF?
> > 
> > Also, I hear that FOSSY 2026 will have kernel and KDE tracks, and it's
> > in Vancouver BC, which could be a good venu to talk to the DE people.
> > 
> >>  - famfs: export distributed memory
> > 
> > This has been, uh, hanging out for an extraordinarily long time.
> > 
> >>  - zero copy for fuse-io-uring
> >>
> >>  - large folios
> >>
> >>  - file handles on the userspace API
> > 
> > (also all that restart stuff, but I think that was already proposed)
> > 
> >>  - compound requests
> >>
> >>  - BPF scripts
> > 
> > Is this an extension of the fuse-bpf filtering discussion that happened
> > in 2023?  (I wondered why you wouldn't just do bpf hooks in the vfs
> > itself, but maybe hch already NAKed that?)
> > 
> > As for fuse-iomap -- this week Joanne and I have been working on making
> > it so that fuse servers can upload ->iomap_{begin,end,ioend} functions
> > into the kernel as BPF programs to avoid server upcalls.  This might be
> > a better way to handle the repeating-pattern-iomapping pattern that
> > seems to exist in famfs than hardcoding things in yet another "upload
> > iomap mappings" fuse request.
> > 
> > (Yes I see you FUSE_SETUPMAPPING...)
> > 
> >> How do these fit into the existing codebase?
> >>
> >> Cleaner separation of layers:
> >>
> >>  - transport layer: /dev/fuse, io-uring, viriofs
> > 
> > I've noticed that each thread in the libfuse uring backend collects a
> > pile of CQEs and processes them linearly.  So if it receives 5 CQEs and
> > the first request takes 30 seconds, the other four just get stuck in
> > line...?
> 
> I'm certainly open for suggestions and patches :)

The only things I can think of are

(a) a pool of threads pinned to the same CPU as the CQE reader, but I
don't think that's going to be good for low latency;

(b) as long as the request is still in libfuse, maybe it can decide "I'm
taking too long" and spawn a pthread to hand the request to; or

(c) can other threads steal a CQE to work on if they go idle?  That
might only work for FUSE_DESTROY though, since there won't be new
requests issued after that.

For the particular problems I was seeing with FUSE_DESTROY I picked (b).
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/libfuse.git/commit/?h=djwong-wtf&id=e2784aaa0bc0d396fe1c75b826fc140366f576bc

But that also only happens if your kernel has
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-fixes&id=a9df193a5913e747d8c2830197c4f36d56f42e4c
so there's no action to be taken for libfuse right now.

> At DDN the queues are polled from reactors (co-routine line), that
> additional libfuse API will never go public, but I definitely want to
> finish and if possible implement a new API before I leave (less than 2
> months left). We had a bit of discussion with Stefan Hajnoczi about that
> around last March, but I never came even close that task the whole year.

<nod>

> > 
> >>  - filesystem layer: local fs, distributed fs
> > 
> > <nod>
> > 
> >> Introduce new version of cleaned up API?
> >>
> >>  - remove async INIT
> >>
> >>  - no fixed ROOT_ID
> > 
> > Can we just merge this?
> > https://lore.kernel.org/linux-fsdevel/176169811231.1426070.12996939158894110793.stgit@frogsfrogsfrogs/
> 
> Could you create a libfuse PR please?

Well we'd have to get the kernel patch merged first, and (AFAIK) it's
not queued up for Linux 7.0.

--D

> 
> Thanks,
> Bernd
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-05  9:27       ` Amir Goldstein
@ 2026-02-06  5:52         ` Darrick J. Wong
  2026-02-06 20:48           ` John Groves
  0 siblings, 1 reply; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-06  5:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: john, Miklos Szeredi, f-pc@lists.linux-foundation.org,
	linux-fsdevel@vger.kernel.org, Joanne Koong, Bernd Schubert,
	Luis Henriques, Horst Birthelmer

On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote:
> On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote:
> >
> > On 26/02/04 11:06AM, Darrick J. Wong wrote:
> >
> > [ ... ]
> >
> > > >  - famfs: export distributed memory
> > >
> > > This has been, uh, hanging out for an extraordinarily long time.
> >
> > Um, *yeah*. Although a significant part of that time was on me, because
> > getting it ported into fuse was kinda hard, my users and I are hoping we
> > can get this upstreamed fairly soon now. I'm hoping that after the 6.19
> > merge window dust settles we can negotiate any needed changes etc. and
> > shoot for the 7.0 merge window.

I think we've all missed getting merged for 7.0 since 6.19 will be
released in 3 days. :/

(Granted most of the maintainers I know are /much/ less conservative
than I was about the schedule)

> I think that the work on famfs is setting an example, and I very much
> hope it will be a good example, of how improving existing infrastructure
> (FUSE) is a better contribution than adding another fs to the pile.

Yeah.  Joanne and I spent a couple of days this week coprogramming a
prototype of a way for famfs to create BPF programs to handle
INTERLEAVED_EXTENT files.  We might be ready to show that off in a
couple of weeks, and that might be a way to clear up the
GET_FMAP/IOMAP_BEGIN logjam at last.

--D

> I acknowledge that doing the latter is way easier (not for vfs maintainers)
> and I very much appreciate your efforts working on the generic FUSE support
> that will hopefully serve the community and your users better in the long run.
> 
> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04 10:43         ` [Lsf-pc] " Jan Kara
@ 2026-02-06  6:09           ` Darrick J. Wong
  2026-02-21  6:07             ` Demi Marie Obenour
  0 siblings, 1 reply; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-06  6:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Joanne Koong, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer,
	lsf-pc

On Wed, Feb 04, 2026 at 11:43:05AM +0100, Jan Kara wrote:
> On Wed 04-02-26 01:22:02, Joanne Koong wrote:
> > On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > I think that at least one question of interest to the wider fs audience is
> > > >
> > > > Can any of the above improvements be used to help phase out some
> > > > of the old under maintained fs and reduce the burden on vfs maintainers?
> > 
> > I think it might be helpful to know ahead of time where the main
> > hesitation lies. Is it performance? Maybe it'd be helpful if before
> > May there was a prototype converting a simpler filesystem (Darrick and
> > I were musing about fat maybe being a good one) and getting a sense of
> > what the delta is between the native kernel implementation and a
> > fuse-based version? In the past year fuse added a lot of new
> > capabilities that improved performance by quite a bit so I'm curious
> > to see where the delta now lies. Or maybe the hesitation is something
> > else entirely, in which case that's probably a conversation better
> > left for May.
> 
> I'm not sure which filesystems Amir had exactly in mind but in my opinion
> FAT is used widely enough to not be a primary target of this effort. It

OTOH the ESP and USB sticks needn't be high performance.  <shrug>

> would be rather filesystems like (random selection) bfs, adfs, vboxfs,
> minix, efs, freevxfs, etc. The user base of these is very small, testing is
> minimal if possible at all, and thus the value of keeping these in the
> kernel vs the effort they add to infrastructure changes (like folio
> conversions, iomap conversion, ...) is not very favorable.

But yeah, these ones in the long tail are probably good targets.  Though
I think willy pointed out that the biggest barrier in his fs folio
conversions was that many of them aren't testable (e.g. lack mkfs or
fsck tools) which makes a legacy pivot that much harder.

> For these the biggest problem IMO is actually finding someone willing to
> invest into doing (and testing) the conversion. I don't think there are
> severe technical obstacles for most of them.

Yep, that's the biggest hurdle -- convincing managers to pay for a bunch
of really old filesystems that are no longer mainstream.

--D

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-06  5:38       ` Darrick J. Wong
@ 2026-02-06  6:15         ` Gao Xiang
  2026-02-21  0:47           ` Darrick J. Wong
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-02-06  6:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Miklos Szeredi, f-pc, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer

Hi Darrick,

On 2026/2/6 13:38, Darrick J. Wong wrote:
> On Thu, Feb 05, 2026 at 06:50:28AM +0800, Gao Xiang wrote:
>>
>>
>> On 2026/2/5 03:06, Darrick J. Wong wrote:
>>> On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote:
>>
>> ...
>>
>>>
>>>    4 For defaults situations, where do we make policy about when to use
>>>      f-s-c and when do we allow use of the kernel driver?  I would guess
>>>      that anything in /etc/fstab could use the kernel driver, and
>>>      everything else should use a fuse container if possible.  For
>>>      unprivileged non-root-ns mounts I think we'd only allow the
>>>      container?
>>
>> Just a side note: As a filesystem for containers, I have to say here
>> again one of the goal of EROFS is to allow unprivileged non-root-ns
>> mounts for container users because again I've seen no on-disk layout
>> security risk especially for the uncompressed layout format and
>> container users have already request this, but as Christoph said,
>> I will finish security model first before I post some code for pure
>> untrusted images.  But first allow dm-verity/fs-verity signed images
>> as the first step.
> 
> <nod> I haven't forgotten.  For readonly root fses erofs is probably the
> best we're going to get, and it's less clunky than fuse.  There's less
> of a firewall due to !microkernel but I'd wager that most immutable
> distros will find erofs a good enough balance between performance and
> isolation.

Thanks, but I can't make decisions for every individual end user.
However, in my view, this approach is valuable for all container
users if they don't mind to try this approach (I'm building this
capabilities with several communities and people): they can achieve
nearly native performance on read-write workloads with a trusted
fs as well as the remote data source is completely isolated using
an immutable secure filesystem.

I will make signed images work first, but as the next step, I'll
definitely work on defining a clear on-disk boundary (very
likely excluding per-inode compression layouts in the beginning)
to enable most users to leverage untrusted data directly in
a totally isolated user/mount namespace.

> 
> Fuse, otoh, is for all the other weird users -- you found an old
> cupboard full of wide scsi disks; or management decided that letting
> container customers bring their own prepopulated data partitions(!) is a
> good idea; or the default when someone plugs in a device that the system
> knows nothing about.

Honestly, I've checked what Ted, Dave, and you said previously.
For generic COW filesystems, it's surely hard to guarantee
filesystem consistency at all times, mainly because of those
on-disk formats by design (lots of duplicated metadata for
different purposes, which can cause extra inconsistency compared
to archive fses.) Of course, it's not entirely impossible, but
as Ted pointed out, it becomes a matter of

1) human resources;
2) enforcing such strict consistency checks harms performance
    in general use cases which just use trusted filesystem /
    media directly like databases.

I'm not against FUSE further improvements because they are seperated
stories, I do think those items are useful for new Linux innovation,
but as for the topic of allowing "root" in non-root-user-ns to mount,
I still insist that it should be a per-filesystem policy, because
filesystems are designed for different targeted use cases:

  - either you face and address the issue (by design or by
    enginneering), or
  - find another alternative way to serve users.

But I do hope we shouldn't force some arbitary policy without any
technical reason, the feature is indeed useful for container users.

> 
>> On the other side, my objective thought of that is FUSE is becoming
>> complex either from its protocol and implementations (even from the
> 
> It already is.
> 
>> TODO lists here) and leak of security design too, it's hard to say
>> from the attack surface which is better and Linux kernel is never
>> regarded as a microkernel model. In order to phase out "legacy and
>> problematic flags", FUSE have to wait until all current users don't
>> use them anymore.
>>
>> I really think it should be a per-filesystem policy rather than the
>> current arbitary policy just out of fragment words, but I will
>> prepare more materials and bring this for more formal discussion
>> until the whole goal is finished.
> 
> Well yes, the transition from kernel to kernel-or-fuse would be
> decided on a per-filesystem basis.  When the fuse driver reaches par
> with the kernel driver on functionality and stability then it becomes a
> candidate for secure container usage.  Not before.

I respect this path, but just from my own perspective, userspace
malicious problems are usually much harder to defence since the
trusted boundary is weaker, in order to allow unpriviledged
daemons, you have to monitor if page cache or any metadata cache
or any potential/undiscovered deadlock vectors can be abused
by those malicious daemons, so that you have to find more harden
ways to limit such abused usage naturally since you never trust
those unpriviledged daemons (which is arbitary executable code
rather than a binary source) instead, which is opposed to
performance cases in principle without detailed analysis.

Just my two cents.

Thanks,
Gao Xiang

> 
> --D
> 
>> Thanks,
>> Gao Xiang
>>
>>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-04  9:22       ` Joanne Koong
                           ` (2 preceding siblings ...)
  2026-02-04 20:47         ` Bernd Schubert
@ 2026-02-06  6:26         ` Darrick J. Wong
  3 siblings, 0 replies; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-06  6:26 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, John Groves,
	Bernd Schubert, Luis Henriques, Horst Birthelmer, lsf-pc

On Wed, Feb 04, 2026 at 01:22:02AM -0800, Joanne Koong wrote:
> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, 2 Feb 2026 at 17:14, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > All important topics which I am sure will be discussed on a FUSE BoF.
> 
> Two other items I'd like to add to the potential discussion list are:
> 
> * leveraging io-uring multishot for batching fuse writeback and
> readahead requests, ie maximizing the throughput per roundtrip context
> switch [1]
> 
> * settling how load distribution should be done for configurable
> queues. We came to a bit of a standstill on Bernd's patchset [2] and
> it would be great to finally get this resolved and the feature landed.
> imo configurable queues and incremental buffer consumption are the two
> main features needed to make fuse-over-io-uring more feasible on
> large-scale systems.
> 
> >
> > I see your point.   Maybe the BPF one could be useful as a cross track
> > discussion, though I'm not sure the fuse side of the design is mature
> > enough for that.  Joanne, you did some experiments with that, no?
> 
> The discussion on this was started in response [3] to Darrick's iomap
> containerization patchset. I have a prototype based on [4] I can get
> into reviewable shape this month or next, if there's interest in
> getting something concrete before May. I did a quick check with the

(Which we're working on :D)

> bpf team a few days ago and confirmed with them that struct ops is the
> way to go for adding the hook point for fuse. For attaching the bpf
> progs to the fuse connection, going through the bpf link interface is
> the modern/preferred way of doing this.

Yes.  That conversion turned out not to be too difficult, but the
resulting uapi is a little awkward because you have to pass the
/dev/fuse fd in one of the structs that you pass to the bpf syscall,
and then the bpf functions have to go find the struct file and use that
to get back to the fuse_conn.

> Discussion wise, imo on the
> fuse side what would be most useful to discuss in May would be what
> other interception points do we think would be the most useful in fuse
> and what should the API interfaces that we expose for those look like
> (eg should these just take the in/out request structs already defined
> in the uapi? or expose more state information?). imo, we should take
> an incremental approach and add interception points more
> conservatively than liberally, on a per-need basis as use cases
> actually come up.

I would start by only allowing iomap_{begin,end,ioend} bpf functions,
and only let them access the same in-arguments and outarg struct as the
fuse server upcall would have.

(I don't have any opinions on the fuse-bpf filtering stuff that was
discussed at lsfmm 2023)

> > > I think that at least one question of interest to the wider fs audience is
> > >
> > > Can any of the above improvements be used to help phase out some
> > > of the old under maintained fs and reduce the burden on vfs maintainers?
> 
> I think it might be helpful to know ahead of time where the main
> hesitation lies. Is it performance? Maybe it'd be helpful if before
> May there was a prototype converting a simpler filesystem (Darrick and
> I were musing about fat maybe being a good one) and getting a sense of
> what the delta is between the native kernel implementation and a
> fuse-based version? In the past year fuse added a lot of new
> capabilities that improved performance by quite a bit so I'm curious
> to see where the delta now lies. Or maybe the hesitation is something
> else entirely, in which case that's probably a conversation better
> left for May.

TBH I think it's mostly inertia because the current solutions aren't so
bad that our managers are screaming at us to get 'er done now. :P

That and conversion is a lot of work.

--D

> Thanks,
> Joanne
> 
> [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z3mTdZdfe5rTukKOnU0y5dpM8aFTCqbctBWsa-S301TQ@mail.gmail.com/
> 
> [2] https://lore.kernel.org/linux-fsdevel/20251013-reduced-nr-ring-queues_3-v3-4-6d87c8aa31ae@ddn.com/t/#u
> 
> [3] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z05QZmos90qmWtnWGF+Kb7rVziJ51UpuJ0O=A+6N1vrg@mail.gmail.com/t/#u
> 
> [4] https://lore.kernel.org/linux-fsdevel/176169810144.1424854.11439355400009006946.stgit@frogsfrogsfrogs/T/#m4998d92f6210d50d0bf6760490689c029bda9231
> 
> >
> > I think the major show stopper is that nobody is going to put a major
> > effort into porting unmaintained kernel filesystems to a different
> > framework.
> >
> > Alternatively someone could implement a "VFS emulator" library.  But
> > keeping that in sync with the kernel, together with all the old fs
> > would be an even greater burden...
> >
> > Thanks,
> > Miklos
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-06  5:52         ` Darrick J. Wong
@ 2026-02-06 20:48           ` John Groves
  2026-02-07  0:22             ` Joanne Koong
  2026-02-20 23:59             ` Darrick J. Wong
  0 siblings, 2 replies; 79+ messages in thread
From: John Groves @ 2026-02-06 20:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Miklos Szeredi, f-pc@lists.linux-foundation.org,
	linux-fsdevel@vger.kernel.org, Joanne Koong, Bernd Schubert,
	Luis Henriques, Horst Birthelmer

On 26/02/05 09:52PM, Darrick J. Wong wrote:
> On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote:
> > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote:
> > >
> > > On 26/02/04 11:06AM, Darrick J. Wong wrote:
> > >
> > > [ ... ]
> > >
> > > > >  - famfs: export distributed memory
> > > >
> > > > This has been, uh, hanging out for an extraordinarily long time.
> > >
> > > Um, *yeah*. Although a significant part of that time was on me, because
> > > getting it ported into fuse was kinda hard, my users and I are hoping we
> > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19
> > > merge window dust settles we can negotiate any needed changes etc. and
> > > shoot for the 7.0 merge window.
> 
> I think we've all missed getting merged for 7.0 since 6.19 will be
> released in 3 days. :/
> 
> (Granted most of the maintainers I know are /much/ less conservative
> than I was about the schedule)

Doh - right you are...

> 
> > I think that the work on famfs is setting an example, and I very much
> > hope it will be a good example, of how improving existing infrastructure
> > (FUSE) is a better contribution than adding another fs to the pile.
> 
> Yeah.  Joanne and I spent a couple of days this week coprogramming a
> prototype of a way for famfs to create BPF programs to handle
> INTERLEAVED_EXTENT files.  We might be ready to show that off in a
> couple of weeks, and that might be a way to clear up the
> GET_FMAP/IOMAP_BEGIN logjam at last.

I'd love to learn more about this; happy to do a call if that's a
good way to get me briefed.

I [generally but not specifically] understand how this could avoid
GET_FMAP, but not GET_DAXDEV. 

But I'm not sure it could (or should) avoid dax_iomap_rw() and
dax_iomap_fault(). The thing is that those call my begin() function
to resolve an offset in a file to an offset on a daxdev, and then
dax completes the fault or memcpy. In that dance, famfs never knows
the kernel address of the memory at all (also true of xfs in fs-dax
mode, unless that's changed fairly recently). I think that's a pretty
decent interface all in all.

Also: dunno whether y'all have looked at the dax patches in the famfs
series, but the solution to working with Alistair's folio-ification 
and cleanup of the dax layer (which set me back months) was to create 
drivers/dax/fsdev.c, which, when bound to a daxdev in place of 
drivers/dax/device.c, configures folios & pages compatibly with 
fs-dax. So I kinda think I need the dax_iomap* interface.

As usual, if I'm overlooking something let me know...

Regards,
John

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-06 20:48           ` John Groves
@ 2026-02-07  0:22             ` Joanne Koong
  2026-02-12  4:46               ` Joanne Koong
  2026-02-20 23:59             ` Darrick J. Wong
  1 sibling, 1 reply; 79+ messages in thread
From: Joanne Koong @ 2026-02-07  0:22 UTC (permalink / raw)
  To: John Groves
  Cc: Darrick J. Wong, Amir Goldstein, Miklos Szeredi,
	f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	Bernd Schubert, Luis Henriques, Horst Birthelmer

On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote:
>
> On 26/02/05 09:52PM, Darrick J. Wong wrote:
> > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote:
> > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote:
> > > >
> > > > On 26/02/04 11:06AM, Darrick J. Wong wrote:
> > > >
> > > > [ ... ]
> > > >
> > > > > >  - famfs: export distributed memory
> > > > >
> > > > > This has been, uh, hanging out for an extraordinarily long time.
> > > >
> > > > Um, *yeah*. Although a significant part of that time was on me, because
> > > > getting it ported into fuse was kinda hard, my users and I are hoping we
> > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19
> > > > merge window dust settles we can negotiate any needed changes etc. and
> > > > shoot for the 7.0 merge window.
> >
> > I think we've all missed getting merged for 7.0 since 6.19 will be
> > released in 3 days. :/
> >
> > (Granted most of the maintainers I know are /much/ less conservative
> > than I was about the schedule)
>
> Doh - right you are...
>
> >
> > > I think that the work on famfs is setting an example, and I very much
> > > hope it will be a good example, of how improving existing infrastructure
> > > (FUSE) is a better contribution than adding another fs to the pile.
> >
> > Yeah.  Joanne and I spent a couple of days this week coprogramming a
> > prototype of a way for famfs to create BPF programs to handle
> > INTERLEAVED_EXTENT files.  We might be ready to show that off in a
> > couple of weeks, and that might be a way to clear up the
> > GET_FMAP/IOMAP_BEGIN logjam at last.
>
> I'd love to learn more about this; happy to do a call if that's a
> good way to get me briefed.
>
> I [generally but not specifically] understand how this could avoid
> GET_FMAP, but not GET_DAXDEV.
>
> But I'm not sure it could (or should) avoid dax_iomap_rw() and
> dax_iomap_fault(). The thing is that those call my begin() function
> to resolve an offset in a file to an offset on a daxdev, and then
> dax completes the fault or memcpy. In that dance, famfs never knows
> the kernel address of the memory at all (also true of xfs in fs-dax
> mode, unless that's changed fairly recently). I think that's a pretty
> decent interface all in all.
>
> Also: dunno whether y'all have looked at the dax patches in the famfs
> series, but the solution to working with Alistair's folio-ification
> and cleanup of the dax layer (which set me back months) was to create
> drivers/dax/fsdev.c, which, when bound to a daxdev in place of
> drivers/dax/device.c, configures folios & pages compatibly with
> fs-dax. So I kinda think I need the dax_iomap* interface.
>
> As usual, if I'm overlooking something let me know...

Hi John,

The conversation started [1] on Darrick's containerization patchset
about using bpf to a) avoid extra requests / context switching for
->iomap_begin and ->iomap_end calls and b) offload what would
otherwise have to be hard-coded kernel logic into userspace, which
gives userspace more flexibility / control with updating the logic and
is less of a maintenance burden for fuse. There was some musing [2]
about whether with bpf infrastructure added, it would allow famfs to
move all famfs-specific logic to userspace/bpf.

I agree that it makes sense for famfs to go through dax iomap
interfaces. imo it seems cleanest if fuse has a generic iomap
interface with iomap dax going through that plumbing, and any
famfs-specific logic that would be needed beyond that (eg computing
the interleaved mappings) being moved to custom famfs bpf programs. I
started trying to implement this yesterday afternoon because I wanted
to make sure it would actually be doable for the famfs logic before
bringing it up and I didn't want to derail your project. So far I only
have the general iomap interface for fuse added with dax operations
going through dax_iomap* and haven't tried out integrating the famfs
GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to
get to that early next week. The work I did with Darrick this week was
on getting a server's bpf programs hooked up to fuse through bpf links
and Darrick has fleshed that out and gotten that working now. If it
turns out famfs can go through a generic iomap fuse plumbing layer,
I'd be curious to hear your thoughts on which approach you'd prefer.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d
[2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u

>
> Regards,
> John
>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-07  0:22             ` Joanne Koong
@ 2026-02-12  4:46               ` Joanne Koong
  2026-02-21  0:37                 ` Darrick J. Wong
  0 siblings, 1 reply; 79+ messages in thread
From: Joanne Koong @ 2026-02-12  4:46 UTC (permalink / raw)
  To: John Groves
  Cc: Darrick J. Wong, Amir Goldstein, Miklos Szeredi,
	f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	Bernd Schubert, Luis Henriques, Horst Birthelmer

On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote:
> >
> > On 26/02/05 09:52PM, Darrick J. Wong wrote:
> > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote:
> > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote:
> > > > >
> > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote:
> > > > >
> > > > > [ ... ]
> > > > >
> > > > > > >  - famfs: export distributed memory
> > > > > >
> > > > > > This has been, uh, hanging out for an extraordinarily long time.
> > > > >
> > > > > Um, *yeah*. Although a significant part of that time was on me, because
> > > > > getting it ported into fuse was kinda hard, my users and I are hoping we
> > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19
> > > > > merge window dust settles we can negotiate any needed changes etc. and
> > > > > shoot for the 7.0 merge window.
> > >
> > > I think we've all missed getting merged for 7.0 since 6.19 will be
> > > released in 3 days. :/
> > >
> > > (Granted most of the maintainers I know are /much/ less conservative
> > > than I was about the schedule)
> >
> > Doh - right you are...
> >
> > >
> > > > I think that the work on famfs is setting an example, and I very much
> > > > hope it will be a good example, of how improving existing infrastructure
> > > > (FUSE) is a better contribution than adding another fs to the pile.
> > >
> > > Yeah.  Joanne and I spent a couple of days this week coprogramming a
> > > prototype of a way for famfs to create BPF programs to handle
> > > INTERLEAVED_EXTENT files.  We might be ready to show that off in a
> > > couple of weeks, and that might be a way to clear up the
> > > GET_FMAP/IOMAP_BEGIN logjam at last.
> >
> > I'd love to learn more about this; happy to do a call if that's a
> > good way to get me briefed.
> >
> > I [generally but not specifically] understand how this could avoid
> > GET_FMAP, but not GET_DAXDEV.
> >
> > But I'm not sure it could (or should) avoid dax_iomap_rw() and
> > dax_iomap_fault(). The thing is that those call my begin() function
> > to resolve an offset in a file to an offset on a daxdev, and then
> > dax completes the fault or memcpy. In that dance, famfs never knows
> > the kernel address of the memory at all (also true of xfs in fs-dax
> > mode, unless that's changed fairly recently). I think that's a pretty
> > decent interface all in all.
> >
> > Also: dunno whether y'all have looked at the dax patches in the famfs
> > series, but the solution to working with Alistair's folio-ification
> > and cleanup of the dax layer (which set me back months) was to create
> > drivers/dax/fsdev.c, which, when bound to a daxdev in place of
> > drivers/dax/device.c, configures folios & pages compatibly with
> > fs-dax. So I kinda think I need the dax_iomap* interface.
> >
> > As usual, if I'm overlooking something let me know...
>
> Hi John,
>
> The conversation started [1] on Darrick's containerization patchset
> about using bpf to a) avoid extra requests / context switching for
> ->iomap_begin and ->iomap_end calls and b) offload what would
> otherwise have to be hard-coded kernel logic into userspace, which
> gives userspace more flexibility / control with updating the logic and
> is less of a maintenance burden for fuse. There was some musing [2]
> about whether with bpf infrastructure added, it would allow famfs to
> move all famfs-specific logic to userspace/bpf.
>
> I agree that it makes sense for famfs to go through dax iomap
> interfaces. imo it seems cleanest if fuse has a generic iomap
> interface with iomap dax going through that plumbing, and any
> famfs-specific logic that would be needed beyond that (eg computing
> the interleaved mappings) being moved to custom famfs bpf programs. I
> started trying to implement this yesterday afternoon because I wanted
> to make sure it would actually be doable for the famfs logic before
> bringing it up and I didn't want to derail your project. So far I only
> have the general iomap interface for fuse added with dax operations
> going through dax_iomap* and haven't tried out integrating the famfs
> GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to
> get to that early next week. The work I did with Darrick this week was
> on getting a server's bpf programs hooked up to fuse through bpf links
> and Darrick has fleshed that out and gotten that working now. If it
> turns out famfs can go through a generic iomap fuse plumbing layer,
> I'd be curious to hear your thoughts on which approach you'd prefer.

I put together a quick prototype to test this out - this is what it
looks like with fuse having a generic iomap interface that supports
dax [1], and the famfs custom logic moved to a bpf program [2]. I
didn't change much, I just moved around your famfs code to the bpf
side. The kernel side changes are in [3] and the libfuse changes are
in [4].

For testing out the prototype, I hooked it up to passthrough_hp to
test running the bpf program and verify that it is able to find the
extent from the bpf map. In my opinion, this makes the fuse side
infrastructure cleaner and more extendable for other servers that will
want to go through dax iomap in the future, but I think this also has
a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP
request after a file is opened, the server can directly populate the
metadata map from userspace with the mapping info when it processes
the FUSE_OPEN request, which gets rid of the roundtrip cost. The
server can dynamically update the metadata at any time from userspace
if the mapping info needs to change in the future. For setting up the
daxdevs, I moved your logic to the init side, where the server passes
the daxdev info upfront through an IOMAP_CONFIG exchange with the
kernel initializing the daxdevs based off that info. I think this will
also make deploying future updates for famfs easier, as updating the
logic won't need to go through the upstream kernel mailing list
process and deploying updates won't require a new kernel release.

These are just my two cents based on my (cursory) understanding of
famfs. Just wanted to float this alternative approach in case it's
useful.

Thanks,
Joanne

[1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd
[2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa
[3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/
[4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/

>
> Thanks,
> Joanne
>
> [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d
> [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u
>
> >
> > Regards,
> > John
> >

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-06 20:48           ` John Groves
  2026-02-07  0:22             ` Joanne Koong
@ 2026-02-20 23:59             ` Darrick J. Wong
  1 sibling, 0 replies; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-20 23:59 UTC (permalink / raw)
  To: John Groves
  Cc: Amir Goldstein, Miklos Szeredi, f-pc@lists.linux-foundation.org,
	linux-fsdevel@vger.kernel.org, Joanne Koong, Bernd Schubert,
	Luis Henriques, Horst Birthelmer

On Fri, Feb 06, 2026 at 02:48:43PM -0600, John Groves wrote:
> On 26/02/05 09:52PM, Darrick J. Wong wrote:
> > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote:
> > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote:
> > > >
> > > > On 26/02/04 11:06AM, Darrick J. Wong wrote:
> > > >
> > > > [ ... ]
> > > >
> > > > > >  - famfs: export distributed memory
> > > > >
> > > > > This has been, uh, hanging out for an extraordinarily long time.
> > > >
> > > > Um, *yeah*. Although a significant part of that time was on me, because
> > > > getting it ported into fuse was kinda hard, my users and I are hoping we
> > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19
> > > > merge window dust settles we can negotiate any needed changes etc. and
> > > > shoot for the 7.0 merge window.
> > 
> > I think we've all missed getting merged for 7.0 since 6.19 will be
> > released in 3 days. :/
> > 
> > (Granted most of the maintainers I know are /much/ less conservative
> > than I was about the schedule)
> 
> Doh - right you are...
> 
> > 
> > > I think that the work on famfs is setting an example, and I very much
> > > hope it will be a good example, of how improving existing infrastructure
> > > (FUSE) is a better contribution than adding another fs to the pile.
> > 
> > Yeah.  Joanne and I spent a couple of days this week coprogramming a
> > prototype of a way for famfs to create BPF programs to handle
> > INTERLEAVED_EXTENT files.  We might be ready to show that off in a
> > couple of weeks, and that might be a way to clear up the
> > GET_FMAP/IOMAP_BEGIN logjam at last.
> 
> I'd love to learn more about this; happy to do a call if that's a
> good way to get me briefed.
> 
> I [generally but not specifically] understand how this could avoid
> GET_FMAP, but not GET_DAXDEV. 

fuse-iomap requires fuse servers to open block devices and register them
with the fuse_conn as a backing file.  The kernel returns a magic cookie
that can then be passed back to the kernel in iomap_begin.  This is
(AFAICT) similar to what fuse does w.r.t. passthrough files.

IIRC, GET_DAXDEV is an ondemand fuse request, which is quite different
from the fuse-iomap model where bdevs have to be registered before you
can use them.

> But I'm not sure it could (or should) avoid dax_iomap_rw() and
> dax_iomap_fault(). The thing is that those call my begin() function
> to resolve an offset in a file to an offset on a daxdev, and then
> dax completes the fault or memcpy. In that dance, famfs never knows
> the kernel address of the memory at all (also true of xfs in fs-dax
> mode, unless that's changed fairly recently). I think that's a pretty
> decent interface all in all.

Right.  dax_iomap_{rw,fault} call the ->iomap_begin they're given, which
can be fuse_iomap_begin, which will either (a) look in the iext cache,
(b) see if the fuse server supplied a bpf program, or (c) upcall the
fuse server.

I also took another look at my broken fuse-iomap-dax patch and realized
that in addition to corrupting data somewhere, there's also a gigantic
XXX around dax_writeback_mapping_range because it takes a bdev instead
of asking the filesystem for mappings, which means that it's broken for
any fsdax file who stores data on more than one device.

> Also: dunno whether y'all have looked at the dax patches in the famfs
> series, but the solution to working with Alistair's folio-ification 
> and cleanup of the dax layer (which set me back months) was to create 
> drivers/dax/fsdev.c, which, when bound to a daxdev in place of 
> drivers/dax/device.c, configures folios & pages compatibly with 
> fs-dax. So I kinda think I need the dax_iomap* interface.

Oh that's good news!

--D

> As usual, if I'm overlooking something let me know...
> 
> Regards,
> John
> 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-12  4:46               ` Joanne Koong
@ 2026-02-21  0:37                 ` Darrick J. Wong
  2026-02-26 20:21                   ` Joanne Koong
  0 siblings, 1 reply; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-21  0:37 UTC (permalink / raw)
  To: Joanne Koong
  Cc: John Groves, Amir Goldstein, Miklos Szeredi,
	f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	Bernd Schubert, Luis Henriques, Horst Birthelmer

On Wed, Feb 11, 2026 at 08:46:26PM -0800, Joanne Koong wrote:
> On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote:
> > >
> > > On 26/02/05 09:52PM, Darrick J. Wong wrote:
> > > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote:
> > > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote:
> > > > > >
> > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote:
> > > > > >
> > > > > > [ ... ]
> > > > > >
> > > > > > > >  - famfs: export distributed memory
> > > > > > >
> > > > > > > This has been, uh, hanging out for an extraordinarily long time.
> > > > > >
> > > > > > Um, *yeah*. Although a significant part of that time was on me, because
> > > > > > getting it ported into fuse was kinda hard, my users and I are hoping we
> > > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19
> > > > > > merge window dust settles we can negotiate any needed changes etc. and
> > > > > > shoot for the 7.0 merge window.
> > > >
> > > > I think we've all missed getting merged for 7.0 since 6.19 will be
> > > > released in 3 days. :/
> > > >
> > > > (Granted most of the maintainers I know are /much/ less conservative
> > > > than I was about the schedule)
> > >
> > > Doh - right you are...
> > >
> > > >
> > > > > I think that the work on famfs is setting an example, and I very much
> > > > > hope it will be a good example, of how improving existing infrastructure
> > > > > (FUSE) is a better contribution than adding another fs to the pile.
> > > >
> > > > Yeah.  Joanne and I spent a couple of days this week coprogramming a
> > > > prototype of a way for famfs to create BPF programs to handle
> > > > INTERLEAVED_EXTENT files.  We might be ready to show that off in a
> > > > couple of weeks, and that might be a way to clear up the
> > > > GET_FMAP/IOMAP_BEGIN logjam at last.
> > >
> > > I'd love to learn more about this; happy to do a call if that's a
> > > good way to get me briefed.
> > >
> > > I [generally but not specifically] understand how this could avoid
> > > GET_FMAP, but not GET_DAXDEV.
> > >
> > > But I'm not sure it could (or should) avoid dax_iomap_rw() and
> > > dax_iomap_fault(). The thing is that those call my begin() function
> > > to resolve an offset in a file to an offset on a daxdev, and then
> > > dax completes the fault or memcpy. In that dance, famfs never knows
> > > the kernel address of the memory at all (also true of xfs in fs-dax
> > > mode, unless that's changed fairly recently). I think that's a pretty
> > > decent interface all in all.
> > >
> > > Also: dunno whether y'all have looked at the dax patches in the famfs
> > > series, but the solution to working with Alistair's folio-ification
> > > and cleanup of the dax layer (which set me back months) was to create
> > > drivers/dax/fsdev.c, which, when bound to a daxdev in place of
> > > drivers/dax/device.c, configures folios & pages compatibly with
> > > fs-dax. So I kinda think I need the dax_iomap* interface.
> > >
> > > As usual, if I'm overlooking something let me know...
> >
> > Hi John,
> >
> > The conversation started [1] on Darrick's containerization patchset
> > about using bpf to a) avoid extra requests / context switching for
> > ->iomap_begin and ->iomap_end calls and b) offload what would
> > otherwise have to be hard-coded kernel logic into userspace, which
> > gives userspace more flexibility / control with updating the logic and
> > is less of a maintenance burden for fuse. There was some musing [2]
> > about whether with bpf infrastructure added, it would allow famfs to
> > move all famfs-specific logic to userspace/bpf.
> >
> > I agree that it makes sense for famfs to go through dax iomap
> > interfaces. imo it seems cleanest if fuse has a generic iomap
> > interface with iomap dax going through that plumbing, and any
> > famfs-specific logic that would be needed beyond that (eg computing
> > the interleaved mappings) being moved to custom famfs bpf programs. I
> > started trying to implement this yesterday afternoon because I wanted
> > to make sure it would actually be doable for the famfs logic before
> > bringing it up and I didn't want to derail your project. So far I only
> > have the general iomap interface for fuse added with dax operations
> > going through dax_iomap* and haven't tried out integrating the famfs
> > GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to
> > get to that early next week. The work I did with Darrick this week was
> > on getting a server's bpf programs hooked up to fuse through bpf links
> > and Darrick has fleshed that out and gotten that working now. If it
> > turns out famfs can go through a generic iomap fuse plumbing layer,
> > I'd be curious to hear your thoughts on which approach you'd prefer.
> 
> I put together a quick prototype to test this out - this is what it
> looks like with fuse having a generic iomap interface that supports
> dax [1], and the famfs custom logic moved to a bpf program [2]. I

The bpf maps that you've used to upload per-inode data into the kernel
is a /much/ cleaner method than custom-compiling C into BPF at runtime!
You can statically compile the BPF object code into the fuse server,
which means that (a) you can take advantage of the bpftool skeletons,
and (b) you can in theory vendor-sign the BPF code if and when that
becomes a requirement.

I think that's way better than having to put vmlinux.h and
fuse_iomap_bpf.h on the deployed system.  Though there's one hitch in
example/Makefile:

vmlinux.h:
	$(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@

The build system isn't necessarily running the same kernel as the deploy
images.  It might be for Meta, but it's not unheard of for our build
system to be running (say) OL10+UEK8 kernel, but the build target is OL8
and UEK7.

There doesn't seem to be any standardization across distros for where a
vmlinux.h file might be found.  Fedora puts it under
/usr/src/$unamestuf, Debian puts it in /usr/include/$gcc_triple, and I
guess SUSE doesn't ship it at all?

That's going to be a headache for deployment as I've been muttering for
a couple of weeks now. :(

Maybe we could reduce the fuse-iomap bpf definitions to use only
cardinal types and the types that iomap itself defines.  That might not
be too hard right now because bpf functions reuse structures from
include/uapi/fuse.h, which currently use uint{8,16,32,64}_t.  It'll get
harder if that __uintXX_t -> __uXX transition actually happens.

But getting back to the famfs bpf stuff, I think doing the interleaved
mappings via BPF gives the famfs server a lot more flexibility in terms
of what it can do when future hardware arrives with even weirder
configurations.

--D

> didn't change much, I just moved around your famfs code to the bpf
> side. The kernel side changes are in [3] and the libfuse changes are
> in [4].
> 
> For testing out the prototype, I hooked it up to passthrough_hp to
> test running the bpf program and verify that it is able to find the
> extent from the bpf map. In my opinion, this makes the fuse side
> infrastructure cleaner and more extendable for other servers that will
> want to go through dax iomap in the future, but I think this also has
> a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP
> request after a file is opened, the server can directly populate the
> metadata map from userspace with the mapping info when it processes
> the FUSE_OPEN request, which gets rid of the roundtrip cost. The
> server can dynamically update the metadata at any time from userspace
> if the mapping info needs to change in the future. For setting up the
> daxdevs, I moved your logic to the init side, where the server passes
> the daxdev info upfront through an IOMAP_CONFIG exchange with the
> kernel initializing the daxdevs based off that info. I think this will
> also make deploying future updates for famfs easier, as updating the
> logic won't need to go through the upstream kernel mailing list
> process and deploying updates won't require a new kernel release.
> 
> These are just my two cents based on my (cursory) understanding of
> famfs. Just wanted to float this alternative approach in case it's
> useful.
> 
> Thanks,
> Joanne
> 
> [1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd
> [2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa
> [3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/
> [4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/
> 
> >
> > Thanks,
> > Joanne
> >
> > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d
> > [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u
> >
> > >
> > > Regards,
> > > John
> > >
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-06  6:15         ` Gao Xiang
@ 2026-02-21  0:47           ` Darrick J. Wong
  2026-03-17  4:17             ` Gao Xiang
  0 siblings, 1 reply; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-21  0:47 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Miklos Szeredi, f-pc, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer

On Fri, Feb 06, 2026 at 02:15:12PM +0800, Gao Xiang wrote:
> Hi Darrick,
> 
> On 2026/2/6 13:38, Darrick J. Wong wrote:
> > On Thu, Feb 05, 2026 at 06:50:28AM +0800, Gao Xiang wrote:
> > > 
> > > 
> > > On 2026/2/5 03:06, Darrick J. Wong wrote:
> > > > On Mon, Feb 02, 2026 at 02:51:04PM +0100, Miklos Szeredi wrote:
> > > 
> > > ...
> > > 
> > > > 
> > > >    4 For defaults situations, where do we make policy about when to use
> > > >      f-s-c and when do we allow use of the kernel driver?  I would guess
> > > >      that anything in /etc/fstab could use the kernel driver, and
> > > >      everything else should use a fuse container if possible.  For
> > > >      unprivileged non-root-ns mounts I think we'd only allow the
> > > >      container?
> > > 
> > > Just a side note: As a filesystem for containers, I have to say here
> > > again one of the goal of EROFS is to allow unprivileged non-root-ns
> > > mounts for container users because again I've seen no on-disk layout
> > > security risk especially for the uncompressed layout format and
> > > container users have already request this, but as Christoph said,
> > > I will finish security model first before I post some code for pure
> > > untrusted images.  But first allow dm-verity/fs-verity signed images
> > > as the first step.
> > 
> > <nod> I haven't forgotten.  For readonly root fses erofs is probably the
> > best we're going to get, and it's less clunky than fuse.  There's less
> > of a firewall due to !microkernel but I'd wager that most immutable
> > distros will find erofs a good enough balance between performance and
> > isolation.
> 
> Thanks, but I can't make decisions for every individual end user.
> However, in my view, this approach is valuable for all container
> users if they don't mind to try this approach (I'm building this
> capabilities with several communities and people): they can achieve
> nearly native performance on read-write workloads with a trusted
> fs as well as the remote data source is completely isolated using
> an immutable secure filesystem.
> 
> I will make signed images work first, but as the next step, I'll
> definitely work on defining a clear on-disk boundary (very
> likely excluding per-inode compression layouts in the beginning)
> to enable most users to leverage untrusted data directly in
> a totally isolated user/mount namespace.

<nod> I hope you succeed!

> > 
> > Fuse, otoh, is for all the other weird users -- you found an old
> > cupboard full of wide scsi disks; or management decided that letting
> > container customers bring their own prepopulated data partitions(!) is a
> > good idea; or the default when someone plugs in a device that the system
> > knows nothing about.
> 
> Honestly, I've checked what Ted, Dave, and you said previously.
> For generic COW filesystems, it's surely hard to guarantee
> filesystem consistency at all times, mainly because of those
> on-disk formats by design (lots of duplicated metadata for
> different purposes, which can cause extra inconsistency compared
> to archive fses.) Of course, it's not entirely impossible, but
> as Ted pointed out, it becomes a matter of
> 
> 1) human resources;
> 2) enforcing such strict consistency checks harms performance
>    in general use cases which just use trusted filesystem /
>    media directly like databases.
> 
> I'm not against FUSE further improvements because they are seperated
> stories, I do think those items are useful for new Linux innovation,
> but as for the topic of allowing "root" in non-root-user-ns to mount,
> I still insist that it should be a per-filesystem policy, because
> filesystems are designed for different targeted use cases:
> 
>  - either you face and address the issue (by design or by
>    enginneering), or
>  - find another alternative way to serve users.
> 
> But I do hope we shouldn't force some arbitary policy without any
> technical reason, the feature is indeed useful for container users.

Oh yes, the policy question is a very large one; for a specific given
filesystem, you need to trust:

A> whatever user is asking to do the mount

B> the quality of the kernel or userspace drivers

C> the provenance of the filesystem image

This is a hugely personal (or institutional) question, all we can do is
provide mechanisms for kernel and userspace drivers, a sensible default
policy, and a reasonable way to relate all three properties to action.

Or just go with IT policy, which is deny, delete, destroy. :P

> > 
> > > On the other side, my objective thought of that is FUSE is becoming
> > > complex either from its protocol and implementations (even from the
> > 
> > It already is.
> > 
> > > TODO lists here) and leak of security design too, it's hard to say
> > > from the attack surface which is better and Linux kernel is never
> > > regarded as a microkernel model. In order to phase out "legacy and
> > > problematic flags", FUSE have to wait until all current users don't
> > > use them anymore.
> > > 
> > > I really think it should be a per-filesystem policy rather than the
> > > current arbitary policy just out of fragment words, but I will
> > > prepare more materials and bring this for more formal discussion
> > > until the whole goal is finished.
> > 
> > Well yes, the transition from kernel to kernel-or-fuse would be
> > decided on a per-filesystem basis.  When the fuse driver reaches par
> > with the kernel driver on functionality and stability then it becomes a
> > candidate for secure container usage.  Not before.
> 
> I respect this path, but just from my own perspective, userspace
> malicious problems are usually much harder to defence since the
> trusted boundary is weaker, in order to allow unpriviledged
> daemons, you have to monitor if page cache or any metadata cache
> or any potential/undiscovered deadlock vectors can be abused
> by those malicious daemons, so that you have to find more harden
> ways to limit such abused usage naturally since you never trust
> those unpriviledged daemons (which is arbitary executable code
> rather than a binary source) instead, which is opposed to
> performance cases in principle without detailed analysis.

I'm well aware that going to userspace opens a whole floodgate of weird
dynamic behavior possibilities.  Though obviously my experiences with
kernel XFS has shown me that those challenges exist there too. :/

The kernel does have the nice property that you can set NOFS and ignore
SIGSTOP/KILL if necessary to get things done.

--D

> Just my two cents.
> 
> Thanks,
> Gao Xiang
> 
> > 
> > --D
> > 
> > > Thanks,
> > > Gao Xiang
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-06  6:09           ` Darrick J. Wong
@ 2026-02-21  6:07             ` Demi Marie Obenour
  2026-02-21  7:07               ` Darrick J. Wong
  0 siblings, 1 reply; 79+ messages in thread
From: Demi Marie Obenour @ 2026-02-21  6:07 UTC (permalink / raw)
  To: Darrick J. Wong, Jan Kara
  Cc: Joanne Koong, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	John Groves, Bernd Schubert, Luis Henriques, Horst Birthelmer,
	lsf-pc


[-- Attachment #1.1.1: Type: text/plain, Size: 3153 bytes --]

On 2/6/26 01:09, Darrick J. Wong wrote:
> On Wed, Feb 04, 2026 at 11:43:05AM +0100, Jan Kara wrote:
>> On Wed 04-02-26 01:22:02, Joanne Koong wrote:
>>> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>>> I think that at least one question of interest to the wider fs audience is
>>>>>
>>>>> Can any of the above improvements be used to help phase out some
>>>>> of the old under maintained fs and reduce the burden on vfs maintainers?
>>>
>>> I think it might be helpful to know ahead of time where the main
>>> hesitation lies. Is it performance? Maybe it'd be helpful if before
>>> May there was a prototype converting a simpler filesystem (Darrick and
>>> I were musing about fat maybe being a good one) and getting a sense of
>>> what the delta is between the native kernel implementation and a
>>> fuse-based version? In the past year fuse added a lot of new
>>> capabilities that improved performance by quite a bit so I'm curious
>>> to see where the delta now lies. Or maybe the hesitation is something
>>> else entirely, in which case that's probably a conversation better
>>> left for May.
>>
>> I'm not sure which filesystems Amir had exactly in mind but in my opinion
>> FAT is used widely enough to not be a primary target of this effort. It
> 
> OTOH the ESP and USB sticks needn't be high performance.  <shrug>

Yup.  Also USB sticks are not trusted.

>> would be rather filesystems like (random selection) bfs, adfs, vboxfs,
>> minix, efs, freevxfs, etc. The user base of these is very small, testing is
>> minimal if possible at all, and thus the value of keeping these in the
>> kernel vs the effort they add to infrastructure changes (like folio
>> conversions, iomap conversion, ...) is not very favorable.
> 
> But yeah, these ones in the long tail are probably good targets.  Though
> I think willy pointed out that the biggest barrier in his fs folio
> conversions was that many of them aren't testable (e.g. lack mkfs or
> fsck tools) which makes a legacy pivot that much harder.

Does it make sense to keep these filesystems around?  If all one cares
about is getting the data off of the filesystem, libguestfs with an
old kernel is sufficient.  If the VFS changes introduced bugs, an old
kernel might even be more reliable.  If there is a way to make sure
the FUSE port works, that would be great.  However, if there is no
way to test them, then maybe they should just be dropped.

>> For these the biggest problem IMO is actually finding someone willing to
>> invest into doing (and testing) the conversion. I don't think there are
>> severe technical obstacles for most of them.
> 
> Yep, that's the biggest hurdle -- convincing managers to pay for a bunch
> of really old filesystems that are no longer mainstream.

Could libguestfs with old guest kernels be a sufficient replacement?
It's not going to be fast, but it's enough for data preservation.

libguestfs supports "fixed appliances", which allow using whatever
kernel one wants.  They even provide some as precompiled binaries.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-21  6:07             ` Demi Marie Obenour
@ 2026-02-21  7:07               ` Darrick J. Wong
  2026-02-21 22:16                 ` Demi Marie Obenour
  0 siblings, 1 reply; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-21  7:07 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jan Kara, Joanne Koong, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, John Groves, Bernd Schubert, Luis Henriques,
	Horst Birthelmer, lsf-pc

On Sat, Feb 21, 2026 at 01:07:55AM -0500, Demi Marie Obenour wrote:
> On 2/6/26 01:09, Darrick J. Wong wrote:
> > On Wed, Feb 04, 2026 at 11:43:05AM +0100, Jan Kara wrote:
> >> On Wed 04-02-26 01:22:02, Joanne Koong wrote:
> >>> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>>>> I think that at least one question of interest to the wider fs audience is
> >>>>>
> >>>>> Can any of the above improvements be used to help phase out some
> >>>>> of the old under maintained fs and reduce the burden on vfs maintainers?
> >>>
> >>> I think it might be helpful to know ahead of time where the main
> >>> hesitation lies. Is it performance? Maybe it'd be helpful if before
> >>> May there was a prototype converting a simpler filesystem (Darrick and
> >>> I were musing about fat maybe being a good one) and getting a sense of
> >>> what the delta is between the native kernel implementation and a
> >>> fuse-based version? In the past year fuse added a lot of new
> >>> capabilities that improved performance by quite a bit so I'm curious
> >>> to see where the delta now lies. Or maybe the hesitation is something
> >>> else entirely, in which case that's probably a conversation better
> >>> left for May.
> >>
> >> I'm not sure which filesystems Amir had exactly in mind but in my opinion
> >> FAT is used widely enough to not be a primary target of this effort. It
> > 
> > OTOH the ESP and USB sticks needn't be high performance.  <shrug>
> 
> Yup.  Also USB sticks are not trusted.
> 
> >> would be rather filesystems like (random selection) bfs, adfs, vboxfs,
> >> minix, efs, freevxfs, etc. The user base of these is very small, testing is
> >> minimal if possible at all, and thus the value of keeping these in the
> >> kernel vs the effort they add to infrastructure changes (like folio
> >> conversions, iomap conversion, ...) is not very favorable.
> > 
> > But yeah, these ones in the long tail are probably good targets.  Though
> > I think willy pointed out that the biggest barrier in his fs folio
> > conversions was that many of them aren't testable (e.g. lack mkfs or
> > fsck tools) which makes a legacy pivot that much harder.
> 
> Does it make sense to keep these filesystems around?  If all one cares
> about is getting the data off of the filesystem, libguestfs with an
> old kernel is sufficient.  If the VFS changes introduced bugs, an old
> kernel might even be more reliable.  If there is a way to make sure
> the FUSE port works, that would be great.  However, if there is no
> way to test them, then maybe they should just be dropped.
> 
> >> For these the biggest problem IMO is actually finding someone willing to
> >> invest into doing (and testing) the conversion. I don't think there are
> >> severe technical obstacles for most of them.
> > 
> > Yep, that's the biggest hurdle -- convincing managers to pay for a bunch
> > of really old filesystems that are no longer mainstream.
> 
> Could libguestfs with old guest kernels be a sufficient replacement?
> It's not going to be fast, but it's enough for data preservation.

In principle it might work, though I have questions about the quality of
whatever's internally driving guestmount.

Do you know how exactly libguestfs/guestmount accesses (say) an XFS
filesystem?  I'm curious because libxfs isn't a shared library, so
either it would have to manipulate xfs_db (ugh!), run the kernel in a VM
layer, or ... do they have their own implementation ala grub?

--D

> libguestfs supports "fixed appliances", which allow using whatever
> kernel one wants.  They even provide some as precompiled binaries.
> -- 
> Sincerely,
> Demi Marie Obenour (she/her/hers)






^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-21  7:07               ` Darrick J. Wong
@ 2026-02-21 22:16                 ` Demi Marie Obenour
  2026-02-23 21:58                   ` Darrick J. Wong
  0 siblings, 1 reply; 79+ messages in thread
From: Demi Marie Obenour @ 2026-02-21 22:16 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Joanne Koong, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, John Groves, Bernd Schubert, Luis Henriques,
	Horst Birthelmer, lsf-pc


[-- Attachment #1.1.1: Type: text/plain, Size: 3850 bytes --]

On 2/21/26 02:07, Darrick J. Wong wrote:
> On Sat, Feb 21, 2026 at 01:07:55AM -0500, Demi Marie Obenour wrote:
>> On 2/6/26 01:09, Darrick J. Wong wrote:
>>> On Wed, Feb 04, 2026 at 11:43:05AM +0100, Jan Kara wrote:
>>>> On Wed 04-02-26 01:22:02, Joanne Koong wrote:
>>>>> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>>>>> I think that at least one question of interest to the wider fs audience is
>>>>>>>
>>>>>>> Can any of the above improvements be used to help phase out some
>>>>>>> of the old under maintained fs and reduce the burden on vfs maintainers?
>>>>>
>>>>> I think it might be helpful to know ahead of time where the main
>>>>> hesitation lies. Is it performance? Maybe it'd be helpful if before
>>>>> May there was a prototype converting a simpler filesystem (Darrick and
>>>>> I were musing about fat maybe being a good one) and getting a sense of
>>>>> what the delta is between the native kernel implementation and a
>>>>> fuse-based version? In the past year fuse added a lot of new
>>>>> capabilities that improved performance by quite a bit so I'm curious
>>>>> to see where the delta now lies. Or maybe the hesitation is something
>>>>> else entirely, in which case that's probably a conversation better
>>>>> left for May.
>>>>
>>>> I'm not sure which filesystems Amir had exactly in mind but in my opinion
>>>> FAT is used widely enough to not be a primary target of this effort. It
>>>
>>> OTOH the ESP and USB sticks needn't be high performance.  <shrug>
>>
>> Yup.  Also USB sticks are not trusted.
>>
>>>> would be rather filesystems like (random selection) bfs, adfs, vboxfs,
>>>> minix, efs, freevxfs, etc. The user base of these is very small, testing is
>>>> minimal if possible at all, and thus the value of keeping these in the
>>>> kernel vs the effort they add to infrastructure changes (like folio
>>>> conversions, iomap conversion, ...) is not very favorable.
>>>
>>> But yeah, these ones in the long tail are probably good targets.  Though
>>> I think willy pointed out that the biggest barrier in his fs folio
>>> conversions was that many of them aren't testable (e.g. lack mkfs or
>>> fsck tools) which makes a legacy pivot that much harder.
>>
>> Does it make sense to keep these filesystems around?  If all one cares
>> about is getting the data off of the filesystem, libguestfs with an
>> old kernel is sufficient.  If the VFS changes introduced bugs, an old
>> kernel might even be more reliable.  If there is a way to make sure
>> the FUSE port works, that would be great.  However, if there is no
>> way to test them, then maybe they should just be dropped.
>>
>>>> For these the biggest problem IMO is actually finding someone willing to
>>>> invest into doing (and testing) the conversion. I don't think there are
>>>> severe technical obstacles for most of them.
>>>
>>> Yep, that's the biggest hurdle -- convincing managers to pay for a bunch
>>> of really old filesystems that are no longer mainstream.
>>
>> Could libguestfs with old guest kernels be a sufficient replacement?
>> It's not going to be fast, but it's enough for data preservation.
> 
> In principle it might work, though I have questions about the quality of
> whatever's internally driving guestmount.
> 
> Do you know how exactly libguestfs/guestmount accesses (say) an XFS
> filesystem?  I'm curious because libxfs isn't a shared library, so
> either it would have to manipulate xfs_db (ugh!), run the kernel in a VM
> layer, or ... do they have their own implementation ala grub?

They run Linux in a VM.  Using an old Linux would allow working with
old filesystems that have since been removed.  If KVM is available,
the VM is (or at least should be) strongly sandboxed.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-21 22:16                 ` Demi Marie Obenour
@ 2026-02-23 21:58                   ` Darrick J. Wong
  0 siblings, 0 replies; 79+ messages in thread
From: Darrick J. Wong @ 2026-02-23 21:58 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jan Kara, Joanne Koong, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, John Groves, Bernd Schubert, Luis Henriques,
	Horst Birthelmer, lsf-pc

On Sat, Feb 21, 2026 at 05:16:25PM -0500, Demi Marie Obenour wrote:
> On 2/21/26 02:07, Darrick J. Wong wrote:
> > On Sat, Feb 21, 2026 at 01:07:55AM -0500, Demi Marie Obenour wrote:
> >> On 2/6/26 01:09, Darrick J. Wong wrote:
> >>> On Wed, Feb 04, 2026 at 11:43:05AM +0100, Jan Kara wrote:
> >>>> On Wed 04-02-26 01:22:02, Joanne Koong wrote:
> >>>>> On Mon, Feb 2, 2026 at 11:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>>>>>> I think that at least one question of interest to the wider fs audience is
> >>>>>>>
> >>>>>>> Can any of the above improvements be used to help phase out some
> >>>>>>> of the old under maintained fs and reduce the burden on vfs maintainers?
> >>>>>
> >>>>> I think it might be helpful to know ahead of time where the main
> >>>>> hesitation lies. Is it performance? Maybe it'd be helpful if before
> >>>>> May there was a prototype converting a simpler filesystem (Darrick and
> >>>>> I were musing about fat maybe being a good one) and getting a sense of
> >>>>> what the delta is between the native kernel implementation and a
> >>>>> fuse-based version? In the past year fuse added a lot of new
> >>>>> capabilities that improved performance by quite a bit so I'm curious
> >>>>> to see where the delta now lies. Or maybe the hesitation is something
> >>>>> else entirely, in which case that's probably a conversation better
> >>>>> left for May.
> >>>>
> >>>> I'm not sure which filesystems Amir had exactly in mind but in my opinion
> >>>> FAT is used widely enough to not be a primary target of this effort. It
> >>>
> >>> OTOH the ESP and USB sticks needn't be high performance.  <shrug>
> >>
> >> Yup.  Also USB sticks are not trusted.
> >>
> >>>> would be rather filesystems like (random selection) bfs, adfs, vboxfs,
> >>>> minix, efs, freevxfs, etc. The user base of these is very small, testing is
> >>>> minimal if possible at all, and thus the value of keeping these in the
> >>>> kernel vs the effort they add to infrastructure changes (like folio
> >>>> conversions, iomap conversion, ...) is not very favorable.
> >>>
> >>> But yeah, these ones in the long tail are probably good targets.  Though
> >>> I think willy pointed out that the biggest barrier in his fs folio
> >>> conversions was that many of them aren't testable (e.g. lack mkfs or
> >>> fsck tools) which makes a legacy pivot that much harder.
> >>
> >> Does it make sense to keep these filesystems around?  If all one cares
> >> about is getting the data off of the filesystem, libguestfs with an
> >> old kernel is sufficient.  If the VFS changes introduced bugs, an old
> >> kernel might even be more reliable.  If there is a way to make sure
> >> the FUSE port works, that would be great.  However, if there is no
> >> way to test them, then maybe they should just be dropped.
> >>
> >>>> For these the biggest problem IMO is actually finding someone willing to
> >>>> invest into doing (and testing) the conversion. I don't think there are
> >>>> severe technical obstacles for most of them.
> >>>
> >>> Yep, that's the biggest hurdle -- convincing managers to pay for a bunch
> >>> of really old filesystems that are no longer mainstream.
> >>
> >> Could libguestfs with old guest kernels be a sufficient replacement?
> >> It's not going to be fast, but it's enough for data preservation.
> > 
> > In principle it might work, though I have questions about the quality of
> > whatever's internally driving guestmount.
> > 
> > Do you know how exactly libguestfs/guestmount accesses (say) an XFS
> > filesystem?  I'm curious because libxfs isn't a shared library, so
> > either it would have to manipulate xfs_db (ugh!), run the kernel in a VM
> > layer, or ... do they have their own implementation ala grub?
> 
> They run Linux in a VM.  Using an old Linux would allow working with
> old filesystems that have since been removed.  If KVM is available,
> the VM is (or at least should be) strongly sandboxed.

/me tries out libguestfs and ... wow it's slow to start.  It does seem
to provide the isolation of the fs parsing code that I want, but the
overheads are quite high.  500MB memory to mount a totally empty XFS
filesystem, and 350MB of disk space to create a rootfs, ouch.

--D

> -- 
> Sincerely,
> Demi Marie Obenour (she/her/hers)






^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-21  0:37                 ` Darrick J. Wong
@ 2026-02-26 20:21                   ` Joanne Koong
  2026-03-03  4:57                     ` Darrick J. Wong
  0 siblings, 1 reply; 79+ messages in thread
From: Joanne Koong @ 2026-02-26 20:21 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Groves, Amir Goldstein, Miklos Szeredi,
	f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	Bernd Schubert, Luis Henriques, Horst Birthelmer

On Fri, Feb 20, 2026 at 4:37 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Feb 11, 2026 at 08:46:26PM -0800, Joanne Koong wrote:
> > On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote:
> > > >
> > > > On 26/02/05 09:52PM, Darrick J. Wong wrote:
> > > > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote:
> > > > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote:
> > > > > > >
> > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote:
> > > > > > >
> > > > > > > [ ... ]
> > > > > > >
> > > > > > > > >  - famfs: export distributed memory
> > > > > > > >
> > > > > > > > This has been, uh, hanging out for an extraordinarily long time.
> > > > > > >
> > > > > > > Um, *yeah*. Although a significant part of that time was on me, because
> > > > > > > getting it ported into fuse was kinda hard, my users and I are hoping we
> > > > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19
> > > > > > > merge window dust settles we can negotiate any needed changes etc. and
> > > > > > > shoot for the 7.0 merge window.
> > > > >
> > > > > I think we've all missed getting merged for 7.0 since 6.19 will be
> > > > > released in 3 days. :/
> > > > >
> > > > > (Granted most of the maintainers I know are /much/ less conservative
> > > > > than I was about the schedule)
> > > >
> > > > Doh - right you are...
> > > >
> > > > >
> > > > > > I think that the work on famfs is setting an example, and I very much
> > > > > > hope it will be a good example, of how improving existing infrastructure
> > > > > > (FUSE) is a better contribution than adding another fs to the pile.
> > > > >
> > > > > Yeah.  Joanne and I spent a couple of days this week coprogramming a
> > > > > prototype of a way for famfs to create BPF programs to handle
> > > > > INTERLEAVED_EXTENT files.  We might be ready to show that off in a
> > > > > couple of weeks, and that might be a way to clear up the
> > > > > GET_FMAP/IOMAP_BEGIN logjam at last.
> > > >
> > > > I'd love to learn more about this; happy to do a call if that's a
> > > > good way to get me briefed.
> > > >
> > > > I [generally but not specifically] understand how this could avoid
> > > > GET_FMAP, but not GET_DAXDEV.
> > > >
> > > > But I'm not sure it could (or should) avoid dax_iomap_rw() and
> > > > dax_iomap_fault(). The thing is that those call my begin() function
> > > > to resolve an offset in a file to an offset on a daxdev, and then
> > > > dax completes the fault or memcpy. In that dance, famfs never knows
> > > > the kernel address of the memory at all (also true of xfs in fs-dax
> > > > mode, unless that's changed fairly recently). I think that's a pretty
> > > > decent interface all in all.
> > > >
> > > > Also: dunno whether y'all have looked at the dax patches in the famfs
> > > > series, but the solution to working with Alistair's folio-ification
> > > > and cleanup of the dax layer (which set me back months) was to create
> > > > drivers/dax/fsdev.c, which, when bound to a daxdev in place of
> > > > drivers/dax/device.c, configures folios & pages compatibly with
> > > > fs-dax. So I kinda think I need the dax_iomap* interface.
> > > >
> > > > As usual, if I'm overlooking something let me know...
> > >
> > > Hi John,
> > >
> > > The conversation started [1] on Darrick's containerization patchset
> > > about using bpf to a) avoid extra requests / context switching for
> > > ->iomap_begin and ->iomap_end calls and b) offload what would
> > > otherwise have to be hard-coded kernel logic into userspace, which
> > > gives userspace more flexibility / control with updating the logic and
> > > is less of a maintenance burden for fuse. There was some musing [2]
> > > about whether with bpf infrastructure added, it would allow famfs to
> > > move all famfs-specific logic to userspace/bpf.
> > >
> > > I agree that it makes sense for famfs to go through dax iomap
> > > interfaces. imo it seems cleanest if fuse has a generic iomap
> > > interface with iomap dax going through that plumbing, and any
> > > famfs-specific logic that would be needed beyond that (eg computing
> > > the interleaved mappings) being moved to custom famfs bpf programs. I
> > > started trying to implement this yesterday afternoon because I wanted
> > > to make sure it would actually be doable for the famfs logic before
> > > bringing it up and I didn't want to derail your project. So far I only
> > > have the general iomap interface for fuse added with dax operations
> > > going through dax_iomap* and haven't tried out integrating the famfs
> > > GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to
> > > get to that early next week. The work I did with Darrick this week was
> > > on getting a server's bpf programs hooked up to fuse through bpf links
> > > and Darrick has fleshed that out and gotten that working now. If it
> > > turns out famfs can go through a generic iomap fuse plumbing layer,
> > > I'd be curious to hear your thoughts on which approach you'd prefer.
> >
> > I put together a quick prototype to test this out - this is what it
> > looks like with fuse having a generic iomap interface that supports
> > dax [1], and the famfs custom logic moved to a bpf program [2]. I
>
> The bpf maps that you've used to upload per-inode data into the kernel
> is a /much/ cleaner method than custom-compiling C into BPF at runtime!
> You can statically compile the BPF object code into the fuse server,
> which means that (a) you can take advantage of the bpftool skeletons,
> and (b) you can in theory vendor-sign the BPF code if and when that
> becomes a requirement.
>
> I think that's way better than having to put vmlinux.h and
> fuse_iomap_bpf.h on the deployed system.  Though there's one hitch in
> example/Makefile:
>
> vmlinux.h:
>         $(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@
>
> The build system isn't necessarily running the same kernel as the deploy
> images.  It might be for Meta, but it's not unheard of for our build
> system to be running (say) OL10+UEK8 kernel, but the build target is OL8
> and UEK7.
>
> There doesn't seem to be any standardization across distros for where a
> vmlinux.h file might be found.  Fedora puts it under
> /usr/src/$unamestuf, Debian puts it in /usr/include/$gcc_triple, and I
> guess SUSE doesn't ship it at all?
>
> That's going to be a headache for deployment as I've been muttering for
> a couple of weeks now. :(

I don't think this is an issue because bpf does dynamic btf-based
relocations (CO-RE) at load time [1]. On the target machine, when
libbpf loads the bpf object it will read the machine's btf and patch
any offsets in bytecode and load the fixed-up version into the kernel.
All that's needed on the target machine for CO-RE is
CONFIG_DEBUG_INFO_BTF=y which is enabled by default on mainstream
distributions. I think this addresses the deployment headache you've
been running into?

Thanks,
Joanne

[1] https://docs.ebpf.io/concepts/core/

>
> Maybe we could reduce the fuse-iomap bpf definitions to use only
> cardinal types and the types that iomap itself defines.  That might not
> be too hard right now because bpf functions reuse structures from
> include/uapi/fuse.h, which currently use uint{8,16,32,64}_t.  It'll get
> harder if that __uintXX_t -> __uXX transition actually happens.
>
> But getting back to the famfs bpf stuff, I think doing the interleaved
> mappings via BPF gives the famfs server a lot more flexibility in terms
> of what it can do when future hardware arrives with even weirder
> configurations.
>
> --D
>
> > didn't change much, I just moved around your famfs code to the bpf
> > side. The kernel side changes are in [3] and the libfuse changes are
> > in [4].
> >
> > For testing out the prototype, I hooked it up to passthrough_hp to
> > test running the bpf program and verify that it is able to find the
> > extent from the bpf map. In my opinion, this makes the fuse side
> > infrastructure cleaner and more extendable for other servers that will
> > want to go through dax iomap in the future, but I think this also has
> > a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP
> > request after a file is opened, the server can directly populate the
> > metadata map from userspace with the mapping info when it processes
> > the FUSE_OPEN request, which gets rid of the roundtrip cost. The
> > server can dynamically update the metadata at any time from userspace
> > if the mapping info needs to change in the future. For setting up the
> > daxdevs, I moved your logic to the init side, where the server passes
> > the daxdev info upfront through an IOMAP_CONFIG exchange with the
> > kernel initializing the daxdevs based off that info. I think this will
> > also make deploying future updates for famfs easier, as updating the
> > logic won't need to go through the upstream kernel mailing list
> > process and deploying updates won't require a new kernel release.
> >
> > These are just my two cents based on my (cursory) understanding of
> > famfs. Just wanted to float this alternative approach in case it's
> > useful.
> >
> > Thanks,
> > Joanne
> >
> > [1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd
> > [2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa
> > [3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/
> > [4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/
> >
> > >
> > > Thanks,
> > > Joanne
> > >
> > > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d
> > > [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u
> > >
> > > >
> > > > Regards,
> > > > John
> > > >
> >

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-26 20:21                   ` Joanne Koong
@ 2026-03-03  4:57                     ` Darrick J. Wong
  2026-03-03 17:28                       ` Joanne Koong
  0 siblings, 1 reply; 79+ messages in thread
From: Darrick J. Wong @ 2026-03-03  4:57 UTC (permalink / raw)
  To: Joanne Koong
  Cc: John Groves, Amir Goldstein, Miklos Szeredi,
	f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	Bernd Schubert, Luis Henriques, Horst Birthelmer

On Thu, Feb 26, 2026 at 12:21:43PM -0800, Joanne Koong wrote:
> On Fri, Feb 20, 2026 at 4:37 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Feb 11, 2026 at 08:46:26PM -0800, Joanne Koong wrote:
> > > On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote:
> > > > >
> > > > > On 26/02/05 09:52PM, Darrick J. Wong wrote:
> > > > > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote:
> > > > > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote:
> > > > > > > >
> > > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote:
> > > > > > > >
> > > > > > > > [ ... ]
> > > > > > > >
> > > > > > > > > >  - famfs: export distributed memory
> > > > > > > > >
> > > > > > > > > This has been, uh, hanging out for an extraordinarily long time.
> > > > > > > >
> > > > > > > > Um, *yeah*. Although a significant part of that time was on me, because
> > > > > > > > getting it ported into fuse was kinda hard, my users and I are hoping we
> > > > > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19
> > > > > > > > merge window dust settles we can negotiate any needed changes etc. and
> > > > > > > > shoot for the 7.0 merge window.
> > > > > >
> > > > > > I think we've all missed getting merged for 7.0 since 6.19 will be
> > > > > > released in 3 days. :/
> > > > > >
> > > > > > (Granted most of the maintainers I know are /much/ less conservative
> > > > > > than I was about the schedule)
> > > > >
> > > > > Doh - right you are...
> > > > >
> > > > > >
> > > > > > > I think that the work on famfs is setting an example, and I very much
> > > > > > > hope it will be a good example, of how improving existing infrastructure
> > > > > > > (FUSE) is a better contribution than adding another fs to the pile.
> > > > > >
> > > > > > Yeah.  Joanne and I spent a couple of days this week coprogramming a
> > > > > > prototype of a way for famfs to create BPF programs to handle
> > > > > > INTERLEAVED_EXTENT files.  We might be ready to show that off in a
> > > > > > couple of weeks, and that might be a way to clear up the
> > > > > > GET_FMAP/IOMAP_BEGIN logjam at last.
> > > > >
> > > > > I'd love to learn more about this; happy to do a call if that's a
> > > > > good way to get me briefed.
> > > > >
> > > > > I [generally but not specifically] understand how this could avoid
> > > > > GET_FMAP, but not GET_DAXDEV.
> > > > >
> > > > > But I'm not sure it could (or should) avoid dax_iomap_rw() and
> > > > > dax_iomap_fault(). The thing is that those call my begin() function
> > > > > to resolve an offset in a file to an offset on a daxdev, and then
> > > > > dax completes the fault or memcpy. In that dance, famfs never knows
> > > > > the kernel address of the memory at all (also true of xfs in fs-dax
> > > > > mode, unless that's changed fairly recently). I think that's a pretty
> > > > > decent interface all in all.
> > > > >
> > > > > Also: dunno whether y'all have looked at the dax patches in the famfs
> > > > > series, but the solution to working with Alistair's folio-ification
> > > > > and cleanup of the dax layer (which set me back months) was to create
> > > > > drivers/dax/fsdev.c, which, when bound to a daxdev in place of
> > > > > drivers/dax/device.c, configures folios & pages compatibly with
> > > > > fs-dax. So I kinda think I need the dax_iomap* interface.
> > > > >
> > > > > As usual, if I'm overlooking something let me know...
> > > >
> > > > Hi John,
> > > >
> > > > The conversation started [1] on Darrick's containerization patchset
> > > > about using bpf to a) avoid extra requests / context switching for
> > > > ->iomap_begin and ->iomap_end calls and b) offload what would
> > > > otherwise have to be hard-coded kernel logic into userspace, which
> > > > gives userspace more flexibility / control with updating the logic and
> > > > is less of a maintenance burden for fuse. There was some musing [2]
> > > > about whether with bpf infrastructure added, it would allow famfs to
> > > > move all famfs-specific logic to userspace/bpf.
> > > >
> > > > I agree that it makes sense for famfs to go through dax iomap
> > > > interfaces. imo it seems cleanest if fuse has a generic iomap
> > > > interface with iomap dax going through that plumbing, and any
> > > > famfs-specific logic that would be needed beyond that (eg computing
> > > > the interleaved mappings) being moved to custom famfs bpf programs. I
> > > > started trying to implement this yesterday afternoon because I wanted
> > > > to make sure it would actually be doable for the famfs logic before
> > > > bringing it up and I didn't want to derail your project. So far I only
> > > > have the general iomap interface for fuse added with dax operations
> > > > going through dax_iomap* and haven't tried out integrating the famfs
> > > > GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to
> > > > get to that early next week. The work I did with Darrick this week was
> > > > on getting a server's bpf programs hooked up to fuse through bpf links
> > > > and Darrick has fleshed that out and gotten that working now. If it
> > > > turns out famfs can go through a generic iomap fuse plumbing layer,
> > > > I'd be curious to hear your thoughts on which approach you'd prefer.
> > >
> > > I put together a quick prototype to test this out - this is what it
> > > looks like with fuse having a generic iomap interface that supports
> > > dax [1], and the famfs custom logic moved to a bpf program [2]. I
> >
> > The bpf maps that you've used to upload per-inode data into the kernel
> > is a /much/ cleaner method than custom-compiling C into BPF at runtime!
> > You can statically compile the BPF object code into the fuse server,
> > which means that (a) you can take advantage of the bpftool skeletons,
> > and (b) you can in theory vendor-sign the BPF code if and when that
> > becomes a requirement.
> >
> > I think that's way better than having to put vmlinux.h and
> > fuse_iomap_bpf.h on the deployed system.  Though there's one hitch in
> > example/Makefile:
> >
> > vmlinux.h:
> >         $(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@
> >
> > The build system isn't necessarily running the same kernel as the deploy
> > images.  It might be for Meta, but it's not unheard of for our build
> > system to be running (say) OL10+UEK8 kernel, but the build target is OL8
> > and UEK7.
> >
> > There doesn't seem to be any standardization across distros for where a
> > vmlinux.h file might be found.  Fedora puts it under
> > /usr/src/$unamestuf, Debian puts it in /usr/include/$gcc_triple, and I
> > guess SUSE doesn't ship it at all?
> >
> > That's going to be a headache for deployment as I've been muttering for
> > a couple of weeks now. :(
> 
> I don't think this is an issue because bpf does dynamic btf-based
> relocations (CO-RE) at load time [1]. On the target machine, when
> libbpf loads the bpf object it will read the machine's btf and patch
> any offsets in bytecode and load the fixed-up version into the kernel.
> All that's needed on the target machine for CO-RE is
> CONFIG_DEBUG_INFO_BTF=y which is enabled by default on mainstream
> distributions. I think this addresses the deployment headache you've
> been running into?

Not really -- CO-RE does indeed work quite nicely to smooth over layout
changes in C structures between a BPF program and the kernel it's being
loaded into (thanks, whoever came up with that!) but the problem I have
is how you /get/ those definitions into clang in the first place.

I was under the impression from many of the bpf examples that you're
supposed to #include a distro-provided "vmlinux.h", but there doesn't
seem to be a standard way to find that file.  Most -dev packages provide
a pkgconfig file that give you the appropriate CFLAGS/LDFLAGS to add,
but apparently this is not the case for BPF...?

Perhaps it's the case that distro packages that are building BPF
programs simply add a build dependency on the package providing
vmlinux.h (e.g. Build-Depends: linux-bpf-dev on Debian) and patch in
"CFLAGS=-I/some/path" as needed?

I suppose for a dynamically generated and compiled BPF program, one
could just "bpftool skel" the /sys/kernel/btf files, capture the output,
and "#include </dev/fd/XXX>" the results.  Honestly that sounds better
than trusting some weird system package.

But maybe dynamic compilation is a totally stupid idea.  I did grow up
in the era of mshtml email wreaking havoc, after all...

--D

> Thanks,
> Joanne
> 
> [1] https://docs.ebpf.io/concepts/core/
> 
> >
> > Maybe we could reduce the fuse-iomap bpf definitions to use only
> > cardinal types and the types that iomap itself defines.  That might not
> > be too hard right now because bpf functions reuse structures from
> > include/uapi/fuse.h, which currently use uint{8,16,32,64}_t.  It'll get
> > harder if that __uintXX_t -> __uXX transition actually happens.
> >
> > But getting back to the famfs bpf stuff, I think doing the interleaved
> > mappings via BPF gives the famfs server a lot more flexibility in terms
> > of what it can do when future hardware arrives with even weirder
> > configurations.
> >
> > --D
> >
> > > didn't change much, I just moved around your famfs code to the bpf
> > > side. The kernel side changes are in [3] and the libfuse changes are
> > > in [4].
> > >
> > > For testing out the prototype, I hooked it up to passthrough_hp to
> > > test running the bpf program and verify that it is able to find the
> > > extent from the bpf map. In my opinion, this makes the fuse side
> > > infrastructure cleaner and more extendable for other servers that will
> > > want to go through dax iomap in the future, but I think this also has
> > > a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP
> > > request after a file is opened, the server can directly populate the
> > > metadata map from userspace with the mapping info when it processes
> > > the FUSE_OPEN request, which gets rid of the roundtrip cost. The
> > > server can dynamically update the metadata at any time from userspace
> > > if the mapping info needs to change in the future. For setting up the
> > > daxdevs, I moved your logic to the init side, where the server passes
> > > the daxdev info upfront through an IOMAP_CONFIG exchange with the
> > > kernel initializing the daxdevs based off that info. I think this will
> > > also make deploying future updates for famfs easier, as updating the
> > > logic won't need to go through the upstream kernel mailing list
> > > process and deploying updates won't require a new kernel release.
> > >
> > > These are just my two cents based on my (cursory) understanding of
> > > famfs. Just wanted to float this alternative approach in case it's
> > > useful.
> > >
> > > Thanks,
> > > Joanne
> > >
> > > [1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd
> > > [2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa
> > > [3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/
> > > [4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/
> > >
> > > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d
> > > > [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u
> > > >
> > > > >
> > > > > Regards,
> > > > > John
> > > > >
> > >
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-03  4:57                     ` Darrick J. Wong
@ 2026-03-03 17:28                       ` Joanne Koong
  0 siblings, 0 replies; 79+ messages in thread
From: Joanne Koong @ 2026-03-03 17:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Groves, Amir Goldstein, Miklos Szeredi,
	f-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	Bernd Schubert, Luis Henriques, Horst Birthelmer

On Mon, Mar 2, 2026 at 8:57 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Feb 26, 2026 at 12:21:43PM -0800, Joanne Koong wrote:
> > On Fri, Feb 20, 2026 at 4:37 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Wed, Feb 11, 2026 at 08:46:26PM -0800, Joanne Koong wrote:
> > > > On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > On Fri, Feb 6, 2026 at 12:48 PM John Groves <john@groves.net> wrote:
> > > > > >
> > > > > > On 26/02/05 09:52PM, Darrick J. Wong wrote:
> > > > > > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote:
> > > > > > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves <john@jagalactic.com> wrote:
> > > > > > > > >
> > > > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote:
> > > > > > > > >
> > > > > > > > > [ ... ]
> > > > > > > > >
> > > > > > > > > > >  - famfs: export distributed memory
> > > > > > > > > >
> > > > > > > > > > This has been, uh, hanging out for an extraordinarily long time.
> > > > > > > > >
> > > > > > > > > Um, *yeah*. Although a significant part of that time was on me, because
> > > > > > > > > getting it ported into fuse was kinda hard, my users and I are hoping we
> > > > > > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19
> > > > > > > > > merge window dust settles we can negotiate any needed changes etc. and
> > > > > > > > > shoot for the 7.0 merge window.
> > > > > > >
> > > > > > > I think we've all missed getting merged for 7.0 since 6.19 will be
> > > > > > > released in 3 days. :/
> > > > > > >
> > > > > > > (Granted most of the maintainers I know are /much/ less conservative
> > > > > > > than I was about the schedule)
> > > > > >
> > > > > > Doh - right you are...
> > > > > >
> > > > > > >
> > > > > > > > I think that the work on famfs is setting an example, and I very much
> > > > > > > > hope it will be a good example, of how improving existing infrastructure
> > > > > > > > (FUSE) is a better contribution than adding another fs to the pile.
> > > > > > >
> > > > > > > Yeah.  Joanne and I spent a couple of days this week coprogramming a
> > > > > > > prototype of a way for famfs to create BPF programs to handle
> > > > > > > INTERLEAVED_EXTENT files.  We might be ready to show that off in a
> > > > > > > couple of weeks, and that might be a way to clear up the
> > > > > > > GET_FMAP/IOMAP_BEGIN logjam at last.
> > > > > >
> > > > > > I'd love to learn more about this; happy to do a call if that's a
> > > > > > good way to get me briefed.
> > > > > >
> > > > > > I [generally but not specifically] understand how this could avoid
> > > > > > GET_FMAP, but not GET_DAXDEV.
> > > > > >
> > > > > > But I'm not sure it could (or should) avoid dax_iomap_rw() and
> > > > > > dax_iomap_fault(). The thing is that those call my begin() function
> > > > > > to resolve an offset in a file to an offset on a daxdev, and then
> > > > > > dax completes the fault or memcpy. In that dance, famfs never knows
> > > > > > the kernel address of the memory at all (also true of xfs in fs-dax
> > > > > > mode, unless that's changed fairly recently). I think that's a pretty
> > > > > > decent interface all in all.
> > > > > >
> > > > > > Also: dunno whether y'all have looked at the dax patches in the famfs
> > > > > > series, but the solution to working with Alistair's folio-ification
> > > > > > and cleanup of the dax layer (which set me back months) was to create
> > > > > > drivers/dax/fsdev.c, which, when bound to a daxdev in place of
> > > > > > drivers/dax/device.c, configures folios & pages compatibly with
> > > > > > fs-dax. So I kinda think I need the dax_iomap* interface.
> > > > > >
> > > > > > As usual, if I'm overlooking something let me know...
> > > > >
> > > > > Hi John,
> > > > >
> > > > > The conversation started [1] on Darrick's containerization patchset
> > > > > about using bpf to a) avoid extra requests / context switching for
> > > > > ->iomap_begin and ->iomap_end calls and b) offload what would
> > > > > otherwise have to be hard-coded kernel logic into userspace, which
> > > > > gives userspace more flexibility / control with updating the logic and
> > > > > is less of a maintenance burden for fuse. There was some musing [2]
> > > > > about whether with bpf infrastructure added, it would allow famfs to
> > > > > move all famfs-specific logic to userspace/bpf.
> > > > >
> > > > > I agree that it makes sense for famfs to go through dax iomap
> > > > > interfaces. imo it seems cleanest if fuse has a generic iomap
> > > > > interface with iomap dax going through that plumbing, and any
> > > > > famfs-specific logic that would be needed beyond that (eg computing
> > > > > the interleaved mappings) being moved to custom famfs bpf programs. I
> > > > > started trying to implement this yesterday afternoon because I wanted
> > > > > to make sure it would actually be doable for the famfs logic before
> > > > > bringing it up and I didn't want to derail your project. So far I only
> > > > > have the general iomap interface for fuse added with dax operations
> > > > > going through dax_iomap* and haven't tried out integrating the famfs
> > > > > GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to
> > > > > get to that early next week. The work I did with Darrick this week was
> > > > > on getting a server's bpf programs hooked up to fuse through bpf links
> > > > > and Darrick has fleshed that out and gotten that working now. If it
> > > > > turns out famfs can go through a generic iomap fuse plumbing layer,
> > > > > I'd be curious to hear your thoughts on which approach you'd prefer.
> > > >
> > > > I put together a quick prototype to test this out - this is what it
> > > > looks like with fuse having a generic iomap interface that supports
> > > > dax [1], and the famfs custom logic moved to a bpf program [2]. I
> > >
> > > The bpf maps that you've used to upload per-inode data into the kernel
> > > is a /much/ cleaner method than custom-compiling C into BPF at runtime!
> > > You can statically compile the BPF object code into the fuse server,
> > > which means that (a) you can take advantage of the bpftool skeletons,
> > > and (b) you can in theory vendor-sign the BPF code if and when that
> > > becomes a requirement.
> > >
> > > I think that's way better than having to put vmlinux.h and
> > > fuse_iomap_bpf.h on the deployed system.  Though there's one hitch in
> > > example/Makefile:
> > >
> > > vmlinux.h:
> > >         $(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@
> > >
> > > The build system isn't necessarily running the same kernel as the deploy
> > > images.  It might be for Meta, but it's not unheard of for our build
> > > system to be running (say) OL10+UEK8 kernel, but the build target is OL8
> > > and UEK7.
> > >
> > > There doesn't seem to be any standardization across distros for where a
> > > vmlinux.h file might be found.  Fedora puts it under
> > > /usr/src/$unamestuf, Debian puts it in /usr/include/$gcc_triple, and I
> > > guess SUSE doesn't ship it at all?
> > >
> > > That's going to be a headache for deployment as I've been muttering for
> > > a couple of weeks now. :(
> >
> > I don't think this is an issue because bpf does dynamic btf-based
> > relocations (CO-RE) at load time [1]. On the target machine, when
> > libbpf loads the bpf object it will read the machine's btf and patch
> > any offsets in bytecode and load the fixed-up version into the kernel.
> > All that's needed on the target machine for CO-RE is
> > CONFIG_DEBUG_INFO_BTF=y which is enabled by default on mainstream
> > distributions. I think this addresses the deployment headache you've
> > been running into?
>
> Not really -- CO-RE does indeed work quite nicely to smooth over layout
> changes in C structures between a BPF program and the kernel it's being
> loaded into (thanks, whoever came up with that!) but the problem I have
> is how you /get/ those definitions into clang in the first place.
>
> I was under the impression from many of the bpf examples that you're
> supposed to #include a distro-provided "vmlinux.h", but there doesn't
> seem to be a standard way to find that file.  Most -dev packages provide

The

vmlinux.h:
         $(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@

line generates the vmlinux.h file. /sys/kernel/btf/vmlinux is a kernel
sysfs path and isn't distro dependent.

Then CO-RE takes care of the rest with fixing any mismatches between
the vmlinux on the build machine vs. the target machine.

Thanks,
Joanne

> a pkgconfig file that give you the appropriate CFLAGS/LDFLAGS to add,
> but apparently this is not the case for BPF...?
>
> Perhaps it's the case that distro packages that are building BPF
> programs simply add a build dependency on the package providing
> vmlinux.h (e.g. Build-Depends: linux-bpf-dev on Debian) and patch in
> "CFLAGS=-I/some/path" as needed?
>
> I suppose for a dynamically generated and compiled BPF program, one
> could just "bpftool skel" the /sys/kernel/btf files, capture the output,
> and "#include </dev/fd/XXX>" the results.  Honestly that sounds better
> than trusting some weird system package.
>
> But maybe dynamic compilation is a totally stupid idea.  I did grow up
> in the era of mshtml email wreaking havoc, after all...
>
> --D
>
> > Thanks,
> > Joanne
> >
> > [1] https://docs.ebpf.io/concepts/core/
> >
> > >
> > > Maybe we could reduce the fuse-iomap bpf definitions to use only
> > > cardinal types and the types that iomap itself defines.  That might not
> > > be too hard right now because bpf functions reuse structures from
> > > include/uapi/fuse.h, which currently use uint{8,16,32,64}_t.  It'll get
> > > harder if that __uintXX_t -> __uXX transition actually happens.
> > >
> > > But getting back to the famfs bpf stuff, I think doing the interleaved
> > > mappings via BPF gives the famfs server a lot more flexibility in terms
> > > of what it can do when future hardware arrives with even weirder
> > > configurations.
> > >
> > > --D
> > >
> > > > didn't change much, I just moved around your famfs code to the bpf
> > > > side. The kernel side changes are in [3] and the libfuse changes are
> > > > in [4].
> > > >
> > > > For testing out the prototype, I hooked it up to passthrough_hp to
> > > > test running the bpf program and verify that it is able to find the
> > > > extent from the bpf map. In my opinion, this makes the fuse side
> > > > infrastructure cleaner and more extendable for other servers that will
> > > > want to go through dax iomap in the future, but I think this also has
> > > > a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP
> > > > request after a file is opened, the server can directly populate the
> > > > metadata map from userspace with the mapping info when it processes
> > > > the FUSE_OPEN request, which gets rid of the roundtrip cost. The
> > > > server can dynamically update the metadata at any time from userspace
> > > > if the mapping info needs to change in the future. For setting up the
> > > > daxdevs, I moved your logic to the init side, where the server passes
> > > > the daxdev info upfront through an IOMAP_CONFIG exchange with the
> > > > kernel initializing the daxdevs based off that info. I think this will
> > > > also make deploying future updates for famfs easier, as updating the
> > > > logic won't need to go through the upstream kernel mailing list
> > > > process and deploying updates won't require a new kernel release.
> > > >
> > > > These are just my two cents based on my (cursory) understanding of
> > > > famfs. Just wanted to float this alternative approach in case it's
> > > > useful.
> > > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > [1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd
> > > > [2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa
> > > > [3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/
> > > > [4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/
> > > >
> > > > >
> > > > > Thanks,
> > > > > Joanne
> > > > >
> > > > > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d
> > > > > [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u
> > > > >
> > > > > >
> > > > > > Regards,
> > > > > > John
> > > > > >
> > > >
> >

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-02-21  0:47           ` Darrick J. Wong
@ 2026-03-17  4:17             ` Gao Xiang
  2026-03-18 21:51               ` Darrick J. Wong
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-17  4:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer,
	Gao Xiang, lsf-pc

Hi Darrick,

On 2026/2/21 08:47, Darrick J. Wong wrote:
> On Fri, Feb 06, 2026 at 02:15:12PM +0800, Gao Xiang wrote:

...

> 
>>>
>>> Fuse, otoh, is for all the other weird users -- you found an old
>>> cupboard full of wide scsi disks; or management decided that letting
>>> container customers bring their own prepopulated data partitions(!) is a
>>> good idea; or the default when someone plugs in a device that the system
>>> knows nothing about.

I brainstormed some more thoughts:

End users would like to mount a filesystem, but it's unknown that
the filesystem is consistent or not, especially for filesystems
are intended to be mounted as "rw", it's very hard to know if the
filesystem metadata is fully consistent without a full fsck scan
in advance.

Considering the following metadata inconsistent case (note that
block 0x123 is referenced by the inconsistent metadata, rather
than normal filesystem reflink with correct metadata):

  inode A (with high permission)
  extent [0~4k)               maps to block 0x123

  random inode B (with low permission)
  extent [0~4k)               maps to block 0x123 too

So there will exist at least three attack ways:

  1) Normal users will record the sensitive information to inode
     A (since it's not the normal COW, the block 0x123 will be
     updated in place), but normal users don't know there exists
     the malicious inode B, so the sensitive information can be
     fetched via inode B illegally;

  2) Attackers can write inode B with low permission in the proper
     timing to change the inode A to compromise the computer
     system;

  3) Of course, such two inodes can cause double freeing issues.

I think the normal copy-on-write (including OverlayFS) mechanism
doesn't have the issue (because all changes will just have another
copy). Of course, hardlinking won't have the same issue either,
because there is only one inode for all hardlinks.

I don't think FUSE-implemented userspace drivers will resolve
such issues (I think users can only get the following usage reclaim:
"that is not the case that we will handle with userspace FUSE
drivers, because the metadata is serious broken"), the only way to
resolve such attack vectors is to run

the full-scan fsck consistency check and then mount "rw"

or

using the immutable filesystem like EROFS (so that there will not
be such inconsisteny issues by design) and isolate the entire write
traffic with a full copy-on-write mechanism with OverlayFS for
example (IOWs, to make all write copy-on-write into another trusted
local filesystem).

I hope it's a valid case, and that can indeed happen if the arbitary
generic filesystem can be mounted in "rw".  And my immutable image
filesystem idea can help mitigate this too (just because the immutable
image won't be changed in any way, and all writes are always copy-up)

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-17  4:17             ` Gao Xiang
@ 2026-03-18 21:51               ` Darrick J. Wong
  2026-03-19  8:05                 ` Gao Xiang
  2026-03-22  3:25                 ` Demi Marie Obenour
  0 siblings, 2 replies; 79+ messages in thread
From: Darrick J. Wong @ 2026-03-18 21:51 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer,
	Gao Xiang, lsf-pc

On Tue, Mar 17, 2026 at 12:17:48PM +0800, Gao Xiang wrote:
> Hi Darrick,
> 
> On 2026/2/21 08:47, Darrick J. Wong wrote:
> > On Fri, Feb 06, 2026 at 02:15:12PM +0800, Gao Xiang wrote:
> 
> ...
> 
> > 
> > > > 
> > > > Fuse, otoh, is for all the other weird users -- you found an old
> > > > cupboard full of wide scsi disks; or management decided that letting
> > > > container customers bring their own prepopulated data partitions(!) is a
> > > > good idea; or the default when someone plugs in a device that the system
> > > > knows nothing about.
> 
> I brainstormed some more thoughts:
> 
> End users would like to mount a filesystem, but it's unknown that
> the filesystem is consistent or not, especially for filesystems
> are intended to be mounted as "rw", it's very hard to know if the
> filesystem metadata is fully consistent without a full fsck scan
> in advance.
> 
> Considering the following metadata inconsistent case (note that
> block 0x123 is referenced by the inconsistent metadata, rather
> than normal filesystem reflink with correct metadata):
> 
>  inode A (with high permission)
>  extent [0~4k)               maps to block 0x123
> 
>  random inode B (with low permission)
>  extent [0~4k)               maps to block 0x123 too
> 
> So there will exist at least three attack ways:
> 
>  1) Normal users will record the sensitive information to inode
>     A (since it's not the normal COW, the block 0x123 will be
>     updated in place), but normal users don't know there exists
>     the malicious inode B, so the sensitive information can be
>     fetched via inode B illegally;
> 
>  2) Attackers can write inode B with low permission in the proper
>     timing to change the inode A to compromise the computer
>     system;
> 
>  3) Of course, such two inodes can cause double freeing issues.
> 
> I think the normal copy-on-write (including OverlayFS) mechanism
> doesn't have the issue (because all changes will just have another
> copy). Of course, hardlinking won't have the same issue either,
> because there is only one inode for all hardlinks.

Yes, though you can screw with the link counts to cause other mayhem ;)

> I don't think FUSE-implemented userspace drivers will resolve
> such issues (I think users can only get the following usage reclaim:

Filesystem implementations /can/ detect these sorts of problems, but
most of them have no means to do that quickly.  As you and Demi Marie
have noted, the only reasonable way to guard against these things is
pre-mount fsck.

And even then, attackers still have a window to screw with the fs
metadata after fsck exits but before mount(2) takes the block device.
I guess you'd have to inject the fsck run after the O_EXCL opening.

Technically speaking fuse4fs could just invoke e2fsck -fn before it
starts up the rest of the libfuse initialization but who knows if that's
an acceptable risk.  Also unclear if you actually want -fy for that.

> "that is not the case that we will handle with userspace FUSE
> drivers, because the metadata is serious broken"), the only way to
> resolve such attack vectors is to run
> 
> the full-scan fsck consistency check and then mount "rw"
> 
> or
> 
> using the immutable filesystem like EROFS (so that there will not
> be such inconsisteny issues by design) and isolate the entire write
> traffic with a full copy-on-write mechanism with OverlayFS for
> example (IOWs, to make all write copy-on-write into another trusted
> local filesystem).

(Yeah, that's probably the only way to go for prepopulated images like
root filesystems and container packages)

> I hope it's a valid case, and that can indeed happen if the arbitary
> generic filesystem can be mounted in "rw".  And my immutable image
> filesystem idea can help mitigate this too (just because the immutable
> image won't be changed in any way, and all writes are always copy-up)

That, we agree on :)

--D

> Thanks,
> Gao Xiang
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-18 21:51               ` Darrick J. Wong
@ 2026-03-19  8:05                 ` Gao Xiang
  2026-03-22  3:25                 ` Demi Marie Obenour
  1 sibling, 0 replies; 79+ messages in thread
From: Gao Xiang @ 2026-03-19  8:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer,
	Gao Xiang, lsf-pc

Hi Darrick,

On 2026/3/19 05:51, Darrick J. Wong wrote:
> On Tue, Mar 17, 2026 at 12:17:48PM +0800, Gao Xiang wrote:
>> Hi Darrick,
>>
>> On 2026/2/21 08:47, Darrick J. Wong wrote:
>>> On Fri, Feb 06, 2026 at 02:15:12PM +0800, Gao Xiang wrote:
>>
>> ...
>>
>>>
>>>>>
>>>>> Fuse, otoh, is for all the other weird users -- you found an old
>>>>> cupboard full of wide scsi disks; or management decided that letting
>>>>> container customers bring their own prepopulated data partitions(!) is a
>>>>> good idea; or the default when someone plugs in a device that the system
>>>>> knows nothing about.
>>
>> I brainstormed some more thoughts:
>>
>> End users would like to mount a filesystem, but it's unknown that
>> the filesystem is consistent or not, especially for filesystems
>> are intended to be mounted as "rw", it's very hard to know if the
>> filesystem metadata is fully consistent without a full fsck scan
>> in advance.
>>
>> Considering the following metadata inconsistent case (note that
>> block 0x123 is referenced by the inconsistent metadata, rather
>> than normal filesystem reflink with correct metadata):
>>
>>   inode A (with high permission)
>>   extent [0~4k)               maps to block 0x123
>>
>>   random inode B (with low permission)
>>   extent [0~4k)               maps to block 0x123 too
>>
>> So there will exist at least three attack ways:
>>
>>   1) Normal users will record the sensitive information to inode
>>      A (since it's not the normal COW, the block 0x123 will be
>>      updated in place), but normal users don't know there exists
>>      the malicious inode B, so the sensitive information can be
>>      fetched via inode B illegally;
>>
>>   2) Attackers can write inode B with low permission in the proper
>>      timing to change the inode A to compromise the computer
>>      system;
>>
>>   3) Of course, such two inodes can cause double freeing issues.
>>
>> I think the normal copy-on-write (including OverlayFS) mechanism
>> doesn't have the issue (because all changes will just have another
>> copy). Of course, hardlinking won't have the same issue either,
>> because there is only one inode for all hardlinks.
> 
> Yes, though you can screw with the link counts to cause other mayhem ;)


Yes, for generic writable filesystems, incorrect nlink values
can also be another potential attack vector.

However, for strict immutable filesystems, we never actually
leverage nlink for any writable thing except getattr(), which
is used only to display archived stat information in the image
to users.

This is similar to how FUSE getattr simply returns nlink to
userspace, so corrupted nlink values for immutable fses doesn't
result in any serious thing (again like ro FUSE returns arbitary
nlink to userspace).

Since the filesystem is strictly immutable, any write operation
triggers a copy-up (copy-on-write) to another trusted
filesystem via OverlayFS.  I admit that hardlinking is no
longer valid in this context; however, since we are already
in the containerization era, almost all applications work
well with new OverlayFS semantics.


> 
>> I don't think FUSE-implemented userspace drivers will resolve
>> such issues (I think users can only get the following usage reclaim:
> 
> Filesystem implementations /can/ detect these sorts of problems, but
> most of them have no means to do that quickly.  As you and Demi Marie
> have noted, the only reasonable way to guard against these things is
> pre-mount fsck.
> 
> And even then, attackers still have a window to screw with the fs
> metadata after fsck exits but before mount(2) takes the block device.
> I guess you'd have to inject the fsck run after the O_EXCL opening.

Let's not talk about the attack like malicious block devices, the
typical real use case is that the container runtime fetchs
a filesystem image from remote, and then mount it.

Consider such typical scenario, I still think full fsck should be
run before mounting, especially for "rw"; otherwise FUSE won't help
serious inconsistent metadata corruption attacks;

> 
> Technically speaking fuse4fs could just invoke e2fsck -fn before it
> starts up the rest of the libfuse initialization but who knows if that's
> an acceptable risk.  Also unclear if you actually want -fy for that.

But if `e2fsck -fn` is run, and we scan the image then finally
find no metadata inconsistency, why not just mounting in the
kernel then? ;-)

I guess the main propose of FUSE was to avoid the impact of
serious malicious inconsistency?  I agree that this approach
will almost never crash the kernel, but like what I said,
the security risk is still here, and it doesn't need any
malicious block device likewise, just fetching untrusted
remote write fileystems to local, and mount.

Out of the topic: some of our alibaba cloud serverless
businesses are still mounting untrusted rw filesystems from
arbitary publishers in the kernel without any fsck in advance,
I tried to persuade them "don't do that" for many many times,
but who knows? :-)

> 
>> "that is not the case that we will handle with userspace FUSE
>> drivers, because the metadata is serious broken"), the only way to
>> resolve such attack vectors is to run
>>
>> the full-scan fsck consistency check and then mount "rw"
>>
>> or
>>
>> using the immutable filesystem like EROFS (so that there will not
>> be such inconsisteny issues by design) and isolate the entire write
>> traffic with a full copy-on-write mechanism with OverlayFS for
>> example (IOWs, to make all write copy-on-write into another trusted
>> local filesystem).
> 
> (Yeah, that's probably the only way to go for prepopulated images like
> root filesystems and container packages)
> 
>> I hope it's a valid case, and that can indeed happen if the arbitary
>> generic filesystem can be mounted in "rw".  And my immutable image
>> filesystem idea can help mitigate this too (just because the immutable
>> image won't be changed in any way, and all writes are always copy-up)
> 
> That, we agree on :)

:)

Thanks,
Gao Xiang

> 
> --D
> 
>> Thanks,
>> Gao Xiang
>>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-18 21:51               ` Darrick J. Wong
  2026-03-19  8:05                 ` Gao Xiang
@ 2026-03-22  3:25                 ` Demi Marie Obenour
  2026-03-22  3:52                   ` Gao Xiang
                                     ` (2 more replies)
  1 sibling, 3 replies; 79+ messages in thread
From: Demi Marie Obenour @ 2026-03-22  3:25 UTC (permalink / raw)
  To: Darrick J. Wong, Gao Xiang
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer,
	Gao Xiang, lsf-pc


[-- Attachment #1.1.1: Type: text/plain, Size: 6015 bytes --]

On 3/18/26 17:51, Darrick J. Wong wrote:
> On Tue, Mar 17, 2026 at 12:17:48PM +0800, Gao Xiang wrote:
>> Hi Darrick,
>>
>> On 2026/2/21 08:47, Darrick J. Wong wrote:
>>> On Fri, Feb 06, 2026 at 02:15:12PM +0800, Gao Xiang wrote:
>>
>> ...
>>
>>>
>>>>>
>>>>> Fuse, otoh, is for all the other weird users -- you found an old
>>>>> cupboard full of wide scsi disks; or management decided that letting
>>>>> container customers bring their own prepopulated data partitions(!) is a
>>>>> good idea; or the default when someone plugs in a device that the system
>>>>> knows nothing about.
>>
>> I brainstormed some more thoughts:
>>
>> End users would like to mount a filesystem, but it's unknown that
>> the filesystem is consistent or not, especially for filesystems
>> are intended to be mounted as "rw", it's very hard to know if the
>> filesystem metadata is fully consistent without a full fsck scan
>> in advance.
>>
>> Considering the following metadata inconsistent case (note that
>> block 0x123 is referenced by the inconsistent metadata, rather
>> than normal filesystem reflink with correct metadata):
>>
>>  inode A (with high permission)
>>  extent [0~4k)               maps to block 0x123
>>
>>  random inode B (with low permission)
>>  extent [0~4k)               maps to block 0x123 too
>>
>> So there will exist at least three attack ways:
>>
>>  1) Normal users will record the sensitive information to inode
>>     A (since it's not the normal COW, the block 0x123 will be
>>     updated in place), but normal users don't know there exists
>>     the malicious inode B, so the sensitive information can be
>>     fetched via inode B illegally;
>>
>>  2) Attackers can write inode B with low permission in the proper
>>     timing to change the inode A to compromise the computer
>>     system;
>>
>>  3) Of course, such two inodes can cause double freeing issues.
>>
>> I think the normal copy-on-write (including OverlayFS) mechanism
>> doesn't have the issue (because all changes will just have another
>> copy). Of course, hardlinking won't have the same issue either,
>> because there is only one inode for all hardlinks.
> 
> Yes, though you can screw with the link counts to cause other mayhem ;)
> 
>> I don't think FUSE-implemented userspace drivers will resolve
>> such issues (I think users can only get the following usage reclaim:
> 
> Filesystem implementations /can/ detect these sorts of problems, but
> most of them have no means to do that quickly.  As you and Demi Marie
> have noted, the only reasonable way to guard against these things is
> pre-mount fsck.
> 
> And even then, attackers still have a window to screw with the fs
> metadata after fsck exits but before mount(2) takes the block device.
> I guess you'd have to inject the fsck run after the O_EXCL opening.
> 
> Technically speaking fuse4fs could just invoke e2fsck -fn before it
> starts up the rest of the libfuse initialization but who knows if that's
> an acceptable risk.  Also unclear if you actually want -fy for that.

To me, the attacks mentioned above are all either user error,
or vulnerabilities in software accessing the filesystem.  If one
doesn't trust a filesystem image, then any data from the filesystem
can't be trusted either.  The only exception is if one can verify
the data cryptographically, which is what fsverity is for.
If the filesystem is mounted r/o and the image doesn't change, one
could guarantee that accessing the filesystem will at least return
deterministic results even for corrupted images.  That's something that
would need to be guaranteed by individual filesystem implementations,
though.

See the end of this email for a long note about what can and cannot
be guaranteed in the face of corrupt or malicious filesystem images.

>> "that is not the case that we will handle with userspace FUSE
>> drivers, because the metadata is serious broken"), the only way to
>> resolve such attack vectors is to run
>>
>> the full-scan fsck consistency check and then mount "rw"
>>
>> or
>>
>> using the immutable filesystem like EROFS (so that there will not
>> be such inconsisteny issues by design) and isolate the entire write
>> traffic with a full copy-on-write mechanism with OverlayFS for
>> example (IOWs, to make all write copy-on-write into another trusted
>> local filesystem).
> 
> (Yeah, that's probably the only way to go for prepopulated images like
> root filesystems and container packages)

Even an immutable filesystem can still be corrupt.

>> I hope it's a valid case, and that can indeed happen if the arbitary
>> generic filesystem can be mounted in "rw".  And my immutable image
>> filesystem idea can help mitigate this too (just because the immutable
>> image won't be changed in any way, and all writes are always copy-up)
> 
> That, we agree on :)

Indeed, expecting writes to a corrupt filesystem to behave reasonably
is very foolish.

Long note starts here: There is no *fundamental* reason that a crafted
filesystem image must be able to cause crashes, memory corruption, etc.
This applies even if the filesystem image may be written to while
mounted.  It is always *possible* to write a filesystem such that
it never trusts anything it reads from disk and assumes each read
could return arbitrarily malicious results.

Right now, many filesystem maintainers do not consider this to be a
priority.  Even if they did, I don't think *anyone* (myself included)
could write a filesystem implementation in C that didn't have memory
corruption flaws.  The only exceptions are if the filesystem is
incredibly simple or formal methods are used, and neither is the
case for existing filesystems in the Linux kernel.  By sandboxing a
filesystem, one ensures that an attacker who compromises a filesystem
implementation needs to find *another* exploit to compromise the
whole system.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-22  3:25                 ` Demi Marie Obenour
@ 2026-03-22  3:52                   ` Gao Xiang
  2026-03-22  4:51                   ` Gao Xiang
  2026-03-22  5:14                   ` Gao Xiang
  2 siblings, 0 replies; 79+ messages in thread
From: Gao Xiang @ 2026-03-22  3:52 UTC (permalink / raw)
  To: Demi Marie Obenour, Darrick J. Wong
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer,
	Gao Xiang, lsf-pc

Hi Demi,

On 2026/3/22 11:25, Demi Marie Obenour wrote:

...

> 
>>> "that is not the case that we will handle with userspace FUSE
>>> drivers, because the metadata is serious broken"), the only way to
>>> resolve such attack vectors is to run
>>>
>>> the full-scan fsck consistency check and then mount "rw"
>>>
>>> or
>>>
>>> using the immutable filesystem like EROFS (so that there will not
>>> be such inconsisteny issues by design) and isolate the entire write
>>> traffic with a full copy-on-write mechanism with OverlayFS for
>>> example (IOWs, to make all write copy-on-write into another trusted
>>> local filesystem).
>>
>> (Yeah, that's probably the only way to go for prepopulated images like
>> root filesystems and container packages)
> 
> Even an immutable filesystem can still be corrupt.

I disagree with you here, I think we need define what kind of
corruption is really harmful to systems.

I can definitely say, if an immutable filesystem is well-defined,
it cannot bring any harmful behaviors to the systems.

Taking one example, nlink can still be mismatched for immutable
filesystems, but does it have any real impact?

  1) you can write an unpriviledged FUSE daemon to return arbitary
     nlink all the time, so getattr results doesn't really matter;

  2) OverlayFS and some other fses I don't remember now will return
     nlink = 1 all the time.

As long as the mount/user namespace are totally isolated (of course
you shouldn't mix with the other namespaces), I cannot think out
a real practical attack patch to attack users __just out of the
well-designed immutable filesystems__.

According to the EROFS on-disk format for example, some field of
course can still be considered as corruption, but so what?  It
cannot bring any harmful behavior like the other generic writable
filesystems, which much rely on the allocation metadata, nlink,
etc. are absolutely correct, otherwise the write paths are hightly
vulnerable.

Let's keep in other words, many situations, you still need to
download archive files (like zip, tar, etc.) from the internet,
but without any verification hash for example: Sometimes we face
random corruptions out of these archive files, but so what?
such archives can be extracted with garbage data, or garbage
metadata, but if the namespaces are isolated, what's the real
impact to the computer systems or users?

That is all I want to say, if you find any real impact, let just
write down the real attack paths, but that is all my ideas in
mind.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-22  3:25                 ` Demi Marie Obenour
  2026-03-22  3:52                   ` Gao Xiang
@ 2026-03-22  4:51                   ` Gao Xiang
  2026-03-22  5:13                     ` Demi Marie Obenour
  2026-03-23  9:54                     ` [Lsf-pc] " Jan Kara
  2026-03-22  5:14                   ` Gao Xiang
  2 siblings, 2 replies; 79+ messages in thread
From: Gao Xiang @ 2026-03-22  4:51 UTC (permalink / raw)
  To: Demi Marie Obenour, Darrick J. Wong
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer,
	Gao Xiang, lsf-pc

On 2026/3/22 11:25, Demi Marie Obenour wrote:

...

>>
>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>> starts up the rest of the libfuse initialization but who knows if that's
>> an acceptable risk.  Also unclear if you actually want -fy for that.
> 

Let me try to reply the remaining part:

> To me, the attacks mentioned above are all either user error,
> or vulnerabilities in software accessing the filesystem.  If one

There are many consequences if users try to use potential inconsistent
writable filesystems directly (without full fsck), what I can think
out including but not limited to:

  - data loss (considering data block double free issue);
  - data theft (for example, users keep sensitive information in the
       workload in a high permission inode but it can be read with
       low permission malicious inode later);
  - data tamper (the same principle).

All vulnerabilities above happen after users try to write the
inconsistent filesystem, which is hard to prevent by on-disk
design.

But if users write with copy-on-write to another local consistent
filesystem, all the vulnerabilities above won't exist.

> doesn't trust a filesystem image, then any data from the filesystem
> can't be trusted either.  The only exception is if one can verify

I don't think trustiness is the core part of this whole topic,
because Linux namespace & cgroup concepts are totally _invented_
for untrusted or isolated workloads.

If you untrust some workload, fine, isolate into another
namespace: you cannot strictly trust anything.

The kernel always has bugs, but is that the real main reason
you never run untrusted workloads? I don't think so.

> the data cryptographically, which is what fsverity is for.
> If the filesystem is mounted r/o and the image doesn't change, one
> could guarantee that accessing the filesystem will at least return
> deterministic results even for corrupted images.  That's something that
> would need to be guaranteed by individual filesystem implementations,
> though.

I just want to say that the real problem with generic writable
filesystems is that their on-disk design makes it difficult to
prevent or detect harmful inconsistencies.

First, the on-disk format includes redundant metadata and even
malicious journal metadata (as I mentioned in previous emails).
This makes it hard to determine whether the filesystem is
inconsistent without performing a full disk scan, which takes
much long time.

Of course, you could mount severely inconsistent writable
filesystems in read-only (RO) mode.  However, they are still
inconsistent by definition according to their formal on-disk
specifications.  Furthermore, the runtime kernel implementatio
  mixes read-write and read-only logic within the same
codebase, which complicates the practical consequences.

Due to immutable filesystem designs, almost all typical severe
inconsistencies cannot happen by design or be regard as harmful.
I believe the core issue is not trustworthiness; even with
an untrusted workload, you should be able to audit it easily.
However, severely inconsistent writable filesystems make such
auditability much harder.

> 
> See the end of this email for a long note about what can and cannot
> be guaranteed in the face of corrupt or malicious filesystem images.
> 
>>> "that is not the case that we will handle with userspace FUSE
>>> drivers, because the metadata is serious broken"), the only way to
>>> resolve such attack vectors is to run
>>>
>>> the full-scan fsck consistency check and then mount "rw"
>>>
>>> or
>>>
>>> using the immutable filesystem like EROFS (so that there will not
>>> be such inconsisteny issues by design) and isolate the entire write
>>> traffic with a full copy-on-write mechanism with OverlayFS for
>>> example (IOWs, to make all write copy-on-write into another trusted
>>> local filesystem).
>>
>> (Yeah, that's probably the only way to go for prepopulated images like
>> root filesystems and container packages)
> 
> Even an immutable filesystem can still be corrupt.
> 
>>> I hope it's a valid case, and that can indeed happen if the arbitary
>>> generic filesystem can be mounted in "rw".  And my immutable image
>>> filesystem idea can help mitigate this too (just because the immutable
>>> image won't be changed in any way, and all writes are always copy-up)
>>
>> That, we agree on :)
> 
> Indeed, expecting writes to a corrupt filesystem to behave reasonably
> is very foolish.
> 
> Long note starts here: There is no *fundamental* reason that a crafted
> filesystem image must be able to cause crashes, memory corruption, etc.

I still think those kinds of security risks just of implementation
bugs are the easist part of the whole issue.

Many linux kernel bugs can cause crashes, memory corruption, why
crafted filesystems need to be specially considered?

> This applies even if the filesystem image may be written to while
> mounted.  It is always *possible* to write a filesystem such that
> it never trusts anything it reads from disk and assumes each read
> could return arbitrarily malicious results.

Linux namespaces are invented for those kind of usage, the broken
archive images return garbage data or even archive images can be
changed randomly at runtime, what's the real impacts if they are
isolated by the namespaces?

> 
> Right now, many filesystem maintainers do not consider this to be a
> priority.  Even if they did, I don't think *anyone* (myself included)
> could write a filesystem implementation in C that didn't have memory
> corruption flaws.  The only exceptions are if the filesystem is

I think this is still falling into the aspect of implementation
bugs, my question is simply: "why filesystem is special in this
kind of area, there are many other kernel subsystems in C which
can receive untrusted data, like TCP/IP stack", why filesystem
is special for particular memory corruption flaws?

I really think different aspects are often mixed when this topic
is mentioned, which makes the discussion getting more and more
divergent.

If we talk about implementation bugs, I think filesystem is not
special, but as I said, I think the main issue is the writable
filesystem on-disk format design, due to the design, there are
many severe consequences out of inconsistent filesystems.

> incredibly simple or formal methods are used, and neither is the
> case for existing filesystems in the Linux kernel.  By sandboxing a
> filesystem, one ensures that an attacker who compromises a filesystem
> implementation needs to find *another* exploit to compromise the
> whole system.

Yes, yet sandboxing is the one part, of course VM sandboxing
is better than Linux namespace isolation, but VMs cost much.

Other than sandboxing, I think auditability is important too,
especially users provide sensitive data to new workloads.

Of course, only dealing with trusted workloads is the best,
out of question.  But in the real world, we cannot always
face complete trusted workloads.  For untrusted workloads,
we need to find reliable ways to audit them until they
become trusted.

Just like in the real world: accumulate credit, undergo
audits, and eventually earn trust.

Sorry about my English, but I hope I express my whole idea.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-22  4:51                   ` Gao Xiang
@ 2026-03-22  5:13                     ` Demi Marie Obenour
  2026-03-22  5:30                       ` Gao Xiang
  2026-03-23  9:54                     ` [Lsf-pc] " Jan Kara
  1 sibling, 1 reply; 79+ messages in thread
From: Demi Marie Obenour @ 2026-03-22  5:13 UTC (permalink / raw)
  To: Gao Xiang, Darrick J. Wong
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer,
	Gao Xiang, lsf-pc


[-- Attachment #1.1.1: Type: text/plain, Size: 9023 bytes --]

On 3/22/26 00:51, Gao Xiang wrote:
> 
> 
> On 2026/3/22 11:25, Demi Marie Obenour wrote:
> 
> ...
> 
>>>
>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>> starts up the rest of the libfuse initialization but who knows if that's
>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>
> 
> Let me try to reply the remaining part:
> 
>> To me, the attacks mentioned above are all either user error,
>> or vulnerabilities in software accessing the filesystem.  If one
> 
> There are many consequences if users try to use potential inconsistent
> writable filesystems directly (without full fsck), what I can think
> out including but not limited to:
> 
>   - data loss (considering data block double free issue);
>   - data theft (for example, users keep sensitive information in the
>        workload in a high permission inode but it can be read with
>        low permission malicious inode later);
>   - data tamper (the same principle).
> 
> All vulnerabilities above happen after users try to write the
> inconsistent filesystem, which is hard to prevent by on-disk
> design.
> 
> But if users write with copy-on-write to another local consistent
> filesystem, all the vulnerabilities above won't exist.

That makes sense!  Is this because the reads are at least
deterministic?

>> doesn't trust a filesystem image, then any data from the filesystem
>> can't be trusted either.  The only exception is if one can verify
> 
> I don't think trustiness is the core part of this whole topic,
> because Linux namespace & cgroup concepts are totally _invented_
> for untrusted or isolated workloads.
> 
> If you untrust some workload, fine, isolate into another
> namespace: you cannot strictly trust anything.
> 
> The kernel always has bugs, but is that the real main reason
> you never run untrusted workloads? I don't think so.
I always use VMs for untrusted workloads.

>> the data cryptographically, which is what fsverity is for.
>> If the filesystem is mounted r/o and the image doesn't change, one
>> could guarantee that accessing the filesystem will at least return
>> deterministic results even for corrupted images.  That's something that
>> would need to be guaranteed by individual filesystem implementations,
>> though.
> 
> I just want to say that the real problem with generic writable
> filesystems is that their on-disk design makes it difficult to
> prevent or detect harmful inconsistencies.
> 
> First, the on-disk format includes redundant metadata and even
> malicious journal metadata (as I mentioned in previous emails).
> This makes it hard to determine whether the filesystem is
> inconsistent without performing a full disk scan, which takes
> much long time.
> 
> Of course, you could mount severely inconsistent writable
> filesystems in read-only (RO) mode.  However, they are still
> inconsistent by definition according to their formal on-disk
> specifications.  Furthermore, the runtime kernel implementatio
>   mixes read-write and read-only logic within the same
> codebase, which complicates the practical consequences.
> 
> Due to immutable filesystem designs, almost all typical severe
> inconsistencies cannot happen by design or be regard as harmful.
> I believe the core issue is not trustworthiness; even with
> an untrusted workload, you should be able to audit it easily.
> However, severely inconsistent writable filesystems make such
> auditability much harder.

That actually makes a lot of sense.  I had not considered the journal,
which means one must modify the disk image just to mount it.

>> See the end of this email for a long note about what can and cannot
>> be guaranteed in the face of corrupt or malicious filesystem images.
>>
>>>> "that is not the case that we will handle with userspace FUSE
>>>> drivers, because the metadata is serious broken"), the only way to
>>>> resolve such attack vectors is to run
>>>>
>>>> the full-scan fsck consistency check and then mount "rw"
>>>>
>>>> or
>>>>
>>>> using the immutable filesystem like EROFS (so that there will not
>>>> be such inconsisteny issues by design) and isolate the entire write
>>>> traffic with a full copy-on-write mechanism with OverlayFS for
>>>> example (IOWs, to make all write copy-on-write into another trusted
>>>> local filesystem).
>>>
>>> (Yeah, that's probably the only way to go for prepopulated images like
>>> root filesystems and container packages)
>>
>> Even an immutable filesystem can still be corrupt.
>>
>>>> I hope it's a valid case, and that can indeed happen if the arbitary
>>>> generic filesystem can be mounted in "rw".  And my immutable image
>>>> filesystem idea can help mitigate this too (just because the immutable
>>>> image won't be changed in any way, and all writes are always copy-up)
>>>
>>> That, we agree on :)
>>
>> Indeed, expecting writes to a corrupt filesystem to behave reasonably
>> is very foolish.
>>
>> Long note starts here: There is no *fundamental* reason that a crafted
>> filesystem image must be able to cause crashes, memory corruption, etc.
> 
> I still think those kinds of security risks just of implementation
> bugs are the easist part of the whole issue.
> 
> Many linux kernel bugs can cause crashes, memory corruption, why
> crafted filesystems need to be specially considered?

In the past, filesystem implementations have often not focused on
this.  The Linux Kernel CVE team does not issue CVEs for such bugs.

>> This applies even if the filesystem image may be written to while
>> mounted.  It is always *possible* to write a filesystem such that
>> it never trusts anything it reads from disk and assumes each read
>> could return arbitrarily malicious results.
> 
> Linux namespaces are invented for those kind of usage, the broken
> archive images return garbage data or even archive images can be
> changed randomly at runtime, what's the real impacts if they are
> isolated by the namespaces?

None!  Regardless of whether one considers namespaces sufficient
for isolating malicious code, they can definitely isolate filesystem
operations very well.

>> Right now, many filesystem maintainers do not consider this to be a
>> priority.  Even if they did, I don't think *anyone* (myself included)
>> could write a filesystem implementation in C that didn't have memory
>> corruption flaws.  The only exceptions are if the filesystem is
> 
> I think this is still falling into the aspect of implementation
> bugs, my question is simply: "why filesystem is special in this
> kind of area, there are many other kernel subsystems in C which
> can receive untrusted data, like TCP/IP stack", why filesystem
> is special for particular memory corruption flaws?

See above - the difference is that filesystems have historically
not been written with untrusted input in mind.  This, of course,
can be changed.

> I really think different aspects are often mixed when this topic
> is mentioned, which makes the discussion getting more and more
> divergent.

I agree.

> If we talk about implementation bugs, I think filesystem is not
> special, but as I said, I think the main issue is the writable
> filesystem on-disk format design, due to the design, there are
> many severe consequences out of inconsistent filesystems.

It definitely makes things much harder, and dramatically increases
the attack surface.

Most uses I have (notably backups) have a hard requirement for writable
storage, and when they don't need it they can use dm-verity.

>> incredibly simple or formal methods are used, and neither is the
>> case for existing filesystems in the Linux kernel.  By sandboxing a
>> filesystem, one ensures that an attacker who compromises a filesystem
>> implementation needs to find *another* exploit to compromise the
>> whole system.
> 
> Yes, yet sandboxing is the one part, of course VM sandboxing
> is better than Linux namespace isolation, but VMs cost much.

I use a lot of VMs, but they indeed use significant resources.  I hope
that at some point this can largely be solved with copy-on-write
VM forking.

> Other than sandboxing, I think auditability is important too,
> especially users provide sensitive data to new workloads.
> 
> Of course, only dealing with trusted workloads is the best,
> out of question.  But in the real world, we cannot always
> face complete trusted workloads.  For untrusted workloads,
> we need to find reliable ways to audit them until they
> become trusted.
> 
> Just like in the real world: accumulate credit, undergo
> audits, and eventually earn trust.
> 
> Sorry about my English, but I hope I express my whole idea.
> 
> Thanks,
> Gao Xiang

Don't worry about your English.  It is completely understandable and
more than capable of getting your (very informative) points across.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-22  3:25                 ` Demi Marie Obenour
  2026-03-22  3:52                   ` Gao Xiang
  2026-03-22  4:51                   ` Gao Xiang
@ 2026-03-22  5:14                   ` Gao Xiang
  2026-03-23  9:43                     ` [Lsf-pc] " Jan Kara
  2 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-22  5:14 UTC (permalink / raw)
  To: Demi Marie Obenour, Darrick J. Wong
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer,
	Gao Xiang, lsf-pc

On 2026/3/22 11:25, Demi Marie Obenour wrote:

> The only exceptions are if the filesystem is incredibly simple
> or formal methods are used, and neither is the case for existing
> filesystems in the Linux kernel.

Again, first, I don't think "simple" is a helpful and descriptive
word out of this kind of area:

"simple" formats are all formats just archive the filesystem
data and metadata, but without any more use cases. No simpler
than that, because you need to tell vfs the file (meta)data
(even the file data is the garbage data), otherwise they won't
be called as filesystems.

So why we always fall into comparing which archive filesystem
is simpler than others unless some bad buggy designs in those
"simple" filesystems.

Here, I can definitely say _EROFS uncompressed format_ fits
this kind of area, and I will write down formally later if each
on-disk field has unexpected values like garbage numbers, what
the outcome.  And the final goal is to allow EROFS uncompressed
format can be mounted as the "root" into totally isolated
user/mount namespaces since it's really useful and no practical
risk.

If any other kernel filesystem maintainers say that they can do
the same , why not also allow them do the same thing? I don't
think it's a reasonable policy that "due to EXT4, XFS, BtrFS
communities say that they cannot tolerate the inconsistent
consequence, any other kernel filesystem should follow the
same policy even they don't have such issue by design."

In other words, does TCP/IP protocol simple? and is there no
simplier protocol for network data? I don't think so, but why
untrusted network data can be parsed in the kernel?  Does
TCP/IP kernel implementation already bugless?

Quite confusing.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-22  5:13                     ` Demi Marie Obenour
@ 2026-03-22  5:30                       ` Gao Xiang
  0 siblings, 0 replies; 79+ messages in thread
From: Gao Xiang @ 2026-03-22  5:30 UTC (permalink / raw)
  To: Demi Marie Obenour, Darrick J. Wong
  Cc: Miklos Szeredi, linux-fsdevel, Joanne Koong, John Groves,
	Bernd Schubert, Amir Goldstein, Luis Henriques, Horst Birthelmer,
	Gao Xiang, lsf-pc



On 2026/3/22 13:13, Demi Marie Obenour wrote:
> On 3/22/26 00:51, Gao Xiang wrote:
>>
>>
>> On 2026/3/22 11:25, Demi Marie Obenour wrote:
>>
>> ...
>>
>>>>
>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>>> starts up the rest of the libfuse initialization but who knows if that's
>>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>>
>>
>> Let me try to reply the remaining part:
>>
>>> To me, the attacks mentioned above are all either user error,
>>> or vulnerabilities in software accessing the filesystem.  If one
>>
>> There are many consequences if users try to use potential inconsistent
>> writable filesystems directly (without full fsck), what I can think
>> out including but not limited to:
>>
>>    - data loss (considering data block double free issue);
>>    - data theft (for example, users keep sensitive information in the
>>         workload in a high permission inode but it can be read with
>>         low permission malicious inode later);
>>    - data tamper (the same principle).
>>
>> All vulnerabilities above happen after users try to write the
>> inconsistent filesystem, which is hard to prevent by on-disk
>> design.
>>
>> But if users write with copy-on-write to another local consistent
>> filesystem, all the vulnerabilities above won't exist.
> 
> That makes sense!  Is this because the reads are at least
> deterministic?

I read the remaining parts, I think only this one that
needs to be clarified.

As I said in the letest reply, I don't think __simple__ is
a suitable descriptive word.

The fact is

A filesystem need to provide users enough information about
files and filesystem hierarchy, otherwise it shouldn't be
called as filesystems.

That is the only thing that immutable filesystem do, providing
filesystem information to vfs, that is all.

No simpler than that, it's the minimal feature set of a
filesystem (and comparing the slight on-disk difference
means nothing unless the on-disk design is too bad). And
I think that should answer your question.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-22  5:14                   ` Gao Xiang
@ 2026-03-23  9:43                     ` Jan Kara
  2026-03-23 10:05                       ` Gao Xiang
  0 siblings, 1 reply; 79+ messages in thread
From: Jan Kara @ 2026-03-23  9:43 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

On Sun 22-03-26 13:14:55, Gao Xiang wrote:
> 
> 
> On 2026/3/22 11:25, Demi Marie Obenour wrote:
> 
> > The only exceptions are if the filesystem is incredibly simple
> > or formal methods are used, and neither is the case for existing
> > filesystems in the Linux kernel.
> 
> Again, first, I don't think "simple" is a helpful and descriptive
> word out of this kind of area:
> 
> "simple" formats are all formats just archive the filesystem
> data and metadata, but without any more use cases. No simpler
> than that, because you need to tell vfs the file (meta)data
> (even the file data is the garbage data), otherwise they won't
> be called as filesystems.
> 
> So why we always fall into comparing which archive filesystem
> is simpler than others unless some bad buggy designs in those
> "simple" filesystems.
> 
> Here, I can definitely say _EROFS uncompressed format_ fits
> this kind of area, and I will write down formally later if each
> on-disk field has unexpected values like garbage numbers, what
> the outcome.  And the final goal is to allow EROFS uncompressed
> format can be mounted as the "root" into totally isolated
> user/mount namespaces since it's really useful and no practical
> risk.
> 
> If any other kernel filesystem maintainers say that they can do
> the same , why not also allow them do the same thing? I don't
> think it's a reasonable policy that "due to EXT4, XFS, BtrFS
> communities say that they cannot tolerate the inconsistent
> consequence, any other kernel filesystem should follow the
> same policy even they don't have such issue by design."
> 
> In other words, does TCP/IP protocol simple? and is there no
> simplier protocol for network data? I don't think so, but why
> untrusted network data can be parsed in the kernel?  Does
> TCP/IP kernel implementation already bugless?

So the amount of state TCP/IP needs to keep around is very small (I'd say
kilobytes) compared to the amount of state a filesystem needs to maintain
(gigabytes). This leads to very fundamental differences in the complexity
of data structures, their verification, etc. So yes, it is much easier to
harden TCP/IP against untrusted input than a filesystem implementation.

And yes, when you have immutable filesystem, things are much simpler
because the data structures and algorithms can be much simpler and as you
wrote a lot of these inconsistencies don't matter (at least for the
kernel). But once you add ability to modify the filesystem - here I don't
think it matters whether through CoW or other means - things get
complicated quickly and it gets much more complex to make your code
resilient to all kinds of inconsistencies...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-22  4:51                   ` Gao Xiang
  2026-03-22  5:13                     ` Demi Marie Obenour
@ 2026-03-23  9:54                     ` Jan Kara
  2026-03-23 10:19                       ` Gao Xiang
  1 sibling, 1 reply; 79+ messages in thread
From: Jan Kara @ 2026-03-23  9:54 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

On Sun 22-03-26 12:51:57, Gao Xiang wrote:
> On 2026/3/22 11:25, Demi Marie Obenour wrote:
> > > Technically speaking fuse4fs could just invoke e2fsck -fn before it
> > > starts up the rest of the libfuse initialization but who knows if that's
> > > an acceptable risk.  Also unclear if you actually want -fy for that.
> > 
> 
> Let me try to reply the remaining part:
> 
> > To me, the attacks mentioned above are all either user error,
> > or vulnerabilities in software accessing the filesystem.  If one
> 
> There are many consequences if users try to use potential inconsistent
> writable filesystems directly (without full fsck), what I can think
> out including but not limited to:
> 
>  - data loss (considering data block double free issue);
>  - data theft (for example, users keep sensitive information in the
>       workload in a high permission inode but it can be read with
>       low permission malicious inode later);
>  - data tamper (the same principle).
> 
> All vulnerabilities above happen after users try to write the
> inconsistent filesystem, which is hard to prevent by on-disk
> design.
> 
> But if users write with copy-on-write to another local consistent
> filesystem, all the vulnerabilities above won't exist.

OK, so if I understand correctly you are advocating that untrusted initial data
should be provided on immutable filesystem and any needed modification
would be handled by overlayfs (or some similar layer) and stored on
(initially empty) writeable filesystem.

That's a sensible design for usecase like containers but what started this
thread about FUSE drivers for filesystems were usecases like access to
filesystems on drives attached at USB port of your laptop. There it isn't
really practical to use your design. You need a standard writeable
filesystem for that but at the same time you cannot quite trust the content
of everything that gets attached to your USB port...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23  9:43                     ` [Lsf-pc] " Jan Kara
@ 2026-03-23 10:05                       ` Gao Xiang
  2026-03-23 10:14                         ` Jan Kara
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-23 10:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

Hi Jan,

On 2026/3/23 17:43, Jan Kara wrote:
> On Sun 22-03-26 13:14:55, Gao Xiang wrote:
>>
>>
>> On 2026/3/22 11:25, Demi Marie Obenour wrote:
>>
>>> The only exceptions are if the filesystem is incredibly simple
>>> or formal methods are used, and neither is the case for existing
>>> filesystems in the Linux kernel.
>>
>> Again, first, I don't think "simple" is a helpful and descriptive
>> word out of this kind of area:
>>
>> "simple" formats are all formats just archive the filesystem
>> data and metadata, but without any more use cases. No simpler
>> than that, because you need to tell vfs the file (meta)data
>> (even the file data is the garbage data), otherwise they won't
>> be called as filesystems.
>>
>> So why we always fall into comparing which archive filesystem
>> is simpler than others unless some bad buggy designs in those
>> "simple" filesystems.
>>
>> Here, I can definitely say _EROFS uncompressed format_ fits
>> this kind of area, and I will write down formally later if each
>> on-disk field has unexpected values like garbage numbers, what
>> the outcome.  And the final goal is to allow EROFS uncompressed
>> format can be mounted as the "root" into totally isolated
>> user/mount namespaces since it's really useful and no practical
>> risk.
>>
>> If any other kernel filesystem maintainers say that they can do
>> the same , why not also allow them do the same thing? I don't
>> think it's a reasonable policy that "due to EXT4, XFS, BtrFS
>> communities say that they cannot tolerate the inconsistent
>> consequence, any other kernel filesystem should follow the
>> same policy even they don't have such issue by design."
>>
>> In other words, does TCP/IP protocol simple? and is there no
>> simplier protocol for network data? I don't think so, but why
>> untrusted network data can be parsed in the kernel?  Does
>> TCP/IP kernel implementation already bugless?
> 
> So the amount of state TCP/IP needs to keep around is very small (I'd say
> kilobytes) compared to the amount of state a filesystem needs to maintain
> (gigabytes). This leads to very fundamental differences in the complexity
> of data structures, their verification, etc. So yes, it is much easier to
> harden TCP/IP against untrusted input than a filesystem implementation.

Thanks for the reply.

I just want to say I think the core EROFS format is not
complex too, but I don't want to make the deadly-simple
comparsion among potential simple filesystems, since
TCP/IP is not the deadly-simple one.

In brief, mounting as "root" in the isolated user/mount
namespace is absolutely our interest and useful to
container users, and as one of the author and maintainer
of EROFS, I can ensure EROFS can bear with untrusted
(meta)data.

> 
> And yes, when you have immutable filesystem, things are much simpler
> because the data structures and algorithms can be much simpler and as you
> wrote a lot of these inconsistencies don't matter (at least for the
> kernel). But once you add ability to modify the filesystem - here I don't
> think it matters whether through CoW or other means - things get
> complicated quickly and it gets much more complex to make your code
> resilient to all kinds of inconsistencies...

I only consider the COW approach using OverlayFS for example,
it just copies up (meta)data into another filesystem (the
semantics is just like copy the file in the userspace) and
the immutable filesystem image won't change in any case.

Overlayfs write goes through normal user write and the
writable filesystem is consistent, so I don't think it does
matter.  Or am I missing something? (e.g could you point
out some case which OverlayFS cannot handle properly?)

Thanks,
Gao Xiang

> 
> 								Honza


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 10:05                       ` Gao Xiang
@ 2026-03-23 10:14                         ` Jan Kara
  2026-03-23 10:30                           ` Gao Xiang
  0 siblings, 1 reply; 79+ messages in thread
From: Jan Kara @ 2026-03-23 10:14 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Jan Kara, Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

Hi Gao!

On Mon 23-03-26 18:05:17, Gao Xiang wrote:
> On 2026/3/23 17:43, Jan Kara wrote:
> > And yes, when you have immutable filesystem, things are much simpler
> > because the data structures and algorithms can be much simpler and as you
> > wrote a lot of these inconsistencies don't matter (at least for the
> > kernel). But once you add ability to modify the filesystem - here I don't
> > think it matters whether through CoW or other means - things get
> > complicated quickly and it gets much more complex to make your code
> > resilient to all kinds of inconsistencies...
> 
> I only consider the COW approach using OverlayFS for example,
> it just copies up (meta)data into another filesystem (the
> semantics is just like copy the file in the userspace) and
> the immutable filesystem image won't change in any case.
> 
> Overlayfs write goes through normal user write and the
> writable filesystem is consistent, so I don't think it does
> matter.  Or am I missing something? (e.g could you point
> out some case which OverlayFS cannot handle properly?)

No, you are correct. For the usecases where immutable fs + overlayfs +
empty initial writeable filesystem works, this is a safe design.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23  9:54                     ` [Lsf-pc] " Jan Kara
@ 2026-03-23 10:19                       ` Gao Xiang
  2026-03-23 11:14                         ` Jan Kara
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-23 10:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

Hi Jan,

On 2026/3/23 17:54, Jan Kara wrote:
> On Sun 22-03-26 12:51:57, Gao Xiang wrote:
>> On 2026/3/22 11:25, Demi Marie Obenour wrote:
>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>>> starts up the rest of the libfuse initialization but who knows if that's
>>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>>
>>
>> Let me try to reply the remaining part:
>>
>>> To me, the attacks mentioned above are all either user error,
>>> or vulnerabilities in software accessing the filesystem.  If one
>>
>> There are many consequences if users try to use potential inconsistent
>> writable filesystems directly (without full fsck), what I can think
>> out including but not limited to:
>>
>>   - data loss (considering data block double free issue);
>>   - data theft (for example, users keep sensitive information in the
>>        workload in a high permission inode but it can be read with
>>        low permission malicious inode later);
>>   - data tamper (the same principle).
>>
>> All vulnerabilities above happen after users try to write the
>> inconsistent filesystem, which is hard to prevent by on-disk
>> design.
>>
>> But if users write with copy-on-write to another local consistent
>> filesystem, all the vulnerabilities above won't exist.
> 
> OK, so if I understand correctly you are advocating that untrusted initial data
> should be provided on immutable filesystem and any needed modification
> would be handled by overlayfs (or some similar layer) and stored on
> (initially empty) writeable filesystem.
> 
> That's a sensible design for usecase like containers but what started this
> thread about FUSE drivers for filesystems were usecases like access to
> filesystems on drives attached at USB port of your laptop. There it isn't
> really practical to use your design. You need a standard writeable
> filesystem for that but at the same time you cannot quite trust the content
> of everything that gets attached to your USB port...

Yes, that is my proposal and my overall interest now.  I know
your interest but I'm here I just would like to say:

Without full scan fsck, even with FUSE, the system is still
vulnerable if the FUSE approch is used.

I could give a detailed example, for example:

There are passwd files `/etc/passwd` and `/etc/shadow` with
proper permissions (for example, you could audit the file
permission with e2fsprogs/xfsprogs without a full fsck scan) in
the inconsistent remote filesystems, but there are some other
malicious files called "foo" and "bar" somewhere with low
permissions but sharing the same blocks which is disallowed
by filesystem on-disk formats illegally (because they violate
copy-on-write semantics by design), also see my previous
reply:
https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com

The initial data of `/etc/passwd` and `/etc/shadow` in the
filesystem image doesn't matter, but users could then keep
very sensitive information later just out of the
inconsistent filesystems, which could cause "data theft"
above.

So I think inconsistent filesystem harms to users no matter
the implementations use FUSE or not.

You could claim it's not the case we care, but I think most
users should care, and they should full fsck in advance, but
if it's fscked and consistent, I'm afraid the images can then
be handled in kernel directly.

Thanks,
Gao Xiang


> 
> 								Honza


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 10:14                         ` Jan Kara
@ 2026-03-23 10:30                           ` Gao Xiang
  0 siblings, 0 replies; 79+ messages in thread
From: Gao Xiang @ 2026-03-23 10:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc



On 2026/3/23 18:14, Jan Kara wrote:
> Hi Gao!
> 
> On Mon 23-03-26 18:05:17, Gao Xiang wrote:
>> On 2026/3/23 17:43, Jan Kara wrote:
>>> And yes, when you have immutable filesystem, things are much simpler
>>> because the data structures and algorithms can be much simpler and as you
>>> wrote a lot of these inconsistencies don't matter (at least for the
>>> kernel). But once you add ability to modify the filesystem - here I don't
>>> think it matters whether through CoW or other means - things get
>>> complicated quickly and it gets much more complex to make your code
>>> resilient to all kinds of inconsistencies...
>>
>> I only consider the COW approach using OverlayFS for example,
>> it just copies up (meta)data into another filesystem (the
>> semantics is just like copy the file in the userspace) and
>> the immutable filesystem image won't change in any case.
>>
>> Overlayfs write goes through normal user write and the
>> writable filesystem is consistent, so I don't think it does
>> matter.  Or am I missing something? (e.g could you point
>> out some case which OverlayFS cannot handle properly?)
> 
> No, you are correct. For the usecases where immutable fs + overlayfs +
> empty initial writeable filesystem works, this is a safe design.

(Just BTW, not only to empty initial writable filesystems
  but also to previously mounted, consistent trusted local
  filesystems used as upper layers, as they are typical
  cases for containers.)

Thanks,
Gao Xiang

> 
> 								Honza


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 10:19                       ` Gao Xiang
@ 2026-03-23 11:14                         ` Jan Kara
  2026-03-23 11:42                           ` Gao Xiang
  2026-03-23 12:08                           ` Demi Marie Obenour
  0 siblings, 2 replies; 79+ messages in thread
From: Jan Kara @ 2026-03-23 11:14 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Jan Kara, Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

Hi Gao!

On Mon 23-03-26 18:19:16, Gao Xiang wrote:
> On 2026/3/23 17:54, Jan Kara wrote:
> > On Sun 22-03-26 12:51:57, Gao Xiang wrote:
> > > On 2026/3/22 11:25, Demi Marie Obenour wrote:
> > > > > Technically speaking fuse4fs could just invoke e2fsck -fn before it
> > > > > starts up the rest of the libfuse initialization but who knows if that's
> > > > > an acceptable risk.  Also unclear if you actually want -fy for that.
> > > > 
> > > 
> > > Let me try to reply the remaining part:
> > > 
> > > > To me, the attacks mentioned above are all either user error,
> > > > or vulnerabilities in software accessing the filesystem.  If one
> > > 
> > > There are many consequences if users try to use potential inconsistent
> > > writable filesystems directly (without full fsck), what I can think
> > > out including but not limited to:
> > > 
> > >   - data loss (considering data block double free issue);
> > >   - data theft (for example, users keep sensitive information in the
> > >        workload in a high permission inode but it can be read with
> > >        low permission malicious inode later);
> > >   - data tamper (the same principle).
> > > 
> > > All vulnerabilities above happen after users try to write the
> > > inconsistent filesystem, which is hard to prevent by on-disk
> > > design.
> > > 
> > > But if users write with copy-on-write to another local consistent
> > > filesystem, all the vulnerabilities above won't exist.
> > 
> > OK, so if I understand correctly you are advocating that untrusted initial data
> > should be provided on immutable filesystem and any needed modification
> > would be handled by overlayfs (or some similar layer) and stored on
> > (initially empty) writeable filesystem.
> > 
> > That's a sensible design for usecase like containers but what started this
> > thread about FUSE drivers for filesystems were usecases like access to
> > filesystems on drives attached at USB port of your laptop. There it isn't
> > really practical to use your design. You need a standard writeable
> > filesystem for that but at the same time you cannot quite trust the content
> > of everything that gets attached to your USB port...
> 
> Yes, that is my proposal and my overall interest now.  I know
> your interest but I'm here I just would like to say:
> 
> Without full scan fsck, even with FUSE, the system is still
> vulnerable if the FUSE approch is used.
> 
> I could give a detailed example, for example:
> 
> There are passwd files `/etc/passwd` and `/etc/shadow` with
> proper permissions (for example, you could audit the file
> permission with e2fsprogs/xfsprogs without a full fsck scan) in
> the inconsistent remote filesystems, but there are some other
> malicious files called "foo" and "bar" somewhere with low
> permissions but sharing the same blocks which is disallowed
> by filesystem on-disk formats illegally (because they violate
> copy-on-write semantics by design), also see my previous
> reply:
> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com
> 
> The initial data of `/etc/passwd` and `/etc/shadow` in the
> filesystem image doesn't matter, but users could then keep
> very sensitive information later just out of the
> inconsistent filesystems, which could cause "data theft"
> above.

Yes, I've seen you mentioning this case earlier in this thread. But let me
say I consider it rather contrived :). For the container usecase if you are
fetching say a root fs image and don't trust the content of the image, then
how do you know it doesn't contain a malicious code that sends all the
sensitive data to some third party? So I believe the owner of the container
has to trust the content of the image, otherwise you've already lost.

The container environment *provider* doesn't necessarily trust either the
container owner or the image so they need to make sure their infrastructure
isn't compromised by malicious actions from these - and for that either
your immutable image scheme or FUSE mounting works.

Similarly with the USB drive content. Either some malicious actor plugs USB
drive into a laptop, it gets automounted, and that must not crash the
kernel or give attacker more priviledge - but that's all - no data is
stored on the drive. Or I myself plug some not-so-trusted USB drive to my
laptop to read some content from it or possibly put there some data for a
friend - again that must not compromise my machine but I'd be really dumb
and already lost the security game if I'd put any sensitive data to such
drive. And for this USB drive case FUSE mounting solves these problems
nicely.

So in my opinion for practical usecases the FUSE solution addresses the
real security concerns.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 11:14                         ` Jan Kara
@ 2026-03-23 11:42                           ` Gao Xiang
  2026-03-23 12:01                             ` Gao Xiang
  2026-03-23 12:08                           ` Demi Marie Obenour
  1 sibling, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-23 11:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

Hi Jan!

On 2026/3/23 19:14, Jan Kara wrote:
> Hi Gao!
> 
> On Mon 23-03-26 18:19:16, Gao Xiang wrote:
>> On 2026/3/23 17:54, Jan Kara wrote:
>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote:
>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote:
>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>>>>> starts up the rest of the libfuse initialization but who knows if that's
>>>>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>>>>
>>>>
>>>> Let me try to reply the remaining part:
>>>>
>>>>> To me, the attacks mentioned above are all either user error,
>>>>> or vulnerabilities in software accessing the filesystem.  If one
>>>>
>>>> There are many consequences if users try to use potential inconsistent
>>>> writable filesystems directly (without full fsck), what I can think
>>>> out including but not limited to:
>>>>
>>>>    - data loss (considering data block double free issue);
>>>>    - data theft (for example, users keep sensitive information in the
>>>>         workload in a high permission inode but it can be read with
>>>>         low permission malicious inode later);
>>>>    - data tamper (the same principle).
>>>>
>>>> All vulnerabilities above happen after users try to write the
>>>> inconsistent filesystem, which is hard to prevent by on-disk
>>>> design.
>>>>
>>>> But if users write with copy-on-write to another local consistent
>>>> filesystem, all the vulnerabilities above won't exist.
>>>
>>> OK, so if I understand correctly you are advocating that untrusted initial data
>>> should be provided on immutable filesystem and any needed modification
>>> would be handled by overlayfs (or some similar layer) and stored on
>>> (initially empty) writeable filesystem.
>>>
>>> That's a sensible design for usecase like containers but what started this
>>> thread about FUSE drivers for filesystems were usecases like access to
>>> filesystems on drives attached at USB port of your laptop. There it isn't
>>> really practical to use your design. You need a standard writeable
>>> filesystem for that but at the same time you cannot quite trust the content
>>> of everything that gets attached to your USB port...
>>
>> Yes, that is my proposal and my overall interest now.  I know
>> your interest but I'm here I just would like to say:
>>
>> Without full scan fsck, even with FUSE, the system is still
>> vulnerable if the FUSE approch is used.
>>
>> I could give a detailed example, for example:
>>
>> There are passwd files `/etc/passwd` and `/etc/shadow` with
>> proper permissions (for example, you could audit the file
>> permission with e2fsprogs/xfsprogs without a full fsck scan) in
>> the inconsistent remote filesystems, but there are some other
>> malicious files called "foo" and "bar" somewhere with low
>> permissions but sharing the same blocks which is disallowed
>> by filesystem on-disk formats illegally (because they violate
>> copy-on-write semantics by design), also see my previous
>> reply:
>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com
>>
>> The initial data of `/etc/passwd` and `/etc/shadow` in the
>> filesystem image doesn't matter, but users could then keep
>> very sensitive information later just out of the
>> inconsistent filesystems, which could cause "data theft"
>> above.
> 
> Yes, I've seen you mentioning this case earlier in this thread. But let me
> say I consider it rather contrived :). For the container usecase if you are
> fetching say a root fs image and don't trust the content of the image, then
> how do you know it doesn't contain a malicious code that sends all the
> sensitive data to some third party? So I believe the owner of the container
> has to trust the content of the image, otherwise you've already lost.

The fact is that many cloud vendors have malicious content
scanners, much like virus scanners.

They just scan the filesystem tree and all contents, but I
think the severe filesystem metadata consistency is not
what they previously care about.  But of course, you could
ask them to fsck too.

Also see below.

> 
> The container environment *provider* doesn't necessarily trust either the
> container owner or the image so they need to make sure their infrastructure
> isn't compromised by malicious actions from these - and for that either
> your immutable image scheme or FUSE mounting works.
> 
> Similarly with the USB drive content. Either some malicious actor plugs USB
> drive into a laptop, it gets automounted, and that must not crash the
> kernel or give attacker more priviledge - but that's all - no data is
> stored on the drive. Or I myself plug some not-so-trusted USB drive to my
> laptop to read some content from it or possibly put there some data for a
> friend - again that must not compromise my machine but I'd be really dumb
> and already lost the security game if I'd put any sensitive data to such
> drive. And for this USB drive case FUSE mounting solves these problems
> nicely.
> 
> So in my opinion for practical usecases the FUSE solution addresses the
> real security concerns.

Sorry, I shouldn't speak in that way, as you said, the
security concepts depends on the context, limitation
and how do you think.

First, I need to rephrase a bit above in case there
could be some divergent discussion:

> for example, you could audit the file permission with
e2fsprogs/xfsprogs without a full fsck scan.

In order to make the userspace programs best-effort, they
should open for write and fstat the permission bits
before writing sensitive informations, it avoids TOCTOU
attacks as much as possible as userspace programs.


Container users use namespaces of course, namespace can
only provide isolations, that is the only security
guarantees namespace can provide, no question of that.

Let's just strictly speaking, as you mentioned, both ways
ensure the isolation (if namespaces are used) and kernel
stability (let's not nitpick about this).  And let's not
talk about malicious block devices or likewise, because
it's not a typical setup (maybe it could be a typical
setup for some cases, but it should be another system-wide
security design) and should be clarified by system admins
for example.

What I just want to say is that: FUSE mount approach _might_
give more incorrect security guarantees than the real users
expect: I think other than avoiding system crashes etc, many
users should expect that they could use the generic writable
filesystem directly with FUSE without full-scan fsck
in advance and keep their sensitive data directly, I don't
think that is the corner cases if you don't claim the
limitation of FUSE approaches.

If none expects that, that is absolute be fine, as I said,
it provides strong isolation and stability, but I really
suspect this approach could be abused to mount totally
untrusted remote filesystems (Actually as I said, some
business of ours already did: fetching EXT4 filesystems
with unknown status and mount without fscking, that is
really disappointing.)

Thanks,
Gao Xiang


> 
> 								Honza


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 11:42                           ` Gao Xiang
@ 2026-03-23 12:01                             ` Gao Xiang
  2026-03-23 14:13                               ` Jan Kara
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-23 12:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc



On 2026/3/23 19:42, Gao Xiang wrote:
> Hi Jan!
> 
> On 2026/3/23 19:14, Jan Kara wrote:
>> Hi Gao!
>>
>> On Mon 23-03-26 18:19:16, Gao Xiang wrote:
>>> On 2026/3/23 17:54, Jan Kara wrote:
>>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote:
>>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote:
>>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>>>>>> starts up the rest of the libfuse initialization but who knows if that's
>>>>>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>>>>>
>>>>>
>>>>> Let me try to reply the remaining part:
>>>>>
>>>>>> To me, the attacks mentioned above are all either user error,
>>>>>> or vulnerabilities in software accessing the filesystem.  If one
>>>>>
>>>>> There are many consequences if users try to use potential inconsistent
>>>>> writable filesystems directly (without full fsck), what I can think
>>>>> out including but not limited to:
>>>>>
>>>>>    - data loss (considering data block double free issue);
>>>>>    - data theft (for example, users keep sensitive information in the
>>>>>         workload in a high permission inode but it can be read with
>>>>>         low permission malicious inode later);
>>>>>    - data tamper (the same principle).
>>>>>
>>>>> All vulnerabilities above happen after users try to write the
>>>>> inconsistent filesystem, which is hard to prevent by on-disk
>>>>> design.
>>>>>
>>>>> But if users write with copy-on-write to another local consistent
>>>>> filesystem, all the vulnerabilities above won't exist.
>>>>
>>>> OK, so if I understand correctly you are advocating that untrusted initial data
>>>> should be provided on immutable filesystem and any needed modification
>>>> would be handled by overlayfs (or some similar layer) and stored on
>>>> (initially empty) writeable filesystem.
>>>>
>>>> That's a sensible design for usecase like containers but what started this
>>>> thread about FUSE drivers for filesystems were usecases like access to
>>>> filesystems on drives attached at USB port of your laptop. There it isn't
>>>> really practical to use your design. You need a standard writeable
>>>> filesystem for that but at the same time you cannot quite trust the content
>>>> of everything that gets attached to your USB port...
>>>
>>> Yes, that is my proposal and my overall interest now.  I know
>>> your interest but I'm here I just would like to say:
>>>
>>> Without full scan fsck, even with FUSE, the system is still
>>> vulnerable if the FUSE approch is used.
>>>
>>> I could give a detailed example, for example:
>>>
>>> There are passwd files `/etc/passwd` and `/etc/shadow` with
>>> proper permissions (for example, you could audit the file
>>> permission with e2fsprogs/xfsprogs without a full fsck scan) in
>>> the inconsistent remote filesystems, but there are some other
>>> malicious files called "foo" and "bar" somewhere with low
>>> permissions but sharing the same blocks which is disallowed
>>> by filesystem on-disk formats illegally (because they violate
>>> copy-on-write semantics by design), also see my previous
>>> reply:
>>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com
>>>
>>> The initial data of `/etc/passwd` and `/etc/shadow` in the
>>> filesystem image doesn't matter, but users could then keep
>>> very sensitive information later just out of the
>>> inconsistent filesystems, which could cause "data theft"
>>> above.
>>
>> Yes, I've seen you mentioning this case earlier in this thread. But let me
>> say I consider it rather contrived :). For the container usecase if you are
>> fetching say a root fs image and don't trust the content of the image, then
>> how do you know it doesn't contain a malicious code that sends all the
>> sensitive data to some third party? So I believe the owner of the container
>> has to trust the content of the image, otherwise you've already lost.
> 
> The fact is that many cloud vendors have malicious content
> scanners, much like virus scanners.
> 
> They just scan the filesystem tree and all contents, but I
> think the severe filesystem metadata consistency is not
> what they previously care about.  But of course, you could
> ask them to fsck too.
> 
> Also see below.
> 
>>
>> The container environment *provider* doesn't necessarily trust either the
>> container owner or the image so they need to make sure their infrastructure
>> isn't compromised by malicious actions from these - and for that either
>> your immutable image scheme or FUSE mounting works.
>>
>> Similarly with the USB drive content. Either some malicious actor plugs USB
>> drive into a laptop, it gets automounted, and that must not crash the
>> kernel or give attacker more priviledge - but that's all - no data is
>> stored on the drive. Or I myself plug some not-so-trusted USB drive to my
>> laptop to read some content from it or possibly put there some data for a
>> friend - again that must not compromise my machine but I'd be really dumb
>> and already lost the security game if I'd put any sensitive data to such
>> drive. And for this USB drive case FUSE mounting solves these problems
>> nicely.
>>
>> So in my opinion for practical usecases the FUSE solution addresses the
>> real security concerns.
> 
> Sorry, I shouldn't speak in that way, as you said, the
> security concepts depends on the context, limitation
> and how do you think.
> 
> First, I need to rephrase a bit above in case there
> could be some divergent discussion:
> 
>> for example, you could audit the file permission with
> e2fsprogs/xfsprogs without a full fsck scan.
> 
> In order to make the userspace programs best-effort, they
> should open for write and fstat the permission bits
> before writing sensitive informations, it avoids TOCTOU
> attacks as much as possible as userspace programs.
> 
> 
> Container users use namespaces of course, namespace can
> only provide isolations, that is the only security
> guarantees namespace can provide, no question of that.
> 
> Let's just strictly speaking, as you mentioned, both ways
> ensure the isolation (if namespaces are used) and kernel
> stability (let's not nitpick about this).  And let's not
> talk about malicious block devices or likewise, because
> it's not a typical setup (maybe it could be a typical
> setup for some cases, but it should be another system-wide
> security design) and should be clarified by system admins
> for example.
> 
> What I just want to say is that: FUSE mount approach _might_
> give more incorrect security guarantees than the real users
> expect: I think other than avoiding system crashes etc, many
> users should expect that they could use the generic writable
> filesystem directly with FUSE without full-scan fsck
> in advance and keep their sensitive data directly, I don't


If you think that is still the corner cases that users expect
incorrectly, For example, I think double freeing issues can
make any useful write stuffs lost just out of inconsistent
filesystem -- that may be totally unrelated to the security.

What I want to say is that, the users' interest of new FUSE
approch is "no full fsck"; Otherwise, if full fsck is used,
why not they mount in the kernel then (I do think kernel
filesystems should fix all bugs out of normal consistent
usage)?

However, "no fsck" and FUSE mounts bring many incorrect
assumption that users can never expect: it's still unreliable,
maybe cannot keep any useful data in that storage.

Hopefully I explain my idea.


> think that is the corner cases if you don't claim the
> limitation of FUSE approaches.
> 
> If none expects that, that is absolute be fine, as I said,
> it provides strong isolation and stability, but I really
> suspect this approach could be abused to mount totally
> untrusted remote filesystems (Actually as I said, some
> business of ours already did: fetching EXT4 filesystems
> with unknown status and mount without fscking, that is
> really disappointing.)
> 
> Thanks,
> Gao Xiang
> 
> 
>>
>>                                 Honza
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 11:14                         ` Jan Kara
  2026-03-23 11:42                           ` Gao Xiang
@ 2026-03-23 12:08                           ` Demi Marie Obenour
  2026-03-23 12:13                             ` Gao Xiang
  1 sibling, 1 reply; 79+ messages in thread
From: Demi Marie Obenour @ 2026-03-23 12:08 UTC (permalink / raw)
  To: Jan Kara, Gao Xiang
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc


[-- Attachment #1.1.1: Type: text/plain, Size: 5606 bytes --]

On 3/23/26 07:14, Jan Kara wrote:
> Hi Gao!
> 
> On Mon 23-03-26 18:19:16, Gao Xiang wrote:
>> On 2026/3/23 17:54, Jan Kara wrote:
>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote:
>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote:
>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>>>>> starts up the rest of the libfuse initialization but who knows if that's
>>>>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>>>>
>>>>
>>>> Let me try to reply the remaining part:
>>>>
>>>>> To me, the attacks mentioned above are all either user error,
>>>>> or vulnerabilities in software accessing the filesystem.  If one
>>>>
>>>> There are many consequences if users try to use potential inconsistent
>>>> writable filesystems directly (without full fsck), what I can think
>>>> out including but not limited to:
>>>>
>>>>   - data loss (considering data block double free issue);
>>>>   - data theft (for example, users keep sensitive information in the
>>>>        workload in a high permission inode but it can be read with
>>>>        low permission malicious inode later);
>>>>   - data tamper (the same principle).
>>>>
>>>> All vulnerabilities above happen after users try to write the
>>>> inconsistent filesystem, which is hard to prevent by on-disk
>>>> design.
>>>>
>>>> But if users write with copy-on-write to another local consistent
>>>> filesystem, all the vulnerabilities above won't exist.
>>>
>>> OK, so if I understand correctly you are advocating that untrusted initial data
>>> should be provided on immutable filesystem and any needed modification
>>> would be handled by overlayfs (or some similar layer) and stored on
>>> (initially empty) writeable filesystem.
>>>
>>> That's a sensible design for usecase like containers but what started this
>>> thread about FUSE drivers for filesystems were usecases like access to
>>> filesystems on drives attached at USB port of your laptop. There it isn't
>>> really practical to use your design. You need a standard writeable
>>> filesystem for that but at the same time you cannot quite trust the content
>>> of everything that gets attached to your USB port...
>>
>> Yes, that is my proposal and my overall interest now.  I know
>> your interest but I'm here I just would like to say:
>>
>> Without full scan fsck, even with FUSE, the system is still
>> vulnerable if the FUSE approch is used.
>>
>> I could give a detailed example, for example:
>>
>> There are passwd files `/etc/passwd` and `/etc/shadow` with
>> proper permissions (for example, you could audit the file
>> permission with e2fsprogs/xfsprogs without a full fsck scan) in
>> the inconsistent remote filesystems, but there are some other
>> malicious files called "foo" and "bar" somewhere with low
>> permissions but sharing the same blocks which is disallowed
>> by filesystem on-disk formats illegally (because they violate
>> copy-on-write semantics by design), also see my previous
>> reply:
>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com
>>
>> The initial data of `/etc/passwd` and `/etc/shadow` in the
>> filesystem image doesn't matter, but users could then keep
>> very sensitive information later just out of the
>> inconsistent filesystems, which could cause "data theft"
>> above.
> 
> Yes, I've seen you mentioning this case earlier in this thread. But let me
> say I consider it rather contrived :). For the container usecase if you are
> fetching say a root fs image and don't trust the content of the image, then
> how do you know it doesn't contain a malicious code that sends all the
> sensitive data to some third party? So I believe the owner of the container
> has to trust the content of the image, otherwise you've already lost.
> 
> The container environment *provider* doesn't necessarily trust either the
> container owner or the image so they need to make sure their infrastructure
> isn't compromised by malicious actions from these - and for that either
> your immutable image scheme or FUSE mounting works.
> 
> Similarly with the USB drive content. Either some malicious actor plugs USB
> drive into a laptop, it gets automounted, and that must not crash the
> kernel or give attacker more priviledge - but that's all - no data is
> stored on the drive. Or I myself plug some not-so-trusted USB drive to my
> laptop to read some content from it or possibly put there some data for a
> friend - again that must not compromise my machine but I'd be really dumb
> and already lost the security game if I'd put any sensitive data to such
> drive. And for this USB drive case FUSE mounting solves these problems
> nicely.
> 
> So in my opinion for practical usecases the FUSE solution addresses the
> real security concerns.
> 
> 								Honza

I agree, *if* the FUSE filesystem is strongly sandboxed so it cannot
mess with things like my home directory.  Personally, I would run
the FUSE filesystem in a VM but that's a separate concern.

There are also (very severe) concerns about USB devices *specifically*.
These are off-topic for this discussion, though.

Of course, the FUSE filesystem must be mounted with nosuid, nodev,
and nosymfollow.  Otherwise there are lots of attacks possible.

Finally, it is very much possible to use storage that one does not have
complete trust in, provided that one uses cryptography to ensure that
the damage it can do is limited.  Many backup systems work this way.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 12:08                           ` Demi Marie Obenour
@ 2026-03-23 12:13                             ` Gao Xiang
  2026-03-23 12:19                               ` Demi Marie Obenour
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-23 12:13 UTC (permalink / raw)
  To: Demi Marie Obenour, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc



On 2026/3/23 20:08, Demi Marie Obenour wrote:
> On 3/23/26 07:14, Jan Kara wrote:
>> Hi Gao!
>>
>> On Mon 23-03-26 18:19:16, Gao Xiang wrote:
>>> On 2026/3/23 17:54, Jan Kara wrote:
>>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote:
>>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote:
>>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>>>>>> starts up the rest of the libfuse initialization but who knows if that's
>>>>>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>>>>>
>>>>>
>>>>> Let me try to reply the remaining part:
>>>>>
>>>>>> To me, the attacks mentioned above are all either user error,
>>>>>> or vulnerabilities in software accessing the filesystem.  If one
>>>>>
>>>>> There are many consequences if users try to use potential inconsistent
>>>>> writable filesystems directly (without full fsck), what I can think
>>>>> out including but not limited to:
>>>>>
>>>>>    - data loss (considering data block double free issue);
>>>>>    - data theft (for example, users keep sensitive information in the
>>>>>         workload in a high permission inode but it can be read with
>>>>>         low permission malicious inode later);
>>>>>    - data tamper (the same principle).
>>>>>
>>>>> All vulnerabilities above happen after users try to write the
>>>>> inconsistent filesystem, which is hard to prevent by on-disk
>>>>> design.
>>>>>
>>>>> But if users write with copy-on-write to another local consistent
>>>>> filesystem, all the vulnerabilities above won't exist.
>>>>
>>>> OK, so if I understand correctly you are advocating that untrusted initial data
>>>> should be provided on immutable filesystem and any needed modification
>>>> would be handled by overlayfs (or some similar layer) and stored on
>>>> (initially empty) writeable filesystem.
>>>>
>>>> That's a sensible design for usecase like containers but what started this
>>>> thread about FUSE drivers for filesystems were usecases like access to
>>>> filesystems on drives attached at USB port of your laptop. There it isn't
>>>> really practical to use your design. You need a standard writeable
>>>> filesystem for that but at the same time you cannot quite trust the content
>>>> of everything that gets attached to your USB port...
>>>
>>> Yes, that is my proposal and my overall interest now.  I know
>>> your interest but I'm here I just would like to say:
>>>
>>> Without full scan fsck, even with FUSE, the system is still
>>> vulnerable if the FUSE approch is used.
>>>
>>> I could give a detailed example, for example:
>>>
>>> There are passwd files `/etc/passwd` and `/etc/shadow` with
>>> proper permissions (for example, you could audit the file
>>> permission with e2fsprogs/xfsprogs without a full fsck scan) in
>>> the inconsistent remote filesystems, but there are some other
>>> malicious files called "foo" and "bar" somewhere with low
>>> permissions but sharing the same blocks which is disallowed
>>> by filesystem on-disk formats illegally (because they violate
>>> copy-on-write semantics by design), also see my previous
>>> reply:
>>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com
>>>
>>> The initial data of `/etc/passwd` and `/etc/shadow` in the
>>> filesystem image doesn't matter, but users could then keep
>>> very sensitive information later just out of the
>>> inconsistent filesystems, which could cause "data theft"
>>> above.
>>
>> Yes, I've seen you mentioning this case earlier in this thread. But let me
>> say I consider it rather contrived :). For the container usecase if you are
>> fetching say a root fs image and don't trust the content of the image, then
>> how do you know it doesn't contain a malicious code that sends all the
>> sensitive data to some third party? So I believe the owner of the container
>> has to trust the content of the image, otherwise you've already lost.
>>
>> The container environment *provider* doesn't necessarily trust either the
>> container owner or the image so they need to make sure their infrastructure
>> isn't compromised by malicious actions from these - and for that either
>> your immutable image scheme or FUSE mounting works.
>>
>> Similarly with the USB drive content. Either some malicious actor plugs USB
>> drive into a laptop, it gets automounted, and that must not crash the
>> kernel or give attacker more priviledge - but that's all - no data is
>> stored on the drive. Or I myself plug some not-so-trusted USB drive to my
>> laptop to read some content from it or possibly put there some data for a
>> friend - again that must not compromise my machine but I'd be really dumb
>> and already lost the security game if I'd put any sensitive data to such
>> drive. And for this USB drive case FUSE mounting solves these problems
>> nicely.
>>
>> So in my opinion for practical usecases the FUSE solution addresses the
>> real security concerns.
>>
>> 								Honza
> 
> I agree, *if* the FUSE filesystem is strongly sandboxed so it cannot
> mess with things like my home directory.  Personally, I would run
> the FUSE filesystem in a VM but that's a separate concern.
> 
> There are also (very severe) concerns about USB devices *specifically*.
> These are off-topic for this discussion, though.
> 
> Of course, the FUSE filesystem must be mounted with nosuid, nodev,
> and nosymfollow.  Otherwise there are lots of attacks possible.
> 
> Finally, it is very much possible to use storage that one does not have
> complete trust in, provided that one uses cryptography to ensure that
> the damage it can do is limited.  Many backup systems work this way.

In brief, as I said, that is _not_ always a security concern:

  - If you don't fsck, and FUSE mount it, your write data to that
    filesystem could be lost if the writable filesystem is
    inconsistent;

  - But if you fsck in advance and the filesystem, the kernel
    implementation should make sure they should fix all bugs of
    consistent filesystems.

So what's the meaning of "no fsck" here if you cannot write
anything in it with FUSE approaches.

Thanks,
Gao Xiang



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 12:13                             ` Gao Xiang
@ 2026-03-23 12:19                               ` Demi Marie Obenour
  2026-03-23 12:30                                 ` Gao Xiang
  0 siblings, 1 reply; 79+ messages in thread
From: Demi Marie Obenour @ 2026-03-23 12:19 UTC (permalink / raw)
  To: Gao Xiang, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc


[-- Attachment #1.1.1: Type: text/plain, Size: 6733 bytes --]

On 3/23/26 08:13, Gao Xiang wrote:
> 
> 
> On 2026/3/23 20:08, Demi Marie Obenour wrote:
>> On 3/23/26 07:14, Jan Kara wrote:
>>> Hi Gao!
>>>
>>> On Mon 23-03-26 18:19:16, Gao Xiang wrote:
>>>> On 2026/3/23 17:54, Jan Kara wrote:
>>>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote:
>>>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote:
>>>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>>>>>>> starts up the rest of the libfuse initialization but who knows if that's
>>>>>>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>>>>>>
>>>>>>
>>>>>> Let me try to reply the remaining part:
>>>>>>
>>>>>>> To me, the attacks mentioned above are all either user error,
>>>>>>> or vulnerabilities in software accessing the filesystem.  If one
>>>>>>
>>>>>> There are many consequences if users try to use potential inconsistent
>>>>>> writable filesystems directly (without full fsck), what I can think
>>>>>> out including but not limited to:
>>>>>>
>>>>>>    - data loss (considering data block double free issue);
>>>>>>    - data theft (for example, users keep sensitive information in the
>>>>>>         workload in a high permission inode but it can be read with
>>>>>>         low permission malicious inode later);
>>>>>>    - data tamper (the same principle).
>>>>>>
>>>>>> All vulnerabilities above happen after users try to write the
>>>>>> inconsistent filesystem, which is hard to prevent by on-disk
>>>>>> design.
>>>>>>
>>>>>> But if users write with copy-on-write to another local consistent
>>>>>> filesystem, all the vulnerabilities above won't exist.
>>>>>
>>>>> OK, so if I understand correctly you are advocating that untrusted initial data
>>>>> should be provided on immutable filesystem and any needed modification
>>>>> would be handled by overlayfs (or some similar layer) and stored on
>>>>> (initially empty) writeable filesystem.
>>>>>
>>>>> That's a sensible design for usecase like containers but what started this
>>>>> thread about FUSE drivers for filesystems were usecases like access to
>>>>> filesystems on drives attached at USB port of your laptop. There it isn't
>>>>> really practical to use your design. You need a standard writeable
>>>>> filesystem for that but at the same time you cannot quite trust the content
>>>>> of everything that gets attached to your USB port...
>>>>
>>>> Yes, that is my proposal and my overall interest now.  I know
>>>> your interest but I'm here I just would like to say:
>>>>
>>>> Without full scan fsck, even with FUSE, the system is still
>>>> vulnerable if the FUSE approch is used.
>>>>
>>>> I could give a detailed example, for example:
>>>>
>>>> There are passwd files `/etc/passwd` and `/etc/shadow` with
>>>> proper permissions (for example, you could audit the file
>>>> permission with e2fsprogs/xfsprogs without a full fsck scan) in
>>>> the inconsistent remote filesystems, but there are some other
>>>> malicious files called "foo" and "bar" somewhere with low
>>>> permissions but sharing the same blocks which is disallowed
>>>> by filesystem on-disk formats illegally (because they violate
>>>> copy-on-write semantics by design), also see my previous
>>>> reply:
>>>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com
>>>>
>>>> The initial data of `/etc/passwd` and `/etc/shadow` in the
>>>> filesystem image doesn't matter, but users could then keep
>>>> very sensitive information later just out of the
>>>> inconsistent filesystems, which could cause "data theft"
>>>> above.
>>>
>>> Yes, I've seen you mentioning this case earlier in this thread. But let me
>>> say I consider it rather contrived :). For the container usecase if you are
>>> fetching say a root fs image and don't trust the content of the image, then
>>> how do you know it doesn't contain a malicious code that sends all the
>>> sensitive data to some third party? So I believe the owner of the container
>>> has to trust the content of the image, otherwise you've already lost.
>>>
>>> The container environment *provider* doesn't necessarily trust either the
>>> container owner or the image so they need to make sure their infrastructure
>>> isn't compromised by malicious actions from these - and for that either
>>> your immutable image scheme or FUSE mounting works.
>>>
>>> Similarly with the USB drive content. Either some malicious actor plugs USB
>>> drive into a laptop, it gets automounted, and that must not crash the
>>> kernel or give attacker more priviledge - but that's all - no data is
>>> stored on the drive. Or I myself plug some not-so-trusted USB drive to my
>>> laptop to read some content from it or possibly put there some data for a
>>> friend - again that must not compromise my machine but I'd be really dumb
>>> and already lost the security game if I'd put any sensitive data to such
>>> drive. And for this USB drive case FUSE mounting solves these problems
>>> nicely.
>>>
>>> So in my opinion for practical usecases the FUSE solution addresses the
>>> real security concerns.
>>>
>>> 								Honza
>>
>> I agree, *if* the FUSE filesystem is strongly sandboxed so it cannot
>> mess with things like my home directory.  Personally, I would run
>> the FUSE filesystem in a VM but that's a separate concern.
>>
>> There are also (very severe) concerns about USB devices *specifically*.
>> These are off-topic for this discussion, though.
>>
>> Of course, the FUSE filesystem must be mounted with nosuid, nodev,
>> and nosymfollow.  Otherwise there are lots of attacks possible.
>>
>> Finally, it is very much possible to use storage that one does not have
>> complete trust in, provided that one uses cryptography to ensure that
>> the damage it can do is limited.  Many backup systems work this way.
> 
> In brief, as I said, that is _not_ always a security concern:
> 
>   - If you don't fsck, and FUSE mount it, your write data to that
>     filesystem could be lost if the writable filesystem is
>     inconsistent;

In the applications I am thinking of, one _hopes_ that the filesystem
is consistent, which it almost always will be.  However, one wants
to be safe in the unlikely case of it being inconsistent.

>   - But if you fsck in advance and the filesystem, the kernel
>     implementation should make sure they should fix all bugs of
>     consistent filesystems.
> 
> So what's the meaning of "no fsck" here if you cannot write
> anything in it with FUSE approaches.

FUSE can (and usually does) have write support.  Also, fsck does not
protect against TOCTOU attacks.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 12:19                               ` Demi Marie Obenour
@ 2026-03-23 12:30                                 ` Gao Xiang
  2026-03-23 12:33                                   ` Gao Xiang
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-23 12:30 UTC (permalink / raw)
  To: Demi Marie Obenour, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc



On 2026/3/23 20:19, Demi Marie Obenour wrote:
> On 3/23/26 08:13, Gao Xiang wrote:
>>
>>
>> On 2026/3/23 20:08, Demi Marie Obenour wrote:
>>> On 3/23/26 07:14, Jan Kara wrote:
>>>> Hi Gao!
>>>>
>>>> On Mon 23-03-26 18:19:16, Gao Xiang wrote:
>>>>> On 2026/3/23 17:54, Jan Kara wrote:
>>>>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote:
>>>>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote:
>>>>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>>>>>>>> starts up the rest of the libfuse initialization but who knows if that's
>>>>>>>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>>>>>>>
>>>>>>>
>>>>>>> Let me try to reply the remaining part:
>>>>>>>
>>>>>>>> To me, the attacks mentioned above are all either user error,
>>>>>>>> or vulnerabilities in software accessing the filesystem.  If one
>>>>>>>
>>>>>>> There are many consequences if users try to use potential inconsistent
>>>>>>> writable filesystems directly (without full fsck), what I can think
>>>>>>> out including but not limited to:
>>>>>>>
>>>>>>>     - data loss (considering data block double free issue);
>>>>>>>     - data theft (for example, users keep sensitive information in the
>>>>>>>          workload in a high permission inode but it can be read with
>>>>>>>          low permission malicious inode later);
>>>>>>>     - data tamper (the same principle).
>>>>>>>
>>>>>>> All vulnerabilities above happen after users try to write the
>>>>>>> inconsistent filesystem, which is hard to prevent by on-disk
>>>>>>> design.
>>>>>>>
>>>>>>> But if users write with copy-on-write to another local consistent
>>>>>>> filesystem, all the vulnerabilities above won't exist.
>>>>>>
>>>>>> OK, so if I understand correctly you are advocating that untrusted initial data
>>>>>> should be provided on immutable filesystem and any needed modification
>>>>>> would be handled by overlayfs (or some similar layer) and stored on
>>>>>> (initially empty) writeable filesystem.
>>>>>>
>>>>>> That's a sensible design for usecase like containers but what started this
>>>>>> thread about FUSE drivers for filesystems were usecases like access to
>>>>>> filesystems on drives attached at USB port of your laptop. There it isn't
>>>>>> really practical to use your design. You need a standard writeable
>>>>>> filesystem for that but at the same time you cannot quite trust the content
>>>>>> of everything that gets attached to your USB port...
>>>>>
>>>>> Yes, that is my proposal and my overall interest now.  I know
>>>>> your interest but I'm here I just would like to say:
>>>>>
>>>>> Without full scan fsck, even with FUSE, the system is still
>>>>> vulnerable if the FUSE approch is used.
>>>>>
>>>>> I could give a detailed example, for example:
>>>>>
>>>>> There are passwd files `/etc/passwd` and `/etc/shadow` with
>>>>> proper permissions (for example, you could audit the file
>>>>> permission with e2fsprogs/xfsprogs without a full fsck scan) in
>>>>> the inconsistent remote filesystems, but there are some other
>>>>> malicious files called "foo" and "bar" somewhere with low
>>>>> permissions but sharing the same blocks which is disallowed
>>>>> by filesystem on-disk formats illegally (because they violate
>>>>> copy-on-write semantics by design), also see my previous
>>>>> reply:
>>>>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com
>>>>>
>>>>> The initial data of `/etc/passwd` and `/etc/shadow` in the
>>>>> filesystem image doesn't matter, but users could then keep
>>>>> very sensitive information later just out of the
>>>>> inconsistent filesystems, which could cause "data theft"
>>>>> above.
>>>>
>>>> Yes, I've seen you mentioning this case earlier in this thread. But let me
>>>> say I consider it rather contrived :). For the container usecase if you are
>>>> fetching say a root fs image and don't trust the content of the image, then
>>>> how do you know it doesn't contain a malicious code that sends all the
>>>> sensitive data to some third party? So I believe the owner of the container
>>>> has to trust the content of the image, otherwise you've already lost.
>>>>
>>>> The container environment *provider* doesn't necessarily trust either the
>>>> container owner or the image so they need to make sure their infrastructure
>>>> isn't compromised by malicious actions from these - and for that either
>>>> your immutable image scheme or FUSE mounting works.
>>>>
>>>> Similarly with the USB drive content. Either some malicious actor plugs USB
>>>> drive into a laptop, it gets automounted, and that must not crash the
>>>> kernel or give attacker more priviledge - but that's all - no data is
>>>> stored on the drive. Or I myself plug some not-so-trusted USB drive to my
>>>> laptop to read some content from it or possibly put there some data for a
>>>> friend - again that must not compromise my machine but I'd be really dumb
>>>> and already lost the security game if I'd put any sensitive data to such
>>>> drive. And for this USB drive case FUSE mounting solves these problems
>>>> nicely.
>>>>
>>>> So in my opinion for practical usecases the FUSE solution addresses the
>>>> real security concerns.
>>>>
>>>> 								Honza
>>>
>>> I agree, *if* the FUSE filesystem is strongly sandboxed so it cannot
>>> mess with things like my home directory.  Personally, I would run
>>> the FUSE filesystem in a VM but that's a separate concern.
>>>
>>> There are also (very severe) concerns about USB devices *specifically*.
>>> These are off-topic for this discussion, though.
>>>
>>> Of course, the FUSE filesystem must be mounted with nosuid, nodev,
>>> and nosymfollow.  Otherwise there are lots of attacks possible.
>>>
>>> Finally, it is very much possible to use storage that one does not have
>>> complete trust in, provided that one uses cryptography to ensure that
>>> the damage it can do is limited.  Many backup systems work this way.
>>
>> In brief, as I said, that is _not_ always a security concern:
>>
>>    - If you don't fsck, and FUSE mount it, your write data to that
>>      filesystem could be lost if the writable filesystem is
>>      inconsistent;
> 
> In the applications I am thinking of, one _hopes_ that the filesystem
> is consistent, which it almost always will be.  However, one wants
> to be safe in the unlikely case of it being inconsistent.

I don't think so, USB stick can be corrupted too and the
network can receive in , there are too many practical
problems here.

> 
>>    - But if you fsck in advance and the filesystem, the kernel
>>      implementation should make sure they should fix all bugs of
>>      consistent filesystems.
>>
>> So what's the meaning of "no fsck" here if you cannot write
>> anything in it with FUSE approaches.
> 
> FUSE can (and usually does) have write support.  Also, fsck does not
> protect against TOCTOU attacks.

If you consider TOCTOU attacks, why FUSE filesystem can
protect if it TOCTOU randomly?

Sigh, I think the whole story is that:

  - The kernel writable filesystem should fix all bugs if the
    filesystem is consistent, and this condition should be
    ensured by fsck in advance;

  - So, alternative approaches like FUSE are not meaningful
    _only if_ we cannot do "fsck" (let's not think untypical
    TOCTOU).

  - but without "fsck", the filesystem can be inconsistent
    by change or by attack, so the write stuff can be lost.

Thanks,
Gao Xiang




^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 12:30                                 ` Gao Xiang
@ 2026-03-23 12:33                                   ` Gao Xiang
  0 siblings, 0 replies; 79+ messages in thread
From: Gao Xiang @ 2026-03-23 12:33 UTC (permalink / raw)
  To: Demi Marie Obenour, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc



On 2026/3/23 20:30, Gao Xiang wrote:
> 
> 
> On 2026/3/23 20:19, Demi Marie Obenour wrote:
>> On 3/23/26 08:13, Gao Xiang wrote:
>>>
>>>
>>> On 2026/3/23 20:08, Demi Marie Obenour wrote:
>>>> On 3/23/26 07:14, Jan Kara wrote:
>>>>> Hi Gao!
>>>>>
>>>>> On Mon 23-03-26 18:19:16, Gao Xiang wrote:
>>>>>> On 2026/3/23 17:54, Jan Kara wrote:
>>>>>>> On Sun 22-03-26 12:51:57, Gao Xiang wrote:
>>>>>>>> On 2026/3/22 11:25, Demi Marie Obenour wrote:
>>>>>>>>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>>>>>>>>> starts up the rest of the libfuse initialization but who knows if that's
>>>>>>>>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Let me try to reply the remaining part:
>>>>>>>>
>>>>>>>>> To me, the attacks mentioned above are all either user error,
>>>>>>>>> or vulnerabilities in software accessing the filesystem.  If one
>>>>>>>>
>>>>>>>> There are many consequences if users try to use potential inconsistent
>>>>>>>> writable filesystems directly (without full fsck), what I can think
>>>>>>>> out including but not limited to:
>>>>>>>>
>>>>>>>>     - data loss (considering data block double free issue);
>>>>>>>>     - data theft (for example, users keep sensitive information in the
>>>>>>>>          workload in a high permission inode but it can be read with
>>>>>>>>          low permission malicious inode later);
>>>>>>>>     - data tamper (the same principle).
>>>>>>>>
>>>>>>>> All vulnerabilities above happen after users try to write the
>>>>>>>> inconsistent filesystem, which is hard to prevent by on-disk
>>>>>>>> design.
>>>>>>>>
>>>>>>>> But if users write with copy-on-write to another local consistent
>>>>>>>> filesystem, all the vulnerabilities above won't exist.
>>>>>>>
>>>>>>> OK, so if I understand correctly you are advocating that untrusted initial data
>>>>>>> should be provided on immutable filesystem and any needed modification
>>>>>>> would be handled by overlayfs (or some similar layer) and stored on
>>>>>>> (initially empty) writeable filesystem.
>>>>>>>
>>>>>>> That's a sensible design for usecase like containers but what started this
>>>>>>> thread about FUSE drivers for filesystems were usecases like access to
>>>>>>> filesystems on drives attached at USB port of your laptop. There it isn't
>>>>>>> really practical to use your design. You need a standard writeable
>>>>>>> filesystem for that but at the same time you cannot quite trust the content
>>>>>>> of everything that gets attached to your USB port...
>>>>>>
>>>>>> Yes, that is my proposal and my overall interest now.  I know
>>>>>> your interest but I'm here I just would like to say:
>>>>>>
>>>>>> Without full scan fsck, even with FUSE, the system is still
>>>>>> vulnerable if the FUSE approch is used.
>>>>>>
>>>>>> I could give a detailed example, for example:
>>>>>>
>>>>>> There are passwd files `/etc/passwd` and `/etc/shadow` with
>>>>>> proper permissions (for example, you could audit the file
>>>>>> permission with e2fsprogs/xfsprogs without a full fsck scan) in
>>>>>> the inconsistent remote filesystems, but there are some other
>>>>>> malicious files called "foo" and "bar" somewhere with low
>>>>>> permissions but sharing the same blocks which is disallowed
>>>>>> by filesystem on-disk formats illegally (because they violate
>>>>>> copy-on-write semantics by design), also see my previous
>>>>>> reply:
>>>>>> https://lore.kernel.org/r/7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com
>>>>>>
>>>>>> The initial data of `/etc/passwd` and `/etc/shadow` in the
>>>>>> filesystem image doesn't matter, but users could then keep
>>>>>> very sensitive information later just out of the
>>>>>> inconsistent filesystems, which could cause "data theft"
>>>>>> above.
>>>>>
>>>>> Yes, I've seen you mentioning this case earlier in this thread. But let me
>>>>> say I consider it rather contrived :). For the container usecase if you are
>>>>> fetching say a root fs image and don't trust the content of the image, then
>>>>> how do you know it doesn't contain a malicious code that sends all the
>>>>> sensitive data to some third party? So I believe the owner of the container
>>>>> has to trust the content of the image, otherwise you've already lost.
>>>>>
>>>>> The container environment *provider* doesn't necessarily trust either the
>>>>> container owner or the image so they need to make sure their infrastructure
>>>>> isn't compromised by malicious actions from these - and for that either
>>>>> your immutable image scheme or FUSE mounting works.
>>>>>
>>>>> Similarly with the USB drive content. Either some malicious actor plugs USB
>>>>> drive into a laptop, it gets automounted, and that must not crash the
>>>>> kernel or give attacker more priviledge - but that's all - no data is
>>>>> stored on the drive. Or I myself plug some not-so-trusted USB drive to my
>>>>> laptop to read some content from it or possibly put there some data for a
>>>>> friend - again that must not compromise my machine but I'd be really dumb
>>>>> and already lost the security game if I'd put any sensitive data to such
>>>>> drive. And for this USB drive case FUSE mounting solves these problems
>>>>> nicely.
>>>>>
>>>>> So in my opinion for practical usecases the FUSE solution addresses the
>>>>> real security concerns.
>>>>>
>>>>>                                 Honza
>>>>
>>>> I agree, *if* the FUSE filesystem is strongly sandboxed so it cannot
>>>> mess with things like my home directory.  Personally, I would run
>>>> the FUSE filesystem in a VM but that's a separate concern.
>>>>
>>>> There are also (very severe) concerns about USB devices *specifically*.
>>>> These are off-topic for this discussion, though.
>>>>
>>>> Of course, the FUSE filesystem must be mounted with nosuid, nodev,
>>>> and nosymfollow.  Otherwise there are lots of attacks possible.
>>>>
>>>> Finally, it is very much possible to use storage that one does not have
>>>> complete trust in, provided that one uses cryptography to ensure that
>>>> the damage it can do is limited.  Many backup systems work this way.
>>>
>>> In brief, as I said, that is _not_ always a security concern:
>>>
>>>    - If you don't fsck, and FUSE mount it, your write data to that
>>>      filesystem could be lost if the writable filesystem is
>>>      inconsistent;
>>
>> In the applications I am thinking of, one _hopes_ that the filesystem
>> is consistent, which it almost always will be.  However, one wants
>> to be safe in the unlikely case of it being inconsistent.
> 
> I don't think so, USB stick can be corrupted too and the
> network can receive in , there are too many practical
> problems here.

Not because attacks, just because cheap USB sticks or
bad network condition for example.

> 
>>
>>>    - But if you fsck in advance and the filesystem, the kernel
>>>      implementation should make sure they should fix all bugs of
>>>      consistent filesystems.
>>>
>>> So what's the meaning of "no fsck" here if you cannot write
>>> anything in it with FUSE approaches.
>>
>> FUSE can (and usually does) have write support.  Also, fsck does not
>> protect against TOCTOU attacks.
> 
> If you consider TOCTOU attacks, why FUSE filesystem can
> protect if it TOCTOU randomly?
> 
> Sigh, I think the whole story is that:
> 
>   - The kernel writable filesystem should fix all bugs if the
>     filesystem is consistent, and this condition should be
>     ensured by fsck in advance;
> 
>   - So, alternative approaches like FUSE are not meaningful

                                            ^ are meaningful

>     _only if_ we cannot do "fsck" (let's not think untypical
>     TOCTOU).
> 
>   - but without "fsck", the filesystem can be inconsistent
>     by change or by attack, so the write stuff can be lost.
> 
> Thanks,
> Gao Xiang
> 
> 
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 12:01                             ` Gao Xiang
@ 2026-03-23 14:13                               ` Jan Kara
  2026-03-23 14:36                                 ` Gao Xiang
  0 siblings, 1 reply; 79+ messages in thread
From: Jan Kara @ 2026-03-23 14:13 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Jan Kara, Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

On Mon 23-03-26 20:01:32, Gao Xiang wrote:
> On 2026/3/23 19:42, Gao Xiang wrote:
> > > for example, you could audit the file permission with
> > e2fsprogs/xfsprogs without a full fsck scan.
> > 
> > In order to make the userspace programs best-effort, they
> > should open for write and fstat the permission bits
> > before writing sensitive informations, it avoids TOCTOU
> > attacks as much as possible as userspace programs.
> > 
> > 
> > Container users use namespaces of course, namespace can
> > only provide isolations, that is the only security
> > guarantees namespace can provide, no question of that.
> > 
> > Let's just strictly speaking, as you mentioned, both ways
> > ensure the isolation (if namespaces are used) and kernel
> > stability (let's not nitpick about this).  And let's not
> > talk about malicious block devices or likewise, because
> > it's not a typical setup (maybe it could be a typical
> > setup for some cases, but it should be another system-wide
> > security design) and should be clarified by system admins
> > for example.
> > 
> > What I just want to say is that: FUSE mount approach _might_
> > give more incorrect security guarantees than the real users
> > expect: I think other than avoiding system crashes etc, many
> > users should expect that they could use the generic writable
> > filesystem directly with FUSE without full-scan fsck
> > in advance and keep their sensitive data directly, I don't
> 
> 
> If you think that is still the corner cases that users expect
> incorrectly, For example, I think double freeing issues can
> make any useful write stuffs lost just out of inconsistent
> filesystem -- that may be totally unrelated to the security.
> 
> What I want to say is that, the users' interest of new FUSE
> approch is "no full fsck"; Otherwise, if full fsck is used,
> why not they mount in the kernel then (I do think kernel
> filesystems should fix all bugs out of normal consistent
> usage)?
> 
> However, "no fsck" and FUSE mounts bring many incorrect
> assumption that users can never expect: it's still unreliable,
> maybe cannot keep any useful data in that storage.
> 
> Hopefully I explain my idea.

I see and I agree that for some cases FUSE access to the untrusted
filesystem needn't be enough

> > think that is the corner cases if you don't claim the
> > limitation of FUSE approaches.
> > 
> > If none expects that, that is absolute be fine, as I said,
> > it provides strong isolation and stability, but I really
> > suspect this approach could be abused to mount totally
> > untrusted remote filesystems (Actually as I said, some
> > business of ours already did: fetching EXT4 filesystems
> > with unknown status and mount without fscking, that is
> > really disappointing.)

Yes, someone downloading untrusted ext4 image, mounting in read-write and
using it for sensitive application, that falls to "insane" category for me
:) We agree on that. And I agree that depending on the application using
FUSE to access such filesystem needn't be safe enough and immutable fs +
overlayfs writeable layer may provide better guarantees about fs behavior.
I would still consider such design highly suspicious but without more
detailed knowledge about the application I cannot say it's outright broken
:).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 14:13                               ` Jan Kara
@ 2026-03-23 14:36                                 ` Gao Xiang
  2026-03-23 14:47                                   ` Jan Kara
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-23 14:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc



On 2026/3/23 22:13, Jan Kara wrote:
> On Mon 23-03-26 20:01:32, Gao Xiang wrote:
>> On 2026/3/23 19:42, Gao Xiang wrote:
>>>> for example, you could audit the file permission with
>>> e2fsprogs/xfsprogs without a full fsck scan.
>>>
>>> In order to make the userspace programs best-effort, they
>>> should open for write and fstat the permission bits
>>> before writing sensitive informations, it avoids TOCTOU
>>> attacks as much as possible as userspace programs.
>>>
>>>
>>> Container users use namespaces of course, namespace can
>>> only provide isolations, that is the only security
>>> guarantees namespace can provide, no question of that.
>>>
>>> Let's just strictly speaking, as you mentioned, both ways
>>> ensure the isolation (if namespaces are used) and kernel
>>> stability (let's not nitpick about this).  And let's not
>>> talk about malicious block devices or likewise, because
>>> it's not a typical setup (maybe it could be a typical
>>> setup for some cases, but it should be another system-wide
>>> security design) and should be clarified by system admins
>>> for example.
>>>
>>> What I just want to say is that: FUSE mount approach _might_
>>> give more incorrect security guarantees than the real users
>>> expect: I think other than avoiding system crashes etc, many
>>> users should expect that they could use the generic writable
>>> filesystem directly with FUSE without full-scan fsck
>>> in advance and keep their sensitive data directly, I don't
>>
>>
>> If you think that is still the corner cases that users expect
>> incorrectly, For example, I think double freeing issues can
>> make any useful write stuffs lost just out of inconsistent
>> filesystem -- that may be totally unrelated to the security.
>>
>> What I want to say is that, the users' interest of new FUSE
>> approch is "no full fsck"; Otherwise, if full fsck is used,
>> why not they mount in the kernel then (I do think kernel
>> filesystems should fix all bugs out of normal consistent
>> usage)?
>>
>> However, "no fsck" and FUSE mounts bring many incorrect
>> assumption that users can never expect: it's still unreliable,
>> maybe cannot keep any useful data in that storage.
>>
>> Hopefully I explain my idea.
> 
> I see and I agree that for some cases FUSE access to the untrusted
> filesystem needn't be enough

Yes, my opinion is that FUSE approches are fine, as long as
we clearly document the limitation and the restriction.
For writable filesystems, clearly a full scan fsck is needed
to keep the filesystem consistent in order to avoid potential
data loss at least.

Otherwise, I'm pretty sure some aggressive users will abuse
this feature with "no fsck"...

> 
>>> think that is the corner cases if you don't claim the
>>> limitation of FUSE approaches.
>>>
>>> If none expects that, that is absolute be fine, as I said,
>>> it provides strong isolation and stability, but I really
>>> suspect this approach could be abused to mount totally
>>> untrusted remote filesystems (Actually as I said, some
>>> business of ours already did: fetching EXT4 filesystems
>>> with unknown status and mount without fscking, that is
>>> really disappointing.)
> 
> Yes, someone downloading untrusted ext4 image, mounting in read-write and
> using it for sensitive application, that falls to "insane" category for me
> :) We agree on that. And I agree that depending on the application using
> FUSE to access such filesystem needn't be safe enough and immutable fs +
> overlayfs writeable layer may provide better guarantees about fs behavior.

That is my overall goal, I just want to make it clear
the difference out of write isolation, but of course,
"secure" or not is relative, and according to the
system design.

If isolation and system stability are enough for
a system and can be called "secure", yes, they are
both the same in such aspects.

> I would still consider such design highly suspicious but without more
> detailed knowledge about the application I cannot say it's outright broken
> :).

What do you mean "such design"?  "Writable untrusted
remote EXT4 images mounting on the host"? Really, we have
such applications for containers for many years but I don't
want to name it here, but I'm totally exhaused by such
usage (since I explained many many times, and they even
never bother with LWN.net) and the internal team.

Thanks,
Gao Xiang

> 
> 								Honza
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 14:36                                 ` Gao Xiang
@ 2026-03-23 14:47                                   ` Jan Kara
  2026-03-23 14:57                                     ` Gao Xiang
  2026-03-24  8:48                                     ` Christian Brauner
  0 siblings, 2 replies; 79+ messages in thread
From: Jan Kara @ 2026-03-23 14:47 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Jan Kara, Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

On Mon 23-03-26 22:36:46, Gao Xiang wrote:
> On 2026/3/23 22:13, Jan Kara wrote:
> > > > think that is the corner cases if you don't claim the
> > > > limitation of FUSE approaches.
> > > > 
> > > > If none expects that, that is absolute be fine, as I said,
> > > > it provides strong isolation and stability, but I really
> > > > suspect this approach could be abused to mount totally
> > > > untrusted remote filesystems (Actually as I said, some
> > > > business of ours already did: fetching EXT4 filesystems
> > > > with unknown status and mount without fscking, that is
> > > > really disappointing.)
> > 
> > Yes, someone downloading untrusted ext4 image, mounting in read-write and
> > using it for sensitive application, that falls to "insane" category for me
> > :) We agree on that. And I agree that depending on the application using
> > FUSE to access such filesystem needn't be safe enough and immutable fs +
> > overlayfs writeable layer may provide better guarantees about fs behavior.
> 
> That is my overall goal, I just want to make it clear
> the difference out of write isolation, but of course,
> "secure" or not is relative, and according to the
> system design.
> 
> If isolation and system stability are enough for
> a system and can be called "secure", yes, they are
> both the same in such aspects.
> 
> > I would still consider such design highly suspicious but without more
> > detailed knowledge about the application I cannot say it's outright broken
> > :).
> 
> What do you mean "such design"?  "Writable untrusted
> remote EXT4 images mounting on the host"? Really, we have
> such applications for containers for many years but I don't
> want to name it here, but I'm totally exhaused by such
> usage (since I explained many many times, and they even
> never bother with LWN.net) and the internal team.

By "such design" I meant generally the concept that you fetch filesystem
images (regardless whether ext4 or some other type) from untrusted source.
Unless you do cryptographical verification of the data, you never know what
kind of garbage your application is processing which is always invitation
for nasty exploits and bugs...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 14:47                                   ` Jan Kara
@ 2026-03-23 14:57                                     ` Gao Xiang
  2026-03-24  8:48                                     ` Christian Brauner
  1 sibling, 0 replies; 79+ messages in thread
From: Gao Xiang @ 2026-03-23 14:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc



On 2026/3/23 22:47, Jan Kara wrote:
> On Mon 23-03-26 22:36:46, Gao Xiang wrote:
>> On 2026/3/23 22:13, Jan Kara wrote:
>>>>> think that is the corner cases if you don't claim the
>>>>> limitation of FUSE approaches.
>>>>>
>>>>> If none expects that, that is absolute be fine, as I said,
>>>>> it provides strong isolation and stability, but I really
>>>>> suspect this approach could be abused to mount totally
>>>>> untrusted remote filesystems (Actually as I said, some
>>>>> business of ours already did: fetching EXT4 filesystems
>>>>> with unknown status and mount without fscking, that is
>>>>> really disappointing.)
>>>
>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and
>>> using it for sensitive application, that falls to "insane" category for me
>>> :) We agree on that. And I agree that depending on the application using
>>> FUSE to access such filesystem needn't be safe enough and immutable fs +
>>> overlayfs writeable layer may provide better guarantees about fs behavior.
>>
>> That is my overall goal, I just want to make it clear
>> the difference out of write isolation, but of course,
>> "secure" or not is relative, and according to the
>> system design.
>>
>> If isolation and system stability are enough for
>> a system and can be called "secure", yes, they are
>> both the same in such aspects.
>>
>>> I would still consider such design highly suspicious but without more
>>> detailed knowledge about the application I cannot say it's outright broken
>>> :).
>>
>> What do you mean "such design"?  "Writable untrusted
>> remote EXT4 images mounting on the host"? Really, we have
>> such applications for containers for many years but I don't
>> want to name it here, but I'm totally exhaused by such
>> usage (since I explained many many times, and they even
>> never bother with LWN.net) and the internal team.
> 
> By "such design" I meant generally the concept that you fetch filesystem
> images (regardless whether ext4 or some other type) from untrusted source.
> Unless you do cryptographical verification of the data, you never know what
> kind of garbage your application is processing which is always invitation
> for nasty exploits and bugs...

That is very common for Docker images for example
(Although I admit Docker images now use TAR archives,
which can be seen as an immutable system too).

For example, Docker Hub keeps TAR images with sha256, but
only sha256 without any cryptographical signature.  Of
course, you could rely on your image scanner to audit
malicious contents if you want (e.g. you keep your sensitive
information) or rely on 3rd-party applications to scan for
you, or never scan since you don't keep any sensitive
information in that.  --- so because Docker image will
write into another trusted local filesystem like the
model I mentioned before.

IOWs, the image sha256 only ensures that "the tar images
you downloaded" is what "the publisher once uploaded",
no more than that.  And that model is also our interest,
since the core EROFS format should fit such model too.

Thanks,
Gao XIang

> 
> 								Honza


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-23 14:47                                   ` Jan Kara
  2026-03-23 14:57                                     ` Gao Xiang
@ 2026-03-24  8:48                                     ` Christian Brauner
  2026-03-24  9:30                                       ` Gao Xiang
  2026-03-24 11:58                                       ` Demi Marie Obenour
  1 sibling, 2 replies; 79+ messages in thread
From: Christian Brauner @ 2026-03-24  8:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Gao Xiang, Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote:
> On Mon 23-03-26 22:36:46, Gao Xiang wrote:
> > On 2026/3/23 22:13, Jan Kara wrote:
> > > > > think that is the corner cases if you don't claim the
> > > > > limitation of FUSE approaches.
> > > > > 
> > > > > If none expects that, that is absolute be fine, as I said,
> > > > > it provides strong isolation and stability, but I really
> > > > > suspect this approach could be abused to mount totally
> > > > > untrusted remote filesystems (Actually as I said, some
> > > > > business of ours already did: fetching EXT4 filesystems
> > > > > with unknown status and mount without fscking, that is
> > > > > really disappointing.)
> > > 
> > > Yes, someone downloading untrusted ext4 image, mounting in read-write and
> > > using it for sensitive application, that falls to "insane" category for me
> > > :) We agree on that. And I agree that depending on the application using
> > > FUSE to access such filesystem needn't be safe enough and immutable fs +
> > > overlayfs writeable layer may provide better guarantees about fs behavior.
> > 
> > That is my overall goal, I just want to make it clear
> > the difference out of write isolation, but of course,
> > "secure" or not is relative, and according to the
> > system design.
> > 
> > If isolation and system stability are enough for
> > a system and can be called "secure", yes, they are
> > both the same in such aspects.
> > 
> > > I would still consider such design highly suspicious but without more
> > > detailed knowledge about the application I cannot say it's outright broken
> > > :).
> > 
> > What do you mean "such design"?  "Writable untrusted
> > remote EXT4 images mounting on the host"? Really, we have
> > such applications for containers for many years but I don't
> > want to name it here, but I'm totally exhaused by such
> > usage (since I explained many many times, and they even
> > never bother with LWN.net) and the internal team.
> 
> By "such design" I meant generally the concept that you fetch filesystem
> images (regardless whether ext4 or some other type) from untrusted source.
> Unless you do cryptographical verification of the data, you never know what
> kind of garbage your application is processing which is always invitation
> for nasty exploits and bugs...

If this is another 500 mail discussion about FS_USERNS_MOUNT on
block-backed filesystems then my verdict still stands that the only
condition under which I will let the VFS allow this if the underlying
device is signed and dm-verity protected. The kernel will continue to
refuse unprivileged policy in general and specifically based on quality
or implementation of the underlying filesystem driver.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-24  8:48                                     ` Christian Brauner
@ 2026-03-24  9:30                                       ` Gao Xiang
  2026-03-24  9:49                                         ` Demi Marie Obenour
  2026-03-24 11:58                                       ` Demi Marie Obenour
  1 sibling, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-24  9:30 UTC (permalink / raw)
  To: Christian Brauner, Jan Kara
  Cc: Demi Marie Obenour, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

Hi Christian,

On 2026/3/24 16:48, Christian Brauner wrote:
> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote:
>> On Mon 23-03-26 22:36:46, Gao Xiang wrote:
>>> On 2026/3/23 22:13, Jan Kara wrote:
>>>>>> think that is the corner cases if you don't claim the
>>>>>> limitation of FUSE approaches.
>>>>>>
>>>>>> If none expects that, that is absolute be fine, as I said,
>>>>>> it provides strong isolation and stability, but I really
>>>>>> suspect this approach could be abused to mount totally
>>>>>> untrusted remote filesystems (Actually as I said, some
>>>>>> business of ours already did: fetching EXT4 filesystems
>>>>>> with unknown status and mount without fscking, that is
>>>>>> really disappointing.)
>>>>
>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and
>>>> using it for sensitive application, that falls to "insane" category for me
>>>> :) We agree on that. And I agree that depending on the application using
>>>> FUSE to access such filesystem needn't be safe enough and immutable fs +
>>>> overlayfs writeable layer may provide better guarantees about fs behavior.
>>>
>>> That is my overall goal, I just want to make it clear
>>> the difference out of write isolation, but of course,
>>> "secure" or not is relative, and according to the
>>> system design.
>>>
>>> If isolation and system stability are enough for
>>> a system and can be called "secure", yes, they are
>>> both the same in such aspects.
>>>
>>>> I would still consider such design highly suspicious but without more
>>>> detailed knowledge about the application I cannot say it's outright broken
>>>> :).
>>>
>>> What do you mean "such design"?  "Writable untrusted
>>> remote EXT4 images mounting on the host"? Really, we have
>>> such applications for containers for many years but I don't
>>> want to name it here, but I'm totally exhaused by such
>>> usage (since I explained many many times, and they even
>>> never bother with LWN.net) and the internal team.
>>
>> By "such design" I meant generally the concept that you fetch filesystem
>> images (regardless whether ext4 or some other type) from untrusted source.
>> Unless you do cryptographical verification of the data, you never know what
>> kind of garbage your application is processing which is always invitation
>> for nasty exploits and bugs...
> 
> If this is another 500 mail discussion about FS_USERNS_MOUNT on
> block-backed filesystems then my verdict still stands that the only
> condition under which I will let the VFS allow this if the underlying
> device is signed and dm-verity protected. The kernel will continue to
> refuse unprivileged policy in general and specifically based on quality
> or implementation of the underlying filesystem driver.


First, if block devices are your concern, fine, how about
allowing it if EROFS file-backed mounts and S_IMMUTABLE
for underlay files is set, and refuse any block device
mounts.

If the issue is "you don't know how to define the quality
or implementation of the underlying filesystem drivers",
you could list your detailed concerns (I think at least
people have trust to the individual filesystem
maintainers' judgements), otherwise there will be endless
new sets of new immutable filesystems for this requirement
(previously, composefs , puzzlefs, and tarfs are all for
this; I admit I didn't get the point of FS_USERNS_MOUNT
at that time of 2023; but know I also think FS_USERNS_MOUNT
is a strong requirement for DinD for example), because that
idea should be sensible according to Darrick and Jan's
reply, and I think more people will agree with that.

And another idea is that you still could return arbitary
metadata with immutable FUSE fses and let users get
garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT,
and if user and mount namespaces are isolated, why bothering
it?

I just hope know why? And as you may notice,
"Demi Marie Obenour wrote:"

> The only exceptions are if the filesystem is incredibly simple
> or formal methods are used, and neither is the case for existing
> filesystems in the Linux kernel. 

I still strong disagree with that judgement, a minimal EROFS
can build an image with superblock, dirs, and files with
xattrs in a 4k-size image; and 4k image should be enough for
fuzzing; also the in-core EROFS format even never allocates
any extra buffers, which is much simplar than FUSE.

In brief, so how to meet your requirement?

Thanks,
Gao Xiang



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-24  9:30                                       ` Gao Xiang
@ 2026-03-24  9:49                                         ` Demi Marie Obenour
  2026-03-24  9:53                                           ` Gao Xiang
  0 siblings, 1 reply; 79+ messages in thread
From: Demi Marie Obenour @ 2026-03-24  9:49 UTC (permalink / raw)
  To: Gao Xiang, Christian Brauner, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc


[-- Attachment #1.1.1: Type: text/plain, Size: 4926 bytes --]

On 3/24/26 05:30, Gao Xiang wrote:
> Hi Christian,
> 
> On 2026/3/24 16:48, Christian Brauner wrote:
>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote:
>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote:
>>>> On 2026/3/23 22:13, Jan Kara wrote:
>>>>>>> think that is the corner cases if you don't claim the
>>>>>>> limitation of FUSE approaches.
>>>>>>>
>>>>>>> If none expects that, that is absolute be fine, as I said,
>>>>>>> it provides strong isolation and stability, but I really
>>>>>>> suspect this approach could be abused to mount totally
>>>>>>> untrusted remote filesystems (Actually as I said, some
>>>>>>> business of ours already did: fetching EXT4 filesystems
>>>>>>> with unknown status and mount without fscking, that is
>>>>>>> really disappointing.)
>>>>>
>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and
>>>>> using it for sensitive application, that falls to "insane" category for me
>>>>> :) We agree on that. And I agree that depending on the application using
>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs +
>>>>> overlayfs writeable layer may provide better guarantees about fs behavior.
>>>>
>>>> That is my overall goal, I just want to make it clear
>>>> the difference out of write isolation, but of course,
>>>> "secure" or not is relative, and according to the
>>>> system design.
>>>>
>>>> If isolation and system stability are enough for
>>>> a system and can be called "secure", yes, they are
>>>> both the same in such aspects.
>>>>
>>>>> I would still consider such design highly suspicious but without more
>>>>> detailed knowledge about the application I cannot say it's outright broken
>>>>> :).
>>>>
>>>> What do you mean "such design"?  "Writable untrusted
>>>> remote EXT4 images mounting on the host"? Really, we have
>>>> such applications for containers for many years but I don't
>>>> want to name it here, but I'm totally exhaused by such
>>>> usage (since I explained many many times, and they even
>>>> never bother with LWN.net) and the internal team.
>>>
>>> By "such design" I meant generally the concept that you fetch filesystem
>>> images (regardless whether ext4 or some other type) from untrusted source.
>>> Unless you do cryptographical verification of the data, you never know what
>>> kind of garbage your application is processing which is always invitation
>>> for nasty exploits and bugs...
>>
>> If this is another 500 mail discussion about FS_USERNS_MOUNT on
>> block-backed filesystems then my verdict still stands that the only
>> condition under which I will let the VFS allow this if the underlying
>> device is signed and dm-verity protected. The kernel will continue to
>> refuse unprivileged policy in general and specifically based on quality
>> or implementation of the underlying filesystem driver.
> 
> 
> First, if block devices are your concern, fine, how about
> allowing it if EROFS file-backed mounts and S_IMMUTABLE
> for underlay files is set, and refuse any block device
> mounts.
> 
> If the issue is "you don't know how to define the quality
> or implementation of the underlying filesystem drivers",
> you could list your detailed concerns (I think at least
> people have trust to the individual filesystem
> maintainers' judgements), otherwise there will be endless
> new sets of new immutable filesystems for this requirement
> (previously, composefs , puzzlefs, and tarfs are all for
> this; I admit I didn't get the point of FS_USERNS_MOUNT
> at that time of 2023; but know I also think FS_USERNS_MOUNT
> is a strong requirement for DinD for example), because that
> idea should be sensible according to Darrick and Jan's
> reply, and I think more people will agree with that.
> 
> And another idea is that you still could return arbitary
> metadata with immutable FUSE fses and let users get
> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT,
> and if user and mount namespaces are isolated, why bothering
> it?
> 
> I just hope know why? And as you may notice,
> "Demi Marie Obenour wrote:"
> 
>> The only exceptions are if the filesystem is incredibly simple
>> or formal methods are used, and neither is the case for existing
>> filesystems in the Linux kernel. 
> 
> I still strong disagree with that judgement, a minimal EROFS
> can build an image with superblock, dirs, and files with
> xattrs in a 4k-size image; and 4k image should be enough for
> fuzzing; also the in-core EROFS format even never allocates
> any extra buffers, which is much simplar than FUSE.
> 
> In brief, so how to meet your requirement?
> 
> Thanks,
> Gao Xiang

Rewriting the code in Rust would dramatically reduce the attack
surface when it comes to memory corruption.  That's a lot to ask,
though, and a lot of work.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-24  9:49                                         ` Demi Marie Obenour
@ 2026-03-24  9:53                                           ` Gao Xiang
  2026-03-24 10:02                                             ` Demi Marie Obenour
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-24  9:53 UTC (permalink / raw)
  To: Demi Marie Obenour, Christian Brauner, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc



On 2026/3/24 17:49, Demi Marie Obenour wrote:
> On 3/24/26 05:30, Gao Xiang wrote:
>> Hi Christian,
>>
>> On 2026/3/24 16:48, Christian Brauner wrote:
>>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote:
>>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote:
>>>>> On 2026/3/23 22:13, Jan Kara wrote:
>>>>>>>> think that is the corner cases if you don't claim the
>>>>>>>> limitation of FUSE approaches.
>>>>>>>>
>>>>>>>> If none expects that, that is absolute be fine, as I said,
>>>>>>>> it provides strong isolation and stability, but I really
>>>>>>>> suspect this approach could be abused to mount totally
>>>>>>>> untrusted remote filesystems (Actually as I said, some
>>>>>>>> business of ours already did: fetching EXT4 filesystems
>>>>>>>> with unknown status and mount without fscking, that is
>>>>>>>> really disappointing.)
>>>>>>
>>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and
>>>>>> using it for sensitive application, that falls to "insane" category for me
>>>>>> :) We agree on that. And I agree that depending on the application using
>>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs +
>>>>>> overlayfs writeable layer may provide better guarantees about fs behavior.
>>>>>
>>>>> That is my overall goal, I just want to make it clear
>>>>> the difference out of write isolation, but of course,
>>>>> "secure" or not is relative, and according to the
>>>>> system design.
>>>>>
>>>>> If isolation and system stability are enough for
>>>>> a system and can be called "secure", yes, they are
>>>>> both the same in such aspects.
>>>>>
>>>>>> I would still consider such design highly suspicious but without more
>>>>>> detailed knowledge about the application I cannot say it's outright broken
>>>>>> :).
>>>>>
>>>>> What do you mean "such design"?  "Writable untrusted
>>>>> remote EXT4 images mounting on the host"? Really, we have
>>>>> such applications for containers for many years but I don't
>>>>> want to name it here, but I'm totally exhaused by such
>>>>> usage (since I explained many many times, and they even
>>>>> never bother with LWN.net) and the internal team.
>>>>
>>>> By "such design" I meant generally the concept that you fetch filesystem
>>>> images (regardless whether ext4 or some other type) from untrusted source.
>>>> Unless you do cryptographical verification of the data, you never know what
>>>> kind of garbage your application is processing which is always invitation
>>>> for nasty exploits and bugs...
>>>
>>> If this is another 500 mail discussion about FS_USERNS_MOUNT on
>>> block-backed filesystems then my verdict still stands that the only
>>> condition under which I will let the VFS allow this if the underlying
>>> device is signed and dm-verity protected. The kernel will continue to
>>> refuse unprivileged policy in general and specifically based on quality
>>> or implementation of the underlying filesystem driver.
>>
>>
>> First, if block devices are your concern, fine, how about
>> allowing it if EROFS file-backed mounts and S_IMMUTABLE
>> for underlay files is set, and refuse any block device
>> mounts.
>>
>> If the issue is "you don't know how to define the quality
>> or implementation of the underlying filesystem drivers",
>> you could list your detailed concerns (I think at least
>> people have trust to the individual filesystem
>> maintainers' judgements), otherwise there will be endless
>> new sets of new immutable filesystems for this requirement
>> (previously, composefs , puzzlefs, and tarfs are all for
>> this; I admit I didn't get the point of FS_USERNS_MOUNT
>> at that time of 2023; but know I also think FS_USERNS_MOUNT
>> is a strong requirement for DinD for example), because that
>> idea should be sensible according to Darrick and Jan's
>> reply, and I think more people will agree with that.
>>
>> And another idea is that you still could return arbitary
>> metadata with immutable FUSE fses and let users get
>> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT,
>> and if user and mount namespaces are isolated, why bothering
>> it?
>>
>> I just hope know why? And as you may notice,
>> "Demi Marie Obenour wrote:"
>>
>>> The only exceptions are if the filesystem is incredibly simple
>>> or formal methods are used, and neither is the case for existing
>>> filesystems in the Linux kernel.
>>
>> I still strong disagree with that judgement, a minimal EROFS
>> can build an image with superblock, dirs, and files with
>> xattrs in a 4k-size image; and 4k image should be enough for
>> fuzzing; also the in-core EROFS format even never allocates
>> any extra buffers, which is much simplar than FUSE.
>>
>> In brief, so how to meet your requirement?
>>
>> Thanks,
>> Gao Xiang
> 
> Rewriting the code in Rust would dramatically reduce the attack
> surface when it comes to memory corruption.  That's a lot to ask,
> though, and a lot of work.

I don't think so, FUSE can do FS_USERNS_MOUNT and written in C
, and the attack surface is already huge.

EROFS will switch to Rust some time, but your judgement will
make people to make another complete new toys of Rust kernel
filesystems --- just because EROFS is currently not written
in Rust.

I'm completely exhaused with such game: If I will address
every single fuzzing bug and CVE, why not?

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-24  9:53                                           ` Gao Xiang
@ 2026-03-24 10:02                                             ` Demi Marie Obenour
  2026-03-24 10:14                                               ` Gao Xiang
  0 siblings, 1 reply; 79+ messages in thread
From: Demi Marie Obenour @ 2026-03-24 10:02 UTC (permalink / raw)
  To: Gao Xiang, Christian Brauner, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc


[-- Attachment #1.1.1: Type: text/plain, Size: 6316 bytes --]

On 3/24/26 05:53, Gao Xiang wrote:
> 
> 
> On 2026/3/24 17:49, Demi Marie Obenour wrote:
>> On 3/24/26 05:30, Gao Xiang wrote:
>>> Hi Christian,
>>>
>>> On 2026/3/24 16:48, Christian Brauner wrote:
>>>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote:
>>>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote:
>>>>>> On 2026/3/23 22:13, Jan Kara wrote:
>>>>>>>>> think that is the corner cases if you don't claim the
>>>>>>>>> limitation of FUSE approaches.
>>>>>>>>>
>>>>>>>>> If none expects that, that is absolute be fine, as I said,
>>>>>>>>> it provides strong isolation and stability, but I really
>>>>>>>>> suspect this approach could be abused to mount totally
>>>>>>>>> untrusted remote filesystems (Actually as I said, some
>>>>>>>>> business of ours already did: fetching EXT4 filesystems
>>>>>>>>> with unknown status and mount without fscking, that is
>>>>>>>>> really disappointing.)
>>>>>>>
>>>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and
>>>>>>> using it for sensitive application, that falls to "insane" category for me
>>>>>>> :) We agree on that. And I agree that depending on the application using
>>>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs +
>>>>>>> overlayfs writeable layer may provide better guarantees about fs behavior.
>>>>>>
>>>>>> That is my overall goal, I just want to make it clear
>>>>>> the difference out of write isolation, but of course,
>>>>>> "secure" or not is relative, and according to the
>>>>>> system design.
>>>>>>
>>>>>> If isolation and system stability are enough for
>>>>>> a system and can be called "secure", yes, they are
>>>>>> both the same in such aspects.
>>>>>>
>>>>>>> I would still consider such design highly suspicious but without more
>>>>>>> detailed knowledge about the application I cannot say it's outright broken
>>>>>>> :).
>>>>>>
>>>>>> What do you mean "such design"?  "Writable untrusted
>>>>>> remote EXT4 images mounting on the host"? Really, we have
>>>>>> such applications for containers for many years but I don't
>>>>>> want to name it here, but I'm totally exhaused by such
>>>>>> usage (since I explained many many times, and they even
>>>>>> never bother with LWN.net) and the internal team.
>>>>>
>>>>> By "such design" I meant generally the concept that you fetch filesystem
>>>>> images (regardless whether ext4 or some other type) from untrusted source.
>>>>> Unless you do cryptographical verification of the data, you never know what
>>>>> kind of garbage your application is processing which is always invitation
>>>>> for nasty exploits and bugs...
>>>>
>>>> If this is another 500 mail discussion about FS_USERNS_MOUNT on
>>>> block-backed filesystems then my verdict still stands that the only
>>>> condition under which I will let the VFS allow this if the underlying
>>>> device is signed and dm-verity protected. The kernel will continue to
>>>> refuse unprivileged policy in general and specifically based on quality
>>>> or implementation of the underlying filesystem driver.
>>>
>>>
>>> First, if block devices are your concern, fine, how about
>>> allowing it if EROFS file-backed mounts and S_IMMUTABLE
>>> for underlay files is set, and refuse any block device
>>> mounts.
>>>
>>> If the issue is "you don't know how to define the quality
>>> or implementation of the underlying filesystem drivers",
>>> you could list your detailed concerns (I think at least
>>> people have trust to the individual filesystem
>>> maintainers' judgements), otherwise there will be endless
>>> new sets of new immutable filesystems for this requirement
>>> (previously, composefs , puzzlefs, and tarfs are all for
>>> this; I admit I didn't get the point of FS_USERNS_MOUNT
>>> at that time of 2023; but know I also think FS_USERNS_MOUNT
>>> is a strong requirement for DinD for example), because that
>>> idea should be sensible according to Darrick and Jan's
>>> reply, and I think more people will agree with that.
>>>
>>> And another idea is that you still could return arbitary
>>> metadata with immutable FUSE fses and let users get
>>> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT,
>>> and if user and mount namespaces are isolated, why bothering
>>> it?
>>>
>>> I just hope know why? And as you may notice,
>>> "Demi Marie Obenour wrote:"
>>>
>>>> The only exceptions are if the filesystem is incredibly simple
>>>> or formal methods are used, and neither is the case for existing
>>>> filesystems in the Linux kernel.
>>>
>>> I still strong disagree with that judgement, a minimal EROFS
>>> can build an image with superblock, dirs, and files with
>>> xattrs in a 4k-size image; and 4k image should be enough for
>>> fuzzing; also the in-core EROFS format even never allocates
>>> any extra buffers, which is much simplar than FUSE.
>>>
>>> In brief, so how to meet your requirement?
>>>
>>> Thanks,
>>> Gao Xiang
>>
>> Rewriting the code in Rust would dramatically reduce the attack
>> surface when it comes to memory corruption.  That's a lot to ask,
>> though, and a lot of work.
> 
> I don't think so, FUSE can do FS_USERNS_MOUNT and written in C
> , and the attack surface is already huge.
> 
> EROFS will switch to Rust some time, but your judgement will
> make people to make another complete new toys of Rust kernel
> filesystems --- just because EROFS is currently not written
> in Rust.
> 
> I'm completely exhaused with such game: If I will address
> every single fuzzing bug and CVE, why not?
> 
> Thanks,
> Gao Xiang

I should have written that rewriting in Rust could help convince
people that it is in fact safe.  One *can* make safe C code, as shown
by OpenSSH.  It's just *harder* to write safe C code, and harder to
demonstrate to others that C code is in fact safe.

Whether the burden of proof being placed on you is excessive is a
separate question that I do not have the experience to comment on.

That said:

> I will address every single fuzzing bug and CVE

is very different than the view of most filesystem developers.
If the fuzzers have good code coverage in EROFS, this is a very strong
argument for making an exception.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-24 10:02                                             ` Demi Marie Obenour
@ 2026-03-24 10:14                                               ` Gao Xiang
  2026-03-24 10:17                                                 ` Demi Marie Obenour
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-24 10:14 UTC (permalink / raw)
  To: Demi Marie Obenour, Christian Brauner, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc



On 2026/3/24 18:02, Demi Marie Obenour wrote:
> On 3/24/26 05:53, Gao Xiang wrote:
>>
>>
>> On 2026/3/24 17:49, Demi Marie Obenour wrote:
>>> On 3/24/26 05:30, Gao Xiang wrote:
>>>> Hi Christian,
>>>>
>>>> On 2026/3/24 16:48, Christian Brauner wrote:
>>>>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote:
>>>>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote:
>>>>>>> On 2026/3/23 22:13, Jan Kara wrote:
>>>>>>>>>> think that is the corner cases if you don't claim the
>>>>>>>>>> limitation of FUSE approaches.
>>>>>>>>>>
>>>>>>>>>> If none expects that, that is absolute be fine, as I said,
>>>>>>>>>> it provides strong isolation and stability, but I really
>>>>>>>>>> suspect this approach could be abused to mount totally
>>>>>>>>>> untrusted remote filesystems (Actually as I said, some
>>>>>>>>>> business of ours already did: fetching EXT4 filesystems
>>>>>>>>>> with unknown status and mount without fscking, that is
>>>>>>>>>> really disappointing.)
>>>>>>>>
>>>>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and
>>>>>>>> using it for sensitive application, that falls to "insane" category for me
>>>>>>>> :) We agree on that. And I agree that depending on the application using
>>>>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs +
>>>>>>>> overlayfs writeable layer may provide better guarantees about fs behavior.
>>>>>>>
>>>>>>> That is my overall goal, I just want to make it clear
>>>>>>> the difference out of write isolation, but of course,
>>>>>>> "secure" or not is relative, and according to the
>>>>>>> system design.
>>>>>>>
>>>>>>> If isolation and system stability are enough for
>>>>>>> a system and can be called "secure", yes, they are
>>>>>>> both the same in such aspects.
>>>>>>>
>>>>>>>> I would still consider such design highly suspicious but without more
>>>>>>>> detailed knowledge about the application I cannot say it's outright broken
>>>>>>>> :).
>>>>>>>
>>>>>>> What do you mean "such design"?  "Writable untrusted
>>>>>>> remote EXT4 images mounting on the host"? Really, we have
>>>>>>> such applications for containers for many years but I don't
>>>>>>> want to name it here, but I'm totally exhaused by such
>>>>>>> usage (since I explained many many times, and they even
>>>>>>> never bother with LWN.net) and the internal team.
>>>>>>
>>>>>> By "such design" I meant generally the concept that you fetch filesystem
>>>>>> images (regardless whether ext4 or some other type) from untrusted source.
>>>>>> Unless you do cryptographical verification of the data, you never know what
>>>>>> kind of garbage your application is processing which is always invitation
>>>>>> for nasty exploits and bugs...
>>>>>
>>>>> If this is another 500 mail discussion about FS_USERNS_MOUNT on
>>>>> block-backed filesystems then my verdict still stands that the only
>>>>> condition under which I will let the VFS allow this if the underlying
>>>>> device is signed and dm-verity protected. The kernel will continue to
>>>>> refuse unprivileged policy in general and specifically based on quality
>>>>> or implementation of the underlying filesystem driver.
>>>>
>>>>
>>>> First, if block devices are your concern, fine, how about
>>>> allowing it if EROFS file-backed mounts and S_IMMUTABLE
>>>> for underlay files is set, and refuse any block device
>>>> mounts.
>>>>
>>>> If the issue is "you don't know how to define the quality
>>>> or implementation of the underlying filesystem drivers",
>>>> you could list your detailed concerns (I think at least
>>>> people have trust to the individual filesystem
>>>> maintainers' judgements), otherwise there will be endless
>>>> new sets of new immutable filesystems for this requirement
>>>> (previously, composefs , puzzlefs, and tarfs are all for
>>>> this; I admit I didn't get the point of FS_USERNS_MOUNT
>>>> at that time of 2023; but know I also think FS_USERNS_MOUNT
>>>> is a strong requirement for DinD for example), because that
>>>> idea should be sensible according to Darrick and Jan's
>>>> reply, and I think more people will agree with that.
>>>>
>>>> And another idea is that you still could return arbitary
>>>> metadata with immutable FUSE fses and let users get
>>>> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT,
>>>> and if user and mount namespaces are isolated, why bothering
>>>> it?
>>>>
>>>> I just hope know why? And as you may notice,
>>>> "Demi Marie Obenour wrote:"
>>>>
>>>>> The only exceptions are if the filesystem is incredibly simple
>>>>> or formal methods are used, and neither is the case for existing
>>>>> filesystems in the Linux kernel.
>>>>
>>>> I still strong disagree with that judgement, a minimal EROFS
>>>> can build an image with superblock, dirs, and files with
>>>> xattrs in a 4k-size image; and 4k image should be enough for
>>>> fuzzing; also the in-core EROFS format even never allocates
>>>> any extra buffers, which is much simplar than FUSE.
>>>>
>>>> In brief, so how to meet your requirement?
>>>>
>>>> Thanks,
>>>> Gao Xiang
>>>
>>> Rewriting the code in Rust would dramatically reduce the attack
>>> surface when it comes to memory corruption.  That's a lot to ask,
>>> though, and a lot of work.
>>
>> I don't think so, FUSE can do FS_USERNS_MOUNT and written in C
>> , and the attack surface is already huge.
>>
>> EROFS will switch to Rust some time, but your judgement will
>> make people to make another complete new toys of Rust kernel
>> filesystems --- just because EROFS is currently not written
>> in Rust.
>>
>> I'm completely exhaused with such game: If I will address
>> every single fuzzing bug and CVE, why not?
>>
>> Thanks,
>> Gao Xiang
> 
> I should have written that rewriting in Rust could help convince
> people that it is in fact safe.  One *can* make safe C code, as shown
> by OpenSSH.  It's just *harder* to write safe C code, and harder to
> demonstrate to others that C code is in fact safe.

How do you define a formal `safe C`? "C without pointers"?

Actually, we tried to switch to Rust but Rust developpers
resist with incremental change, they just want a pure Rust
and switch to it all the time, that is impossible for all
mature kernel filesystems.

> 
> Whether the burden of proof being placed on you is excessive is a
> separate question that I do not have the experience to comment on.

That is funny TBH, just because the whole policy here
is broken, if you call out the LOC of codebase, I
believe FUSE, OverlayFS and even TCP/IP are all complex
than EROFS.

If you still think LOC is the issue, I'm pretty fine to
isolate a `fs/simple_erofs` and drop all advanced runtime
features and even compression.

> 
> That said:
> 
>> I will address every single fuzzing bug and CVE
> 
> is very different than the view of most filesystem developers.
> If the fuzzers have good code coverage in EROFS, this is a very strong
> argument for making an exception.

I don't know if it's just your judgement or Christian's
judgement.

Currently EROFS is well-fuzzed by syzkaller and I keep
maintaining it as 0 active issue (as I said, 4k images
are enough for fuzzing all EROFS metadata format, almost
all previous syzkaller issues are out of compressed
inodes but we can just disable compression formats for
FS_USERNS_MOUNT, just because compression algorithms
are already complex for fuzzing) and we will definitely
improve this part even further if that is the real
concern of this.

And we will accept any fuzzing bug as CVE, and fix them
as 0day bugs like other subsystems written in C which
accept untrusted (meta)data.  Is that end of story of
this game?

Thanks,'
Gao Xiang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-24 10:14                                               ` Gao Xiang
@ 2026-03-24 10:17                                                 ` Demi Marie Obenour
  2026-03-24 10:25                                                   ` Gao Xiang
  0 siblings, 1 reply; 79+ messages in thread
From: Demi Marie Obenour @ 2026-03-24 10:17 UTC (permalink / raw)
  To: Gao Xiang, Christian Brauner, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc


[-- Attachment #1.1.1: Type: text/plain, Size: 8218 bytes --]

On 3/24/26 06:14, Gao Xiang wrote:
> 
> 
> On 2026/3/24 18:02, Demi Marie Obenour wrote:
>> On 3/24/26 05:53, Gao Xiang wrote:
>>>
>>>
>>> On 2026/3/24 17:49, Demi Marie Obenour wrote:
>>>> On 3/24/26 05:30, Gao Xiang wrote:
>>>>> Hi Christian,
>>>>>
>>>>> On 2026/3/24 16:48, Christian Brauner wrote:
>>>>>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote:
>>>>>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote:
>>>>>>>> On 2026/3/23 22:13, Jan Kara wrote:
>>>>>>>>>>> think that is the corner cases if you don't claim the
>>>>>>>>>>> limitation of FUSE approaches.
>>>>>>>>>>>
>>>>>>>>>>> If none expects that, that is absolute be fine, as I said,
>>>>>>>>>>> it provides strong isolation and stability, but I really
>>>>>>>>>>> suspect this approach could be abused to mount totally
>>>>>>>>>>> untrusted remote filesystems (Actually as I said, some
>>>>>>>>>>> business of ours already did: fetching EXT4 filesystems
>>>>>>>>>>> with unknown status and mount without fscking, that is
>>>>>>>>>>> really disappointing.)
>>>>>>>>>
>>>>>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and
>>>>>>>>> using it for sensitive application, that falls to "insane" category for me
>>>>>>>>> :) We agree on that. And I agree that depending on the application using
>>>>>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs +
>>>>>>>>> overlayfs writeable layer may provide better guarantees about fs behavior.
>>>>>>>>
>>>>>>>> That is my overall goal, I just want to make it clear
>>>>>>>> the difference out of write isolation, but of course,
>>>>>>>> "secure" or not is relative, and according to the
>>>>>>>> system design.
>>>>>>>>
>>>>>>>> If isolation and system stability are enough for
>>>>>>>> a system and can be called "secure", yes, they are
>>>>>>>> both the same in such aspects.
>>>>>>>>
>>>>>>>>> I would still consider such design highly suspicious but without more
>>>>>>>>> detailed knowledge about the application I cannot say it's outright broken
>>>>>>>>> :).
>>>>>>>>
>>>>>>>> What do you mean "such design"?  "Writable untrusted
>>>>>>>> remote EXT4 images mounting on the host"? Really, we have
>>>>>>>> such applications for containers for many years but I don't
>>>>>>>> want to name it here, but I'm totally exhaused by such
>>>>>>>> usage (since I explained many many times, and they even
>>>>>>>> never bother with LWN.net) and the internal team.
>>>>>>>
>>>>>>> By "such design" I meant generally the concept that you fetch filesystem
>>>>>>> images (regardless whether ext4 or some other type) from untrusted source.
>>>>>>> Unless you do cryptographical verification of the data, you never know what
>>>>>>> kind of garbage your application is processing which is always invitation
>>>>>>> for nasty exploits and bugs...
>>>>>>
>>>>>> If this is another 500 mail discussion about FS_USERNS_MOUNT on
>>>>>> block-backed filesystems then my verdict still stands that the only
>>>>>> condition under which I will let the VFS allow this if the underlying
>>>>>> device is signed and dm-verity protected. The kernel will continue to
>>>>>> refuse unprivileged policy in general and specifically based on quality
>>>>>> or implementation of the underlying filesystem driver.
>>>>>
>>>>>
>>>>> First, if block devices are your concern, fine, how about
>>>>> allowing it if EROFS file-backed mounts and S_IMMUTABLE
>>>>> for underlay files is set, and refuse any block device
>>>>> mounts.
>>>>>
>>>>> If the issue is "you don't know how to define the quality
>>>>> or implementation of the underlying filesystem drivers",
>>>>> you could list your detailed concerns (I think at least
>>>>> people have trust to the individual filesystem
>>>>> maintainers' judgements), otherwise there will be endless
>>>>> new sets of new immutable filesystems for this requirement
>>>>> (previously, composefs , puzzlefs, and tarfs are all for
>>>>> this; I admit I didn't get the point of FS_USERNS_MOUNT
>>>>> at that time of 2023; but know I also think FS_USERNS_MOUNT
>>>>> is a strong requirement for DinD for example), because that
>>>>> idea should be sensible according to Darrick and Jan's
>>>>> reply, and I think more people will agree with that.
>>>>>
>>>>> And another idea is that you still could return arbitary
>>>>> metadata with immutable FUSE fses and let users get
>>>>> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT,
>>>>> and if user and mount namespaces are isolated, why bothering
>>>>> it?
>>>>>
>>>>> I just hope know why? And as you may notice,
>>>>> "Demi Marie Obenour wrote:"
>>>>>
>>>>>> The only exceptions are if the filesystem is incredibly simple
>>>>>> or formal methods are used, and neither is the case for existing
>>>>>> filesystems in the Linux kernel.
>>>>>
>>>>> I still strong disagree with that judgement, a minimal EROFS
>>>>> can build an image with superblock, dirs, and files with
>>>>> xattrs in a 4k-size image; and 4k image should be enough for
>>>>> fuzzing; also the in-core EROFS format even never allocates
>>>>> any extra buffers, which is much simplar than FUSE.
>>>>>
>>>>> In brief, so how to meet your requirement?
>>>>>
>>>>> Thanks,
>>>>> Gao Xiang
>>>>
>>>> Rewriting the code in Rust would dramatically reduce the attack
>>>> surface when it comes to memory corruption.  That's a lot to ask,
>>>> though, and a lot of work.
>>>
>>> I don't think so, FUSE can do FS_USERNS_MOUNT and written in C
>>> , and the attack surface is already huge.
>>>
>>> EROFS will switch to Rust some time, but your judgement will
>>> make people to make another complete new toys of Rust kernel
>>> filesystems --- just because EROFS is currently not written
>>> in Rust.
>>>
>>> I'm completely exhaused with such game: If I will address
>>> every single fuzzing bug and CVE, why not?
>>>
>>> Thanks,
>>> Gao Xiang
>>
>> I should have written that rewriting in Rust could help convince
>> people that it is in fact safe.  One *can* make safe C code, as shown
>> by OpenSSH.  It's just *harder* to write safe C code, and harder to
>> demonstrate to others that C code is in fact safe.
> 
> How do you define a formal `safe C`? "C without pointers"?

Safe = "history of not having many vulnerabilities"

> Actually, we tried to switch to Rust but Rust developpers
> resist with incremental change, they just want a pure Rust
> and switch to it all the time, that is impossible for all
> mature kernel filesystems.

Incremental change is definitely good.

>> Whether the burden of proof being placed on you is excessive is a
>> separate question that I do not have the experience to comment on.
> 
> That is funny TBH, just because the whole policy here
> is broken, if you call out the LOC of codebase, I
> believe FUSE, OverlayFS and even TCP/IP are all complex
> than EROFS.
> 
> If you still think LOC is the issue, I'm pretty fine to
> isolate a `fs/simple_erofs` and drop all advanced runtime
> features and even compression.

I don't think LOC is the main problem.

>> That said:
>>
>>> I will address every single fuzzing bug and CVE
>>
>> is very different than the view of most filesystem developers.
>> If the fuzzers have good code coverage in EROFS, this is a very strong
>> argument for making an exception.
> 
> I don't know if it's just your judgement or Christian's
> judgement.
> 
> Currently EROFS is well-fuzzed by syzkaller and I keep
> maintaining it as 0 active issue (as I said, 4k images
> are enough for fuzzing all EROFS metadata format, almost
> all previous syzkaller issues are out of compressed
> inodes but we can just disable compression formats for
> FS_USERNS_MOUNT, just because compression algorithms
> are already complex for fuzzing) and we will definitely
> improve this part even further if that is the real
> concern of this.
> 
> And we will accept any fuzzing bug as CVE, and fix them
> as 0day bugs like other subsystems written in C which
> accept untrusted (meta)data.  Is that end of story of
> this game?

It should be!
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-24 10:17                                                 ` Demi Marie Obenour
@ 2026-03-24 10:25                                                   ` Gao Xiang
  0 siblings, 0 replies; 79+ messages in thread
From: Gao Xiang @ 2026-03-24 10:25 UTC (permalink / raw)
  To: Demi Marie Obenour, Christian Brauner, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc



On 2026/3/24 18:17, Demi Marie Obenour wrote:
> On 3/24/26 06:14, Gao Xiang wrote:
>>
>>
>> On 2026/3/24 18:02, Demi Marie Obenour wrote:
>>> On 3/24/26 05:53, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2026/3/24 17:49, Demi Marie Obenour wrote:
>>>>> On 3/24/26 05:30, Gao Xiang wrote:
>>>>>> Hi Christian,
>>>>>>
>>>>>> On 2026/3/24 16:48, Christian Brauner wrote:
>>>>>>> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote:
>>>>>>>> On Mon 23-03-26 22:36:46, Gao Xiang wrote:
>>>>>>>>> On 2026/3/23 22:13, Jan Kara wrote:
>>>>>>>>>>>> think that is the corner cases if you don't claim the
>>>>>>>>>>>> limitation of FUSE approaches.
>>>>>>>>>>>>
>>>>>>>>>>>> If none expects that, that is absolute be fine, as I said,
>>>>>>>>>>>> it provides strong isolation and stability, but I really
>>>>>>>>>>>> suspect this approach could be abused to mount totally
>>>>>>>>>>>> untrusted remote filesystems (Actually as I said, some
>>>>>>>>>>>> business of ours already did: fetching EXT4 filesystems
>>>>>>>>>>>> with unknown status and mount without fscking, that is
>>>>>>>>>>>> really disappointing.)
>>>>>>>>>>
>>>>>>>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and
>>>>>>>>>> using it for sensitive application, that falls to "insane" category for me
>>>>>>>>>> :) We agree on that. And I agree that depending on the application using
>>>>>>>>>> FUSE to access such filesystem needn't be safe enough and immutable fs +
>>>>>>>>>> overlayfs writeable layer may provide better guarantees about fs behavior.
>>>>>>>>>
>>>>>>>>> That is my overall goal, I just want to make it clear
>>>>>>>>> the difference out of write isolation, but of course,
>>>>>>>>> "secure" or not is relative, and according to the
>>>>>>>>> system design.
>>>>>>>>>
>>>>>>>>> If isolation and system stability are enough for
>>>>>>>>> a system and can be called "secure", yes, they are
>>>>>>>>> both the same in such aspects.
>>>>>>>>>
>>>>>>>>>> I would still consider such design highly suspicious but without more
>>>>>>>>>> detailed knowledge about the application I cannot say it's outright broken
>>>>>>>>>> :).
>>>>>>>>>
>>>>>>>>> What do you mean "such design"?  "Writable untrusted
>>>>>>>>> remote EXT4 images mounting on the host"? Really, we have
>>>>>>>>> such applications for containers for many years but I don't
>>>>>>>>> want to name it here, but I'm totally exhaused by such
>>>>>>>>> usage (since I explained many many times, and they even
>>>>>>>>> never bother with LWN.net) and the internal team.
>>>>>>>>
>>>>>>>> By "such design" I meant generally the concept that you fetch filesystem
>>>>>>>> images (regardless whether ext4 or some other type) from untrusted source.
>>>>>>>> Unless you do cryptographical verification of the data, you never know what
>>>>>>>> kind of garbage your application is processing which is always invitation
>>>>>>>> for nasty exploits and bugs...
>>>>>>>
>>>>>>> If this is another 500 mail discussion about FS_USERNS_MOUNT on
>>>>>>> block-backed filesystems then my verdict still stands that the only
>>>>>>> condition under which I will let the VFS allow this if the underlying
>>>>>>> device is signed and dm-verity protected. The kernel will continue to
>>>>>>> refuse unprivileged policy in general and specifically based on quality
>>>>>>> or implementation of the underlying filesystem driver.
>>>>>>
>>>>>>
>>>>>> First, if block devices are your concern, fine, how about
>>>>>> allowing it if EROFS file-backed mounts and S_IMMUTABLE
>>>>>> for underlay files is set, and refuse any block device
>>>>>> mounts.
>>>>>>
>>>>>> If the issue is "you don't know how to define the quality
>>>>>> or implementation of the underlying filesystem drivers",
>>>>>> you could list your detailed concerns (I think at least
>>>>>> people have trust to the individual filesystem
>>>>>> maintainers' judgements), otherwise there will be endless
>>>>>> new sets of new immutable filesystems for this requirement
>>>>>> (previously, composefs , puzzlefs, and tarfs are all for
>>>>>> this; I admit I didn't get the point of FS_USERNS_MOUNT
>>>>>> at that time of 2023; but know I also think FS_USERNS_MOUNT
>>>>>> is a strong requirement for DinD for example), because that
>>>>>> idea should be sensible according to Darrick and Jan's
>>>>>> reply, and I think more people will agree with that.
>>>>>>
>>>>>> And another idea is that you still could return arbitary
>>>>>> metadata with immutable FUSE fses and let users get
>>>>>> garbage (meta)data, and FUSE already allows FS_USERNS_MOUNT,
>>>>>> and if user and mount namespaces are isolated, why bothering
>>>>>> it?
>>>>>>
>>>>>> I just hope know why? And as you may notice,
>>>>>> "Demi Marie Obenour wrote:"
>>>>>>
>>>>>>> The only exceptions are if the filesystem is incredibly simple
>>>>>>> or formal methods are used, and neither is the case for existing
>>>>>>> filesystems in the Linux kernel.
>>>>>>
>>>>>> I still strong disagree with that judgement, a minimal EROFS
>>>>>> can build an image with superblock, dirs, and files with
>>>>>> xattrs in a 4k-size image; and 4k image should be enough for
>>>>>> fuzzing; also the in-core EROFS format even never allocates
>>>>>> any extra buffers, which is much simplar than FUSE.
>>>>>>
>>>>>> In brief, so how to meet your requirement?
>>>>>>
>>>>>> Thanks,
>>>>>> Gao Xiang
>>>>>
>>>>> Rewriting the code in Rust would dramatically reduce the attack
>>>>> surface when it comes to memory corruption.  That's a lot to ask,
>>>>> though, and a lot of work.
>>>>
>>>> I don't think so, FUSE can do FS_USERNS_MOUNT and written in C
>>>> , and the attack surface is already huge.
>>>>
>>>> EROFS will switch to Rust some time, but your judgement will
>>>> make people to make another complete new toys of Rust kernel
>>>> filesystems --- just because EROFS is currently not written
>>>> in Rust.
>>>>
>>>> I'm completely exhaused with such game: If I will address
>>>> every single fuzzing bug and CVE, why not?
>>>>
>>>> Thanks,
>>>> Gao Xiang
>>>
>>> I should have written that rewriting in Rust could help convince
>>> people that it is in fact safe.  One *can* make safe C code, as shown
>>> by OpenSSH.  It's just *harder* to write safe C code, and harder to
>>> demonstrate to others that C code is in fact safe.
>>
>> How do you define a formal `safe C`? "C without pointers"?
> 
> Safe = "history of not having many vulnerabilities"

So there will be no pointer, but that is almost
impossible for filesystems, since filesystem APIs
work with pointers.

> 
>> Actually, we tried to switch to Rust but Rust developpers
>> resist with incremental change, they just want a pure Rust
>> and switch to it all the time, that is impossible for all
>> mature kernel filesystems.
> 
> Incremental change is definitely good.

Those developpers resisted this two years ago.

> 
>>> Whether the burden of proof being placed on you is excessive is a
>>> separate question that I do not have the experience to comment on.
>>
>> That is funny TBH, just because the whole policy here
>> is broken, if you call out the LOC of codebase, I
>> believe FUSE, OverlayFS and even TCP/IP are all complex
>> than EROFS.
>>
>> If you still think LOC is the issue, I'm pretty fine to
>> isolate a `fs/simple_erofs` and drop all advanced runtime
>> features and even compression.
> 
> I don't think LOC is the main problem.

But folks come to me, telling me that you're unsafe
because your LOC is larger than my new stuff.

How do I react?

> 
>>> That said:
>>>
>>>> I will address every single fuzzing bug and CVE
>>>
>>> is very different than the view of most filesystem developers.
>>> If the fuzzers have good code coverage in EROFS, this is a very strong
>>> argument for making an exception.
>>
>> I don't know if it's just your judgement or Christian's
>> judgement.
>>
>> Currently EROFS is well-fuzzed by syzkaller and I keep
>> maintaining it as 0 active issue (as I said, 4k images
>> are enough for fuzzing all EROFS metadata format, almost
>> all previous syzkaller issues are out of compressed
>> inodes but we can just disable compression formats for
>> FS_USERNS_MOUNT, just because compression algorithms
>> are already complex for fuzzing) and we will definitely
>> improve this part even further if that is the real
>> concern of this.
>>
>> And we will accept any fuzzing bug as CVE, and fix them
>> as 0day bugs like other subsystems written in C which
>> accept untrusted (meta)data.  Is that end of story of
>> this game?
> 
> It should be!

So why? Who could tell me why?

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-24  8:48                                     ` Christian Brauner
  2026-03-24  9:30                                       ` Gao Xiang
@ 2026-03-24 11:58                                       ` Demi Marie Obenour
  2026-03-24 12:21                                         ` Gao Xiang
  1 sibling, 1 reply; 79+ messages in thread
From: Demi Marie Obenour @ 2026-03-24 11:58 UTC (permalink / raw)
  To: Christian Brauner, Jan Kara
  Cc: Gao Xiang, Darrick J. Wong, Miklos Szeredi, linux-fsdevel,
	Joanne Koong, John Groves, Bernd Schubert, Amir Goldstein,
	Luis Henriques, Horst Birthelmer, Gao Xiang, lsf-pc


[-- Attachment #1.1.1: Type: text/plain, Size: 3426 bytes --]

On 3/24/26 04:48, Christian Brauner wrote:
> On Mon, Mar 23, 2026 at 03:47:24PM +0100, Jan Kara wrote:
>> On Mon 23-03-26 22:36:46, Gao Xiang wrote:
>>> On 2026/3/23 22:13, Jan Kara wrote:
>>>>>> think that is the corner cases if you don't claim the
>>>>>> limitation of FUSE approaches.
>>>>>>
>>>>>> If none expects that, that is absolute be fine, as I said,
>>>>>> it provides strong isolation and stability, but I really
>>>>>> suspect this approach could be abused to mount totally
>>>>>> untrusted remote filesystems (Actually as I said, some
>>>>>> business of ours already did: fetching EXT4 filesystems
>>>>>> with unknown status and mount without fscking, that is
>>>>>> really disappointing.)
>>>>
>>>> Yes, someone downloading untrusted ext4 image, mounting in read-write and
>>>> using it for sensitive application, that falls to "insane" category for me
>>>> :) We agree on that. And I agree that depending on the application using
>>>> FUSE to access such filesystem needn't be safe enough and immutable fs +
>>>> overlayfs writeable layer may provide better guarantees about fs behavior.
>>>
>>> That is my overall goal, I just want to make it clear
>>> the difference out of write isolation, but of course,
>>> "secure" or not is relative, and according to the
>>> system design.
>>>
>>> If isolation and system stability are enough for
>>> a system and can be called "secure", yes, they are
>>> both the same in such aspects.
>>>
>>>> I would still consider such design highly suspicious but without more
>>>> detailed knowledge about the application I cannot say it's outright broken
>>>> :).
>>>
>>> What do you mean "such design"?  "Writable untrusted
>>> remote EXT4 images mounting on the host"? Really, we have
>>> such applications for containers for many years but I don't
>>> want to name it here, but I'm totally exhaused by such
>>> usage (since I explained many many times, and they even
>>> never bother with LWN.net) and the internal team.
>>
>> By "such design" I meant generally the concept that you fetch filesystem
>> images (regardless whether ext4 or some other type) from untrusted source.
>> Unless you do cryptographical verification of the data, you never know what
>> kind of garbage your application is processing which is always invitation
>> for nasty exploits and bugs...
> 
> If this is another 500 mail discussion about FS_USERNS_MOUNT on
> block-backed filesystems then my verdict still stands that the only
> condition under which I will let the VFS allow this if the underlying
> device is signed and dm-verity protected. The kernel will continue to
> refuse unprivileged policy in general and specifically based on quality
> or implementation of the underlying filesystem driver.

As far as I can tell, the main problems are:

1. Most filesystems can only be run in kernel mode, so one needs a
   VM and an expensive RPC protocol if one wants to run them in a
   sandboxed environment.

2. Context switch overhead is so high that running filesystems entirely
   in userspace, without some form of in-kernel I/O acceleration,
   is a performance problem.

3. Filesystems are written in C and not designed to be secure against
   malicious on-disk images.

Gao Xiang is working on problem for EROFS.
FUSE iomap support solves 2.  lklfuse solves problem 1.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-24 11:58                                       ` Demi Marie Obenour
@ 2026-03-24 12:21                                         ` Gao Xiang
  2026-03-26 14:39                                           ` Christian Brauner
  0 siblings, 1 reply; 79+ messages in thread
From: Gao Xiang @ 2026-03-24 12:21 UTC (permalink / raw)
  To: Demi Marie Obenour, Christian Brauner, Jan Kara
  Cc: Darrick J. Wong, Miklos Szeredi, linux-fsdevel, Joanne Koong,
	John Groves, Bernd Schubert, Amir Goldstein, Luis Henriques,
	Horst Birthelmer, Gao Xiang, lsf-pc

On 2026/3/24 19:58, Demi Marie Obenour wrote:
> On 3/24/26 04:48, Christian Brauner wrote:

...

>>>>
>>>>> I would still consider such design highly suspicious but without more
>>>>> detailed knowledge about the application I cannot say it's outright broken
>>>>> :).
>>>>
>>>> What do you mean "such design"?  "Writable untrusted
>>>> remote EXT4 images mounting on the host"? Really, we have
>>>> such applications for containers for many years but I don't
>>>> want to name it here, but I'm totally exhaused by such
>>>> usage (since I explained many many times, and they even
>>>> never bother with LWN.net) and the internal team.
>>>
>>> By "such design" I meant generally the concept that you fetch filesystem
>>> images (regardless whether ext4 or some other type) from untrusted source.
>>> Unless you do cryptographical verification of the data, you never know what
>>> kind of garbage your application is processing which is always invitation
>>> for nasty exploits and bugs...
>>
>> If this is another 500 mail discussion about FS_USERNS_MOUNT on
>> block-backed filesystems then my verdict still stands that the only
>> condition under which I will let the VFS allow this if the underlying
>> device is signed and dm-verity protected. The kernel will continue to
>> refuse unprivileged policy in general and specifically based on quality
>> or implementation of the underlying filesystem driver.
> 
> As far as I can tell, the main problems are:
> 
> 1. Most filesystems can only be run in kernel mode, so one needs a
>     VM and an expensive RPC protocol if one wants to run them in a
>     sandboxed environment.
> 
> 2. Context switch overhead is so high that running filesystems entirely
>     in userspace, without some form of in-kernel I/O acceleration,
>     is a performance problem.
> 
> 3. Filesystems are written in C and not designed to be secure against
>     malicious on-disk images.
> 
> Gao Xiang is working on problem for EROFS.
> FUSE iomap support solves 2.  lklfuse solves problem 1.

Sigh, I just would like to say, as Darrick and Jan's previous
replies, immutable on-disk fses are a special kind of filesystems
and the overall on-disk format is to provide vfs/MM basic
informattion (like LOOKUP, GETATTR, and READDIR, READ), and the
reason is that even some values of metadata could be considered
as inconsistent, it's just like FUSE unprivileged daemon returns
garbage (meta)data and/or TAR extracts garbage (meta)data --
shouldn't matter at all.

Why I'm here is I'm totally exhaused by arbitary claim like
"all kernel filesystem are insecure". Again, that is absolutely
untrue: the feature set, the working model and the implementation
complexity of immutable filesystems make it more secure by
design.

Also the reason of "another 500 mail discussion about
FS_USERNS_MOUNT" is just because "FS_USERNS_MOUNT is very very
useful to containers", and the special kind of immutable on-disk
filesystems can fit this goal technically which is much much
unlike to generic writable ondisk fses or NFS and why I working
on EROFS is also because I believe immutable ondisk filesystems
are absolutely useful, more secure than other generic writable
fses by design especially on containers and handling untrusted
remote data.

I here claim again that all implementation vulnerability of
EROFS will claim as 0-day bug, and I've already did in this way
for many years.  Let's step back, even not me, if there are
some other sane immutable filesystems aiming for containers,
they will definitely claim the same, why not?

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more
  2026-03-24 12:21                                         ` Gao Xiang
@ 2026-03-26 14:39                                           ` Christian Brauner
  0 siblings, 0 replies; 79+ messages in thread
From: Christian Brauner @ 2026-03-26 14:39 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Demi Marie Obenour, Jan Kara, Darrick J. Wong, Miklos Szeredi,
	linux-fsdevel, Joanne Koong, John Groves, Bernd Schubert,
	Amir Goldstein, Luis Henriques, Horst Birthelmer, Gao Xiang,
	lsf-pc

On Tue, Mar 24, 2026 at 08:21:00PM +0800, Gao Xiang wrote:
> 
> 
> On 2026/3/24 19:58, Demi Marie Obenour wrote:
> > On 3/24/26 04:48, Christian Brauner wrote:
> 
> ...
> 
> > > > > 
> > > > > > I would still consider such design highly suspicious but without more
> > > > > > detailed knowledge about the application I cannot say it's outright broken
> > > > > > :).
> > > > > 
> > > > > What do you mean "such design"?  "Writable untrusted
> > > > > remote EXT4 images mounting on the host"? Really, we have
> > > > > such applications for containers for many years but I don't
> > > > > want to name it here, but I'm totally exhaused by such
> > > > > usage (since I explained many many times, and they even
> > > > > never bother with LWN.net) and the internal team.
> > > > 
> > > > By "such design" I meant generally the concept that you fetch filesystem
> > > > images (regardless whether ext4 or some other type) from untrusted source.
> > > > Unless you do cryptographical verification of the data, you never know what
> > > > kind of garbage your application is processing which is always invitation
> > > > for nasty exploits and bugs...
> > > 
> > > If this is another 500 mail discussion about FS_USERNS_MOUNT on
> > > block-backed filesystems then my verdict still stands that the only
> > > condition under which I will let the VFS allow this if the underlying
> > > device is signed and dm-verity protected. The kernel will continue to
> > > refuse unprivileged policy in general and specifically based on quality
> > > or implementation of the underlying filesystem driver.
> > 
> > As far as I can tell, the main problems are:
> > 
> > 1. Most filesystems can only be run in kernel mode, so one needs a
> >     VM and an expensive RPC protocol if one wants to run them in a
> >     sandboxed environment.
> > 
> > 2. Context switch overhead is so high that running filesystems entirely
> >     in userspace, without some form of in-kernel I/O acceleration,
> >     is a performance problem.
> > 
> > 3. Filesystems are written in C and not designed to be secure against
> >     malicious on-disk images.
> > 
> > Gao Xiang is working on problem for EROFS.
> > FUSE iomap support solves 2.  lklfuse solves problem 1.
> 
> Sigh, I just would like to say, as Darrick and Jan's previous
> replies, immutable on-disk fses are a special kind of filesystems
> and the overall on-disk format is to provide vfs/MM basic
> informattion (like LOOKUP, GETATTR, and READDIR, READ), and the
> reason is that even some values of metadata could be considered
> as inconsistent, it's just like FUSE unprivileged daemon returns
> garbage (meta)data and/or TAR extracts garbage (meta)data --
> shouldn't matter at all.
> 
> Why I'm here is I'm totally exhaused by arbitary claim like
> "all kernel filesystem are insecure". Again, that is absolutely
> untrue: the feature set, the working model and the implementation
> complexity of immutable filesystems make it more secure by
> design.
> 
> Also the reason of "another 500 mail discussion about
> FS_USERNS_MOUNT" is just because "FS_USERNS_MOUNT is very very
> useful to containers", and the special kind of immutable on-disk
> filesystems can fit this goal technically which is much much
> unlike to generic writable ondisk fses or NFS and why I working
> on EROFS is also because I believe immutable ondisk filesystems
> are absolutely useful, more secure than other generic writable
> fses by design especially on containers and handling untrusted
> remote data.
> 
> I here claim again that all implementation vulnerability of
> EROFS will claim as 0-day bug, and I've already did in this way
> for many years.  Let's step back, even not me, if there are
> some other sane immutable filesystems aiming for containers,
> they will definitely claim the same, why not?

If you want unprivileged filesystem drivers mountable by arbitrary users
and containers then get behind the effort to move this completely out of
the kernel and into fuse making fuse fast enough so that we don't have
to think about it anymore.

The whole push over the last years has been that if users want to mount
arbitrary in-kernel filesystems in userspace then they better built a
delegation and security model _in userspace_ to make this happen. This
is why we built mountfsd in userspace which works just fine today.

I don't understand what exactly people think is going to happen once we
start promising that mounting untrusted images in the kernel for even
one filesystem is fine. This will march us down security madness we have
not experienced before with all of the k8s and container workloads out
there.

For me it is currently still completely irrelevant what filesystem
driver this is and whether it is immutable or not. Look at the size of
your attack surface in your codebase and your algorithms and the ever
expanding functionality it exposes. This pipe dream of "rootless"
containers being able to mount arbitrary images in-kernel without
userspace policy is not workable.

We debate this over and over because userspace is unwilling to accept
that there are fundamental policy problems that are not solved in the
kernel. And that includes when it is safe to mount arbitrary data. This
is especially true now as we're being flooded with (valid and invalid)
CVEs due to everyone believing their personal LLM companion.

You're going to be at LSF/MM/BPF and I'm sure there'll be more
discussion around this.

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2026-03-26 14:39 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <aYIsRc03fGhQ7vbS@groves.net>
2026-02-02 13:51 ` [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Miklos Szeredi
2026-02-02 16:14   ` Amir Goldstein
2026-02-03  7:55     ` Miklos Szeredi
2026-02-03  9:19       ` [Lsf-pc] " Jan Kara
2026-02-03 10:31         ` Amir Goldstein
2026-02-04  9:22       ` Joanne Koong
2026-02-04 10:37         ` Amir Goldstein
2026-02-04 10:43         ` [Lsf-pc] " Jan Kara
2026-02-06  6:09           ` Darrick J. Wong
2026-02-21  6:07             ` Demi Marie Obenour
2026-02-21  7:07               ` Darrick J. Wong
2026-02-21 22:16                 ` Demi Marie Obenour
2026-02-23 21:58                   ` Darrick J. Wong
2026-02-04 20:47         ` Bernd Schubert
2026-02-06  6:26         ` Darrick J. Wong
2026-02-03 10:15     ` Luis Henriques
2026-02-03 10:20       ` Amir Goldstein
2026-02-03 10:38         ` Luis Henriques
2026-02-03 14:20         ` Christian Brauner
2026-02-03 10:36   ` Amir Goldstein
2026-02-03 17:13   ` John Groves
2026-02-04 19:06   ` Darrick J. Wong
2026-02-04 19:38     ` Horst Birthelmer
2026-02-04 20:58     ` Bernd Schubert
2026-02-06  5:47       ` Darrick J. Wong
2026-02-04 22:50     ` Gao Xiang
2026-02-06  5:38       ` Darrick J. Wong
2026-02-06  6:15         ` Gao Xiang
2026-02-21  0:47           ` Darrick J. Wong
2026-03-17  4:17             ` Gao Xiang
2026-03-18 21:51               ` Darrick J. Wong
2026-03-19  8:05                 ` Gao Xiang
2026-03-22  3:25                 ` Demi Marie Obenour
2026-03-22  3:52                   ` Gao Xiang
2026-03-22  4:51                   ` Gao Xiang
2026-03-22  5:13                     ` Demi Marie Obenour
2026-03-22  5:30                       ` Gao Xiang
2026-03-23  9:54                     ` [Lsf-pc] " Jan Kara
2026-03-23 10:19                       ` Gao Xiang
2026-03-23 11:14                         ` Jan Kara
2026-03-23 11:42                           ` Gao Xiang
2026-03-23 12:01                             ` Gao Xiang
2026-03-23 14:13                               ` Jan Kara
2026-03-23 14:36                                 ` Gao Xiang
2026-03-23 14:47                                   ` Jan Kara
2026-03-23 14:57                                     ` Gao Xiang
2026-03-24  8:48                                     ` Christian Brauner
2026-03-24  9:30                                       ` Gao Xiang
2026-03-24  9:49                                         ` Demi Marie Obenour
2026-03-24  9:53                                           ` Gao Xiang
2026-03-24 10:02                                             ` Demi Marie Obenour
2026-03-24 10:14                                               ` Gao Xiang
2026-03-24 10:17                                                 ` Demi Marie Obenour
2026-03-24 10:25                                                   ` Gao Xiang
2026-03-24 11:58                                       ` Demi Marie Obenour
2026-03-24 12:21                                         ` Gao Xiang
2026-03-26 14:39                                           ` Christian Brauner
2026-03-23 12:08                           ` Demi Marie Obenour
2026-03-23 12:13                             ` Gao Xiang
2026-03-23 12:19                               ` Demi Marie Obenour
2026-03-23 12:30                                 ` Gao Xiang
2026-03-23 12:33                                   ` Gao Xiang
2026-03-22  5:14                   ` Gao Xiang
2026-03-23  9:43                     ` [Lsf-pc] " Jan Kara
2026-03-23 10:05                       ` Gao Xiang
2026-03-23 10:14                         ` Jan Kara
2026-03-23 10:30                           ` Gao Xiang
2026-02-04 23:19     ` Gao Xiang
2026-02-05  3:33     ` John Groves
2026-02-05  9:27       ` Amir Goldstein
2026-02-06  5:52         ` Darrick J. Wong
2026-02-06 20:48           ` John Groves
2026-02-07  0:22             ` Joanne Koong
2026-02-12  4:46               ` Joanne Koong
2026-02-21  0:37                 ` Darrick J. Wong
2026-02-26 20:21                   ` Joanne Koong
2026-03-03  4:57                     ` Darrick J. Wong
2026-03-03 17:28                       ` Joanne Koong
2026-02-20 23:59             ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox