[RFC] Another take at restarting FUSE servers

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Another take at restarting FUSE servers
@ 2025-07-29 13:56 Luis Henriques
  2025-07-29 23:38 ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Luis Henriques @ 2025-07-29 13:56 UTC (permalink / raw)
  To: Miklos Szeredi, Bernd Schubert; +Cc: linux-fsdevel, linux-kernel

Hi!

I know this has been discussed several times in several places, and the
recent(ish) addition of NOTIFY_RESEND is an important step towards being
able to restart a user-space FUSE server.

While looking at how to restart a server that uses the libfuse lowlevel
API, I've created an RFC pull request [1] to understand whether adding
support for this operation would be something acceptable in the project.
The PR doesn't do anything sophisticated, it simply hacks into the opaque
libfuse data structures so that a server could set some of the sessions'
fields.

So, a FUSE server simply has to save the /dev/fuse file descriptor and
pass it to libfuse while recovering, after a restart or a crash.  The
mentioned NOTIFY_RESEND should be used so that no requests are lost, of
course.  And there are probably other data structures that user-space file
systems will have to keep track as well, so that everything can be
restored.  (The parameters set in the INIT phase, for example.)

But, from the discussion with Bernd in the PR, one of the things that
would be good to have is for the kernel to send back to user-space the
information about the inodes it already knows about.

I have been playing with this idea with a patch that simply sends out
LOOKUPs for each of these inodes.  This could be done through a new
NOTIFY_RESEND_INODES, or maybe it could be an extra operation added to the
already existing NOTIFY_RESEND.

Anyway, before spending any more time with this, I wanted to ask whether
this is something that could be acceptable in the kernel, if people think
a different approach should be followed, or if I'm simply trying to solve
the wrong problem.

Thanks in advance for any feedback on this.

[1] https://github.com/libfuse/libfuse/pull/1219

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-07-29 13:56 [RFC] Another take at restarting FUSE servers Luis Henriques
@ 2025-07-29 23:38 ` Darrick J. Wong
  2025-07-30 14:04   ` Luis Henriques
  2025-07-31 13:04   ` Theodore Ts'o
  0 siblings, 2 replies; 46+ messages in thread
From: Darrick J. Wong @ 2025-07-29 23:38 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel

On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote:
> Hi!
> 
> I know this has been discussed several times in several places, and the
> recent(ish) addition of NOTIFY_RESEND is an important step towards being
> able to restart a user-space FUSE server.
> 
> While looking at how to restart a server that uses the libfuse lowlevel
> API, I've created an RFC pull request [1] to understand whether adding
> support for this operation would be something acceptable in the project.

Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
could restart itself.  It's unclear if doing so will actually enable us
to clear the condition that caused the failure in the first place, but I
suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
aren't totally crazy.

> The PR doesn't do anything sophisticated, it simply hacks into the opaque
> libfuse data structures so that a server could set some of the sessions'
> fields.
> 
> So, a FUSE server simply has to save the /dev/fuse file descriptor and
> pass it to libfuse while recovering, after a restart or a crash.  The
> mentioned NOTIFY_RESEND should be used so that no requests are lost, of
> course.  And there are probably other data structures that user-space file
> systems will have to keep track as well, so that everything can be
> restored.  (The parameters set in the INIT phase, for example.)

Yeah, I don't know how that would work in practice.  Would the kernel
send back the old connection flags and whatnot via some sort of
FUSE_REINIT request, and the fuse server can either decide that it will
try to recover, or just bail out?

> But, from the discussion with Bernd in the PR, one of the things that
> would be good to have is for the kernel to send back to user-space the
> information about the inodes it already knows about.
> 
> I have been playing with this idea with a patch that simply sends out
> LOOKUPs for each of these inodes.  This could be done through a new
> NOTIFY_RESEND_INODES, or maybe it could be an extra operation added to the
> already existing NOTIFY_RESEND.

I have no idea if NOTIFY_RESEND already does this, but you'd probably
want to purge all the unreferenced dentries/inodes to reduce the amount
of re-querying.

I gather that any fuse server that wants to reboot itself would either
have to persist what the nodeids map to, or otherwise stabilize them?
For example, fuse2fs could set the nodeid to match the ext2 inode
numbers.  Then reconnecting them wouldn't be too hard.

> Anyway, before spending any more time with this, I wanted to ask whether
> this is something that could be acceptable in the kernel, if people think
> a different approach should be followed, or if I'm simply trying to solve
> the wrong problem.
> 
> Thanks in advance for any feedback on this.
> 
> [1] https://github.com/libfuse/libfuse/pull/1219

Who calls fuse_session_reinitialize() ?

--D

> Cheers,
> -- 
> Luís
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-07-29 23:38 ` Darrick J. Wong
@ 2025-07-30 14:04   ` Luis Henriques
  2025-07-31 11:33     ` Christian Brauner
  2025-07-31 13:04   ` Theodore Ts'o
  1 sibling, 1 reply; 46+ messages in thread
From: Luis Henriques @ 2025-07-30 14:04 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel

Hi Darrick,

On Tue, Jul 29 2025, Darrick J. Wong wrote:

> On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote:
>> Hi!
>> 
>> I know this has been discussed several times in several places, and the
>> recent(ish) addition of NOTIFY_RESEND is an important step towards being
>> able to restart a user-space FUSE server.
>> 
>> While looking at how to restart a server that uses the libfuse lowlevel
>> API, I've created an RFC pull request [1] to understand whether adding
>> support for this operation would be something acceptable in the project.
>
> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> could restart itself.  It's unclear if doing so will actually enable us
> to clear the condition that caused the failure in the first place, but I
> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> aren't totally crazy.

Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do
the restart itself.  Instead, it simply adds some visibility into the
opaque data structures so that a FUSE server could re-initialise a session
without having to go through a full remount.

But sure, there are other things that could be added to the library as
well.  For example, in my current experiments, the FUSE server needs start
some sort of "file descriptor server" to keep the fd alive for the
restart.  This daemon could be optionally provided in libfuse itself,
which could also be used to store all sorts of blobs needed by the file
system after recovery is done.

>> The PR doesn't do anything sophisticated, it simply hacks into the opaque
>> libfuse data structures so that a server could set some of the sessions'
>> fields.
>> 
>> So, a FUSE server simply has to save the /dev/fuse file descriptor and
>> pass it to libfuse while recovering, after a restart or a crash.  The
>> mentioned NOTIFY_RESEND should be used so that no requests are lost, of
>> course.  And there are probably other data structures that user-space file
>> systems will have to keep track as well, so that everything can be
>> restored.  (The parameters set in the INIT phase, for example.)
>
> Yeah, I don't know how that would work in practice.  Would the kernel
> send back the old connection flags and whatnot via some sort of
> FUSE_REINIT request, and the fuse server can either decide that it will
> try to recover, or just bail out?

That would be an option.  But my current idea would be that the server
would need to store those somewhere and simply assume they are still OK
after reconnecting.  The kernel wouldn't need to know the user-space was
replaced by another server, potentially different, after an upgrade for
example.

Right now, AFAIU, restarting a FUSE server *can* be done without any help
from the kernel side, as long as the fd is kept alive.  The NOTIFY_RESEND
is used only for resending FUSE requests for which the kernel is currently
waiting replies for.  So, for example if the kernel sends a FUSE_READ to
user-space and the server crashes while trying to serve it, the kernel
will still be waiting for that reply.  However, a new server trying to
recover from the crash will have no way to know that.  And this is where
the NOTIFY_RESEND is useful.

>> But, from the discussion with Bernd in the PR, one of the things that
>> would be good to have is for the kernel to send back to user-space the
>> information about the inodes it already knows about.
>> 
>> I have been playing with this idea with a patch that simply sends out
>> LOOKUPs for each of these inodes.  This could be done through a new
>> NOTIFY_RESEND_INODES, or maybe it could be an extra operation added to the
>> already existing NOTIFY_RESEND.
>
> I have no idea if NOTIFY_RESEND already does this, but you'd probably
> want to purge all the unreferenced dentries/inodes to reduce the amount
> of re-querying.

No, NOTIFY_RESEND doesn't purge any of those; currently it simply resend
all the requests.

> I gather that any fuse server that wants to reboot itself would either
> have to persist what the nodeids map to, or otherwise stabilize them?
> For example, fuse2fs could set the nodeid to match the ext2 inode
> numbers.  Then reconnecting them wouldn't be too hard.

Right, that's my understanding as well -- restarting a server requires
stable nodeids.  IIRC most (all?) examples shipped with libfuse can't be
restarted because they cast a pointer (the memory address to some sort of
inode data struct) and use that as the nodeid.

>> Anyway, before spending any more time with this, I wanted to ask whether
>> this is something that could be acceptable in the kernel, if people think
>> a different approach should be followed, or if I'm simply trying to solve
>> the wrong problem.
>> 
>> Thanks in advance for any feedback on this.
>> 
>> [1] https://github.com/libfuse/libfuse/pull/1219
>
> Who calls fuse_session_reinitialize() ?

Ah! Good question!  So, my idea was that a FUSE server would do something
like this:

	fuse_session_new()

	if (do_recovery) {
		get_old_fd()
		fuse_session_reinitialize()
                fuse_lowlevel_notify_resend()
	} else
		fuse_session_mount()

	fuse_daemonize()
	fuse_session_loop_mt()

Anyway, my initial concerns with restartability started because it is
currently not possible to restart a server that uses libfuse without
hacking into it's internal data structures.  The idea of resending all
LOOKUPs just came from the discussion in the PR.

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-07-30 14:04   ` Luis Henriques
@ 2025-07-31 11:33     ` Christian Brauner
  2025-07-31 12:23       ` Luis Henriques
  2025-07-31 17:29       ` Darrick J. Wong
  0 siblings, 2 replies; 46+ messages in thread
From: Christian Brauner @ 2025-07-31 11:33 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Darrick J. Wong, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel

On Wed, Jul 30, 2025 at 03:04:00PM +0100, Luis Henriques wrote:
> Hi Darrick,
> 
> On Tue, Jul 29 2025, Darrick J. Wong wrote:
> 
> > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote:
> >> Hi!
> >> 
> >> I know this has been discussed several times in several places, and the
> >> recent(ish) addition of NOTIFY_RESEND is an important step towards being
> >> able to restart a user-space FUSE server.
> >> 
> >> While looking at how to restart a server that uses the libfuse lowlevel
> >> API, I've created an RFC pull request [1] to understand whether adding
> >> support for this operation would be something acceptable in the project.
> >
> > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> > could restart itself.  It's unclear if doing so will actually enable us
> > to clear the condition that caused the failure in the first place, but I
> > suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> > aren't totally crazy.
> 
> Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do
> the restart itself.  Instead, it simply adds some visibility into the
> opaque data structures so that a FUSE server could re-initialise a session
> without having to go through a full remount.
> 
> But sure, there are other things that could be added to the library as
> well.  For example, in my current experiments, the FUSE server needs start
> some sort of "file descriptor server" to keep the fd alive for the
> restart.  This daemon could be optionally provided in libfuse itself,
> which could also be used to store all sorts of blobs needed by the file
> system after recovery is done.

Fwiw, for most use-cases you really just want to use systemd's file
descriptor store to persist the /dev/fuse connection:
https://systemd.io/FILE_DESCRIPTOR_STORE/

> 
> >> The PR doesn't do anything sophisticated, it simply hacks into the opaque
> >> libfuse data structures so that a server could set some of the sessions'
> >> fields.
> >> 
> >> So, a FUSE server simply has to save the /dev/fuse file descriptor and
> >> pass it to libfuse while recovering, after a restart or a crash.  The
> >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of
> >> course.  And there are probably other data structures that user-space file
> >> systems will have to keep track as well, so that everything can be
> >> restored.  (The parameters set in the INIT phase, for example.)
> >
> > Yeah, I don't know how that would work in practice.  Would the kernel
> > send back the old connection flags and whatnot via some sort of
> > FUSE_REINIT request, and the fuse server can either decide that it will
> > try to recover, or just bail out?
> 
> That would be an option.  But my current idea would be that the server
> would need to store those somewhere and simply assume they are still OK

The fdstore currently allows to associate a name with a file descriptor
in the fdstore. That name would allow you to associate the options with
the fuse connection. However, I would not rule it out that additional
metadata could be attached to file descriptors in the fdstore if that's
something that's needed.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-07-31 11:33     ` Christian Brauner
@ 2025-07-31 12:23       ` Luis Henriques
  2025-07-31 17:29       ` Darrick J. Wong
  1 sibling, 0 replies; 46+ messages in thread
From: Luis Henriques @ 2025-07-31 12:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel

On Thu, Jul 31 2025, Christian Brauner wrote:

> On Wed, Jul 30, 2025 at 03:04:00PM +0100, Luis Henriques wrote:
>> Hi Darrick,
>> 
>> On Tue, Jul 29 2025, Darrick J. Wong wrote:
>> 
>> > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote:
>> >> Hi!
>> >> 
>> >> I know this has been discussed several times in several places, and the
>> >> recent(ish) addition of NOTIFY_RESEND is an important step towards being
>> >> able to restart a user-space FUSE server.
>> >> 
>> >> While looking at how to restart a server that uses the libfuse lowlevel
>> >> API, I've created an RFC pull request [1] to understand whether adding
>> >> support for this operation would be something acceptable in the project.
>> >
>> > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>> > could restart itself.  It's unclear if doing so will actually enable us
>> > to clear the condition that caused the failure in the first place, but I
>> > suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>> > aren't totally crazy.
>> 
>> Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do
>> the restart itself.  Instead, it simply adds some visibility into the
>> opaque data structures so that a FUSE server could re-initialise a session
>> without having to go through a full remount.
>> 
>> But sure, there are other things that could be added to the library as
>> well.  For example, in my current experiments, the FUSE server needs start
>> some sort of "file descriptor server" to keep the fd alive for the
>> restart.  This daemon could be optionally provided in libfuse itself,
>> which could also be used to store all sorts of blobs needed by the file
>> system after recovery is done.
>
> Fwiw, for most use-cases you really just want to use systemd's file
> descriptor store to persist the /dev/fuse connection:
> https://systemd.io/FILE_DESCRIPTOR_STORE/

Thank you, Christian.  I guess I should have mentioned systemd's fdstore
here.  In fact, I knew about it, but in my experiments I decided not to
use it because it's trivial to keep the fd alive[1] (and also because my
test environment doesn't run systemd).

But still, any eventual libfuse support could still include the interface
with fdstore for that.

[1] Obviously "it's trivial" for my experiments.  Doing it in a secure way
    is probably a bit more challenging.

Cheers,
-- 
Luís

>
>> 
>> >> The PR doesn't do anything sophisticated, it simply hacks into the opaque
>> >> libfuse data structures so that a server could set some of the sessions'
>> >> fields.
>> >> 
>> >> So, a FUSE server simply has to save the /dev/fuse file descriptor and
>> >> pass it to libfuse while recovering, after a restart or a crash.  The
>> >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of
>> >> course.  And there are probably other data structures that user-space file
>> >> systems will have to keep track as well, so that everything can be
>> >> restored.  (The parameters set in the INIT phase, for example.)
>> >
>> > Yeah, I don't know how that would work in practice.  Would the kernel
>> > send back the old connection flags and whatnot via some sort of
>> > FUSE_REINIT request, and the fuse server can either decide that it will
>> > try to recover, or just bail out?
>> 
>> That would be an option.  But my current idea would be that the server
>> would need to store those somewhere and simply assume they are still OK
>
> The fdstore currently allows to associate a name with a file descriptor
> in the fdstore. That name would allow you to associate the options with
> the fuse connection. However, I would not rule it out that additional
> metadata could be attached to file descriptors in the fdstore if that's
> something that's needed.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-07-31 11:33     ` Christian Brauner
  2025-07-31 12:23       ` Luis Henriques
@ 2025-07-31 17:29       ` Darrick J. Wong
  2025-08-04  8:45         ` Christian Brauner
  1 sibling, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2025-07-31 17:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Luis Henriques, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel

On Thu, Jul 31, 2025 at 01:33:09PM +0200, Christian Brauner wrote:
> On Wed, Jul 30, 2025 at 03:04:00PM +0100, Luis Henriques wrote:
> > Hi Darrick,
> > 
> > On Tue, Jul 29 2025, Darrick J. Wong wrote:
> > 
> > > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote:
> > >> Hi!
> > >> 
> > >> I know this has been discussed several times in several places, and the
> > >> recent(ish) addition of NOTIFY_RESEND is an important step towards being
> > >> able to restart a user-space FUSE server.
> > >> 
> > >> While looking at how to restart a server that uses the libfuse lowlevel
> > >> API, I've created an RFC pull request [1] to understand whether adding
> > >> support for this operation would be something acceptable in the project.
> > >
> > > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> > > could restart itself.  It's unclear if doing so will actually enable us
> > > to clear the condition that caused the failure in the first place, but I
> > > suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> > > aren't totally crazy.
> > 
> > Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do
> > the restart itself.  Instead, it simply adds some visibility into the
> > opaque data structures so that a FUSE server could re-initialise a session
> > without having to go through a full remount.
> > 
> > But sure, there are other things that could be added to the library as
> > well.  For example, in my current experiments, the FUSE server needs start
> > some sort of "file descriptor server" to keep the fd alive for the
> > restart.  This daemon could be optionally provided in libfuse itself,
> > which could also be used to store all sorts of blobs needed by the file
> > system after recovery is done.
> 
> Fwiw, for most use-cases you really just want to use systemd's file
> descriptor store to persist the /dev/fuse connection:
> https://systemd.io/FILE_DESCRIPTOR_STORE/

Very nice!  This is exactly what I was looking for to handle the initial
setup, so I'm glad I don't have to go design a protocol around that.

> > 
> > >> The PR doesn't do anything sophisticated, it simply hacks into the opaque
> > >> libfuse data structures so that a server could set some of the sessions'
> > >> fields.
> > >> 
> > >> So, a FUSE server simply has to save the /dev/fuse file descriptor and
> > >> pass it to libfuse while recovering, after a restart or a crash.  The
> > >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of
> > >> course.  And there are probably other data structures that user-space file
> > >> systems will have to keep track as well, so that everything can be
> > >> restored.  (The parameters set in the INIT phase, for example.)
> > >
> > > Yeah, I don't know how that would work in practice.  Would the kernel
> > > send back the old connection flags and whatnot via some sort of
> > > FUSE_REINIT request, and the fuse server can either decide that it will
> > > try to recover, or just bail out?
> > 
> > That would be an option.  But my current idea would be that the server
> > would need to store those somewhere and simply assume they are still OK
> 
> The fdstore currently allows to associate a name with a file descriptor
> in the fdstore. That name would allow you to associate the options with
> the fuse connection. However, I would not rule it out that additional
> metadata could be attached to file descriptors in the fdstore if that's
> something that's needed.

Names are useful, I'd at least want "fusedev", "fsopen", and "device".

If someone passed "journal_dev=/dev/sdaX" to fuse2fs then I'd want it to
be able to tell mountfsd "Hey, can you also open /dev/sdaX and put it in
the store as 'journal_dev'?" Then it just has to wait until the fd shows
up, and it can continue with the mount process.

Though the "device" argument needn't be a path, so to be fully general
mountfsd and the fuse server would have to handshake that as well.

--D

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-07-31 17:29       ` Darrick J. Wong
@ 2025-08-04  8:45         ` Christian Brauner
  2025-08-12 19:28           ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Christian Brauner @ 2025-08-04  8:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Luis Henriques, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel

On Thu, Jul 31, 2025 at 10:29:46AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 31, 2025 at 01:33:09PM +0200, Christian Brauner wrote:
> > On Wed, Jul 30, 2025 at 03:04:00PM +0100, Luis Henriques wrote:
> > > Hi Darrick,
> > > 
> > > On Tue, Jul 29 2025, Darrick J. Wong wrote:
> > > 
> > > > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote:
> > > >> Hi!
> > > >> 
> > > >> I know this has been discussed several times in several places, and the
> > > >> recent(ish) addition of NOTIFY_RESEND is an important step towards being
> > > >> able to restart a user-space FUSE server.
> > > >> 
> > > >> While looking at how to restart a server that uses the libfuse lowlevel
> > > >> API, I've created an RFC pull request [1] to understand whether adding
> > > >> support for this operation would be something acceptable in the project.
> > > >
> > > > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> > > > could restart itself.  It's unclear if doing so will actually enable us
> > > > to clear the condition that caused the failure in the first place, but I
> > > > suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> > > > aren't totally crazy.
> > > 
> > > Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do
> > > the restart itself.  Instead, it simply adds some visibility into the
> > > opaque data structures so that a FUSE server could re-initialise a session
> > > without having to go through a full remount.
> > > 
> > > But sure, there are other things that could be added to the library as
> > > well.  For example, in my current experiments, the FUSE server needs start
> > > some sort of "file descriptor server" to keep the fd alive for the
> > > restart.  This daemon could be optionally provided in libfuse itself,
> > > which could also be used to store all sorts of blobs needed by the file
> > > system after recovery is done.
> > 
> > Fwiw, for most use-cases you really just want to use systemd's file
> > descriptor store to persist the /dev/fuse connection:
> > https://systemd.io/FILE_DESCRIPTOR_STORE/
> 
> Very nice!  This is exactly what I was looking for to handle the initial
> setup, so I'm glad I don't have to go design a protocol around that.
> 
> > > 
> > > >> The PR doesn't do anything sophisticated, it simply hacks into the opaque
> > > >> libfuse data structures so that a server could set some of the sessions'
> > > >> fields.
> > > >> 
> > > >> So, a FUSE server simply has to save the /dev/fuse file descriptor and
> > > >> pass it to libfuse while recovering, after a restart or a crash.  The
> > > >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of
> > > >> course.  And there are probably other data structures that user-space file
> > > >> systems will have to keep track as well, so that everything can be
> > > >> restored.  (The parameters set in the INIT phase, for example.)
> > > >
> > > > Yeah, I don't know how that would work in practice.  Would the kernel
> > > > send back the old connection flags and whatnot via some sort of
> > > > FUSE_REINIT request, and the fuse server can either decide that it will
> > > > try to recover, or just bail out?
> > > 
> > > That would be an option.  But my current idea would be that the server
> > > would need to store those somewhere and simply assume they are still OK
> > 
> > The fdstore currently allows to associate a name with a file descriptor
> > in the fdstore. That name would allow you to associate the options with
> > the fuse connection. However, I would not rule it out that additional
> > metadata could be attached to file descriptors in the fdstore if that's
> > something that's needed.
> 
> Names are useful, I'd at least want "fusedev", "fsopen", and "device".
> 
> If someone passed "journal_dev=/dev/sdaX" to fuse2fs then I'd want it to
> be able to tell mountfsd "Hey, can you also open /dev/sdaX and put it in
> the store as 'journal_dev'?" Then it just has to wait until the fd shows
> up, and it can continue with the mount process.
> 
> Though the "device" argument needn't be a path, so to be fully general
> mountfsd and the fuse server would have to handshake that as well.

Fwiw, to attach arbitrary metadata to a file descriptor the easiest
thing to do would be to stash both a (fuse server) file descriptor and
then also a memfd via memfd_create() that e.g., can contain all the
server options that you want to store.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-08-04  8:45         ` Christian Brauner
@ 2025-08-12 19:28           ` Darrick J. Wong
  0 siblings, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2025-08-12 19:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Luis Henriques, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel

On Mon, Aug 04, 2025 at 10:45:44AM +0200, Christian Brauner wrote:
> On Thu, Jul 31, 2025 at 10:29:46AM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 31, 2025 at 01:33:09PM +0200, Christian Brauner wrote:
> > > On Wed, Jul 30, 2025 at 03:04:00PM +0100, Luis Henriques wrote:
> > > > Hi Darrick,
> > > > 
> > > > On Tue, Jul 29 2025, Darrick J. Wong wrote:
> > > > 
> > > > > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote:
> > > > >> Hi!
> > > > >> 
> > > > >> I know this has been discussed several times in several places, and the
> > > > >> recent(ish) addition of NOTIFY_RESEND is an important step towards being
> > > > >> able to restart a user-space FUSE server.
> > > > >> 
> > > > >> While looking at how to restart a server that uses the libfuse lowlevel
> > > > >> API, I've created an RFC pull request [1] to understand whether adding
> > > > >> support for this operation would be something acceptable in the project.
> > > > >
> > > > > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> > > > > could restart itself.  It's unclear if doing so will actually enable us
> > > > > to clear the condition that caused the failure in the first place, but I
> > > > > suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> > > > > aren't totally crazy.
> > > > 
> > > > Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do
> > > > the restart itself.  Instead, it simply adds some visibility into the
> > > > opaque data structures so that a FUSE server could re-initialise a session
> > > > without having to go through a full remount.
> > > > 
> > > > But sure, there are other things that could be added to the library as
> > > > well.  For example, in my current experiments, the FUSE server needs start
> > > > some sort of "file descriptor server" to keep the fd alive for the
> > > > restart.  This daemon could be optionally provided in libfuse itself,
> > > > which could also be used to store all sorts of blobs needed by the file
> > > > system after recovery is done.
> > > 
> > > Fwiw, for most use-cases you really just want to use systemd's file
> > > descriptor store to persist the /dev/fuse connection:
> > > https://systemd.io/FILE_DESCRIPTOR_STORE/
> > 
> > Very nice!  This is exactly what I was looking for to handle the initial
> > setup, so I'm glad I don't have to go design a protocol around that.
> > 
> > > > 
> > > > >> The PR doesn't do anything sophisticated, it simply hacks into the opaque
> > > > >> libfuse data structures so that a server could set some of the sessions'
> > > > >> fields.
> > > > >> 
> > > > >> So, a FUSE server simply has to save the /dev/fuse file descriptor and
> > > > >> pass it to libfuse while recovering, after a restart or a crash.  The
> > > > >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of
> > > > >> course.  And there are probably other data structures that user-space file
> > > > >> systems will have to keep track as well, so that everything can be
> > > > >> restored.  (The parameters set in the INIT phase, for example.)
> > > > >
> > > > > Yeah, I don't know how that would work in practice.  Would the kernel
> > > > > send back the old connection flags and whatnot via some sort of
> > > > > FUSE_REINIT request, and the fuse server can either decide that it will
> > > > > try to recover, or just bail out?
> > > > 
> > > > That would be an option.  But my current idea would be that the server
> > > > would need to store those somewhere and simply assume they are still OK
> > > 
> > > The fdstore currently allows to associate a name with a file descriptor
> > > in the fdstore. That name would allow you to associate the options with
> > > the fuse connection. However, I would not rule it out that additional
> > > metadata could be attached to file descriptors in the fdstore if that's
> > > something that's needed.
> > 
> > Names are useful, I'd at least want "fusedev", "fsopen", and "device".
> > 
> > If someone passed "journal_dev=/dev/sdaX" to fuse2fs then I'd want it to
> > be able to tell mountfsd "Hey, can you also open /dev/sdaX and put it in
> > the store as 'journal_dev'?" Then it just has to wait until the fd shows
> > up, and it can continue with the mount process.
> > 
> > Though the "device" argument needn't be a path, so to be fully general
> > mountfsd and the fuse server would have to handshake that as well.
> 
> Fwiw, to attach arbitrary metadata to a file descriptor the easiest
> thing to do would be to stash both a (fuse server) file descriptor and
> then also a memfd via memfd_create() that e.g., can contain all the
> server options that you want to store.

<nod> I'll keep that in mind when I get to designing those components.
Thanks for the input!

(I'm still working on stabiling the new fuse4fs server, it's probably
going to be a while yet...)

--D

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-07-29 23:38 ` Darrick J. Wong
  2025-07-30 14:04   ` Luis Henriques
@ 2025-07-31 13:04   ` Theodore Ts'o
  2025-07-31 17:38     ` Darrick J. Wong
  1 sibling, 1 reply; 46+ messages in thread
From: Theodore Ts'o @ 2025-07-31 13:04 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Luis Henriques, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel

On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> 
> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> could restart itself.  It's unclear if doing so will actually enable us
> to clear the condition that caused the failure in the first place, but I
> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> aren't totally crazy.

I'm trying to understand what the failure scenario is here.  Is this
if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
is supposed to happen with respect to open files, metadata and data
modifications which were in transit, etc.?  Sure, fuse2fs could run
e2fsck -fy, but if there are dirty inode on the system, that's going
potentally to be out of sync, right?

What are the recovery semantics that we hope to be able to provide?

     	     	      		     	     - Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-07-31 13:04   ` Theodore Ts'o
@ 2025-07-31 17:38     ` Darrick J. Wong
  2025-08-01 10:15       ` Luis Henriques
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2025-07-31 17:38 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Luis Henriques, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel

On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> > 
> > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> > could restart itself.  It's unclear if doing so will actually enable us
> > to clear the condition that caused the failure in the first place, but I
> > suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> > aren't totally crazy.
> 
> I'm trying to understand what the failure scenario is here.  Is this
> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> is supposed to happen with respect to open files, metadata and data
> modifications which were in transit, etc.?  Sure, fuse2fs could run
> e2fsck -fy, but if there are dirty inode on the system, that's going
> potentally to be out of sync, right?
> 
> What are the recovery semantics that we hope to be able to provide?

<echoing what we said on the ext4 call this morning>

With iomap, most of the dirty state is in the kernel, so I think the new
fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
would initiate GETATTR requests on all the cached inodes to validate
that they still exist; and then resend all the unacknowledged requests
that were pending at the time.  It might be the case that you have to
that in the reverse order; I only know enough about the design of fuse
to suspect that to be true.

Anyhow once those are complete, I think we can resume operations with
the surviving inodes.  The ones that fail the GETATTR revalidation are
fuse_make_bad'd, which effectively revokes them.

All of this of course relies on fuse2fs maintaining as little volatile
state of its own as possible.  I think that means disabling the block
cache in the unix io manager, and if we ever implemented delalloc then
either we'd have to save the reservations somewhere or I guess you could
immediately syncfs the whole filesystem to try to push all the dirty
data to disk before we start allowing new free space allocations for new
changes.

--D

>      	     	      		     	     - Ted
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-07-31 17:38     ` Darrick J. Wong
@ 2025-08-01 10:15       ` Luis Henriques
  2025-08-11 15:43         ` Darrick J. Wong
  2025-09-12 10:31         ` Bernd Schubert
  0 siblings, 2 replies; 46+ messages in thread
From: Luis Henriques @ 2025-08-01 10:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel

On Thu, Jul 31 2025, Darrick J. Wong wrote:

> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>> > 
>> > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>> > could restart itself.  It's unclear if doing so will actually enable us
>> > to clear the condition that caused the failure in the first place, but I
>> > suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>> > aren't totally crazy.
>> 
>> I'm trying to understand what the failure scenario is here.  Is this
>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>> is supposed to happen with respect to open files, metadata and data
>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>> e2fsck -fy, but if there are dirty inode on the system, that's going
>> potentally to be out of sync, right?
>> 
>> What are the recovery semantics that we hope to be able to provide?
>
> <echoing what we said on the ext4 call this morning>
>
> With iomap, most of the dirty state is in the kernel, so I think the new
> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> would initiate GETATTR requests on all the cached inodes to validate
> that they still exist; and then resend all the unacknowledged requests
> that were pending at the time.  It might be the case that you have to
> that in the reverse order; I only know enough about the design of fuse
> to suspect that to be true.
>
> Anyhow once those are complete, I think we can resume operations with
> the surviving inodes.  The ones that fail the GETATTR revalidation are
> fuse_make_bad'd, which effectively revokes them.

Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
but probably GETATTR is a better option.

So, are you currently working on any of this?  Are you implementing this
new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
look at fuse2fs too.

Cheers,
-- 
Luís

> All of this of course relies on fuse2fs maintaining as little volatile
> state of its own as possible.  I think that means disabling the block
> cache in the unix io manager, and if we ever implemented delalloc then
> either we'd have to save the reservations somewhere or I guess you could
> immediately syncfs the whole filesystem to try to push all the dirty
> data to disk before we start allowing new free space allocations for new
> changes.
>
> --D
>
>>      	     	      		     	     - Ted
>> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-08-01 10:15       ` Luis Henriques
@ 2025-08-11 15:43         ` Darrick J. Wong
  2025-08-13 13:14           ` Luis Henriques
  2025-09-12 10:31         ` Bernd Schubert
  1 sibling, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2025-08-11 15:43 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel

On Fri, Aug 01, 2025 at 11:15:26AM +0100, Luis Henriques wrote:
> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> 
> > On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >> > 
> >> > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >> > could restart itself.  It's unclear if doing so will actually enable us
> >> > to clear the condition that caused the failure in the first place, but I
> >> > suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> >> > aren't totally crazy.
> >> 
> >> I'm trying to understand what the failure scenario is here.  Is this
> >> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> >> is supposed to happen with respect to open files, metadata and data
> >> modifications which were in transit, etc.?  Sure, fuse2fs could run
> >> e2fsck -fy, but if there are dirty inode on the system, that's going
> >> potentally to be out of sync, right?
> >> 
> >> What are the recovery semantics that we hope to be able to provide?
> >
> > <echoing what we said on the ext4 call this morning>
> >
> > With iomap, most of the dirty state is in the kernel, so I think the new
> > fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> > would initiate GETATTR requests on all the cached inodes to validate
> > that they still exist; and then resend all the unacknowledged requests
> > that were pending at the time.  It might be the case that you have to
> > that in the reverse order; I only know enough about the design of fuse
> > to suspect that to be true.
> >
> > Anyhow once those are complete, I think we can resume operations with
> > the surviving inodes.  The ones that fail the GETATTR revalidation are
> > fuse_make_bad'd, which effectively revokes them.
> 
> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> but probably GETATTR is a better option.
> 
> So, are you currently working on any of this?  Are you implementing this
> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> look at fuse2fs too.

Nope, right now I'm concentrating on making sure the fuse/iomap IO path
works reliably; and converting fuse2fs to be a lowlevel fuse server.
Eliminating all the path walking stuff that the highlevel fuse library
does reduces the fstests runtime from 7.9 to 3.5h, and turning on iomap
cuts that to 2.2h.

--D

> Cheers,
> -- 
> Luís
> 
> > All of this of course relies on fuse2fs maintaining as little volatile
> > state of its own as possible.  I think that means disabling the block
> > cache in the unix io manager, and if we ever implemented delalloc then
> > either we'd have to save the reservations somewhere or I guess you could
> > immediately syncfs the whole filesystem to try to push all the dirty
> > data to disk before we start allowing new free space allocations for new
> > changes.
> >
> > --D
> >
> >>      	     	      		     	     - Ted
> >> 
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-08-11 15:43         ` Darrick J. Wong
@ 2025-08-13 13:14           ` Luis Henriques
  0 siblings, 0 replies; 46+ messages in thread
From: Luis Henriques @ 2025-08-13 13:14 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel

On Mon, Aug 11 2025, Darrick J. Wong wrote:

> On Fri, Aug 01, 2025 at 11:15:26AM +0100, Luis Henriques wrote:
>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>> 
>> > On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>> >> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>> >> > 
>> >> > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>> >> > could restart itself.  It's unclear if doing so will actually enable us
>> >> > to clear the condition that caused the failure in the first place, but I
>> >> > suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>> >> > aren't totally crazy.
>> >> 
>> >> I'm trying to understand what the failure scenario is here.  Is this
>> >> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>> >> is supposed to happen with respect to open files, metadata and data
>> >> modifications which were in transit, etc.?  Sure, fuse2fs could run
>> >> e2fsck -fy, but if there are dirty inode on the system, that's going
>> >> potentally to be out of sync, right?
>> >> 
>> >> What are the recovery semantics that we hope to be able to provide?
>> >
>> > <echoing what we said on the ext4 call this morning>
>> >
>> > With iomap, most of the dirty state is in the kernel, so I think the new
>> > fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>> > would initiate GETATTR requests on all the cached inodes to validate
>> > that they still exist; and then resend all the unacknowledged requests
>> > that were pending at the time.  It might be the case that you have to
>> > that in the reverse order; I only know enough about the design of fuse
>> > to suspect that to be true.
>> >
>> > Anyhow once those are complete, I think we can resume operations with
>> > the surviving inodes.  The ones that fail the GETATTR revalidation are
>> > fuse_make_bad'd, which effectively revokes them.
>> 
>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>> but probably GETATTR is a better option.
>> 
>> So, are you currently working on any of this?  Are you implementing this
>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>> look at fuse2fs too.
>
> Nope, right now I'm concentrating on making sure the fuse/iomap IO path
> works reliably; and converting fuse2fs to be a lowlevel fuse server.

Great, thanks for clarifying.

> Eliminating all the path walking stuff that the highlevel fuse library
> does reduces the fstests runtime from 7.9 to 3.5h, and turning on iomap
> cuts that to 2.2h.

Wow! those are quite impressive numbers.  Looking forward to look into
those fuse2fs improvements!

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-08-01 10:15       ` Luis Henriques
  2025-08-11 15:43         ` Darrick J. Wong
@ 2025-09-12 10:31         ` Bernd Schubert
  2025-09-12 11:41           ` Amir Goldstein
  1 sibling, 1 reply; 46+ messages in thread
From: Bernd Schubert @ 2025-09-12 10:31 UTC (permalink / raw)
  To: Luis Henriques, Darrick J. Wong
  Cc: Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel,
	linux-kernel, Kevin Chen, Amir Goldstein

On 8/1/25 12:15, Luis Henriques wrote:
> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> 
>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>>>>
>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>>>> could restart itself.  It's unclear if doing so will actually enable us
>>>> to clear the condition that caused the failure in the first place, but I
>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>>>> aren't totally crazy.
>>>
>>> I'm trying to understand what the failure scenario is here.  Is this
>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>>> is supposed to happen with respect to open files, metadata and data
>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>>> potentally to be out of sync, right?
>>>
>>> What are the recovery semantics that we hope to be able to provide?
>>
>> <echoing what we said on the ext4 call this morning>
>>
>> With iomap, most of the dirty state is in the kernel, so I think the new
>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>> would initiate GETATTR requests on all the cached inodes to validate
>> that they still exist; and then resend all the unacknowledged requests
>> that were pending at the time.  It might be the case that you have to
>> that in the reverse order; I only know enough about the design of fuse
>> to suspect that to be true.
>>
>> Anyhow once those are complete, I think we can resume operations with
>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>> fuse_make_bad'd, which effectively revokes them.
> 
> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> but probably GETATTR is a better option.
> 
> So, are you currently working on any of this?  Are you implementing this
> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> look at fuse2fs too.

Sorry for joining the discussion late, I was totally occupied, day and
night. Added Kevin to CC, who is going to work on recovery on our
DDN side.

Issue with GETATTR and LOOKUP is that they need a path, but on fuse
server restart we want kernel to recover inodes and their lookup count.
Now inode recovery might be hard, because we currently only have a 
64-bit node-id - which is used my most fuse application as memory
pointer.

As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
outstanding requests. And that ends up in most cases in sending requests
with invalid node-IDs, that are casted and might provoke random memory
access on restart. Kind of the same issue why fuse nfs export or
open_by_handle_at doesn't work well right now.

So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
would not return a 64-bit node ID, but a max 128 byte file handle.
And then FUSE_REVALIDATE_FH on server restart.
The file handles could be stored into the fuse inode and also used for
NFS export. 

I *think* Amir had a similar idea, but I don't find the link quickly.
Adding Amir to CC.

Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
will iterate over all superblock inodes and mark them with fuse_make_bad.
Any objections against that?

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-12 10:31         ` Bernd Schubert
@ 2025-09-12 11:41           ` Amir Goldstein
  2025-09-12 12:29             ` Bernd Schubert
  0 siblings, 1 reply; 46+ messages in thread
From: Amir Goldstein @ 2025-09-12 11:41 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Luis Henriques, Darrick J. Wong, Theodore Ts'o,
	Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel,
	Kevin Chen

On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>
>
>
> On 8/1/25 12:15, Luis Henriques wrote:
> > On Thu, Jul 31 2025, Darrick J. Wong wrote:
> >
> >> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >>>>
> >>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >>>> could restart itself.  It's unclear if doing so will actually enable us
> >>>> to clear the condition that caused the failure in the first place, but I
> >>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> >>>> aren't totally crazy.
> >>>
> >>> I'm trying to understand what the failure scenario is here.  Is this
> >>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> >>> is supposed to happen with respect to open files, metadata and data
> >>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> >>> e2fsck -fy, but if there are dirty inode on the system, that's going
> >>> potentally to be out of sync, right?
> >>>
> >>> What are the recovery semantics that we hope to be able to provide?
> >>
> >> <echoing what we said on the ext4 call this morning>
> >>
> >> With iomap, most of the dirty state is in the kernel, so I think the new
> >> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> >> would initiate GETATTR requests on all the cached inodes to validate
> >> that they still exist; and then resend all the unacknowledged requests
> >> that were pending at the time.  It might be the case that you have to
> >> that in the reverse order; I only know enough about the design of fuse
> >> to suspect that to be true.
> >>
> >> Anyhow once those are complete, I think we can resume operations with
> >> the surviving inodes.  The ones that fail the GETATTR revalidation are
> >> fuse_make_bad'd, which effectively revokes them.
> >
> > Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> > but probably GETATTR is a better option.
> >
> > So, are you currently working on any of this?  Are you implementing this
> > new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> > look at fuse2fs too.
>
> Sorry for joining the discussion late, I was totally occupied, day and
> night. Added Kevin to CC, who is going to work on recovery on our
> DDN side.
>
> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> server restart we want kernel to recover inodes and their lookup count.
> Now inode recovery might be hard, because we currently only have a
> 64-bit node-id - which is used my most fuse application as memory
> pointer.
>
> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> outstanding requests. And that ends up in most cases in sending requests
> with invalid node-IDs, that are casted and might provoke random memory
> access on restart. Kind of the same issue why fuse nfs export or
> open_by_handle_at doesn't work well right now.
>
> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> would not return a 64-bit node ID, but a max 128 byte file handle.
> And then FUSE_REVALIDATE_FH on server restart.
> The file handles could be stored into the fuse inode and also used for
> NFS export.
>
> I *think* Amir had a similar idea, but I don't find the link quickly.
> Adding Amir to CC.

Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/

>
> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> will iterate over all superblock inodes and mark them with fuse_make_bad.
> Any objections against that?

IDK, it seems much more ugly than implementing LOOKUP_HANDLE
and I am not sure that LOOKUP_HANDLE is that hard to implement, when
comparing to this alternative.

I mean a restartable server is going to be a new implementation anyway, right?
So it makes sense to start with a cleaner and more adequate protocol,
does it not?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-12 11:41           ` Amir Goldstein
@ 2025-09-12 12:29             ` Bernd Schubert
  2025-09-12 14:58               ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Bernd Schubert @ 2025-09-12 12:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Luis Henriques, Darrick J. Wong, Theodore Ts'o,
	Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel,
	Kevin Chen



On 9/12/25 13:41, Amir Goldstein wrote:
> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>>
>>
>>
>> On 8/1/25 12:15, Luis Henriques wrote:
>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>>>
>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>>>>>>
>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>>>>>> could restart itself.  It's unclear if doing so will actually enable us
>>>>>> to clear the condition that caused the failure in the first place, but I
>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>>>>>> aren't totally crazy.
>>>>>
>>>>> I'm trying to understand what the failure scenario is here.  Is this
>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>>>>> is supposed to happen with respect to open files, metadata and data
>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>>>>> potentally to be out of sync, right?
>>>>>
>>>>> What are the recovery semantics that we hope to be able to provide?
>>>>
>>>> <echoing what we said on the ext4 call this morning>
>>>>
>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>>>> would initiate GETATTR requests on all the cached inodes to validate
>>>> that they still exist; and then resend all the unacknowledged requests
>>>> that were pending at the time.  It might be the case that you have to
>>>> that in the reverse order; I only know enough about the design of fuse
>>>> to suspect that to be true.
>>>>
>>>> Anyhow once those are complete, I think we can resume operations with
>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>>>> fuse_make_bad'd, which effectively revokes them.
>>>
>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>>> but probably GETATTR is a better option.
>>>
>>> So, are you currently working on any of this?  Are you implementing this
>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>>> look at fuse2fs too.
>>
>> Sorry for joining the discussion late, I was totally occupied, day and
>> night. Added Kevin to CC, who is going to work on recovery on our
>> DDN side.
>>
>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>> server restart we want kernel to recover inodes and their lookup count.
>> Now inode recovery might be hard, because we currently only have a
>> 64-bit node-id - which is used my most fuse application as memory
>> pointer.
>>
>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>> outstanding requests. And that ends up in most cases in sending requests
>> with invalid node-IDs, that are casted and might provoke random memory
>> access on restart. Kind of the same issue why fuse nfs export or
>> open_by_handle_at doesn't work well right now.
>>
>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>> would not return a 64-bit node ID, but a max 128 byte file handle.
>> And then FUSE_REVALIDATE_FH on server restart.
>> The file handles could be stored into the fuse inode and also used for
>> NFS export.
>>
>> I *think* Amir had a similar idea, but I don't find the link quickly.
>> Adding Amir to CC.
> 
> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/

Thanks for the reference Amir! I even had been in that thread.

> 
>>
>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>> Any objections against that?
> 
> IDK, it seems much more ugly than implementing LOOKUP_HANDLE
> and I am not sure that LOOKUP_HANDLE is that hard to implement, when
> comparing to this alternative.
> 
> I mean a restartable server is going to be a new implementation anyway, right?
> So it makes sense to start with a cleaner and more adequate protocol,
> does it not?

Definitely, if we agree on the approach on LOOKUP_HANDLE and using it
for recovery, adding that op seems simple. And reading through the
thread you had posted above, just the implementation was missing.
So let's go ahead to do this approach.


Thanks,
Bernd




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-12 12:29             ` Bernd Schubert
@ 2025-09-12 14:58               ` Darrick J. Wong
  2025-09-12 15:20                 ` Bernd Schubert
  2025-09-15  7:07                 ` Amir Goldstein
  0 siblings, 2 replies; 46+ messages in thread
From: Darrick J. Wong @ 2025-09-12 14:58 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Amir Goldstein, Luis Henriques, Theodore Ts'o, Miklos Szeredi,
	Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen

On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> 
> 
> On 9/12/25 13:41, Amir Goldstein wrote:
> > On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>
> >>
> >>
> >> On 8/1/25 12:15, Luis Henriques wrote:
> >>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> >>>
> >>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >>>>>>
> >>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >>>>>> could restart itself.  It's unclear if doing so will actually enable us
> >>>>>> to clear the condition that caused the failure in the first place, but I
> >>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> >>>>>> aren't totally crazy.
> >>>>>
> >>>>> I'm trying to understand what the failure scenario is here.  Is this
> >>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> >>>>> is supposed to happen with respect to open files, metadata and data
> >>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> >>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> >>>>> potentally to be out of sync, right?
> >>>>>
> >>>>> What are the recovery semantics that we hope to be able to provide?
> >>>>
> >>>> <echoing what we said on the ext4 call this morning>
> >>>>
> >>>> With iomap, most of the dirty state is in the kernel, so I think the new
> >>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> >>>> would initiate GETATTR requests on all the cached inodes to validate
> >>>> that they still exist; and then resend all the unacknowledged requests
> >>>> that were pending at the time.  It might be the case that you have to
> >>>> that in the reverse order; I only know enough about the design of fuse
> >>>> to suspect that to be true.
> >>>>
> >>>> Anyhow once those are complete, I think we can resume operations with
> >>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> >>>> fuse_make_bad'd, which effectively revokes them.
> >>>
> >>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> >>> but probably GETATTR is a better option.
> >>>
> >>> So, are you currently working on any of this?  Are you implementing this
> >>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> >>> look at fuse2fs too.
> >>
> >> Sorry for joining the discussion late, I was totally occupied, day and
> >> night. Added Kevin to CC, who is going to work on recovery on our
> >> DDN side.
> >>
> >> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> >> server restart we want kernel to recover inodes and their lookup count.
> >> Now inode recovery might be hard, because we currently only have a
> >> 64-bit node-id - which is used my most fuse application as memory
> >> pointer.
> >>
> >> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> >> outstanding requests. And that ends up in most cases in sending requests
> >> with invalid node-IDs, that are casted and might provoke random memory
> >> access on restart. Kind of the same issue why fuse nfs export or
> >> open_by_handle_at doesn't work well right now.
> >>
> >> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> >> would not return a 64-bit node ID, but a max 128 byte file handle.
> >> And then FUSE_REVALIDATE_FH on server restart.
> >> The file handles could be stored into the fuse inode and also used for
> >> NFS export.
> >>
> >> I *think* Amir had a similar idea, but I don't find the link quickly.
> >> Adding Amir to CC.
> > 
> > Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> 
> Thanks for the reference Amir! I even had been in that thread.
> 
> > 
> >>
> >> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> >> will iterate over all superblock inodes and mark them with fuse_make_bad.
> >> Any objections against that?

What if you actually /can/ reuse a nodeid after a restart?  Consider
fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
you can reconnect the fuse_inode to the ondisk inode, assuming recovery
didn't delete it, obviously.

I suppose you could just ask for refreshed stat information and either
the server gives it to you and the fuse_inode lives; or the server
returns ENOENT and then we mark it bad.  But I'd have to see code
patches to form a real opinion.

It's very nice of fuse to have implemented revoke() ;)

--D

> > IDK, it seems much more ugly than implementing LOOKUP_HANDLE
> > and I am not sure that LOOKUP_HANDLE is that hard to implement, when
> > comparing to this alternative.
> > 
> > I mean a restartable server is going to be a new implementation anyway, right?
> > So it makes sense to start with a cleaner and more adequate protocol,
> > does it not?
> 
> Definitely, if we agree on the approach on LOOKUP_HANDLE and using it
> for recovery, adding that op seems simple. And reading through the
> thread you had posted above, just the implementation was missing.
> So let's go ahead to do this approach.
> 
> 
> Thanks,
> Bernd
> 
> 
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-12 14:58               ` Darrick J. Wong
@ 2025-09-12 15:20                 ` Bernd Schubert
  2025-09-15  4:43                   ` Darrick J. Wong
  2025-09-15  7:07                 ` Amir Goldstein
  1 sibling, 1 reply; 46+ messages in thread
From: Bernd Schubert @ 2025-09-12 15:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Luis Henriques, Theodore Ts'o, Miklos Szeredi,
	Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen



On 9/12/25 16:58, Darrick J. Wong wrote:
> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
>>
>>
>> On 9/12/25 13:41, Amir Goldstein wrote:
>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/1/25 12:15, Luis Henriques wrote:
>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>>>>>
>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>>>>>>>>
>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
>>>>>>>> to clear the condition that caused the failure in the first place, but I
>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>>>>>>>> aren't totally crazy.
>>>>>>>
>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>>>>>>> is supposed to happen with respect to open files, metadata and data
>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>>>>>>> potentally to be out of sync, right?
>>>>>>>
>>>>>>> What are the recovery semantics that we hope to be able to provide?
>>>>>>
>>>>>> <echoing what we said on the ext4 call this morning>
>>>>>>
>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>>>>>> would initiate GETATTR requests on all the cached inodes to validate
>>>>>> that they still exist; and then resend all the unacknowledged requests
>>>>>> that were pending at the time.  It might be the case that you have to
>>>>>> that in the reverse order; I only know enough about the design of fuse
>>>>>> to suspect that to be true.
>>>>>>
>>>>>> Anyhow once those are complete, I think we can resume operations with
>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>>>>>> fuse_make_bad'd, which effectively revokes them.
>>>>>
>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>>>>> but probably GETATTR is a better option.
>>>>>
>>>>> So, are you currently working on any of this?  Are you implementing this
>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>>>>> look at fuse2fs too.
>>>>
>>>> Sorry for joining the discussion late, I was totally occupied, day and
>>>> night. Added Kevin to CC, who is going to work on recovery on our
>>>> DDN side.
>>>>
>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>>>> server restart we want kernel to recover inodes and their lookup count.
>>>> Now inode recovery might be hard, because we currently only have a
>>>> 64-bit node-id - which is used my most fuse application as memory
>>>> pointer.
>>>>
>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>>>> outstanding requests. And that ends up in most cases in sending requests
>>>> with invalid node-IDs, that are casted and might provoke random memory
>>>> access on restart. Kind of the same issue why fuse nfs export or
>>>> open_by_handle_at doesn't work well right now.
>>>>
>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
>>>> And then FUSE_REVALIDATE_FH on server restart.
>>>> The file handles could be stored into the fuse inode and also used for
>>>> NFS export.
>>>>
>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
>>>> Adding Amir to CC.
>>>
>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>
>> Thanks for the reference Amir! I even had been in that thread.
>>
>>>
>>>>
>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>>>> Any objections against that?
> 
> What if you actually /can/ reuse a nodeid after a restart?  Consider
> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> didn't delete it, obviously.
> 
> I suppose you could just ask for refreshed stat information and either
> the server gives it to you and the fuse_inode lives; or the server
> returns ENOENT and then we mark it bad.  But I'd have to see code
> patches to form a real opinion.
> 
> It's very nice of fuse to have implemented revoke() ;)


Assuming you would run with an attr cache timeout equal 0 the existing
NOTIFY_RESEND would be enough for fuse4fs? 


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-12 15:20                 ` Bernd Schubert
@ 2025-09-15  4:43                   ` Darrick J. Wong
  0 siblings, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2025-09-15  4:43 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Amir Goldstein, Luis Henriques, Theodore Ts'o, Miklos Szeredi,
	Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen

On Fri, Sep 12, 2025 at 05:20:58PM +0200, Bernd Schubert wrote:
> 
> 
> On 9/12/25 16:58, Darrick J. Wong wrote:
> > On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> >>
> >>
> >> On 9/12/25 13:41, Amir Goldstein wrote:
> >>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/1/25 12:15, Luis Henriques wrote:
> >>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> >>>>>
> >>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >>>>>>>>
> >>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
> >>>>>>>> to clear the condition that caused the failure in the first place, but I
> >>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> >>>>>>>> aren't totally crazy.
> >>>>>>>
> >>>>>>> I'm trying to understand what the failure scenario is here.  Is this
> >>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> >>>>>>> is supposed to happen with respect to open files, metadata and data
> >>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> >>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> >>>>>>> potentally to be out of sync, right?
> >>>>>>>
> >>>>>>> What are the recovery semantics that we hope to be able to provide?
> >>>>>>
> >>>>>> <echoing what we said on the ext4 call this morning>
> >>>>>>
> >>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> >>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> >>>>>> would initiate GETATTR requests on all the cached inodes to validate
> >>>>>> that they still exist; and then resend all the unacknowledged requests
> >>>>>> that were pending at the time.  It might be the case that you have to
> >>>>>> that in the reverse order; I only know enough about the design of fuse
> >>>>>> to suspect that to be true.
> >>>>>>
> >>>>>> Anyhow once those are complete, I think we can resume operations with
> >>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> >>>>>> fuse_make_bad'd, which effectively revokes them.
> >>>>>
> >>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> >>>>> but probably GETATTR is a better option.
> >>>>>
> >>>>> So, are you currently working on any of this?  Are you implementing this
> >>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> >>>>> look at fuse2fs too.
> >>>>
> >>>> Sorry for joining the discussion late, I was totally occupied, day and
> >>>> night. Added Kevin to CC, who is going to work on recovery on our
> >>>> DDN side.
> >>>>
> >>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> >>>> server restart we want kernel to recover inodes and their lookup count.
> >>>> Now inode recovery might be hard, because we currently only have a
> >>>> 64-bit node-id - which is used my most fuse application as memory
> >>>> pointer.
> >>>>
> >>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> >>>> outstanding requests. And that ends up in most cases in sending requests
> >>>> with invalid node-IDs, that are casted and might provoke random memory
> >>>> access on restart. Kind of the same issue why fuse nfs export or
> >>>> open_by_handle_at doesn't work well right now.
> >>>>
> >>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> >>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> >>>> And then FUSE_REVALIDATE_FH on server restart.
> >>>> The file handles could be stored into the fuse inode and also used for
> >>>> NFS export.
> >>>>
> >>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> >>>> Adding Amir to CC.
> >>>
> >>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> >>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >>
> >> Thanks for the reference Amir! I even had been in that thread.
> >>
> >>>
> >>>>
> >>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> >>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> >>>> Any objections against that?
> > 
> > What if you actually /can/ reuse a nodeid after a restart?  Consider
> > fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> > you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> > didn't delete it, obviously.
> > 
> > I suppose you could just ask for refreshed stat information and either
> > the server gives it to you and the fuse_inode lives; or the server
> > returns ENOENT and then we mark it bad.  But I'd have to see code
> > patches to form a real opinion.
> > 
> > It's very nice of fuse to have implemented revoke() ;)
> 
> 
> Assuming you would run with an attr cache timeout equal 0 the existing
> NOTIFY_RESEND would be enough for fuse4fs? 

That brings up some good questions.  Yes, fuse4fs sets an attr cache
timeout of 0, but (a) would it actually be useful to set it to a higher
value to reduce round trips?  And (b) shouldn't a restart trigger a
revalidation regardless?

--D

> 
> Thanks,
> Bernd
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-12 14:58               ` Darrick J. Wong
  2025-09-12 15:20                 ` Bernd Schubert
@ 2025-09-15  7:07                 ` Amir Goldstein
  2025-09-15  8:27                   ` Bernd Schubert
  1 sibling, 1 reply; 46+ messages in thread
From: Amir Goldstein @ 2025-09-15  7:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Bernd Schubert, Luis Henriques, Theodore Ts'o, Miklos Szeredi,
	Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen

On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> >
> >
> > On 9/12/25 13:41, Amir Goldstein wrote:
> > > On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> > >>
> > >>
> > >>
> > >> On 8/1/25 12:15, Luis Henriques wrote:
> > >>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> > >>>
> > >>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> > >>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> > >>>>>>
> > >>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> > >>>>>> could restart itself.  It's unclear if doing so will actually enable us
> > >>>>>> to clear the condition that caused the failure in the first place, but I
> > >>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> > >>>>>> aren't totally crazy.
> > >>>>>
> > >>>>> I'm trying to understand what the failure scenario is here.  Is this
> > >>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> > >>>>> is supposed to happen with respect to open files, metadata and data
> > >>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> > >>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> > >>>>> potentally to be out of sync, right?
> > >>>>>
> > >>>>> What are the recovery semantics that we hope to be able to provide?
> > >>>>
> > >>>> <echoing what we said on the ext4 call this morning>
> > >>>>
> > >>>> With iomap, most of the dirty state is in the kernel, so I think the new
> > >>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> > >>>> would initiate GETATTR requests on all the cached inodes to validate
> > >>>> that they still exist; and then resend all the unacknowledged requests
> > >>>> that were pending at the time.  It might be the case that you have to
> > >>>> that in the reverse order; I only know enough about the design of fuse
> > >>>> to suspect that to be true.
> > >>>>
> > >>>> Anyhow once those are complete, I think we can resume operations with
> > >>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> > >>>> fuse_make_bad'd, which effectively revokes them.
> > >>>
> > >>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> > >>> but probably GETATTR is a better option.
> > >>>
> > >>> So, are you currently working on any of this?  Are you implementing this
> > >>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> > >>> look at fuse2fs too.
> > >>
> > >> Sorry for joining the discussion late, I was totally occupied, day and
> > >> night. Added Kevin to CC, who is going to work on recovery on our
> > >> DDN side.
> > >>
> > >> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> > >> server restart we want kernel to recover inodes and their lookup count.
> > >> Now inode recovery might be hard, because we currently only have a
> > >> 64-bit node-id - which is used my most fuse application as memory
> > >> pointer.
> > >>
> > >> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> > >> outstanding requests. And that ends up in most cases in sending requests
> > >> with invalid node-IDs, that are casted and might provoke random memory
> > >> access on restart. Kind of the same issue why fuse nfs export or
> > >> open_by_handle_at doesn't work well right now.
> > >>
> > >> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> > >> would not return a 64-bit node ID, but a max 128 byte file handle.
> > >> And then FUSE_REVALIDATE_FH on server restart.
> > >> The file handles could be stored into the fuse inode and also used for
> > >> NFS export.
> > >>
> > >> I *think* Amir had a similar idea, but I don't find the link quickly.
> > >> Adding Amir to CC.
> > >
> > > Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> > > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >
> > Thanks for the reference Amir! I even had been in that thread.
> >
> > >
> > >>
> > >> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> > >> will iterate over all superblock inodes and mark them with fuse_make_bad.
> > >> Any objections against that?
>
> What if you actually /can/ reuse a nodeid after a restart?  Consider
> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> didn't delete it, obviously.

FUSE_LOOKUP_HANDLE is a contract.
If fuse4fs can reuse nodeid after restart then by all means, it should sign
this contract, otherwise there is no way for client to know that the
nodeids are persistent.
If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
API trivial.

>
> I suppose you could just ask for refreshed stat information and either
> the server gives it to you and the fuse_inode lives; or the server
> returns ENOENT and then we mark it bad.  But I'd have to see code
> patches to form a real opinion.
>

You could make fuse4fs_handle := <nodeid:fuse_instance_id>
where fuse_instance_id can be its start time or random number.
for auto invalidate, or maybe the fuse_instance_id should be
a native part of FUSE protocol so that client knows to only invalidate
attr cache in case of fuse_instance_id change?

In any case, instead of a storm of revalidate messages after
server restart, do it lazily on demand.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-15  7:07                 ` Amir Goldstein
@ 2025-09-15  8:27                   ` Bernd Schubert
  2025-09-15  8:41                     ` Amir Goldstein
  0 siblings, 1 reply; 46+ messages in thread
From: Bernd Schubert @ 2025-09-15  8:27 UTC (permalink / raw)
  To: Amir Goldstein, Darrick J. Wong
  Cc: Luis Henriques, Theodore Ts'o, Miklos Szeredi, Bernd Schubert,
	linux-fsdevel, linux-kernel, Kevin Chen



On 9/15/25 09:07, Amir Goldstein wrote:
> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
>>
>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
>>>
>>>
>>> On 9/12/25 13:41, Amir Goldstein wrote:
>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 8/1/25 12:15, Luis Henriques wrote:
>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>>>>>>
>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>>>>>>>>>
>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
>>>>>>>>> to clear the condition that caused the failure in the first place, but I
>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>>>>>>>>> aren't totally crazy.
>>>>>>>>
>>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>>>>>>>> is supposed to happen with respect to open files, metadata and data
>>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>>>>>>>> potentally to be out of sync, right?
>>>>>>>>
>>>>>>>> What are the recovery semantics that we hope to be able to provide?
>>>>>>>
>>>>>>> <echoing what we said on the ext4 call this morning>
>>>>>>>
>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>>>>>>> would initiate GETATTR requests on all the cached inodes to validate
>>>>>>> that they still exist; and then resend all the unacknowledged requests
>>>>>>> that were pending at the time.  It might be the case that you have to
>>>>>>> that in the reverse order; I only know enough about the design of fuse
>>>>>>> to suspect that to be true.
>>>>>>>
>>>>>>> Anyhow once those are complete, I think we can resume operations with
>>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>>>>>>> fuse_make_bad'd, which effectively revokes them.
>>>>>>
>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>>>>>> but probably GETATTR is a better option.
>>>>>>
>>>>>> So, are you currently working on any of this?  Are you implementing this
>>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>>>>>> look at fuse2fs too.
>>>>>
>>>>> Sorry for joining the discussion late, I was totally occupied, day and
>>>>> night. Added Kevin to CC, who is going to work on recovery on our
>>>>> DDN side.
>>>>>
>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>>>>> server restart we want kernel to recover inodes and their lookup count.
>>>>> Now inode recovery might be hard, because we currently only have a
>>>>> 64-bit node-id - which is used my most fuse application as memory
>>>>> pointer.
>>>>>
>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>>>>> outstanding requests. And that ends up in most cases in sending requests
>>>>> with invalid node-IDs, that are casted and might provoke random memory
>>>>> access on restart. Kind of the same issue why fuse nfs export or
>>>>> open_by_handle_at doesn't work well right now.
>>>>>
>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
>>>>> And then FUSE_REVALIDATE_FH on server restart.
>>>>> The file handles could be stored into the fuse inode and also used for
>>>>> NFS export.
>>>>>
>>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
>>>>> Adding Amir to CC.
>>>>
>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>>
>>> Thanks for the reference Amir! I even had been in that thread.
>>>
>>>>
>>>>>
>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>>>>> Any objections against that?
>>
>> What if you actually /can/ reuse a nodeid after a restart?  Consider
>> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
>> didn't delete it, obviously.
> 
> FUSE_LOOKUP_HANDLE is a contract.
> If fuse4fs can reuse nodeid after restart then by all means, it should sign
> this contract, otherwise there is no way for client to know that the
> nodeids are persistent.
> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> API trivial.
> 
>>
>> I suppose you could just ask for refreshed stat information and either
>> the server gives it to you and the fuse_inode lives; or the server
>> returns ENOENT and then we mark it bad.  But I'd have to see code
>> patches to form a real opinion.
>>
> 
> You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> where fuse_instance_id can be its start time or random number.
> for auto invalidate, or maybe the fuse_instance_id should be
> a native part of FUSE protocol so that client knows to only invalidate
> attr cache in case of fuse_instance_id change?
> 
> In any case, instead of a storm of revalidate messages after
> server restart, do it lazily on demand.

For a network file system, probably. For fuse4fs or other block
based file systems, not sure. Darrick has the example of fsck.
Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
fuse-server gets restarted, fsck'ed and some files get removed.
Now reading these inodes would still work - wouldn't it
be better to invalidate the cache before going into operation
again?


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-15  8:27                   ` Bernd Schubert
@ 2025-09-15  8:41                     ` Amir Goldstein
  2025-09-16  2:53                       ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Amir Goldstein @ 2025-09-15  8:41 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Darrick J. Wong, Luis Henriques, Theodore Ts'o,
	Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel,
	Kevin Chen

On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>
>
>
> On 9/15/25 09:07, Amir Goldstein wrote:
> > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >>
> >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> >>>
> >>>
> >>> On 9/12/25 13:41, Amir Goldstein wrote:
> >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 8/1/25 12:15, Luis Henriques wrote:
> >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> >>>>>>
> >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >>>>>>>>>
> >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
> >>>>>>>>> to clear the condition that caused the failure in the first place, but I
> >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> >>>>>>>>> aren't totally crazy.
> >>>>>>>>
> >>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
> >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> >>>>>>>> is supposed to happen with respect to open files, metadata and data
> >>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> >>>>>>>> potentally to be out of sync, right?
> >>>>>>>>
> >>>>>>>> What are the recovery semantics that we hope to be able to provide?
> >>>>>>>
> >>>>>>> <echoing what we said on the ext4 call this morning>
> >>>>>>>
> >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> >>>>>>> would initiate GETATTR requests on all the cached inodes to validate
> >>>>>>> that they still exist; and then resend all the unacknowledged requests
> >>>>>>> that were pending at the time.  It might be the case that you have to
> >>>>>>> that in the reverse order; I only know enough about the design of fuse
> >>>>>>> to suspect that to be true.
> >>>>>>>
> >>>>>>> Anyhow once those are complete, I think we can resume operations with
> >>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> >>>>>>> fuse_make_bad'd, which effectively revokes them.
> >>>>>>
> >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> >>>>>> but probably GETATTR is a better option.
> >>>>>>
> >>>>>> So, are you currently working on any of this?  Are you implementing this
> >>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> >>>>>> look at fuse2fs too.
> >>>>>
> >>>>> Sorry for joining the discussion late, I was totally occupied, day and
> >>>>> night. Added Kevin to CC, who is going to work on recovery on our
> >>>>> DDN side.
> >>>>>
> >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> >>>>> server restart we want kernel to recover inodes and their lookup count.
> >>>>> Now inode recovery might be hard, because we currently only have a
> >>>>> 64-bit node-id - which is used my most fuse application as memory
> >>>>> pointer.
> >>>>>
> >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> >>>>> outstanding requests. And that ends up in most cases in sending requests
> >>>>> with invalid node-IDs, that are casted and might provoke random memory
> >>>>> access on restart. Kind of the same issue why fuse nfs export or
> >>>>> open_by_handle_at doesn't work well right now.
> >>>>>
> >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> >>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> >>>>> And then FUSE_REVALIDATE_FH on server restart.
> >>>>> The file handles could be stored into the fuse inode and also used for
> >>>>> NFS export.
> >>>>>
> >>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> >>>>> Adding Amir to CC.
> >>>>
> >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >>>
> >>> Thanks for the reference Amir! I even had been in that thread.
> >>>
> >>>>
> >>>>>
> >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> >>>>> Any objections against that?
> >>
> >> What if you actually /can/ reuse a nodeid after a restart?  Consider
> >> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> >> didn't delete it, obviously.
> >
> > FUSE_LOOKUP_HANDLE is a contract.
> > If fuse4fs can reuse nodeid after restart then by all means, it should sign
> > this contract, otherwise there is no way for client to know that the
> > nodeids are persistent.
> > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> > API trivial.
> >
> >>
> >> I suppose you could just ask for refreshed stat information and either
> >> the server gives it to you and the fuse_inode lives; or the server
> >> returns ENOENT and then we mark it bad.  But I'd have to see code
> >> patches to form a real opinion.
> >>
> >
> > You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> > where fuse_instance_id can be its start time or random number.
> > for auto invalidate, or maybe the fuse_instance_id should be
> > a native part of FUSE protocol so that client knows to only invalidate
> > attr cache in case of fuse_instance_id change?
> >
> > In any case, instead of a storm of revalidate messages after
> > server restart, do it lazily on demand.
>
> For a network file system, probably. For fuse4fs or other block
> based file systems, not sure. Darrick has the example of fsck.
> Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
> fuse-server gets restarted, fsck'ed and some files get removed.
> Now reading these inodes would still work - wouldn't it
> be better to invalidate the cache before going into operation
> again?

Forgive me, I was making a wrong assumption that fuse4fs
was using ext4 filehandle as nodeid, but of course it does not.

The reason I made this wrong assumption is because fuse4fs *can*
already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
which is what my fuse passthough library [1] does.

My claim was that although fuse4fs could support safe restart, which
cannot read from recycled inode number with current FUSE protocol,
doing so with FUSE_HANDLE protocol would express a commitment
to this behavior.

Thanks,
Amir.

[1] https://github.com/amir73il/libfuse/commits/fuse_passthrough

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-15  8:41                     ` Amir Goldstein
@ 2025-09-16  2:53                       ` Darrick J. Wong
  2025-09-16  7:59                         ` Amir Goldstein
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2025-09-16  2:53 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bernd Schubert, Luis Henriques, Theodore Ts'o, Miklos Szeredi,
	Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen

On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> >
> >
> >
> > On 9/15/25 09:07, Amir Goldstein wrote:
> > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >>
> > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> > >>>
> > >>>
> > >>> On 9/12/25 13:41, Amir Goldstein wrote:
> > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On 8/1/25 12:15, Luis Henriques wrote:
> > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> > >>>>>>
> > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> > >>>>>>>>>
> > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> > >>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
> > >>>>>>>>> to clear the condition that caused the failure in the first place, but I
> > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> > >>>>>>>>> aren't totally crazy.
> > >>>>>>>>
> > >>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
> > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> > >>>>>>>> is supposed to happen with respect to open files, metadata and data
> > >>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> > >>>>>>>> potentally to be out of sync, right?
> > >>>>>>>>
> > >>>>>>>> What are the recovery semantics that we hope to be able to provide?
> > >>>>>>>
> > >>>>>>> <echoing what we said on the ext4 call this morning>
> > >>>>>>>
> > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate
> > >>>>>>> that they still exist; and then resend all the unacknowledged requests
> > >>>>>>> that were pending at the time.  It might be the case that you have to
> > >>>>>>> that in the reverse order; I only know enough about the design of fuse
> > >>>>>>> to suspect that to be true.
> > >>>>>>>
> > >>>>>>> Anyhow once those are complete, I think we can resume operations with
> > >>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> > >>>>>>> fuse_make_bad'd, which effectively revokes them.
> > >>>>>>
> > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> > >>>>>> but probably GETATTR is a better option.
> > >>>>>>
> > >>>>>> So, are you currently working on any of this?  Are you implementing this
> > >>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> > >>>>>> look at fuse2fs too.
> > >>>>>
> > >>>>> Sorry for joining the discussion late, I was totally occupied, day and
> > >>>>> night. Added Kevin to CC, who is going to work on recovery on our
> > >>>>> DDN side.
> > >>>>>
> > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> > >>>>> server restart we want kernel to recover inodes and their lookup count.
> > >>>>> Now inode recovery might be hard, because we currently only have a
> > >>>>> 64-bit node-id - which is used my most fuse application as memory
> > >>>>> pointer.
> > >>>>>
> > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> > >>>>> outstanding requests. And that ends up in most cases in sending requests
> > >>>>> with invalid node-IDs, that are casted and might provoke random memory
> > >>>>> access on restart. Kind of the same issue why fuse nfs export or
> > >>>>> open_by_handle_at doesn't work well right now.
> > >>>>>
> > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> > >>>>> And then FUSE_REVALIDATE_FH on server restart.
> > >>>>> The file handles could be stored into the fuse inode and also used for
> > >>>>> NFS export.
> > >>>>>
> > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> > >>>>> Adding Amir to CC.
> > >>>>
> > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> > >>>
> > >>> Thanks for the reference Amir! I even had been in that thread.
> > >>>
> > >>>>
> > >>>>>
> > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> > >>>>> Any objections against that?
> > >>
> > >> What if you actually /can/ reuse a nodeid after a restart?  Consider
> > >> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> > >> didn't delete it, obviously.
> > >
> > > FUSE_LOOKUP_HANDLE is a contract.
> > > If fuse4fs can reuse nodeid after restart then by all means, it should sign
> > > this contract, otherwise there is no way for client to know that the
> > > nodeids are persistent.
> > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> > > API trivial.
> > >
> > >>
> > >> I suppose you could just ask for refreshed stat information and either
> > >> the server gives it to you and the fuse_inode lives; or the server
> > >> returns ENOENT and then we mark it bad.  But I'd have to see code
> > >> patches to form a real opinion.
> > >>
> > >
> > > You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> > > where fuse_instance_id can be its start time or random number.
> > > for auto invalidate, or maybe the fuse_instance_id should be
> > > a native part of FUSE protocol so that client knows to only invalidate
> > > attr cache in case of fuse_instance_id change?
> > >
> > > In any case, instead of a storm of revalidate messages after
> > > server restart, do it lazily on demand.
> >
> > For a network file system, probably. For fuse4fs or other block
> > based file systems, not sure. Darrick has the example of fsck.
> > Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
> > fuse-server gets restarted, fsck'ed and some files get removed.
> > Now reading these inodes would still work - wouldn't it
> > be better to invalidate the cache before going into operation
> > again?
> 
> Forgive me, I was making a wrong assumption that fuse4fs
> was using ext4 filehandle as nodeid, but of course it does not.

Well now that you mention it, there /is/ a risk of shenanigans like
that.  Consider:

1) fuse4fs mount an ext4 filesystem
2) crash the fuse4fs server
<fuse4fs server restart stalls...>
3) e2fsck -fy /dev/XXX deletes inode 17
4) someone else mounts the fs, makes some changes that result in 17
   being reallocated, user says "OOOOOPS", unmounts it
5) fuse4fs server finally restarts, and reconnects to the kernel

Hey, inode 17 is now a different file!!

So maybe the nodeid has to be an actual file handle.  Oh wait, no,
everything's (potentially) fine because fuse4fs supplied i_generation to
the kernel, and fuse_stale_inode will mark it bad if that happens.

Hm ok then, at least there's a way out. :)

> The reason I made this wrong assumption is because fuse4fs *can*
> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
> which is what my fuse passthough library [1] does.
> 
> My claim was that although fuse4fs could support safe restart, which
> cannot read from recycled inode number with current FUSE protocol,
> doing so with FUSE_HANDLE protocol would express a commitment

Pardon my naïvete, but what is FUSE_HANDLE?

$ git grep -w FUSE_HANDLE fs
$

--D

> Thanks,
> Amir.
> 
> [1] https://github.com/amir73il/libfuse/commits/fuse_passthrough
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-16  2:53                       ` Darrick J. Wong
@ 2025-09-16  7:59                         ` Amir Goldstein
  2025-09-18 17:50                           ` Darrick J. Wong
  2025-11-04 11:40                           ` Luis Henriques
  0 siblings, 2 replies; 46+ messages in thread
From: Amir Goldstein @ 2025-09-16  7:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Bernd Schubert, Luis Henriques, Theodore Ts'o, Miklos Szeredi,
	Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen

On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
> > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> > >
> > >
> > >
> > > On 9/15/25 09:07, Amir Goldstein wrote:
> > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >>
> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> > > >>>
> > > >>>
> > > >>> On 9/12/25 13:41, Amir Goldstein wrote:
> > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote:
> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> > > >>>>>>
> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> > > >>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
> > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I
> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> > > >>>>>>>>> aren't totally crazy.
> > > >>>>>>>>
> > > >>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> > > >>>>>>>> is supposed to happen with respect to open files, metadata and data
> > > >>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> > > >>>>>>>> potentally to be out of sync, right?
> > > >>>>>>>>
> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide?
> > > >>>>>>>
> > > >>>>>>> <echoing what we said on the ext4 call this morning>
> > > >>>>>>>
> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate
> > > >>>>>>> that they still exist; and then resend all the unacknowledged requests
> > > >>>>>>> that were pending at the time.  It might be the case that you have to
> > > >>>>>>> that in the reverse order; I only know enough about the design of fuse
> > > >>>>>>> to suspect that to be true.
> > > >>>>>>>
> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with
> > > >>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> > > >>>>>>> fuse_make_bad'd, which effectively revokes them.
> > > >>>>>>
> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> > > >>>>>> but probably GETATTR is a better option.
> > > >>>>>>
> > > >>>>>> So, are you currently working on any of this?  Are you implementing this
> > > >>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> > > >>>>>> look at fuse2fs too.
> > > >>>>>
> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and
> > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our
> > > >>>>> DDN side.
> > > >>>>>
> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> > > >>>>> server restart we want kernel to recover inodes and their lookup count.
> > > >>>>> Now inode recovery might be hard, because we currently only have a
> > > >>>>> 64-bit node-id - which is used my most fuse application as memory
> > > >>>>> pointer.
> > > >>>>>
> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> > > >>>>> outstanding requests. And that ends up in most cases in sending requests
> > > >>>>> with invalid node-IDs, that are casted and might provoke random memory
> > > >>>>> access on restart. Kind of the same issue why fuse nfs export or
> > > >>>>> open_by_handle_at doesn't work well right now.
> > > >>>>>
> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> > > >>>>> And then FUSE_REVALIDATE_FH on server restart.
> > > >>>>> The file handles could be stored into the fuse inode and also used for
> > > >>>>> NFS export.
> > > >>>>>
> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> > > >>>>> Adding Amir to CC.
> > > >>>>
> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> > > >>>
> > > >>> Thanks for the reference Amir! I even had been in that thread.
> > > >>>
> > > >>>>
> > > >>>>>
> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> > > >>>>> Any objections against that?
> > > >>
> > > >> What if you actually /can/ reuse a nodeid after a restart?  Consider
> > > >> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> > > >> didn't delete it, obviously.
> > > >
> > > > FUSE_LOOKUP_HANDLE is a contract.
> > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign
> > > > this contract, otherwise there is no way for client to know that the
> > > > nodeids are persistent.
> > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> > > > API trivial.
> > > >
> > > >>
> > > >> I suppose you could just ask for refreshed stat information and either
> > > >> the server gives it to you and the fuse_inode lives; or the server
> > > >> returns ENOENT and then we mark it bad.  But I'd have to see code
> > > >> patches to form a real opinion.
> > > >>
> > > >
> > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> > > > where fuse_instance_id can be its start time or random number.
> > > > for auto invalidate, or maybe the fuse_instance_id should be
> > > > a native part of FUSE protocol so that client knows to only invalidate
> > > > attr cache in case of fuse_instance_id change?
> > > >
> > > > In any case, instead of a storm of revalidate messages after
> > > > server restart, do it lazily on demand.
> > >
> > > For a network file system, probably. For fuse4fs or other block
> > > based file systems, not sure. Darrick has the example of fsck.
> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
> > > fuse-server gets restarted, fsck'ed and some files get removed.
> > > Now reading these inodes would still work - wouldn't it
> > > be better to invalidate the cache before going into operation
> > > again?
> >
> > Forgive me, I was making a wrong assumption that fuse4fs
> > was using ext4 filehandle as nodeid, but of course it does not.
>
> Well now that you mention it, there /is/ a risk of shenanigans like
> that.  Consider:
>
> 1) fuse4fs mount an ext4 filesystem
> 2) crash the fuse4fs server
> <fuse4fs server restart stalls...>
> 3) e2fsck -fy /dev/XXX deletes inode 17
> 4) someone else mounts the fs, makes some changes that result in 17
>    being reallocated, user says "OOOOOPS", unmounts it
> 5) fuse4fs server finally restarts, and reconnects to the kernel
>
> Hey, inode 17 is now a different file!!
>
> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
> everything's (potentially) fine because fuse4fs supplied i_generation to
> the kernel, and fuse_stale_inode will mark it bad if that happens.
>
> Hm ok then, at least there's a way out. :)
>

Right.

> > The reason I made this wrong assumption is because fuse4fs *can*
> > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
> > which is what my fuse passthough library [1] does.
> >
> > My claim was that although fuse4fs could support safe restart, which
> > cannot read from recycled inode number with current FUSE protocol,
> > doing so with FUSE_HANDLE protocol would express a commitment
>
> Pardon my naïvete, but what is FUSE_HANDLE?
>
> $ git grep -w FUSE_HANDLE fs
> $

Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/

Which means to communicate a variable sized "nodeid"
which can also be declared as an object id that survives server restart.

Basically, the reason that I brought up LOOKUP_HANDLE is to
properly support NFS export of fuse filesystems.

My incentive was to support a proper fuse server restart/remount/re-export
with the same fsid in /etc/exports, but this gives us a better starting point
for fuse server restart/re-connect.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-16  7:59                         ` Amir Goldstein
@ 2025-09-18 17:50                           ` Darrick J. Wong
  2025-11-04 11:40                           ` Luis Henriques
  1 sibling, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2025-09-18 17:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bernd Schubert, Luis Henriques, Theodore Ts'o, Miklos Szeredi,
	Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen

On Tue, Sep 16, 2025 at 09:59:36AM +0200, Amir Goldstein wrote:
> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
> > > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> > > >
> > > >
> > > >
> > > > On 9/15/25 09:07, Amir Goldstein wrote:
> > > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >>
> > > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> > > > >>>
> > > > >>>
> > > > >>> On 9/12/25 13:41, Amir Goldstein wrote:
> > > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On 8/1/25 12:15, Luis Henriques wrote:
> > > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> > > > >>>>>>
> > > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> > > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> > > > >>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
> > > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I
> > > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> > > > >>>>>>>>> aren't totally crazy.
> > > > >>>>>>>>
> > > > >>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
> > > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> > > > >>>>>>>> is supposed to happen with respect to open files, metadata and data
> > > > >>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> > > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> > > > >>>>>>>> potentally to be out of sync, right?
> > > > >>>>>>>>
> > > > >>>>>>>> What are the recovery semantics that we hope to be able to provide?
> > > > >>>>>>>
> > > > >>>>>>> <echoing what we said on the ext4 call this morning>
> > > > >>>>>>>
> > > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> > > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> > > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate
> > > > >>>>>>> that they still exist; and then resend all the unacknowledged requests
> > > > >>>>>>> that were pending at the time.  It might be the case that you have to
> > > > >>>>>>> that in the reverse order; I only know enough about the design of fuse
> > > > >>>>>>> to suspect that to be true.
> > > > >>>>>>>
> > > > >>>>>>> Anyhow once those are complete, I think we can resume operations with
> > > > >>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> > > > >>>>>>> fuse_make_bad'd, which effectively revokes them.
> > > > >>>>>>
> > > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> > > > >>>>>> but probably GETATTR is a better option.
> > > > >>>>>>
> > > > >>>>>> So, are you currently working on any of this?  Are you implementing this
> > > > >>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> > > > >>>>>> look at fuse2fs too.
> > > > >>>>>
> > > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and
> > > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our
> > > > >>>>> DDN side.
> > > > >>>>>
> > > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> > > > >>>>> server restart we want kernel to recover inodes and their lookup count.
> > > > >>>>> Now inode recovery might be hard, because we currently only have a
> > > > >>>>> 64-bit node-id - which is used my most fuse application as memory
> > > > >>>>> pointer.
> > > > >>>>>
> > > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> > > > >>>>> outstanding requests. And that ends up in most cases in sending requests
> > > > >>>>> with invalid node-IDs, that are casted and might provoke random memory
> > > > >>>>> access on restart. Kind of the same issue why fuse nfs export or
> > > > >>>>> open_by_handle_at doesn't work well right now.
> > > > >>>>>
> > > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> > > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> > > > >>>>> And then FUSE_REVALIDATE_FH on server restart.
> > > > >>>>> The file handles could be stored into the fuse inode and also used for
> > > > >>>>> NFS export.
> > > > >>>>>
> > > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> > > > >>>>> Adding Amir to CC.
> > > > >>>>
> > > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> > > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> > > > >>>
> > > > >>> Thanks for the reference Amir! I even had been in that thread.
> > > > >>>
> > > > >>>>
> > > > >>>>>
> > > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> > > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> > > > >>>>> Any objections against that?
> > > > >>
> > > > >> What if you actually /can/ reuse a nodeid after a restart?  Consider
> > > > >> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> > > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> > > > >> didn't delete it, obviously.
> > > > >
> > > > > FUSE_LOOKUP_HANDLE is a contract.
> > > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign
> > > > > this contract, otherwise there is no way for client to know that the
> > > > > nodeids are persistent.
> > > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> > > > > API trivial.
> > > > >
> > > > >>
> > > > >> I suppose you could just ask for refreshed stat information and either
> > > > >> the server gives it to you and the fuse_inode lives; or the server
> > > > >> returns ENOENT and then we mark it bad.  But I'd have to see code
> > > > >> patches to form a real opinion.
> > > > >>
> > > > >
> > > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> > > > > where fuse_instance_id can be its start time or random number.
> > > > > for auto invalidate, or maybe the fuse_instance_id should be
> > > > > a native part of FUSE protocol so that client knows to only invalidate
> > > > > attr cache in case of fuse_instance_id change?
> > > > >
> > > > > In any case, instead of a storm of revalidate messages after
> > > > > server restart, do it lazily on demand.
> > > >
> > > > For a network file system, probably. For fuse4fs or other block
> > > > based file systems, not sure. Darrick has the example of fsck.
> > > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
> > > > fuse-server gets restarted, fsck'ed and some files get removed.
> > > > Now reading these inodes would still work - wouldn't it
> > > > be better to invalidate the cache before going into operation
> > > > again?
> > >
> > > Forgive me, I was making a wrong assumption that fuse4fs
> > > was using ext4 filehandle as nodeid, but of course it does not.
> >
> > Well now that you mention it, there /is/ a risk of shenanigans like
> > that.  Consider:
> >
> > 1) fuse4fs mount an ext4 filesystem
> > 2) crash the fuse4fs server
> > <fuse4fs server restart stalls...>
> > 3) e2fsck -fy /dev/XXX deletes inode 17
> > 4) someone else mounts the fs, makes some changes that result in 17
> >    being reallocated, user says "OOOOOPS", unmounts it
> > 5) fuse4fs server finally restarts, and reconnects to the kernel
> >
> > Hey, inode 17 is now a different file!!
> >
> > So maybe the nodeid has to be an actual file handle.  Oh wait, no,
> > everything's (potentially) fine because fuse4fs supplied i_generation to
> > the kernel, and fuse_stale_inode will mark it bad if that happens.
> >
> > Hm ok then, at least there's a way out. :)
> >
> 
> Right.
> 
> > > The reason I made this wrong assumption is because fuse4fs *can*
> > > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
> > > which is what my fuse passthough library [1] does.
> > >
> > > My claim was that although fuse4fs could support safe restart, which
> > > cannot read from recycled inode number with current FUSE protocol,
> > > doing so with FUSE_HANDLE protocol would express a commitment
> >
> > Pardon my naïvete, but what is FUSE_HANDLE?
> >
> > $ git grep -w FUSE_HANDLE fs
> > $
> 
> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> 
> Which means to communicate a variable sized "nodeid"
> which can also be declared as an object id that survives server restart.
> 
> Basically, the reason that I brought up LOOKUP_HANDLE is to
> properly support NFS export of fuse filesystems.
> 
> My incentive was to support a proper fuse server restart/remount/re-export
> with the same fsid in /etc/exports, but this gives us a better starting point
> for fuse server restart/re-connect.

Ah.  I don't think that's necessary for ext4, but probably desirable for
fancy filesystems that support things like subvolumes or do weird stuff.

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-09-16  7:59                         ` Amir Goldstein
  2025-09-18 17:50                           ` Darrick J. Wong
@ 2025-11-04 11:40                           ` Luis Henriques
  2025-11-04 13:10                             ` Amir Goldstein
  1 sibling, 1 reply; 46+ messages in thread
From: Luis Henriques @ 2025-11-04 11:40 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel,
	Kevin Chen

On Tue, Sep 16 2025, Amir Goldstein wrote:

> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
>>
>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
>> > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>> > >
>> > >
>> > >
>> > > On 9/15/25 09:07, Amir Goldstein wrote:
>> > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
>> > > >>
>> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
>> > > >>>
>> > > >>>
>> > > >>> On 9/12/25 13:41, Amir Goldstein wrote:
>> > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>> > > >>>>>
>> > > >>>>>
>> > > >>>>>
>> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote:
>> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>> > > >>>>>>
>> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>> > > >>>>>>>>>
>> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>> > > >>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
>> > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I
>> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>> > > >>>>>>>>> aren't totally crazy.
>> > > >>>>>>>>
>> > > >>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
>> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>> > > >>>>>>>> is supposed to happen with respect to open files, metadata and data
>> > > >>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>> > > >>>>>>>> potentally to be out of sync, right?
>> > > >>>>>>>>
>> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide?
>> > > >>>>>>>
>> > > >>>>>>> <echoing what we said on the ext4 call this morning>
>> > > >>>>>>>
>> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>> > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate
>> > > >>>>>>> that they still exist; and then resend all the unacknowledged requests
>> > > >>>>>>> that were pending at the time.  It might be the case that you have to
>> > > >>>>>>> that in the reverse order; I only know enough about the design of fuse
>> > > >>>>>>> to suspect that to be true.
>> > > >>>>>>>
>> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with
>> > > >>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>> > > >>>>>>> fuse_make_bad'd, which effectively revokes them.
>> > > >>>>>>
>> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>> > > >>>>>> but probably GETATTR is a better option.
>> > > >>>>>>
>> > > >>>>>> So, are you currently working on any of this?  Are you implementing this
>> > > >>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>> > > >>>>>> look at fuse2fs too.
>> > > >>>>>
>> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and
>> > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our
>> > > >>>>> DDN side.
>> > > >>>>>
>> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>> > > >>>>> server restart we want kernel to recover inodes and their lookup count.
>> > > >>>>> Now inode recovery might be hard, because we currently only have a
>> > > >>>>> 64-bit node-id - which is used my most fuse application as memory
>> > > >>>>> pointer.
>> > > >>>>>
>> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>> > > >>>>> outstanding requests. And that ends up in most cases in sending requests
>> > > >>>>> with invalid node-IDs, that are casted and might provoke random memory
>> > > >>>>> access on restart. Kind of the same issue why fuse nfs export or
>> > > >>>>> open_by_handle_at doesn't work well right now.
>> > > >>>>>
>> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
>> > > >>>>> And then FUSE_REVALIDATE_FH on server restart.
>> > > >>>>> The file handles could be stored into the fuse inode and also used for
>> > > >>>>> NFS export.
>> > > >>>>>
>> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
>> > > >>>>> Adding Amir to CC.
>> > > >>>>
>> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
>> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>> > > >>>
>> > > >>> Thanks for the reference Amir! I even had been in that thread.
>> > > >>>
>> > > >>>>
>> > > >>>>>
>> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>> > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>> > > >>>>> Any objections against that?
>> > > >>
>> > > >> What if you actually /can/ reuse a nodeid after a restart?  Consider
>> > > >> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
>> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
>> > > >> didn't delete it, obviously.
>> > > >
>> > > > FUSE_LOOKUP_HANDLE is a contract.
>> > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign
>> > > > this contract, otherwise there is no way for client to know that the
>> > > > nodeids are persistent.
>> > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
>> > > > API trivial.
>> > > >
>> > > >>
>> > > >> I suppose you could just ask for refreshed stat information and either
>> > > >> the server gives it to you and the fuse_inode lives; or the server
>> > > >> returns ENOENT and then we mark it bad.  But I'd have to see code
>> > > >> patches to form a real opinion.
>> > > >>
>> > > >
>> > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id>
>> > > > where fuse_instance_id can be its start time or random number.
>> > > > for auto invalidate, or maybe the fuse_instance_id should be
>> > > > a native part of FUSE protocol so that client knows to only invalidate
>> > > > attr cache in case of fuse_instance_id change?
>> > > >
>> > > > In any case, instead of a storm of revalidate messages after
>> > > > server restart, do it lazily on demand.
>> > >
>> > > For a network file system, probably. For fuse4fs or other block
>> > > based file systems, not sure. Darrick has the example of fsck.
>> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
>> > > fuse-server gets restarted, fsck'ed and some files get removed.
>> > > Now reading these inodes would still work - wouldn't it
>> > > be better to invalidate the cache before going into operation
>> > > again?
>> >
>> > Forgive me, I was making a wrong assumption that fuse4fs
>> > was using ext4 filehandle as nodeid, but of course it does not.
>>
>> Well now that you mention it, there /is/ a risk of shenanigans like
>> that.  Consider:
>>
>> 1) fuse4fs mount an ext4 filesystem
>> 2) crash the fuse4fs server
>> <fuse4fs server restart stalls...>
>> 3) e2fsck -fy /dev/XXX deletes inode 17
>> 4) someone else mounts the fs, makes some changes that result in 17
>>    being reallocated, user says "OOOOOPS", unmounts it
>> 5) fuse4fs server finally restarts, and reconnects to the kernel
>>
>> Hey, inode 17 is now a different file!!
>>
>> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
>> everything's (potentially) fine because fuse4fs supplied i_generation to
>> the kernel, and fuse_stale_inode will mark it bad if that happens.
>>
>> Hm ok then, at least there's a way out. :)
>>
>
> Right.
>
>> > The reason I made this wrong assumption is because fuse4fs *can*
>> > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
>> > which is what my fuse passthough library [1] does.
>> >
>> > My claim was that although fuse4fs could support safe restart, which
>> > cannot read from recycled inode number with current FUSE protocol,
>> > doing so with FUSE_HANDLE protocol would express a commitment
>>
>> Pardon my naïvete, but what is FUSE_HANDLE?
>>
>> $ git grep -w FUSE_HANDLE fs
>> $
>
> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>
> Which means to communicate a variable sized "nodeid"
> which can also be declared as an object id that survives server restart.
>
> Basically, the reason that I brought up LOOKUP_HANDLE is to
> properly support NFS export of fuse filesystems.
>
> My incentive was to support a proper fuse server restart/remount/re-export
> with the same fsid in /etc/exports, but this gives us a better starting point
> for fuse server restart/re-connect.

Sorry for resurrecting (again!) this discussion.  I've been thinking about
this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
However, I feel there are other operations that will need to return this
new handle.

For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
Doesn't this means that, if the user-space server supports the new
LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
request?  The same question applies for TMPFILE, LINK, etc.  Or is there
something special about the LOOKUP operation that I'm missing?

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-04 11:40                           ` Luis Henriques
@ 2025-11-04 13:10                             ` Amir Goldstein
  2025-11-04 14:52                               ` Luis Henriques
  2025-11-05 22:24                               ` Bernd Schubert
  0 siblings, 2 replies; 46+ messages in thread
From: Amir Goldstein @ 2025-11-04 13:10 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel,
	Kevin Chen

On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote:
>
> On Tue, Sep 16 2025, Amir Goldstein wrote:
>
> > On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >>
> >> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
> >> > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> >> > >
> >> > >
> >> > >
> >> > > On 9/15/25 09:07, Amir Goldstein wrote:
> >> > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >> > > >>
> >> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> >> > > >>>
> >> > > >>>
> >> > > >>> On 9/12/25 13:41, Amir Goldstein wrote:
> >> > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> >> > > >>>>>
> >> > > >>>>>
> >> > > >>>>>
> >> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote:
> >> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> >> > > >>>>>>
> >> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >> > > >>>>>>>>>
> >> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >> > > >>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
> >> > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I
> >> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> >> > > >>>>>>>>> aren't totally crazy.
> >> > > >>>>>>>>
> >> > > >>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
> >> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> >> > > >>>>>>>> is supposed to happen with respect to open files, metadata and data
> >> > > >>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> >> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> >> > > >>>>>>>> potentally to be out of sync, right?
> >> > > >>>>>>>>
> >> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide?
> >> > > >>>>>>>
> >> > > >>>>>>> <echoing what we said on the ext4 call this morning>
> >> > > >>>>>>>
> >> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> >> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> >> > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate
> >> > > >>>>>>> that they still exist; and then resend all the unacknowledged requests
> >> > > >>>>>>> that were pending at the time.  It might be the case that you have to
> >> > > >>>>>>> that in the reverse order; I only know enough about the design of fuse
> >> > > >>>>>>> to suspect that to be true.
> >> > > >>>>>>>
> >> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with
> >> > > >>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> >> > > >>>>>>> fuse_make_bad'd, which effectively revokes them.
> >> > > >>>>>>
> >> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> >> > > >>>>>> but probably GETATTR is a better option.
> >> > > >>>>>>
> >> > > >>>>>> So, are you currently working on any of this?  Are you implementing this
> >> > > >>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> >> > > >>>>>> look at fuse2fs too.
> >> > > >>>>>
> >> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and
> >> > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our
> >> > > >>>>> DDN side.
> >> > > >>>>>
> >> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> >> > > >>>>> server restart we want kernel to recover inodes and their lookup count.
> >> > > >>>>> Now inode recovery might be hard, because we currently only have a
> >> > > >>>>> 64-bit node-id - which is used my most fuse application as memory
> >> > > >>>>> pointer.
> >> > > >>>>>
> >> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> >> > > >>>>> outstanding requests. And that ends up in most cases in sending requests
> >> > > >>>>> with invalid node-IDs, that are casted and might provoke random memory
> >> > > >>>>> access on restart. Kind of the same issue why fuse nfs export or
> >> > > >>>>> open_by_handle_at doesn't work well right now.
> >> > > >>>>>
> >> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> >> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> >> > > >>>>> And then FUSE_REVALIDATE_FH on server restart.
> >> > > >>>>> The file handles could be stored into the fuse inode and also used for
> >> > > >>>>> NFS export.
> >> > > >>>>>
> >> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> >> > > >>>>> Adding Amir to CC.
> >> > > >>>>
> >> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> >> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >> > > >>>
> >> > > >>> Thanks for the reference Amir! I even had been in that thread.
> >> > > >>>
> >> > > >>>>
> >> > > >>>>>
> >> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> >> > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> >> > > >>>>> Any objections against that?
> >> > > >>
> >> > > >> What if you actually /can/ reuse a nodeid after a restart?  Consider
> >> > > >> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> >> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> >> > > >> didn't delete it, obviously.
> >> > > >
> >> > > > FUSE_LOOKUP_HANDLE is a contract.
> >> > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign
> >> > > > this contract, otherwise there is no way for client to know that the
> >> > > > nodeids are persistent.
> >> > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> >> > > > API trivial.
> >> > > >
> >> > > >>
> >> > > >> I suppose you could just ask for refreshed stat information and either
> >> > > >> the server gives it to you and the fuse_inode lives; or the server
> >> > > >> returns ENOENT and then we mark it bad.  But I'd have to see code
> >> > > >> patches to form a real opinion.
> >> > > >>
> >> > > >
> >> > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> >> > > > where fuse_instance_id can be its start time or random number.
> >> > > > for auto invalidate, or maybe the fuse_instance_id should be
> >> > > > a native part of FUSE protocol so that client knows to only invalidate
> >> > > > attr cache in case of fuse_instance_id change?
> >> > > >
> >> > > > In any case, instead of a storm of revalidate messages after
> >> > > > server restart, do it lazily on demand.
> >> > >
> >> > > For a network file system, probably. For fuse4fs or other block
> >> > > based file systems, not sure. Darrick has the example of fsck.
> >> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
> >> > > fuse-server gets restarted, fsck'ed and some files get removed.
> >> > > Now reading these inodes would still work - wouldn't it
> >> > > be better to invalidate the cache before going into operation
> >> > > again?
> >> >
> >> > Forgive me, I was making a wrong assumption that fuse4fs
> >> > was using ext4 filehandle as nodeid, but of course it does not.
> >>
> >> Well now that you mention it, there /is/ a risk of shenanigans like
> >> that.  Consider:
> >>
> >> 1) fuse4fs mount an ext4 filesystem
> >> 2) crash the fuse4fs server
> >> <fuse4fs server restart stalls...>
> >> 3) e2fsck -fy /dev/XXX deletes inode 17
> >> 4) someone else mounts the fs, makes some changes that result in 17
> >>    being reallocated, user says "OOOOOPS", unmounts it
> >> 5) fuse4fs server finally restarts, and reconnects to the kernel
> >>
> >> Hey, inode 17 is now a different file!!
> >>
> >> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
> >> everything's (potentially) fine because fuse4fs supplied i_generation to
> >> the kernel, and fuse_stale_inode will mark it bad if that happens.
> >>
> >> Hm ok then, at least there's a way out. :)
> >>
> >
> > Right.
> >
> >> > The reason I made this wrong assumption is because fuse4fs *can*
> >> > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
> >> > which is what my fuse passthough library [1] does.
> >> >
> >> > My claim was that although fuse4fs could support safe restart, which
> >> > cannot read from recycled inode number with current FUSE protocol,
> >> > doing so with FUSE_HANDLE protocol would express a commitment
> >>
> >> Pardon my naïvete, but what is FUSE_HANDLE?
> >>
> >> $ git grep -w FUSE_HANDLE fs
> >> $
> >
> > Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
> > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >
> > Which means to communicate a variable sized "nodeid"
> > which can also be declared as an object id that survives server restart.
> >
> > Basically, the reason that I brought up LOOKUP_HANDLE is to
> > properly support NFS export of fuse filesystems.
> >
> > My incentive was to support a proper fuse server restart/remount/re-export
> > with the same fsid in /etc/exports, but this gives us a better starting point
> > for fuse server restart/re-connect.
>
> Sorry for resurrecting (again!) this discussion.  I've been thinking about
> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
> However, I feel there are other operations that will need to return this
> new handle.
>
> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
> Doesn't this means that, if the user-space server supports the new
> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
> request?

Yes, I think that's what it means.

> The same question applies for TMPFILE, LINK, etc.  Or is there
> something special about the LOOKUP operation that I'm missing?
>

Any command returning fuse_entry_out.

READDIRPLUS, MKNOD, MKDIR, SYMLINK

fuse_entry_out was extended once and fuse_reply_entry()
sends the size of the struct.
However fuse_reply_create() sends it with fuse_open_out
appended and fuse_add_direntry_plus() does not seem to write
record size at all, so server and client will need to agree on the
size of fuse_entry_out and this would need to be backward compat.
If both server and client declare support for FUSE_LOOKUP_HANDLE
it should be fine (?).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-04 13:10                             ` Amir Goldstein
@ 2025-11-04 14:52                               ` Luis Henriques
  2025-11-05 10:21                                 ` Amir Goldstein
  2025-11-05 22:24                               ` Bernd Schubert
  1 sibling, 1 reply; 46+ messages in thread
From: Luis Henriques @ 2025-11-04 14:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel,
	Kevin Chen

On Tue, Nov 04 2025, Amir Goldstein wrote:

> On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote:
>>
>> On Tue, Sep 16 2025, Amir Goldstein wrote:
>>
>> > On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
>> >>
>> >> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
>> >> > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>> >> > >
>> >> > >
>> >> > >
>> >> > > On 9/15/25 09:07, Amir Goldstein wrote:
>> >> > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
>> >> > > >>
>> >> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
>> >> > > >>>
>> >> > > >>>
>> >> > > >>> On 9/12/25 13:41, Amir Goldstein wrote:
>> >> > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>> >> > > >>>>>
>> >> > > >>>>>
>> >> > > >>>>>
>> >> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote:
>> >> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>> >> > > >>>>>>
>> >> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>> >> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>> >> > > >>>>>>>>>
>> >> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>> >> > > >>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
>> >> > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I
>> >> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>> >> > > >>>>>>>>> aren't totally crazy.
>> >> > > >>>>>>>>
>> >> > > >>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
>> >> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>> >> > > >>>>>>>> is supposed to happen with respect to open files, metadata and data
>> >> > > >>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>> >> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>> >> > > >>>>>>>> potentally to be out of sync, right?
>> >> > > >>>>>>>>
>> >> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide?
>> >> > > >>>>>>>
>> >> > > >>>>>>> <echoing what we said on the ext4 call this morning>
>> >> > > >>>>>>>
>> >> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>> >> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>> >> > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate
>> >> > > >>>>>>> that they still exist; and then resend all the unacknowledged requests
>> >> > > >>>>>>> that were pending at the time.  It might be the case that you have to
>> >> > > >>>>>>> that in the reverse order; I only know enough about the design of fuse
>> >> > > >>>>>>> to suspect that to be true.
>> >> > > >>>>>>>
>> >> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with
>> >> > > >>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>> >> > > >>>>>>> fuse_make_bad'd, which effectively revokes them.
>> >> > > >>>>>>
>> >> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>> >> > > >>>>>> but probably GETATTR is a better option.
>> >> > > >>>>>>
>> >> > > >>>>>> So, are you currently working on any of this?  Are you implementing this
>> >> > > >>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>> >> > > >>>>>> look at fuse2fs too.
>> >> > > >>>>>
>> >> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and
>> >> > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our
>> >> > > >>>>> DDN side.
>> >> > > >>>>>
>> >> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>> >> > > >>>>> server restart we want kernel to recover inodes and their lookup count.
>> >> > > >>>>> Now inode recovery might be hard, because we currently only have a
>> >> > > >>>>> 64-bit node-id - which is used my most fuse application as memory
>> >> > > >>>>> pointer.
>> >> > > >>>>>
>> >> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>> >> > > >>>>> outstanding requests. And that ends up in most cases in sending requests
>> >> > > >>>>> with invalid node-IDs, that are casted and might provoke random memory
>> >> > > >>>>> access on restart. Kind of the same issue why fuse nfs export or
>> >> > > >>>>> open_by_handle_at doesn't work well right now.
>> >> > > >>>>>
>> >> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>> >> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
>> >> > > >>>>> And then FUSE_REVALIDATE_FH on server restart.
>> >> > > >>>>> The file handles could be stored into the fuse inode and also used for
>> >> > > >>>>> NFS export.
>> >> > > >>>>>
>> >> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
>> >> > > >>>>> Adding Amir to CC.
>> >> > > >>>>
>> >> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
>> >> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>> >> > > >>>
>> >> > > >>> Thanks for the reference Amir! I even had been in that thread.
>> >> > > >>>
>> >> > > >>>>
>> >> > > >>>>>
>> >> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>> >> > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>> >> > > >>>>> Any objections against that?
>> >> > > >>
>> >> > > >> What if you actually /can/ reuse a nodeid after a restart?  Consider
>> >> > > >> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
>> >> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
>> >> > > >> didn't delete it, obviously.
>> >> > > >
>> >> > > > FUSE_LOOKUP_HANDLE is a contract.
>> >> > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign
>> >> > > > this contract, otherwise there is no way for client to know that the
>> >> > > > nodeids are persistent.
>> >> > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
>> >> > > > API trivial.
>> >> > > >
>> >> > > >>
>> >> > > >> I suppose you could just ask for refreshed stat information and either
>> >> > > >> the server gives it to you and the fuse_inode lives; or the server
>> >> > > >> returns ENOENT and then we mark it bad.  But I'd have to see code
>> >> > > >> patches to form a real opinion.
>> >> > > >>
>> >> > > >
>> >> > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id>
>> >> > > > where fuse_instance_id can be its start time or random number.
>> >> > > > for auto invalidate, or maybe the fuse_instance_id should be
>> >> > > > a native part of FUSE protocol so that client knows to only invalidate
>> >> > > > attr cache in case of fuse_instance_id change?
>> >> > > >
>> >> > > > In any case, instead of a storm of revalidate messages after
>> >> > > > server restart, do it lazily on demand.
>> >> > >
>> >> > > For a network file system, probably. For fuse4fs or other block
>> >> > > based file systems, not sure. Darrick has the example of fsck.
>> >> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
>> >> > > fuse-server gets restarted, fsck'ed and some files get removed.
>> >> > > Now reading these inodes would still work - wouldn't it
>> >> > > be better to invalidate the cache before going into operation
>> >> > > again?
>> >> >
>> >> > Forgive me, I was making a wrong assumption that fuse4fs
>> >> > was using ext4 filehandle as nodeid, but of course it does not.
>> >>
>> >> Well now that you mention it, there /is/ a risk of shenanigans like
>> >> that.  Consider:
>> >>
>> >> 1) fuse4fs mount an ext4 filesystem
>> >> 2) crash the fuse4fs server
>> >> <fuse4fs server restart stalls...>
>> >> 3) e2fsck -fy /dev/XXX deletes inode 17
>> >> 4) someone else mounts the fs, makes some changes that result in 17
>> >>    being reallocated, user says "OOOOOPS", unmounts it
>> >> 5) fuse4fs server finally restarts, and reconnects to the kernel
>> >>
>> >> Hey, inode 17 is now a different file!!
>> >>
>> >> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
>> >> everything's (potentially) fine because fuse4fs supplied i_generation to
>> >> the kernel, and fuse_stale_inode will mark it bad if that happens.
>> >>
>> >> Hm ok then, at least there's a way out. :)
>> >>
>> >
>> > Right.
>> >
>> >> > The reason I made this wrong assumption is because fuse4fs *can*
>> >> > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
>> >> > which is what my fuse passthough library [1] does.
>> >> >
>> >> > My claim was that although fuse4fs could support safe restart, which
>> >> > cannot read from recycled inode number with current FUSE protocol,
>> >> > doing so with FUSE_HANDLE protocol would express a commitment
>> >>
>> >> Pardon my naïvete, but what is FUSE_HANDLE?
>> >>
>> >> $ git grep -w FUSE_HANDLE fs
>> >> $
>> >
>> > Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
>> > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>> >
>> > Which means to communicate a variable sized "nodeid"
>> > which can also be declared as an object id that survives server restart.
>> >
>> > Basically, the reason that I brought up LOOKUP_HANDLE is to
>> > properly support NFS export of fuse filesystems.
>> >
>> > My incentive was to support a proper fuse server restart/remount/re-export
>> > with the same fsid in /etc/exports, but this gives us a better starting point
>> > for fuse server restart/re-connect.
>>
>> Sorry for resurrecting (again!) this discussion.  I've been thinking about
>> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
>> However, I feel there are other operations that will need to return this
>> new handle.
>>
>> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
>> Doesn't this means that, if the user-space server supports the new
>> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
>> request?
>
> Yes, I think that's what it means.

Awesome, thank you for confirming this.

>> The same question applies for TMPFILE, LINK, etc.  Or is there
>> something special about the LOOKUP operation that I'm missing?
>>
>
> Any command returning fuse_entry_out.
>
> READDIRPLUS, MKNOD, MKDIR, SYMLINK

Right, I had this list, but totally missed READDIRPLUS.

> fuse_entry_out was extended once and fuse_reply_entry()
> sends the size of the struct.

So, if I'm understanding you correctly, you're suggesting to extend
fuse_entry_out to add the new handle (a 'size' field + the actual handle).
That's probably a good idea.  I was working towards having the
LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would
include:

 - An extra inarg: the parent directory handle.  (To be honest, I'm not
   really sure this would be needed.)
 - An extra outarg: for the actual handle.

With your suggestion, only the extra inarg would be required.

> However fuse_reply_create() sends it with fuse_open_out
> appended

This one should be fine...

> and fuse_add_direntry_plus() does not seem to write
> record size at all, so server and client will need to agree on the
> size of fuse_entry_out and this would need to be backward compat.
> If both server and client declare support for FUSE_LOOKUP_HANDLE
> it should be fine (?).

... yeah, this could be a bit trickier.  But I'll need to go look into it.

Thanks a lot for your comments, Amir.  I was trying to get an RFC out
soon(ish) to get early feedback, hoping to prevent me following wrong
paths.

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-04 14:52                               ` Luis Henriques
@ 2025-11-05 10:21                                 ` Amir Goldstein
  2025-11-05 11:50                                   ` Luis Henriques
  0 siblings, 1 reply; 46+ messages in thread
From: Amir Goldstein @ 2025-11-05 10:21 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel,
	Kevin Chen

On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote:
>
> On Tue, Nov 04 2025, Amir Goldstein wrote:
>
> > On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote:
> >>
> >> On Tue, Sep 16 2025, Amir Goldstein wrote:
> >>
> >> > On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >> >>
> >> >> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
> >> >> > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On 9/15/25 09:07, Amir Goldstein wrote:
> >> >> > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >> >> > > >>
> >> >> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> >> >> > > >>>
> >> >> > > >>>
> >> >> > > >>> On 9/12/25 13:41, Amir Goldstein wrote:
> >> >> > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> >> >> > > >>>>>
> >> >> > > >>>>>
> >> >> > > >>>>>
> >> >> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote:
> >> >> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> >> >> > > >>>>>>
> >> >> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >> >> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >> >> > > >>>>>>>>>
> >> >> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >> >> > > >>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
> >> >> > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I
> >> >> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> >> >> > > >>>>>>>>> aren't totally crazy.
> >> >> > > >>>>>>>>
> >> >> > > >>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
> >> >> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> >> >> > > >>>>>>>> is supposed to happen with respect to open files, metadata and data
> >> >> > > >>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> >> >> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> >> >> > > >>>>>>>> potentally to be out of sync, right?
> >> >> > > >>>>>>>>
> >> >> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide?
> >> >> > > >>>>>>>
> >> >> > > >>>>>>> <echoing what we said on the ext4 call this morning>
> >> >> > > >>>>>>>
> >> >> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> >> >> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> >> >> > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate
> >> >> > > >>>>>>> that they still exist; and then resend all the unacknowledged requests
> >> >> > > >>>>>>> that were pending at the time.  It might be the case that you have to
> >> >> > > >>>>>>> that in the reverse order; I only know enough about the design of fuse
> >> >> > > >>>>>>> to suspect that to be true.
> >> >> > > >>>>>>>
> >> >> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with
> >> >> > > >>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> >> >> > > >>>>>>> fuse_make_bad'd, which effectively revokes them.
> >> >> > > >>>>>>
> >> >> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> >> >> > > >>>>>> but probably GETATTR is a better option.
> >> >> > > >>>>>>
> >> >> > > >>>>>> So, are you currently working on any of this?  Are you implementing this
> >> >> > > >>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> >> >> > > >>>>>> look at fuse2fs too.
> >> >> > > >>>>>
> >> >> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and
> >> >> > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our
> >> >> > > >>>>> DDN side.
> >> >> > > >>>>>
> >> >> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> >> >> > > >>>>> server restart we want kernel to recover inodes and their lookup count.
> >> >> > > >>>>> Now inode recovery might be hard, because we currently only have a
> >> >> > > >>>>> 64-bit node-id - which is used my most fuse application as memory
> >> >> > > >>>>> pointer.
> >> >> > > >>>>>
> >> >> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> >> >> > > >>>>> outstanding requests. And that ends up in most cases in sending requests
> >> >> > > >>>>> with invalid node-IDs, that are casted and might provoke random memory
> >> >> > > >>>>> access on restart. Kind of the same issue why fuse nfs export or
> >> >> > > >>>>> open_by_handle_at doesn't work well right now.
> >> >> > > >>>>>
> >> >> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> >> >> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> >> >> > > >>>>> And then FUSE_REVALIDATE_FH on server restart.
> >> >> > > >>>>> The file handles could be stored into the fuse inode and also used for
> >> >> > > >>>>> NFS export.
> >> >> > > >>>>>
> >> >> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> >> >> > > >>>>> Adding Amir to CC.
> >> >> > > >>>>
> >> >> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> >> >> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >> >> > > >>>
> >> >> > > >>> Thanks for the reference Amir! I even had been in that thread.
> >> >> > > >>>
> >> >> > > >>>>
> >> >> > > >>>>>
> >> >> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> >> >> > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> >> >> > > >>>>> Any objections against that?
> >> >> > > >>
> >> >> > > >> What if you actually /can/ reuse a nodeid after a restart?  Consider
> >> >> > > >> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> >> >> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> >> >> > > >> didn't delete it, obviously.
> >> >> > > >
> >> >> > > > FUSE_LOOKUP_HANDLE is a contract.
> >> >> > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign
> >> >> > > > this contract, otherwise there is no way for client to know that the
> >> >> > > > nodeids are persistent.
> >> >> > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> >> >> > > > API trivial.
> >> >> > > >
> >> >> > > >>
> >> >> > > >> I suppose you could just ask for refreshed stat information and either
> >> >> > > >> the server gives it to you and the fuse_inode lives; or the server
> >> >> > > >> returns ENOENT and then we mark it bad.  But I'd have to see code
> >> >> > > >> patches to form a real opinion.
> >> >> > > >>
> >> >> > > >
> >> >> > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> >> >> > > > where fuse_instance_id can be its start time or random number.
> >> >> > > > for auto invalidate, or maybe the fuse_instance_id should be
> >> >> > > > a native part of FUSE protocol so that client knows to only invalidate
> >> >> > > > attr cache in case of fuse_instance_id change?
> >> >> > > >
> >> >> > > > In any case, instead of a storm of revalidate messages after
> >> >> > > > server restart, do it lazily on demand.
> >> >> > >
> >> >> > > For a network file system, probably. For fuse4fs or other block
> >> >> > > based file systems, not sure. Darrick has the example of fsck.
> >> >> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
> >> >> > > fuse-server gets restarted, fsck'ed and some files get removed.
> >> >> > > Now reading these inodes would still work - wouldn't it
> >> >> > > be better to invalidate the cache before going into operation
> >> >> > > again?
> >> >> >
> >> >> > Forgive me, I was making a wrong assumption that fuse4fs
> >> >> > was using ext4 filehandle as nodeid, but of course it does not.
> >> >>
> >> >> Well now that you mention it, there /is/ a risk of shenanigans like
> >> >> that.  Consider:
> >> >>
> >> >> 1) fuse4fs mount an ext4 filesystem
> >> >> 2) crash the fuse4fs server
> >> >> <fuse4fs server restart stalls...>
> >> >> 3) e2fsck -fy /dev/XXX deletes inode 17
> >> >> 4) someone else mounts the fs, makes some changes that result in 17
> >> >>    being reallocated, user says "OOOOOPS", unmounts it
> >> >> 5) fuse4fs server finally restarts, and reconnects to the kernel
> >> >>
> >> >> Hey, inode 17 is now a different file!!
> >> >>
> >> >> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
> >> >> everything's (potentially) fine because fuse4fs supplied i_generation to
> >> >> the kernel, and fuse_stale_inode will mark it bad if that happens.
> >> >>
> >> >> Hm ok then, at least there's a way out. :)
> >> >>
> >> >
> >> > Right.
> >> >
> >> >> > The reason I made this wrong assumption is because fuse4fs *can*
> >> >> > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
> >> >> > which is what my fuse passthough library [1] does.
> >> >> >
> >> >> > My claim was that although fuse4fs could support safe restart, which
> >> >> > cannot read from recycled inode number with current FUSE protocol,
> >> >> > doing so with FUSE_HANDLE protocol would express a commitment
> >> >>
> >> >> Pardon my naïvete, but what is FUSE_HANDLE?
> >> >>
> >> >> $ git grep -w FUSE_HANDLE fs
> >> >> $
> >> >
> >> > Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
> >> > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >> >
> >> > Which means to communicate a variable sized "nodeid"
> >> > which can also be declared as an object id that survives server restart.
> >> >
> >> > Basically, the reason that I brought up LOOKUP_HANDLE is to
> >> > properly support NFS export of fuse filesystems.
> >> >
> >> > My incentive was to support a proper fuse server restart/remount/re-export
> >> > with the same fsid in /etc/exports, but this gives us a better starting point
> >> > for fuse server restart/re-connect.
> >>
> >> Sorry for resurrecting (again!) this discussion.  I've been thinking about
> >> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
> >> However, I feel there are other operations that will need to return this
> >> new handle.
> >>
> >> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
> >> Doesn't this means that, if the user-space server supports the new
> >> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
> >> request?
> >
> > Yes, I think that's what it means.
>
> Awesome, thank you for confirming this.
>
> >> The same question applies for TMPFILE, LINK, etc.  Or is there
> >> something special about the LOOKUP operation that I'm missing?
> >>
> >
> > Any command returning fuse_entry_out.
> >
> > READDIRPLUS, MKNOD, MKDIR, SYMLINK
>
> Right, I had this list, but totally missed READDIRPLUS.
>
> > fuse_entry_out was extended once and fuse_reply_entry()
> > sends the size of the struct.
>
> So, if I'm understanding you correctly, you're suggesting to extend
> fuse_entry_out to add the new handle (a 'size' field + the actual handle).

Well it depends...

There are several ways to do it.
I would really like to get Miklos and Bernd's opinion on the preferred way.

So far, it looks like the client determines the size of the output args.

If we want the server to be able to write a different file handle size
per inode that's going to be a bigger challenge.

I think it's plenty enough if server and client negotiate a max file handle
size and then the client always reserves enough space in the output
args buffer.

One more thing to ask is what is "the actual handle".
If "the actual handle" is the variable sized struct file_handle then
the size is already available in the file handle header.
If it is not, then I think some sort of type or version of the file handles
encoding should be negotiated beyond the max handle size.

> That's probably a good idea.  I was working towards having the
> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would
> include:
>
>  - An extra inarg: the parent directory handle.  (To be honest, I'm not
>    really sure this would be needed.)

Yes, I think you need extra inarg.
Why would it not be needed?
The problem is that you cannot know if the parent node id in the lookup
command is stale after server restart.

The thing is that the kernel fuse inode will need to store the file handle,
much the same as an NFS client stores the file handle provided by the
NFS server.

FYI, fanotify has an optimized way to store file handles in
struct fanotify_fid_event - small file handles are stored inline
and larger file handles can use an external buffer.

But fuse does not need to support any size of file handles.
For first version we could definitely simplify things by limiting the size
of supported file handles, because server and client need to negotiate
the max file handle size anyway.

>  - An extra outarg: for the actual handle.
>
> With your suggestion, only the extra inarg would be required.
>

Yes, either extra arg or just an extended size of fuse_entry_out
negotiated at init time.

TBH it seems cleaner to add 2nd outarg to all the commands,
but CREATE already has a 2nd arg and 2nd arg does not solve
READDIRPLUS.

> > However fuse_reply_create() sends it with fuse_open_out
> > appended
>
> This one should be fine...
>
> > and fuse_add_direntry_plus() does not seem to write
> > record size at all, so server and client will need to agree on the
> > size of fuse_entry_out and this would need to be backward compat.
> > If both server and client declare support for FUSE_LOOKUP_HANDLE
> > it should be fine (?).
>
> ... yeah, this could be a bit trickier.  But I'll need to go look into it.
>
> Thanks a lot for your comments, Amir.  I was trying to get an RFC out
> soon(ish) to get early feedback, hoping to prevent me following wrong
> paths.
>

Disclaimer, following my advice may well lead you down wrong paths..
Best to wait for confirmation from Miklos and Bernd if you want to have
more certainty...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-05 10:21                                 ` Amir Goldstein
@ 2025-11-05 11:50                                   ` Luis Henriques
  2025-11-05 15:30                                     ` Amir Goldstein
  0 siblings, 1 reply; 46+ messages in thread
From: Luis Henriques @ 2025-11-05 11:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel,
	Kevin Chen, Matt Harvey

Hi Amir,

On Wed, Nov 05 2025, Amir Goldstein wrote:

> On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote:

<...>

>> > fuse_entry_out was extended once and fuse_reply_entry()
>> > sends the size of the struct.
>>
>> So, if I'm understanding you correctly, you're suggesting to extend
>> fuse_entry_out to add the new handle (a 'size' field + the actual handle).
>
> Well it depends...
>
> There are several ways to do it.
> I would really like to get Miklos and Bernd's opinion on the preferred way.

Sure, all feedback is welcome!

> So far, it looks like the client determines the size of the output args.
>
> If we want the server to be able to write a different file handle size
> per inode that's going to be a bigger challenge.
>
> I think it's plenty enough if server and client negotiate a max file handle
> size and then the client always reserves enough space in the output
> args buffer.
>
> One more thing to ask is what is "the actual handle".
> If "the actual handle" is the variable sized struct file_handle then
> the size is already available in the file handle header.

Actually, this is exactly what I was trying to mimic for my initial
attempt.  However, I was not going to do any size negotiation but instead
define a maximum size for the handle.  See below.

> If it is not, then I think some sort of type or version of the file handles
> encoding should be negotiated beyond the max handle size.

In my initial stab at this I was going to take a very simple approach and
hard-code a maximum size for the handle.  This would have the advantage of
allowing the server to use different sizes for different inodes (though
I'm not sure how useful that would be in practice).  So, in summary, I
would define the new handle like this:

/* Same value as MAX_HANDLE_SZ */
#define FUSE_MAX_HANDLE_SZ 128

struct fuse_file_handle {
	uint32_t	size;
	uint32_t	padding;
	char		handle[FUSE_MAX_HANDLE_SZ];
};

and this struct would be included in fuse_entry_out.

There's probably a problem with having this (big) fixed size increase to
fuse_entry_out, but maybe that could be fixed once I have all the other
details sorted out.  Hopefully I'm not oversimplifying the problem,
skipping the need for negotiating a handle size.

>> That's probably a good idea.  I was working towards having the
>> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would
>> include:
>>
>>  - An extra inarg: the parent directory handle.  (To be honest, I'm not
>>    really sure this would be needed.)
>
> Yes, I think you need extra inarg.
> Why would it not be needed?
> The problem is that you cannot know if the parent node id in the lookup
> command is stale after server restart.

Ah, of course.  Hence the need for this extra inarg.

> The thing is that the kernel fuse inode will need to store the file handle,
> much the same as an NFS client stores the file handle provided by the
> NFS server.
>
> FYI, fanotify has an optimized way to store file handles in
> struct fanotify_fid_event - small file handles are stored inline
> and larger file handles can use an external buffer.
>
> But fuse does not need to support any size of file handles.
> For first version we could definitely simplify things by limiting the size
> of supported file handles, because server and client need to negotiate
> the max file handle size anyway.

I'll definitely need to have a look at how fanotify does that.  But I
guess that if my simplistic approach with a static array is acceptable for
now, I'll stick with it for the initial attempt to implement this, and
eventually revisit it later to do something more clever.

>>  - An extra outarg: for the actual handle.
>>
>> With your suggestion, only the extra inarg would be required.
>>
>
> Yes, either extra arg or just an extended size of fuse_entry_out
> negotiated at init time.
>
> TBH it seems cleaner to add 2nd outarg to all the commands,
> but CREATE already has a 2nd arg and 2nd arg does not solve
> READDIRPLUS.

Right.  I'm more and more convinced that extending fuse_entry_out is the
way to go.

>> > However fuse_reply_create() sends it with fuse_open_out
>> > appended
>>
>> This one should be fine...
>>
>> > and fuse_add_direntry_plus() does not seem to write
>> > record size at all, so server and client will need to agree on the
>> > size of fuse_entry_out and this would need to be backward compat.
>> > If both server and client declare support for FUSE_LOOKUP_HANDLE
>> > it should be fine (?).
>>
>> ... yeah, this could be a bit trickier.  But I'll need to go look into it.
>>
>> Thanks a lot for your comments, Amir.  I was trying to get an RFC out
>> soon(ish) to get early feedback, hoping to prevent me following wrong
>> paths.
>>
>
> Disclaimer, following my advice may well lead you down wrong paths..
> Best to wait for confirmation from Miklos and Bernd if you want to have
> more certainty...

Haha thanks for the warning :-)

And again, thanks a lot for your feedback, Amir.

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-05 11:50                                   ` Luis Henriques
@ 2025-11-05 15:30                                     ` Amir Goldstein
  2025-11-05 21:38                                       ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Amir Goldstein @ 2025-11-05 15:30 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel,
	Kevin Chen, Matt Harvey

On Wed, Nov 5, 2025 at 12:50 PM Luis Henriques <luis@igalia.com> wrote:
>
> Hi Amir,
>
> On Wed, Nov 05 2025, Amir Goldstein wrote:
>
> > On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote:
>
> <...>
>
> >> > fuse_entry_out was extended once and fuse_reply_entry()
> >> > sends the size of the struct.
> >>
> >> So, if I'm understanding you correctly, you're suggesting to extend
> >> fuse_entry_out to add the new handle (a 'size' field + the actual handle).
> >
> > Well it depends...
> >
> > There are several ways to do it.
> > I would really like to get Miklos and Bernd's opinion on the preferred way.
>
> Sure, all feedback is welcome!
>
> > So far, it looks like the client determines the size of the output args.
> >
> > If we want the server to be able to write a different file handle size
> > per inode that's going to be a bigger challenge.
> >
> > I think it's plenty enough if server and client negotiate a max file handle
> > size and then the client always reserves enough space in the output
> > args buffer.
> >
> > One more thing to ask is what is "the actual handle".
> > If "the actual handle" is the variable sized struct file_handle then
> > the size is already available in the file handle header.
>
> Actually, this is exactly what I was trying to mimic for my initial
> attempt.  However, I was not going to do any size negotiation but instead
> define a maximum size for the handle.  See below.
>
> > If it is not, then I think some sort of type or version of the file handles
> > encoding should be negotiated beyond the max handle size.
>
> In my initial stab at this I was going to take a very simple approach and
> hard-code a maximum size for the handle.  This would have the advantage of
> allowing the server to use different sizes for different inodes (though
> I'm not sure how useful that would be in practice).  So, in summary, I
> would define the new handle like this:
>
> /* Same value as MAX_HANDLE_SZ */
> #define FUSE_MAX_HANDLE_SZ 128
>
> struct fuse_file_handle {
>         uint32_t        size;
>         uint32_t        padding;

I think that the handle type is going to be relevant as well.

>         char            handle[FUSE_MAX_HANDLE_SZ];
> };
>
> and this struct would be included in fuse_entry_out.
>
> There's probably a problem with having this (big) fixed size increase to
> fuse_entry_out, but maybe that could be fixed once I have all the other
> details sorted out.  Hopefully I'm not oversimplifying the problem,
> skipping the need for negotiating a handle size.
>

Maybe this fixed size is reasonable for the first version of FUSE protocol
as long as this overhead is NOT added if the server does not opt-in for the
feature.

IOW, allow the server to negotiate FUSE_MAX_HANDLE_SZ or 0,
but keep the negotiation protocol extendable to another value later on.

> >> That's probably a good idea.  I was working towards having the
> >> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would
> >> include:
> >>
> >>  - An extra inarg: the parent directory handle.  (To be honest, I'm not
> >>    really sure this would be needed.)
> >
> > Yes, I think you need extra inarg.
> > Why would it not be needed?
> > The problem is that you cannot know if the parent node id in the lookup
> > command is stale after server restart.
>
> Ah, of course.  Hence the need for this extra inarg.
>
> > The thing is that the kernel fuse inode will need to store the file handle,
> > much the same as an NFS client stores the file handle provided by the
> > NFS server.
> >
> > FYI, fanotify has an optimized way to store file handles in
> > struct fanotify_fid_event - small file handles are stored inline
> > and larger file handles can use an external buffer.
> >
> > But fuse does not need to support any size of file handles.
> > For first version we could definitely simplify things by limiting the size
> > of supported file handles, because server and client need to negotiate
> > the max file handle size anyway.
>
> I'll definitely need to have a look at how fanotify does that.  But I
> guess that if my simplistic approach with a static array is acceptable for
> now, I'll stick with it for the initial attempt to implement this, and
> eventually revisit it later to do something more clever.
>

What you proposed is the extension of fuse_entry_out for fuse
protocol.

My reference to fanotify_fid_event is meant to explain how to encode
a file handle in fuse_inode in cache, because the fuse_inode_cachep
cannot have variable sized inodes and in most of the cases, a short
inline file handle should be enough.

Therefore, if you limit the support in the first version to something like
FANOTIFY_INLINE_FH_LEN, you can always store the file handle
in fuse_inode and postpone support for bigger file handles to later.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-05 15:30                                     ` Amir Goldstein
@ 2025-11-05 21:38                                       ` Darrick J. Wong
  2025-11-05 21:46                                         ` Bernd Schubert
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2025-11-05 21:38 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi,
	Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen,
	Matt Harvey

On Wed, Nov 05, 2025 at 04:30:51PM +0100, Amir Goldstein wrote:
> On Wed, Nov 5, 2025 at 12:50 PM Luis Henriques <luis@igalia.com> wrote:
> >
> > Hi Amir,
> >
> > On Wed, Nov 05 2025, Amir Goldstein wrote:
> >
> > > On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote:
> >
> > <...>
> >
> > >> > fuse_entry_out was extended once and fuse_reply_entry()
> > >> > sends the size of the struct.
> > >>
> > >> So, if I'm understanding you correctly, you're suggesting to extend
> > >> fuse_entry_out to add the new handle (a 'size' field + the actual handle).
> > >
> > > Well it depends...
> > >
> > > There are several ways to do it.
> > > I would really like to get Miklos and Bernd's opinion on the preferred way.
> >
> > Sure, all feedback is welcome!
> >
> > > So far, it looks like the client determines the size of the output args.
> > >
> > > If we want the server to be able to write a different file handle size
> > > per inode that's going to be a bigger challenge.
> > >
> > > I think it's plenty enough if server and client negotiate a max file handle
> > > size and then the client always reserves enough space in the output
> > > args buffer.
> > >
> > > One more thing to ask is what is "the actual handle".
> > > If "the actual handle" is the variable sized struct file_handle then
> > > the size is already available in the file handle header.
> >
> > Actually, this is exactly what I was trying to mimic for my initial
> > attempt.  However, I was not going to do any size negotiation but instead
> > define a maximum size for the handle.  See below.
> >
> > > If it is not, then I think some sort of type or version of the file handles
> > > encoding should be negotiated beyond the max handle size.
> >
> > In my initial stab at this I was going to take a very simple approach and
> > hard-code a maximum size for the handle.  This would have the advantage of
> > allowing the server to use different sizes for different inodes (though
> > I'm not sure how useful that would be in practice).  So, in summary, I
> > would define the new handle like this:
> >
> > /* Same value as MAX_HANDLE_SZ */
> > #define FUSE_MAX_HANDLE_SZ 128
> >
> > struct fuse_file_handle {
> >         uint32_t        size;
> >         uint32_t        padding;
> 
> I think that the handle type is going to be relevant as well.
> 
> >         char            handle[FUSE_MAX_HANDLE_SZ];
> > };
> >
> > and this struct would be included in fuse_entry_out.
> >
> > There's probably a problem with having this (big) fixed size increase to
> > fuse_entry_out, but maybe that could be fixed once I have all the other
> > details sorted out.  Hopefully I'm not oversimplifying the problem,
> > skipping the need for negotiating a handle size.
> >
> 
> Maybe this fixed size is reasonable for the first version of FUSE protocol
> as long as this overhead is NOT added if the server does not opt-in for the
> feature.
> 
> IOW, allow the server to negotiate FUSE_MAX_HANDLE_SZ or 0,
> but keep the negotiation protocol extendable to another value later on.
> 
> > >> That's probably a good idea.  I was working towards having the
> > >> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would
> > >> include:
> > >>
> > >>  - An extra inarg: the parent directory handle.  (To be honest, I'm not
> > >>    really sure this would be needed.)
> > >
> > > Yes, I think you need extra inarg.
> > > Why would it not be needed?
> > > The problem is that you cannot know if the parent node id in the lookup
> > > command is stale after server restart.
> >
> > Ah, of course.  Hence the need for this extra inarg.
> >
> > > The thing is that the kernel fuse inode will need to store the file handle,
> > > much the same as an NFS client stores the file handle provided by the
> > > NFS server.
> > >
> > > FYI, fanotify has an optimized way to store file handles in
> > > struct fanotify_fid_event - small file handles are stored inline
> > > and larger file handles can use an external buffer.
> > >
> > > But fuse does not need to support any size of file handles.
> > > For first version we could definitely simplify things by limiting the size
> > > of supported file handles, because server and client need to negotiate
> > > the max file handle size anyway.
> >
> > I'll definitely need to have a look at how fanotify does that.  But I
> > guess that if my simplistic approach with a static array is acceptable for
> > now, I'll stick with it for the initial attempt to implement this, and
> > eventually revisit it later to do something more clever.
> >
> 
> What you proposed is the extension of fuse_entry_out for fuse
> protocol.
> 
> My reference to fanotify_fid_event is meant to explain how to encode
> a file handle in fuse_inode in cache, because the fuse_inode_cachep
> cannot have variable sized inodes and in most of the cases, a short
> inline file handle should be enough.
> 
> Therefore, if you limit the support in the first version to something like
> FANOTIFY_INLINE_FH_LEN, you can always store the file handle
> in fuse_inode and postpone support for bigger file handles to later.

I suggest that you also provide a way for the fuse server to tell the
kernel that it can construct its own handles from {fuse_inode::nodeid,
inode::i_generation} if they want something more efficient than
uploading 128b blobs.

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-05 21:38                                       ` Darrick J. Wong
@ 2025-11-05 21:46                                         ` Bernd Schubert
  2025-11-05 22:06                                           ` Bernd Schubert
  0 siblings, 1 reply; 46+ messages in thread
From: Bernd Schubert @ 2025-11-05 21:46 UTC (permalink / raw)
  To: Darrick J. Wong, Amir Goldstein
  Cc: Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi,
	linux-fsdevel, linux-kernel, Kevin Chen, Matt Harvey



On 11/5/25 22:38, Darrick J. Wong wrote:
> On Wed, Nov 05, 2025 at 04:30:51PM +0100, Amir Goldstein wrote:
>> On Wed, Nov 5, 2025 at 12:50 PM Luis Henriques <luis@igalia.com> wrote:
>>>
>>> Hi Amir,
>>>
>>> On Wed, Nov 05 2025, Amir Goldstein wrote:
>>>
>>>> On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote:
>>>
>>> <...>
>>>
>>>>>> fuse_entry_out was extended once and fuse_reply_entry()
>>>>>> sends the size of the struct.
>>>>>
>>>>> So, if I'm understanding you correctly, you're suggesting to extend
>>>>> fuse_entry_out to add the new handle (a 'size' field + the actual handle).
>>>>
>>>> Well it depends...
>>>>
>>>> There are several ways to do it.
>>>> I would really like to get Miklos and Bernd's opinion on the preferred way.
>>>
>>> Sure, all feedback is welcome!
>>>
>>>> So far, it looks like the client determines the size of the output args.
>>>>
>>>> If we want the server to be able to write a different file handle size
>>>> per inode that's going to be a bigger challenge.
>>>>
>>>> I think it's plenty enough if server and client negotiate a max file handle
>>>> size and then the client always reserves enough space in the output
>>>> args buffer.
>>>>
>>>> One more thing to ask is what is "the actual handle".
>>>> If "the actual handle" is the variable sized struct file_handle then
>>>> the size is already available in the file handle header.
>>>
>>> Actually, this is exactly what I was trying to mimic for my initial
>>> attempt.  However, I was not going to do any size negotiation but instead
>>> define a maximum size for the handle.  See below.
>>>
>>>> If it is not, then I think some sort of type or version of the file handles
>>>> encoding should be negotiated beyond the max handle size.
>>>
>>> In my initial stab at this I was going to take a very simple approach and
>>> hard-code a maximum size for the handle.  This would have the advantage of
>>> allowing the server to use different sizes for different inodes (though
>>> I'm not sure how useful that would be in practice).  So, in summary, I
>>> would define the new handle like this:
>>>
>>> /* Same value as MAX_HANDLE_SZ */
>>> #define FUSE_MAX_HANDLE_SZ 128
>>>
>>> struct fuse_file_handle {
>>>         uint32_t        size;
>>>         uint32_t        padding;
>>
>> I think that the handle type is going to be relevant as well.
>>
>>>         char            handle[FUSE_MAX_HANDLE_SZ];
>>> };
>>>
>>> and this struct would be included in fuse_entry_out.
>>>
>>> There's probably a problem with having this (big) fixed size increase to
>>> fuse_entry_out, but maybe that could be fixed once I have all the other
>>> details sorted out.  Hopefully I'm not oversimplifying the problem,
>>> skipping the need for negotiating a handle size.
>>>
>>
>> Maybe this fixed size is reasonable for the first version of FUSE protocol
>> as long as this overhead is NOT added if the server does not opt-in for the
>> feature.
>>
>> IOW, allow the server to negotiate FUSE_MAX_HANDLE_SZ or 0,
>> but keep the negotiation protocol extendable to another value later on.
>>
>>>>> That's probably a good idea.  I was working towards having the
>>>>> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would
>>>>> include:
>>>>>
>>>>>  - An extra inarg: the parent directory handle.  (To be honest, I'm not
>>>>>    really sure this would be needed.)
>>>>
>>>> Yes, I think you need extra inarg.
>>>> Why would it not be needed?
>>>> The problem is that you cannot know if the parent node id in the lookup
>>>> command is stale after server restart.
>>>
>>> Ah, of course.  Hence the need for this extra inarg.
>>>
>>>> The thing is that the kernel fuse inode will need to store the file handle,
>>>> much the same as an NFS client stores the file handle provided by the
>>>> NFS server.
>>>>
>>>> FYI, fanotify has an optimized way to store file handles in
>>>> struct fanotify_fid_event - small file handles are stored inline
>>>> and larger file handles can use an external buffer.
>>>>
>>>> But fuse does not need to support any size of file handles.
>>>> For first version we could definitely simplify things by limiting the size
>>>> of supported file handles, because server and client need to negotiate
>>>> the max file handle size anyway.
>>>
>>> I'll definitely need to have a look at how fanotify does that.  But I
>>> guess that if my simplistic approach with a static array is acceptable for
>>> now, I'll stick with it for the initial attempt to implement this, and
>>> eventually revisit it later to do something more clever.
>>>
>>
>> What you proposed is the extension of fuse_entry_out for fuse
>> protocol.
>>
>> My reference to fanotify_fid_event is meant to explain how to encode
>> a file handle in fuse_inode in cache, because the fuse_inode_cachep
>> cannot have variable sized inodes and in most of the cases, a short
>> inline file handle should be enough.
>>
>> Therefore, if you limit the support in the first version to something like
>> FANOTIFY_INLINE_FH_LEN, you can always store the file handle
>> in fuse_inode and postpone support for bigger file handles to later.
> 
> I suggest that you also provide a way for the fuse server to tell the
> kernel that it can construct its own handles from {fuse_inode::nodeid,
> inode::i_generation} if they want something more efficient than
> uploading 128b blobs.

Isn't that covered by handle size defined in FUSE_INIT reply? I.e.
handle size would be 0B in this case? 

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-05 21:46                                         ` Bernd Schubert
@ 2025-11-05 22:06                                           ` Bernd Schubert
  0 siblings, 0 replies; 46+ messages in thread
From: Bernd Schubert @ 2025-11-05 22:06 UTC (permalink / raw)
  To: Bernd Schubert, Darrick J. Wong, Amir Goldstein
  Cc: Luis Henriques, Theodore Ts'o, Miklos Szeredi, linux-fsdevel,
	linux-kernel, Kevin Chen, Matt Harvey



On 11/5/25 22:46, Bernd Schubert wrote:
> 
> 
> On 11/5/25 22:38, Darrick J. Wong wrote:
>> On Wed, Nov 05, 2025 at 04:30:51PM +0100, Amir Goldstein wrote:
>>> On Wed, Nov 5, 2025 at 12:50 PM Luis Henriques <luis@igalia.com> wrote:
>>>>
>>>> Hi Amir,
>>>>
>>>> On Wed, Nov 05 2025, Amir Goldstein wrote:
>>>>
>>>>> On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote:
>>>>
>>>> <...>
>>>>
>>>>>>> fuse_entry_out was extended once and fuse_reply_entry()
>>>>>>> sends the size of the struct.
>>>>>>
>>>>>> So, if I'm understanding you correctly, you're suggesting to extend
>>>>>> fuse_entry_out to add the new handle (a 'size' field + the actual handle).
>>>>>
>>>>> Well it depends...
>>>>>
>>>>> There are several ways to do it.
>>>>> I would really like to get Miklos and Bernd's opinion on the preferred way.
>>>>
>>>> Sure, all feedback is welcome!
>>>>
>>>>> So far, it looks like the client determines the size of the output args.
>>>>>
>>>>> If we want the server to be able to write a different file handle size
>>>>> per inode that's going to be a bigger challenge.
>>>>>
>>>>> I think it's plenty enough if server and client negotiate a max file handle
>>>>> size and then the client always reserves enough space in the output
>>>>> args buffer.
>>>>>
>>>>> One more thing to ask is what is "the actual handle".
>>>>> If "the actual handle" is the variable sized struct file_handle then
>>>>> the size is already available in the file handle header.
>>>>
>>>> Actually, this is exactly what I was trying to mimic for my initial
>>>> attempt.  However, I was not going to do any size negotiation but instead
>>>> define a maximum size for the handle.  See below.
>>>>
>>>>> If it is not, then I think some sort of type or version of the file handles
>>>>> encoding should be negotiated beyond the max handle size.
>>>>
>>>> In my initial stab at this I was going to take a very simple approach and
>>>> hard-code a maximum size for the handle.  This would have the advantage of
>>>> allowing the server to use different sizes for different inodes (though
>>>> I'm not sure how useful that would be in practice).  So, in summary, I
>>>> would define the new handle like this:
>>>>
>>>> /* Same value as MAX_HANDLE_SZ */
>>>> #define FUSE_MAX_HANDLE_SZ 128
>>>>
>>>> struct fuse_file_handle {
>>>>         uint32_t        size;
>>>>         uint32_t        padding;
>>>
>>> I think that the handle type is going to be relevant as well.
>>>
>>>>         char            handle[FUSE_MAX_HANDLE_SZ];
>>>> };
>>>>
>>>> and this struct would be included in fuse_entry_out.
>>>>
>>>> There's probably a problem with having this (big) fixed size increase to
>>>> fuse_entry_out, but maybe that could be fixed once I have all the other
>>>> details sorted out.  Hopefully I'm not oversimplifying the problem,
>>>> skipping the need for negotiating a handle size.
>>>>
>>>
>>> Maybe this fixed size is reasonable for the first version of FUSE protocol
>>> as long as this overhead is NOT added if the server does not opt-in for the
>>> feature.
>>>
>>> IOW, allow the server to negotiate FUSE_MAX_HANDLE_SZ or 0,
>>> but keep the negotiation protocol extendable to another value later on.
>>>
>>>>>> That's probably a good idea.  I was working towards having the
>>>>>> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would
>>>>>> include:
>>>>>>
>>>>>>  - An extra inarg: the parent directory handle.  (To be honest, I'm not
>>>>>>    really sure this would be needed.)
>>>>>
>>>>> Yes, I think you need extra inarg.
>>>>> Why would it not be needed?
>>>>> The problem is that you cannot know if the parent node id in the lookup
>>>>> command is stale after server restart.
>>>>
>>>> Ah, of course.  Hence the need for this extra inarg.
>>>>
>>>>> The thing is that the kernel fuse inode will need to store the file handle,
>>>>> much the same as an NFS client stores the file handle provided by the
>>>>> NFS server.
>>>>>
>>>>> FYI, fanotify has an optimized way to store file handles in
>>>>> struct fanotify_fid_event - small file handles are stored inline
>>>>> and larger file handles can use an external buffer.
>>>>>
>>>>> But fuse does not need to support any size of file handles.
>>>>> For first version we could definitely simplify things by limiting the size
>>>>> of supported file handles, because server and client need to negotiate
>>>>> the max file handle size anyway.
>>>>
>>>> I'll definitely need to have a look at how fanotify does that.  But I
>>>> guess that if my simplistic approach with a static array is acceptable for
>>>> now, I'll stick with it for the initial attempt to implement this, and
>>>> eventually revisit it later to do something more clever.
>>>>
>>>
>>> What you proposed is the extension of fuse_entry_out for fuse
>>> protocol.
>>>
>>> My reference to fanotify_fid_event is meant to explain how to encode
>>> a file handle in fuse_inode in cache, because the fuse_inode_cachep
>>> cannot have variable sized inodes and in most of the cases, a short
>>> inline file handle should be enough.
>>>
>>> Therefore, if you limit the support in the first version to something like
>>> FANOTIFY_INLINE_FH_LEN, you can always store the file handle
>>> in fuse_inode and postpone support for bigger file handles to later.
>>
>> I suggest that you also provide a way for the fuse server to tell the
>> kernel that it can construct its own handles from {fuse_inode::nodeid,
>> inode::i_generation} if they want something more efficient than
>> uploading 128b blobs.
> 
> Isn't that covered by handle size defined in FUSE_INIT reply? I.e.
> handle size would be 0B in this case? 

Sorry my fault, yeah, this needs a special flag.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-04 13:10                             ` Amir Goldstein
  2025-11-04 14:52                               ` Luis Henriques
@ 2025-11-05 22:24                               ` Bernd Schubert
  2025-11-05 22:42                                 ` Darrick J. Wong
  1 sibling, 1 reply; 46+ messages in thread
From: Bernd Schubert @ 2025-11-05 22:24 UTC (permalink / raw)
  To: Amir Goldstein, Luis Henriques
  Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen



On 11/4/25 14:10, Amir Goldstein wrote:
> On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote:
>>
>> On Tue, Sep 16 2025, Amir Goldstein wrote:
>>
>>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
>>>>
>>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
>>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9/15/25 09:07, Amir Goldstein wrote:
>>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
>>>>>>>>
>>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote:
>>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote:
>>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>>>>>>>>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
>>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I
>>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>>>>>>>>>>>>>>> aren't totally crazy.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
>>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data
>>>>>>>>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>>>>>>>>>>>>>> potentally to be out of sync, right?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide?
>>>>>>>>>>>>>
>>>>>>>>>>>>> <echoing what we said on the ext4 call this morning>
>>>>>>>>>>>>>
>>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate
>>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests
>>>>>>>>>>>>> that were pending at the time.  It might be the case that you have to
>>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse
>>>>>>>>>>>>> to suspect that to be true.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with
>>>>>>>>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them.
>>>>>>>>>>>>
>>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>>>>>>>>>>>> but probably GETATTR is a better option.
>>>>>>>>>>>>
>>>>>>>>>>>> So, are you currently working on any of this?  Are you implementing this
>>>>>>>>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>>>>>>>>>>>> look at fuse2fs too.
>>>>>>>>>>>
>>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and
>>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our
>>>>>>>>>>> DDN side.
>>>>>>>>>>>
>>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count.
>>>>>>>>>>> Now inode recovery might be hard, because we currently only have a
>>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory
>>>>>>>>>>> pointer.
>>>>>>>>>>>
>>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests
>>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory
>>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or
>>>>>>>>>>> open_by_handle_at doesn't work well right now.
>>>>>>>>>>>
>>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
>>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart.
>>>>>>>>>>> The file handles could be stored into the fuse inode and also used for
>>>>>>>>>>> NFS export.
>>>>>>>>>>>
>>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
>>>>>>>>>>> Adding Amir to CC.
>>>>>>>>>>
>>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
>>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>>>>>>>>
>>>>>>>>> Thanks for the reference Amir! I even had been in that thread.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>>>>>>>>>>> Any objections against that?
>>>>>>>>
>>>>>>>> What if you actually /can/ reuse a nodeid after a restart?  Consider
>>>>>>>> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
>>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
>>>>>>>> didn't delete it, obviously.
>>>>>>>
>>>>>>> FUSE_LOOKUP_HANDLE is a contract.
>>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign
>>>>>>> this contract, otherwise there is no way for client to know that the
>>>>>>> nodeids are persistent.
>>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
>>>>>>> API trivial.
>>>>>>>
>>>>>>>>
>>>>>>>> I suppose you could just ask for refreshed stat information and either
>>>>>>>> the server gives it to you and the fuse_inode lives; or the server
>>>>>>>> returns ENOENT and then we mark it bad.  But I'd have to see code
>>>>>>>> patches to form a real opinion.
>>>>>>>>
>>>>>>>
>>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id>
>>>>>>> where fuse_instance_id can be its start time or random number.
>>>>>>> for auto invalidate, or maybe the fuse_instance_id should be
>>>>>>> a native part of FUSE protocol so that client knows to only invalidate
>>>>>>> attr cache in case of fuse_instance_id change?
>>>>>>>
>>>>>>> In any case, instead of a storm of revalidate messages after
>>>>>>> server restart, do it lazily on demand.
>>>>>>
>>>>>> For a network file system, probably. For fuse4fs or other block
>>>>>> based file systems, not sure. Darrick has the example of fsck.
>>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
>>>>>> fuse-server gets restarted, fsck'ed and some files get removed.
>>>>>> Now reading these inodes would still work - wouldn't it
>>>>>> be better to invalidate the cache before going into operation
>>>>>> again?
>>>>>
>>>>> Forgive me, I was making a wrong assumption that fuse4fs
>>>>> was using ext4 filehandle as nodeid, but of course it does not.
>>>>
>>>> Well now that you mention it, there /is/ a risk of shenanigans like
>>>> that.  Consider:
>>>>
>>>> 1) fuse4fs mount an ext4 filesystem
>>>> 2) crash the fuse4fs server
>>>> <fuse4fs server restart stalls...>
>>>> 3) e2fsck -fy /dev/XXX deletes inode 17
>>>> 4) someone else mounts the fs, makes some changes that result in 17
>>>>    being reallocated, user says "OOOOOPS", unmounts it
>>>> 5) fuse4fs server finally restarts, and reconnects to the kernel
>>>>
>>>> Hey, inode 17 is now a different file!!
>>>>
>>>> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
>>>> everything's (potentially) fine because fuse4fs supplied i_generation to
>>>> the kernel, and fuse_stale_inode will mark it bad if that happens.
>>>>
>>>> Hm ok then, at least there's a way out. :)
>>>>
>>>
>>> Right.
>>>
>>>>> The reason I made this wrong assumption is because fuse4fs *can*
>>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
>>>>> which is what my fuse passthough library [1] does.
>>>>>
>>>>> My claim was that although fuse4fs could support safe restart, which
>>>>> cannot read from recycled inode number with current FUSE protocol,
>>>>> doing so with FUSE_HANDLE protocol would express a commitment
>>>>
>>>> Pardon my naïvete, but what is FUSE_HANDLE?
>>>>
>>>> $ git grep -w FUSE_HANDLE fs
>>>> $
>>>
>>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>>
>>> Which means to communicate a variable sized "nodeid"
>>> which can also be declared as an object id that survives server restart.
>>>
>>> Basically, the reason that I brought up LOOKUP_HANDLE is to
>>> properly support NFS export of fuse filesystems.
>>>
>>> My incentive was to support a proper fuse server restart/remount/re-export
>>> with the same fsid in /etc/exports, but this gives us a better starting point
>>> for fuse server restart/re-connect.
>>
>> Sorry for resurrecting (again!) this discussion.  I've been thinking about
>> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
>> However, I feel there are other operations that will need to return this
>> new handle.
>>
>> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
>> Doesn't this means that, if the user-space server supports the new
>> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
>> request?
> 
> Yes, I think that's what it means.
> 
>> The same question applies for TMPFILE, LINK, etc.  Or is there
>> something special about the LOOKUP operation that I'm missing?
>>
> 
> Any command returning fuse_entry_out.
> 
> READDIRPLUS, MKNOD, MKDIR, SYMLINK

Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these
things. With double checking, though, the file was mostly created by AI
(just added a correction today). With that easy to see the missing
FUSE_TMPFILE.


> 
> fuse_entry_out was extended once and fuse_reply_entry()
> sends the size of the struct.

Sorry, I'm confused. Where does fuse_reply_entry() send the size?

> However fuse_reply_create() sends it with fuse_open_out
> appended and fuse_add_direntry_plus() does not seem to write
> record size at all, so server and client will need to agree on the
> size of fuse_entry_out and this would need to be backward compat.
> If both server and client declare support for FUSE_LOOKUP_HANDLE
> it should be fine (?).

If max_handle size becomes a value in fuse_init_out, server and
client would use it? I think appended fuse_open_out could just
follow the dynamic actual size of the handle - code that
serializes/deserializes the response has to look up the actual
handle size then. For example I wouldn't know what to put in
for any of the example/passthrough* file systems as handle size - 
would need to be 128B, but the actual size will be typically
much smaller.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-05 22:24                               ` Bernd Schubert
@ 2025-11-05 22:42                                 ` Darrick J. Wong
  2025-11-05 22:48                                   ` Bernd Schubert
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2025-11-05 22:42 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Amir Goldstein, Luis Henriques, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen

On Wed, Nov 05, 2025 at 11:24:01PM +0100, Bernd Schubert wrote:
> 
> 
> On 11/4/25 14:10, Amir Goldstein wrote:
> > On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote:
> >>
> >> On Tue, Sep 16 2025, Amir Goldstein wrote:
> >>
> >>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >>>>
> >>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
> >>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 9/15/25 09:07, Amir Goldstein wrote:
> >>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >>>>>>>>
> >>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote:
> >>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote:
> >>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >>>>>>>>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
> >>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I
> >>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> >>>>>>>>>>>>>>> aren't totally crazy.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
> >>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> >>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data
> >>>>>>>>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> >>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> >>>>>>>>>>>>>> potentally to be out of sync, right?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <echoing what we said on the ext4 call this morning>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> >>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> >>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate
> >>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests
> >>>>>>>>>>>>> that were pending at the time.  It might be the case that you have to
> >>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse
> >>>>>>>>>>>>> to suspect that to be true.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with
> >>>>>>>>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> >>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> >>>>>>>>>>>> but probably GETATTR is a better option.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So, are you currently working on any of this?  Are you implementing this
> >>>>>>>>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> >>>>>>>>>>>> look at fuse2fs too.
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and
> >>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our
> >>>>>>>>>>> DDN side.
> >>>>>>>>>>>
> >>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> >>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count.
> >>>>>>>>>>> Now inode recovery might be hard, because we currently only have a
> >>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory
> >>>>>>>>>>> pointer.
> >>>>>>>>>>>
> >>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> >>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests
> >>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory
> >>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or
> >>>>>>>>>>> open_by_handle_at doesn't work well right now.
> >>>>>>>>>>>
> >>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> >>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> >>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart.
> >>>>>>>>>>> The file handles could be stored into the fuse inode and also used for
> >>>>>>>>>>> NFS export.
> >>>>>>>>>>>
> >>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> >>>>>>>>>>> Adding Amir to CC.
> >>>>>>>>>>
> >>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> >>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >>>>>>>>>
> >>>>>>>>> Thanks for the reference Amir! I even had been in that thread.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> >>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> >>>>>>>>>>> Any objections against that?
> >>>>>>>>
> >>>>>>>> What if you actually /can/ reuse a nodeid after a restart?  Consider
> >>>>>>>> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> >>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> >>>>>>>> didn't delete it, obviously.
> >>>>>>>
> >>>>>>> FUSE_LOOKUP_HANDLE is a contract.
> >>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign
> >>>>>>> this contract, otherwise there is no way for client to know that the
> >>>>>>> nodeids are persistent.
> >>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> >>>>>>> API trivial.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I suppose you could just ask for refreshed stat information and either
> >>>>>>>> the server gives it to you and the fuse_inode lives; or the server
> >>>>>>>> returns ENOENT and then we mark it bad.  But I'd have to see code
> >>>>>>>> patches to form a real opinion.
> >>>>>>>>
> >>>>>>>
> >>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> >>>>>>> where fuse_instance_id can be its start time or random number.
> >>>>>>> for auto invalidate, or maybe the fuse_instance_id should be
> >>>>>>> a native part of FUSE protocol so that client knows to only invalidate
> >>>>>>> attr cache in case of fuse_instance_id change?
> >>>>>>>
> >>>>>>> In any case, instead of a storm of revalidate messages after
> >>>>>>> server restart, do it lazily on demand.
> >>>>>>
> >>>>>> For a network file system, probably. For fuse4fs or other block
> >>>>>> based file systems, not sure. Darrick has the example of fsck.
> >>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
> >>>>>> fuse-server gets restarted, fsck'ed and some files get removed.
> >>>>>> Now reading these inodes would still work - wouldn't it
> >>>>>> be better to invalidate the cache before going into operation
> >>>>>> again?
> >>>>>
> >>>>> Forgive me, I was making a wrong assumption that fuse4fs
> >>>>> was using ext4 filehandle as nodeid, but of course it does not.
> >>>>
> >>>> Well now that you mention it, there /is/ a risk of shenanigans like
> >>>> that.  Consider:
> >>>>
> >>>> 1) fuse4fs mount an ext4 filesystem
> >>>> 2) crash the fuse4fs server
> >>>> <fuse4fs server restart stalls...>
> >>>> 3) e2fsck -fy /dev/XXX deletes inode 17
> >>>> 4) someone else mounts the fs, makes some changes that result in 17
> >>>>    being reallocated, user says "OOOOOPS", unmounts it
> >>>> 5) fuse4fs server finally restarts, and reconnects to the kernel
> >>>>
> >>>> Hey, inode 17 is now a different file!!
> >>>>
> >>>> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
> >>>> everything's (potentially) fine because fuse4fs supplied i_generation to
> >>>> the kernel, and fuse_stale_inode will mark it bad if that happens.
> >>>>
> >>>> Hm ok then, at least there's a way out. :)
> >>>>
> >>>
> >>> Right.
> >>>
> >>>>> The reason I made this wrong assumption is because fuse4fs *can*
> >>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
> >>>>> which is what my fuse passthough library [1] does.
> >>>>>
> >>>>> My claim was that although fuse4fs could support safe restart, which
> >>>>> cannot read from recycled inode number with current FUSE protocol,
> >>>>> doing so with FUSE_HANDLE protocol would express a commitment
> >>>>
> >>>> Pardon my naïvete, but what is FUSE_HANDLE?
> >>>>
> >>>> $ git grep -w FUSE_HANDLE fs
> >>>> $
> >>>
> >>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
> >>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >>>
> >>> Which means to communicate a variable sized "nodeid"
> >>> which can also be declared as an object id that survives server restart.
> >>>
> >>> Basically, the reason that I brought up LOOKUP_HANDLE is to
> >>> properly support NFS export of fuse filesystems.
> >>>
> >>> My incentive was to support a proper fuse server restart/remount/re-export
> >>> with the same fsid in /etc/exports, but this gives us a better starting point
> >>> for fuse server restart/re-connect.
> >>
> >> Sorry for resurrecting (again!) this discussion.  I've been thinking about
> >> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
> >> However, I feel there are other operations that will need to return this
> >> new handle.
> >>
> >> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
> >> Doesn't this means that, if the user-space server supports the new
> >> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
> >> request?
> > 
> > Yes, I think that's what it means.
> > 
> >> The same question applies for TMPFILE, LINK, etc.  Or is there
> >> something special about the LOOKUP operation that I'm missing?
> >>
> > 
> > Any command returning fuse_entry_out.
> > 
> > READDIRPLUS, MKNOD, MKDIR, SYMLINK
> 
> Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these
> things. With double checking, though, the file was mostly created by AI
> (just added a correction today). With that easy to see the missing
> FUSE_TMPFILE.
> 
> 
> > 
> > fuse_entry_out was extended once and fuse_reply_entry()
> > sends the size of the struct.
> 
> Sorry, I'm confused. Where does fuse_reply_entry() send the size?
> 
> > However fuse_reply_create() sends it with fuse_open_out
> > appended and fuse_add_direntry_plus() does not seem to write
> > record size at all, so server and client will need to agree on the
> > size of fuse_entry_out and this would need to be backward compat.
> > If both server and client declare support for FUSE_LOOKUP_HANDLE
> > it should be fine (?).
> 
> If max_handle size becomes a value in fuse_init_out, server and
> client would use it? I think appended fuse_open_out could just
> follow the dynamic actual size of the handle - code that
> serializes/deserializes the response has to look up the actual
> handle size then. For example I wouldn't know what to put in
> for any of the example/passthrough* file systems as handle size - 
> would need to be 128B, but the actual size will be typically
> much smaller.

name_to_handle_at ?

I guess the problem here is that technically speaking filesystems could
have variable sized handles depending on the file.  Sometimes you encode
just the ino/gen of the child file, but other times you might know the
parent and put that in the handle too.

--D

> 
> Thanks,
> Bernd
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-05 22:42                                 ` Darrick J. Wong
@ 2025-11-05 22:48                                   ` Bernd Schubert
  2025-11-06  0:21                                     ` Darrick J. Wong
  2025-11-06 10:13                                     ` Amir Goldstein
  0 siblings, 2 replies; 46+ messages in thread
From: Bernd Schubert @ 2025-11-05 22:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Luis Henriques, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen



On 11/5/25 23:42, Darrick J. Wong wrote:
> On Wed, Nov 05, 2025 at 11:24:01PM +0100, Bernd Schubert wrote:
>>
>>
>> On 11/4/25 14:10, Amir Goldstein wrote:
>>> On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote:
>>>>
>>>> On Tue, Sep 16 2025, Amir Goldstein wrote:
>>>>
>>>>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
>>>>>>
>>>>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
>>>>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/15/25 09:07, Amir Goldstein wrote:
>>>>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote:
>>>>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote:
>>>>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>>>>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>>>>>>>>>>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
>>>>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I
>>>>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>>>>>>>>>>>>>>>>> aren't totally crazy.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
>>>>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>>>>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data
>>>>>>>>>>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>>>>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>>>>>>>>>>>>>>>> potentally to be out of sync, right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <echoing what we said on the ext4 call this morning>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>>>>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>>>>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate
>>>>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests
>>>>>>>>>>>>>>> that were pending at the time.  It might be the case that you have to
>>>>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse
>>>>>>>>>>>>>>> to suspect that to be true.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with
>>>>>>>>>>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>>>>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>>>>>>>>>>>>>> but probably GETATTR is a better option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, are you currently working on any of this?  Are you implementing this
>>>>>>>>>>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>>>>>>>>>>>>>> look at fuse2fs too.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and
>>>>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our
>>>>>>>>>>>>> DDN side.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>>>>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count.
>>>>>>>>>>>>> Now inode recovery might be hard, because we currently only have a
>>>>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory
>>>>>>>>>>>>> pointer.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>>>>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests
>>>>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory
>>>>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or
>>>>>>>>>>>>> open_by_handle_at doesn't work well right now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>>>>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
>>>>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart.
>>>>>>>>>>>>> The file handles could be stored into the fuse inode and also used for
>>>>>>>>>>>>> NFS export.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
>>>>>>>>>>>>> Adding Amir to CC.
>>>>>>>>>>>>
>>>>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
>>>>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the reference Amir! I even had been in that thread.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>>>>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>>>>>>>>>>>>> Any objections against that?
>>>>>>>>>>
>>>>>>>>>> What if you actually /can/ reuse a nodeid after a restart?  Consider
>>>>>>>>>> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
>>>>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
>>>>>>>>>> didn't delete it, obviously.
>>>>>>>>>
>>>>>>>>> FUSE_LOOKUP_HANDLE is a contract.
>>>>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign
>>>>>>>>> this contract, otherwise there is no way for client to know that the
>>>>>>>>> nodeids are persistent.
>>>>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
>>>>>>>>> API trivial.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I suppose you could just ask for refreshed stat information and either
>>>>>>>>>> the server gives it to you and the fuse_inode lives; or the server
>>>>>>>>>> returns ENOENT and then we mark it bad.  But I'd have to see code
>>>>>>>>>> patches to form a real opinion.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id>
>>>>>>>>> where fuse_instance_id can be its start time or random number.
>>>>>>>>> for auto invalidate, or maybe the fuse_instance_id should be
>>>>>>>>> a native part of FUSE protocol so that client knows to only invalidate
>>>>>>>>> attr cache in case of fuse_instance_id change?
>>>>>>>>>
>>>>>>>>> In any case, instead of a storm of revalidate messages after
>>>>>>>>> server restart, do it lazily on demand.
>>>>>>>>
>>>>>>>> For a network file system, probably. For fuse4fs or other block
>>>>>>>> based file systems, not sure. Darrick has the example of fsck.
>>>>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
>>>>>>>> fuse-server gets restarted, fsck'ed and some files get removed.
>>>>>>>> Now reading these inodes would still work - wouldn't it
>>>>>>>> be better to invalidate the cache before going into operation
>>>>>>>> again?
>>>>>>>
>>>>>>> Forgive me, I was making a wrong assumption that fuse4fs
>>>>>>> was using ext4 filehandle as nodeid, but of course it does not.
>>>>>>
>>>>>> Well now that you mention it, there /is/ a risk of shenanigans like
>>>>>> that.  Consider:
>>>>>>
>>>>>> 1) fuse4fs mount an ext4 filesystem
>>>>>> 2) crash the fuse4fs server
>>>>>> <fuse4fs server restart stalls...>
>>>>>> 3) e2fsck -fy /dev/XXX deletes inode 17
>>>>>> 4) someone else mounts the fs, makes some changes that result in 17
>>>>>>    being reallocated, user says "OOOOOPS", unmounts it
>>>>>> 5) fuse4fs server finally restarts, and reconnects to the kernel
>>>>>>
>>>>>> Hey, inode 17 is now a different file!!
>>>>>>
>>>>>> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
>>>>>> everything's (potentially) fine because fuse4fs supplied i_generation to
>>>>>> the kernel, and fuse_stale_inode will mark it bad if that happens.
>>>>>>
>>>>>> Hm ok then, at least there's a way out. :)
>>>>>>
>>>>>
>>>>> Right.
>>>>>
>>>>>>> The reason I made this wrong assumption is because fuse4fs *can*
>>>>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
>>>>>>> which is what my fuse passthough library [1] does.
>>>>>>>
>>>>>>> My claim was that although fuse4fs could support safe restart, which
>>>>>>> cannot read from recycled inode number with current FUSE protocol,
>>>>>>> doing so with FUSE_HANDLE protocol would express a commitment
>>>>>>
>>>>>> Pardon my naïvete, but what is FUSE_HANDLE?
>>>>>>
>>>>>> $ git grep -w FUSE_HANDLE fs
>>>>>> $
>>>>>
>>>>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>>>>
>>>>> Which means to communicate a variable sized "nodeid"
>>>>> which can also be declared as an object id that survives server restart.
>>>>>
>>>>> Basically, the reason that I brought up LOOKUP_HANDLE is to
>>>>> properly support NFS export of fuse filesystems.
>>>>>
>>>>> My incentive was to support a proper fuse server restart/remount/re-export
>>>>> with the same fsid in /etc/exports, but this gives us a better starting point
>>>>> for fuse server restart/re-connect.
>>>>
>>>> Sorry for resurrecting (again!) this discussion.  I've been thinking about
>>>> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
>>>> However, I feel there are other operations that will need to return this
>>>> new handle.
>>>>
>>>> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
>>>> Doesn't this means that, if the user-space server supports the new
>>>> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
>>>> request?
>>>
>>> Yes, I think that's what it means.
>>>
>>>> The same question applies for TMPFILE, LINK, etc.  Or is there
>>>> something special about the LOOKUP operation that I'm missing?
>>>>
>>>
>>> Any command returning fuse_entry_out.
>>>
>>> READDIRPLUS, MKNOD, MKDIR, SYMLINK
>>
>> Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these
>> things. With double checking, though, the file was mostly created by AI
>> (just added a correction today). With that easy to see the missing
>> FUSE_TMPFILE.
>>
>>
>>>
>>> fuse_entry_out was extended once and fuse_reply_entry()
>>> sends the size of the struct.
>>
>> Sorry, I'm confused. Where does fuse_reply_entry() send the size?
>>
>>> However fuse_reply_create() sends it with fuse_open_out
>>> appended and fuse_add_direntry_plus() does not seem to write
>>> record size at all, so server and client will need to agree on the
>>> size of fuse_entry_out and this would need to be backward compat.
>>> If both server and client declare support for FUSE_LOOKUP_HANDLE
>>> it should be fine (?).
>>
>> If max_handle size becomes a value in fuse_init_out, server and
>> client would use it? I think appended fuse_open_out could just
>> follow the dynamic actual size of the handle - code that
>> serializes/deserializes the response has to look up the actual
>> handle size then. For example I wouldn't know what to put in
>> for any of the example/passthrough* file systems as handle size - 
>> would need to be 128B, but the actual size will be typically
>> much smaller.
> 
> name_to_handle_at ?
> 
> I guess the problem here is that technically speaking filesystems could
> have variable sized handles depending on the file.  Sometimes you encode
> just the ino/gen of the child file, but other times you might know the
> parent and put that in the handle too.

Yeah, I don't think it would be reliable for *all* file systems to use
name_to_handle_at on startup on some example file/directory. At least
not without knowing all the details of the underlying passthrough file
system.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-05 22:48                                   ` Bernd Schubert
@ 2025-11-06  0:21                                     ` Darrick J. Wong
  2025-11-06 10:13                                     ` Amir Goldstein
  1 sibling, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2025-11-06  0:21 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Amir Goldstein, Luis Henriques, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen

On Wed, Nov 05, 2025 at 11:48:21PM +0100, Bernd Schubert wrote:
> 
> 
> On 11/5/25 23:42, Darrick J. Wong wrote:
> > On Wed, Nov 05, 2025 at 11:24:01PM +0100, Bernd Schubert wrote:
> >>
> >>
> >> On 11/4/25 14:10, Amir Goldstein wrote:
> >>> On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote:
> >>>>
> >>>> On Tue, Sep 16 2025, Amir Goldstein wrote:
> >>>>
> >>>>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >>>>>>
> >>>>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
> >>>>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 9/15/25 09:07, Amir Goldstein wrote:
> >>>>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote:
> >>>>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote:
> >>>>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >>>>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >>>>>>>>>>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
> >>>>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I
> >>>>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> >>>>>>>>>>>>>>>>> aren't totally crazy.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
> >>>>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> >>>>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data
> >>>>>>>>>>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> >>>>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> >>>>>>>>>>>>>>>> potentally to be out of sync, right?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> <echoing what we said on the ext4 call this morning>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> >>>>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> >>>>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate
> >>>>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests
> >>>>>>>>>>>>>>> that were pending at the time.  It might be the case that you have to
> >>>>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse
> >>>>>>>>>>>>>>> to suspect that to be true.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with
> >>>>>>>>>>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> >>>>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> >>>>>>>>>>>>>> but probably GETATTR is a better option.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So, are you currently working on any of this?  Are you implementing this
> >>>>>>>>>>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> >>>>>>>>>>>>>> look at fuse2fs too.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and
> >>>>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our
> >>>>>>>>>>>>> DDN side.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> >>>>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count.
> >>>>>>>>>>>>> Now inode recovery might be hard, because we currently only have a
> >>>>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory
> >>>>>>>>>>>>> pointer.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> >>>>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests
> >>>>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory
> >>>>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or
> >>>>>>>>>>>>> open_by_handle_at doesn't work well right now.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> >>>>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> >>>>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart.
> >>>>>>>>>>>>> The file handles could be stored into the fuse inode and also used for
> >>>>>>>>>>>>> NFS export.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> >>>>>>>>>>>>> Adding Amir to CC.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> >>>>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the reference Amir! I even had been in that thread.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> >>>>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> >>>>>>>>>>>>> Any objections against that?
> >>>>>>>>>>
> >>>>>>>>>> What if you actually /can/ reuse a nodeid after a restart?  Consider
> >>>>>>>>>> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> >>>>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> >>>>>>>>>> didn't delete it, obviously.
> >>>>>>>>>
> >>>>>>>>> FUSE_LOOKUP_HANDLE is a contract.
> >>>>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign
> >>>>>>>>> this contract, otherwise there is no way for client to know that the
> >>>>>>>>> nodeids are persistent.
> >>>>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> >>>>>>>>> API trivial.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I suppose you could just ask for refreshed stat information and either
> >>>>>>>>>> the server gives it to you and the fuse_inode lives; or the server
> >>>>>>>>>> returns ENOENT and then we mark it bad.  But I'd have to see code
> >>>>>>>>>> patches to form a real opinion.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> >>>>>>>>> where fuse_instance_id can be its start time or random number.
> >>>>>>>>> for auto invalidate, or maybe the fuse_instance_id should be
> >>>>>>>>> a native part of FUSE protocol so that client knows to only invalidate
> >>>>>>>>> attr cache in case of fuse_instance_id change?
> >>>>>>>>>
> >>>>>>>>> In any case, instead of a storm of revalidate messages after
> >>>>>>>>> server restart, do it lazily on demand.
> >>>>>>>>
> >>>>>>>> For a network file system, probably. For fuse4fs or other block
> >>>>>>>> based file systems, not sure. Darrick has the example of fsck.
> >>>>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
> >>>>>>>> fuse-server gets restarted, fsck'ed and some files get removed.
> >>>>>>>> Now reading these inodes would still work - wouldn't it
> >>>>>>>> be better to invalidate the cache before going into operation
> >>>>>>>> again?
> >>>>>>>
> >>>>>>> Forgive me, I was making a wrong assumption that fuse4fs
> >>>>>>> was using ext4 filehandle as nodeid, but of course it does not.
> >>>>>>
> >>>>>> Well now that you mention it, there /is/ a risk of shenanigans like
> >>>>>> that.  Consider:
> >>>>>>
> >>>>>> 1) fuse4fs mount an ext4 filesystem
> >>>>>> 2) crash the fuse4fs server
> >>>>>> <fuse4fs server restart stalls...>
> >>>>>> 3) e2fsck -fy /dev/XXX deletes inode 17
> >>>>>> 4) someone else mounts the fs, makes some changes that result in 17
> >>>>>>    being reallocated, user says "OOOOOPS", unmounts it
> >>>>>> 5) fuse4fs server finally restarts, and reconnects to the kernel
> >>>>>>
> >>>>>> Hey, inode 17 is now a different file!!
> >>>>>>
> >>>>>> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
> >>>>>> everything's (potentially) fine because fuse4fs supplied i_generation to
> >>>>>> the kernel, and fuse_stale_inode will mark it bad if that happens.
> >>>>>>
> >>>>>> Hm ok then, at least there's a way out. :)
> >>>>>>
> >>>>>
> >>>>> Right.
> >>>>>
> >>>>>>> The reason I made this wrong assumption is because fuse4fs *can*
> >>>>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
> >>>>>>> which is what my fuse passthough library [1] does.
> >>>>>>>
> >>>>>>> My claim was that although fuse4fs could support safe restart, which
> >>>>>>> cannot read from recycled inode number with current FUSE protocol,
> >>>>>>> doing so with FUSE_HANDLE protocol would express a commitment
> >>>>>>
> >>>>>> Pardon my naïvete, but what is FUSE_HANDLE?
> >>>>>>
> >>>>>> $ git grep -w FUSE_HANDLE fs
> >>>>>> $
> >>>>>
> >>>>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
> >>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >>>>>
> >>>>> Which means to communicate a variable sized "nodeid"
> >>>>> which can also be declared as an object id that survives server restart.
> >>>>>
> >>>>> Basically, the reason that I brought up LOOKUP_HANDLE is to
> >>>>> properly support NFS export of fuse filesystems.
> >>>>>
> >>>>> My incentive was to support a proper fuse server restart/remount/re-export
> >>>>> with the same fsid in /etc/exports, but this gives us a better starting point
> >>>>> for fuse server restart/re-connect.
> >>>>
> >>>> Sorry for resurrecting (again!) this discussion.  I've been thinking about
> >>>> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
> >>>> However, I feel there are other operations that will need to return this
> >>>> new handle.
> >>>>
> >>>> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
> >>>> Doesn't this means that, if the user-space server supports the new
> >>>> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
> >>>> request?
> >>>
> >>> Yes, I think that's what it means.
> >>>
> >>>> The same question applies for TMPFILE, LINK, etc.  Or is there
> >>>> something special about the LOOKUP operation that I'm missing?
> >>>>
> >>>
> >>> Any command returning fuse_entry_out.
> >>>
> >>> READDIRPLUS, MKNOD, MKDIR, SYMLINK
> >>
> >> Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these
> >> things. With double checking, though, the file was mostly created by AI
> >> (just added a correction today). With that easy to see the missing
> >> FUSE_TMPFILE.
> >>
> >>
> >>>
> >>> fuse_entry_out was extended once and fuse_reply_entry()
> >>> sends the size of the struct.
> >>
> >> Sorry, I'm confused. Where does fuse_reply_entry() send the size?
> >>
> >>> However fuse_reply_create() sends it with fuse_open_out
> >>> appended and fuse_add_direntry_plus() does not seem to write
> >>> record size at all, so server and client will need to agree on the
> >>> size of fuse_entry_out and this would need to be backward compat.
> >>> If both server and client declare support for FUSE_LOOKUP_HANDLE
> >>> it should be fine (?).
> >>
> >> If max_handle size becomes a value in fuse_init_out, server and
> >> client would use it? I think appended fuse_open_out could just
> >> follow the dynamic actual size of the handle - code that
> >> serializes/deserializes the response has to look up the actual
> >> handle size then. For example I wouldn't know what to put in
> >> for any of the example/passthrough* file systems as handle size - 
> >> would need to be 128B, but the actual size will be typically
> >> much smaller.
> > 
> > name_to_handle_at ?
> > 
> > I guess the problem here is that technically speaking filesystems could
> > have variable sized handles depending on the file.  Sometimes you encode
> > just the ino/gen of the child file, but other times you might know the
> > parent and put that in the handle too.
> 
> Yeah, I don't think it would be reliable for *all* file systems to use
> name_to_handle_at on startup on some example file/directory. At least
> not without knowing all the details of the underlying passthrough file
> system.

I think if you can send arbitrarily sized outblobs back to the kernel
then it would be ok for a filesystem to have different handle sizes for
a file, just so long as it doesn't change during the lifetime of a file.
Obviously you couldn't then have a meaningful fs-wide max_handle_size.

--D

> 
> Thanks,
> Bernd
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-05 22:48                                   ` Bernd Schubert
  2025-11-06  0:21                                     ` Darrick J. Wong
@ 2025-11-06 10:13                                     ` Amir Goldstein
  2025-11-06 15:12                                       ` Luis Henriques
  2025-11-06 15:49                                       ` Darrick J. Wong
  1 sibling, 2 replies; 46+ messages in thread
From: Amir Goldstein @ 2025-11-06 10:13 UTC (permalink / raw)
  To: Bernd Schubert, Darrick J. Wong
  Cc: Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi,
	linux-fsdevel, linux-kernel, Kevin Chen

[...]

> >>> fuse_entry_out was extended once and fuse_reply_entry()
> >>> sends the size of the struct.
> >>
> >> Sorry, I'm confused. Where does fuse_reply_entry() send the size?

Sorry, I meant to say that the reply size is variable.
The size is obviously determined at init time.

> >>
> >>> However fuse_reply_create() sends it with fuse_open_out
> >>> appended and fuse_add_direntry_plus() does not seem to write
> >>> record size at all, so server and client will need to agree on the
> >>> size of fuse_entry_out and this would need to be backward compat.
> >>> If both server and client declare support for FUSE_LOOKUP_HANDLE
> >>> it should be fine (?).
> >>
> >> If max_handle size becomes a value in fuse_init_out, server and
> >> client would use it? I think appended fuse_open_out could just
> >> follow the dynamic actual size of the handle - code that
> >> serializes/deserializes the response has to look up the actual
> >> handle size then. For example I wouldn't know what to put in
> >> for any of the example/passthrough* file systems as handle size -
> >> would need to be 128B, but the actual size will be typically
> >> much smaller.
> >
> > name_to_handle_at ?
> >
> > I guess the problem here is that technically speaking filesystems could
> > have variable sized handles depending on the file.  Sometimes you encode
> > just the ino/gen of the child file, but other times you might know the
> > parent and put that in the handle too.
>
> Yeah, I don't think it would be reliable for *all* file systems to use
> name_to_handle_at on startup on some example file/directory. At least
> not without knowing all the details of the underlying passthrough file
> system.
>

Maybe it's not a world-wide general solution, but it is a practical one.

My fuse_passthrough library knows how to detect xfs and ext4 and
knows about the size of their file handles.
https://github.com/amir73il/libfuse/blob/fuse_passthrough/passthrough/fuse_passthrough.cpp#L645

A server could optimize for max_handle_size if it knows it or use
MAX_HANDLE_SZ if it doesn't.

Keep in mind that for the sake of restarting fuse servers (title of this thread)
file handles do not need to be the actual filesystem file handles.
Server can use its own pid as generation and then all inodes get
auto invalidated on server restart.

Not invalidating file handles on server restart, because the file handles
are persistent file handles is an optimization.

LOOKUP_HANDLE still needs to provide the inode+gen of the parent
which LOOKUP currently does not.

I did not understand why Darrick's suggestion of a flag that ino+gen
suffice is any different then max_handle_size = 12 and using the
standard FILEID_INO64_GEN in that case?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-06 10:13                                     ` Amir Goldstein
@ 2025-11-06 15:12                                       ` Luis Henriques
  2025-11-06 15:58                                         ` Luis Henriques
  2025-11-06 15:49                                       ` Darrick J. Wong
  1 sibling, 1 reply; 46+ messages in thread
From: Luis Henriques @ 2025-11-06 15:12 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bernd Schubert, Darrick J. Wong, Bernd Schubert,
	Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel,
	Kevin Chen

On Thu, Nov 06 2025, Amir Goldstein wrote:

> [...]
>
>> >>> fuse_entry_out was extended once and fuse_reply_entry()
>> >>> sends the size of the struct.
>> >>
>> >> Sorry, I'm confused. Where does fuse_reply_entry() send the size?
>
> Sorry, I meant to say that the reply size is variable.
> The size is obviously determined at init time.
>
>> >>
>> >>> However fuse_reply_create() sends it with fuse_open_out
>> >>> appended and fuse_add_direntry_plus() does not seem to write
>> >>> record size at all, so server and client will need to agree on the
>> >>> size of fuse_entry_out and this would need to be backward compat.
>> >>> If both server and client declare support for FUSE_LOOKUP_HANDLE
>> >>> it should be fine (?).
>> >>
>> >> If max_handle size becomes a value in fuse_init_out, server and
>> >> client would use it? I think appended fuse_open_out could just
>> >> follow the dynamic actual size of the handle - code that
>> >> serializes/deserializes the response has to look up the actual
>> >> handle size then. For example I wouldn't know what to put in
>> >> for any of the example/passthrough* file systems as handle size -
>> >> would need to be 128B, but the actual size will be typically
>> >> much smaller.
>> >
>> > name_to_handle_at ?
>> >
>> > I guess the problem here is that technically speaking filesystems could
>> > have variable sized handles depending on the file.  Sometimes you encode
>> > just the ino/gen of the child file, but other times you might know the
>> > parent and put that in the handle too.
>>
>> Yeah, I don't think it would be reliable for *all* file systems to use
>> name_to_handle_at on startup on some example file/directory. At least
>> not without knowing all the details of the underlying passthrough file
>> system.
>>
>
> Maybe it's not a world-wide general solution, but it is a practical one.
>
> My fuse_passthrough library knows how to detect xfs and ext4 and
> knows about the size of their file handles.
> https://github.com/amir73il/libfuse/blob/fuse_passthrough/passthrough/fuse_passthrough.cpp#L645
>
> A server could optimize for max_handle_size if it knows it or use
> MAX_HANDLE_SZ if it doesn't.
>
> Keep in mind that for the sake of restarting fuse servers (title of this thread)
> file handles do not need to be the actual filesystem file handles.
> Server can use its own pid as generation and then all inodes get
> auto invalidated on server restart.
>
> Not invalidating file handles on server restart, because the file handles
> are persistent file handles is an optimization.
>
> LOOKUP_HANDLE still needs to provide the inode+gen of the parent
> which LOOKUP currently does not.

One additional complication I just realised is that FUSE_LOOKUP already
uses up all the 3 in_args.

So, my initial plan of having FUSE_LOOKUP_HANDLE using a similar structure
to FUSE_LOOKUP, with the additional parent handle passed to the server
through the in_args needs a different solution.

(Anyway, I'll need to read through the whole thread(s) again to better
digest all the information.)

Cheers,
-- 
Luís


>
> I did not understand why Darrick's suggestion of a flag that ino+gen
> suffice is any different then max_handle_size = 12 and using the
> standard FILEID_INO64_GEN in that case?
>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-06 15:12                                       ` Luis Henriques
@ 2025-11-06 15:58                                         ` Luis Henriques
  0 siblings, 0 replies; 46+ messages in thread
From: Luis Henriques @ 2025-11-06 15:58 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bernd Schubert, Darrick J. Wong, Bernd Schubert,
	Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel,
	Kevin Chen

On Thu, Nov 06 2025, Luis Henriques wrote:

> On Thu, Nov 06 2025, Amir Goldstein wrote:
>
>> [...]
>>
>>> >>> fuse_entry_out was extended once and fuse_reply_entry()
>>> >>> sends the size of the struct.
>>> >>
>>> >> Sorry, I'm confused. Where does fuse_reply_entry() send the size?
>>
>> Sorry, I meant to say that the reply size is variable.
>> The size is obviously determined at init time.
>>
>>> >>
>>> >>> However fuse_reply_create() sends it with fuse_open_out
>>> >>> appended and fuse_add_direntry_plus() does not seem to write
>>> >>> record size at all, so server and client will need to agree on the
>>> >>> size of fuse_entry_out and this would need to be backward compat.
>>> >>> If both server and client declare support for FUSE_LOOKUP_HANDLE
>>> >>> it should be fine (?).
>>> >>
>>> >> If max_handle size becomes a value in fuse_init_out, server and
>>> >> client would use it? I think appended fuse_open_out could just
>>> >> follow the dynamic actual size of the handle - code that
>>> >> serializes/deserializes the response has to look up the actual
>>> >> handle size then. For example I wouldn't know what to put in
>>> >> for any of the example/passthrough* file systems as handle size -
>>> >> would need to be 128B, but the actual size will be typically
>>> >> much smaller.
>>> >
>>> > name_to_handle_at ?
>>> >
>>> > I guess the problem here is that technically speaking filesystems could
>>> > have variable sized handles depending on the file.  Sometimes you encode
>>> > just the ino/gen of the child file, but other times you might know the
>>> > parent and put that in the handle too.
>>>
>>> Yeah, I don't think it would be reliable for *all* file systems to use
>>> name_to_handle_at on startup on some example file/directory. At least
>>> not without knowing all the details of the underlying passthrough file
>>> system.
>>>
>>
>> Maybe it's not a world-wide general solution, but it is a practical one.
>>
>> My fuse_passthrough library knows how to detect xfs and ext4 and
>> knows about the size of their file handles.
>> https://github.com/amir73il/libfuse/blob/fuse_passthrough/passthrough/fuse_passthrough.cpp#L645
>>
>> A server could optimize for max_handle_size if it knows it or use
>> MAX_HANDLE_SZ if it doesn't.
>>
>> Keep in mind that for the sake of restarting fuse servers (title of this thread)
>> file handles do not need to be the actual filesystem file handles.
>> Server can use its own pid as generation and then all inodes get
>> auto invalidated on server restart.
>>
>> Not invalidating file handles on server restart, because the file handles
>> are persistent file handles is an optimization.
>>
>> LOOKUP_HANDLE still needs to provide the inode+gen of the parent
>> which LOOKUP currently does not.
>
> One additional complication I just realised is that FUSE_LOOKUP already
> uses up all the 3 in_args.

Ok, ignore me.  We can have 4 in_args, not 3.

Cheers
-- 
Luís

> So, my initial plan of having FUSE_LOOKUP_HANDLE using a similar structure
> to FUSE_LOOKUP, with the additional parent handle passed to the server
> through the in_args needs a different solution.
>
> (Anyway, I'll need to read through the whole thread(s) again to better
> digest all the information.)
>
> Cheers,
> -- 
> Luís
>
>
>>
>> I did not understand why Darrick's suggestion of a flag that ino+gen
>> suffice is any different then max_handle_size = 12 and using the
>> standard FILEID_INO64_GEN in that case?
>>
>> Thanks,
>> Amir.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-06 10:13                                     ` Amir Goldstein
  2025-11-06 15:12                                       ` Luis Henriques
@ 2025-11-06 15:49                                       ` Darrick J. Wong
  2025-11-06 16:08                                         ` Stef Bon
  2025-11-06 16:11                                         ` Amir Goldstein
  1 sibling, 2 replies; 46+ messages in thread
From: Darrick J. Wong @ 2025-11-06 15:49 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bernd Schubert, Luis Henriques, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen

On Thu, Nov 06, 2025 at 11:13:01AM +0100, Amir Goldstein wrote:
> [...]
> 
> > >>> fuse_entry_out was extended once and fuse_reply_entry()
> > >>> sends the size of the struct.
> > >>
> > >> Sorry, I'm confused. Where does fuse_reply_entry() send the size?
> 
> Sorry, I meant to say that the reply size is variable.
> The size is obviously determined at init time.
> 
> > >>
> > >>> However fuse_reply_create() sends it with fuse_open_out
> > >>> appended and fuse_add_direntry_plus() does not seem to write
> > >>> record size at all, so server and client will need to agree on the
> > >>> size of fuse_entry_out and this would need to be backward compat.
> > >>> If both server and client declare support for FUSE_LOOKUP_HANDLE
> > >>> it should be fine (?).
> > >>
> > >> If max_handle size becomes a value in fuse_init_out, server and
> > >> client would use it? I think appended fuse_open_out could just
> > >> follow the dynamic actual size of the handle - code that
> > >> serializes/deserializes the response has to look up the actual
> > >> handle size then. For example I wouldn't know what to put in
> > >> for any of the example/passthrough* file systems as handle size -
> > >> would need to be 128B, but the actual size will be typically
> > >> much smaller.
> > >
> > > name_to_handle_at ?
> > >
> > > I guess the problem here is that technically speaking filesystems could
> > > have variable sized handles depending on the file.  Sometimes you encode
> > > just the ino/gen of the child file, but other times you might know the
> > > parent and put that in the handle too.
> >
> > Yeah, I don't think it would be reliable for *all* file systems to use
> > name_to_handle_at on startup on some example file/directory. At least
> > not without knowing all the details of the underlying passthrough file
> > system.
> >
> 
> Maybe it's not a world-wide general solution, but it is a practical one.
> 
> My fuse_passthrough library knows how to detect xfs and ext4 and
> knows about the size of their file handles.
> https://github.com/amir73il/libfuse/blob/fuse_passthrough/passthrough/fuse_passthrough.cpp#L645
> 
> A server could optimize for max_handle_size if it knows it or use
> MAX_HANDLE_SZ if it doesn't.
> 
> Keep in mind that for the sake of restarting fuse servers (title of this thread)
> file handles do not need to be the actual filesystem file handles.
> Server can use its own pid as generation and then all inodes get
> auto invalidated on server restart.
> 
> Not invalidating file handles on server restart, because the file handles
> are persistent file handles is an optimization.
> 
> LOOKUP_HANDLE still needs to provide the inode+gen of the parent
> which LOOKUP currently does not.
> 
> I did not understand why Darrick's suggestion of a flag that ino+gen
> suffice is any different then max_handle_size = 12 and using the
> standard FILEID_INO64_GEN in that case?

Technically speaking, a 12-byte handle could contain anything.  Maybe
you have a u32 volumeid, inumber, and generation, whereas the flag that
I was mumbling about would specify the handle format as well.

Speaking of which: should file handles be exporting volume ids for the
filesystem (btrfs) that supports it?

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-06 15:49                                       ` Darrick J. Wong
@ 2025-11-06 16:08                                         ` Stef Bon
  2025-11-07  9:25                                           ` Luis Henriques
  2025-11-06 16:11                                         ` Amir Goldstein
  1 sibling, 1 reply; 46+ messages in thread
From: Stef Bon @ 2025-11-06 16:08 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Bernd Schubert, Luis Henriques, Bernd Schubert,
	Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel,
	Kevin Chen

Hi,

is implementing a lookup using a handle to be in the kernel?

I've written a FUSE fs for sftp using SSH as transport, where the
lookup call normally has to create a path (relative to the root of the
sftp) and send that to the remote server.
It saves the creation of this path if there is a handle available.
When doing an opendir, this is normally followed by a lookup for every
dentry. (sftp does not support readdirplus) Now in this case there is
a handle available (the one used by opendir, or one created with
open), so the fuse daemon I wrote used that to proceed. (and so not
create a path).

So it can also go in userspace.

Stef

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-06 16:08                                         ` Stef Bon
@ 2025-11-07  9:25                                           ` Luis Henriques
  2025-11-10  8:20                                             ` Stef Bon
  0 siblings, 1 reply; 46+ messages in thread
From: Luis Henriques @ 2025-11-07  9:25 UTC (permalink / raw)
  To: Stef Bon
  Cc: Darrick J. Wong, Amir Goldstein, Bernd Schubert, Bernd Schubert,
	Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel,
	Kevin Chen

Hi Stef,

On Thu, Nov 06 2025, Stef Bon wrote:

> Hi,
>
> is implementing a lookup using a handle to be in the kernel?

What we're talking here is a new FUSE operation, FUSE_LOOKUP_HANDLE.  The
scope here is mostly related to servers restartability: being able to
restart a FUSE server without unmounting the file system.  But other
scopes are also relevant (e.g. NFS exports).

Just in case you missed it, here's a link to the full discussion:

https://lore.kernel.org/all/8734afp0ct.fsf@igalia.com/

and to an older discussion, also relevant:

https://lore.kernel.org/all/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/

Cheers,
-- 
Luís

> I've written a FUSE fs for sftp using SSH as transport, where the
> lookup call normally has to create a path (relative to the root of the
> sftp) and send that to the remote server.
> It saves the creation of this path if there is a handle available.
> When doing an opendir, this is normally followed by a lookup for every
> dentry. (sftp does not support readdirplus) Now in this case there is
> a handle available (the one used by opendir, or one created with
> open), so the fuse daemon I wrote used that to proceed. (and so not
> create a path).
>
> So it can also go in userspace.
>
> Stef
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-07  9:25                                           ` Luis Henriques
@ 2025-11-10  8:20                                             ` Stef Bon
  0 siblings, 0 replies; 46+ messages in thread
From: Stef Bon @ 2025-11-10  8:20 UTC (permalink / raw)
  To: Luis Henriques; +Cc: linux-fsdevel

Hi,

I see this has to do with the name to handle calls to provide clients
a reference to fs objects which remain valid over a restart right?

Stef

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] Another take at restarting FUSE servers
  2025-11-06 15:49                                       ` Darrick J. Wong
  2025-11-06 16:08                                         ` Stef Bon
@ 2025-11-06 16:11                                         ` Amir Goldstein
  1 sibling, 0 replies; 46+ messages in thread
From: Amir Goldstein @ 2025-11-06 16:11 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Bernd Schubert, Luis Henriques, Bernd Schubert, Theodore Ts'o,
	Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen

On Thu, Nov 6, 2025 at 4:49 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Nov 06, 2025 at 11:13:01AM +0100, Amir Goldstein wrote:
> > [...]
> >
> > > >>> fuse_entry_out was extended once and fuse_reply_entry()
> > > >>> sends the size of the struct.
> > > >>
> > > >> Sorry, I'm confused. Where does fuse_reply_entry() send the size?
> >
> > Sorry, I meant to say that the reply size is variable.
> > The size is obviously determined at init time.
> >
> > > >>
> > > >>> However fuse_reply_create() sends it with fuse_open_out
> > > >>> appended and fuse_add_direntry_plus() does not seem to write
> > > >>> record size at all, so server and client will need to agree on the
> > > >>> size of fuse_entry_out and this would need to be backward compat.
> > > >>> If both server and client declare support for FUSE_LOOKUP_HANDLE
> > > >>> it should be fine (?).
> > > >>
> > > >> If max_handle size becomes a value in fuse_init_out, server and
> > > >> client would use it? I think appended fuse_open_out could just
> > > >> follow the dynamic actual size of the handle - code that
> > > >> serializes/deserializes the response has to look up the actual
> > > >> handle size then. For example I wouldn't know what to put in
> > > >> for any of the example/passthrough* file systems as handle size -
> > > >> would need to be 128B, but the actual size will be typically
> > > >> much smaller.
> > > >
> > > > name_to_handle_at ?
> > > >
> > > > I guess the problem here is that technically speaking filesystems could
> > > > have variable sized handles depending on the file.  Sometimes you encode
> > > > just the ino/gen of the child file, but other times you might know the
> > > > parent and put that in the handle too.
> > >
> > > Yeah, I don't think it would be reliable for *all* file systems to use
> > > name_to_handle_at on startup on some example file/directory. At least
> > > not without knowing all the details of the underlying passthrough file
> > > system.
> > >
> >
> > Maybe it's not a world-wide general solution, but it is a practical one.
> >
> > My fuse_passthrough library knows how to detect xfs and ext4 and
> > knows about the size of their file handles.
> > https://github.com/amir73il/libfuse/blob/fuse_passthrough/passthrough/fuse_passthrough.cpp#L645
> >
> > A server could optimize for max_handle_size if it knows it or use
> > MAX_HANDLE_SZ if it doesn't.
> >
> > Keep in mind that for the sake of restarting fuse servers (title of this thread)
> > file handles do not need to be the actual filesystem file handles.
> > Server can use its own pid as generation and then all inodes get
> > auto invalidated on server restart.
> >
> > Not invalidating file handles on server restart, because the file handles
> > are persistent file handles is an optimization.
> >
> > LOOKUP_HANDLE still needs to provide the inode+gen of the parent
> > which LOOKUP currently does not.
> >
> > I did not understand why Darrick's suggestion of a flag that ino+gen
> > suffice is any different then max_handle_size = 12 and using the
> > standard FILEID_INO64_GEN in that case?
>
> Technically speaking, a 12-byte handle could contain anything.  Maybe
> you have a u32 volumeid, inumber, and generation, whereas the flag that
> I was mumbling about would specify the handle format as well.
>
> Speaking of which: should file handles be exporting volume ids for the
> filesystem (btrfs) that supports it?
>

file handles are opaque so the server can put whatever server wants in them
it does not need to put the native fs file handles (in case of passthrough fs
or in case of iomap fs).

Take struct ovl_fh for example, the format of file handles that overlayfs
exports to NFS encapsulates the underlying fs uuid and file handle.

Note that when exporting such a fuse filesystem to NFS, it is still the
responsibility of the exporter to specify an explicit fsid identifier in
/etc/exports for this fuse server type/instance and then the file handles
generated by this server are expected to be unique in the scope of this
NFS export. Not sure how much of this is relevant for the use case
of restarting a fuse server.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2025-11-10  8:21 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-29 13:56 [RFC] Another take at restarting FUSE servers Luis Henriques
2025-07-29 23:38 ` Darrick J. Wong
2025-07-30 14:04   ` Luis Henriques
2025-07-31 11:33     ` Christian Brauner
2025-07-31 12:23       ` Luis Henriques
2025-07-31 17:29       ` Darrick J. Wong
2025-08-04  8:45         ` Christian Brauner
2025-08-12 19:28           ` Darrick J. Wong
2025-07-31 13:04   ` Theodore Ts'o
2025-07-31 17:38     ` Darrick J. Wong
2025-08-01 10:15       ` Luis Henriques
2025-08-11 15:43         ` Darrick J. Wong
2025-08-13 13:14           ` Luis Henriques
2025-09-12 10:31         ` Bernd Schubert
2025-09-12 11:41           ` Amir Goldstein
2025-09-12 12:29             ` Bernd Schubert
2025-09-12 14:58               ` Darrick J. Wong
2025-09-12 15:20                 ` Bernd Schubert
2025-09-15  4:43                   ` Darrick J. Wong
2025-09-15  7:07                 ` Amir Goldstein
2025-09-15  8:27                   ` Bernd Schubert
2025-09-15  8:41                     ` Amir Goldstein
2025-09-16  2:53                       ` Darrick J. Wong
2025-09-16  7:59                         ` Amir Goldstein
2025-09-18 17:50                           ` Darrick J. Wong
2025-11-04 11:40                           ` Luis Henriques
2025-11-04 13:10                             ` Amir Goldstein
2025-11-04 14:52                               ` Luis Henriques
2025-11-05 10:21                                 ` Amir Goldstein
2025-11-05 11:50                                   ` Luis Henriques
2025-11-05 15:30                                     ` Amir Goldstein
2025-11-05 21:38                                       ` Darrick J. Wong
2025-11-05 21:46                                         ` Bernd Schubert
2025-11-05 22:06                                           ` Bernd Schubert
2025-11-05 22:24                               ` Bernd Schubert
2025-11-05 22:42                                 ` Darrick J. Wong
2025-11-05 22:48                                   ` Bernd Schubert
2025-11-06  0:21                                     ` Darrick J. Wong
2025-11-06 10:13                                     ` Amir Goldstein
2025-11-06 15:12                                       ` Luis Henriques
2025-11-06 15:58                                         ` Luis Henriques
2025-11-06 15:49                                       ` Darrick J. Wong
2025-11-06 16:08                                         ` Stef Bon
2025-11-07  9:25                                           ` Luis Henriques
2025-11-10  8:20                                             ` Stef Bon
2025-11-06 16:11                                         ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).