* [RFC] Another take at restarting FUSE servers @ 2025-07-29 13:56 Luis Henriques 2025-07-29 23:38 ` Darrick J. Wong 0 siblings, 1 reply; 46+ messages in thread From: Luis Henriques @ 2025-07-29 13:56 UTC (permalink / raw) To: Miklos Szeredi, Bernd Schubert; +Cc: linux-fsdevel, linux-kernel Hi! I know this has been discussed several times in several places, and the recent(ish) addition of NOTIFY_RESEND is an important step towards being able to restart a user-space FUSE server. While looking at how to restart a server that uses the libfuse lowlevel API, I've created an RFC pull request [1] to understand whether adding support for this operation would be something acceptable in the project. The PR doesn't do anything sophisticated, it simply hacks into the opaque libfuse data structures so that a server could set some of the sessions' fields. So, a FUSE server simply has to save the /dev/fuse file descriptor and pass it to libfuse while recovering, after a restart or a crash. The mentioned NOTIFY_RESEND should be used so that no requests are lost, of course. And there are probably other data structures that user-space file systems will have to keep track as well, so that everything can be restored. (The parameters set in the INIT phase, for example.) But, from the discussion with Bernd in the PR, one of the things that would be good to have is for the kernel to send back to user-space the information about the inodes it already knows about. I have been playing with this idea with a patch that simply sends out LOOKUPs for each of these inodes. This could be done through a new NOTIFY_RESEND_INODES, or maybe it could be an extra operation added to the already existing NOTIFY_RESEND. Anyway, before spending any more time with this, I wanted to ask whether this is something that could be acceptable in the kernel, if people think a different approach should be followed, or if I'm simply trying to solve the wrong problem. Thanks in advance for any feedback on this. [1] https://github.com/libfuse/libfuse/pull/1219 Cheers, -- Luís ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-07-29 13:56 [RFC] Another take at restarting FUSE servers Luis Henriques @ 2025-07-29 23:38 ` Darrick J. Wong 2025-07-30 14:04 ` Luis Henriques 2025-07-31 13:04 ` Theodore Ts'o 0 siblings, 2 replies; 46+ messages in thread From: Darrick J. Wong @ 2025-07-29 23:38 UTC (permalink / raw) To: Luis Henriques Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote: > Hi! > > I know this has been discussed several times in several places, and the > recent(ish) addition of NOTIFY_RESEND is an important step towards being > able to restart a user-space FUSE server. > > While looking at how to restart a server that uses the libfuse lowlevel > API, I've created an RFC pull request [1] to understand whether adding > support for this operation would be something acceptable in the project. Just speaking for fuse2fs here -- that would be kinda nifty if libfuse could restart itself. It's unclear if doing so will actually enable us to clear the condition that caused the failure in the first place, but I suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts aren't totally crazy. > The PR doesn't do anything sophisticated, it simply hacks into the opaque > libfuse data structures so that a server could set some of the sessions' > fields. > > So, a FUSE server simply has to save the /dev/fuse file descriptor and > pass it to libfuse while recovering, after a restart or a crash. The > mentioned NOTIFY_RESEND should be used so that no requests are lost, of > course. And there are probably other data structures that user-space file > systems will have to keep track as well, so that everything can be > restored. (The parameters set in the INIT phase, for example.) Yeah, I don't know how that would work in practice. Would the kernel send back the old connection flags and whatnot via some sort of FUSE_REINIT request, and the fuse server can either decide that it will try to recover, or just bail out? > But, from the discussion with Bernd in the PR, one of the things that > would be good to have is for the kernel to send back to user-space the > information about the inodes it already knows about. > > I have been playing with this idea with a patch that simply sends out > LOOKUPs for each of these inodes. This could be done through a new > NOTIFY_RESEND_INODES, or maybe it could be an extra operation added to the > already existing NOTIFY_RESEND. I have no idea if NOTIFY_RESEND already does this, but you'd probably want to purge all the unreferenced dentries/inodes to reduce the amount of re-querying. I gather that any fuse server that wants to reboot itself would either have to persist what the nodeids map to, or otherwise stabilize them? For example, fuse2fs could set the nodeid to match the ext2 inode numbers. Then reconnecting them wouldn't be too hard. > Anyway, before spending any more time with this, I wanted to ask whether > this is something that could be acceptable in the kernel, if people think > a different approach should be followed, or if I'm simply trying to solve > the wrong problem. > > Thanks in advance for any feedback on this. > > [1] https://github.com/libfuse/libfuse/pull/1219 Who calls fuse_session_reinitialize() ? --D > Cheers, > -- > Luís > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-07-29 23:38 ` Darrick J. Wong @ 2025-07-30 14:04 ` Luis Henriques 2025-07-31 11:33 ` Christian Brauner 2025-07-31 13:04 ` Theodore Ts'o 1 sibling, 1 reply; 46+ messages in thread From: Luis Henriques @ 2025-07-30 14:04 UTC (permalink / raw) To: Darrick J. Wong Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel Hi Darrick, On Tue, Jul 29 2025, Darrick J. Wong wrote: > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote: >> Hi! >> >> I know this has been discussed several times in several places, and the >> recent(ish) addition of NOTIFY_RESEND is an important step towards being >> able to restart a user-space FUSE server. >> >> While looking at how to restart a server that uses the libfuse lowlevel >> API, I've created an RFC pull request [1] to understand whether adding >> support for this operation would be something acceptable in the project. > > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > could restart itself. It's unclear if doing so will actually enable us > to clear the condition that caused the failure in the first place, but I > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > aren't totally crazy. Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do the restart itself. Instead, it simply adds some visibility into the opaque data structures so that a FUSE server could re-initialise a session without having to go through a full remount. But sure, there are other things that could be added to the library as well. For example, in my current experiments, the FUSE server needs start some sort of "file descriptor server" to keep the fd alive for the restart. This daemon could be optionally provided in libfuse itself, which could also be used to store all sorts of blobs needed by the file system after recovery is done. >> The PR doesn't do anything sophisticated, it simply hacks into the opaque >> libfuse data structures so that a server could set some of the sessions' >> fields. >> >> So, a FUSE server simply has to save the /dev/fuse file descriptor and >> pass it to libfuse while recovering, after a restart or a crash. The >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of >> course. And there are probably other data structures that user-space file >> systems will have to keep track as well, so that everything can be >> restored. (The parameters set in the INIT phase, for example.) > > Yeah, I don't know how that would work in practice. Would the kernel > send back the old connection flags and whatnot via some sort of > FUSE_REINIT request, and the fuse server can either decide that it will > try to recover, or just bail out? That would be an option. But my current idea would be that the server would need to store those somewhere and simply assume they are still OK after reconnecting. The kernel wouldn't need to know the user-space was replaced by another server, potentially different, after an upgrade for example. Right now, AFAIU, restarting a FUSE server *can* be done without any help from the kernel side, as long as the fd is kept alive. The NOTIFY_RESEND is used only for resending FUSE requests for which the kernel is currently waiting replies for. So, for example if the kernel sends a FUSE_READ to user-space and the server crashes while trying to serve it, the kernel will still be waiting for that reply. However, a new server trying to recover from the crash will have no way to know that. And this is where the NOTIFY_RESEND is useful. >> But, from the discussion with Bernd in the PR, one of the things that >> would be good to have is for the kernel to send back to user-space the >> information about the inodes it already knows about. >> >> I have been playing with this idea with a patch that simply sends out >> LOOKUPs for each of these inodes. This could be done through a new >> NOTIFY_RESEND_INODES, or maybe it could be an extra operation added to the >> already existing NOTIFY_RESEND. > > I have no idea if NOTIFY_RESEND already does this, but you'd probably > want to purge all the unreferenced dentries/inodes to reduce the amount > of re-querying. No, NOTIFY_RESEND doesn't purge any of those; currently it simply resend all the requests. > I gather that any fuse server that wants to reboot itself would either > have to persist what the nodeids map to, or otherwise stabilize them? > For example, fuse2fs could set the nodeid to match the ext2 inode > numbers. Then reconnecting them wouldn't be too hard. Right, that's my understanding as well -- restarting a server requires stable nodeids. IIRC most (all?) examples shipped with libfuse can't be restarted because they cast a pointer (the memory address to some sort of inode data struct) and use that as the nodeid. >> Anyway, before spending any more time with this, I wanted to ask whether >> this is something that could be acceptable in the kernel, if people think >> a different approach should be followed, or if I'm simply trying to solve >> the wrong problem. >> >> Thanks in advance for any feedback on this. >> >> [1] https://github.com/libfuse/libfuse/pull/1219 > > Who calls fuse_session_reinitialize() ? Ah! Good question! So, my idea was that a FUSE server would do something like this: fuse_session_new() if (do_recovery) { get_old_fd() fuse_session_reinitialize() fuse_lowlevel_notify_resend() } else fuse_session_mount() fuse_daemonize() fuse_session_loop_mt() Anyway, my initial concerns with restartability started because it is currently not possible to restart a server that uses libfuse without hacking into it's internal data structures. The idea of resending all LOOKUPs just came from the discussion in the PR. Cheers, -- Luís ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-07-30 14:04 ` Luis Henriques @ 2025-07-31 11:33 ` Christian Brauner 2025-07-31 12:23 ` Luis Henriques 2025-07-31 17:29 ` Darrick J. Wong 0 siblings, 2 replies; 46+ messages in thread From: Christian Brauner @ 2025-07-31 11:33 UTC (permalink / raw) To: Luis Henriques Cc: Darrick J. Wong, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Wed, Jul 30, 2025 at 03:04:00PM +0100, Luis Henriques wrote: > Hi Darrick, > > On Tue, Jul 29 2025, Darrick J. Wong wrote: > > > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote: > >> Hi! > >> > >> I know this has been discussed several times in several places, and the > >> recent(ish) addition of NOTIFY_RESEND is an important step towards being > >> able to restart a user-space FUSE server. > >> > >> While looking at how to restart a server that uses the libfuse lowlevel > >> API, I've created an RFC pull request [1] to understand whether adding > >> support for this operation would be something acceptable in the project. > > > > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > > could restart itself. It's unclear if doing so will actually enable us > > to clear the condition that caused the failure in the first place, but I > > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > > aren't totally crazy. > > Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do > the restart itself. Instead, it simply adds some visibility into the > opaque data structures so that a FUSE server could re-initialise a session > without having to go through a full remount. > > But sure, there are other things that could be added to the library as > well. For example, in my current experiments, the FUSE server needs start > some sort of "file descriptor server" to keep the fd alive for the > restart. This daemon could be optionally provided in libfuse itself, > which could also be used to store all sorts of blobs needed by the file > system after recovery is done. Fwiw, for most use-cases you really just want to use systemd's file descriptor store to persist the /dev/fuse connection: https://systemd.io/FILE_DESCRIPTOR_STORE/ > > >> The PR doesn't do anything sophisticated, it simply hacks into the opaque > >> libfuse data structures so that a server could set some of the sessions' > >> fields. > >> > >> So, a FUSE server simply has to save the /dev/fuse file descriptor and > >> pass it to libfuse while recovering, after a restart or a crash. The > >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of > >> course. And there are probably other data structures that user-space file > >> systems will have to keep track as well, so that everything can be > >> restored. (The parameters set in the INIT phase, for example.) > > > > Yeah, I don't know how that would work in practice. Would the kernel > > send back the old connection flags and whatnot via some sort of > > FUSE_REINIT request, and the fuse server can either decide that it will > > try to recover, or just bail out? > > That would be an option. But my current idea would be that the server > would need to store those somewhere and simply assume they are still OK The fdstore currently allows to associate a name with a file descriptor in the fdstore. That name would allow you to associate the options with the fuse connection. However, I would not rule it out that additional metadata could be attached to file descriptors in the fdstore if that's something that's needed. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-07-31 11:33 ` Christian Brauner @ 2025-07-31 12:23 ` Luis Henriques 2025-07-31 17:29 ` Darrick J. Wong 1 sibling, 0 replies; 46+ messages in thread From: Luis Henriques @ 2025-07-31 12:23 UTC (permalink / raw) To: Christian Brauner Cc: Darrick J. Wong, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Thu, Jul 31 2025, Christian Brauner wrote: > On Wed, Jul 30, 2025 at 03:04:00PM +0100, Luis Henriques wrote: >> Hi Darrick, >> >> On Tue, Jul 29 2025, Darrick J. Wong wrote: >> >> > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote: >> >> Hi! >> >> >> >> I know this has been discussed several times in several places, and the >> >> recent(ish) addition of NOTIFY_RESEND is an important step towards being >> >> able to restart a user-space FUSE server. >> >> >> >> While looking at how to restart a server that uses the libfuse lowlevel >> >> API, I've created an RFC pull request [1] to understand whether adding >> >> support for this operation would be something acceptable in the project. >> > >> > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >> > could restart itself. It's unclear if doing so will actually enable us >> > to clear the condition that caused the failure in the first place, but I >> > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >> > aren't totally crazy. >> >> Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do >> the restart itself. Instead, it simply adds some visibility into the >> opaque data structures so that a FUSE server could re-initialise a session >> without having to go through a full remount. >> >> But sure, there are other things that could be added to the library as >> well. For example, in my current experiments, the FUSE server needs start >> some sort of "file descriptor server" to keep the fd alive for the >> restart. This daemon could be optionally provided in libfuse itself, >> which could also be used to store all sorts of blobs needed by the file >> system after recovery is done. > > Fwiw, for most use-cases you really just want to use systemd's file > descriptor store to persist the /dev/fuse connection: > https://systemd.io/FILE_DESCRIPTOR_STORE/ Thank you, Christian. I guess I should have mentioned systemd's fdstore here. In fact, I knew about it, but in my experiments I decided not to use it because it's trivial to keep the fd alive[1] (and also because my test environment doesn't run systemd). But still, any eventual libfuse support could still include the interface with fdstore for that. [1] Obviously "it's trivial" for my experiments. Doing it in a secure way is probably a bit more challenging. Cheers, -- Luís > >> >> >> The PR doesn't do anything sophisticated, it simply hacks into the opaque >> >> libfuse data structures so that a server could set some of the sessions' >> >> fields. >> >> >> >> So, a FUSE server simply has to save the /dev/fuse file descriptor and >> >> pass it to libfuse while recovering, after a restart or a crash. The >> >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of >> >> course. And there are probably other data structures that user-space file >> >> systems will have to keep track as well, so that everything can be >> >> restored. (The parameters set in the INIT phase, for example.) >> > >> > Yeah, I don't know how that would work in practice. Would the kernel >> > send back the old connection flags and whatnot via some sort of >> > FUSE_REINIT request, and the fuse server can either decide that it will >> > try to recover, or just bail out? >> >> That would be an option. But my current idea would be that the server >> would need to store those somewhere and simply assume they are still OK > > The fdstore currently allows to associate a name with a file descriptor > in the fdstore. That name would allow you to associate the options with > the fuse connection. However, I would not rule it out that additional > metadata could be attached to file descriptors in the fdstore if that's > something that's needed. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-07-31 11:33 ` Christian Brauner 2025-07-31 12:23 ` Luis Henriques @ 2025-07-31 17:29 ` Darrick J. Wong 2025-08-04 8:45 ` Christian Brauner 1 sibling, 1 reply; 46+ messages in thread From: Darrick J. Wong @ 2025-07-31 17:29 UTC (permalink / raw) To: Christian Brauner Cc: Luis Henriques, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Thu, Jul 31, 2025 at 01:33:09PM +0200, Christian Brauner wrote: > On Wed, Jul 30, 2025 at 03:04:00PM +0100, Luis Henriques wrote: > > Hi Darrick, > > > > On Tue, Jul 29 2025, Darrick J. Wong wrote: > > > > > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote: > > >> Hi! > > >> > > >> I know this has been discussed several times in several places, and the > > >> recent(ish) addition of NOTIFY_RESEND is an important step towards being > > >> able to restart a user-space FUSE server. > > >> > > >> While looking at how to restart a server that uses the libfuse lowlevel > > >> API, I've created an RFC pull request [1] to understand whether adding > > >> support for this operation would be something acceptable in the project. > > > > > > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > > > could restart itself. It's unclear if doing so will actually enable us > > > to clear the condition that caused the failure in the first place, but I > > > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > > > aren't totally crazy. > > > > Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do > > the restart itself. Instead, it simply adds some visibility into the > > opaque data structures so that a FUSE server could re-initialise a session > > without having to go through a full remount. > > > > But sure, there are other things that could be added to the library as > > well. For example, in my current experiments, the FUSE server needs start > > some sort of "file descriptor server" to keep the fd alive for the > > restart. This daemon could be optionally provided in libfuse itself, > > which could also be used to store all sorts of blobs needed by the file > > system after recovery is done. > > Fwiw, for most use-cases you really just want to use systemd's file > descriptor store to persist the /dev/fuse connection: > https://systemd.io/FILE_DESCRIPTOR_STORE/ Very nice! This is exactly what I was looking for to handle the initial setup, so I'm glad I don't have to go design a protocol around that. > > > > >> The PR doesn't do anything sophisticated, it simply hacks into the opaque > > >> libfuse data structures so that a server could set some of the sessions' > > >> fields. > > >> > > >> So, a FUSE server simply has to save the /dev/fuse file descriptor and > > >> pass it to libfuse while recovering, after a restart or a crash. The > > >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of > > >> course. And there are probably other data structures that user-space file > > >> systems will have to keep track as well, so that everything can be > > >> restored. (The parameters set in the INIT phase, for example.) > > > > > > Yeah, I don't know how that would work in practice. Would the kernel > > > send back the old connection flags and whatnot via some sort of > > > FUSE_REINIT request, and the fuse server can either decide that it will > > > try to recover, or just bail out? > > > > That would be an option. But my current idea would be that the server > > would need to store those somewhere and simply assume they are still OK > > The fdstore currently allows to associate a name with a file descriptor > in the fdstore. That name would allow you to associate the options with > the fuse connection. However, I would not rule it out that additional > metadata could be attached to file descriptors in the fdstore if that's > something that's needed. Names are useful, I'd at least want "fusedev", "fsopen", and "device". If someone passed "journal_dev=/dev/sdaX" to fuse2fs then I'd want it to be able to tell mountfsd "Hey, can you also open /dev/sdaX and put it in the store as 'journal_dev'?" Then it just has to wait until the fd shows up, and it can continue with the mount process. Though the "device" argument needn't be a path, so to be fully general mountfsd and the fuse server would have to handshake that as well. --D ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-07-31 17:29 ` Darrick J. Wong @ 2025-08-04 8:45 ` Christian Brauner 2025-08-12 19:28 ` Darrick J. Wong 0 siblings, 1 reply; 46+ messages in thread From: Christian Brauner @ 2025-08-04 8:45 UTC (permalink / raw) To: Darrick J. Wong Cc: Luis Henriques, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Thu, Jul 31, 2025 at 10:29:46AM -0700, Darrick J. Wong wrote: > On Thu, Jul 31, 2025 at 01:33:09PM +0200, Christian Brauner wrote: > > On Wed, Jul 30, 2025 at 03:04:00PM +0100, Luis Henriques wrote: > > > Hi Darrick, > > > > > > On Tue, Jul 29 2025, Darrick J. Wong wrote: > > > > > > > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote: > > > >> Hi! > > > >> > > > >> I know this has been discussed several times in several places, and the > > > >> recent(ish) addition of NOTIFY_RESEND is an important step towards being > > > >> able to restart a user-space FUSE server. > > > >> > > > >> While looking at how to restart a server that uses the libfuse lowlevel > > > >> API, I've created an RFC pull request [1] to understand whether adding > > > >> support for this operation would be something acceptable in the project. > > > > > > > > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > > > > could restart itself. It's unclear if doing so will actually enable us > > > > to clear the condition that caused the failure in the first place, but I > > > > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > > > > aren't totally crazy. > > > > > > Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do > > > the restart itself. Instead, it simply adds some visibility into the > > > opaque data structures so that a FUSE server could re-initialise a session > > > without having to go through a full remount. > > > > > > But sure, there are other things that could be added to the library as > > > well. For example, in my current experiments, the FUSE server needs start > > > some sort of "file descriptor server" to keep the fd alive for the > > > restart. This daemon could be optionally provided in libfuse itself, > > > which could also be used to store all sorts of blobs needed by the file > > > system after recovery is done. > > > > Fwiw, for most use-cases you really just want to use systemd's file > > descriptor store to persist the /dev/fuse connection: > > https://systemd.io/FILE_DESCRIPTOR_STORE/ > > Very nice! This is exactly what I was looking for to handle the initial > setup, so I'm glad I don't have to go design a protocol around that. > > > > > > > >> The PR doesn't do anything sophisticated, it simply hacks into the opaque > > > >> libfuse data structures so that a server could set some of the sessions' > > > >> fields. > > > >> > > > >> So, a FUSE server simply has to save the /dev/fuse file descriptor and > > > >> pass it to libfuse while recovering, after a restart or a crash. The > > > >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of > > > >> course. And there are probably other data structures that user-space file > > > >> systems will have to keep track as well, so that everything can be > > > >> restored. (The parameters set in the INIT phase, for example.) > > > > > > > > Yeah, I don't know how that would work in practice. Would the kernel > > > > send back the old connection flags and whatnot via some sort of > > > > FUSE_REINIT request, and the fuse server can either decide that it will > > > > try to recover, or just bail out? > > > > > > That would be an option. But my current idea would be that the server > > > would need to store those somewhere and simply assume they are still OK > > > > The fdstore currently allows to associate a name with a file descriptor > > in the fdstore. That name would allow you to associate the options with > > the fuse connection. However, I would not rule it out that additional > > metadata could be attached to file descriptors in the fdstore if that's > > something that's needed. > > Names are useful, I'd at least want "fusedev", "fsopen", and "device". > > If someone passed "journal_dev=/dev/sdaX" to fuse2fs then I'd want it to > be able to tell mountfsd "Hey, can you also open /dev/sdaX and put it in > the store as 'journal_dev'?" Then it just has to wait until the fd shows > up, and it can continue with the mount process. > > Though the "device" argument needn't be a path, so to be fully general > mountfsd and the fuse server would have to handshake that as well. Fwiw, to attach arbitrary metadata to a file descriptor the easiest thing to do would be to stash both a (fuse server) file descriptor and then also a memfd via memfd_create() that e.g., can contain all the server options that you want to store. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-08-04 8:45 ` Christian Brauner @ 2025-08-12 19:28 ` Darrick J. Wong 0 siblings, 0 replies; 46+ messages in thread From: Darrick J. Wong @ 2025-08-12 19:28 UTC (permalink / raw) To: Christian Brauner Cc: Luis Henriques, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Mon, Aug 04, 2025 at 10:45:44AM +0200, Christian Brauner wrote: > On Thu, Jul 31, 2025 at 10:29:46AM -0700, Darrick J. Wong wrote: > > On Thu, Jul 31, 2025 at 01:33:09PM +0200, Christian Brauner wrote: > > > On Wed, Jul 30, 2025 at 03:04:00PM +0100, Luis Henriques wrote: > > > > Hi Darrick, > > > > > > > > On Tue, Jul 29 2025, Darrick J. Wong wrote: > > > > > > > > > On Tue, Jul 29, 2025 at 02:56:02PM +0100, Luis Henriques wrote: > > > > >> Hi! > > > > >> > > > > >> I know this has been discussed several times in several places, and the > > > > >> recent(ish) addition of NOTIFY_RESEND is an important step towards being > > > > >> able to restart a user-space FUSE server. > > > > >> > > > > >> While looking at how to restart a server that uses the libfuse lowlevel > > > > >> API, I've created an RFC pull request [1] to understand whether adding > > > > >> support for this operation would be something acceptable in the project. > > > > > > > > > > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > > > > > could restart itself. It's unclear if doing so will actually enable us > > > > > to clear the condition that caused the failure in the first place, but I > > > > > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > > > > > aren't totally crazy. > > > > > > > > Maybe my PR lacks a bit of ambition -- it's goal wasn't to have libfuse do > > > > the restart itself. Instead, it simply adds some visibility into the > > > > opaque data structures so that a FUSE server could re-initialise a session > > > > without having to go through a full remount. > > > > > > > > But sure, there are other things that could be added to the library as > > > > well. For example, in my current experiments, the FUSE server needs start > > > > some sort of "file descriptor server" to keep the fd alive for the > > > > restart. This daemon could be optionally provided in libfuse itself, > > > > which could also be used to store all sorts of blobs needed by the file > > > > system after recovery is done. > > > > > > Fwiw, for most use-cases you really just want to use systemd's file > > > descriptor store to persist the /dev/fuse connection: > > > https://systemd.io/FILE_DESCRIPTOR_STORE/ > > > > Very nice! This is exactly what I was looking for to handle the initial > > setup, so I'm glad I don't have to go design a protocol around that. > > > > > > > > > > >> The PR doesn't do anything sophisticated, it simply hacks into the opaque > > > > >> libfuse data structures so that a server could set some of the sessions' > > > > >> fields. > > > > >> > > > > >> So, a FUSE server simply has to save the /dev/fuse file descriptor and > > > > >> pass it to libfuse while recovering, after a restart or a crash. The > > > > >> mentioned NOTIFY_RESEND should be used so that no requests are lost, of > > > > >> course. And there are probably other data structures that user-space file > > > > >> systems will have to keep track as well, so that everything can be > > > > >> restored. (The parameters set in the INIT phase, for example.) > > > > > > > > > > Yeah, I don't know how that would work in practice. Would the kernel > > > > > send back the old connection flags and whatnot via some sort of > > > > > FUSE_REINIT request, and the fuse server can either decide that it will > > > > > try to recover, or just bail out? > > > > > > > > That would be an option. But my current idea would be that the server > > > > would need to store those somewhere and simply assume they are still OK > > > > > > The fdstore currently allows to associate a name with a file descriptor > > > in the fdstore. That name would allow you to associate the options with > > > the fuse connection. However, I would not rule it out that additional > > > metadata could be attached to file descriptors in the fdstore if that's > > > something that's needed. > > > > Names are useful, I'd at least want "fusedev", "fsopen", and "device". > > > > If someone passed "journal_dev=/dev/sdaX" to fuse2fs then I'd want it to > > be able to tell mountfsd "Hey, can you also open /dev/sdaX and put it in > > the store as 'journal_dev'?" Then it just has to wait until the fd shows > > up, and it can continue with the mount process. > > > > Though the "device" argument needn't be a path, so to be fully general > > mountfsd and the fuse server would have to handshake that as well. > > Fwiw, to attach arbitrary metadata to a file descriptor the easiest > thing to do would be to stash both a (fuse server) file descriptor and > then also a memfd via memfd_create() that e.g., can contain all the > server options that you want to store. <nod> I'll keep that in mind when I get to designing those components. Thanks for the input! (I'm still working on stabiling the new fuse4fs server, it's probably going to be a while yet...) --D ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-07-29 23:38 ` Darrick J. Wong 2025-07-30 14:04 ` Luis Henriques @ 2025-07-31 13:04 ` Theodore Ts'o 2025-07-31 17:38 ` Darrick J. Wong 1 sibling, 1 reply; 46+ messages in thread From: Theodore Ts'o @ 2025-07-31 13:04 UTC (permalink / raw) To: Darrick J. Wong Cc: Luis Henriques, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > could restart itself. It's unclear if doing so will actually enable us > to clear the condition that caused the failure in the first place, but I > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > aren't totally crazy. I'm trying to understand what the failure scenario is here. Is this if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what is supposed to happen with respect to open files, metadata and data modifications which were in transit, etc.? Sure, fuse2fs could run e2fsck -fy, but if there are dirty inode on the system, that's going potentally to be out of sync, right? What are the recovery semantics that we hope to be able to provide? - Ted ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-07-31 13:04 ` Theodore Ts'o @ 2025-07-31 17:38 ` Darrick J. Wong 2025-08-01 10:15 ` Luis Henriques 0 siblings, 1 reply; 46+ messages in thread From: Darrick J. Wong @ 2025-07-31 17:38 UTC (permalink / raw) To: Theodore Ts'o Cc: Luis Henriques, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > > > > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > > could restart itself. It's unclear if doing so will actually enable us > > to clear the condition that caused the failure in the first place, but I > > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > > aren't totally crazy. > > I'm trying to understand what the failure scenario is here. Is this > if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > is supposed to happen with respect to open files, metadata and data > modifications which were in transit, etc.? Sure, fuse2fs could run > e2fsck -fy, but if there are dirty inode on the system, that's going > potentally to be out of sync, right? > > What are the recovery semantics that we hope to be able to provide? <echoing what we said on the ext4 call this morning> With iomap, most of the dirty state is in the kernel, so I think the new fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which would initiate GETATTR requests on all the cached inodes to validate that they still exist; and then resend all the unacknowledged requests that were pending at the time. It might be the case that you have to that in the reverse order; I only know enough about the design of fuse to suspect that to be true. Anyhow once those are complete, I think we can resume operations with the surviving inodes. The ones that fail the GETATTR revalidation are fuse_make_bad'd, which effectively revokes them. All of this of course relies on fuse2fs maintaining as little volatile state of its own as possible. I think that means disabling the block cache in the unix io manager, and if we ever implemented delalloc then either we'd have to save the reservations somewhere or I guess you could immediately syncfs the whole filesystem to try to push all the dirty data to disk before we start allowing new free space allocations for new changes. --D > - Ted > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-07-31 17:38 ` Darrick J. Wong @ 2025-08-01 10:15 ` Luis Henriques 2025-08-11 15:43 ` Darrick J. Wong 2025-09-12 10:31 ` Bernd Schubert 0 siblings, 2 replies; 46+ messages in thread From: Luis Henriques @ 2025-08-01 10:15 UTC (permalink / raw) To: Darrick J. Wong Cc: Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Thu, Jul 31 2025, Darrick J. Wong wrote: > On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >> > >> > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >> > could restart itself. It's unclear if doing so will actually enable us >> > to clear the condition that caused the failure in the first place, but I >> > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >> > aren't totally crazy. >> >> I'm trying to understand what the failure scenario is here. Is this >> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >> is supposed to happen with respect to open files, metadata and data >> modifications which were in transit, etc.? Sure, fuse2fs could run >> e2fsck -fy, but if there are dirty inode on the system, that's going >> potentally to be out of sync, right? >> >> What are the recovery semantics that we hope to be able to provide? > > <echoing what we said on the ext4 call this morning> > > With iomap, most of the dirty state is in the kernel, so I think the new > fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > would initiate GETATTR requests on all the cached inodes to validate > that they still exist; and then resend all the unacknowledged requests > that were pending at the time. It might be the case that you have to > that in the reverse order; I only know enough about the design of fuse > to suspect that to be true. > > Anyhow once those are complete, I think we can resume operations with > the surviving inodes. The ones that fail the GETATTR revalidation are > fuse_make_bad'd, which effectively revokes them. Ah! Interesting, I have been playing a bit with sending LOOKUP requests, but probably GETATTR is a better option. So, are you currently working on any of this? Are you implementing this new NOTIFY_RESTARTED request? I guess it's time for me to have a closer look at fuse2fs too. Cheers, -- Luís > All of this of course relies on fuse2fs maintaining as little volatile > state of its own as possible. I think that means disabling the block > cache in the unix io manager, and if we ever implemented delalloc then > either we'd have to save the reservations somewhere or I guess you could > immediately syncfs the whole filesystem to try to push all the dirty > data to disk before we start allowing new free space allocations for new > changes. > > --D > >> - Ted >> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-08-01 10:15 ` Luis Henriques @ 2025-08-11 15:43 ` Darrick J. Wong 2025-08-13 13:14 ` Luis Henriques 2025-09-12 10:31 ` Bernd Schubert 1 sibling, 1 reply; 46+ messages in thread From: Darrick J. Wong @ 2025-08-11 15:43 UTC (permalink / raw) To: Luis Henriques Cc: Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Fri, Aug 01, 2025 at 11:15:26AM +0100, Luis Henriques wrote: > On Thu, Jul 31 2025, Darrick J. Wong wrote: > > > On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > >> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > >> > > >> > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > >> > could restart itself. It's unclear if doing so will actually enable us > >> > to clear the condition that caused the failure in the first place, but I > >> > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > >> > aren't totally crazy. > >> > >> I'm trying to understand what the failure scenario is here. Is this > >> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > >> is supposed to happen with respect to open files, metadata and data > >> modifications which were in transit, etc.? Sure, fuse2fs could run > >> e2fsck -fy, but if there are dirty inode on the system, that's going > >> potentally to be out of sync, right? > >> > >> What are the recovery semantics that we hope to be able to provide? > > > > <echoing what we said on the ext4 call this morning> > > > > With iomap, most of the dirty state is in the kernel, so I think the new > > fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > > would initiate GETATTR requests on all the cached inodes to validate > > that they still exist; and then resend all the unacknowledged requests > > that were pending at the time. It might be the case that you have to > > that in the reverse order; I only know enough about the design of fuse > > to suspect that to be true. > > > > Anyhow once those are complete, I think we can resume operations with > > the surviving inodes. The ones that fail the GETATTR revalidation are > > fuse_make_bad'd, which effectively revokes them. > > Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > but probably GETATTR is a better option. > > So, are you currently working on any of this? Are you implementing this > new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > look at fuse2fs too. Nope, right now I'm concentrating on making sure the fuse/iomap IO path works reliably; and converting fuse2fs to be a lowlevel fuse server. Eliminating all the path walking stuff that the highlevel fuse library does reduces the fstests runtime from 7.9 to 3.5h, and turning on iomap cuts that to 2.2h. --D > Cheers, > -- > Luís > > > All of this of course relies on fuse2fs maintaining as little volatile > > state of its own as possible. I think that means disabling the block > > cache in the unix io manager, and if we ever implemented delalloc then > > either we'd have to save the reservations somewhere or I guess you could > > immediately syncfs the whole filesystem to try to push all the dirty > > data to disk before we start allowing new free space allocations for new > > changes. > > > > --D > > > >> - Ted > >> > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-08-11 15:43 ` Darrick J. Wong @ 2025-08-13 13:14 ` Luis Henriques 0 siblings, 0 replies; 46+ messages in thread From: Luis Henriques @ 2025-08-13 13:14 UTC (permalink / raw) To: Darrick J. Wong Cc: Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel On Mon, Aug 11 2025, Darrick J. Wong wrote: > On Fri, Aug 01, 2025 at 11:15:26AM +0100, Luis Henriques wrote: >> On Thu, Jul 31 2025, Darrick J. Wong wrote: >> >> > On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >> >> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >> >> > >> >> > Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >> >> > could restart itself. It's unclear if doing so will actually enable us >> >> > to clear the condition that caused the failure in the first place, but I >> >> > suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >> >> > aren't totally crazy. >> >> >> >> I'm trying to understand what the failure scenario is here. Is this >> >> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >> >> is supposed to happen with respect to open files, metadata and data >> >> modifications which were in transit, etc.? Sure, fuse2fs could run >> >> e2fsck -fy, but if there are dirty inode on the system, that's going >> >> potentally to be out of sync, right? >> >> >> >> What are the recovery semantics that we hope to be able to provide? >> > >> > <echoing what we said on the ext4 call this morning> >> > >> > With iomap, most of the dirty state is in the kernel, so I think the new >> > fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which >> > would initiate GETATTR requests on all the cached inodes to validate >> > that they still exist; and then resend all the unacknowledged requests >> > that were pending at the time. It might be the case that you have to >> > that in the reverse order; I only know enough about the design of fuse >> > to suspect that to be true. >> > >> > Anyhow once those are complete, I think we can resume operations with >> > the surviving inodes. The ones that fail the GETATTR revalidation are >> > fuse_make_bad'd, which effectively revokes them. >> >> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, >> but probably GETATTR is a better option. >> >> So, are you currently working on any of this? Are you implementing this >> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer >> look at fuse2fs too. > > Nope, right now I'm concentrating on making sure the fuse/iomap IO path > works reliably; and converting fuse2fs to be a lowlevel fuse server. Great, thanks for clarifying. > Eliminating all the path walking stuff that the highlevel fuse library > does reduces the fstests runtime from 7.9 to 3.5h, and turning on iomap > cuts that to 2.2h. Wow! those are quite impressive numbers. Looking forward to look into those fuse2fs improvements! Cheers, -- Luís ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-08-01 10:15 ` Luis Henriques 2025-08-11 15:43 ` Darrick J. Wong @ 2025-09-12 10:31 ` Bernd Schubert 2025-09-12 11:41 ` Amir Goldstein 1 sibling, 1 reply; 46+ messages in thread From: Bernd Schubert @ 2025-09-12 10:31 UTC (permalink / raw) To: Luis Henriques, Darrick J. Wong Cc: Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen, Amir Goldstein On 8/1/25 12:15, Luis Henriques wrote: > On Thu, Jul 31 2025, Darrick J. Wong wrote: > >> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >>>> >>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >>>> could restart itself. It's unclear if doing so will actually enable us >>>> to clear the condition that caused the failure in the first place, but I >>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >>>> aren't totally crazy. >>> >>> I'm trying to understand what the failure scenario is here. Is this >>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >>> is supposed to happen with respect to open files, metadata and data >>> modifications which were in transit, etc.? Sure, fuse2fs could run >>> e2fsck -fy, but if there are dirty inode on the system, that's going >>> potentally to be out of sync, right? >>> >>> What are the recovery semantics that we hope to be able to provide? >> >> <echoing what we said on the ext4 call this morning> >> >> With iomap, most of the dirty state is in the kernel, so I think the new >> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which >> would initiate GETATTR requests on all the cached inodes to validate >> that they still exist; and then resend all the unacknowledged requests >> that were pending at the time. It might be the case that you have to >> that in the reverse order; I only know enough about the design of fuse >> to suspect that to be true. >> >> Anyhow once those are complete, I think we can resume operations with >> the surviving inodes. The ones that fail the GETATTR revalidation are >> fuse_make_bad'd, which effectively revokes them. > > Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > but probably GETATTR is a better option. > > So, are you currently working on any of this? Are you implementing this > new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > look at fuse2fs too. Sorry for joining the discussion late, I was totally occupied, day and night. Added Kevin to CC, who is going to work on recovery on our DDN side. Issue with GETATTR and LOOKUP is that they need a path, but on fuse server restart we want kernel to recover inodes and their lookup count. Now inode recovery might be hard, because we currently only have a 64-bit node-id - which is used my most fuse application as memory pointer. As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends outstanding requests. And that ends up in most cases in sending requests with invalid node-IDs, that are casted and might provoke random memory access on restart. Kind of the same issue why fuse nfs export or open_by_handle_at doesn't work well right now. So IMHO, what we really want is something like FUSE_LOOKUP_FH, which would not return a 64-bit node ID, but a max 128 byte file handle. And then FUSE_REVALIDATE_FH on server restart. The file handles could be stored into the fuse inode and also used for NFS export. I *think* Amir had a similar idea, but I don't find the link quickly. Adding Amir to CC. Our short term plan is to add something like FUSE_NOTIFY_RESTART, which will iterate over all superblock inodes and mark them with fuse_make_bad. Any objections against that? Thanks, Bernd ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-12 10:31 ` Bernd Schubert @ 2025-09-12 11:41 ` Amir Goldstein 2025-09-12 12:29 ` Bernd Schubert 0 siblings, 1 reply; 46+ messages in thread From: Amir Goldstein @ 2025-09-12 11:41 UTC (permalink / raw) To: Bernd Schubert Cc: Luis Henriques, Darrick J. Wong, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > > > > On 8/1/25 12:15, Luis Henriques wrote: > > On Thu, Jul 31 2025, Darrick J. Wong wrote: > > > >> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > >>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > >>>> > >>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > >>>> could restart itself. It's unclear if doing so will actually enable us > >>>> to clear the condition that caused the failure in the first place, but I > >>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > >>>> aren't totally crazy. > >>> > >>> I'm trying to understand what the failure scenario is here. Is this > >>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > >>> is supposed to happen with respect to open files, metadata and data > >>> modifications which were in transit, etc.? Sure, fuse2fs could run > >>> e2fsck -fy, but if there are dirty inode on the system, that's going > >>> potentally to be out of sync, right? > >>> > >>> What are the recovery semantics that we hope to be able to provide? > >> > >> <echoing what we said on the ext4 call this morning> > >> > >> With iomap, most of the dirty state is in the kernel, so I think the new > >> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > >> would initiate GETATTR requests on all the cached inodes to validate > >> that they still exist; and then resend all the unacknowledged requests > >> that were pending at the time. It might be the case that you have to > >> that in the reverse order; I only know enough about the design of fuse > >> to suspect that to be true. > >> > >> Anyhow once those are complete, I think we can resume operations with > >> the surviving inodes. The ones that fail the GETATTR revalidation are > >> fuse_make_bad'd, which effectively revokes them. > > > > Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > > but probably GETATTR is a better option. > > > > So, are you currently working on any of this? Are you implementing this > > new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > > look at fuse2fs too. > > Sorry for joining the discussion late, I was totally occupied, day and > night. Added Kevin to CC, who is going to work on recovery on our > DDN side. > > Issue with GETATTR and LOOKUP is that they need a path, but on fuse > server restart we want kernel to recover inodes and their lookup count. > Now inode recovery might be hard, because we currently only have a > 64-bit node-id - which is used my most fuse application as memory > pointer. > > As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > outstanding requests. And that ends up in most cases in sending requests > with invalid node-IDs, that are casted and might provoke random memory > access on restart. Kind of the same issue why fuse nfs export or > open_by_handle_at doesn't work well right now. > > So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > would not return a 64-bit node ID, but a max 128 byte file handle. > And then FUSE_REVALIDATE_FH on server restart. > The file handles could be stored into the fuse inode and also used for > NFS export. > > I *think* Amir had a similar idea, but I don't find the link quickly. > Adding Amir to CC. Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > > Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > will iterate over all superblock inodes and mark them with fuse_make_bad. > Any objections against that? IDK, it seems much more ugly than implementing LOOKUP_HANDLE and I am not sure that LOOKUP_HANDLE is that hard to implement, when comparing to this alternative. I mean a restartable server is going to be a new implementation anyway, right? So it makes sense to start with a cleaner and more adequate protocol, does it not? Thanks, Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-12 11:41 ` Amir Goldstein @ 2025-09-12 12:29 ` Bernd Schubert 2025-09-12 14:58 ` Darrick J. Wong 0 siblings, 1 reply; 46+ messages in thread From: Bernd Schubert @ 2025-09-12 12:29 UTC (permalink / raw) To: Amir Goldstein Cc: Luis Henriques, Darrick J. Wong, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On 9/12/25 13:41, Amir Goldstein wrote: > On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: >> >> >> >> On 8/1/25 12:15, Luis Henriques wrote: >>> On Thu, Jul 31 2025, Darrick J. Wong wrote: >>> >>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >>>>>> >>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >>>>>> could restart itself. It's unclear if doing so will actually enable us >>>>>> to clear the condition that caused the failure in the first place, but I >>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >>>>>> aren't totally crazy. >>>>> >>>>> I'm trying to understand what the failure scenario is here. Is this >>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >>>>> is supposed to happen with respect to open files, metadata and data >>>>> modifications which were in transit, etc.? Sure, fuse2fs could run >>>>> e2fsck -fy, but if there are dirty inode on the system, that's going >>>>> potentally to be out of sync, right? >>>>> >>>>> What are the recovery semantics that we hope to be able to provide? >>>> >>>> <echoing what we said on the ext4 call this morning> >>>> >>>> With iomap, most of the dirty state is in the kernel, so I think the new >>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which >>>> would initiate GETATTR requests on all the cached inodes to validate >>>> that they still exist; and then resend all the unacknowledged requests >>>> that were pending at the time. It might be the case that you have to >>>> that in the reverse order; I only know enough about the design of fuse >>>> to suspect that to be true. >>>> >>>> Anyhow once those are complete, I think we can resume operations with >>>> the surviving inodes. The ones that fail the GETATTR revalidation are >>>> fuse_make_bad'd, which effectively revokes them. >>> >>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, >>> but probably GETATTR is a better option. >>> >>> So, are you currently working on any of this? Are you implementing this >>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer >>> look at fuse2fs too. >> >> Sorry for joining the discussion late, I was totally occupied, day and >> night. Added Kevin to CC, who is going to work on recovery on our >> DDN side. >> >> Issue with GETATTR and LOOKUP is that they need a path, but on fuse >> server restart we want kernel to recover inodes and their lookup count. >> Now inode recovery might be hard, because we currently only have a >> 64-bit node-id - which is used my most fuse application as memory >> pointer. >> >> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends >> outstanding requests. And that ends up in most cases in sending requests >> with invalid node-IDs, that are casted and might provoke random memory >> access on restart. Kind of the same issue why fuse nfs export or >> open_by_handle_at doesn't work well right now. >> >> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which >> would not return a 64-bit node ID, but a max 128 byte file handle. >> And then FUSE_REVALIDATE_FH on server restart. >> The file handles could be stored into the fuse inode and also used for >> NFS export. >> >> I *think* Amir had a similar idea, but I don't find the link quickly. >> Adding Amir to CC. > > Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ Thanks for the reference Amir! I even had been in that thread. > >> >> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which >> will iterate over all superblock inodes and mark them with fuse_make_bad. >> Any objections against that? > > IDK, it seems much more ugly than implementing LOOKUP_HANDLE > and I am not sure that LOOKUP_HANDLE is that hard to implement, when > comparing to this alternative. > > I mean a restartable server is going to be a new implementation anyway, right? > So it makes sense to start with a cleaner and more adequate protocol, > does it not? Definitely, if we agree on the approach on LOOKUP_HANDLE and using it for recovery, adding that op seems simple. And reading through the thread you had posted above, just the implementation was missing. So let's go ahead to do this approach. Thanks, Bernd ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-12 12:29 ` Bernd Schubert @ 2025-09-12 14:58 ` Darrick J. Wong 2025-09-12 15:20 ` Bernd Schubert 2025-09-15 7:07 ` Amir Goldstein 0 siblings, 2 replies; 46+ messages in thread From: Darrick J. Wong @ 2025-09-12 14:58 UTC (permalink / raw) To: Bernd Schubert Cc: Amir Goldstein, Luis Henriques, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > > > On 9/12/25 13:41, Amir Goldstein wrote: > > On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > >> > >> > >> > >> On 8/1/25 12:15, Luis Henriques wrote: > >>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > >>> > >>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > >>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > >>>>>> > >>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > >>>>>> could restart itself. It's unclear if doing so will actually enable us > >>>>>> to clear the condition that caused the failure in the first place, but I > >>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > >>>>>> aren't totally crazy. > >>>>> > >>>>> I'm trying to understand what the failure scenario is here. Is this > >>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > >>>>> is supposed to happen with respect to open files, metadata and data > >>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > >>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > >>>>> potentally to be out of sync, right? > >>>>> > >>>>> What are the recovery semantics that we hope to be able to provide? > >>>> > >>>> <echoing what we said on the ext4 call this morning> > >>>> > >>>> With iomap, most of the dirty state is in the kernel, so I think the new > >>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > >>>> would initiate GETATTR requests on all the cached inodes to validate > >>>> that they still exist; and then resend all the unacknowledged requests > >>>> that were pending at the time. It might be the case that you have to > >>>> that in the reverse order; I only know enough about the design of fuse > >>>> to suspect that to be true. > >>>> > >>>> Anyhow once those are complete, I think we can resume operations with > >>>> the surviving inodes. The ones that fail the GETATTR revalidation are > >>>> fuse_make_bad'd, which effectively revokes them. > >>> > >>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > >>> but probably GETATTR is a better option. > >>> > >>> So, are you currently working on any of this? Are you implementing this > >>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > >>> look at fuse2fs too. > >> > >> Sorry for joining the discussion late, I was totally occupied, day and > >> night. Added Kevin to CC, who is going to work on recovery on our > >> DDN side. > >> > >> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > >> server restart we want kernel to recover inodes and their lookup count. > >> Now inode recovery might be hard, because we currently only have a > >> 64-bit node-id - which is used my most fuse application as memory > >> pointer. > >> > >> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > >> outstanding requests. And that ends up in most cases in sending requests > >> with invalid node-IDs, that are casted and might provoke random memory > >> access on restart. Kind of the same issue why fuse nfs export or > >> open_by_handle_at doesn't work well right now. > >> > >> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > >> would not return a 64-bit node ID, but a max 128 byte file handle. > >> And then FUSE_REVALIDATE_FH on server restart. > >> The file handles could be stored into the fuse inode and also used for > >> NFS export. > >> > >> I *think* Amir had a similar idea, but I don't find the link quickly. > >> Adding Amir to CC. > > > > Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > > Thanks for the reference Amir! I even had been in that thread. > > > > >> > >> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > >> will iterate over all superblock inodes and mark them with fuse_make_bad. > >> Any objections against that? What if you actually /can/ reuse a nodeid after a restart? Consider fuse4fs, where the nodeid is the on-disk inode number. After a restart, you can reconnect the fuse_inode to the ondisk inode, assuming recovery didn't delete it, obviously. I suppose you could just ask for refreshed stat information and either the server gives it to you and the fuse_inode lives; or the server returns ENOENT and then we mark it bad. But I'd have to see code patches to form a real opinion. It's very nice of fuse to have implemented revoke() ;) --D > > IDK, it seems much more ugly than implementing LOOKUP_HANDLE > > and I am not sure that LOOKUP_HANDLE is that hard to implement, when > > comparing to this alternative. > > > > I mean a restartable server is going to be a new implementation anyway, right? > > So it makes sense to start with a cleaner and more adequate protocol, > > does it not? > > Definitely, if we agree on the approach on LOOKUP_HANDLE and using it > for recovery, adding that op seems simple. And reading through the > thread you had posted above, just the implementation was missing. > So let's go ahead to do this approach. > > > Thanks, > Bernd > > > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-12 14:58 ` Darrick J. Wong @ 2025-09-12 15:20 ` Bernd Schubert 2025-09-15 4:43 ` Darrick J. Wong 2025-09-15 7:07 ` Amir Goldstein 1 sibling, 1 reply; 46+ messages in thread From: Bernd Schubert @ 2025-09-12 15:20 UTC (permalink / raw) To: Darrick J. Wong Cc: Amir Goldstein, Luis Henriques, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On 9/12/25 16:58, Darrick J. Wong wrote: > On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: >> >> >> On 9/12/25 13:41, Amir Goldstein wrote: >>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: >>>> >>>> >>>> >>>> On 8/1/25 12:15, Luis Henriques wrote: >>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: >>>>> >>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >>>>>>>> >>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >>>>>>>> could restart itself. It's unclear if doing so will actually enable us >>>>>>>> to clear the condition that caused the failure in the first place, but I >>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >>>>>>>> aren't totally crazy. >>>>>>> >>>>>>> I'm trying to understand what the failure scenario is here. Is this >>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >>>>>>> is supposed to happen with respect to open files, metadata and data >>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run >>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going >>>>>>> potentally to be out of sync, right? >>>>>>> >>>>>>> What are the recovery semantics that we hope to be able to provide? >>>>>> >>>>>> <echoing what we said on the ext4 call this morning> >>>>>> >>>>>> With iomap, most of the dirty state is in the kernel, so I think the new >>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which >>>>>> would initiate GETATTR requests on all the cached inodes to validate >>>>>> that they still exist; and then resend all the unacknowledged requests >>>>>> that were pending at the time. It might be the case that you have to >>>>>> that in the reverse order; I only know enough about the design of fuse >>>>>> to suspect that to be true. >>>>>> >>>>>> Anyhow once those are complete, I think we can resume operations with >>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are >>>>>> fuse_make_bad'd, which effectively revokes them. >>>>> >>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, >>>>> but probably GETATTR is a better option. >>>>> >>>>> So, are you currently working on any of this? Are you implementing this >>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer >>>>> look at fuse2fs too. >>>> >>>> Sorry for joining the discussion late, I was totally occupied, day and >>>> night. Added Kevin to CC, who is going to work on recovery on our >>>> DDN side. >>>> >>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse >>>> server restart we want kernel to recover inodes and their lookup count. >>>> Now inode recovery might be hard, because we currently only have a >>>> 64-bit node-id - which is used my most fuse application as memory >>>> pointer. >>>> >>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends >>>> outstanding requests. And that ends up in most cases in sending requests >>>> with invalid node-IDs, that are casted and might provoke random memory >>>> access on restart. Kind of the same issue why fuse nfs export or >>>> open_by_handle_at doesn't work well right now. >>>> >>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which >>>> would not return a 64-bit node ID, but a max 128 byte file handle. >>>> And then FUSE_REVALIDATE_FH on server restart. >>>> The file handles could be stored into the fuse inode and also used for >>>> NFS export. >>>> >>>> I *think* Amir had a similar idea, but I don't find the link quickly. >>>> Adding Amir to CC. >>> >>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: >>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ >> >> Thanks for the reference Amir! I even had been in that thread. >> >>> >>>> >>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which >>>> will iterate over all superblock inodes and mark them with fuse_make_bad. >>>> Any objections against that? > > What if you actually /can/ reuse a nodeid after a restart? Consider > fuse4fs, where the nodeid is the on-disk inode number. After a restart, > you can reconnect the fuse_inode to the ondisk inode, assuming recovery > didn't delete it, obviously. > > I suppose you could just ask for refreshed stat information and either > the server gives it to you and the fuse_inode lives; or the server > returns ENOENT and then we mark it bad. But I'd have to see code > patches to form a real opinion. > > It's very nice of fuse to have implemented revoke() ;) Assuming you would run with an attr cache timeout equal 0 the existing NOTIFY_RESEND would be enough for fuse4fs? Thanks, Bernd ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-12 15:20 ` Bernd Schubert @ 2025-09-15 4:43 ` Darrick J. Wong 0 siblings, 0 replies; 46+ messages in thread From: Darrick J. Wong @ 2025-09-15 4:43 UTC (permalink / raw) To: Bernd Schubert Cc: Amir Goldstein, Luis Henriques, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Fri, Sep 12, 2025 at 05:20:58PM +0200, Bernd Schubert wrote: > > > On 9/12/25 16:58, Darrick J. Wong wrote: > > On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > >> > >> > >> On 9/12/25 13:41, Amir Goldstein wrote: > >>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > >>>> > >>>> > >>>> > >>>> On 8/1/25 12:15, Luis Henriques wrote: > >>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > >>>>> > >>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > >>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > >>>>>>>> > >>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > >>>>>>>> could restart itself. It's unclear if doing so will actually enable us > >>>>>>>> to clear the condition that caused the failure in the first place, but I > >>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > >>>>>>>> aren't totally crazy. > >>>>>>> > >>>>>>> I'm trying to understand what the failure scenario is here. Is this > >>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > >>>>>>> is supposed to happen with respect to open files, metadata and data > >>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > >>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > >>>>>>> potentally to be out of sync, right? > >>>>>>> > >>>>>>> What are the recovery semantics that we hope to be able to provide? > >>>>>> > >>>>>> <echoing what we said on the ext4 call this morning> > >>>>>> > >>>>>> With iomap, most of the dirty state is in the kernel, so I think the new > >>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > >>>>>> would initiate GETATTR requests on all the cached inodes to validate > >>>>>> that they still exist; and then resend all the unacknowledged requests > >>>>>> that were pending at the time. It might be the case that you have to > >>>>>> that in the reverse order; I only know enough about the design of fuse > >>>>>> to suspect that to be true. > >>>>>> > >>>>>> Anyhow once those are complete, I think we can resume operations with > >>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are > >>>>>> fuse_make_bad'd, which effectively revokes them. > >>>>> > >>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > >>>>> but probably GETATTR is a better option. > >>>>> > >>>>> So, are you currently working on any of this? Are you implementing this > >>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > >>>>> look at fuse2fs too. > >>>> > >>>> Sorry for joining the discussion late, I was totally occupied, day and > >>>> night. Added Kevin to CC, who is going to work on recovery on our > >>>> DDN side. > >>>> > >>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > >>>> server restart we want kernel to recover inodes and their lookup count. > >>>> Now inode recovery might be hard, because we currently only have a > >>>> 64-bit node-id - which is used my most fuse application as memory > >>>> pointer. > >>>> > >>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > >>>> outstanding requests. And that ends up in most cases in sending requests > >>>> with invalid node-IDs, that are casted and might provoke random memory > >>>> access on restart. Kind of the same issue why fuse nfs export or > >>>> open_by_handle_at doesn't work well right now. > >>>> > >>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > >>>> would not return a 64-bit node ID, but a max 128 byte file handle. > >>>> And then FUSE_REVALIDATE_FH on server restart. > >>>> The file handles could be stored into the fuse inode and also used for > >>>> NFS export. > >>>> > >>>> I *think* Amir had a similar idea, but I don't find the link quickly. > >>>> Adding Amir to CC. > >>> > >>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > >>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > >> > >> Thanks for the reference Amir! I even had been in that thread. > >> > >>> > >>>> > >>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > >>>> will iterate over all superblock inodes and mark them with fuse_make_bad. > >>>> Any objections against that? > > > > What if you actually /can/ reuse a nodeid after a restart? Consider > > fuse4fs, where the nodeid is the on-disk inode number. After a restart, > > you can reconnect the fuse_inode to the ondisk inode, assuming recovery > > didn't delete it, obviously. > > > > I suppose you could just ask for refreshed stat information and either > > the server gives it to you and the fuse_inode lives; or the server > > returns ENOENT and then we mark it bad. But I'd have to see code > > patches to form a real opinion. > > > > It's very nice of fuse to have implemented revoke() ;) > > > Assuming you would run with an attr cache timeout equal 0 the existing > NOTIFY_RESEND would be enough for fuse4fs? That brings up some good questions. Yes, fuse4fs sets an attr cache timeout of 0, but (a) would it actually be useful to set it to a higher value to reduce round trips? And (b) shouldn't a restart trigger a revalidation regardless? --D > > Thanks, > Bernd > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-12 14:58 ` Darrick J. Wong 2025-09-12 15:20 ` Bernd Schubert @ 2025-09-15 7:07 ` Amir Goldstein 2025-09-15 8:27 ` Bernd Schubert 1 sibling, 1 reply; 46+ messages in thread From: Amir Goldstein @ 2025-09-15 7:07 UTC (permalink / raw) To: Darrick J. Wong Cc: Bernd Schubert, Luis Henriques, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > > > > > > On 9/12/25 13:41, Amir Goldstein wrote: > > > On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > > >> > > >> > > >> > > >> On 8/1/25 12:15, Luis Henriques wrote: > > >>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > > >>> > > >>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > > >>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > > >>>>>> > > >>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > > >>>>>> could restart itself. It's unclear if doing so will actually enable us > > >>>>>> to clear the condition that caused the failure in the first place, but I > > >>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > > >>>>>> aren't totally crazy. > > >>>>> > > >>>>> I'm trying to understand what the failure scenario is here. Is this > > >>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > > >>>>> is supposed to happen with respect to open files, metadata and data > > >>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > > >>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > > >>>>> potentally to be out of sync, right? > > >>>>> > > >>>>> What are the recovery semantics that we hope to be able to provide? > > >>>> > > >>>> <echoing what we said on the ext4 call this morning> > > >>>> > > >>>> With iomap, most of the dirty state is in the kernel, so I think the new > > >>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > > >>>> would initiate GETATTR requests on all the cached inodes to validate > > >>>> that they still exist; and then resend all the unacknowledged requests > > >>>> that were pending at the time. It might be the case that you have to > > >>>> that in the reverse order; I only know enough about the design of fuse > > >>>> to suspect that to be true. > > >>>> > > >>>> Anyhow once those are complete, I think we can resume operations with > > >>>> the surviving inodes. The ones that fail the GETATTR revalidation are > > >>>> fuse_make_bad'd, which effectively revokes them. > > >>> > > >>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > > >>> but probably GETATTR is a better option. > > >>> > > >>> So, are you currently working on any of this? Are you implementing this > > >>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > > >>> look at fuse2fs too. > > >> > > >> Sorry for joining the discussion late, I was totally occupied, day and > > >> night. Added Kevin to CC, who is going to work on recovery on our > > >> DDN side. > > >> > > >> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > > >> server restart we want kernel to recover inodes and their lookup count. > > >> Now inode recovery might be hard, because we currently only have a > > >> 64-bit node-id - which is used my most fuse application as memory > > >> pointer. > > >> > > >> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > > >> outstanding requests. And that ends up in most cases in sending requests > > >> with invalid node-IDs, that are casted and might provoke random memory > > >> access on restart. Kind of the same issue why fuse nfs export or > > >> open_by_handle_at doesn't work well right now. > > >> > > >> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > > >> would not return a 64-bit node ID, but a max 128 byte file handle. > > >> And then FUSE_REVALIDATE_FH on server restart. > > >> The file handles could be stored into the fuse inode and also used for > > >> NFS export. > > >> > > >> I *think* Amir had a similar idea, but I don't find the link quickly. > > >> Adding Amir to CC. > > > > > > Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > > > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > > > > Thanks for the reference Amir! I even had been in that thread. > > > > > > > >> > > >> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > > >> will iterate over all superblock inodes and mark them with fuse_make_bad. > > >> Any objections against that? > > What if you actually /can/ reuse a nodeid after a restart? Consider > fuse4fs, where the nodeid is the on-disk inode number. After a restart, > you can reconnect the fuse_inode to the ondisk inode, assuming recovery > didn't delete it, obviously. FUSE_LOOKUP_HANDLE is a contract. If fuse4fs can reuse nodeid after restart then by all means, it should sign this contract, otherwise there is no way for client to know that the nodeids are persistent. If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() API trivial. > > I suppose you could just ask for refreshed stat information and either > the server gives it to you and the fuse_inode lives; or the server > returns ENOENT and then we mark it bad. But I'd have to see code > patches to form a real opinion. > You could make fuse4fs_handle := <nodeid:fuse_instance_id> where fuse_instance_id can be its start time or random number. for auto invalidate, or maybe the fuse_instance_id should be a native part of FUSE protocol so that client knows to only invalidate attr cache in case of fuse_instance_id change? In any case, instead of a storm of revalidate messages after server restart, do it lazily on demand. Thanks, Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-15 7:07 ` Amir Goldstein @ 2025-09-15 8:27 ` Bernd Schubert 2025-09-15 8:41 ` Amir Goldstein 0 siblings, 1 reply; 46+ messages in thread From: Bernd Schubert @ 2025-09-15 8:27 UTC (permalink / raw) To: Amir Goldstein, Darrick J. Wong Cc: Luis Henriques, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On 9/15/25 09:07, Amir Goldstein wrote: > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: >> >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: >>> >>> >>> On 9/12/25 13:41, Amir Goldstein wrote: >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: >>>>> >>>>> >>>>> >>>>> On 8/1/25 12:15, Luis Henriques wrote: >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: >>>>>> >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >>>>>>>>> >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >>>>>>>>> could restart itself. It's unclear if doing so will actually enable us >>>>>>>>> to clear the condition that caused the failure in the first place, but I >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >>>>>>>>> aren't totally crazy. >>>>>>>> >>>>>>>> I'm trying to understand what the failure scenario is here. Is this >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >>>>>>>> is supposed to happen with respect to open files, metadata and data >>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going >>>>>>>> potentally to be out of sync, right? >>>>>>>> >>>>>>>> What are the recovery semantics that we hope to be able to provide? >>>>>>> >>>>>>> <echoing what we said on the ext4 call this morning> >>>>>>> >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which >>>>>>> would initiate GETATTR requests on all the cached inodes to validate >>>>>>> that they still exist; and then resend all the unacknowledged requests >>>>>>> that were pending at the time. It might be the case that you have to >>>>>>> that in the reverse order; I only know enough about the design of fuse >>>>>>> to suspect that to be true. >>>>>>> >>>>>>> Anyhow once those are complete, I think we can resume operations with >>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are >>>>>>> fuse_make_bad'd, which effectively revokes them. >>>>>> >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, >>>>>> but probably GETATTR is a better option. >>>>>> >>>>>> So, are you currently working on any of this? Are you implementing this >>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer >>>>>> look at fuse2fs too. >>>>> >>>>> Sorry for joining the discussion late, I was totally occupied, day and >>>>> night. Added Kevin to CC, who is going to work on recovery on our >>>>> DDN side. >>>>> >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse >>>>> server restart we want kernel to recover inodes and their lookup count. >>>>> Now inode recovery might be hard, because we currently only have a >>>>> 64-bit node-id - which is used my most fuse application as memory >>>>> pointer. >>>>> >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends >>>>> outstanding requests. And that ends up in most cases in sending requests >>>>> with invalid node-IDs, that are casted and might provoke random memory >>>>> access on restart. Kind of the same issue why fuse nfs export or >>>>> open_by_handle_at doesn't work well right now. >>>>> >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which >>>>> would not return a 64-bit node ID, but a max 128 byte file handle. >>>>> And then FUSE_REVALIDATE_FH on server restart. >>>>> The file handles could be stored into the fuse inode and also used for >>>>> NFS export. >>>>> >>>>> I *think* Amir had a similar idea, but I don't find the link quickly. >>>>> Adding Amir to CC. >>>> >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ >>> >>> Thanks for the reference Amir! I even had been in that thread. >>> >>>> >>>>> >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. >>>>> Any objections against that? >> >> What if you actually /can/ reuse a nodeid after a restart? Consider >> fuse4fs, where the nodeid is the on-disk inode number. After a restart, >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery >> didn't delete it, obviously. > > FUSE_LOOKUP_HANDLE is a contract. > If fuse4fs can reuse nodeid after restart then by all means, it should sign > this contract, otherwise there is no way for client to know that the > nodeids are persistent. > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() > API trivial. > >> >> I suppose you could just ask for refreshed stat information and either >> the server gives it to you and the fuse_inode lives; or the server >> returns ENOENT and then we mark it bad. But I'd have to see code >> patches to form a real opinion. >> > > You could make fuse4fs_handle := <nodeid:fuse_instance_id> > where fuse_instance_id can be its start time or random number. > for auto invalidate, or maybe the fuse_instance_id should be > a native part of FUSE protocol so that client knows to only invalidate > attr cache in case of fuse_instance_id change? > > In any case, instead of a storm of revalidate messages after > server restart, do it lazily on demand. For a network file system, probably. For fuse4fs or other block based file systems, not sure. Darrick has the example of fsck. Let's assume fuse4fs runs with attribute and dentry timeouts > 0, fuse-server gets restarted, fsck'ed and some files get removed. Now reading these inodes would still work - wouldn't it be better to invalidate the cache before going into operation again? Thanks, Bernd ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-15 8:27 ` Bernd Schubert @ 2025-09-15 8:41 ` Amir Goldstein 2025-09-16 2:53 ` Darrick J. Wong 0 siblings, 1 reply; 46+ messages in thread From: Amir Goldstein @ 2025-09-15 8:41 UTC (permalink / raw) To: Bernd Schubert Cc: Darrick J. Wong, Luis Henriques, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: > > > > On 9/15/25 09:07, Amir Goldstein wrote: > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: > >> > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > >>> > >>> > >>> On 9/12/25 13:41, Amir Goldstein wrote: > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 8/1/25 12:15, Luis Henriques wrote: > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > >>>>>> > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > >>>>>>>>> > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > >>>>>>>>> could restart itself. It's unclear if doing so will actually enable us > >>>>>>>>> to clear the condition that caused the failure in the first place, but I > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > >>>>>>>>> aren't totally crazy. > >>>>>>>> > >>>>>>>> I'm trying to understand what the failure scenario is here. Is this > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > >>>>>>>> is supposed to happen with respect to open files, metadata and data > >>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > >>>>>>>> potentally to be out of sync, right? > >>>>>>>> > >>>>>>>> What are the recovery semantics that we hope to be able to provide? > >>>>>>> > >>>>>>> <echoing what we said on the ext4 call this morning> > >>>>>>> > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate > >>>>>>> that they still exist; and then resend all the unacknowledged requests > >>>>>>> that were pending at the time. It might be the case that you have to > >>>>>>> that in the reverse order; I only know enough about the design of fuse > >>>>>>> to suspect that to be true. > >>>>>>> > >>>>>>> Anyhow once those are complete, I think we can resume operations with > >>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are > >>>>>>> fuse_make_bad'd, which effectively revokes them. > >>>>>> > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > >>>>>> but probably GETATTR is a better option. > >>>>>> > >>>>>> So, are you currently working on any of this? Are you implementing this > >>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > >>>>>> look at fuse2fs too. > >>>>> > >>>>> Sorry for joining the discussion late, I was totally occupied, day and > >>>>> night. Added Kevin to CC, who is going to work on recovery on our > >>>>> DDN side. > >>>>> > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > >>>>> server restart we want kernel to recover inodes and their lookup count. > >>>>> Now inode recovery might be hard, because we currently only have a > >>>>> 64-bit node-id - which is used my most fuse application as memory > >>>>> pointer. > >>>>> > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > >>>>> outstanding requests. And that ends up in most cases in sending requests > >>>>> with invalid node-IDs, that are casted and might provoke random memory > >>>>> access on restart. Kind of the same issue why fuse nfs export or > >>>>> open_by_handle_at doesn't work well right now. > >>>>> > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle. > >>>>> And then FUSE_REVALIDATE_FH on server restart. > >>>>> The file handles could be stored into the fuse inode and also used for > >>>>> NFS export. > >>>>> > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly. > >>>>> Adding Amir to CC. > >>>> > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > >>> > >>> Thanks for the reference Amir! I even had been in that thread. > >>> > >>>> > >>>>> > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. > >>>>> Any objections against that? > >> > >> What if you actually /can/ reuse a nodeid after a restart? Consider > >> fuse4fs, where the nodeid is the on-disk inode number. After a restart, > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery > >> didn't delete it, obviously. > > > > FUSE_LOOKUP_HANDLE is a contract. > > If fuse4fs can reuse nodeid after restart then by all means, it should sign > > this contract, otherwise there is no way for client to know that the > > nodeids are persistent. > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() > > API trivial. > > > >> > >> I suppose you could just ask for refreshed stat information and either > >> the server gives it to you and the fuse_inode lives; or the server > >> returns ENOENT and then we mark it bad. But I'd have to see code > >> patches to form a real opinion. > >> > > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id> > > where fuse_instance_id can be its start time or random number. > > for auto invalidate, or maybe the fuse_instance_id should be > > a native part of FUSE protocol so that client knows to only invalidate > > attr cache in case of fuse_instance_id change? > > > > In any case, instead of a storm of revalidate messages after > > server restart, do it lazily on demand. > > For a network file system, probably. For fuse4fs or other block > based file systems, not sure. Darrick has the example of fsck. > Let's assume fuse4fs runs with attribute and dentry timeouts > 0, > fuse-server gets restarted, fsck'ed and some files get removed. > Now reading these inodes would still work - wouldn't it > be better to invalidate the cache before going into operation > again? Forgive me, I was making a wrong assumption that fuse4fs was using ext4 filehandle as nodeid, but of course it does not. The reason I made this wrong assumption is because fuse4fs *can* already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol which is what my fuse passthough library [1] does. My claim was that although fuse4fs could support safe restart, which cannot read from recycled inode number with current FUSE protocol, doing so with FUSE_HANDLE protocol would express a commitment to this behavior. Thanks, Amir. [1] https://github.com/amir73il/libfuse/commits/fuse_passthrough ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-15 8:41 ` Amir Goldstein @ 2025-09-16 2:53 ` Darrick J. Wong 2025-09-16 7:59 ` Amir Goldstein 0 siblings, 1 reply; 46+ messages in thread From: Darrick J. Wong @ 2025-09-16 2:53 UTC (permalink / raw) To: Amir Goldstein Cc: Bernd Schubert, Luis Henriques, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: > > > > > > > > On 9/15/25 09:07, Amir Goldstein wrote: > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: > > >> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > > >>> > > >>> > > >>> On 9/12/25 13:41, Amir Goldstein wrote: > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > > >>>>> > > >>>>> > > >>>>> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote: > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > > >>>>>> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > > >>>>>>>>> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > > >>>>>>>>> could restart itself. It's unclear if doing so will actually enable us > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > > >>>>>>>>> aren't totally crazy. > > >>>>>>>> > > >>>>>>>> I'm trying to understand what the failure scenario is here. Is this > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > > >>>>>>>> is supposed to happen with respect to open files, metadata and data > > >>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > > >>>>>>>> potentally to be out of sync, right? > > >>>>>>>> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide? > > >>>>>>> > > >>>>>>> <echoing what we said on the ext4 call this morning> > > >>>>>>> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate > > >>>>>>> that they still exist; and then resend all the unacknowledged requests > > >>>>>>> that were pending at the time. It might be the case that you have to > > >>>>>>> that in the reverse order; I only know enough about the design of fuse > > >>>>>>> to suspect that to be true. > > >>>>>>> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with > > >>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are > > >>>>>>> fuse_make_bad'd, which effectively revokes them. > > >>>>>> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > > >>>>>> but probably GETATTR is a better option. > > >>>>>> > > >>>>>> So, are you currently working on any of this? Are you implementing this > > >>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > > >>>>>> look at fuse2fs too. > > >>>>> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our > > >>>>> DDN side. > > >>>>> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > > >>>>> server restart we want kernel to recover inodes and their lookup count. > > >>>>> Now inode recovery might be hard, because we currently only have a > > >>>>> 64-bit node-id - which is used my most fuse application as memory > > >>>>> pointer. > > >>>>> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > > >>>>> outstanding requests. And that ends up in most cases in sending requests > > >>>>> with invalid node-IDs, that are casted and might provoke random memory > > >>>>> access on restart. Kind of the same issue why fuse nfs export or > > >>>>> open_by_handle_at doesn't work well right now. > > >>>>> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle. > > >>>>> And then FUSE_REVALIDATE_FH on server restart. > > >>>>> The file handles could be stored into the fuse inode and also used for > > >>>>> NFS export. > > >>>>> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly. > > >>>>> Adding Amir to CC. > > >>>> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > > >>> > > >>> Thanks for the reference Amir! I even had been in that thread. > > >>> > > >>>> > > >>>>> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. > > >>>>> Any objections against that? > > >> > > >> What if you actually /can/ reuse a nodeid after a restart? Consider > > >> fuse4fs, where the nodeid is the on-disk inode number. After a restart, > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery > > >> didn't delete it, obviously. > > > > > > FUSE_LOOKUP_HANDLE is a contract. > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign > > > this contract, otherwise there is no way for client to know that the > > > nodeids are persistent. > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() > > > API trivial. > > > > > >> > > >> I suppose you could just ask for refreshed stat information and either > > >> the server gives it to you and the fuse_inode lives; or the server > > >> returns ENOENT and then we mark it bad. But I'd have to see code > > >> patches to form a real opinion. > > >> > > > > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id> > > > where fuse_instance_id can be its start time or random number. > > > for auto invalidate, or maybe the fuse_instance_id should be > > > a native part of FUSE protocol so that client knows to only invalidate > > > attr cache in case of fuse_instance_id change? > > > > > > In any case, instead of a storm of revalidate messages after > > > server restart, do it lazily on demand. > > > > For a network file system, probably. For fuse4fs or other block > > based file systems, not sure. Darrick has the example of fsck. > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0, > > fuse-server gets restarted, fsck'ed and some files get removed. > > Now reading these inodes would still work - wouldn't it > > be better to invalidate the cache before going into operation > > again? > > Forgive me, I was making a wrong assumption that fuse4fs > was using ext4 filehandle as nodeid, but of course it does not. Well now that you mention it, there /is/ a risk of shenanigans like that. Consider: 1) fuse4fs mount an ext4 filesystem 2) crash the fuse4fs server <fuse4fs server restart stalls...> 3) e2fsck -fy /dev/XXX deletes inode 17 4) someone else mounts the fs, makes some changes that result in 17 being reallocated, user says "OOOOOPS", unmounts it 5) fuse4fs server finally restarts, and reconnects to the kernel Hey, inode 17 is now a different file!! So maybe the nodeid has to be an actual file handle. Oh wait, no, everything's (potentially) fine because fuse4fs supplied i_generation to the kernel, and fuse_stale_inode will mark it bad if that happens. Hm ok then, at least there's a way out. :) > The reason I made this wrong assumption is because fuse4fs *can* > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol > which is what my fuse passthough library [1] does. > > My claim was that although fuse4fs could support safe restart, which > cannot read from recycled inode number with current FUSE protocol, > doing so with FUSE_HANDLE protocol would express a commitment Pardon my naïvete, but what is FUSE_HANDLE? $ git grep -w FUSE_HANDLE fs $ --D > Thanks, > Amir. > > [1] https://github.com/amir73il/libfuse/commits/fuse_passthrough > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-16 2:53 ` Darrick J. Wong @ 2025-09-16 7:59 ` Amir Goldstein 2025-09-18 17:50 ` Darrick J. Wong 2025-11-04 11:40 ` Luis Henriques 0 siblings, 2 replies; 46+ messages in thread From: Amir Goldstein @ 2025-09-16 7:59 UTC (permalink / raw) To: Darrick J. Wong Cc: Bernd Schubert, Luis Henriques, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote: > > On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: > > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: > > > > > > > > > > > > On 9/15/25 09:07, Amir Goldstein wrote: > > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > >> > > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > > > >>> > > > >>> > > > >>> On 9/12/25 13:41, Amir Goldstein wrote: > > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On 8/1/25 12:15, Luis Henriques wrote: > > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > > > >>>>>> > > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > > > >>>>>>>>> > > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > > > >>>>>>>>> could restart itself. It's unclear if doing so will actually enable us > > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I > > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > > > >>>>>>>>> aren't totally crazy. > > > >>>>>>>> > > > >>>>>>>> I'm trying to understand what the failure scenario is here. Is this > > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > > > >>>>>>>> is supposed to happen with respect to open files, metadata and data > > > >>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > > > >>>>>>>> potentally to be out of sync, right? > > > >>>>>>>> > > > >>>>>>>> What are the recovery semantics that we hope to be able to provide? > > > >>>>>>> > > > >>>>>>> <echoing what we said on the ext4 call this morning> > > > >>>>>>> > > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new > > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate > > > >>>>>>> that they still exist; and then resend all the unacknowledged requests > > > >>>>>>> that were pending at the time. It might be the case that you have to > > > >>>>>>> that in the reverse order; I only know enough about the design of fuse > > > >>>>>>> to suspect that to be true. > > > >>>>>>> > > > >>>>>>> Anyhow once those are complete, I think we can resume operations with > > > >>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are > > > >>>>>>> fuse_make_bad'd, which effectively revokes them. > > > >>>>>> > > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > > > >>>>>> but probably GETATTR is a better option. > > > >>>>>> > > > >>>>>> So, are you currently working on any of this? Are you implementing this > > > >>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > > > >>>>>> look at fuse2fs too. > > > >>>>> > > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and > > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our > > > >>>>> DDN side. > > > >>>>> > > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > > > >>>>> server restart we want kernel to recover inodes and their lookup count. > > > >>>>> Now inode recovery might be hard, because we currently only have a > > > >>>>> 64-bit node-id - which is used my most fuse application as memory > > > >>>>> pointer. > > > >>>>> > > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > > > >>>>> outstanding requests. And that ends up in most cases in sending requests > > > >>>>> with invalid node-IDs, that are casted and might provoke random memory > > > >>>>> access on restart. Kind of the same issue why fuse nfs export or > > > >>>>> open_by_handle_at doesn't work well right now. > > > >>>>> > > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle. > > > >>>>> And then FUSE_REVALIDATE_FH on server restart. > > > >>>>> The file handles could be stored into the fuse inode and also used for > > > >>>>> NFS export. > > > >>>>> > > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly. > > > >>>>> Adding Amir to CC. > > > >>>> > > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > > > >>> > > > >>> Thanks for the reference Amir! I even had been in that thread. > > > >>> > > > >>>> > > > >>>>> > > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. > > > >>>>> Any objections against that? > > > >> > > > >> What if you actually /can/ reuse a nodeid after a restart? Consider > > > >> fuse4fs, where the nodeid is the on-disk inode number. After a restart, > > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery > > > >> didn't delete it, obviously. > > > > > > > > FUSE_LOOKUP_HANDLE is a contract. > > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign > > > > this contract, otherwise there is no way for client to know that the > > > > nodeids are persistent. > > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() > > > > API trivial. > > > > > > > >> > > > >> I suppose you could just ask for refreshed stat information and either > > > >> the server gives it to you and the fuse_inode lives; or the server > > > >> returns ENOENT and then we mark it bad. But I'd have to see code > > > >> patches to form a real opinion. > > > >> > > > > > > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id> > > > > where fuse_instance_id can be its start time or random number. > > > > for auto invalidate, or maybe the fuse_instance_id should be > > > > a native part of FUSE protocol so that client knows to only invalidate > > > > attr cache in case of fuse_instance_id change? > > > > > > > > In any case, instead of a storm of revalidate messages after > > > > server restart, do it lazily on demand. > > > > > > For a network file system, probably. For fuse4fs or other block > > > based file systems, not sure. Darrick has the example of fsck. > > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0, > > > fuse-server gets restarted, fsck'ed and some files get removed. > > > Now reading these inodes would still work - wouldn't it > > > be better to invalidate the cache before going into operation > > > again? > > > > Forgive me, I was making a wrong assumption that fuse4fs > > was using ext4 filehandle as nodeid, but of course it does not. > > Well now that you mention it, there /is/ a risk of shenanigans like > that. Consider: > > 1) fuse4fs mount an ext4 filesystem > 2) crash the fuse4fs server > <fuse4fs server restart stalls...> > 3) e2fsck -fy /dev/XXX deletes inode 17 > 4) someone else mounts the fs, makes some changes that result in 17 > being reallocated, user says "OOOOOPS", unmounts it > 5) fuse4fs server finally restarts, and reconnects to the kernel > > Hey, inode 17 is now a different file!! > > So maybe the nodeid has to be an actual file handle. Oh wait, no, > everything's (potentially) fine because fuse4fs supplied i_generation to > the kernel, and fuse_stale_inode will mark it bad if that happens. > > Hm ok then, at least there's a way out. :) > Right. > > The reason I made this wrong assumption is because fuse4fs *can* > > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol > > which is what my fuse passthough library [1] does. > > > > My claim was that although fuse4fs could support safe restart, which > > cannot read from recycled inode number with current FUSE protocol, > > doing so with FUSE_HANDLE protocol would express a commitment > > Pardon my naïvete, but what is FUSE_HANDLE? > > $ git grep -w FUSE_HANDLE fs > $ Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ Which means to communicate a variable sized "nodeid" which can also be declared as an object id that survives server restart. Basically, the reason that I brought up LOOKUP_HANDLE is to properly support NFS export of fuse filesystems. My incentive was to support a proper fuse server restart/remount/re-export with the same fsid in /etc/exports, but this gives us a better starting point for fuse server restart/re-connect. Thanks, Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-16 7:59 ` Amir Goldstein @ 2025-09-18 17:50 ` Darrick J. Wong 2025-11-04 11:40 ` Luis Henriques 1 sibling, 0 replies; 46+ messages in thread From: Darrick J. Wong @ 2025-09-18 17:50 UTC (permalink / raw) To: Amir Goldstein Cc: Bernd Schubert, Luis Henriques, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Tue, Sep 16, 2025 at 09:59:36AM +0200, Amir Goldstein wrote: > On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: > > > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: > > > > > > > > > > > > > > > > On 9/15/25 09:07, Amir Goldstein wrote: > > > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > >> > > > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > > > > >>> > > > > >>> > > > > >>> On 9/12/25 13:41, Amir Goldstein wrote: > > > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> On 8/1/25 12:15, Luis Henriques wrote: > > > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > > > > >>>>>> > > > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > > > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > > > > >>>>>>>>> > > > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > > > > >>>>>>>>> could restart itself. It's unclear if doing so will actually enable us > > > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I > > > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > > > > >>>>>>>>> aren't totally crazy. > > > > >>>>>>>> > > > > >>>>>>>> I'm trying to understand what the failure scenario is here. Is this > > > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > > > > >>>>>>>> is supposed to happen with respect to open files, metadata and data > > > > >>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > > > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > > > > >>>>>>>> potentally to be out of sync, right? > > > > >>>>>>>> > > > > >>>>>>>> What are the recovery semantics that we hope to be able to provide? > > > > >>>>>>> > > > > >>>>>>> <echoing what we said on the ext4 call this morning> > > > > >>>>>>> > > > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new > > > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > > > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate > > > > >>>>>>> that they still exist; and then resend all the unacknowledged requests > > > > >>>>>>> that were pending at the time. It might be the case that you have to > > > > >>>>>>> that in the reverse order; I only know enough about the design of fuse > > > > >>>>>>> to suspect that to be true. > > > > >>>>>>> > > > > >>>>>>> Anyhow once those are complete, I think we can resume operations with > > > > >>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are > > > > >>>>>>> fuse_make_bad'd, which effectively revokes them. > > > > >>>>>> > > > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > > > > >>>>>> but probably GETATTR is a better option. > > > > >>>>>> > > > > >>>>>> So, are you currently working on any of this? Are you implementing this > > > > >>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > > > > >>>>>> look at fuse2fs too. > > > > >>>>> > > > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and > > > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our > > > > >>>>> DDN side. > > > > >>>>> > > > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > > > > >>>>> server restart we want kernel to recover inodes and their lookup count. > > > > >>>>> Now inode recovery might be hard, because we currently only have a > > > > >>>>> 64-bit node-id - which is used my most fuse application as memory > > > > >>>>> pointer. > > > > >>>>> > > > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > > > > >>>>> outstanding requests. And that ends up in most cases in sending requests > > > > >>>>> with invalid node-IDs, that are casted and might provoke random memory > > > > >>>>> access on restart. Kind of the same issue why fuse nfs export or > > > > >>>>> open_by_handle_at doesn't work well right now. > > > > >>>>> > > > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > > > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle. > > > > >>>>> And then FUSE_REVALIDATE_FH on server restart. > > > > >>>>> The file handles could be stored into the fuse inode and also used for > > > > >>>>> NFS export. > > > > >>>>> > > > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly. > > > > >>>>> Adding Amir to CC. > > > > >>>> > > > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > > > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > > > > >>> > > > > >>> Thanks for the reference Amir! I even had been in that thread. > > > > >>> > > > > >>>> > > > > >>>>> > > > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > > > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. > > > > >>>>> Any objections against that? > > > > >> > > > > >> What if you actually /can/ reuse a nodeid after a restart? Consider > > > > >> fuse4fs, where the nodeid is the on-disk inode number. After a restart, > > > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery > > > > >> didn't delete it, obviously. > > > > > > > > > > FUSE_LOOKUP_HANDLE is a contract. > > > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign > > > > > this contract, otherwise there is no way for client to know that the > > > > > nodeids are persistent. > > > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() > > > > > API trivial. > > > > > > > > > >> > > > > >> I suppose you could just ask for refreshed stat information and either > > > > >> the server gives it to you and the fuse_inode lives; or the server > > > > >> returns ENOENT and then we mark it bad. But I'd have to see code > > > > >> patches to form a real opinion. > > > > >> > > > > > > > > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id> > > > > > where fuse_instance_id can be its start time or random number. > > > > > for auto invalidate, or maybe the fuse_instance_id should be > > > > > a native part of FUSE protocol so that client knows to only invalidate > > > > > attr cache in case of fuse_instance_id change? > > > > > > > > > > In any case, instead of a storm of revalidate messages after > > > > > server restart, do it lazily on demand. > > > > > > > > For a network file system, probably. For fuse4fs or other block > > > > based file systems, not sure. Darrick has the example of fsck. > > > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0, > > > > fuse-server gets restarted, fsck'ed and some files get removed. > > > > Now reading these inodes would still work - wouldn't it > > > > be better to invalidate the cache before going into operation > > > > again? > > > > > > Forgive me, I was making a wrong assumption that fuse4fs > > > was using ext4 filehandle as nodeid, but of course it does not. > > > > Well now that you mention it, there /is/ a risk of shenanigans like > > that. Consider: > > > > 1) fuse4fs mount an ext4 filesystem > > 2) crash the fuse4fs server > > <fuse4fs server restart stalls...> > > 3) e2fsck -fy /dev/XXX deletes inode 17 > > 4) someone else mounts the fs, makes some changes that result in 17 > > being reallocated, user says "OOOOOPS", unmounts it > > 5) fuse4fs server finally restarts, and reconnects to the kernel > > > > Hey, inode 17 is now a different file!! > > > > So maybe the nodeid has to be an actual file handle. Oh wait, no, > > everything's (potentially) fine because fuse4fs supplied i_generation to > > the kernel, and fuse_stale_inode will mark it bad if that happens. > > > > Hm ok then, at least there's a way out. :) > > > > Right. > > > > The reason I made this wrong assumption is because fuse4fs *can* > > > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol > > > which is what my fuse passthough library [1] does. > > > > > > My claim was that although fuse4fs could support safe restart, which > > > cannot read from recycled inode number with current FUSE protocol, > > > doing so with FUSE_HANDLE protocol would express a commitment > > > > Pardon my naïvete, but what is FUSE_HANDLE? > > > > $ git grep -w FUSE_HANDLE fs > > $ > > Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > > Which means to communicate a variable sized "nodeid" > which can also be declared as an object id that survives server restart. > > Basically, the reason that I brought up LOOKUP_HANDLE is to > properly support NFS export of fuse filesystems. > > My incentive was to support a proper fuse server restart/remount/re-export > with the same fsid in /etc/exports, but this gives us a better starting point > for fuse server restart/re-connect. Ah. I don't think that's necessary for ext4, but probably desirable for fancy filesystems that support things like subvolumes or do weird stuff. --D > Thanks, > Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-09-16 7:59 ` Amir Goldstein 2025-09-18 17:50 ` Darrick J. Wong @ 2025-11-04 11:40 ` Luis Henriques 2025-11-04 13:10 ` Amir Goldstein 1 sibling, 1 reply; 46+ messages in thread From: Luis Henriques @ 2025-11-04 11:40 UTC (permalink / raw) To: Amir Goldstein Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Tue, Sep 16 2025, Amir Goldstein wrote: > On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote: >> >> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: >> > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: >> > > >> > > >> > > >> > > On 9/15/25 09:07, Amir Goldstein wrote: >> > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: >> > > >> >> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: >> > > >>> >> > > >>> >> > > >>> On 9/12/25 13:41, Amir Goldstein wrote: >> > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: >> > > >>>>> >> > > >>>>> >> > > >>>>> >> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote: >> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: >> > > >>>>>> >> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >> > > >>>>>>>>> >> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >> > > >>>>>>>>> could restart itself. It's unclear if doing so will actually enable us >> > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I >> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >> > > >>>>>>>>> aren't totally crazy. >> > > >>>>>>>> >> > > >>>>>>>> I'm trying to understand what the failure scenario is here. Is this >> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >> > > >>>>>>>> is supposed to happen with respect to open files, metadata and data >> > > >>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run >> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going >> > > >>>>>>>> potentally to be out of sync, right? >> > > >>>>>>>> >> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide? >> > > >>>>>>> >> > > >>>>>>> <echoing what we said on the ext4 call this morning> >> > > >>>>>>> >> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new >> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which >> > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate >> > > >>>>>>> that they still exist; and then resend all the unacknowledged requests >> > > >>>>>>> that were pending at the time. It might be the case that you have to >> > > >>>>>>> that in the reverse order; I only know enough about the design of fuse >> > > >>>>>>> to suspect that to be true. >> > > >>>>>>> >> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with >> > > >>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are >> > > >>>>>>> fuse_make_bad'd, which effectively revokes them. >> > > >>>>>> >> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, >> > > >>>>>> but probably GETATTR is a better option. >> > > >>>>>> >> > > >>>>>> So, are you currently working on any of this? Are you implementing this >> > > >>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer >> > > >>>>>> look at fuse2fs too. >> > > >>>>> >> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and >> > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our >> > > >>>>> DDN side. >> > > >>>>> >> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse >> > > >>>>> server restart we want kernel to recover inodes and their lookup count. >> > > >>>>> Now inode recovery might be hard, because we currently only have a >> > > >>>>> 64-bit node-id - which is used my most fuse application as memory >> > > >>>>> pointer. >> > > >>>>> >> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends >> > > >>>>> outstanding requests. And that ends up in most cases in sending requests >> > > >>>>> with invalid node-IDs, that are casted and might provoke random memory >> > > >>>>> access on restart. Kind of the same issue why fuse nfs export or >> > > >>>>> open_by_handle_at doesn't work well right now. >> > > >>>>> >> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which >> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle. >> > > >>>>> And then FUSE_REVALIDATE_FH on server restart. >> > > >>>>> The file handles could be stored into the fuse inode and also used for >> > > >>>>> NFS export. >> > > >>>>> >> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly. >> > > >>>>> Adding Amir to CC. >> > > >>>> >> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: >> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ >> > > >>> >> > > >>> Thanks for the reference Amir! I even had been in that thread. >> > > >>> >> > > >>>> >> > > >>>>> >> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which >> > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. >> > > >>>>> Any objections against that? >> > > >> >> > > >> What if you actually /can/ reuse a nodeid after a restart? Consider >> > > >> fuse4fs, where the nodeid is the on-disk inode number. After a restart, >> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery >> > > >> didn't delete it, obviously. >> > > > >> > > > FUSE_LOOKUP_HANDLE is a contract. >> > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign >> > > > this contract, otherwise there is no way for client to know that the >> > > > nodeids are persistent. >> > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() >> > > > API trivial. >> > > > >> > > >> >> > > >> I suppose you could just ask for refreshed stat information and either >> > > >> the server gives it to you and the fuse_inode lives; or the server >> > > >> returns ENOENT and then we mark it bad. But I'd have to see code >> > > >> patches to form a real opinion. >> > > >> >> > > > >> > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id> >> > > > where fuse_instance_id can be its start time or random number. >> > > > for auto invalidate, or maybe the fuse_instance_id should be >> > > > a native part of FUSE protocol so that client knows to only invalidate >> > > > attr cache in case of fuse_instance_id change? >> > > > >> > > > In any case, instead of a storm of revalidate messages after >> > > > server restart, do it lazily on demand. >> > > >> > > For a network file system, probably. For fuse4fs or other block >> > > based file systems, not sure. Darrick has the example of fsck. >> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0, >> > > fuse-server gets restarted, fsck'ed and some files get removed. >> > > Now reading these inodes would still work - wouldn't it >> > > be better to invalidate the cache before going into operation >> > > again? >> > >> > Forgive me, I was making a wrong assumption that fuse4fs >> > was using ext4 filehandle as nodeid, but of course it does not. >> >> Well now that you mention it, there /is/ a risk of shenanigans like >> that. Consider: >> >> 1) fuse4fs mount an ext4 filesystem >> 2) crash the fuse4fs server >> <fuse4fs server restart stalls...> >> 3) e2fsck -fy /dev/XXX deletes inode 17 >> 4) someone else mounts the fs, makes some changes that result in 17 >> being reallocated, user says "OOOOOPS", unmounts it >> 5) fuse4fs server finally restarts, and reconnects to the kernel >> >> Hey, inode 17 is now a different file!! >> >> So maybe the nodeid has to be an actual file handle. Oh wait, no, >> everything's (potentially) fine because fuse4fs supplied i_generation to >> the kernel, and fuse_stale_inode will mark it bad if that happens. >> >> Hm ok then, at least there's a way out. :) >> > > Right. > >> > The reason I made this wrong assumption is because fuse4fs *can* >> > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol >> > which is what my fuse passthough library [1] does. >> > >> > My claim was that although fuse4fs could support safe restart, which >> > cannot read from recycled inode number with current FUSE protocol, >> > doing so with FUSE_HANDLE protocol would express a commitment >> >> Pardon my naïvete, but what is FUSE_HANDLE? >> >> $ git grep -w FUSE_HANDLE fs >> $ > > Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > > Which means to communicate a variable sized "nodeid" > which can also be declared as an object id that survives server restart. > > Basically, the reason that I brought up LOOKUP_HANDLE is to > properly support NFS export of fuse filesystems. > > My incentive was to support a proper fuse server restart/remount/re-export > with the same fsid in /etc/exports, but this gives us a better starting point > for fuse server restart/re-connect. Sorry for resurrecting (again!) this discussion. I've been thinking about this, and trying to get some initial RFC for this LOOKUP_HANDLE operation. However, I feel there are other operations that will need to return this new handle. For example, the FUSE_CREATE (for atomic_open) also returns a nodeid. Doesn't this means that, if the user-space server supports the new LOOKUP_HANDLE, it should also return an handle in reply to the CREATE request? The same question applies for TMPFILE, LINK, etc. Or is there something special about the LOOKUP operation that I'm missing? Cheers, -- Luís ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-04 11:40 ` Luis Henriques @ 2025-11-04 13:10 ` Amir Goldstein 2025-11-04 14:52 ` Luis Henriques 2025-11-05 22:24 ` Bernd Schubert 0 siblings, 2 replies; 46+ messages in thread From: Amir Goldstein @ 2025-11-04 13:10 UTC (permalink / raw) To: Luis Henriques Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote: > > On Tue, Sep 16 2025, Amir Goldstein wrote: > > > On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote: > >> > >> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: > >> > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: > >> > > > >> > > > >> > > > >> > > On 9/15/25 09:07, Amir Goldstein wrote: > >> > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: > >> > > >> > >> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > >> > > >>> > >> > > >>> > >> > > >>> On 9/12/25 13:41, Amir Goldstein wrote: > >> > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > >> > > >>>>> > >> > > >>>>> > >> > > >>>>> > >> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote: > >> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > >> > > >>>>>> > >> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > >> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > >> > > >>>>>>>>> > >> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > >> > > >>>>>>>>> could restart itself. It's unclear if doing so will actually enable us > >> > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I > >> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > >> > > >>>>>>>>> aren't totally crazy. > >> > > >>>>>>>> > >> > > >>>>>>>> I'm trying to understand what the failure scenario is here. Is this > >> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > >> > > >>>>>>>> is supposed to happen with respect to open files, metadata and data > >> > > >>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > >> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > >> > > >>>>>>>> potentally to be out of sync, right? > >> > > >>>>>>>> > >> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide? > >> > > >>>>>>> > >> > > >>>>>>> <echoing what we said on the ext4 call this morning> > >> > > >>>>>>> > >> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new > >> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > >> > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate > >> > > >>>>>>> that they still exist; and then resend all the unacknowledged requests > >> > > >>>>>>> that were pending at the time. It might be the case that you have to > >> > > >>>>>>> that in the reverse order; I only know enough about the design of fuse > >> > > >>>>>>> to suspect that to be true. > >> > > >>>>>>> > >> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with > >> > > >>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are > >> > > >>>>>>> fuse_make_bad'd, which effectively revokes them. > >> > > >>>>>> > >> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > >> > > >>>>>> but probably GETATTR is a better option. > >> > > >>>>>> > >> > > >>>>>> So, are you currently working on any of this? Are you implementing this > >> > > >>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > >> > > >>>>>> look at fuse2fs too. > >> > > >>>>> > >> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and > >> > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our > >> > > >>>>> DDN side. > >> > > >>>>> > >> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > >> > > >>>>> server restart we want kernel to recover inodes and their lookup count. > >> > > >>>>> Now inode recovery might be hard, because we currently only have a > >> > > >>>>> 64-bit node-id - which is used my most fuse application as memory > >> > > >>>>> pointer. > >> > > >>>>> > >> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > >> > > >>>>> outstanding requests. And that ends up in most cases in sending requests > >> > > >>>>> with invalid node-IDs, that are casted and might provoke random memory > >> > > >>>>> access on restart. Kind of the same issue why fuse nfs export or > >> > > >>>>> open_by_handle_at doesn't work well right now. > >> > > >>>>> > >> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > >> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle. > >> > > >>>>> And then FUSE_REVALIDATE_FH on server restart. > >> > > >>>>> The file handles could be stored into the fuse inode and also used for > >> > > >>>>> NFS export. > >> > > >>>>> > >> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly. > >> > > >>>>> Adding Amir to CC. > >> > > >>>> > >> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > >> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > >> > > >>> > >> > > >>> Thanks for the reference Amir! I even had been in that thread. > >> > > >>> > >> > > >>>> > >> > > >>>>> > >> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > >> > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. > >> > > >>>>> Any objections against that? > >> > > >> > >> > > >> What if you actually /can/ reuse a nodeid after a restart? Consider > >> > > >> fuse4fs, where the nodeid is the on-disk inode number. After a restart, > >> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery > >> > > >> didn't delete it, obviously. > >> > > > > >> > > > FUSE_LOOKUP_HANDLE is a contract. > >> > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign > >> > > > this contract, otherwise there is no way for client to know that the > >> > > > nodeids are persistent. > >> > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() > >> > > > API trivial. > >> > > > > >> > > >> > >> > > >> I suppose you could just ask for refreshed stat information and either > >> > > >> the server gives it to you and the fuse_inode lives; or the server > >> > > >> returns ENOENT and then we mark it bad. But I'd have to see code > >> > > >> patches to form a real opinion. > >> > > >> > >> > > > > >> > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id> > >> > > > where fuse_instance_id can be its start time or random number. > >> > > > for auto invalidate, or maybe the fuse_instance_id should be > >> > > > a native part of FUSE protocol so that client knows to only invalidate > >> > > > attr cache in case of fuse_instance_id change? > >> > > > > >> > > > In any case, instead of a storm of revalidate messages after > >> > > > server restart, do it lazily on demand. > >> > > > >> > > For a network file system, probably. For fuse4fs or other block > >> > > based file systems, not sure. Darrick has the example of fsck. > >> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0, > >> > > fuse-server gets restarted, fsck'ed and some files get removed. > >> > > Now reading these inodes would still work - wouldn't it > >> > > be better to invalidate the cache before going into operation > >> > > again? > >> > > >> > Forgive me, I was making a wrong assumption that fuse4fs > >> > was using ext4 filehandle as nodeid, but of course it does not. > >> > >> Well now that you mention it, there /is/ a risk of shenanigans like > >> that. Consider: > >> > >> 1) fuse4fs mount an ext4 filesystem > >> 2) crash the fuse4fs server > >> <fuse4fs server restart stalls...> > >> 3) e2fsck -fy /dev/XXX deletes inode 17 > >> 4) someone else mounts the fs, makes some changes that result in 17 > >> being reallocated, user says "OOOOOPS", unmounts it > >> 5) fuse4fs server finally restarts, and reconnects to the kernel > >> > >> Hey, inode 17 is now a different file!! > >> > >> So maybe the nodeid has to be an actual file handle. Oh wait, no, > >> everything's (potentially) fine because fuse4fs supplied i_generation to > >> the kernel, and fuse_stale_inode will mark it bad if that happens. > >> > >> Hm ok then, at least there's a way out. :) > >> > > > > Right. > > > >> > The reason I made this wrong assumption is because fuse4fs *can* > >> > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol > >> > which is what my fuse passthough library [1] does. > >> > > >> > My claim was that although fuse4fs could support safe restart, which > >> > cannot read from recycled inode number with current FUSE protocol, > >> > doing so with FUSE_HANDLE protocol would express a commitment > >> > >> Pardon my naïvete, but what is FUSE_HANDLE? > >> > >> $ git grep -w FUSE_HANDLE fs > >> $ > > > > Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): > > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > > > > Which means to communicate a variable sized "nodeid" > > which can also be declared as an object id that survives server restart. > > > > Basically, the reason that I brought up LOOKUP_HANDLE is to > > properly support NFS export of fuse filesystems. > > > > My incentive was to support a proper fuse server restart/remount/re-export > > with the same fsid in /etc/exports, but this gives us a better starting point > > for fuse server restart/re-connect. > > Sorry for resurrecting (again!) this discussion. I've been thinking about > this, and trying to get some initial RFC for this LOOKUP_HANDLE operation. > However, I feel there are other operations that will need to return this > new handle. > > For example, the FUSE_CREATE (for atomic_open) also returns a nodeid. > Doesn't this means that, if the user-space server supports the new > LOOKUP_HANDLE, it should also return an handle in reply to the CREATE > request? Yes, I think that's what it means. > The same question applies for TMPFILE, LINK, etc. Or is there > something special about the LOOKUP operation that I'm missing? > Any command returning fuse_entry_out. READDIRPLUS, MKNOD, MKDIR, SYMLINK fuse_entry_out was extended once and fuse_reply_entry() sends the size of the struct. However fuse_reply_create() sends it with fuse_open_out appended and fuse_add_direntry_plus() does not seem to write record size at all, so server and client will need to agree on the size of fuse_entry_out and this would need to be backward compat. If both server and client declare support for FUSE_LOOKUP_HANDLE it should be fine (?). Thanks, Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-04 13:10 ` Amir Goldstein @ 2025-11-04 14:52 ` Luis Henriques 2025-11-05 10:21 ` Amir Goldstein 2025-11-05 22:24 ` Bernd Schubert 1 sibling, 1 reply; 46+ messages in thread From: Luis Henriques @ 2025-11-04 14:52 UTC (permalink / raw) To: Amir Goldstein Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Tue, Nov 04 2025, Amir Goldstein wrote: > On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote: >> >> On Tue, Sep 16 2025, Amir Goldstein wrote: >> >> > On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote: >> >> >> >> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: >> >> > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: >> >> > > >> >> > > >> >> > > >> >> > > On 9/15/25 09:07, Amir Goldstein wrote: >> >> > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: >> >> > > >> >> >> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: >> >> > > >>> >> >> > > >>> >> >> > > >>> On 9/12/25 13:41, Amir Goldstein wrote: >> >> > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: >> >> > > >>>>> >> >> > > >>>>> >> >> > > >>>>> >> >> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote: >> >> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: >> >> > > >>>>>> >> >> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >> >> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >> >> > > >>>>>>>>> >> >> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >> >> > > >>>>>>>>> could restart itself. It's unclear if doing so will actually enable us >> >> > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I >> >> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >> >> > > >>>>>>>>> aren't totally crazy. >> >> > > >>>>>>>> >> >> > > >>>>>>>> I'm trying to understand what the failure scenario is here. Is this >> >> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >> >> > > >>>>>>>> is supposed to happen with respect to open files, metadata and data >> >> > > >>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run >> >> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going >> >> > > >>>>>>>> potentally to be out of sync, right? >> >> > > >>>>>>>> >> >> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide? >> >> > > >>>>>>> >> >> > > >>>>>>> <echoing what we said on the ext4 call this morning> >> >> > > >>>>>>> >> >> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new >> >> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which >> >> > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate >> >> > > >>>>>>> that they still exist; and then resend all the unacknowledged requests >> >> > > >>>>>>> that were pending at the time. It might be the case that you have to >> >> > > >>>>>>> that in the reverse order; I only know enough about the design of fuse >> >> > > >>>>>>> to suspect that to be true. >> >> > > >>>>>>> >> >> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with >> >> > > >>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are >> >> > > >>>>>>> fuse_make_bad'd, which effectively revokes them. >> >> > > >>>>>> >> >> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, >> >> > > >>>>>> but probably GETATTR is a better option. >> >> > > >>>>>> >> >> > > >>>>>> So, are you currently working on any of this? Are you implementing this >> >> > > >>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer >> >> > > >>>>>> look at fuse2fs too. >> >> > > >>>>> >> >> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and >> >> > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our >> >> > > >>>>> DDN side. >> >> > > >>>>> >> >> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse >> >> > > >>>>> server restart we want kernel to recover inodes and their lookup count. >> >> > > >>>>> Now inode recovery might be hard, because we currently only have a >> >> > > >>>>> 64-bit node-id - which is used my most fuse application as memory >> >> > > >>>>> pointer. >> >> > > >>>>> >> >> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends >> >> > > >>>>> outstanding requests. And that ends up in most cases in sending requests >> >> > > >>>>> with invalid node-IDs, that are casted and might provoke random memory >> >> > > >>>>> access on restart. Kind of the same issue why fuse nfs export or >> >> > > >>>>> open_by_handle_at doesn't work well right now. >> >> > > >>>>> >> >> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which >> >> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle. >> >> > > >>>>> And then FUSE_REVALIDATE_FH on server restart. >> >> > > >>>>> The file handles could be stored into the fuse inode and also used for >> >> > > >>>>> NFS export. >> >> > > >>>>> >> >> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly. >> >> > > >>>>> Adding Amir to CC. >> >> > > >>>> >> >> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: >> >> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ >> >> > > >>> >> >> > > >>> Thanks for the reference Amir! I even had been in that thread. >> >> > > >>> >> >> > > >>>> >> >> > > >>>>> >> >> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which >> >> > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. >> >> > > >>>>> Any objections against that? >> >> > > >> >> >> > > >> What if you actually /can/ reuse a nodeid after a restart? Consider >> >> > > >> fuse4fs, where the nodeid is the on-disk inode number. After a restart, >> >> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery >> >> > > >> didn't delete it, obviously. >> >> > > > >> >> > > > FUSE_LOOKUP_HANDLE is a contract. >> >> > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign >> >> > > > this contract, otherwise there is no way for client to know that the >> >> > > > nodeids are persistent. >> >> > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() >> >> > > > API trivial. >> >> > > > >> >> > > >> >> >> > > >> I suppose you could just ask for refreshed stat information and either >> >> > > >> the server gives it to you and the fuse_inode lives; or the server >> >> > > >> returns ENOENT and then we mark it bad. But I'd have to see code >> >> > > >> patches to form a real opinion. >> >> > > >> >> >> > > > >> >> > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id> >> >> > > > where fuse_instance_id can be its start time or random number. >> >> > > > for auto invalidate, or maybe the fuse_instance_id should be >> >> > > > a native part of FUSE protocol so that client knows to only invalidate >> >> > > > attr cache in case of fuse_instance_id change? >> >> > > > >> >> > > > In any case, instead of a storm of revalidate messages after >> >> > > > server restart, do it lazily on demand. >> >> > > >> >> > > For a network file system, probably. For fuse4fs or other block >> >> > > based file systems, not sure. Darrick has the example of fsck. >> >> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0, >> >> > > fuse-server gets restarted, fsck'ed and some files get removed. >> >> > > Now reading these inodes would still work - wouldn't it >> >> > > be better to invalidate the cache before going into operation >> >> > > again? >> >> > >> >> > Forgive me, I was making a wrong assumption that fuse4fs >> >> > was using ext4 filehandle as nodeid, but of course it does not. >> >> >> >> Well now that you mention it, there /is/ a risk of shenanigans like >> >> that. Consider: >> >> >> >> 1) fuse4fs mount an ext4 filesystem >> >> 2) crash the fuse4fs server >> >> <fuse4fs server restart stalls...> >> >> 3) e2fsck -fy /dev/XXX deletes inode 17 >> >> 4) someone else mounts the fs, makes some changes that result in 17 >> >> being reallocated, user says "OOOOOPS", unmounts it >> >> 5) fuse4fs server finally restarts, and reconnects to the kernel >> >> >> >> Hey, inode 17 is now a different file!! >> >> >> >> So maybe the nodeid has to be an actual file handle. Oh wait, no, >> >> everything's (potentially) fine because fuse4fs supplied i_generation to >> >> the kernel, and fuse_stale_inode will mark it bad if that happens. >> >> >> >> Hm ok then, at least there's a way out. :) >> >> >> > >> > Right. >> > >> >> > The reason I made this wrong assumption is because fuse4fs *can* >> >> > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol >> >> > which is what my fuse passthough library [1] does. >> >> > >> >> > My claim was that although fuse4fs could support safe restart, which >> >> > cannot read from recycled inode number with current FUSE protocol, >> >> > doing so with FUSE_HANDLE protocol would express a commitment >> >> >> >> Pardon my naïvete, but what is FUSE_HANDLE? >> >> >> >> $ git grep -w FUSE_HANDLE fs >> >> $ >> > >> > Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): >> > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ >> > >> > Which means to communicate a variable sized "nodeid" >> > which can also be declared as an object id that survives server restart. >> > >> > Basically, the reason that I brought up LOOKUP_HANDLE is to >> > properly support NFS export of fuse filesystems. >> > >> > My incentive was to support a proper fuse server restart/remount/re-export >> > with the same fsid in /etc/exports, but this gives us a better starting point >> > for fuse server restart/re-connect. >> >> Sorry for resurrecting (again!) this discussion. I've been thinking about >> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation. >> However, I feel there are other operations that will need to return this >> new handle. >> >> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid. >> Doesn't this means that, if the user-space server supports the new >> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE >> request? > > Yes, I think that's what it means. Awesome, thank you for confirming this. >> The same question applies for TMPFILE, LINK, etc. Or is there >> something special about the LOOKUP operation that I'm missing? >> > > Any command returning fuse_entry_out. > > READDIRPLUS, MKNOD, MKDIR, SYMLINK Right, I had this list, but totally missed READDIRPLUS. > fuse_entry_out was extended once and fuse_reply_entry() > sends the size of the struct. So, if I'm understanding you correctly, you're suggesting to extend fuse_entry_out to add the new handle (a 'size' field + the actual handle). That's probably a good idea. I was working towards having the LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would include: - An extra inarg: the parent directory handle. (To be honest, I'm not really sure this would be needed.) - An extra outarg: for the actual handle. With your suggestion, only the extra inarg would be required. > However fuse_reply_create() sends it with fuse_open_out > appended This one should be fine... > and fuse_add_direntry_plus() does not seem to write > record size at all, so server and client will need to agree on the > size of fuse_entry_out and this would need to be backward compat. > If both server and client declare support for FUSE_LOOKUP_HANDLE > it should be fine (?). ... yeah, this could be a bit trickier. But I'll need to go look into it. Thanks a lot for your comments, Amir. I was trying to get an RFC out soon(ish) to get early feedback, hoping to prevent me following wrong paths. Cheers, -- Luís ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-04 14:52 ` Luis Henriques @ 2025-11-05 10:21 ` Amir Goldstein 2025-11-05 11:50 ` Luis Henriques 0 siblings, 1 reply; 46+ messages in thread From: Amir Goldstein @ 2025-11-05 10:21 UTC (permalink / raw) To: Luis Henriques Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote: > > On Tue, Nov 04 2025, Amir Goldstein wrote: > > > On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote: > >> > >> On Tue, Sep 16 2025, Amir Goldstein wrote: > >> > >> > On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote: > >> >> > >> >> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: > >> >> > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: > >> >> > > > >> >> > > > >> >> > > > >> >> > > On 9/15/25 09:07, Amir Goldstein wrote: > >> >> > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: > >> >> > > >> > >> >> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > >> >> > > >>> > >> >> > > >>> > >> >> > > >>> On 9/12/25 13:41, Amir Goldstein wrote: > >> >> > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > >> >> > > >>>>> > >> >> > > >>>>> > >> >> > > >>>>> > >> >> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote: > >> >> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > >> >> > > >>>>>> > >> >> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > >> >> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > >> >> > > >>>>>>>>> > >> >> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > >> >> > > >>>>>>>>> could restart itself. It's unclear if doing so will actually enable us > >> >> > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I > >> >> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > >> >> > > >>>>>>>>> aren't totally crazy. > >> >> > > >>>>>>>> > >> >> > > >>>>>>>> I'm trying to understand what the failure scenario is here. Is this > >> >> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > >> >> > > >>>>>>>> is supposed to happen with respect to open files, metadata and data > >> >> > > >>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > >> >> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > >> >> > > >>>>>>>> potentally to be out of sync, right? > >> >> > > >>>>>>>> > >> >> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide? > >> >> > > >>>>>>> > >> >> > > >>>>>>> <echoing what we said on the ext4 call this morning> > >> >> > > >>>>>>> > >> >> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new > >> >> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > >> >> > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate > >> >> > > >>>>>>> that they still exist; and then resend all the unacknowledged requests > >> >> > > >>>>>>> that were pending at the time. It might be the case that you have to > >> >> > > >>>>>>> that in the reverse order; I only know enough about the design of fuse > >> >> > > >>>>>>> to suspect that to be true. > >> >> > > >>>>>>> > >> >> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with > >> >> > > >>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are > >> >> > > >>>>>>> fuse_make_bad'd, which effectively revokes them. > >> >> > > >>>>>> > >> >> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > >> >> > > >>>>>> but probably GETATTR is a better option. > >> >> > > >>>>>> > >> >> > > >>>>>> So, are you currently working on any of this? Are you implementing this > >> >> > > >>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > >> >> > > >>>>>> look at fuse2fs too. > >> >> > > >>>>> > >> >> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and > >> >> > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our > >> >> > > >>>>> DDN side. > >> >> > > >>>>> > >> >> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > >> >> > > >>>>> server restart we want kernel to recover inodes and their lookup count. > >> >> > > >>>>> Now inode recovery might be hard, because we currently only have a > >> >> > > >>>>> 64-bit node-id - which is used my most fuse application as memory > >> >> > > >>>>> pointer. > >> >> > > >>>>> > >> >> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > >> >> > > >>>>> outstanding requests. And that ends up in most cases in sending requests > >> >> > > >>>>> with invalid node-IDs, that are casted and might provoke random memory > >> >> > > >>>>> access on restart. Kind of the same issue why fuse nfs export or > >> >> > > >>>>> open_by_handle_at doesn't work well right now. > >> >> > > >>>>> > >> >> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > >> >> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle. > >> >> > > >>>>> And then FUSE_REVALIDATE_FH on server restart. > >> >> > > >>>>> The file handles could be stored into the fuse inode and also used for > >> >> > > >>>>> NFS export. > >> >> > > >>>>> > >> >> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly. > >> >> > > >>>>> Adding Amir to CC. > >> >> > > >>>> > >> >> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > >> >> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > >> >> > > >>> > >> >> > > >>> Thanks for the reference Amir! I even had been in that thread. > >> >> > > >>> > >> >> > > >>>> > >> >> > > >>>>> > >> >> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > >> >> > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. > >> >> > > >>>>> Any objections against that? > >> >> > > >> > >> >> > > >> What if you actually /can/ reuse a nodeid after a restart? Consider > >> >> > > >> fuse4fs, where the nodeid is the on-disk inode number. After a restart, > >> >> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery > >> >> > > >> didn't delete it, obviously. > >> >> > > > > >> >> > > > FUSE_LOOKUP_HANDLE is a contract. > >> >> > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign > >> >> > > > this contract, otherwise there is no way for client to know that the > >> >> > > > nodeids are persistent. > >> >> > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() > >> >> > > > API trivial. > >> >> > > > > >> >> > > >> > >> >> > > >> I suppose you could just ask for refreshed stat information and either > >> >> > > >> the server gives it to you and the fuse_inode lives; or the server > >> >> > > >> returns ENOENT and then we mark it bad. But I'd have to see code > >> >> > > >> patches to form a real opinion. > >> >> > > >> > >> >> > > > > >> >> > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id> > >> >> > > > where fuse_instance_id can be its start time or random number. > >> >> > > > for auto invalidate, or maybe the fuse_instance_id should be > >> >> > > > a native part of FUSE protocol so that client knows to only invalidate > >> >> > > > attr cache in case of fuse_instance_id change? > >> >> > > > > >> >> > > > In any case, instead of a storm of revalidate messages after > >> >> > > > server restart, do it lazily on demand. > >> >> > > > >> >> > > For a network file system, probably. For fuse4fs or other block > >> >> > > based file systems, not sure. Darrick has the example of fsck. > >> >> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0, > >> >> > > fuse-server gets restarted, fsck'ed and some files get removed. > >> >> > > Now reading these inodes would still work - wouldn't it > >> >> > > be better to invalidate the cache before going into operation > >> >> > > again? > >> >> > > >> >> > Forgive me, I was making a wrong assumption that fuse4fs > >> >> > was using ext4 filehandle as nodeid, but of course it does not. > >> >> > >> >> Well now that you mention it, there /is/ a risk of shenanigans like > >> >> that. Consider: > >> >> > >> >> 1) fuse4fs mount an ext4 filesystem > >> >> 2) crash the fuse4fs server > >> >> <fuse4fs server restart stalls...> > >> >> 3) e2fsck -fy /dev/XXX deletes inode 17 > >> >> 4) someone else mounts the fs, makes some changes that result in 17 > >> >> being reallocated, user says "OOOOOPS", unmounts it > >> >> 5) fuse4fs server finally restarts, and reconnects to the kernel > >> >> > >> >> Hey, inode 17 is now a different file!! > >> >> > >> >> So maybe the nodeid has to be an actual file handle. Oh wait, no, > >> >> everything's (potentially) fine because fuse4fs supplied i_generation to > >> >> the kernel, and fuse_stale_inode will mark it bad if that happens. > >> >> > >> >> Hm ok then, at least there's a way out. :) > >> >> > >> > > >> > Right. > >> > > >> >> > The reason I made this wrong assumption is because fuse4fs *can* > >> >> > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol > >> >> > which is what my fuse passthough library [1] does. > >> >> > > >> >> > My claim was that although fuse4fs could support safe restart, which > >> >> > cannot read from recycled inode number with current FUSE protocol, > >> >> > doing so with FUSE_HANDLE protocol would express a commitment > >> >> > >> >> Pardon my naïvete, but what is FUSE_HANDLE? > >> >> > >> >> $ git grep -w FUSE_HANDLE fs > >> >> $ > >> > > >> > Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): > >> > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > >> > > >> > Which means to communicate a variable sized "nodeid" > >> > which can also be declared as an object id that survives server restart. > >> > > >> > Basically, the reason that I brought up LOOKUP_HANDLE is to > >> > properly support NFS export of fuse filesystems. > >> > > >> > My incentive was to support a proper fuse server restart/remount/re-export > >> > with the same fsid in /etc/exports, but this gives us a better starting point > >> > for fuse server restart/re-connect. > >> > >> Sorry for resurrecting (again!) this discussion. I've been thinking about > >> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation. > >> However, I feel there are other operations that will need to return this > >> new handle. > >> > >> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid. > >> Doesn't this means that, if the user-space server supports the new > >> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE > >> request? > > > > Yes, I think that's what it means. > > Awesome, thank you for confirming this. > > >> The same question applies for TMPFILE, LINK, etc. Or is there > >> something special about the LOOKUP operation that I'm missing? > >> > > > > Any command returning fuse_entry_out. > > > > READDIRPLUS, MKNOD, MKDIR, SYMLINK > > Right, I had this list, but totally missed READDIRPLUS. > > > fuse_entry_out was extended once and fuse_reply_entry() > > sends the size of the struct. > > So, if I'm understanding you correctly, you're suggesting to extend > fuse_entry_out to add the new handle (a 'size' field + the actual handle). Well it depends... There are several ways to do it. I would really like to get Miklos and Bernd's opinion on the preferred way. So far, it looks like the client determines the size of the output args. If we want the server to be able to write a different file handle size per inode that's going to be a bigger challenge. I think it's plenty enough if server and client negotiate a max file handle size and then the client always reserves enough space in the output args buffer. One more thing to ask is what is "the actual handle". If "the actual handle" is the variable sized struct file_handle then the size is already available in the file handle header. If it is not, then I think some sort of type or version of the file handles encoding should be negotiated beyond the max handle size. > That's probably a good idea. I was working towards having the > LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would > include: > > - An extra inarg: the parent directory handle. (To be honest, I'm not > really sure this would be needed.) Yes, I think you need extra inarg. Why would it not be needed? The problem is that you cannot know if the parent node id in the lookup command is stale after server restart. The thing is that the kernel fuse inode will need to store the file handle, much the same as an NFS client stores the file handle provided by the NFS server. FYI, fanotify has an optimized way to store file handles in struct fanotify_fid_event - small file handles are stored inline and larger file handles can use an external buffer. But fuse does not need to support any size of file handles. For first version we could definitely simplify things by limiting the size of supported file handles, because server and client need to negotiate the max file handle size anyway. > - An extra outarg: for the actual handle. > > With your suggestion, only the extra inarg would be required. > Yes, either extra arg or just an extended size of fuse_entry_out negotiated at init time. TBH it seems cleaner to add 2nd outarg to all the commands, but CREATE already has a 2nd arg and 2nd arg does not solve READDIRPLUS. > > However fuse_reply_create() sends it with fuse_open_out > > appended > > This one should be fine... > > > and fuse_add_direntry_plus() does not seem to write > > record size at all, so server and client will need to agree on the > > size of fuse_entry_out and this would need to be backward compat. > > If both server and client declare support for FUSE_LOOKUP_HANDLE > > it should be fine (?). > > ... yeah, this could be a bit trickier. But I'll need to go look into it. > > Thanks a lot for your comments, Amir. I was trying to get an RFC out > soon(ish) to get early feedback, hoping to prevent me following wrong > paths. > Disclaimer, following my advice may well lead you down wrong paths.. Best to wait for confirmation from Miklos and Bernd if you want to have more certainty... Thanks, Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-05 10:21 ` Amir Goldstein @ 2025-11-05 11:50 ` Luis Henriques 2025-11-05 15:30 ` Amir Goldstein 0 siblings, 1 reply; 46+ messages in thread From: Luis Henriques @ 2025-11-05 11:50 UTC (permalink / raw) To: Amir Goldstein Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen, Matt Harvey Hi Amir, On Wed, Nov 05 2025, Amir Goldstein wrote: > On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote: <...> >> > fuse_entry_out was extended once and fuse_reply_entry() >> > sends the size of the struct. >> >> So, if I'm understanding you correctly, you're suggesting to extend >> fuse_entry_out to add the new handle (a 'size' field + the actual handle). > > Well it depends... > > There are several ways to do it. > I would really like to get Miklos and Bernd's opinion on the preferred way. Sure, all feedback is welcome! > So far, it looks like the client determines the size of the output args. > > If we want the server to be able to write a different file handle size > per inode that's going to be a bigger challenge. > > I think it's plenty enough if server and client negotiate a max file handle > size and then the client always reserves enough space in the output > args buffer. > > One more thing to ask is what is "the actual handle". > If "the actual handle" is the variable sized struct file_handle then > the size is already available in the file handle header. Actually, this is exactly what I was trying to mimic for my initial attempt. However, I was not going to do any size negotiation but instead define a maximum size for the handle. See below. > If it is not, then I think some sort of type or version of the file handles > encoding should be negotiated beyond the max handle size. In my initial stab at this I was going to take a very simple approach and hard-code a maximum size for the handle. This would have the advantage of allowing the server to use different sizes for different inodes (though I'm not sure how useful that would be in practice). So, in summary, I would define the new handle like this: /* Same value as MAX_HANDLE_SZ */ #define FUSE_MAX_HANDLE_SZ 128 struct fuse_file_handle { uint32_t size; uint32_t padding; char handle[FUSE_MAX_HANDLE_SZ]; }; and this struct would be included in fuse_entry_out. There's probably a problem with having this (big) fixed size increase to fuse_entry_out, but maybe that could be fixed once I have all the other details sorted out. Hopefully I'm not oversimplifying the problem, skipping the need for negotiating a handle size. >> That's probably a good idea. I was working towards having the >> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would >> include: >> >> - An extra inarg: the parent directory handle. (To be honest, I'm not >> really sure this would be needed.) > > Yes, I think you need extra inarg. > Why would it not be needed? > The problem is that you cannot know if the parent node id in the lookup > command is stale after server restart. Ah, of course. Hence the need for this extra inarg. > The thing is that the kernel fuse inode will need to store the file handle, > much the same as an NFS client stores the file handle provided by the > NFS server. > > FYI, fanotify has an optimized way to store file handles in > struct fanotify_fid_event - small file handles are stored inline > and larger file handles can use an external buffer. > > But fuse does not need to support any size of file handles. > For first version we could definitely simplify things by limiting the size > of supported file handles, because server and client need to negotiate > the max file handle size anyway. I'll definitely need to have a look at how fanotify does that. But I guess that if my simplistic approach with a static array is acceptable for now, I'll stick with it for the initial attempt to implement this, and eventually revisit it later to do something more clever. >> - An extra outarg: for the actual handle. >> >> With your suggestion, only the extra inarg would be required. >> > > Yes, either extra arg or just an extended size of fuse_entry_out > negotiated at init time. > > TBH it seems cleaner to add 2nd outarg to all the commands, > but CREATE already has a 2nd arg and 2nd arg does not solve > READDIRPLUS. Right. I'm more and more convinced that extending fuse_entry_out is the way to go. >> > However fuse_reply_create() sends it with fuse_open_out >> > appended >> >> This one should be fine... >> >> > and fuse_add_direntry_plus() does not seem to write >> > record size at all, so server and client will need to agree on the >> > size of fuse_entry_out and this would need to be backward compat. >> > If both server and client declare support for FUSE_LOOKUP_HANDLE >> > it should be fine (?). >> >> ... yeah, this could be a bit trickier. But I'll need to go look into it. >> >> Thanks a lot for your comments, Amir. I was trying to get an RFC out >> soon(ish) to get early feedback, hoping to prevent me following wrong >> paths. >> > > Disclaimer, following my advice may well lead you down wrong paths.. > Best to wait for confirmation from Miklos and Bernd if you want to have > more certainty... Haha thanks for the warning :-) And again, thanks a lot for your feedback, Amir. Cheers, -- Luís ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-05 11:50 ` Luis Henriques @ 2025-11-05 15:30 ` Amir Goldstein 2025-11-05 21:38 ` Darrick J. Wong 0 siblings, 1 reply; 46+ messages in thread From: Amir Goldstein @ 2025-11-05 15:30 UTC (permalink / raw) To: Luis Henriques Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen, Matt Harvey On Wed, Nov 5, 2025 at 12:50 PM Luis Henriques <luis@igalia.com> wrote: > > Hi Amir, > > On Wed, Nov 05 2025, Amir Goldstein wrote: > > > On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote: > > <...> > > >> > fuse_entry_out was extended once and fuse_reply_entry() > >> > sends the size of the struct. > >> > >> So, if I'm understanding you correctly, you're suggesting to extend > >> fuse_entry_out to add the new handle (a 'size' field + the actual handle). > > > > Well it depends... > > > > There are several ways to do it. > > I would really like to get Miklos and Bernd's opinion on the preferred way. > > Sure, all feedback is welcome! > > > So far, it looks like the client determines the size of the output args. > > > > If we want the server to be able to write a different file handle size > > per inode that's going to be a bigger challenge. > > > > I think it's plenty enough if server and client negotiate a max file handle > > size and then the client always reserves enough space in the output > > args buffer. > > > > One more thing to ask is what is "the actual handle". > > If "the actual handle" is the variable sized struct file_handle then > > the size is already available in the file handle header. > > Actually, this is exactly what I was trying to mimic for my initial > attempt. However, I was not going to do any size negotiation but instead > define a maximum size for the handle. See below. > > > If it is not, then I think some sort of type or version of the file handles > > encoding should be negotiated beyond the max handle size. > > In my initial stab at this I was going to take a very simple approach and > hard-code a maximum size for the handle. This would have the advantage of > allowing the server to use different sizes for different inodes (though > I'm not sure how useful that would be in practice). So, in summary, I > would define the new handle like this: > > /* Same value as MAX_HANDLE_SZ */ > #define FUSE_MAX_HANDLE_SZ 128 > > struct fuse_file_handle { > uint32_t size; > uint32_t padding; I think that the handle type is going to be relevant as well. > char handle[FUSE_MAX_HANDLE_SZ]; > }; > > and this struct would be included in fuse_entry_out. > > There's probably a problem with having this (big) fixed size increase to > fuse_entry_out, but maybe that could be fixed once I have all the other > details sorted out. Hopefully I'm not oversimplifying the problem, > skipping the need for negotiating a handle size. > Maybe this fixed size is reasonable for the first version of FUSE protocol as long as this overhead is NOT added if the server does not opt-in for the feature. IOW, allow the server to negotiate FUSE_MAX_HANDLE_SZ or 0, but keep the negotiation protocol extendable to another value later on. > >> That's probably a good idea. I was working towards having the > >> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would > >> include: > >> > >> - An extra inarg: the parent directory handle. (To be honest, I'm not > >> really sure this would be needed.) > > > > Yes, I think you need extra inarg. > > Why would it not be needed? > > The problem is that you cannot know if the parent node id in the lookup > > command is stale after server restart. > > Ah, of course. Hence the need for this extra inarg. > > > The thing is that the kernel fuse inode will need to store the file handle, > > much the same as an NFS client stores the file handle provided by the > > NFS server. > > > > FYI, fanotify has an optimized way to store file handles in > > struct fanotify_fid_event - small file handles are stored inline > > and larger file handles can use an external buffer. > > > > But fuse does not need to support any size of file handles. > > For first version we could definitely simplify things by limiting the size > > of supported file handles, because server and client need to negotiate > > the max file handle size anyway. > > I'll definitely need to have a look at how fanotify does that. But I > guess that if my simplistic approach with a static array is acceptable for > now, I'll stick with it for the initial attempt to implement this, and > eventually revisit it later to do something more clever. > What you proposed is the extension of fuse_entry_out for fuse protocol. My reference to fanotify_fid_event is meant to explain how to encode a file handle in fuse_inode in cache, because the fuse_inode_cachep cannot have variable sized inodes and in most of the cases, a short inline file handle should be enough. Therefore, if you limit the support in the first version to something like FANOTIFY_INLINE_FH_LEN, you can always store the file handle in fuse_inode and postpone support for bigger file handles to later. Thanks, Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-05 15:30 ` Amir Goldstein @ 2025-11-05 21:38 ` Darrick J. Wong 2025-11-05 21:46 ` Bernd Schubert 0 siblings, 1 reply; 46+ messages in thread From: Darrick J. Wong @ 2025-11-05 21:38 UTC (permalink / raw) To: Amir Goldstein Cc: Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, Bernd Schubert, linux-fsdevel, linux-kernel, Kevin Chen, Matt Harvey On Wed, Nov 05, 2025 at 04:30:51PM +0100, Amir Goldstein wrote: > On Wed, Nov 5, 2025 at 12:50 PM Luis Henriques <luis@igalia.com> wrote: > > > > Hi Amir, > > > > On Wed, Nov 05 2025, Amir Goldstein wrote: > > > > > On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote: > > > > <...> > > > > >> > fuse_entry_out was extended once and fuse_reply_entry() > > >> > sends the size of the struct. > > >> > > >> So, if I'm understanding you correctly, you're suggesting to extend > > >> fuse_entry_out to add the new handle (a 'size' field + the actual handle). > > > > > > Well it depends... > > > > > > There are several ways to do it. > > > I would really like to get Miklos and Bernd's opinion on the preferred way. > > > > Sure, all feedback is welcome! > > > > > So far, it looks like the client determines the size of the output args. > > > > > > If we want the server to be able to write a different file handle size > > > per inode that's going to be a bigger challenge. > > > > > > I think it's plenty enough if server and client negotiate a max file handle > > > size and then the client always reserves enough space in the output > > > args buffer. > > > > > > One more thing to ask is what is "the actual handle". > > > If "the actual handle" is the variable sized struct file_handle then > > > the size is already available in the file handle header. > > > > Actually, this is exactly what I was trying to mimic for my initial > > attempt. However, I was not going to do any size negotiation but instead > > define a maximum size for the handle. See below. > > > > > If it is not, then I think some sort of type or version of the file handles > > > encoding should be negotiated beyond the max handle size. > > > > In my initial stab at this I was going to take a very simple approach and > > hard-code a maximum size for the handle. This would have the advantage of > > allowing the server to use different sizes for different inodes (though > > I'm not sure how useful that would be in practice). So, in summary, I > > would define the new handle like this: > > > > /* Same value as MAX_HANDLE_SZ */ > > #define FUSE_MAX_HANDLE_SZ 128 > > > > struct fuse_file_handle { > > uint32_t size; > > uint32_t padding; > > I think that the handle type is going to be relevant as well. > > > char handle[FUSE_MAX_HANDLE_SZ]; > > }; > > > > and this struct would be included in fuse_entry_out. > > > > There's probably a problem with having this (big) fixed size increase to > > fuse_entry_out, but maybe that could be fixed once I have all the other > > details sorted out. Hopefully I'm not oversimplifying the problem, > > skipping the need for negotiating a handle size. > > > > Maybe this fixed size is reasonable for the first version of FUSE protocol > as long as this overhead is NOT added if the server does not opt-in for the > feature. > > IOW, allow the server to negotiate FUSE_MAX_HANDLE_SZ or 0, > but keep the negotiation protocol extendable to another value later on. > > > >> That's probably a good idea. I was working towards having the > > >> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would > > >> include: > > >> > > >> - An extra inarg: the parent directory handle. (To be honest, I'm not > > >> really sure this would be needed.) > > > > > > Yes, I think you need extra inarg. > > > Why would it not be needed? > > > The problem is that you cannot know if the parent node id in the lookup > > > command is stale after server restart. > > > > Ah, of course. Hence the need for this extra inarg. > > > > > The thing is that the kernel fuse inode will need to store the file handle, > > > much the same as an NFS client stores the file handle provided by the > > > NFS server. > > > > > > FYI, fanotify has an optimized way to store file handles in > > > struct fanotify_fid_event - small file handles are stored inline > > > and larger file handles can use an external buffer. > > > > > > But fuse does not need to support any size of file handles. > > > For first version we could definitely simplify things by limiting the size > > > of supported file handles, because server and client need to negotiate > > > the max file handle size anyway. > > > > I'll definitely need to have a look at how fanotify does that. But I > > guess that if my simplistic approach with a static array is acceptable for > > now, I'll stick with it for the initial attempt to implement this, and > > eventually revisit it later to do something more clever. > > > > What you proposed is the extension of fuse_entry_out for fuse > protocol. > > My reference to fanotify_fid_event is meant to explain how to encode > a file handle in fuse_inode in cache, because the fuse_inode_cachep > cannot have variable sized inodes and in most of the cases, a short > inline file handle should be enough. > > Therefore, if you limit the support in the first version to something like > FANOTIFY_INLINE_FH_LEN, you can always store the file handle > in fuse_inode and postpone support for bigger file handles to later. I suggest that you also provide a way for the fuse server to tell the kernel that it can construct its own handles from {fuse_inode::nodeid, inode::i_generation} if they want something more efficient than uploading 128b blobs. --D > Thanks, > Amir. > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-05 21:38 ` Darrick J. Wong @ 2025-11-05 21:46 ` Bernd Schubert 2025-11-05 22:06 ` Bernd Schubert 0 siblings, 1 reply; 46+ messages in thread From: Bernd Schubert @ 2025-11-05 21:46 UTC (permalink / raw) To: Darrick J. Wong, Amir Goldstein Cc: Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen, Matt Harvey On 11/5/25 22:38, Darrick J. Wong wrote: > On Wed, Nov 05, 2025 at 04:30:51PM +0100, Amir Goldstein wrote: >> On Wed, Nov 5, 2025 at 12:50 PM Luis Henriques <luis@igalia.com> wrote: >>> >>> Hi Amir, >>> >>> On Wed, Nov 05 2025, Amir Goldstein wrote: >>> >>>> On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote: >>> >>> <...> >>> >>>>>> fuse_entry_out was extended once and fuse_reply_entry() >>>>>> sends the size of the struct. >>>>> >>>>> So, if I'm understanding you correctly, you're suggesting to extend >>>>> fuse_entry_out to add the new handle (a 'size' field + the actual handle). >>>> >>>> Well it depends... >>>> >>>> There are several ways to do it. >>>> I would really like to get Miklos and Bernd's opinion on the preferred way. >>> >>> Sure, all feedback is welcome! >>> >>>> So far, it looks like the client determines the size of the output args. >>>> >>>> If we want the server to be able to write a different file handle size >>>> per inode that's going to be a bigger challenge. >>>> >>>> I think it's plenty enough if server and client negotiate a max file handle >>>> size and then the client always reserves enough space in the output >>>> args buffer. >>>> >>>> One more thing to ask is what is "the actual handle". >>>> If "the actual handle" is the variable sized struct file_handle then >>>> the size is already available in the file handle header. >>> >>> Actually, this is exactly what I was trying to mimic for my initial >>> attempt. However, I was not going to do any size negotiation but instead >>> define a maximum size for the handle. See below. >>> >>>> If it is not, then I think some sort of type or version of the file handles >>>> encoding should be negotiated beyond the max handle size. >>> >>> In my initial stab at this I was going to take a very simple approach and >>> hard-code a maximum size for the handle. This would have the advantage of >>> allowing the server to use different sizes for different inodes (though >>> I'm not sure how useful that would be in practice). So, in summary, I >>> would define the new handle like this: >>> >>> /* Same value as MAX_HANDLE_SZ */ >>> #define FUSE_MAX_HANDLE_SZ 128 >>> >>> struct fuse_file_handle { >>> uint32_t size; >>> uint32_t padding; >> >> I think that the handle type is going to be relevant as well. >> >>> char handle[FUSE_MAX_HANDLE_SZ]; >>> }; >>> >>> and this struct would be included in fuse_entry_out. >>> >>> There's probably a problem with having this (big) fixed size increase to >>> fuse_entry_out, but maybe that could be fixed once I have all the other >>> details sorted out. Hopefully I'm not oversimplifying the problem, >>> skipping the need for negotiating a handle size. >>> >> >> Maybe this fixed size is reasonable for the first version of FUSE protocol >> as long as this overhead is NOT added if the server does not opt-in for the >> feature. >> >> IOW, allow the server to negotiate FUSE_MAX_HANDLE_SZ or 0, >> but keep the negotiation protocol extendable to another value later on. >> >>>>> That's probably a good idea. I was working towards having the >>>>> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would >>>>> include: >>>>> >>>>> - An extra inarg: the parent directory handle. (To be honest, I'm not >>>>> really sure this would be needed.) >>>> >>>> Yes, I think you need extra inarg. >>>> Why would it not be needed? >>>> The problem is that you cannot know if the parent node id in the lookup >>>> command is stale after server restart. >>> >>> Ah, of course. Hence the need for this extra inarg. >>> >>>> The thing is that the kernel fuse inode will need to store the file handle, >>>> much the same as an NFS client stores the file handle provided by the >>>> NFS server. >>>> >>>> FYI, fanotify has an optimized way to store file handles in >>>> struct fanotify_fid_event - small file handles are stored inline >>>> and larger file handles can use an external buffer. >>>> >>>> But fuse does not need to support any size of file handles. >>>> For first version we could definitely simplify things by limiting the size >>>> of supported file handles, because server and client need to negotiate >>>> the max file handle size anyway. >>> >>> I'll definitely need to have a look at how fanotify does that. But I >>> guess that if my simplistic approach with a static array is acceptable for >>> now, I'll stick with it for the initial attempt to implement this, and >>> eventually revisit it later to do something more clever. >>> >> >> What you proposed is the extension of fuse_entry_out for fuse >> protocol. >> >> My reference to fanotify_fid_event is meant to explain how to encode >> a file handle in fuse_inode in cache, because the fuse_inode_cachep >> cannot have variable sized inodes and in most of the cases, a short >> inline file handle should be enough. >> >> Therefore, if you limit the support in the first version to something like >> FANOTIFY_INLINE_FH_LEN, you can always store the file handle >> in fuse_inode and postpone support for bigger file handles to later. > > I suggest that you also provide a way for the fuse server to tell the > kernel that it can construct its own handles from {fuse_inode::nodeid, > inode::i_generation} if they want something more efficient than > uploading 128b blobs. Isn't that covered by handle size defined in FUSE_INIT reply? I.e. handle size would be 0B in this case? Thanks, Bernd ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-05 21:46 ` Bernd Schubert @ 2025-11-05 22:06 ` Bernd Schubert 0 siblings, 0 replies; 46+ messages in thread From: Bernd Schubert @ 2025-11-05 22:06 UTC (permalink / raw) To: Bernd Schubert, Darrick J. Wong, Amir Goldstein Cc: Luis Henriques, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen, Matt Harvey On 11/5/25 22:46, Bernd Schubert wrote: > > > On 11/5/25 22:38, Darrick J. Wong wrote: >> On Wed, Nov 05, 2025 at 04:30:51PM +0100, Amir Goldstein wrote: >>> On Wed, Nov 5, 2025 at 12:50 PM Luis Henriques <luis@igalia.com> wrote: >>>> >>>> Hi Amir, >>>> >>>> On Wed, Nov 05 2025, Amir Goldstein wrote: >>>> >>>>> On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@igalia.com> wrote: >>>> >>>> <...> >>>> >>>>>>> fuse_entry_out was extended once and fuse_reply_entry() >>>>>>> sends the size of the struct. >>>>>> >>>>>> So, if I'm understanding you correctly, you're suggesting to extend >>>>>> fuse_entry_out to add the new handle (a 'size' field + the actual handle). >>>>> >>>>> Well it depends... >>>>> >>>>> There are several ways to do it. >>>>> I would really like to get Miklos and Bernd's opinion on the preferred way. >>>> >>>> Sure, all feedback is welcome! >>>> >>>>> So far, it looks like the client determines the size of the output args. >>>>> >>>>> If we want the server to be able to write a different file handle size >>>>> per inode that's going to be a bigger challenge. >>>>> >>>>> I think it's plenty enough if server and client negotiate a max file handle >>>>> size and then the client always reserves enough space in the output >>>>> args buffer. >>>>> >>>>> One more thing to ask is what is "the actual handle". >>>>> If "the actual handle" is the variable sized struct file_handle then >>>>> the size is already available in the file handle header. >>>> >>>> Actually, this is exactly what I was trying to mimic for my initial >>>> attempt. However, I was not going to do any size negotiation but instead >>>> define a maximum size for the handle. See below. >>>> >>>>> If it is not, then I think some sort of type or version of the file handles >>>>> encoding should be negotiated beyond the max handle size. >>>> >>>> In my initial stab at this I was going to take a very simple approach and >>>> hard-code a maximum size for the handle. This would have the advantage of >>>> allowing the server to use different sizes for different inodes (though >>>> I'm not sure how useful that would be in practice). So, in summary, I >>>> would define the new handle like this: >>>> >>>> /* Same value as MAX_HANDLE_SZ */ >>>> #define FUSE_MAX_HANDLE_SZ 128 >>>> >>>> struct fuse_file_handle { >>>> uint32_t size; >>>> uint32_t padding; >>> >>> I think that the handle type is going to be relevant as well. >>> >>>> char handle[FUSE_MAX_HANDLE_SZ]; >>>> }; >>>> >>>> and this struct would be included in fuse_entry_out. >>>> >>>> There's probably a problem with having this (big) fixed size increase to >>>> fuse_entry_out, but maybe that could be fixed once I have all the other >>>> details sorted out. Hopefully I'm not oversimplifying the problem, >>>> skipping the need for negotiating a handle size. >>>> >>> >>> Maybe this fixed size is reasonable for the first version of FUSE protocol >>> as long as this overhead is NOT added if the server does not opt-in for the >>> feature. >>> >>> IOW, allow the server to negotiate FUSE_MAX_HANDLE_SZ or 0, >>> but keep the negotiation protocol extendable to another value later on. >>> >>>>>> That's probably a good idea. I was working towards having the >>>>>> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would >>>>>> include: >>>>>> >>>>>> - An extra inarg: the parent directory handle. (To be honest, I'm not >>>>>> really sure this would be needed.) >>>>> >>>>> Yes, I think you need extra inarg. >>>>> Why would it not be needed? >>>>> The problem is that you cannot know if the parent node id in the lookup >>>>> command is stale after server restart. >>>> >>>> Ah, of course. Hence the need for this extra inarg. >>>> >>>>> The thing is that the kernel fuse inode will need to store the file handle, >>>>> much the same as an NFS client stores the file handle provided by the >>>>> NFS server. >>>>> >>>>> FYI, fanotify has an optimized way to store file handles in >>>>> struct fanotify_fid_event - small file handles are stored inline >>>>> and larger file handles can use an external buffer. >>>>> >>>>> But fuse does not need to support any size of file handles. >>>>> For first version we could definitely simplify things by limiting the size >>>>> of supported file handles, because server and client need to negotiate >>>>> the max file handle size anyway. >>>> >>>> I'll definitely need to have a look at how fanotify does that. But I >>>> guess that if my simplistic approach with a static array is acceptable for >>>> now, I'll stick with it for the initial attempt to implement this, and >>>> eventually revisit it later to do something more clever. >>>> >>> >>> What you proposed is the extension of fuse_entry_out for fuse >>> protocol. >>> >>> My reference to fanotify_fid_event is meant to explain how to encode >>> a file handle in fuse_inode in cache, because the fuse_inode_cachep >>> cannot have variable sized inodes and in most of the cases, a short >>> inline file handle should be enough. >>> >>> Therefore, if you limit the support in the first version to something like >>> FANOTIFY_INLINE_FH_LEN, you can always store the file handle >>> in fuse_inode and postpone support for bigger file handles to later. >> >> I suggest that you also provide a way for the fuse server to tell the >> kernel that it can construct its own handles from {fuse_inode::nodeid, >> inode::i_generation} if they want something more efficient than >> uploading 128b blobs. > > Isn't that covered by handle size defined in FUSE_INIT reply? I.e. > handle size would be 0B in this case? Sorry my fault, yeah, this needs a special flag. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-04 13:10 ` Amir Goldstein 2025-11-04 14:52 ` Luis Henriques @ 2025-11-05 22:24 ` Bernd Schubert 2025-11-05 22:42 ` Darrick J. Wong 1 sibling, 1 reply; 46+ messages in thread From: Bernd Schubert @ 2025-11-05 22:24 UTC (permalink / raw) To: Amir Goldstein, Luis Henriques Cc: Darrick J. Wong, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen On 11/4/25 14:10, Amir Goldstein wrote: > On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote: >> >> On Tue, Sep 16 2025, Amir Goldstein wrote: >> >>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote: >>>> >>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: >>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 9/15/25 09:07, Amir Goldstein wrote: >>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: >>>>>>>> >>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote: >>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote: >>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: >>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >>>>>>>>>>>>>>> could restart itself. It's unclear if doing so will actually enable us >>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I >>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >>>>>>>>>>>>>>> aren't totally crazy. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here. Is this >>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data >>>>>>>>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run >>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going >>>>>>>>>>>>>> potentally to be out of sync, right? >>>>>>>>>>>>>> >>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide? >>>>>>>>>>>>> >>>>>>>>>>>>> <echoing what we said on the ext4 call this morning> >>>>>>>>>>>>> >>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new >>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which >>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate >>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests >>>>>>>>>>>>> that were pending at the time. It might be the case that you have to >>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse >>>>>>>>>>>>> to suspect that to be true. >>>>>>>>>>>>> >>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with >>>>>>>>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are >>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them. >>>>>>>>>>>> >>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, >>>>>>>>>>>> but probably GETATTR is a better option. >>>>>>>>>>>> >>>>>>>>>>>> So, are you currently working on any of this? Are you implementing this >>>>>>>>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer >>>>>>>>>>>> look at fuse2fs too. >>>>>>>>>>> >>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and >>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our >>>>>>>>>>> DDN side. >>>>>>>>>>> >>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse >>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count. >>>>>>>>>>> Now inode recovery might be hard, because we currently only have a >>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory >>>>>>>>>>> pointer. >>>>>>>>>>> >>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends >>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests >>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory >>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or >>>>>>>>>>> open_by_handle_at doesn't work well right now. >>>>>>>>>>> >>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which >>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle. >>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart. >>>>>>>>>>> The file handles could be stored into the fuse inode and also used for >>>>>>>>>>> NFS export. >>>>>>>>>>> >>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly. >>>>>>>>>>> Adding Amir to CC. >>>>>>>>>> >>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: >>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ >>>>>>>>> >>>>>>>>> Thanks for the reference Amir! I even had been in that thread. >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which >>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. >>>>>>>>>>> Any objections against that? >>>>>>>> >>>>>>>> What if you actually /can/ reuse a nodeid after a restart? Consider >>>>>>>> fuse4fs, where the nodeid is the on-disk inode number. After a restart, >>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery >>>>>>>> didn't delete it, obviously. >>>>>>> >>>>>>> FUSE_LOOKUP_HANDLE is a contract. >>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign >>>>>>> this contract, otherwise there is no way for client to know that the >>>>>>> nodeids are persistent. >>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() >>>>>>> API trivial. >>>>>>> >>>>>>>> >>>>>>>> I suppose you could just ask for refreshed stat information and either >>>>>>>> the server gives it to you and the fuse_inode lives; or the server >>>>>>>> returns ENOENT and then we mark it bad. But I'd have to see code >>>>>>>> patches to form a real opinion. >>>>>>>> >>>>>>> >>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id> >>>>>>> where fuse_instance_id can be its start time or random number. >>>>>>> for auto invalidate, or maybe the fuse_instance_id should be >>>>>>> a native part of FUSE protocol so that client knows to only invalidate >>>>>>> attr cache in case of fuse_instance_id change? >>>>>>> >>>>>>> In any case, instead of a storm of revalidate messages after >>>>>>> server restart, do it lazily on demand. >>>>>> >>>>>> For a network file system, probably. For fuse4fs or other block >>>>>> based file systems, not sure. Darrick has the example of fsck. >>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0, >>>>>> fuse-server gets restarted, fsck'ed and some files get removed. >>>>>> Now reading these inodes would still work - wouldn't it >>>>>> be better to invalidate the cache before going into operation >>>>>> again? >>>>> >>>>> Forgive me, I was making a wrong assumption that fuse4fs >>>>> was using ext4 filehandle as nodeid, but of course it does not. >>>> >>>> Well now that you mention it, there /is/ a risk of shenanigans like >>>> that. Consider: >>>> >>>> 1) fuse4fs mount an ext4 filesystem >>>> 2) crash the fuse4fs server >>>> <fuse4fs server restart stalls...> >>>> 3) e2fsck -fy /dev/XXX deletes inode 17 >>>> 4) someone else mounts the fs, makes some changes that result in 17 >>>> being reallocated, user says "OOOOOPS", unmounts it >>>> 5) fuse4fs server finally restarts, and reconnects to the kernel >>>> >>>> Hey, inode 17 is now a different file!! >>>> >>>> So maybe the nodeid has to be an actual file handle. Oh wait, no, >>>> everything's (potentially) fine because fuse4fs supplied i_generation to >>>> the kernel, and fuse_stale_inode will mark it bad if that happens. >>>> >>>> Hm ok then, at least there's a way out. :) >>>> >>> >>> Right. >>> >>>>> The reason I made this wrong assumption is because fuse4fs *can* >>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol >>>>> which is what my fuse passthough library [1] does. >>>>> >>>>> My claim was that although fuse4fs could support safe restart, which >>>>> cannot read from recycled inode number with current FUSE protocol, >>>>> doing so with FUSE_HANDLE protocol would express a commitment >>>> >>>> Pardon my naïvete, but what is FUSE_HANDLE? >>>> >>>> $ git grep -w FUSE_HANDLE fs >>>> $ >>> >>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): >>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ >>> >>> Which means to communicate a variable sized "nodeid" >>> which can also be declared as an object id that survives server restart. >>> >>> Basically, the reason that I brought up LOOKUP_HANDLE is to >>> properly support NFS export of fuse filesystems. >>> >>> My incentive was to support a proper fuse server restart/remount/re-export >>> with the same fsid in /etc/exports, but this gives us a better starting point >>> for fuse server restart/re-connect. >> >> Sorry for resurrecting (again!) this discussion. I've been thinking about >> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation. >> However, I feel there are other operations that will need to return this >> new handle. >> >> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid. >> Doesn't this means that, if the user-space server supports the new >> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE >> request? > > Yes, I think that's what it means. > >> The same question applies for TMPFILE, LINK, etc. Or is there >> something special about the LOOKUP operation that I'm missing? >> > > Any command returning fuse_entry_out. > > READDIRPLUS, MKNOD, MKDIR, SYMLINK Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these things. With double checking, though, the file was mostly created by AI (just added a correction today). With that easy to see the missing FUSE_TMPFILE. > > fuse_entry_out was extended once and fuse_reply_entry() > sends the size of the struct. Sorry, I'm confused. Where does fuse_reply_entry() send the size? > However fuse_reply_create() sends it with fuse_open_out > appended and fuse_add_direntry_plus() does not seem to write > record size at all, so server and client will need to agree on the > size of fuse_entry_out and this would need to be backward compat. > If both server and client declare support for FUSE_LOOKUP_HANDLE > it should be fine (?). If max_handle size becomes a value in fuse_init_out, server and client would use it? I think appended fuse_open_out could just follow the dynamic actual size of the handle - code that serializes/deserializes the response has to look up the actual handle size then. For example I wouldn't know what to put in for any of the example/passthrough* file systems as handle size - would need to be 128B, but the actual size will be typically much smaller. Thanks, Bernd ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-05 22:24 ` Bernd Schubert @ 2025-11-05 22:42 ` Darrick J. Wong 2025-11-05 22:48 ` Bernd Schubert 0 siblings, 1 reply; 46+ messages in thread From: Darrick J. Wong @ 2025-11-05 22:42 UTC (permalink / raw) To: Bernd Schubert Cc: Amir Goldstein, Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen On Wed, Nov 05, 2025 at 11:24:01PM +0100, Bernd Schubert wrote: > > > On 11/4/25 14:10, Amir Goldstein wrote: > > On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote: > >> > >> On Tue, Sep 16 2025, Amir Goldstein wrote: > >> > >>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote: > >>>> > >>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: > >>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 9/15/25 09:07, Amir Goldstein wrote: > >>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: > >>>>>>>> > >>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote: > >>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote: > >>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > >>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > >>>>>>>>>>>>>>> could restart itself. It's unclear if doing so will actually enable us > >>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I > >>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > >>>>>>>>>>>>>>> aren't totally crazy. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here. Is this > >>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > >>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data > >>>>>>>>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > >>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > >>>>>>>>>>>>>> potentally to be out of sync, right? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide? > >>>>>>>>>>>>> > >>>>>>>>>>>>> <echoing what we said on the ext4 call this morning> > >>>>>>>>>>>>> > >>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new > >>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > >>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate > >>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests > >>>>>>>>>>>>> that were pending at the time. It might be the case that you have to > >>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse > >>>>>>>>>>>>> to suspect that to be true. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with > >>>>>>>>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are > >>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them. > >>>>>>>>>>>> > >>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > >>>>>>>>>>>> but probably GETATTR is a better option. > >>>>>>>>>>>> > >>>>>>>>>>>> So, are you currently working on any of this? Are you implementing this > >>>>>>>>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > >>>>>>>>>>>> look at fuse2fs too. > >>>>>>>>>>> > >>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and > >>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our > >>>>>>>>>>> DDN side. > >>>>>>>>>>> > >>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > >>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count. > >>>>>>>>>>> Now inode recovery might be hard, because we currently only have a > >>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory > >>>>>>>>>>> pointer. > >>>>>>>>>>> > >>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > >>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests > >>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory > >>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or > >>>>>>>>>>> open_by_handle_at doesn't work well right now. > >>>>>>>>>>> > >>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > >>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle. > >>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart. > >>>>>>>>>>> The file handles could be stored into the fuse inode and also used for > >>>>>>>>>>> NFS export. > >>>>>>>>>>> > >>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly. > >>>>>>>>>>> Adding Amir to CC. > >>>>>>>>>> > >>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > >>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > >>>>>>>>> > >>>>>>>>> Thanks for the reference Amir! I even had been in that thread. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > >>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. > >>>>>>>>>>> Any objections against that? > >>>>>>>> > >>>>>>>> What if you actually /can/ reuse a nodeid after a restart? Consider > >>>>>>>> fuse4fs, where the nodeid is the on-disk inode number. After a restart, > >>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery > >>>>>>>> didn't delete it, obviously. > >>>>>>> > >>>>>>> FUSE_LOOKUP_HANDLE is a contract. > >>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign > >>>>>>> this contract, otherwise there is no way for client to know that the > >>>>>>> nodeids are persistent. > >>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() > >>>>>>> API trivial. > >>>>>>> > >>>>>>>> > >>>>>>>> I suppose you could just ask for refreshed stat information and either > >>>>>>>> the server gives it to you and the fuse_inode lives; or the server > >>>>>>>> returns ENOENT and then we mark it bad. But I'd have to see code > >>>>>>>> patches to form a real opinion. > >>>>>>>> > >>>>>>> > >>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id> > >>>>>>> where fuse_instance_id can be its start time or random number. > >>>>>>> for auto invalidate, or maybe the fuse_instance_id should be > >>>>>>> a native part of FUSE protocol so that client knows to only invalidate > >>>>>>> attr cache in case of fuse_instance_id change? > >>>>>>> > >>>>>>> In any case, instead of a storm of revalidate messages after > >>>>>>> server restart, do it lazily on demand. > >>>>>> > >>>>>> For a network file system, probably. For fuse4fs or other block > >>>>>> based file systems, not sure. Darrick has the example of fsck. > >>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0, > >>>>>> fuse-server gets restarted, fsck'ed and some files get removed. > >>>>>> Now reading these inodes would still work - wouldn't it > >>>>>> be better to invalidate the cache before going into operation > >>>>>> again? > >>>>> > >>>>> Forgive me, I was making a wrong assumption that fuse4fs > >>>>> was using ext4 filehandle as nodeid, but of course it does not. > >>>> > >>>> Well now that you mention it, there /is/ a risk of shenanigans like > >>>> that. Consider: > >>>> > >>>> 1) fuse4fs mount an ext4 filesystem > >>>> 2) crash the fuse4fs server > >>>> <fuse4fs server restart stalls...> > >>>> 3) e2fsck -fy /dev/XXX deletes inode 17 > >>>> 4) someone else mounts the fs, makes some changes that result in 17 > >>>> being reallocated, user says "OOOOOPS", unmounts it > >>>> 5) fuse4fs server finally restarts, and reconnects to the kernel > >>>> > >>>> Hey, inode 17 is now a different file!! > >>>> > >>>> So maybe the nodeid has to be an actual file handle. Oh wait, no, > >>>> everything's (potentially) fine because fuse4fs supplied i_generation to > >>>> the kernel, and fuse_stale_inode will mark it bad if that happens. > >>>> > >>>> Hm ok then, at least there's a way out. :) > >>>> > >>> > >>> Right. > >>> > >>>>> The reason I made this wrong assumption is because fuse4fs *can* > >>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol > >>>>> which is what my fuse passthough library [1] does. > >>>>> > >>>>> My claim was that although fuse4fs could support safe restart, which > >>>>> cannot read from recycled inode number with current FUSE protocol, > >>>>> doing so with FUSE_HANDLE protocol would express a commitment > >>>> > >>>> Pardon my naïvete, but what is FUSE_HANDLE? > >>>> > >>>> $ git grep -w FUSE_HANDLE fs > >>>> $ > >>> > >>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): > >>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > >>> > >>> Which means to communicate a variable sized "nodeid" > >>> which can also be declared as an object id that survives server restart. > >>> > >>> Basically, the reason that I brought up LOOKUP_HANDLE is to > >>> properly support NFS export of fuse filesystems. > >>> > >>> My incentive was to support a proper fuse server restart/remount/re-export > >>> with the same fsid in /etc/exports, but this gives us a better starting point > >>> for fuse server restart/re-connect. > >> > >> Sorry for resurrecting (again!) this discussion. I've been thinking about > >> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation. > >> However, I feel there are other operations that will need to return this > >> new handle. > >> > >> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid. > >> Doesn't this means that, if the user-space server supports the new > >> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE > >> request? > > > > Yes, I think that's what it means. > > > >> The same question applies for TMPFILE, LINK, etc. Or is there > >> something special about the LOOKUP operation that I'm missing? > >> > > > > Any command returning fuse_entry_out. > > > > READDIRPLUS, MKNOD, MKDIR, SYMLINK > > Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these > things. With double checking, though, the file was mostly created by AI > (just added a correction today). With that easy to see the missing > FUSE_TMPFILE. > > > > > > fuse_entry_out was extended once and fuse_reply_entry() > > sends the size of the struct. > > Sorry, I'm confused. Where does fuse_reply_entry() send the size? > > > However fuse_reply_create() sends it with fuse_open_out > > appended and fuse_add_direntry_plus() does not seem to write > > record size at all, so server and client will need to agree on the > > size of fuse_entry_out and this would need to be backward compat. > > If both server and client declare support for FUSE_LOOKUP_HANDLE > > it should be fine (?). > > If max_handle size becomes a value in fuse_init_out, server and > client would use it? I think appended fuse_open_out could just > follow the dynamic actual size of the handle - code that > serializes/deserializes the response has to look up the actual > handle size then. For example I wouldn't know what to put in > for any of the example/passthrough* file systems as handle size - > would need to be 128B, but the actual size will be typically > much smaller. name_to_handle_at ? I guess the problem here is that technically speaking filesystems could have variable sized handles depending on the file. Sometimes you encode just the ino/gen of the child file, but other times you might know the parent and put that in the handle too. --D > > Thanks, > Bernd > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-05 22:42 ` Darrick J. Wong @ 2025-11-05 22:48 ` Bernd Schubert 2025-11-06 0:21 ` Darrick J. Wong 2025-11-06 10:13 ` Amir Goldstein 0 siblings, 2 replies; 46+ messages in thread From: Bernd Schubert @ 2025-11-05 22:48 UTC (permalink / raw) To: Darrick J. Wong Cc: Amir Goldstein, Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen On 11/5/25 23:42, Darrick J. Wong wrote: > On Wed, Nov 05, 2025 at 11:24:01PM +0100, Bernd Schubert wrote: >> >> >> On 11/4/25 14:10, Amir Goldstein wrote: >>> On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote: >>>> >>>> On Tue, Sep 16 2025, Amir Goldstein wrote: >>>> >>>>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote: >>>>>> >>>>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: >>>>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 9/15/25 09:07, Amir Goldstein wrote: >>>>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: >>>>>>>>>> >>>>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote: >>>>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote: >>>>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >>>>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >>>>>>>>>>>>>>>>> could restart itself. It's unclear if doing so will actually enable us >>>>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I >>>>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >>>>>>>>>>>>>>>>> aren't totally crazy. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here. Is this >>>>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >>>>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data >>>>>>>>>>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run >>>>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going >>>>>>>>>>>>>>>> potentally to be out of sync, right? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> <echoing what we said on the ext4 call this morning> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new >>>>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which >>>>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate >>>>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests >>>>>>>>>>>>>>> that were pending at the time. It might be the case that you have to >>>>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse >>>>>>>>>>>>>>> to suspect that to be true. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with >>>>>>>>>>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are >>>>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, >>>>>>>>>>>>>> but probably GETATTR is a better option. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So, are you currently working on any of this? Are you implementing this >>>>>>>>>>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer >>>>>>>>>>>>>> look at fuse2fs too. >>>>>>>>>>>>> >>>>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and >>>>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our >>>>>>>>>>>>> DDN side. >>>>>>>>>>>>> >>>>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse >>>>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count. >>>>>>>>>>>>> Now inode recovery might be hard, because we currently only have a >>>>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory >>>>>>>>>>>>> pointer. >>>>>>>>>>>>> >>>>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends >>>>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests >>>>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory >>>>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or >>>>>>>>>>>>> open_by_handle_at doesn't work well right now. >>>>>>>>>>>>> >>>>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which >>>>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle. >>>>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart. >>>>>>>>>>>>> The file handles could be stored into the fuse inode and also used for >>>>>>>>>>>>> NFS export. >>>>>>>>>>>>> >>>>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly. >>>>>>>>>>>>> Adding Amir to CC. >>>>>>>>>>>> >>>>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: >>>>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ >>>>>>>>>>> >>>>>>>>>>> Thanks for the reference Amir! I even had been in that thread. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which >>>>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. >>>>>>>>>>>>> Any objections against that? >>>>>>>>>> >>>>>>>>>> What if you actually /can/ reuse a nodeid after a restart? Consider >>>>>>>>>> fuse4fs, where the nodeid is the on-disk inode number. After a restart, >>>>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery >>>>>>>>>> didn't delete it, obviously. >>>>>>>>> >>>>>>>>> FUSE_LOOKUP_HANDLE is a contract. >>>>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign >>>>>>>>> this contract, otherwise there is no way for client to know that the >>>>>>>>> nodeids are persistent. >>>>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() >>>>>>>>> API trivial. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I suppose you could just ask for refreshed stat information and either >>>>>>>>>> the server gives it to you and the fuse_inode lives; or the server >>>>>>>>>> returns ENOENT and then we mark it bad. But I'd have to see code >>>>>>>>>> patches to form a real opinion. >>>>>>>>>> >>>>>>>>> >>>>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id> >>>>>>>>> where fuse_instance_id can be its start time or random number. >>>>>>>>> for auto invalidate, or maybe the fuse_instance_id should be >>>>>>>>> a native part of FUSE protocol so that client knows to only invalidate >>>>>>>>> attr cache in case of fuse_instance_id change? >>>>>>>>> >>>>>>>>> In any case, instead of a storm of revalidate messages after >>>>>>>>> server restart, do it lazily on demand. >>>>>>>> >>>>>>>> For a network file system, probably. For fuse4fs or other block >>>>>>>> based file systems, not sure. Darrick has the example of fsck. >>>>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0, >>>>>>>> fuse-server gets restarted, fsck'ed and some files get removed. >>>>>>>> Now reading these inodes would still work - wouldn't it >>>>>>>> be better to invalidate the cache before going into operation >>>>>>>> again? >>>>>>> >>>>>>> Forgive me, I was making a wrong assumption that fuse4fs >>>>>>> was using ext4 filehandle as nodeid, but of course it does not. >>>>>> >>>>>> Well now that you mention it, there /is/ a risk of shenanigans like >>>>>> that. Consider: >>>>>> >>>>>> 1) fuse4fs mount an ext4 filesystem >>>>>> 2) crash the fuse4fs server >>>>>> <fuse4fs server restart stalls...> >>>>>> 3) e2fsck -fy /dev/XXX deletes inode 17 >>>>>> 4) someone else mounts the fs, makes some changes that result in 17 >>>>>> being reallocated, user says "OOOOOPS", unmounts it >>>>>> 5) fuse4fs server finally restarts, and reconnects to the kernel >>>>>> >>>>>> Hey, inode 17 is now a different file!! >>>>>> >>>>>> So maybe the nodeid has to be an actual file handle. Oh wait, no, >>>>>> everything's (potentially) fine because fuse4fs supplied i_generation to >>>>>> the kernel, and fuse_stale_inode will mark it bad if that happens. >>>>>> >>>>>> Hm ok then, at least there's a way out. :) >>>>>> >>>>> >>>>> Right. >>>>> >>>>>>> The reason I made this wrong assumption is because fuse4fs *can* >>>>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol >>>>>>> which is what my fuse passthough library [1] does. >>>>>>> >>>>>>> My claim was that although fuse4fs could support safe restart, which >>>>>>> cannot read from recycled inode number with current FUSE protocol, >>>>>>> doing so with FUSE_HANDLE protocol would express a commitment >>>>>> >>>>>> Pardon my naïvete, but what is FUSE_HANDLE? >>>>>> >>>>>> $ git grep -w FUSE_HANDLE fs >>>>>> $ >>>>> >>>>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): >>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ >>>>> >>>>> Which means to communicate a variable sized "nodeid" >>>>> which can also be declared as an object id that survives server restart. >>>>> >>>>> Basically, the reason that I brought up LOOKUP_HANDLE is to >>>>> properly support NFS export of fuse filesystems. >>>>> >>>>> My incentive was to support a proper fuse server restart/remount/re-export >>>>> with the same fsid in /etc/exports, but this gives us a better starting point >>>>> for fuse server restart/re-connect. >>>> >>>> Sorry for resurrecting (again!) this discussion. I've been thinking about >>>> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation. >>>> However, I feel there are other operations that will need to return this >>>> new handle. >>>> >>>> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid. >>>> Doesn't this means that, if the user-space server supports the new >>>> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE >>>> request? >>> >>> Yes, I think that's what it means. >>> >>>> The same question applies for TMPFILE, LINK, etc. Or is there >>>> something special about the LOOKUP operation that I'm missing? >>>> >>> >>> Any command returning fuse_entry_out. >>> >>> READDIRPLUS, MKNOD, MKDIR, SYMLINK >> >> Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these >> things. With double checking, though, the file was mostly created by AI >> (just added a correction today). With that easy to see the missing >> FUSE_TMPFILE. >> >> >>> >>> fuse_entry_out was extended once and fuse_reply_entry() >>> sends the size of the struct. >> >> Sorry, I'm confused. Where does fuse_reply_entry() send the size? >> >>> However fuse_reply_create() sends it with fuse_open_out >>> appended and fuse_add_direntry_plus() does not seem to write >>> record size at all, so server and client will need to agree on the >>> size of fuse_entry_out and this would need to be backward compat. >>> If both server and client declare support for FUSE_LOOKUP_HANDLE >>> it should be fine (?). >> >> If max_handle size becomes a value in fuse_init_out, server and >> client would use it? I think appended fuse_open_out could just >> follow the dynamic actual size of the handle - code that >> serializes/deserializes the response has to look up the actual >> handle size then. For example I wouldn't know what to put in >> for any of the example/passthrough* file systems as handle size - >> would need to be 128B, but the actual size will be typically >> much smaller. > > name_to_handle_at ? > > I guess the problem here is that technically speaking filesystems could > have variable sized handles depending on the file. Sometimes you encode > just the ino/gen of the child file, but other times you might know the > parent and put that in the handle too. Yeah, I don't think it would be reliable for *all* file systems to use name_to_handle_at on startup on some example file/directory. At least not without knowing all the details of the underlying passthrough file system. Thanks, Bernd ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-05 22:48 ` Bernd Schubert @ 2025-11-06 0:21 ` Darrick J. Wong 2025-11-06 10:13 ` Amir Goldstein 1 sibling, 0 replies; 46+ messages in thread From: Darrick J. Wong @ 2025-11-06 0:21 UTC (permalink / raw) To: Bernd Schubert Cc: Amir Goldstein, Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen On Wed, Nov 05, 2025 at 11:48:21PM +0100, Bernd Schubert wrote: > > > On 11/5/25 23:42, Darrick J. Wong wrote: > > On Wed, Nov 05, 2025 at 11:24:01PM +0100, Bernd Schubert wrote: > >> > >> > >> On 11/4/25 14:10, Amir Goldstein wrote: > >>> On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote: > >>>> > >>>> On Tue, Sep 16 2025, Amir Goldstein wrote: > >>>> > >>>>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote: > >>>>>> > >>>>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: > >>>>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 9/15/25 09:07, Amir Goldstein wrote: > >>>>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote: > >>>>>>>>>> > >>>>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote: > >>>>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote: > >>>>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > >>>>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > >>>>>>>>>>>>>>>>> could restart itself. It's unclear if doing so will actually enable us > >>>>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I > >>>>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > >>>>>>>>>>>>>>>>> aren't totally crazy. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here. Is this > >>>>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > >>>>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data > >>>>>>>>>>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > >>>>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > >>>>>>>>>>>>>>>> potentally to be out of sync, right? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> <echoing what we said on the ext4 call this morning> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new > >>>>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > >>>>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate > >>>>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests > >>>>>>>>>>>>>>> that were pending at the time. It might be the case that you have to > >>>>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse > >>>>>>>>>>>>>>> to suspect that to be true. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with > >>>>>>>>>>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are > >>>>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > >>>>>>>>>>>>>> but probably GETATTR is a better option. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> So, are you currently working on any of this? Are you implementing this > >>>>>>>>>>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > >>>>>>>>>>>>>> look at fuse2fs too. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and > >>>>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our > >>>>>>>>>>>>> DDN side. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > >>>>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count. > >>>>>>>>>>>>> Now inode recovery might be hard, because we currently only have a > >>>>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory > >>>>>>>>>>>>> pointer. > >>>>>>>>>>>>> > >>>>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > >>>>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests > >>>>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory > >>>>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or > >>>>>>>>>>>>> open_by_handle_at doesn't work well right now. > >>>>>>>>>>>>> > >>>>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > >>>>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle. > >>>>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart. > >>>>>>>>>>>>> The file handles could be stored into the fuse inode and also used for > >>>>>>>>>>>>> NFS export. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly. > >>>>>>>>>>>>> Adding Amir to CC. > >>>>>>>>>>>> > >>>>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > >>>>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > >>>>>>>>>>> > >>>>>>>>>>> Thanks for the reference Amir! I even had been in that thread. > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > >>>>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad. > >>>>>>>>>>>>> Any objections against that? > >>>>>>>>>> > >>>>>>>>>> What if you actually /can/ reuse a nodeid after a restart? Consider > >>>>>>>>>> fuse4fs, where the nodeid is the on-disk inode number. After a restart, > >>>>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery > >>>>>>>>>> didn't delete it, obviously. > >>>>>>>>> > >>>>>>>>> FUSE_LOOKUP_HANDLE is a contract. > >>>>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign > >>>>>>>>> this contract, otherwise there is no way for client to know that the > >>>>>>>>> nodeids are persistent. > >>>>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle() > >>>>>>>>> API trivial. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> I suppose you could just ask for refreshed stat information and either > >>>>>>>>>> the server gives it to you and the fuse_inode lives; or the server > >>>>>>>>>> returns ENOENT and then we mark it bad. But I'd have to see code > >>>>>>>>>> patches to form a real opinion. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id> > >>>>>>>>> where fuse_instance_id can be its start time or random number. > >>>>>>>>> for auto invalidate, or maybe the fuse_instance_id should be > >>>>>>>>> a native part of FUSE protocol so that client knows to only invalidate > >>>>>>>>> attr cache in case of fuse_instance_id change? > >>>>>>>>> > >>>>>>>>> In any case, instead of a storm of revalidate messages after > >>>>>>>>> server restart, do it lazily on demand. > >>>>>>>> > >>>>>>>> For a network file system, probably. For fuse4fs or other block > >>>>>>>> based file systems, not sure. Darrick has the example of fsck. > >>>>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0, > >>>>>>>> fuse-server gets restarted, fsck'ed and some files get removed. > >>>>>>>> Now reading these inodes would still work - wouldn't it > >>>>>>>> be better to invalidate the cache before going into operation > >>>>>>>> again? > >>>>>>> > >>>>>>> Forgive me, I was making a wrong assumption that fuse4fs > >>>>>>> was using ext4 filehandle as nodeid, but of course it does not. > >>>>>> > >>>>>> Well now that you mention it, there /is/ a risk of shenanigans like > >>>>>> that. Consider: > >>>>>> > >>>>>> 1) fuse4fs mount an ext4 filesystem > >>>>>> 2) crash the fuse4fs server > >>>>>> <fuse4fs server restart stalls...> > >>>>>> 3) e2fsck -fy /dev/XXX deletes inode 17 > >>>>>> 4) someone else mounts the fs, makes some changes that result in 17 > >>>>>> being reallocated, user says "OOOOOPS", unmounts it > >>>>>> 5) fuse4fs server finally restarts, and reconnects to the kernel > >>>>>> > >>>>>> Hey, inode 17 is now a different file!! > >>>>>> > >>>>>> So maybe the nodeid has to be an actual file handle. Oh wait, no, > >>>>>> everything's (potentially) fine because fuse4fs supplied i_generation to > >>>>>> the kernel, and fuse_stale_inode will mark it bad if that happens. > >>>>>> > >>>>>> Hm ok then, at least there's a way out. :) > >>>>>> > >>>>> > >>>>> Right. > >>>>> > >>>>>>> The reason I made this wrong assumption is because fuse4fs *can* > >>>>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol > >>>>>>> which is what my fuse passthough library [1] does. > >>>>>>> > >>>>>>> My claim was that although fuse4fs could support safe restart, which > >>>>>>> cannot read from recycled inode number with current FUSE protocol, > >>>>>>> doing so with FUSE_HANDLE protocol would express a commitment > >>>>>> > >>>>>> Pardon my naïvete, but what is FUSE_HANDLE? > >>>>>> > >>>>>> $ git grep -w FUSE_HANDLE fs > >>>>>> $ > >>>>> > >>>>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): > >>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > >>>>> > >>>>> Which means to communicate a variable sized "nodeid" > >>>>> which can also be declared as an object id that survives server restart. > >>>>> > >>>>> Basically, the reason that I brought up LOOKUP_HANDLE is to > >>>>> properly support NFS export of fuse filesystems. > >>>>> > >>>>> My incentive was to support a proper fuse server restart/remount/re-export > >>>>> with the same fsid in /etc/exports, but this gives us a better starting point > >>>>> for fuse server restart/re-connect. > >>>> > >>>> Sorry for resurrecting (again!) this discussion. I've been thinking about > >>>> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation. > >>>> However, I feel there are other operations that will need to return this > >>>> new handle. > >>>> > >>>> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid. > >>>> Doesn't this means that, if the user-space server supports the new > >>>> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE > >>>> request? > >>> > >>> Yes, I think that's what it means. > >>> > >>>> The same question applies for TMPFILE, LINK, etc. Or is there > >>>> something special about the LOOKUP operation that I'm missing? > >>>> > >>> > >>> Any command returning fuse_entry_out. > >>> > >>> READDIRPLUS, MKNOD, MKDIR, SYMLINK > >> > >> Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these > >> things. With double checking, though, the file was mostly created by AI > >> (just added a correction today). With that easy to see the missing > >> FUSE_TMPFILE. > >> > >> > >>> > >>> fuse_entry_out was extended once and fuse_reply_entry() > >>> sends the size of the struct. > >> > >> Sorry, I'm confused. Where does fuse_reply_entry() send the size? > >> > >>> However fuse_reply_create() sends it with fuse_open_out > >>> appended and fuse_add_direntry_plus() does not seem to write > >>> record size at all, so server and client will need to agree on the > >>> size of fuse_entry_out and this would need to be backward compat. > >>> If both server and client declare support for FUSE_LOOKUP_HANDLE > >>> it should be fine (?). > >> > >> If max_handle size becomes a value in fuse_init_out, server and > >> client would use it? I think appended fuse_open_out could just > >> follow the dynamic actual size of the handle - code that > >> serializes/deserializes the response has to look up the actual > >> handle size then. For example I wouldn't know what to put in > >> for any of the example/passthrough* file systems as handle size - > >> would need to be 128B, but the actual size will be typically > >> much smaller. > > > > name_to_handle_at ? > > > > I guess the problem here is that technically speaking filesystems could > > have variable sized handles depending on the file. Sometimes you encode > > just the ino/gen of the child file, but other times you might know the > > parent and put that in the handle too. > > Yeah, I don't think it would be reliable for *all* file systems to use > name_to_handle_at on startup on some example file/directory. At least > not without knowing all the details of the underlying passthrough file > system. I think if you can send arbitrarily sized outblobs back to the kernel then it would be ok for a filesystem to have different handle sizes for a file, just so long as it doesn't change during the lifetime of a file. Obviously you couldn't then have a meaningful fs-wide max_handle_size. --D > > Thanks, > Bernd > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-05 22:48 ` Bernd Schubert 2025-11-06 0:21 ` Darrick J. Wong @ 2025-11-06 10:13 ` Amir Goldstein 2025-11-06 15:12 ` Luis Henriques 2025-11-06 15:49 ` Darrick J. Wong 1 sibling, 2 replies; 46+ messages in thread From: Amir Goldstein @ 2025-11-06 10:13 UTC (permalink / raw) To: Bernd Schubert, Darrick J. Wong Cc: Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen [...] > >>> fuse_entry_out was extended once and fuse_reply_entry() > >>> sends the size of the struct. > >> > >> Sorry, I'm confused. Where does fuse_reply_entry() send the size? Sorry, I meant to say that the reply size is variable. The size is obviously determined at init time. > >> > >>> However fuse_reply_create() sends it with fuse_open_out > >>> appended and fuse_add_direntry_plus() does not seem to write > >>> record size at all, so server and client will need to agree on the > >>> size of fuse_entry_out and this would need to be backward compat. > >>> If both server and client declare support for FUSE_LOOKUP_HANDLE > >>> it should be fine (?). > >> > >> If max_handle size becomes a value in fuse_init_out, server and > >> client would use it? I think appended fuse_open_out could just > >> follow the dynamic actual size of the handle - code that > >> serializes/deserializes the response has to look up the actual > >> handle size then. For example I wouldn't know what to put in > >> for any of the example/passthrough* file systems as handle size - > >> would need to be 128B, but the actual size will be typically > >> much smaller. > > > > name_to_handle_at ? > > > > I guess the problem here is that technically speaking filesystems could > > have variable sized handles depending on the file. Sometimes you encode > > just the ino/gen of the child file, but other times you might know the > > parent and put that in the handle too. > > Yeah, I don't think it would be reliable for *all* file systems to use > name_to_handle_at on startup on some example file/directory. At least > not without knowing all the details of the underlying passthrough file > system. > Maybe it's not a world-wide general solution, but it is a practical one. My fuse_passthrough library knows how to detect xfs and ext4 and knows about the size of their file handles. https://github.com/amir73il/libfuse/blob/fuse_passthrough/passthrough/fuse_passthrough.cpp#L645 A server could optimize for max_handle_size if it knows it or use MAX_HANDLE_SZ if it doesn't. Keep in mind that for the sake of restarting fuse servers (title of this thread) file handles do not need to be the actual filesystem file handles. Server can use its own pid as generation and then all inodes get auto invalidated on server restart. Not invalidating file handles on server restart, because the file handles are persistent file handles is an optimization. LOOKUP_HANDLE still needs to provide the inode+gen of the parent which LOOKUP currently does not. I did not understand why Darrick's suggestion of a flag that ino+gen suffice is any different then max_handle_size = 12 and using the standard FILEID_INO64_GEN in that case? Thanks, Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-06 10:13 ` Amir Goldstein @ 2025-11-06 15:12 ` Luis Henriques 2025-11-06 15:58 ` Luis Henriques 2025-11-06 15:49 ` Darrick J. Wong 1 sibling, 1 reply; 46+ messages in thread From: Luis Henriques @ 2025-11-06 15:12 UTC (permalink / raw) To: Amir Goldstein Cc: Bernd Schubert, Darrick J. Wong, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen On Thu, Nov 06 2025, Amir Goldstein wrote: > [...] > >> >>> fuse_entry_out was extended once and fuse_reply_entry() >> >>> sends the size of the struct. >> >> >> >> Sorry, I'm confused. Where does fuse_reply_entry() send the size? > > Sorry, I meant to say that the reply size is variable. > The size is obviously determined at init time. > >> >> >> >>> However fuse_reply_create() sends it with fuse_open_out >> >>> appended and fuse_add_direntry_plus() does not seem to write >> >>> record size at all, so server and client will need to agree on the >> >>> size of fuse_entry_out and this would need to be backward compat. >> >>> If both server and client declare support for FUSE_LOOKUP_HANDLE >> >>> it should be fine (?). >> >> >> >> If max_handle size becomes a value in fuse_init_out, server and >> >> client would use it? I think appended fuse_open_out could just >> >> follow the dynamic actual size of the handle - code that >> >> serializes/deserializes the response has to look up the actual >> >> handle size then. For example I wouldn't know what to put in >> >> for any of the example/passthrough* file systems as handle size - >> >> would need to be 128B, but the actual size will be typically >> >> much smaller. >> > >> > name_to_handle_at ? >> > >> > I guess the problem here is that technically speaking filesystems could >> > have variable sized handles depending on the file. Sometimes you encode >> > just the ino/gen of the child file, but other times you might know the >> > parent and put that in the handle too. >> >> Yeah, I don't think it would be reliable for *all* file systems to use >> name_to_handle_at on startup on some example file/directory. At least >> not without knowing all the details of the underlying passthrough file >> system. >> > > Maybe it's not a world-wide general solution, but it is a practical one. > > My fuse_passthrough library knows how to detect xfs and ext4 and > knows about the size of their file handles. > https://github.com/amir73il/libfuse/blob/fuse_passthrough/passthrough/fuse_passthrough.cpp#L645 > > A server could optimize for max_handle_size if it knows it or use > MAX_HANDLE_SZ if it doesn't. > > Keep in mind that for the sake of restarting fuse servers (title of this thread) > file handles do not need to be the actual filesystem file handles. > Server can use its own pid as generation and then all inodes get > auto invalidated on server restart. > > Not invalidating file handles on server restart, because the file handles > are persistent file handles is an optimization. > > LOOKUP_HANDLE still needs to provide the inode+gen of the parent > which LOOKUP currently does not. One additional complication I just realised is that FUSE_LOOKUP already uses up all the 3 in_args. So, my initial plan of having FUSE_LOOKUP_HANDLE using a similar structure to FUSE_LOOKUP, with the additional parent handle passed to the server through the in_args needs a different solution. (Anyway, I'll need to read through the whole thread(s) again to better digest all the information.) Cheers, -- Luís > > I did not understand why Darrick's suggestion of a flag that ino+gen > suffice is any different then max_handle_size = 12 and using the > standard FILEID_INO64_GEN in that case? > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-06 15:12 ` Luis Henriques @ 2025-11-06 15:58 ` Luis Henriques 0 siblings, 0 replies; 46+ messages in thread From: Luis Henriques @ 2025-11-06 15:58 UTC (permalink / raw) To: Amir Goldstein Cc: Bernd Schubert, Darrick J. Wong, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen On Thu, Nov 06 2025, Luis Henriques wrote: > On Thu, Nov 06 2025, Amir Goldstein wrote: > >> [...] >> >>> >>> fuse_entry_out was extended once and fuse_reply_entry() >>> >>> sends the size of the struct. >>> >> >>> >> Sorry, I'm confused. Where does fuse_reply_entry() send the size? >> >> Sorry, I meant to say that the reply size is variable. >> The size is obviously determined at init time. >> >>> >> >>> >>> However fuse_reply_create() sends it with fuse_open_out >>> >>> appended and fuse_add_direntry_plus() does not seem to write >>> >>> record size at all, so server and client will need to agree on the >>> >>> size of fuse_entry_out and this would need to be backward compat. >>> >>> If both server and client declare support for FUSE_LOOKUP_HANDLE >>> >>> it should be fine (?). >>> >> >>> >> If max_handle size becomes a value in fuse_init_out, server and >>> >> client would use it? I think appended fuse_open_out could just >>> >> follow the dynamic actual size of the handle - code that >>> >> serializes/deserializes the response has to look up the actual >>> >> handle size then. For example I wouldn't know what to put in >>> >> for any of the example/passthrough* file systems as handle size - >>> >> would need to be 128B, but the actual size will be typically >>> >> much smaller. >>> > >>> > name_to_handle_at ? >>> > >>> > I guess the problem here is that technically speaking filesystems could >>> > have variable sized handles depending on the file. Sometimes you encode >>> > just the ino/gen of the child file, but other times you might know the >>> > parent and put that in the handle too. >>> >>> Yeah, I don't think it would be reliable for *all* file systems to use >>> name_to_handle_at on startup on some example file/directory. At least >>> not without knowing all the details of the underlying passthrough file >>> system. >>> >> >> Maybe it's not a world-wide general solution, but it is a practical one. >> >> My fuse_passthrough library knows how to detect xfs and ext4 and >> knows about the size of their file handles. >> https://github.com/amir73il/libfuse/blob/fuse_passthrough/passthrough/fuse_passthrough.cpp#L645 >> >> A server could optimize for max_handle_size if it knows it or use >> MAX_HANDLE_SZ if it doesn't. >> >> Keep in mind that for the sake of restarting fuse servers (title of this thread) >> file handles do not need to be the actual filesystem file handles. >> Server can use its own pid as generation and then all inodes get >> auto invalidated on server restart. >> >> Not invalidating file handles on server restart, because the file handles >> are persistent file handles is an optimization. >> >> LOOKUP_HANDLE still needs to provide the inode+gen of the parent >> which LOOKUP currently does not. > > One additional complication I just realised is that FUSE_LOOKUP already > uses up all the 3 in_args. Ok, ignore me. We can have 4 in_args, not 3. Cheers -- Luís > So, my initial plan of having FUSE_LOOKUP_HANDLE using a similar structure > to FUSE_LOOKUP, with the additional parent handle passed to the server > through the in_args needs a different solution. > > (Anyway, I'll need to read through the whole thread(s) again to better > digest all the information.) > > Cheers, > -- > Luís > > >> >> I did not understand why Darrick's suggestion of a flag that ino+gen >> suffice is any different then max_handle_size = 12 and using the >> standard FILEID_INO64_GEN in that case? >> >> Thanks, >> Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-06 10:13 ` Amir Goldstein 2025-11-06 15:12 ` Luis Henriques @ 2025-11-06 15:49 ` Darrick J. Wong 2025-11-06 16:08 ` Stef Bon 2025-11-06 16:11 ` Amir Goldstein 1 sibling, 2 replies; 46+ messages in thread From: Darrick J. Wong @ 2025-11-06 15:49 UTC (permalink / raw) To: Amir Goldstein Cc: Bernd Schubert, Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen On Thu, Nov 06, 2025 at 11:13:01AM +0100, Amir Goldstein wrote: > [...] > > > >>> fuse_entry_out was extended once and fuse_reply_entry() > > >>> sends the size of the struct. > > >> > > >> Sorry, I'm confused. Where does fuse_reply_entry() send the size? > > Sorry, I meant to say that the reply size is variable. > The size is obviously determined at init time. > > > >> > > >>> However fuse_reply_create() sends it with fuse_open_out > > >>> appended and fuse_add_direntry_plus() does not seem to write > > >>> record size at all, so server and client will need to agree on the > > >>> size of fuse_entry_out and this would need to be backward compat. > > >>> If both server and client declare support for FUSE_LOOKUP_HANDLE > > >>> it should be fine (?). > > >> > > >> If max_handle size becomes a value in fuse_init_out, server and > > >> client would use it? I think appended fuse_open_out could just > > >> follow the dynamic actual size of the handle - code that > > >> serializes/deserializes the response has to look up the actual > > >> handle size then. For example I wouldn't know what to put in > > >> for any of the example/passthrough* file systems as handle size - > > >> would need to be 128B, but the actual size will be typically > > >> much smaller. > > > > > > name_to_handle_at ? > > > > > > I guess the problem here is that technically speaking filesystems could > > > have variable sized handles depending on the file. Sometimes you encode > > > just the ino/gen of the child file, but other times you might know the > > > parent and put that in the handle too. > > > > Yeah, I don't think it would be reliable for *all* file systems to use > > name_to_handle_at on startup on some example file/directory. At least > > not without knowing all the details of the underlying passthrough file > > system. > > > > Maybe it's not a world-wide general solution, but it is a practical one. > > My fuse_passthrough library knows how to detect xfs and ext4 and > knows about the size of their file handles. > https://github.com/amir73il/libfuse/blob/fuse_passthrough/passthrough/fuse_passthrough.cpp#L645 > > A server could optimize for max_handle_size if it knows it or use > MAX_HANDLE_SZ if it doesn't. > > Keep in mind that for the sake of restarting fuse servers (title of this thread) > file handles do not need to be the actual filesystem file handles. > Server can use its own pid as generation and then all inodes get > auto invalidated on server restart. > > Not invalidating file handles on server restart, because the file handles > are persistent file handles is an optimization. > > LOOKUP_HANDLE still needs to provide the inode+gen of the parent > which LOOKUP currently does not. > > I did not understand why Darrick's suggestion of a flag that ino+gen > suffice is any different then max_handle_size = 12 and using the > standard FILEID_INO64_GEN in that case? Technically speaking, a 12-byte handle could contain anything. Maybe you have a u32 volumeid, inumber, and generation, whereas the flag that I was mumbling about would specify the handle format as well. Speaking of which: should file handles be exporting volume ids for the filesystem (btrfs) that supports it? --D > Thanks, > Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-06 15:49 ` Darrick J. Wong @ 2025-11-06 16:08 ` Stef Bon 2025-11-07 9:25 ` Luis Henriques 2025-11-06 16:11 ` Amir Goldstein 1 sibling, 1 reply; 46+ messages in thread From: Stef Bon @ 2025-11-06 16:08 UTC (permalink / raw) To: Darrick J. Wong Cc: Amir Goldstein, Bernd Schubert, Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen Hi, is implementing a lookup using a handle to be in the kernel? I've written a FUSE fs for sftp using SSH as transport, where the lookup call normally has to create a path (relative to the root of the sftp) and send that to the remote server. It saves the creation of this path if there is a handle available. When doing an opendir, this is normally followed by a lookup for every dentry. (sftp does not support readdirplus) Now in this case there is a handle available (the one used by opendir, or one created with open), so the fuse daemon I wrote used that to proceed. (and so not create a path). So it can also go in userspace. Stef ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-06 16:08 ` Stef Bon @ 2025-11-07 9:25 ` Luis Henriques 2025-11-10 8:20 ` Stef Bon 0 siblings, 1 reply; 46+ messages in thread From: Luis Henriques @ 2025-11-07 9:25 UTC (permalink / raw) To: Stef Bon Cc: Darrick J. Wong, Amir Goldstein, Bernd Schubert, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen Hi Stef, On Thu, Nov 06 2025, Stef Bon wrote: > Hi, > > is implementing a lookup using a handle to be in the kernel? What we're talking here is a new FUSE operation, FUSE_LOOKUP_HANDLE. The scope here is mostly related to servers restartability: being able to restart a FUSE server without unmounting the file system. But other scopes are also relevant (e.g. NFS exports). Just in case you missed it, here's a link to the full discussion: https://lore.kernel.org/all/8734afp0ct.fsf@igalia.com/ and to an older discussion, also relevant: https://lore.kernel.org/all/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ Cheers, -- Luís > I've written a FUSE fs for sftp using SSH as transport, where the > lookup call normally has to create a path (relative to the root of the > sftp) and send that to the remote server. > It saves the creation of this path if there is a handle available. > When doing an opendir, this is normally followed by a lookup for every > dentry. (sftp does not support readdirplus) Now in this case there is > a handle available (the one used by opendir, or one created with > open), so the fuse daemon I wrote used that to proceed. (and so not > create a path). > > So it can also go in userspace. > > Stef > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-07 9:25 ` Luis Henriques @ 2025-11-10 8:20 ` Stef Bon 0 siblings, 0 replies; 46+ messages in thread From: Stef Bon @ 2025-11-10 8:20 UTC (permalink / raw) To: Luis Henriques; +Cc: linux-fsdevel Hi, I see this has to do with the name to handle calls to provide clients a reference to fs objects which remain valid over a restart right? Stef ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Another take at restarting FUSE servers 2025-11-06 15:49 ` Darrick J. Wong 2025-11-06 16:08 ` Stef Bon @ 2025-11-06 16:11 ` Amir Goldstein 1 sibling, 0 replies; 46+ messages in thread From: Amir Goldstein @ 2025-11-06 16:11 UTC (permalink / raw) To: Darrick J. Wong Cc: Bernd Schubert, Luis Henriques, Bernd Schubert, Theodore Ts'o, Miklos Szeredi, linux-fsdevel, linux-kernel, Kevin Chen On Thu, Nov 6, 2025 at 4:49 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Thu, Nov 06, 2025 at 11:13:01AM +0100, Amir Goldstein wrote: > > [...] > > > > > >>> fuse_entry_out was extended once and fuse_reply_entry() > > > >>> sends the size of the struct. > > > >> > > > >> Sorry, I'm confused. Where does fuse_reply_entry() send the size? > > > > Sorry, I meant to say that the reply size is variable. > > The size is obviously determined at init time. > > > > > >> > > > >>> However fuse_reply_create() sends it with fuse_open_out > > > >>> appended and fuse_add_direntry_plus() does not seem to write > > > >>> record size at all, so server and client will need to agree on the > > > >>> size of fuse_entry_out and this would need to be backward compat. > > > >>> If both server and client declare support for FUSE_LOOKUP_HANDLE > > > >>> it should be fine (?). > > > >> > > > >> If max_handle size becomes a value in fuse_init_out, server and > > > >> client would use it? I think appended fuse_open_out could just > > > >> follow the dynamic actual size of the handle - code that > > > >> serializes/deserializes the response has to look up the actual > > > >> handle size then. For example I wouldn't know what to put in > > > >> for any of the example/passthrough* file systems as handle size - > > > >> would need to be 128B, but the actual size will be typically > > > >> much smaller. > > > > > > > > name_to_handle_at ? > > > > > > > > I guess the problem here is that technically speaking filesystems could > > > > have variable sized handles depending on the file. Sometimes you encode > > > > just the ino/gen of the child file, but other times you might know the > > > > parent and put that in the handle too. > > > > > > Yeah, I don't think it would be reliable for *all* file systems to use > > > name_to_handle_at on startup on some example file/directory. At least > > > not without knowing all the details of the underlying passthrough file > > > system. > > > > > > > Maybe it's not a world-wide general solution, but it is a practical one. > > > > My fuse_passthrough library knows how to detect xfs and ext4 and > > knows about the size of their file handles. > > https://github.com/amir73il/libfuse/blob/fuse_passthrough/passthrough/fuse_passthrough.cpp#L645 > > > > A server could optimize for max_handle_size if it knows it or use > > MAX_HANDLE_SZ if it doesn't. > > > > Keep in mind that for the sake of restarting fuse servers (title of this thread) > > file handles do not need to be the actual filesystem file handles. > > Server can use its own pid as generation and then all inodes get > > auto invalidated on server restart. > > > > Not invalidating file handles on server restart, because the file handles > > are persistent file handles is an optimization. > > > > LOOKUP_HANDLE still needs to provide the inode+gen of the parent > > which LOOKUP currently does not. > > > > I did not understand why Darrick's suggestion of a flag that ino+gen > > suffice is any different then max_handle_size = 12 and using the > > standard FILEID_INO64_GEN in that case? > > Technically speaking, a 12-byte handle could contain anything. Maybe > you have a u32 volumeid, inumber, and generation, whereas the flag that > I was mumbling about would specify the handle format as well. > > Speaking of which: should file handles be exporting volume ids for the > filesystem (btrfs) that supports it? > file handles are opaque so the server can put whatever server wants in them it does not need to put the native fs file handles (in case of passthrough fs or in case of iomap fs). Take struct ovl_fh for example, the format of file handles that overlayfs exports to NFS encapsulates the underlying fs uuid and file handle. Note that when exporting such a fuse filesystem to NFS, it is still the responsibility of the exporter to specify an explicit fsid identifier in /etc/exports for this fuse server type/instance and then the file handles generated by this server are expected to be unique in the scope of this NFS export. Not sure how much of this is relevant for the use case of restarting a fuse server. Thanks, Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2025-11-10 8:21 UTC | newest] Thread overview: 46+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-29 13:56 [RFC] Another take at restarting FUSE servers Luis Henriques 2025-07-29 23:38 ` Darrick J. Wong 2025-07-30 14:04 ` Luis Henriques 2025-07-31 11:33 ` Christian Brauner 2025-07-31 12:23 ` Luis Henriques 2025-07-31 17:29 ` Darrick J. Wong 2025-08-04 8:45 ` Christian Brauner 2025-08-12 19:28 ` Darrick J. Wong 2025-07-31 13:04 ` Theodore Ts'o 2025-07-31 17:38 ` Darrick J. Wong 2025-08-01 10:15 ` Luis Henriques 2025-08-11 15:43 ` Darrick J. Wong 2025-08-13 13:14 ` Luis Henriques 2025-09-12 10:31 ` Bernd Schubert 2025-09-12 11:41 ` Amir Goldstein 2025-09-12 12:29 ` Bernd Schubert 2025-09-12 14:58 ` Darrick J. Wong 2025-09-12 15:20 ` Bernd Schubert 2025-09-15 4:43 ` Darrick J. Wong 2025-09-15 7:07 ` Amir Goldstein 2025-09-15 8:27 ` Bernd Schubert 2025-09-15 8:41 ` Amir Goldstein 2025-09-16 2:53 ` Darrick J. Wong 2025-09-16 7:59 ` Amir Goldstein 2025-09-18 17:50 ` Darrick J. Wong 2025-11-04 11:40 ` Luis Henriques 2025-11-04 13:10 ` Amir Goldstein 2025-11-04 14:52 ` Luis Henriques 2025-11-05 10:21 ` Amir Goldstein 2025-11-05 11:50 ` Luis Henriques 2025-11-05 15:30 ` Amir Goldstein 2025-11-05 21:38 ` Darrick J. Wong 2025-11-05 21:46 ` Bernd Schubert 2025-11-05 22:06 ` Bernd Schubert 2025-11-05 22:24 ` Bernd Schubert 2025-11-05 22:42 ` Darrick J. Wong 2025-11-05 22:48 ` Bernd Schubert 2025-11-06 0:21 ` Darrick J. Wong 2025-11-06 10:13 ` Amir Goldstein 2025-11-06 15:12 ` Luis Henriques 2025-11-06 15:58 ` Luis Henriques 2025-11-06 15:49 ` Darrick J. Wong 2025-11-06 16:08 ` Stef Bon 2025-11-07 9:25 ` Luis Henriques 2025-11-10 8:20 ` Stef Bon 2025-11-06 16:11 ` Amir Goldstein
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).