Re: [RFC] Another take at restarting FUSE servers

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Bernd Schubert <bschubert@ddn.com>
To: Amir Goldstein <amir73il@gmail.com>, Luis Henriques <luis@igalia.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	Bernd Schubert <bernd@bsbernd.com>, Theodore Ts'o <tytso@mit.edu>,
	Miklos Szeredi <miklos@szeredi.hu>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Kevin Chen <kchen@ddn.com>
Subject: Re: [RFC] Another take at restarting FUSE servers
Date: Wed, 5 Nov 2025 23:24:01 +0100	[thread overview]
Message-ID: <7ee1e308-c58c-45a0-8ded-6694feae097f@ddn.com> (raw)
In-Reply-To: <CAOQ4uxg7b0mupCVaouPXPGNN=Ji2XceeceUf8L6pW8+vq3uOMQ@mail.gmail.com>



On 11/4/25 14:10, Amir Goldstein wrote:
> On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@igalia.com> wrote:
>>
>> On Tue, Sep 16 2025, Amir Goldstein wrote:
>>
>>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
>>>>
>>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
>>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9/15/25 09:07, Amir Goldstein wrote:
>>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@kernel.org> wrote:
>>>>>>>>
>>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote:
>>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote:
>>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>>>>>>>>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
>>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I
>>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>>>>>>>>>>>>>>> aren't totally crazy.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
>>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data
>>>>>>>>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>>>>>>>>>>>>>> potentally to be out of sync, right?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide?
>>>>>>>>>>>>>
>>>>>>>>>>>>> <echoing what we said on the ext4 call this morning>
>>>>>>>>>>>>>
>>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate
>>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests
>>>>>>>>>>>>> that were pending at the time.  It might be the case that you have to
>>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse
>>>>>>>>>>>>> to suspect that to be true.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with
>>>>>>>>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them.
>>>>>>>>>>>>
>>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>>>>>>>>>>>> but probably GETATTR is a better option.
>>>>>>>>>>>>
>>>>>>>>>>>> So, are you currently working on any of this?  Are you implementing this
>>>>>>>>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>>>>>>>>>>>> look at fuse2fs too.
>>>>>>>>>>>
>>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and
>>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our
>>>>>>>>>>> DDN side.
>>>>>>>>>>>
>>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count.
>>>>>>>>>>> Now inode recovery might be hard, because we currently only have a
>>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory
>>>>>>>>>>> pointer.
>>>>>>>>>>>
>>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests
>>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory
>>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or
>>>>>>>>>>> open_by_handle_at doesn't work well right now.
>>>>>>>>>>>
>>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
>>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart.
>>>>>>>>>>> The file handles could be stored into the fuse inode and also used for
>>>>>>>>>>> NFS export.
>>>>>>>>>>>
>>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
>>>>>>>>>>> Adding Amir to CC.
>>>>>>>>>>
>>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
>>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>>>>>>>>
>>>>>>>>> Thanks for the reference Amir! I even had been in that thread.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>>>>>>>>>>> Any objections against that?
>>>>>>>>
>>>>>>>> What if you actually /can/ reuse a nodeid after a restart?  Consider
>>>>>>>> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
>>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
>>>>>>>> didn't delete it, obviously.
>>>>>>>
>>>>>>> FUSE_LOOKUP_HANDLE is a contract.
>>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign
>>>>>>> this contract, otherwise there is no way for client to know that the
>>>>>>> nodeids are persistent.
>>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
>>>>>>> API trivial.
>>>>>>>
>>>>>>>>
>>>>>>>> I suppose you could just ask for refreshed stat information and either
>>>>>>>> the server gives it to you and the fuse_inode lives; or the server
>>>>>>>> returns ENOENT and then we mark it bad.  But I'd have to see code
>>>>>>>> patches to form a real opinion.
>>>>>>>>
>>>>>>>
>>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id>
>>>>>>> where fuse_instance_id can be its start time or random number.
>>>>>>> for auto invalidate, or maybe the fuse_instance_id should be
>>>>>>> a native part of FUSE protocol so that client knows to only invalidate
>>>>>>> attr cache in case of fuse_instance_id change?
>>>>>>>
>>>>>>> In any case, instead of a storm of revalidate messages after
>>>>>>> server restart, do it lazily on demand.
>>>>>>
>>>>>> For a network file system, probably. For fuse4fs or other block
>>>>>> based file systems, not sure. Darrick has the example of fsck.
>>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
>>>>>> fuse-server gets restarted, fsck'ed and some files get removed.
>>>>>> Now reading these inodes would still work - wouldn't it
>>>>>> be better to invalidate the cache before going into operation
>>>>>> again?
>>>>>
>>>>> Forgive me, I was making a wrong assumption that fuse4fs
>>>>> was using ext4 filehandle as nodeid, but of course it does not.
>>>>
>>>> Well now that you mention it, there /is/ a risk of shenanigans like
>>>> that.  Consider:
>>>>
>>>> 1) fuse4fs mount an ext4 filesystem
>>>> 2) crash the fuse4fs server
>>>> <fuse4fs server restart stalls...>
>>>> 3) e2fsck -fy /dev/XXX deletes inode 17
>>>> 4) someone else mounts the fs, makes some changes that result in 17
>>>>    being reallocated, user says "OOOOOPS", unmounts it
>>>> 5) fuse4fs server finally restarts, and reconnects to the kernel
>>>>
>>>> Hey, inode 17 is now a different file!!
>>>>
>>>> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
>>>> everything's (potentially) fine because fuse4fs supplied i_generation to
>>>> the kernel, and fuse_stale_inode will mark it bad if that happens.
>>>>
>>>> Hm ok then, at least there's a way out. :)
>>>>
>>>
>>> Right.
>>>
>>>>> The reason I made this wrong assumption is because fuse4fs *can*
>>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
>>>>> which is what my fuse passthough library [1] does.
>>>>>
>>>>> My claim was that although fuse4fs could support safe restart, which
>>>>> cannot read from recycled inode number with current FUSE protocol,
>>>>> doing so with FUSE_HANDLE protocol would express a commitment
>>>>
>>>> Pardon my naïvete, but what is FUSE_HANDLE?
>>>>
>>>> $ git grep -w FUSE_HANDLE fs
>>>> $
>>>
>>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>>
>>> Which means to communicate a variable sized "nodeid"
>>> which can also be declared as an object id that survives server restart.
>>>
>>> Basically, the reason that I brought up LOOKUP_HANDLE is to
>>> properly support NFS export of fuse filesystems.
>>>
>>> My incentive was to support a proper fuse server restart/remount/re-export
>>> with the same fsid in /etc/exports, but this gives us a better starting point
>>> for fuse server restart/re-connect.
>>
>> Sorry for resurrecting (again!) this discussion.  I've been thinking about
>> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
>> However, I feel there are other operations that will need to return this
>> new handle.
>>
>> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
>> Doesn't this means that, if the user-space server supports the new
>> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
>> request?
> 
> Yes, I think that's what it means.
> 
>> The same question applies for TMPFILE, LINK, etc.  Or is there
>> something special about the LOOKUP operation that I'm missing?
>>
> 
> Any command returning fuse_entry_out.
> 
> READDIRPLUS, MKNOD, MKDIR, SYMLINK

Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these
things. With double checking, though, the file was mostly created by AI
(just added a correction today). With that easy to see the missing
FUSE_TMPFILE.


> 
> fuse_entry_out was extended once and fuse_reply_entry()
> sends the size of the struct.

Sorry, I'm confused. Where does fuse_reply_entry() send the size?

> However fuse_reply_create() sends it with fuse_open_out
> appended and fuse_add_direntry_plus() does not seem to write
> record size at all, so server and client will need to agree on the
> size of fuse_entry_out and this would need to be backward compat.
> If both server and client declare support for FUSE_LOOKUP_HANDLE
> it should be fine (?).

If max_handle size becomes a value in fuse_init_out, server and
client would use it? I think appended fuse_open_out could just
follow the dynamic actual size of the handle - code that
serializes/deserializes the response has to look up the actual
handle size then. For example I wouldn't know what to put in
for any of the example/passthrough* file systems as handle size - 
would need to be 128B, but the actual size will be typically
much smaller.


Thanks,
Bernd

next prev parent reply	other threads:[~2025-11-05 22:24 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-29 13:56 [RFC] Another take at restarting FUSE servers Luis Henriques
2025-07-29 23:38 ` Darrick J. Wong
2025-07-30 14:04   ` Luis Henriques
2025-07-31 11:33     ` Christian Brauner
2025-07-31 12:23       ` Luis Henriques
2025-07-31 17:29       ` Darrick J. Wong
2025-08-04  8:45         ` Christian Brauner
2025-08-12 19:28           ` Darrick J. Wong
2025-07-31 13:04   ` Theodore Ts'o
2025-07-31 17:38     ` Darrick J. Wong
2025-08-01 10:15       ` Luis Henriques
2025-08-11 15:43         ` Darrick J. Wong
2025-08-13 13:14           ` Luis Henriques
2025-09-12 10:31         ` Bernd Schubert
2025-09-12 11:41           ` Amir Goldstein
2025-09-12 12:29             ` Bernd Schubert
2025-09-12 14:58               ` Darrick J. Wong
2025-09-12 15:20                 ` Bernd Schubert
2025-09-15  4:43                   ` Darrick J. Wong
2025-09-15  7:07                 ` Amir Goldstein
2025-09-15  8:27                   ` Bernd Schubert
2025-09-15  8:41                     ` Amir Goldstein
2025-09-16  2:53                       ` Darrick J. Wong
2025-09-16  7:59                         ` Amir Goldstein
2025-09-18 17:50                           ` Darrick J. Wong
2025-11-04 11:40                           ` Luis Henriques
2025-11-04 13:10                             ` Amir Goldstein
2025-11-04 14:52                               ` Luis Henriques
2025-11-05 10:21                                 ` Amir Goldstein
2025-11-05 11:50                                   ` Luis Henriques
2025-11-05 15:30                                     ` Amir Goldstein
2025-11-05 21:38                                       ` Darrick J. Wong
2025-11-05 21:46                                         ` Bernd Schubert
2025-11-05 22:06                                           ` Bernd Schubert
2025-11-05 22:24                               ` Bernd Schubert [this message]
2025-11-05 22:42                                 ` Darrick J. Wong
2025-11-05 22:48                                   ` Bernd Schubert
2025-11-06  0:21                                     ` Darrick J. Wong
2025-11-06 10:13                                     ` Amir Goldstein
2025-11-06 15:12                                       ` Luis Henriques
2025-11-06 15:58                                         ` Luis Henriques
2025-11-06 15:49                                       ` Darrick J. Wong
2025-11-06 16:08                                         ` Stef Bon
2025-11-07  9:25                                           ` Luis Henriques
2025-11-10  8:20                                             ` Stef Bon
2025-11-06 16:11                                         ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7ee1e308-c58c-45a0-8ded-6694feae097f@ddn.com \
    --to=bschubert@ddn.com \
    --cc=amir73il@gmail.com \
    --cc=bernd@bsbernd.com \
    --cc=djwong@kernel.org \
    --cc=kchen@ddn.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luis@igalia.com \
    --cc=miklos@szeredi.hu \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).