From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ej1-f52.google.com (mail-ej1-f52.google.com [209.85.218.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F00C2F39C1 for ; Tue, 4 Nov 2025 13:10:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.52 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762261848; cv=none; b=Wqn5L1m5J66i6m7yLWnxbX56wJHm6t+3Tg2DHEWuPtkDFQjZLmoTvDOQa2FwzzB7i0I15tfU+AOkmlCEAYwYisFDzN4xHjOJbh3UiY8gglPFtn3RIaZJj8+ZVLE84xHNPqze5KBoAW6+s6DNPvPVHH4J+OcONl/ZSn6vUxCMqbg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762261848; c=relaxed/simple; bh=VhkhzmMcv/smNo1A6jxBdqBtG3BBvfXHku9Ydct+gvs=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=Tsg9ocIp0pQdevvkMwKDbLV7AYGD14+RZebhORv3ZkUoCPP+jqG65yO/HZdH/30KH1Ctlm+gwMjLtmdb0jl8bLpFdl8QsUZ0ZgnW2IIvlN4m6Jc15OnFVRxviGIFsBg2NXSqymuYVQgbTILtfPwEiLLtFCOOlTx0fpYsJUpRe3Q= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=O3k5xJIu; arc=none smtp.client-ip=209.85.218.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="O3k5xJIu" Received: by mail-ej1-f52.google.com with SMTP id a640c23a62f3a-b6d3340dc2aso1328635266b.0 for ; Tue, 04 Nov 2025 05:10:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1762261844; x=1762866644; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3eb6Fhw+jGRkyT5hVdQ02opkzhktDn8toH3DgzCieb8=; b=O3k5xJIu+oHsP1/3R52d84SbcFpm0IY+zXzKz2FzhE6FUvvAn5ULHBU/OZuLwpuFNO yrsiC5XevLasctWeQbMRLglZNTGuN9C2JxB/5zZzU3snjIos0Wg1v7ldxC3R9tPCR+Nd 3lVmZ0x1WAleZoxU7vUjuZPOTpBR8I1GE2iTigfrgje1TQlqpON0End6ENqg359M0Xnn QhddCguZGTjSwxurswyEaYOuQfqeelUBOC42g5+yi80eS+ixKIRjBwoRzbGz9uivipw8 aPmgXIirLV3IRl+9DyqiHR54WC6Xw+xa/ASUnonuvWl1W289hGEk403NveV0wlmlpmzE 7i3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762261844; x=1762866644; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3eb6Fhw+jGRkyT5hVdQ02opkzhktDn8toH3DgzCieb8=; b=hZ7MmDTz4NoYfK6/vlwD6gbqrRlQjZT8qHAAgbIESH3dXKXpsLzHHzEI4GGMl//x1k siQuIahbf7LqSKNf2EF8jhHeXaOSfD7ouUqEbSl3LIZtYIN+RUe5wCFuxcgJWY915dcw wTt0emDNYzUAYAO2M2OWj16LlHG1sG6sH4eK6GSAoOX66NRQ5JWzJQ92JGoX1Nvmeke9 dyANJU9/Drje7hONUhjHJWADQM6hhg0+LtbaEEYrHpKB2VVXGFAAKHQEaAsi51Q2GOgZ 9+z1O2nR2VacapjIRy1hJrXvgAApffcGdNdTJ4UiF8LMCmWdUqfhNDbQQLpKxtJknuol MCmg== X-Forwarded-Encrypted: i=1; AJvYcCWYIBLIhD0MvLSAjiLIq5ruudZJoa5FNUSBSSNZTVLhi8U4fi0ZAQzIWDxLC+Ir1yOKOAQbJWZeDLBtyO+o@vger.kernel.org X-Gm-Message-State: AOJu0YwzyDlz/8VBoFkhUeIIwYAP0Dougx9XAGDzpT+7sloicuX+rG7x 8T6qpQbuyet7uYuXktZxMl+upEV1qjCeJTDVejR2bdOxMZfyCjr0zFz7bqHb/aB5TzDYbTRyxLP rr8xaJCV1TvSw77+Hc2dHerryRx0GcckQBh4PEVW6AQ== X-Gm-Gg: ASbGncsqMYlM/jznLT1gmqx8eiUPZtKWznbTwj2hcnW52S2wph3lXgUd0myQv+sbyyj ZphOsiBqZCH9Llzsj/Qf3iJx9zL9jAytgLImV+HjW4IBQmmyFw9FkvJXAbWTn5TcDFCA3ABLTl2 E+4YroeskHUI4tIechPOwETY2qJ4hphP136KfX2FPf1f/bE8hJtCIHKoYxWaDWQTjdHY9pnBa3c ARMaTv2cI2+zDo9X3XJZO0p2pJp+AJd5Oa2nBsxhG5nInQpFhfmt3f4nR8Y2jFbJ/gwYLmLHDyJ 0xedmMDR3rIHpKu1tcM= X-Google-Smtp-Source: AGHT+IH/2pUqmrN7VPU+SV+aH7vEEkumO/C5X2+je2g5UclR15/R3ZrjFYbk+o6g4UsV/+K6ojEY6+j2fKHiandD2bA= X-Received: by 2002:a17:907:dac:b0:b70:aebe:2ef7 with SMTP id a640c23a62f3a-b7215bd8863mr352109666b.14.1762261844201; Tue, 04 Nov 2025 05:10:44 -0800 (PST) Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20250731130458.GE273706@mit.edu> <20250731173858.GE2672029@frogsfrogsfrogs> <8734abgxfl.fsf@igalia.com> <39818613-c10b-4ed2-b596-23b70c749af1@bsbernd.com> <2e57be4f-e61b-4a37-832d-14bdea315126@bsbernd.com> <20250912145857.GQ8117@frogsfrogsfrogs> <2e1db15f-b2b1-487f-9f42-44dc7480b2e2@bsbernd.com> <20250916025341.GO1587915@frogsfrogsfrogs> <87ldkm6n5o.fsf@wotan.olymp> In-Reply-To: <87ldkm6n5o.fsf@wotan.olymp> From: Amir Goldstein Date: Tue, 4 Nov 2025 14:10:32 +0100 X-Gm-Features: AWmQ_bly3rdIxKic-xmn_OHbdNdph66vGTwjjJ4W0ydit2BdioRiiUqo7v_TbMk Message-ID: Subject: Re: [RFC] Another take at restarting FUSE servers To: Luis Henriques Cc: "Darrick J. Wong" , Bernd Schubert , "Theodore Ts'o" , Miklos Szeredi , Bernd Schubert , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Kevin Chen Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Nov 4, 2025 at 12:40=E2=80=AFPM Luis Henriques wr= ote: > > On Tue, Sep 16 2025, Amir Goldstein wrote: > > > On Tue, Sep 16, 2025 at 4:53=E2=80=AFAM Darrick J. Wong wrote: > >> > >> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote: > >> > On Mon, Sep 15, 2025 at 10:27=E2=80=AFAM Bernd Schubert wrote: > >> > > > >> > > > >> > > > >> > > On 9/15/25 09:07, Amir Goldstein wrote: > >> > > > On Fri, Sep 12, 2025 at 4:58=E2=80=AFPM Darrick J. Wong wrote: > >> > > >> > >> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > >> > > >>> > >> > > >>> > >> > > >>> On 9/12/25 13:41, Amir Goldstein wrote: > >> > > >>>> On Fri, Sep 12, 2025 at 12:31=E2=80=AFPM Bernd Schubert wrote: > >> > > >>>>> > >> > > >>>>> > >> > > >>>>> > >> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote: > >> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > >> > > >>>>>> > >> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wr= ote: > >> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong= wrote: > >> > > >>>>>>>>> > >> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda ni= fty if libfuse > >> > > >>>>>>>>> could restart itself. It's unclear if doing so will act= ually enable us > >> > > >>>>>>>>> to clear the condition that caused the failure in the fi= rst place, but I > >> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So mayb= e restarts > >> > > >>>>>>>>> aren't totally crazy. > >> > > >>>>>>>> > >> > > >>>>>>>> I'm trying to understand what the failure scenario is her= e. Is this > >> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?= If so, what > >> > > >>>>>>>> is supposed to happen with respect to open files, metadat= a and data > >> > > >>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs= could run > >> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, t= hat's going > >> > > >>>>>>>> potentally to be out of sync, right? > >> > > >>>>>>>> > >> > > >>>>>>>> What are the recovery semantics that we hope to be able t= o provide? > >> > > >>>>>>> > >> > > >>>>>>> > >> > > >>>>>>> > >> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I= think the new > >> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RE= STARTED, which > >> > > >>>>>>> would initiate GETATTR requests on all the cached inodes t= o validate > >> > > >>>>>>> that they still exist; and then resend all the unacknowled= ged requests > >> > > >>>>>>> that were pending at the time. It might be the case that = you have to > >> > > >>>>>>> that in the reverse order; I only know enough about the de= sign of fuse > >> > > >>>>>>> to suspect that to be true. > >> > > >>>>>>> > >> > > >>>>>>> Anyhow once those are complete, I think we can resume oper= ations with > >> > > >>>>>>> the surviving inodes. The ones that fail the GETATTR reva= lidation are > >> > > >>>>>>> fuse_make_bad'd, which effectively revokes them. > >> > > >>>>>> > >> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOO= KUP requests, > >> > > >>>>>> but probably GETATTR is a better option. > >> > > >>>>>> > >> > > >>>>>> So, are you currently working on any of this? Are you impl= ementing this > >> > > >>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to = have a closer > >> > > >>>>>> look at fuse2fs too. > >> > > >>>>> > >> > > >>>>> Sorry for joining the discussion late, I was totally occupie= d, day and > >> > > >>>>> night. Added Kevin to CC, who is going to work on recovery o= n our > >> > > >>>>> DDN side. > >> > > >>>>> > >> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but = on fuse > >> > > >>>>> server restart we want kernel to recover inodes and their lo= okup count. > >> > > >>>>> Now inode recovery might be hard, because we currently only = have a > >> > > >>>>> 64-bit node-id - which is used my most fuse application as m= emory > >> > > >>>>> pointer. > >> > > >>>>> > >> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it j= ust re-sends > >> > > >>>>> outstanding requests. And that ends up in most cases in send= ing requests > >> > > >>>>> with invalid node-IDs, that are casted and might provoke ran= dom memory > >> > > >>>>> access on restart. Kind of the same issue why fuse nfs expor= t or > >> > > >>>>> open_by_handle_at doesn't work well right now. > >> > > >>>>> > >> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_F= H, which > >> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file h= andle. > >> > > >>>>> And then FUSE_REVALIDATE_FH on server restart. > >> > > >>>>> The file handles could be stored into the fuse inode and als= o used for > >> > > >>>>> NFS export. > >> > > >>>>> > >> > > >>>>> I *think* Amir had a similar idea, but I don't find the link= quickly. > >> > > >>>>> Adding Amir to CC. > >> > > >>>> > >> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this roll= ing thread: > >> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quB= aTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > >> > > >>> > >> > > >>> Thanks for the reference Amir! I even had been in that thread. > >> > > >>> > >> > > >>>> > >> > > >>>>> > >> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RES= TART, which > >> > > >>>>> will iterate over all superblock inodes and mark them with f= use_make_bad. > >> > > >>>>> Any objections against that? > >> > > >> > >> > > >> What if you actually /can/ reuse a nodeid after a restart? Con= sider > >> > > >> fuse4fs, where the nodeid is the on-disk inode number. After a= restart, > >> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming = recovery > >> > > >> didn't delete it, obviously. > >> > > > > >> > > > FUSE_LOOKUP_HANDLE is a contract. > >> > > > If fuse4fs can reuse nodeid after restart then by all means, it = should sign > >> > > > this contract, otherwise there is no way for client to know that= the > >> > > > nodeids are persistent. > >> > > > If fuse4fs_handle :=3D nodeid, that will make implementing the l= ookup_handle() > >> > > > API trivial. > >> > > > > >> > > >> > >> > > >> I suppose you could just ask for refreshed stat information and= either > >> > > >> the server gives it to you and the fuse_inode lives; or the ser= ver > >> > > >> returns ENOENT and then we mark it bad. But I'd have to see co= de > >> > > >> patches to form a real opinion. > >> > > >> > >> > > > > >> > > > You could make fuse4fs_handle :=3D > >> > > > where fuse_instance_id can be its start time or random number. > >> > > > for auto invalidate, or maybe the fuse_instance_id should be > >> > > > a native part of FUSE protocol so that client knows to only inva= lidate > >> > > > attr cache in case of fuse_instance_id change? > >> > > > > >> > > > In any case, instead of a storm of revalidate messages after > >> > > > server restart, do it lazily on demand. > >> > > > >> > > For a network file system, probably. For fuse4fs or other block > >> > > based file systems, not sure. Darrick has the example of fsck. > >> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0, > >> > > fuse-server gets restarted, fsck'ed and some files get removed. > >> > > Now reading these inodes would still work - wouldn't it > >> > > be better to invalidate the cache before going into operation > >> > > again? > >> > > >> > Forgive me, I was making a wrong assumption that fuse4fs > >> > was using ext4 filehandle as nodeid, but of course it does not. > >> > >> Well now that you mention it, there /is/ a risk of shenanigans like > >> that. Consider: > >> > >> 1) fuse4fs mount an ext4 filesystem > >> 2) crash the fuse4fs server > >> > >> 3) e2fsck -fy /dev/XXX deletes inode 17 > >> 4) someone else mounts the fs, makes some changes that result in 17 > >> being reallocated, user says "OOOOOPS", unmounts it > >> 5) fuse4fs server finally restarts, and reconnects to the kernel > >> > >> Hey, inode 17 is now a different file!! > >> > >> So maybe the nodeid has to be an actual file handle. Oh wait, no, > >> everything's (potentially) fine because fuse4fs supplied i_generation = to > >> the kernel, and fuse_stale_inode will mark it bad if that happens. > >> > >> Hm ok then, at least there's a way out. :) > >> > > > > Right. > > > >> > The reason I made this wrong assumption is because fuse4fs *can* > >> > already use ext4 (64bit) file handle as nodeid, with existing FUSE p= rotocol > >> > which is what my fuse passthough library [1] does. > >> > > >> > My claim was that although fuse4fs could support safe restart, which > >> > cannot read from recycled inode number with current FUSE protocol, > >> > doing so with FUSE_HANDLE protocol would express a commitment > >> > >> Pardon my na=C3=AFvete, but what is FUSE_HANDLE? > >> > >> $ git grep -w FUSE_HANDLE fs > >> $ > > > > Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE): > > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W= _4yUY4L2JRAEKxEwOQ@mail.gmail.com/ > > > > Which means to communicate a variable sized "nodeid" > > which can also be declared as an object id that survives server restart= . > > > > Basically, the reason that I brought up LOOKUP_HANDLE is to > > properly support NFS export of fuse filesystems. > > > > My incentive was to support a proper fuse server restart/remount/re-exp= ort > > with the same fsid in /etc/exports, but this gives us a better starting= point > > for fuse server restart/re-connect. > > Sorry for resurrecting (again!) this discussion. I've been thinking abou= t > this, and trying to get some initial RFC for this LOOKUP_HANDLE operation= . > However, I feel there are other operations that will need to return this > new handle. > > For example, the FUSE_CREATE (for atomic_open) also returns a nodeid. > Doesn't this means that, if the user-space server supports the new > LOOKUP_HANDLE, it should also return an handle in reply to the CREATE > request? Yes, I think that's what it means. > The same question applies for TMPFILE, LINK, etc. Or is there > something special about the LOOKUP operation that I'm missing? > Any command returning fuse_entry_out. READDIRPLUS, MKNOD, MKDIR, SYMLINK fuse_entry_out was extended once and fuse_reply_entry() sends the size of the struct. However fuse_reply_create() sends it with fuse_open_out appended and fuse_add_direntry_plus() does not seem to write record size at all, so server and client will need to agree on the size of fuse_entry_out and this would need to be backward compat. If both server and client declare support for FUSE_LOOKUP_HANDLE it should be fine (?). Thanks, Amir.