From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4107333D6E1 for ; Tue, 3 Mar 2026 04:57:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772513876; cv=none; b=IdVDOZpZ7AkkEcdh4/hMjiDY6wapJOtyMlvojiGH+IGzDP5Sz4Q+l7nSCvx7hjbOe5OkctSYRj54KE/tPdGZC2GaBxLcaTErKHpWlahMXHYfZQTV66gMWdJzbPDytznWIIJRgDtDVgXQmV8Z7Q+FRRrJdxkLtDzLwTX2ZqHctmA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772513876; c=relaxed/simple; bh=l1tTb6ZqedZB1tYs/ZhQDppfd20EPa/78nVI1xlrSng=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=TT04Hn8ptcYtPlcnQuDQpq9pJmlxCzcMgm7C2xM/2O6HASCH3OodeLgifV6skh+928j+Nncc4Hs4zeJjr9NXbXES5IJvMxEzVk5inBTd9TOpMYVE1VQtHjGhNT1g+uX6BoHR/lv/rQXzoz8qxQ64ieq0q8abmP9uKk/s7H6bcNc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=d7lPLTMW; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="d7lPLTMW" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D902CC116C6; Tue, 3 Mar 2026 04:57:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772513875; bh=l1tTb6ZqedZB1tYs/ZhQDppfd20EPa/78nVI1xlrSng=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=d7lPLTMWy1sPC0uSJO6see0YV5l8/T8bRigX8kqc9icxTLy+agEvk+kYnQEgiWCM7 21q3xGLwAxZ28eXeqsfuQbKRcWaFnFDZuM/X1aKOGLFXrFPjUCo3c59h3fx14xx5aj YxlFwz/RUgzMEos547uXJROdaEDM0FL2mJ5R5hNw/zJ2raROLxTLVxpviFGCXlUXrk CLGhghAvBnhvXmNoSXJyrwQ08XSeHbDTKJJOejIRb48zubwy8MzgaqIADtrTwAUt77 NqKGOhoq1uWSO9KXDbdq8BMLq0jf6hBLy869ESAYnrOjrc5YwYI/AJEz+tbviA3yQc +2u77QWAqn5aA== Date: Mon, 2 Mar 2026 20:57:55 -0800 From: "Darrick J. Wong" To: Joanne Koong Cc: John Groves , Amir Goldstein , Miklos Szeredi , "f-pc@lists.linux-foundation.org" , "linux-fsdevel@vger.kernel.org" , Bernd Schubert , Luis Henriques , Horst Birthelmer Subject: Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more Message-ID: <20260303045755.GN13829@frogsfrogsfrogs> References: <20260204190649.GB7693@frogsfrogsfrogs> <0100019c2bdca8b7-b1760667-a4e6-4a52-b976-9f039e65b976-000000@email.amazonses.com> <20260206055247.GF7693@frogsfrogsfrogs> <20260221003756.GD11076@frogsfrogsfrogs> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Thu, Feb 26, 2026 at 12:21:43PM -0800, Joanne Koong wrote: > On Fri, Feb 20, 2026 at 4:37 PM Darrick J. Wong wrote: > > > > On Wed, Feb 11, 2026 at 08:46:26PM -0800, Joanne Koong wrote: > > > On Fri, Feb 6, 2026 at 4:22 PM Joanne Koong wrote: > > > > > > > > On Fri, Feb 6, 2026 at 12:48 PM John Groves wrote: > > > > > > > > > > On 26/02/05 09:52PM, Darrick J. Wong wrote: > > > > > > On Thu, Feb 05, 2026 at 10:27:52AM +0100, Amir Goldstein wrote: > > > > > > > On Thu, Feb 5, 2026 at 4:33 AM John Groves wrote: > > > > > > > > > > > > > > > > On 26/02/04 11:06AM, Darrick J. Wong wrote: > > > > > > > > > > > > > > > > [ ... ] > > > > > > > > > > > > > > > > > > - famfs: export distributed memory > > > > > > > > > > > > > > > > > > This has been, uh, hanging out for an extraordinarily long time. > > > > > > > > > > > > > > > > Um, *yeah*. Although a significant part of that time was on me, because > > > > > > > > getting it ported into fuse was kinda hard, my users and I are hoping we > > > > > > > > can get this upstreamed fairly soon now. I'm hoping that after the 6.19 > > > > > > > > merge window dust settles we can negotiate any needed changes etc. and > > > > > > > > shoot for the 7.0 merge window. > > > > > > > > > > > > I think we've all missed getting merged for 7.0 since 6.19 will be > > > > > > released in 3 days. :/ > > > > > > > > > > > > (Granted most of the maintainers I know are /much/ less conservative > > > > > > than I was about the schedule) > > > > > > > > > > Doh - right you are... > > > > > > > > > > > > > > > > > > I think that the work on famfs is setting an example, and I very much > > > > > > > hope it will be a good example, of how improving existing infrastructure > > > > > > > (FUSE) is a better contribution than adding another fs to the pile. > > > > > > > > > > > > Yeah. Joanne and I spent a couple of days this week coprogramming a > > > > > > prototype of a way for famfs to create BPF programs to handle > > > > > > INTERLEAVED_EXTENT files. We might be ready to show that off in a > > > > > > couple of weeks, and that might be a way to clear up the > > > > > > GET_FMAP/IOMAP_BEGIN logjam at last. > > > > > > > > > > I'd love to learn more about this; happy to do a call if that's a > > > > > good way to get me briefed. > > > > > > > > > > I [generally but not specifically] understand how this could avoid > > > > > GET_FMAP, but not GET_DAXDEV. > > > > > > > > > > But I'm not sure it could (or should) avoid dax_iomap_rw() and > > > > > dax_iomap_fault(). The thing is that those call my begin() function > > > > > to resolve an offset in a file to an offset on a daxdev, and then > > > > > dax completes the fault or memcpy. In that dance, famfs never knows > > > > > the kernel address of the memory at all (also true of xfs in fs-dax > > > > > mode, unless that's changed fairly recently). I think that's a pretty > > > > > decent interface all in all. > > > > > > > > > > Also: dunno whether y'all have looked at the dax patches in the famfs > > > > > series, but the solution to working with Alistair's folio-ification > > > > > and cleanup of the dax layer (which set me back months) was to create > > > > > drivers/dax/fsdev.c, which, when bound to a daxdev in place of > > > > > drivers/dax/device.c, configures folios & pages compatibly with > > > > > fs-dax. So I kinda think I need the dax_iomap* interface. > > > > > > > > > > As usual, if I'm overlooking something let me know... > > > > > > > > Hi John, > > > > > > > > The conversation started [1] on Darrick's containerization patchset > > > > about using bpf to a) avoid extra requests / context switching for > > > > ->iomap_begin and ->iomap_end calls and b) offload what would > > > > otherwise have to be hard-coded kernel logic into userspace, which > > > > gives userspace more flexibility / control with updating the logic and > > > > is less of a maintenance burden for fuse. There was some musing [2] > > > > about whether with bpf infrastructure added, it would allow famfs to > > > > move all famfs-specific logic to userspace/bpf. > > > > > > > > I agree that it makes sense for famfs to go through dax iomap > > > > interfaces. imo it seems cleanest if fuse has a generic iomap > > > > interface with iomap dax going through that plumbing, and any > > > > famfs-specific logic that would be needed beyond that (eg computing > > > > the interleaved mappings) being moved to custom famfs bpf programs. I > > > > started trying to implement this yesterday afternoon because I wanted > > > > to make sure it would actually be doable for the famfs logic before > > > > bringing it up and I didn't want to derail your project. So far I only > > > > have the general iomap interface for fuse added with dax operations > > > > going through dax_iomap* and haven't tried out integrating the famfs > > > > GET_FMAP/GET_DAXDEV bpf program part yet but I'm planning/hoping to > > > > get to that early next week. The work I did with Darrick this week was > > > > on getting a server's bpf programs hooked up to fuse through bpf links > > > > and Darrick has fleshed that out and gotten that working now. If it > > > > turns out famfs can go through a generic iomap fuse plumbing layer, > > > > I'd be curious to hear your thoughts on which approach you'd prefer. > > > > > > I put together a quick prototype to test this out - this is what it > > > looks like with fuse having a generic iomap interface that supports > > > dax [1], and the famfs custom logic moved to a bpf program [2]. I > > > > The bpf maps that you've used to upload per-inode data into the kernel > > is a /much/ cleaner method than custom-compiling C into BPF at runtime! > > You can statically compile the BPF object code into the fuse server, > > which means that (a) you can take advantage of the bpftool skeletons, > > and (b) you can in theory vendor-sign the BPF code if and when that > > becomes a requirement. > > > > I think that's way better than having to put vmlinux.h and > > fuse_iomap_bpf.h on the deployed system. Though there's one hitch in > > example/Makefile: > > > > vmlinux.h: > > $(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > $@ > > > > The build system isn't necessarily running the same kernel as the deploy > > images. It might be for Meta, but it's not unheard of for our build > > system to be running (say) OL10+UEK8 kernel, but the build target is OL8 > > and UEK7. > > > > There doesn't seem to be any standardization across distros for where a > > vmlinux.h file might be found. Fedora puts it under > > /usr/src/$unamestuf, Debian puts it in /usr/include/$gcc_triple, and I > > guess SUSE doesn't ship it at all? > > > > That's going to be a headache for deployment as I've been muttering for > > a couple of weeks now. :( > > I don't think this is an issue because bpf does dynamic btf-based > relocations (CO-RE) at load time [1]. On the target machine, when > libbpf loads the bpf object it will read the machine's btf and patch > any offsets in bytecode and load the fixed-up version into the kernel. > All that's needed on the target machine for CO-RE is > CONFIG_DEBUG_INFO_BTF=y which is enabled by default on mainstream > distributions. I think this addresses the deployment headache you've > been running into? Not really -- CO-RE does indeed work quite nicely to smooth over layout changes in C structures between a BPF program and the kernel it's being loaded into (thanks, whoever came up with that!) but the problem I have is how you /get/ those definitions into clang in the first place. I was under the impression from many of the bpf examples that you're supposed to #include a distro-provided "vmlinux.h", but there doesn't seem to be a standard way to find that file. Most -dev packages provide a pkgconfig file that give you the appropriate CFLAGS/LDFLAGS to add, but apparently this is not the case for BPF...? Perhaps it's the case that distro packages that are building BPF programs simply add a build dependency on the package providing vmlinux.h (e.g. Build-Depends: linux-bpf-dev on Debian) and patch in "CFLAGS=-I/some/path" as needed? I suppose for a dynamically generated and compiled BPF program, one could just "bpftool skel" the /sys/kernel/btf files, capture the output, and "#include " the results. Honestly that sounds better than trusting some weird system package. But maybe dynamic compilation is a totally stupid idea. I did grow up in the era of mshtml email wreaking havoc, after all... --D > Thanks, > Joanne > > [1] https://docs.ebpf.io/concepts/core/ > > > > > Maybe we could reduce the fuse-iomap bpf definitions to use only > > cardinal types and the types that iomap itself defines. That might not > > be too hard right now because bpf functions reuse structures from > > include/uapi/fuse.h, which currently use uint{8,16,32,64}_t. It'll get > > harder if that __uintXX_t -> __uXX transition actually happens. > > > > But getting back to the famfs bpf stuff, I think doing the interleaved > > mappings via BPF gives the famfs server a lot more flexibility in terms > > of what it can do when future hardware arrives with even weirder > > configurations. > > > > --D > > > > > didn't change much, I just moved around your famfs code to the bpf > > > side. The kernel side changes are in [3] and the libfuse changes are > > > in [4]. > > > > > > For testing out the prototype, I hooked it up to passthrough_hp to > > > test running the bpf program and verify that it is able to find the > > > extent from the bpf map. In my opinion, this makes the fuse side > > > infrastructure cleaner and more extendable for other servers that will > > > want to go through dax iomap in the future, but I think this also has > > > a few benefits for famfs. Instead of needing to issue a FUSE_GET_FMAP > > > request after a file is opened, the server can directly populate the > > > metadata map from userspace with the mapping info when it processes > > > the FUSE_OPEN request, which gets rid of the roundtrip cost. The > > > server can dynamically update the metadata at any time from userspace > > > if the mapping info needs to change in the future. For setting up the > > > daxdevs, I moved your logic to the init side, where the server passes > > > the daxdev info upfront through an IOMAP_CONFIG exchange with the > > > kernel initializing the daxdevs based off that info. I think this will > > > also make deploying future updates for famfs easier, as updating the > > > logic won't need to go through the upstream kernel mailing list > > > process and deploying updates won't require a new kernel release. > > > > > > These are just my two cents based on my (cursory) understanding of > > > famfs. Just wanted to float this alternative approach in case it's > > > useful. > > > > > > Thanks, > > > Joanne > > > > > > [1] https://github.com/joannekoong/linux/commit/b8f9d284a6955391f00f576d890e1c1ccc943cfd > > > [2] https://github.com/joannekoong/libfuse/commit/444fa27fa9fd2118a0dc332933197faf9bbf25aa > > > [3] https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/ > > > [4] https://github.com/joannekoong/libfuse/commits/famfs_bpf/ > > > > > > > > > > > Thanks, > > > > Joanne > > > > > > > > [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#md1b8003a109760d8ee1d5397e053673c1978ed4d > > > > [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1bxhw2u0qwjw0dJPGdmxEXbcEyKn-=iFrszqof2c8wGCA@mail.gmail.com/t/#u > > > > > > > > > > > > > > Regards, > > > > > John > > > > > > > > >