* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory [not found] ` <55e3d9f6-50d2-48c0-b7e3-fb1c144cf3e8@linux.alibaba.com> @ 2026-01-26 17:38 ` Cong Wang 2026-01-26 19:16 ` Matthew Wilcox 0 siblings, 1 reply; 7+ messages in thread From: Cong Wang @ 2026-01-26 17:38 UTC (permalink / raw) To: Gao Xiang; +Cc: linux-fsdevel, linux-kernel, Cong Wang, multikernel Hi Xiang, On Sun, Jan 25, 2026 at 8:04 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > Hi Cong, > > On 2026/1/25 01:10, Cong Wang wrote: > > Hello, > > > > I would like to introduce DAXFS, a simple read-only filesystem > > designed to operate directly on shared physical memory via the DAX > > (Direct Access). > > > > Unlike ramfs or tmpfs, which operate within the kernel’s page cache > > and result in fragmented, per-instance memory allocation, DAXFS > > provides a mechanism for zero-copy reads from contiguous memory > > regions. It bypasses the traditional block I/O stack, buffer heads, > > and page cache entirely. > > > > Key Features > > - Zero-Copy Efficiency: File reads resolve to direct memory loads, > > eliminating page cache duplication and CPU-driven copies. > > - True Physical Sharing: By mapping a contiguous physical address or a > > dma-buf, multiple kernel instances or containers can share the same > > physical pages. > > - Hardware Integration: Supports mounting memory exported by GPUs, > > FPGAs, or CXL devices via the dma-buf API. > > - Simplicity: Uses a self-contained, read-only image format with no > > runtime allocation or complex device management. > > > > Primary Use Cases > > - Multikernel Environments: Sharing a common Docker image across > > independent kernel instances via shared memory. > > - CXL Memory Pooling: Accessing read-only data across multiple hosts > > without network I/O. > > - Container Rootfs Sharing: Using a single DAXFS base image for > > multiple containers (via OverlayFS) to save physical RAM. > > - Accelerator Data: Zero-copy access to model weights or lookup tables > > stored in device memory. > > Actually, EROFS DAX is already used for this way for various users, > including all the usage above. > > Could you explain why EROFS doesn't suit for your use cases? EROFS does not support direct physical memory operations. As you mentioned, it relies on other layers like ramdax to function in these scenarios. I have looked into ramdax, and it does not seem suitable for multikernel use case. Specifically, the ending 128K is shared across multiple kernels, which would cause significant issues. For reference: 87 dimm->label_area = memremap(start + size - LABEL_AREA_SIZE, 88 LABEL_AREA_SIZE, MEMREMAP_WB); ... 154 static int ramdax_set_config_data(struct nvdimm *nvdimm, int buf_len, 155 struct nd_cmd_set_config_hdr *cmd) 156 { 157 struct ramdax_dimm *dimm = nvdimm_provider_data(nvdimm); 158 159 if (sizeof(*cmd) > buf_len) 160 return -EINVAL; 161 if (struct_size(cmd, in_buf, cmd->in_length) > buf_len) 162 return -EINVAL; 163 if (size_add(cmd->in_offset, cmd->in_length) > LABEL_AREA_SIZE) 164 return -EINVAL; 165 166 memcpy(dimm->label_area + cmd->in_offset, cmd->in_buf, cmd->in_length); 167 168 return 0; 169 } Not to mention other cases like GPU/SmartNIC etc.. If you are interested in adding multikernel support to EROFS, here is the codebase you could start with: https://github.com/multikernel/linux. PR is always welcome. Thanks, Cong Wang ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory 2026-01-26 17:38 ` [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory Cong Wang @ 2026-01-26 19:16 ` Matthew Wilcox 2026-01-26 19:48 ` Cong Wang 0 siblings, 1 reply; 7+ messages in thread From: Matthew Wilcox @ 2026-01-26 19:16 UTC (permalink / raw) To: Cong Wang; +Cc: Gao Xiang, linux-fsdevel, linux-kernel, Cong Wang, multikernel On Mon, Jan 26, 2026 at 09:38:23AM -0800, Cong Wang wrote: > If you are interested in adding multikernel support to EROFS, here is > the codebase you could start with: > https://github.com/multikernel/linux. PR is always welcome. I think the onus is rather the other way around. Adding a new filesystem to Linux has a high bar to clear because it becomes a maintenance burden to the rest of us. Convince us that what you're doing here *can't* be done better by modifying erofs. Before I saw the email from Gao Xiang, I was also going to suggest that using erofs would be a better idea than supporting your own filesystem. Writing a new filesystem is a lot of fun. Supporting a new filesystem and making it production-quality is a whole lot of pain. It's much better if you can leverage other people's work. That's why DAX is a support layer for filesystems rather than its own filesystem. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory 2026-01-26 19:16 ` Matthew Wilcox @ 2026-01-26 19:48 ` Cong Wang 2026-01-26 20:13 ` Gao Xiang 2026-01-26 20:40 ` Matthew Wilcox 0 siblings, 2 replies; 7+ messages in thread From: Cong Wang @ 2026-01-26 19:48 UTC (permalink / raw) To: Matthew Wilcox Cc: Gao Xiang, linux-fsdevel, linux-kernel, Cong Wang, multikernel On Mon, Jan 26, 2026 at 11:16 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jan 26, 2026 at 09:38:23AM -0800, Cong Wang wrote: > > If you are interested in adding multikernel support to EROFS, here is > > the codebase you could start with: > > https://github.com/multikernel/linux. PR is always welcome. > > I think the onus is rather the other way around. Adding a new filesystem > to Linux has a high bar to clear because it becomes a maintenance burden > to the rest of us. Convince us that what you're doing here *can't* > be done better by modifying erofs. > > Before I saw the email from Gao Xiang, I was also going to suggest that > using erofs would be a better idea than supporting your own filesystem. > Writing a new filesystem is a lot of fun. Supporting a new filesystem > and making it production-quality is a whole lot of pain. It's much > better if you can leverage other people's work. That's why DAX is a > support layer for filesystems rather than its own filesystem. Great question. The core reason is multikernel assumes little to none compatibility. Specifically for this scenario, struct inode is not compatible. This could rule out a lot of existing filesystems, except read-only ones. Now back to EROFS, it is still based on a block device, which itself can't be shared among different kernels. ramdax is actually a perfect example here, its label_area can't be shared among different kernels. Let's take one step back: even if we really could share a device with multiple kernels, it still could not share the memory footprint, with DAX + EROFS, we would still get: 1) Each kernel creates its own DAX mappings 2) And faults pages independently There is no cross-kernel page sharing accounting. I hope this makes sense. Regards, Cong ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory 2026-01-26 19:48 ` Cong Wang @ 2026-01-26 20:13 ` Gao Xiang 2026-01-26 20:40 ` Matthew Wilcox 1 sibling, 0 replies; 7+ messages in thread From: Gao Xiang @ 2026-01-26 20:13 UTC (permalink / raw) To: Cong Wang, Matthew Wilcox Cc: linux-fsdevel, linux-kernel, Cong Wang, multikernel On 2026/1/27 03:48, Cong Wang wrote: > On Mon, Jan 26, 2026 at 11:16 AM Matthew Wilcox <willy@infradead.org> wrote: >> >> On Mon, Jan 26, 2026 at 09:38:23AM -0800, Cong Wang wrote: >>> If you are interested in adding multikernel support to EROFS, here is >>> the codebase you could start with: >>> https://github.com/multikernel/linux. PR is always welcome. >> >> I think the onus is rather the other way around. Adding a new filesystem >> to Linux has a high bar to clear because it becomes a maintenance burden >> to the rest of us. Convince us that what you're doing here *can't* >> be done better by modifying erofs. >> >> Before I saw the email from Gao Xiang, I was also going to suggest that >> using erofs would be a better idea than supporting your own filesystem. >> Writing a new filesystem is a lot of fun. Supporting a new filesystem >> and making it production-quality is a whole lot of pain. It's much >> better if you can leverage other people's work. That's why DAX is a >> support layer for filesystems rather than its own filesystem. > > Great question. > > The core reason is multikernel assumes little to none compatibility. > > Specifically for this scenario, struct inode is not compatible. This > could rule out a lot of existing filesystems, except read-only ones. I don't quite get the point here, assuming you know filesystems. > > Now back to EROFS, it is still based on a block device, which > itself can't be shared among different kernels. ramdax is actually > a perfect example here, its label_area can't be shared among > different kernels. > > Let's take one step back: even if we really could share a device > with multiple kernels, it still could not share the memory footprint, > with DAX + EROFS, we would still get: > 1) Each kernel creates its own DAX mappings > 2) And faults pages independently > > There is no cross-kernel page sharing accounting. > > I hope this makes sense. No, EROFS on-disk format designs for any backend, so you could use this format backed by: 1) raw block device 2) file 3) a pure ramdaxfs (it's still WIP) Why not? because an ordinary container image user doesn't assume a fs especially for a particular type of device, especially for golden image usage. You cannot say, oh, I build an image, maybe, you have to use it just for ramdax usage, oh, you backed by a file on the block device, you have to convert to another format to use: EROFS on-disk format should allow for _all the device backend_. At a quick glance of your code, it seems it's much premature and ineffective because subdirectories just like a link chain, and maybe it is only somewhat reasonable for ramdax usage, but it's still _not_ cache-friendly. The reason why it doesn't work for you because _multikernel_ isn't an offical upsteam requirement, all upstream virtualization users directly use virtio-pmem now. I think for the upstream kernels, you'd like to make multikernel an offical upstream requirement first, then there will be drivers for you to do multikernel ramdax, rather than the raw usage of 1) memremap 2) vmf_insert_mixed in the filesystem drivers, I do think they are _red line_ for any new filesytem drivers (instead of legacy cramfs MTD XIP old code). Anyway, I really think your current use cases are already covered by EROFS for many years. Thanks, Gao Xiang > > Regards, > Cong ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory 2026-01-26 19:48 ` Cong Wang 2026-01-26 20:13 ` Gao Xiang @ 2026-01-26 20:40 ` Matthew Wilcox 2026-01-27 0:02 ` Cong Wang 1 sibling, 1 reply; 7+ messages in thread From: Matthew Wilcox @ 2026-01-26 20:40 UTC (permalink / raw) To: Cong Wang; +Cc: Gao Xiang, linux-fsdevel, linux-kernel, Cong Wang, multikernel On Mon, Jan 26, 2026 at 11:48:20AM -0800, Cong Wang wrote: > Specifically for this scenario, struct inode is not compatible. This > could rule out a lot of existing filesystems, except read-only ones. I don't think you understand that there's a difference between *on disk* inode and *in core* inode. Compare and contrast struct ext2_inode and struct inode. > Now back to EROFS, it is still based on a block device, which > itself can't be shared among different kernels. ramdax is actually > a perfect example here, its label_area can't be shared among > different kernels. > > Let's take one step back: even if we really could share a device > with multiple kernels, it still could not share the memory footprint, > with DAX + EROFS, we would still get: > 1) Each kernel creates its own DAX mappings > 2) And faults pages independently > > There is no cross-kernel page sharing accounting. > > I hope this makes sense. No, it doesn't. I'm not suggesting that you use erofs unchanged, I'm suggesting that you modify erofs to support your needs. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory 2026-01-26 20:40 ` Matthew Wilcox @ 2026-01-27 0:02 ` Cong Wang 2026-01-27 0:55 ` Gao Xiang 0 siblings, 1 reply; 7+ messages in thread From: Cong Wang @ 2026-01-27 0:02 UTC (permalink / raw) To: Matthew Wilcox Cc: Gao Xiang, linux-fsdevel, linux-kernel, Cong Wang, multikernel On Mon, Jan 26, 2026 at 12:40 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jan 26, 2026 at 11:48:20AM -0800, Cong Wang wrote: > > Specifically for this scenario, struct inode is not compatible. This > > could rule out a lot of existing filesystems, except read-only ones. > > I don't think you understand that there's a difference between *on disk* > inode and *in core* inode. Compare and contrast struct ext2_inode and > struct inode. > > > Now back to EROFS, it is still based on a block device, which > > itself can't be shared among different kernels. ramdax is actually > > a perfect example here, its label_area can't be shared among > > different kernels. > > > > Let's take one step back: even if we really could share a device > > with multiple kernels, it still could not share the memory footprint, > > with DAX + EROFS, we would still get: > > 1) Each kernel creates its own DAX mappings > > 2) And faults pages independently > > > > There is no cross-kernel page sharing accounting. > > > > I hope this makes sense. > > No, it doesn't. I'm not suggesting that you use erofs unchanged, I'm > suggesting that you modify erofs to support your needs. I just tried: https://github.com/multikernel/linux/commit/a6dc3351e78fc2028e4ca0ea02e781ca0bfefea3 Unfortunately, the multi-kernel derivation is still there and probably hard to eliminate without re-architecturing EROFS, here is why: DAXFS Inode (line 202-216): struct daxfs_base_inode { __le32 ino; __le32 mode; ... __le64 size; __le64 data_offset; /* ← INTRINSIC: stored directly in inode */ ... }; DAXFS Read Path: // Pseudocode - what DAXFS does void *data = base + inode->data_offset + file_offset; copy_to_iter(data, len, to); // DONE. No metadata parsing, no derivation. EROFS Read Path: // What EROFS does (even in memory mode) struct erofs_map_blocks map = { .m_la = pos }; erofs_map_blocks(inode, &map); // ← DERIVES physical address // Inside erofs_map_blocks(): // - Check inode layout type (compact? extended? chunk-indexed?) // - For chunk-indexed: walk chunk table // - For plain: compute from inode // - Handle inline data, holes, compression... src = base + map.m_pa; Please let me know if I miss anything here. Also, the speculative branching support is also harder for EROFS, please see my updated README here: https://github.com/multikernel/daxfs/blob/main/README.md (Skip to the Branching section.) Thanks. Cong Wang ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory 2026-01-27 0:02 ` Cong Wang @ 2026-01-27 0:55 ` Gao Xiang 0 siblings, 0 replies; 7+ messages in thread From: Gao Xiang @ 2026-01-27 0:55 UTC (permalink / raw) To: Cong Wang, Matthew Wilcox Cc: linux-fsdevel, linux-kernel, Cong Wang, multikernel On 2026/1/27 08:02, Cong Wang wrote: > On Mon, Jan 26, 2026 at 12:40 PM Matthew Wilcox <willy@infradead.org> wrote: >> >> On Mon, Jan 26, 2026 at 11:48:20AM -0800, Cong Wang wrote: >>> Specifically for this scenario, struct inode is not compatible. This >>> could rule out a lot of existing filesystems, except read-only ones. >> >> I don't think you understand that there's a difference between *on disk* >> inode and *in core* inode. Compare and contrast struct ext2_inode and >> struct inode. >> >>> Now back to EROFS, it is still based on a block device, which >>> itself can't be shared among different kernels. ramdax is actually >>> a perfect example here, its label_area can't be shared among >>> different kernels. >>> >>> Let's take one step back: even if we really could share a device >>> with multiple kernels, it still could not share the memory footprint, >>> with DAX + EROFS, we would still get: >>> 1) Each kernel creates its own DAX mappings >>> 2) And faults pages independently >>> >>> There is no cross-kernel page sharing accounting. >>> >>> I hope this makes sense. >> >> No, it doesn't. I'm not suggesting that you use erofs unchanged, I'm >> suggesting that you modify erofs to support your needs. > > I just tried: > https://github.com/multikernel/linux/commit/a6dc3351e78fc2028e4ca0ea02e781ca0bfefea3 > > Unfortunately, the multi-kernel derivation is still there and probably > hard to eliminate without re-architecturing EROFS, here is why: > > DAXFS Inode (line 202-216): > > struct daxfs_base_inode { > __le32 ino; > __le32 mode; > ... > __le64 size; > __le64 data_offset; /* ← INTRINSIC: stored directly in inode > */ > ... > }; > > DAXFS Read Path: > // Pseudocode - what DAXFS does > void *data = base + inode->data_offset + file_offset; > copy_to_iter(data, len, to); > // DONE. No metadata parsing, no derivation. Then? how do you handle memory-mapped cases? your inode->data_offset still needs PAGE_SIZE aligned, no? How it happens if an image with unaligned data offsets? and why bother copy_to_iter in your filesystem itself rather than using the upstream DAX infrastructure? Also where you handle malicious `child_ino` if sub-directories can generate a loop (from your on-disk design?) How it deals with hardlinks btw? > > EROFS Read Path: > // What EROFS does (even in memory mode) > struct erofs_map_blocks map = { .m_la = pos }; > erofs_map_blocks(inode, &map); // ← DERIVES physical address > // Inside erofs_map_blocks(): > // - Check inode layout type (compact? extended? > chunk-indexed?) > // - For chunk-indexed: walk chunk table > // - For plain: compute from inode > // - Handle inline data, holes, compression... > src = base + map.m_pa; > > Please let me know if I miss anything here. Your expression above is very vague, so I don't know how to react your words above. I basically would like to say, your basic use case just needs plain EROFS inodes (both compact & extended on-disk core inode has a raw_blkaddr, and raw_blkaddr * PAGE_SIZE is what you called `inode->data_offset`). You could just ignore the EROFS compressed layout since it needs to use page cache for those inodes even for EROFS FSDAX, and your "DAXFS" doesn't deal with compression. Also, the expression above seems to be partially generated by AI, but I have to write more reasonable words myself, it seems unfair for me to reply in this thread. > > Also, the speculative branching support is also harder for EROFS, > please see my updated README here: > https://github.com/multikernel/daxfs/blob/main/README.md > (Skip to the Branching section.) I also would like to discuss new use cases like "shared-memory DAX filesystem for AI agents", but my proposal is to redirect the whole write traffic into another filesystem (either a tmpfs or a real disk fs) and when agents need to snapshot, generate a new read-only layer for memory sharing. The reason is because I really would like to make the core EROFS format straight-forward even for untrusted remote image usage. Also a second quick glance of your cow approach, it just seems nonsense from a real filesystem developer, anyway, it's not me to prove your use cases to convince people, it cannot be implemented with an existing fs with enhancements. If upstreaming is your interest, file a LSFMMBPF topic to show your use cases to discuss, and I would like to join the discussion. If your interest is not upstreaming, please ignore all my replies. Thanks, Gao Xiang > > Thanks. > Cong Wang ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-01-27 0:55 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAGHCLaREA4xzP7CkJrpqu4C=PKw_3GppOUPWZKn0Fxom_3Z9Qw@mail.gmail.com>
[not found] ` <55e3d9f6-50d2-48c0-b7e3-fb1c144cf3e8@linux.alibaba.com>
2026-01-26 17:38 ` [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory Cong Wang
2026-01-26 19:16 ` Matthew Wilcox
2026-01-26 19:48 ` Cong Wang
2026-01-26 20:13 ` Gao Xiang
2026-01-26 20:40 ` Matthew Wilcox
2026-01-27 0:02 ` Cong Wang
2026-01-27 0:55 ` Gao Xiang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox