public inbox for multikernel@lists.linux.dev
 help / color / mirror / Atom feed
* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory
       [not found] ` <55e3d9f6-50d2-48c0-b7e3-fb1c144cf3e8@linux.alibaba.com>
@ 2026-01-26 17:38   ` Cong Wang
  2026-01-26 19:16     ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Cong Wang @ 2026-01-26 17:38 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-fsdevel, linux-kernel, Cong Wang, multikernel

Hi Xiang,

On Sun, Jan 25, 2026 at 8:04 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi Cong,
>
> On 2026/1/25 01:10, Cong Wang wrote:
> > Hello,
> >
> > I would like to introduce DAXFS, a simple read-only filesystem
> > designed to operate directly on shared physical memory via the DAX
> > (Direct Access).
> >
> > Unlike ramfs or tmpfs, which operate within the kernel’s page cache
> > and result in fragmented, per-instance memory allocation, DAXFS
> > provides a mechanism for zero-copy reads from contiguous memory
> > regions. It bypasses the traditional block I/O stack, buffer heads,
> > and page cache entirely.
> >
> > Key Features
> > - Zero-Copy Efficiency: File reads resolve to direct memory loads,
> > eliminating page cache duplication and CPU-driven copies.
> > - True Physical Sharing: By mapping a contiguous physical address or a
> > dma-buf, multiple kernel instances or containers can share the same
> > physical pages.
> > - Hardware Integration: Supports mounting memory exported by GPUs,
> > FPGAs, or CXL devices via the dma-buf API.
> > - Simplicity: Uses a self-contained, read-only image format with no
> > runtime allocation or complex device management.
> >
> > Primary Use Cases
> > - Multikernel Environments: Sharing a common Docker image across
> > independent kernel instances via shared memory.
> > - CXL Memory Pooling: Accessing read-only data across multiple hosts
> > without network I/O.
> > - Container Rootfs Sharing: Using a single DAXFS base image for
> > multiple containers (via OverlayFS) to save physical RAM.
> > - Accelerator Data: Zero-copy access to model weights or lookup tables
> > stored in device memory.
>
> Actually, EROFS DAX is already used for this way for various users,
> including all the usage above.
>
> Could you explain why EROFS doesn't suit for your use cases?

EROFS does not support direct physical memory operations. As you
mentioned, it relies on other layers like ramdax to function in these
scenarios.

I have looked into ramdax, and it does not seem suitable for
multikernel use case. Specifically, the ending 128K is shared across
multiple kernels, which would cause significant issues. For reference:

87 dimm->label_area = memremap(start + size - LABEL_AREA_SIZE,
88 LABEL_AREA_SIZE, MEMREMAP_WB);
...
154 static int ramdax_set_config_data(struct nvdimm *nvdimm, int buf_len,
155 struct nd_cmd_set_config_hdr *cmd)
156 {
157 struct ramdax_dimm *dimm = nvdimm_provider_data(nvdimm);
158
159 if (sizeof(*cmd) > buf_len)
160 return -EINVAL;
161 if (struct_size(cmd, in_buf, cmd->in_length) > buf_len)
162 return -EINVAL;
163 if (size_add(cmd->in_offset, cmd->in_length) > LABEL_AREA_SIZE)
164 return -EINVAL;
165
166 memcpy(dimm->label_area + cmd->in_offset, cmd->in_buf, cmd->in_length);
167
168 return 0;
169 }

Not to mention other cases like GPU/SmartNIC etc..

If you are interested in adding multikernel support to EROFS, here is
the codebase you could start with:
https://github.com/multikernel/linux. PR is always welcome.

Thanks,
Cong Wang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory
  2026-01-26 17:38   ` [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory Cong Wang
@ 2026-01-26 19:16     ` Matthew Wilcox
  2026-01-26 19:48       ` Cong Wang
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2026-01-26 19:16 UTC (permalink / raw)
  To: Cong Wang; +Cc: Gao Xiang, linux-fsdevel, linux-kernel, Cong Wang, multikernel

On Mon, Jan 26, 2026 at 09:38:23AM -0800, Cong Wang wrote:
> If you are interested in adding multikernel support to EROFS, here is
> the codebase you could start with:
> https://github.com/multikernel/linux. PR is always welcome.

I think the onus is rather the other way around.  Adding a new filesystem
to Linux has a high bar to clear because it becomes a maintenance burden
to the rest of us.  Convince us that what you're doing here *can't*
be done better by modifying erofs.

Before I saw the email from Gao Xiang, I was also going to suggest that
using erofs would be a better idea than supporting your own filesystem.
Writing a new filesystem is a lot of fun.  Supporting a new filesystem
and making it production-quality is a whole lot of pain.  It's much
better if you can leverage other people's work.  That's why DAX is a
support layer for filesystems rather than its own filesystem.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory
  2026-01-26 19:16     ` Matthew Wilcox
@ 2026-01-26 19:48       ` Cong Wang
  2026-01-26 20:13         ` Gao Xiang
  2026-01-26 20:40         ` Matthew Wilcox
  0 siblings, 2 replies; 7+ messages in thread
From: Cong Wang @ 2026-01-26 19:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gao Xiang, linux-fsdevel, linux-kernel, Cong Wang, multikernel

On Mon, Jan 26, 2026 at 11:16 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 26, 2026 at 09:38:23AM -0800, Cong Wang wrote:
> > If you are interested in adding multikernel support to EROFS, here is
> > the codebase you could start with:
> > https://github.com/multikernel/linux. PR is always welcome.
>
> I think the onus is rather the other way around.  Adding a new filesystem
> to Linux has a high bar to clear because it becomes a maintenance burden
> to the rest of us.  Convince us that what you're doing here *can't*
> be done better by modifying erofs.
>
> Before I saw the email from Gao Xiang, I was also going to suggest that
> using erofs would be a better idea than supporting your own filesystem.
> Writing a new filesystem is a lot of fun.  Supporting a new filesystem
> and making it production-quality is a whole lot of pain.  It's much
> better if you can leverage other people's work.  That's why DAX is a
> support layer for filesystems rather than its own filesystem.

Great question.

The core reason is multikernel assumes little to none compatibility.

Specifically for this scenario, struct inode is not compatible. This
could rule out a lot of existing filesystems, except read-only ones.

Now back to EROFS, it is still based on a block device, which
itself can't be shared among different kernels. ramdax is actually
a perfect example here, its label_area can't be shared among
different kernels.

Let's take one step back: even if we really could share a device
with multiple kernels, it still could not share the memory footprint,
with DAX + EROFS, we would still get:
1) Each kernel creates its own DAX mappings
2) And faults pages independently

There is no cross-kernel page sharing accounting.

I hope this makes sense.

Regards,
Cong

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory
  2026-01-26 19:48       ` Cong Wang
@ 2026-01-26 20:13         ` Gao Xiang
  2026-01-26 20:40         ` Matthew Wilcox
  1 sibling, 0 replies; 7+ messages in thread
From: Gao Xiang @ 2026-01-26 20:13 UTC (permalink / raw)
  To: Cong Wang, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, Cong Wang, multikernel



On 2026/1/27 03:48, Cong Wang wrote:
> On Mon, Jan 26, 2026 at 11:16 AM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Mon, Jan 26, 2026 at 09:38:23AM -0800, Cong Wang wrote:
>>> If you are interested in adding multikernel support to EROFS, here is
>>> the codebase you could start with:
>>> https://github.com/multikernel/linux. PR is always welcome.
>>
>> I think the onus is rather the other way around.  Adding a new filesystem
>> to Linux has a high bar to clear because it becomes a maintenance burden
>> to the rest of us.  Convince us that what you're doing here *can't*
>> be done better by modifying erofs.
>>
>> Before I saw the email from Gao Xiang, I was also going to suggest that
>> using erofs would be a better idea than supporting your own filesystem.
>> Writing a new filesystem is a lot of fun.  Supporting a new filesystem
>> and making it production-quality is a whole lot of pain.  It's much
>> better if you can leverage other people's work.  That's why DAX is a
>> support layer for filesystems rather than its own filesystem.
> 
> Great question.
> 
> The core reason is multikernel assumes little to none compatibility.
> 
> Specifically for this scenario, struct inode is not compatible. This
> could rule out a lot of existing filesystems, except read-only ones.

I don't quite get the point here, assuming you know filesystems.

> 
> Now back to EROFS, it is still based on a block device, which
> itself can't be shared among different kernels. ramdax is actually
> a perfect example here, its label_area can't be shared among
> different kernels.
> 
> Let's take one step back: even if we really could share a device
> with multiple kernels, it still could not share the memory footprint,
> with DAX + EROFS, we would still get:
> 1) Each kernel creates its own DAX mappings
> 2) And faults pages independently
> 
> There is no cross-kernel page sharing accounting.
> 
> I hope this makes sense.

No, EROFS on-disk format designs for any backend, so you could
use this format backed by:
  1) raw block device
  2) file
  3) a pure ramdaxfs (it's still WIP)

Why not? because an ordinary container image user doesn't assume
a fs especially for a particular type of device, especially for
golden image usage.

You cannot say, oh, I build an image, maybe, you have to use it
just for ramdax usage, oh, you backed by a file on the block
device, you have to convert to another format to use:

  EROFS on-disk format should allow for _all the device backend_.

At a quick glance of your code, it seems it's much premature
and ineffective because subdirectories just like a link chain,
and maybe it is only somewhat reasonable for ramdax usage,
but it's still _not_ cache-friendly.

The reason why it doesn't work for you because _multikernel_
isn't an offical upsteam requirement, all upstream virtualization
users directly use virtio-pmem now.

I think for the upstream kernels, you'd like to make multikernel
an offical upstream requirement first, then there will be drivers
for you to do multikernel ramdax, rather than the raw usage of
  1) memremap
  2) vmf_insert_mixed

in the filesystem drivers, I do think they are _red line_ for
any new filesytem drivers (instead of legacy cramfs MTD XIP
old code).

Anyway, I really think your current use cases are already
covered by EROFS for many years.

Thanks,
Gao Xiang

> 
> Regards,
> Cong


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory
  2026-01-26 19:48       ` Cong Wang
  2026-01-26 20:13         ` Gao Xiang
@ 2026-01-26 20:40         ` Matthew Wilcox
  2026-01-27  0:02           ` Cong Wang
  1 sibling, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2026-01-26 20:40 UTC (permalink / raw)
  To: Cong Wang; +Cc: Gao Xiang, linux-fsdevel, linux-kernel, Cong Wang, multikernel

On Mon, Jan 26, 2026 at 11:48:20AM -0800, Cong Wang wrote:
> Specifically for this scenario, struct inode is not compatible. This
> could rule out a lot of existing filesystems, except read-only ones.

I don't think you understand that there's a difference between *on disk*
inode and *in core* inode.  Compare and contrast struct ext2_inode and
struct inode.

> Now back to EROFS, it is still based on a block device, which
> itself can't be shared among different kernels. ramdax is actually
> a perfect example here, its label_area can't be shared among
> different kernels.
> 
> Let's take one step back: even if we really could share a device
> with multiple kernels, it still could not share the memory footprint,
> with DAX + EROFS, we would still get:
> 1) Each kernel creates its own DAX mappings
> 2) And faults pages independently
> 
> There is no cross-kernel page sharing accounting.
> 
> I hope this makes sense.

No, it doesn't.  I'm not suggesting that you use erofs unchanged, I'm
suggesting that you modify erofs to support your needs.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory
  2026-01-26 20:40         ` Matthew Wilcox
@ 2026-01-27  0:02           ` Cong Wang
  2026-01-27  0:55             ` Gao Xiang
  0 siblings, 1 reply; 7+ messages in thread
From: Cong Wang @ 2026-01-27  0:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gao Xiang, linux-fsdevel, linux-kernel, Cong Wang, multikernel

On Mon, Jan 26, 2026 at 12:40 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 26, 2026 at 11:48:20AM -0800, Cong Wang wrote:
> > Specifically for this scenario, struct inode is not compatible. This
> > could rule out a lot of existing filesystems, except read-only ones.
>
> I don't think you understand that there's a difference between *on disk*
> inode and *in core* inode.  Compare and contrast struct ext2_inode and
> struct inode.
>
> > Now back to EROFS, it is still based on a block device, which
> > itself can't be shared among different kernels. ramdax is actually
> > a perfect example here, its label_area can't be shared among
> > different kernels.
> >
> > Let's take one step back: even if we really could share a device
> > with multiple kernels, it still could not share the memory footprint,
> > with DAX + EROFS, we would still get:
> > 1) Each kernel creates its own DAX mappings
> > 2) And faults pages independently
> >
> > There is no cross-kernel page sharing accounting.
> >
> > I hope this makes sense.
>
> No, it doesn't.  I'm not suggesting that you use erofs unchanged, I'm
> suggesting that you modify erofs to support your needs.

I just tried:
https://github.com/multikernel/linux/commit/a6dc3351e78fc2028e4ca0ea02e781ca0bfefea3

Unfortunately, the multi-kernel derivation is still there and probably
hard to eliminate without re-architecturing EROFS, here is why:

  DAXFS Inode (line 202-216):

  struct daxfs_base_inode {
      __le32 ino;
      __le32 mode;
      ...
      __le64 size;
      __le64 data_offset;    /* ← INTRINSIC: stored directly in inode
*/
      ...
  };

 DAXFS Read Path:
  // Pseudocode - what DAXFS does
  void *data = base + inode->data_offset + file_offset;
  copy_to_iter(data, len, to);
  // DONE. No metadata parsing, no derivation.

 EROFS Read Path:
  // What EROFS does (even in memory mode)
  struct erofs_map_blocks map = { .m_la = pos };
  erofs_map_blocks(inode, &map);  // ← DERIVES physical address
      // Inside erofs_map_blocks():
      //   - Check inode layout type (compact? extended?
chunk-indexed?)
      //   - For chunk-indexed: walk chunk table
      //   - For plain: compute from inode
      //   - Handle inline data, holes, compression...
  src = base + map.m_pa;

Please let me know if I miss anything here.

Also, the speculative branching support is also harder for EROFS,
please see my updated README here:
https://github.com/multikernel/daxfs/blob/main/README.md
(Skip to the Branching section.)

Thanks.
Cong Wang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory
  2026-01-27  0:02           ` Cong Wang
@ 2026-01-27  0:55             ` Gao Xiang
  0 siblings, 0 replies; 7+ messages in thread
From: Gao Xiang @ 2026-01-27  0:55 UTC (permalink / raw)
  To: Cong Wang, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, Cong Wang, multikernel



On 2026/1/27 08:02, Cong Wang wrote:
> On Mon, Jan 26, 2026 at 12:40 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Mon, Jan 26, 2026 at 11:48:20AM -0800, Cong Wang wrote:
>>> Specifically for this scenario, struct inode is not compatible. This
>>> could rule out a lot of existing filesystems, except read-only ones.
>>
>> I don't think you understand that there's a difference between *on disk*
>> inode and *in core* inode.  Compare and contrast struct ext2_inode and
>> struct inode.
>>
>>> Now back to EROFS, it is still based on a block device, which
>>> itself can't be shared among different kernels. ramdax is actually
>>> a perfect example here, its label_area can't be shared among
>>> different kernels.
>>>
>>> Let's take one step back: even if we really could share a device
>>> with multiple kernels, it still could not share the memory footprint,
>>> with DAX + EROFS, we would still get:
>>> 1) Each kernel creates its own DAX mappings
>>> 2) And faults pages independently
>>>
>>> There is no cross-kernel page sharing accounting.
>>>
>>> I hope this makes sense.
>>
>> No, it doesn't.  I'm not suggesting that you use erofs unchanged, I'm
>> suggesting that you modify erofs to support your needs.
> 
> I just tried:
> https://github.com/multikernel/linux/commit/a6dc3351e78fc2028e4ca0ea02e781ca0bfefea3
> 
> Unfortunately, the multi-kernel derivation is still there and probably
> hard to eliminate without re-architecturing EROFS, here is why:
> 
>    DAXFS Inode (line 202-216):
> 
>    struct daxfs_base_inode {
>        __le32 ino;
>        __le32 mode;
>        ...
>        __le64 size;
>        __le64 data_offset;    /* ← INTRINSIC: stored directly in inode
> */
>        ...
>    };
> 
>   DAXFS Read Path:
>    // Pseudocode - what DAXFS does
>    void *data = base + inode->data_offset + file_offset;
>    copy_to_iter(data, len, to);
>    // DONE. No metadata parsing, no derivation.

Then? how do you handle memory-mapped cases? your
inode->data_offset still needs PAGE_SIZE aligned, no?

How it happens if an image with unaligned data offsets?

and why bother copy_to_iter in your filesystem itself
rather than using the upstream DAX infrastructure?

Also where you handle malicious `child_ino` if
sub-directories can generate a loop (from your on-disk
design?) How it deals with hardlinks btw?

> 
>   EROFS Read Path:
>    // What EROFS does (even in memory mode)
>    struct erofs_map_blocks map = { .m_la = pos };
>    erofs_map_blocks(inode, &map);  // ← DERIVES physical address
>        // Inside erofs_map_blocks():
>        //   - Check inode layout type (compact? extended?
> chunk-indexed?)
>        //   - For chunk-indexed: walk chunk table
>        //   - For plain: compute from inode
>        //   - Handle inline data, holes, compression...
>    src = base + map.m_pa;
> 
> Please let me know if I miss anything here.

Your expression above is very vague, so I don't know how
to react your words above.

I basically would like to say, your basic use case just
needs plain EROFS inodes (both compact & extended on-disk
core inode has a raw_blkaddr, and raw_blkaddr * PAGE_SIZE
is what you called `inode->data_offset`).

You could just ignore the EROFS compressed layout since
it needs to use page cache for those inodes even for
EROFS FSDAX, and your "DAXFS" doesn't deal with
compression.

Also, the expression above seems to be partially generated
by AI, but I have to write more reasonable words myself,
it seems unfair for me to reply in this thread.

> 
> Also, the speculative branching support is also harder for EROFS,
> please see my updated README here:
> https://github.com/multikernel/daxfs/blob/main/README.md
> (Skip to the Branching section.)

I also would like to discuss new use cases like
"shared-memory DAX filesystem for AI agents", but my
proposal is to redirect the whole write traffic into
another filesystem (either a tmpfs or a real disk fs) and
when agents need to snapshot, generate a new read-only
layer for memory sharing. The reason is because I really
would like to make the core EROFS format straight-forward
even for untrusted remote image usage.

Also a second quick glance of your cow approach, it just
seems nonsense from a real filesystem developer, anyway,
it's not me to prove your use cases to convince people,
it cannot be implemented with an existing fs with
enhancements.

If upstreaming is your interest, file a LSFMMBPF topic to
show your use cases to discuss, and I would like
to join the discussion.  If your interest is not
upstreaming, please ignore all my replies.

Thanks,
Gao Xiang

> 
> Thanks.
> Cong Wang


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-01-27  0:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAGHCLaREA4xzP7CkJrpqu4C=PKw_3GppOUPWZKn0Fxom_3Z9Qw@mail.gmail.com>
     [not found] ` <55e3d9f6-50d2-48c0-b7e3-fb1c144cf3e8@linux.alibaba.com>
2026-01-26 17:38   ` [ANNOUNCE] DAXFS: A zero-copy, dmabuf-friendly filesystem for shared memory Cong Wang
2026-01-26 19:16     ` Matthew Wilcox
2026-01-26 19:48       ` Cong Wang
2026-01-26 20:13         ` Gao Xiang
2026-01-26 20:40         ` Matthew Wilcox
2026-01-27  0:02           ` Cong Wang
2026-01-27  0:55             ` Gao Xiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox