* [PATCH v15 1/9] fs: Export alloc_empty_backing_file
2026-01-16 9:55 [PATCH v15 0/9] erofs: Introduce page cache sharing feature Hongbo Li
@ 2026-01-16 9:55 ` Hongbo Li
2026-01-16 9:55 ` [PATCH v15 2/9] erofs: decouple `struct erofs_anon_fs_type` Hongbo Li
` (8 subsequent siblings)
9 siblings, 0 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-16 9:55 UTC (permalink / raw)
To: hsiangkao, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel,
lihongbo22
There is no need to open nonexistent real files if backing files
couldn't be backed by real files (e.g., EROFS page cache sharing
doesn't need typical real files to open again).
Therefore, we export the alloc_empty_backing_file() helper, allowing
filesystems to dynamically set the backing file without real file
open. This is particularly useful for obtaining the correct @path
and @inode when calling file_user_path() and file_user_inode().
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Amir Goldstein <amir73il@gmail.com>
---
fs/file_table.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/file_table.c b/fs/file_table.c
index cd4a3db4659a..476edfe7d8f5 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -308,6 +308,7 @@ struct file *alloc_empty_backing_file(int flags, const struct cred *cred)
ff->file.f_mode |= FMODE_BACKING | FMODE_NOACCOUNT;
return &ff->file;
}
+EXPORT_SYMBOL_GPL(alloc_empty_backing_file);
/**
* file_init_path - initialize a 'struct file' based on path
--
2.22.0
^ permalink raw reply related [flat|nested] 46+ messages in thread* [PATCH v15 2/9] erofs: decouple `struct erofs_anon_fs_type`
2026-01-16 9:55 [PATCH v15 0/9] erofs: Introduce page cache sharing feature Hongbo Li
2026-01-16 9:55 ` [PATCH v15 1/9] fs: Export alloc_empty_backing_file Hongbo Li
@ 2026-01-16 9:55 ` Hongbo Li
2026-01-16 15:38 ` Christoph Hellwig
2026-01-16 9:55 ` [PATCH v15 3/9] erofs: support user-defined fingerprint name Hongbo Li
` (7 subsequent siblings)
9 siblings, 1 reply; 46+ messages in thread
From: Hongbo Li @ 2026-01-16 9:55 UTC (permalink / raw)
To: hsiangkao, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel,
lihongbo22
From: Gao Xiang <hsiangkao@linux.alibaba.com>
- Move the `struct erofs_anon_fs_type` to super.c and expose it
in preparation for the upcoming page cache share feature;
- Remove the `.owner` field, as they are all internal mounts and
fully managed by EROFS. Retaining `.owner` would unnecessarily
increment module reference counts, preventing the EROFS kernel
module from being unloaded.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
fs/erofs/fscache.c | 13 -------------
fs/erofs/internal.h | 2 ++
fs/erofs/super.c | 14 ++++++++++++++
3 files changed, 16 insertions(+), 13 deletions(-)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 7a346e20f7b7..f4937b025038 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -3,7 +3,6 @@
* Copyright (C) 2022, Alibaba Cloud
* Copyright (C) 2022, Bytedance Inc. All rights reserved.
*/
-#include <linux/pseudo_fs.h>
#include <linux/fscache.h>
#include "internal.h"
@@ -13,18 +12,6 @@ static LIST_HEAD(erofs_domain_list);
static LIST_HEAD(erofs_domain_cookies_list);
static struct vfsmount *erofs_pseudo_mnt;
-static int erofs_anon_init_fs_context(struct fs_context *fc)
-{
- return init_pseudo(fc, EROFS_SUPER_MAGIC) ? 0 : -ENOMEM;
-}
-
-static struct file_system_type erofs_anon_fs_type = {
- .owner = THIS_MODULE,
- .name = "pseudo_erofs",
- .init_fs_context = erofs_anon_init_fs_context,
- .kill_sb = kill_anon_super,
-};
-
struct erofs_fscache_io {
struct netfs_cache_resources cres;
struct iov_iter iter;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index f7f622836198..98fe652aea33 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -188,6 +188,8 @@ static inline bool erofs_is_fileio_mode(struct erofs_sb_info *sbi)
return IS_ENABLED(CONFIG_EROFS_FS_BACKED_BY_FILE) && sbi->dif0.file;
}
+extern struct file_system_type erofs_anon_fs_type;
+
static inline bool erofs_is_fscache_mode(struct super_block *sb)
{
return IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) &&
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 937a215f626c..f18f43b78fca 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -11,6 +11,7 @@
#include <linux/fs_parser.h>
#include <linux/exportfs.h>
#include <linux/backing-dev.h>
+#include <linux/pseudo_fs.h>
#include "xattr.h"
#define CREATE_TRACE_POINTS
@@ -936,6 +937,19 @@ static struct file_system_type erofs_fs_type = {
};
MODULE_ALIAS_FS("erofs");
+#if defined(CONFIG_EROFS_FS_ONDEMAND)
+static int erofs_anon_init_fs_context(struct fs_context *fc)
+{
+ return init_pseudo(fc, EROFS_SUPER_MAGIC) ? 0 : -ENOMEM;
+}
+
+struct file_system_type erofs_anon_fs_type = {
+ .name = "pseudo_erofs",
+ .init_fs_context = erofs_anon_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+#endif
+
static int __init erofs_module_init(void)
{
int err;
--
2.22.0
^ permalink raw reply related [flat|nested] 46+ messages in thread* Re: [PATCH v15 2/9] erofs: decouple `struct erofs_anon_fs_type`
2026-01-16 9:55 ` [PATCH v15 2/9] erofs: decouple `struct erofs_anon_fs_type` Hongbo Li
@ 2026-01-16 15:38 ` Christoph Hellwig
2026-01-19 1:34 ` Hongbo Li
0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-16 15:38 UTC (permalink / raw)
To: Hongbo Li
Cc: hsiangkao, chao, brauner, djwong, amir73il, hch, linux-fsdevel,
linux-erofs, linux-kernel
> +#if defined(CONFIG_EROFS_FS_ONDEMAND)
Normally this would just use #ifdef.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 2/9] erofs: decouple `struct erofs_anon_fs_type`
2026-01-16 15:38 ` Christoph Hellwig
@ 2026-01-19 1:34 ` Hongbo Li
2026-01-19 1:44 ` Gao Xiang
0 siblings, 1 reply; 46+ messages in thread
From: Hongbo Li @ 2026-01-19 1:34 UTC (permalink / raw)
To: hsiangkao
Cc: chao, brauner, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel, Christoph Hellwig
Hi, Xiang
On 2026/1/16 23:38, Christoph Hellwig wrote:
>> +#if defined(CONFIG_EROFS_FS_ONDEMAND)
>
> Normally this would just use #ifdef.
>
How about using #ifdef for all of them? I checked and there are only
three places in total, and all of them are related to
FS_PAGE_CACHE_SHARE or FS_ONDEMAND config macro.
Thanks,
Hongbo
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 2/9] erofs: decouple `struct erofs_anon_fs_type`
2026-01-19 1:34 ` Hongbo Li
@ 2026-01-19 1:44 ` Gao Xiang
2026-01-19 2:23 ` Hongbo Li
2026-01-19 7:28 ` Christoph Hellwig
0 siblings, 2 replies; 46+ messages in thread
From: Gao Xiang @ 2026-01-19 1:44 UTC (permalink / raw)
To: Hongbo Li
Cc: chao, brauner, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel, Christoph Hellwig
On 2026/1/19 09:34, Hongbo Li wrote:
> Hi, Xiang
>
> On 2026/1/16 23:38, Christoph Hellwig wrote:
>>> +#if defined(CONFIG_EROFS_FS_ONDEMAND)
>>
>> Normally this would just use #ifdef.
>>
> How about using #ifdef for all of them? I checked and there are only three places in total, and all of them are related to FS_PAGE_CACHE_SHARE or FS_ONDEMAND config macro.
I'm fine with most cases (including here).
But I'm not sure if there is a case as `#if defined() || defined()`,
it seems it cannot be simply replaced with `#ifdef`.
Thanks,
Gao Xiang
>
> Thanks,
> Hongbo
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 2/9] erofs: decouple `struct erofs_anon_fs_type`
2026-01-19 1:44 ` Gao Xiang
@ 2026-01-19 2:23 ` Hongbo Li
2026-01-19 7:28 ` Christoph Hellwig
1 sibling, 0 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-19 2:23 UTC (permalink / raw)
To: Gao Xiang, Christoph Hellwig
Cc: chao, brauner, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel
Hi,
On 2026/1/19 9:44, Gao Xiang wrote:
>
>
> On 2026/1/19 09:34, Hongbo Li wrote:
>> Hi, Xiang
>>
>> On 2026/1/16 23:38, Christoph Hellwig wrote:
>>>> +#if defined(CONFIG_EROFS_FS_ONDEMAND)
>>>
>>> Normally this would just use #ifdef.
>>>
>> How about using #ifdef for all of them? I checked and there are only
>> three places in total, and all of them are related to
>> FS_PAGE_CACHE_SHARE or FS_ONDEMAND config macro.
>
> I'm fine with most cases (including here).
>
> But I'm not sure if there is a case as `#if defined() || defined()`,
> it seems it cannot be simply replaced with `#ifdef`.
>
Yeah, we cannot replace it in this case. So I will keep it here because
it will be changed into `#if defined() || defined()` in following steps.
Instead, I will use this way in other place, such as:
```
#if defined(CONFIG_EROFS_FS_PAGE_CACHE_SHARE)
set_opt(&sbi->opt, INODE_SHARE);
#else
...
```
Thanks,
Hongbo
> Thanks,
> Gao Xiang
>
>>
>> Thanks,
>> Hongbo
>
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: [PATCH v15 2/9] erofs: decouple `struct erofs_anon_fs_type`
2026-01-19 1:44 ` Gao Xiang
2026-01-19 2:23 ` Hongbo Li
@ 2026-01-19 7:28 ` Christoph Hellwig
1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-19 7:28 UTC (permalink / raw)
To: Gao Xiang
Cc: Hongbo Li, chao, brauner, djwong, amir73il, linux-fsdevel,
linux-erofs, linux-kernel, Christoph Hellwig
On Mon, Jan 19, 2026 at 09:44:59AM +0800, Gao Xiang wrote:
> But I'm not sure if there is a case as `#if defined() || defined()`,
> it seems it cannot be simply replaced with `#ifdef`.
They can't. If you have multiple statements compined using operators
you need to use #if and defined().
^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH v15 3/9] erofs: support user-defined fingerprint name
2026-01-16 9:55 [PATCH v15 0/9] erofs: Introduce page cache sharing feature Hongbo Li
2026-01-16 9:55 ` [PATCH v15 1/9] fs: Export alloc_empty_backing_file Hongbo Li
2026-01-16 9:55 ` [PATCH v15 2/9] erofs: decouple `struct erofs_anon_fs_type` Hongbo Li
@ 2026-01-16 9:55 ` Hongbo Li
2026-01-16 9:55 ` [PATCH v15 4/9] erofs: support domain-specific page cache share Hongbo Li
` (6 subsequent siblings)
9 siblings, 0 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-16 9:55 UTC (permalink / raw)
To: hsiangkao, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel,
lihongbo22
From: Hongzhen Luo <hongzhen@linux.alibaba.com>
When creating the EROFS image, users can specify the fingerprint name.
This is to prepare for the upcoming inode page cache share.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
fs/erofs/Kconfig | 9 +++++++++
fs/erofs/erofs_fs.h | 5 +++--
fs/erofs/internal.h | 2 ++
fs/erofs/super.c | 9 +++++++++
fs/erofs/xattr.c | 13 +++++++++++++
5 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index d81f3318417d..b71f2a8074fe 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -194,3 +194,12 @@ config EROFS_FS_PCPU_KTHREAD_HIPRI
at higher priority.
If unsure, say N.
+
+config EROFS_FS_PAGE_CACHE_SHARE
+ bool "EROFS page cache share support (experimental)"
+ depends on EROFS_FS && EROFS_FS_XATTR && !EROFS_FS_ONDEMAND
+ help
+ This enables page cache sharing among inodes with identical
+ content fingerprints on the same machine.
+
+ If unsure, say N.
diff --git a/fs/erofs/erofs_fs.h b/fs/erofs/erofs_fs.h
index e24268acdd62..b30a74d307c5 100644
--- a/fs/erofs/erofs_fs.h
+++ b/fs/erofs/erofs_fs.h
@@ -17,7 +17,7 @@
#define EROFS_FEATURE_COMPAT_XATTR_FILTER 0x00000004
#define EROFS_FEATURE_COMPAT_SHARED_EA_IN_METABOX 0x00000008
#define EROFS_FEATURE_COMPAT_PLAIN_XATTR_PFX 0x00000010
-
+#define EROFS_FEATURE_COMPAT_ISHARE_XATTRS 0x00000020
/*
* Any bits that aren't in EROFS_ALL_FEATURE_INCOMPAT should
@@ -83,7 +83,8 @@ struct erofs_super_block {
__le32 xattr_prefix_start; /* start of long xattr prefixes */
__le64 packed_nid; /* nid of the special packed inode */
__u8 xattr_filter_reserved; /* reserved for xattr name filter */
- __u8 reserved[3];
+ __u8 ishare_xattr_prefix_id;
+ __u8 reserved[2];
__le32 build_time; /* seconds added to epoch for mkfs time */
__le64 rootnid_8b; /* (48BIT on) nid of root directory */
__le64 reserved2;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 98fe652aea33..ec79e8b44d3b 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -134,6 +134,7 @@ struct erofs_sb_info {
u32 xattr_blkaddr;
u32 xattr_prefix_start;
u8 xattr_prefix_count;
+ u8 ishare_xattr_prefix_id;
struct erofs_xattr_prefix_item *xattr_prefixes;
unsigned int xattr_filter_reserved;
#endif
@@ -238,6 +239,7 @@ EROFS_FEATURE_FUNCS(sb_chksum, compat, COMPAT_SB_CHKSUM)
EROFS_FEATURE_FUNCS(xattr_filter, compat, COMPAT_XATTR_FILTER)
EROFS_FEATURE_FUNCS(shared_ea_in_metabox, compat, COMPAT_SHARED_EA_IN_METABOX)
EROFS_FEATURE_FUNCS(plain_xattr_pfx, compat, COMPAT_PLAIN_XATTR_PFX)
+EROFS_FEATURE_FUNCS(ishare_xattrs, compat, COMPAT_ISHARE_XATTRS)
static inline u64 erofs_nid_to_ino64(struct erofs_sb_info *sbi, erofs_nid_t nid)
{
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index f18f43b78fca..dca1445f6c92 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -320,6 +320,15 @@ static int erofs_read_superblock(struct super_block *sb)
sbi->xattr_prefix_start = le32_to_cpu(dsb->xattr_prefix_start);
sbi->xattr_prefix_count = dsb->xattr_prefix_count;
sbi->xattr_filter_reserved = dsb->xattr_filter_reserved;
+ if (erofs_sb_has_ishare_xattrs(sbi)) {
+ if (dsb->ishare_xattr_prefix_id >= sbi->xattr_prefix_count) {
+ erofs_err(sb, "invalid ishare xattr prefix id %u",
+ dsb->ishare_xattr_prefix_id);
+ ret = -EFSCORRUPTED;
+ goto out;
+ }
+ sbi->ishare_xattr_prefix_id = dsb->ishare_xattr_prefix_id;
+ }
#endif
sbi->islotbits = ilog2(sizeof(struct erofs_inode_compact));
if (erofs_sb_has_48bit(sbi) && dsb->rootnid_8b) {
diff --git a/fs/erofs/xattr.c b/fs/erofs/xattr.c
index 396536d9a862..ae61f20cb861 100644
--- a/fs/erofs/xattr.c
+++ b/fs/erofs/xattr.c
@@ -519,6 +519,19 @@ int erofs_xattr_prefixes_init(struct super_block *sb)
}
erofs_put_metabuf(&buf);
+ if (!ret && erofs_sb_has_ishare_xattrs(sbi)) {
+ struct erofs_xattr_prefix_item *pf = pfs + sbi->ishare_xattr_prefix_id;
+ struct erofs_xattr_long_prefix *newpfx;
+
+ newpfx = krealloc(pf->prefix,
+ sizeof(*newpfx) + pf->infix_len + 1, GFP_KERNEL);
+ if (newpfx) {
+ newpfx->infix[pf->infix_len] = '\0';
+ pf->prefix = newpfx;
+ } else {
+ ret = -ENOMEM;
+ }
+ }
sbi->xattr_prefixes = pfs;
if (ret)
erofs_xattr_prefixes_cleanup(sb);
--
2.22.0
^ permalink raw reply related [flat|nested] 46+ messages in thread* [PATCH v15 4/9] erofs: support domain-specific page cache share
2026-01-16 9:55 [PATCH v15 0/9] erofs: Introduce page cache sharing feature Hongbo Li
` (2 preceding siblings ...)
2026-01-16 9:55 ` [PATCH v15 3/9] erofs: support user-defined fingerprint name Hongbo Li
@ 2026-01-16 9:55 ` Hongbo Li
2026-01-16 9:55 ` [PATCH v15 5/9] erofs: introduce the page cache share feature Hongbo Li
` (5 subsequent siblings)
9 siblings, 0 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-16 9:55 UTC (permalink / raw)
To: hsiangkao, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel,
lihongbo22
From: Hongzhen Luo <hongzhen@linux.alibaba.com>
Only files in the same domain will share the page cache. Also modify
the sysfs related content in preparation for the upcoming page cache
share feature.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
fs/erofs/super.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index dca1445f6c92..960da62636ad 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -524,6 +524,8 @@ static int erofs_fc_parse_param(struct fs_context *fc,
if (!sbi->fsid)
return -ENOMEM;
break;
+#endif
+#if defined(CONFIG_EROFS_FS_ONDEMAND) || defined(CONFIG_EROFS_FS_PAGE_CACHE_SHARE)
case Opt_domain_id:
kfree(sbi->domain_id);
sbi->domain_id = kstrdup(param->string, GFP_KERNEL);
@@ -624,7 +626,7 @@ static void erofs_set_sysfs_name(struct super_block *sb)
{
struct erofs_sb_info *sbi = EROFS_SB(sb);
- if (sbi->domain_id)
+ if (sbi->domain_id && !erofs_sb_has_ishare_xattrs(sbi))
super_set_sysfs_name_generic(sb, "%s,%s", sbi->domain_id,
sbi->fsid);
else if (sbi->fsid)
@@ -1054,12 +1056,10 @@ static int erofs_show_options(struct seq_file *seq, struct dentry *root)
seq_puts(seq, ",dax=never");
if (erofs_is_fileio_mode(sbi) && test_opt(opt, DIRECT_IO))
seq_puts(seq, ",directio");
-#ifdef CONFIG_EROFS_FS_ONDEMAND
if (sbi->fsid)
seq_printf(seq, ",fsid=%s", sbi->fsid);
if (sbi->domain_id)
seq_printf(seq, ",domain_id=%s", sbi->domain_id);
-#endif
if (sbi->dif0.fsoff)
seq_printf(seq, ",fsoffset=%llu", sbi->dif0.fsoff);
return 0;
--
2.22.0
^ permalink raw reply related [flat|nested] 46+ messages in thread* [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-16 9:55 [PATCH v15 0/9] erofs: Introduce page cache sharing feature Hongbo Li
` (3 preceding siblings ...)
2026-01-16 9:55 ` [PATCH v15 4/9] erofs: support domain-specific page cache share Hongbo Li
@ 2026-01-16 9:55 ` Hongbo Li
2026-01-16 15:46 ` Christoph Hellwig
2026-01-20 14:19 ` Gao Xiang
2026-01-16 9:55 ` [PATCH v15 6/9] erofs: pass inode to trace_erofs_read_folio Hongbo Li
` (4 subsequent siblings)
9 siblings, 2 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-16 9:55 UTC (permalink / raw)
To: hsiangkao, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel,
lihongbo22
From: Hongzhen Luo <hongzhen@linux.alibaba.com>
Currently, reading files with different paths (or names) but the same
content will consume multiple copies of the page cache, even if the
content of these page caches is the same. For example, reading
identical files (e.g., *.so files) from two different minor versions of
container images will cost multiple copies of the same page cache,
since different containers have different mount points. Therefore,
sharing the page cache for files with the same content can save memory.
This introduces the page cache share feature in erofs. It allocate a
deduplicated inode and use its page cache as shared. Reads for files
with identical content will ultimately be routed to the page cache of
the deduplicated inode. In this way, a single page cache satisfies
multiple read requests for different files with the same contents.
We introduce inode_share mount option to enable the page sharing mode
during mounting.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
Documentation/filesystems/erofs.rst | 5 +
fs/erofs/Makefile | 1 +
fs/erofs/inode.c | 24 +----
fs/erofs/internal.h | 57 ++++++++++
fs/erofs/ishare.c | 161 ++++++++++++++++++++++++++++
fs/erofs/super.c | 56 +++++++++-
fs/erofs/xattr.c | 34 ++++++
fs/erofs/xattr.h | 3 +
8 files changed, 316 insertions(+), 25 deletions(-)
create mode 100644 fs/erofs/ishare.c
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index 08194f194b94..27d3caa3c73c 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -128,7 +128,12 @@ device=%s Specify a path to an extra device to be used together.
fsid=%s Specify a filesystem image ID for Fscache back-end.
domain_id=%s Specify a domain ID in fscache mode so that different images
with the same blobs under a given domain ID can share storage.
+ Also used for inode page sharing mode which defines a sharing
+ domain.
fsoffset=%llu Specify block-aligned filesystem offset for the primary device.
+inode_share Enable inode page sharing for this filesystem. Inodes with
+ identical content within the same domain ID can share the
+ page cache.
=================== =========================================================
Sysfs Entries
diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
index 549abc424763..a80e1762b607 100644
--- a/fs/erofs/Makefile
+++ b/fs/erofs/Makefile
@@ -10,3 +10,4 @@ erofs-$(CONFIG_EROFS_FS_ZIP_ZSTD) += decompressor_zstd.o
erofs-$(CONFIG_EROFS_FS_ZIP_ACCEL) += decompressor_crypto.o
erofs-$(CONFIG_EROFS_FS_BACKED_BY_FILE) += fileio.o
erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o
+erofs-$(CONFIG_EROFS_FS_PAGE_CACHE_SHARE) += ishare.o
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index bce98c845a18..202cbbb4eada 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -203,7 +203,6 @@ static int erofs_read_inode(struct inode *inode)
static int erofs_fill_inode(struct inode *inode)
{
- struct erofs_inode *vi = EROFS_I(inode);
int err;
trace_erofs_fill_inode(inode);
@@ -235,28 +234,7 @@ static int erofs_fill_inode(struct inode *inode)
}
mapping_set_large_folios(inode->i_mapping);
- if (erofs_inode_is_data_compressed(vi->datalayout)) {
-#ifdef CONFIG_EROFS_FS_ZIP
- DO_ONCE_LITE_IF(inode->i_blkbits != PAGE_SHIFT,
- erofs_info, inode->i_sb,
- "EXPERIMENTAL EROFS subpage compressed block support in use. Use at your own risk!");
- inode->i_mapping->a_ops = &z_erofs_aops;
-#else
- err = -EOPNOTSUPP;
-#endif
- } else {
- inode->i_mapping->a_ops = &erofs_aops;
-#ifdef CONFIG_EROFS_FS_ONDEMAND
- if (erofs_is_fscache_mode(inode->i_sb))
- inode->i_mapping->a_ops = &erofs_fscache_access_aops;
-#endif
-#ifdef CONFIG_EROFS_FS_BACKED_BY_FILE
- if (erofs_is_fileio_mode(EROFS_SB(inode->i_sb)))
- inode->i_mapping->a_ops = &erofs_fileio_aops;
-#endif
- }
-
- return err;
+ return erofs_inode_set_aops(inode, inode, false);
}
/*
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index ec79e8b44d3b..15945e3308b8 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -179,6 +179,7 @@ struct erofs_sb_info {
#define EROFS_MOUNT_DAX_ALWAYS 0x00000040
#define EROFS_MOUNT_DAX_NEVER 0x00000080
#define EROFS_MOUNT_DIRECT_IO 0x00000100
+#define EROFS_MOUNT_INODE_SHARE 0x00000200
#define clear_opt(opt, option) ((opt)->mount_opt &= ~EROFS_MOUNT_##option)
#define set_opt(opt, option) ((opt)->mount_opt |= EROFS_MOUNT_##option)
@@ -269,6 +270,11 @@ static inline u64 erofs_nid_to_ino64(struct erofs_sb_info *sbi, erofs_nid_t nid)
/* default readahead size of directories */
#define EROFS_DIR_RA_BYTES 16384
+struct erofs_inode_fingerprint {
+ u8 *opaque;
+ int size;
+};
+
struct erofs_inode {
erofs_nid_t nid;
@@ -304,6 +310,18 @@ struct erofs_inode {
};
#endif /* CONFIG_EROFS_FS_ZIP */
};
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ struct list_head ishare_list;
+ union {
+ /* for each anon shared inode */
+ struct {
+ struct erofs_inode_fingerprint fingerprint;
+ spinlock_t ishare_lock;
+ };
+ /* for each real inode */
+ struct inode *sharedinode;
+ };
+#endif
/* the corresponding vfs inode */
struct inode vfs_inode;
};
@@ -410,6 +428,7 @@ extern const struct inode_operations erofs_dir_iops;
extern const struct file_operations erofs_file_fops;
extern const struct file_operations erofs_dir_fops;
+extern const struct file_operations erofs_ishare_fops;
extern const struct iomap_ops z_erofs_iomap_report_ops;
@@ -455,6 +474,32 @@ static inline void *erofs_vm_map_ram(struct page **pages, unsigned int count)
return NULL;
}
+static inline int erofs_inode_set_aops(struct inode *inode,
+ struct inode *realinode, bool no_fscache)
+{
+ if (erofs_inode_is_data_compressed(EROFS_I(realinode)->datalayout)) {
+#ifdef CONFIG_EROFS_FS_ZIP
+ DO_ONCE_LITE_IF(realinode->i_blkbits != PAGE_SHIFT,
+ erofs_info, realinode->i_sb,
+ "EXPERIMENTAL EROFS subpage compressed block support in use. Use at your own risk!");
+ inode->i_mapping->a_ops = &z_erofs_aops;
+#else
+ return -EOPNOTSUPP;
+#endif
+ } else {
+ inode->i_mapping->a_ops = &erofs_aops;
+#ifdef CONFIG_EROFS_FS_ONDEMAND
+ if (!no_fscache && erofs_is_fscache_mode(realinode->i_sb))
+ inode->i_mapping->a_ops = &erofs_fscache_access_aops;
+#endif
+#ifdef CONFIG_EROFS_FS_BACKED_BY_FILE
+ if (erofs_is_fileio_mode(EROFS_SB(realinode->i_sb)))
+ inode->i_mapping->a_ops = &erofs_fileio_aops;
+#endif
+ }
+ return 0;
+}
+
int erofs_register_sysfs(struct super_block *sb);
void erofs_unregister_sysfs(struct super_block *sb);
int __init erofs_init_sysfs(void);
@@ -541,6 +586,18 @@ static inline struct bio *erofs_fscache_bio_alloc(struct erofs_map_dev *mdev) {
static inline void erofs_fscache_submit_bio(struct bio *bio) {}
#endif
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+int __init erofs_init_ishare(void);
+void erofs_exit_ishare(void);
+bool erofs_ishare_fill_inode(struct inode *inode);
+void erofs_ishare_free_inode(struct inode *inode);
+#else
+static inline int erofs_init_ishare(void) { return 0; }
+static inline void erofs_exit_ishare(void) {}
+static inline bool erofs_ishare_fill_inode(struct inode *inode) { return false; }
+static inline void erofs_ishare_free_inode(struct inode *inode) {}
+#endif
+
long erofs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
long erofs_compat_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg);
diff --git a/fs/erofs/ishare.c b/fs/erofs/ishare.c
new file mode 100644
index 000000000000..6b710c935afb
--- /dev/null
+++ b/fs/erofs/ishare.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2024, Alibaba Cloud
+ */
+#include <linux/xxhash.h>
+#include <linux/mount.h>
+#include "internal.h"
+#include "xattr.h"
+
+#include "../internal.h"
+
+static struct vfsmount *erofs_ishare_mnt;
+
+static int erofs_ishare_iget5_eq(struct inode *inode, void *data)
+{
+ struct erofs_inode_fingerprint *fp1 = &EROFS_I(inode)->fingerprint;
+ struct erofs_inode_fingerprint *fp2 = data;
+
+ return fp1->size == fp2->size &&
+ !memcmp(fp1->opaque, fp2->opaque, fp2->size);
+}
+
+static int erofs_ishare_iget5_set(struct inode *inode, void *data)
+{
+ struct erofs_inode *vi = EROFS_I(inode);
+
+ vi->fingerprint = *(struct erofs_inode_fingerprint *)data;
+ INIT_LIST_HEAD(&vi->ishare_list);
+ spin_lock_init(&vi->ishare_lock);
+ return 0;
+}
+
+bool erofs_ishare_fill_inode(struct inode *inode)
+{
+ struct erofs_sb_info *sbi = EROFS_SB(inode->i_sb);
+ struct erofs_inode *vi = EROFS_I(inode);
+ struct erofs_inode_fingerprint fp;
+ struct inode *sharedinode;
+ unsigned long hash;
+
+ if (erofs_xattr_fill_inode_fingerprint(&fp, inode, sbi->domain_id))
+ return false;
+ hash = xxh32(fp.opaque, fp.size, 0);
+ sharedinode = iget5_locked(erofs_ishare_mnt->mnt_sb, hash,
+ erofs_ishare_iget5_eq, erofs_ishare_iget5_set,
+ &fp);
+ if (!sharedinode) {
+ kfree(fp.opaque);
+ return false;
+ }
+
+ if (inode_state_read_once(sharedinode) & I_NEW) {
+ if (erofs_inode_set_aops(sharedinode, inode, true)) {
+ iget_failed(sharedinode);
+ kfree(fp.opaque);
+ return false;
+ }
+ sharedinode->i_mode = vi->vfs_inode.i_mode;
+ sharedinode->i_size = vi->vfs_inode.i_size;
+ unlock_new_inode(sharedinode);
+ } else {
+ kfree(fp.opaque);
+ }
+ vi->sharedinode = sharedinode;
+ INIT_LIST_HEAD(&vi->ishare_list);
+ spin_lock(&EROFS_I(sharedinode)->ishare_lock);
+ list_add(&vi->ishare_list, &EROFS_I(sharedinode)->ishare_list);
+ spin_unlock(&EROFS_I(sharedinode)->ishare_lock);
+ return true;
+}
+
+void erofs_ishare_free_inode(struct inode *inode)
+{
+ struct erofs_inode *vi = EROFS_I(inode);
+ struct inode *sharedinode = vi->sharedinode;
+
+ if (!sharedinode)
+ return;
+ spin_lock(&EROFS_I(sharedinode)->ishare_lock);
+ list_del(&vi->ishare_list);
+ spin_unlock(&EROFS_I(sharedinode)->ishare_lock);
+ iput(sharedinode);
+ vi->sharedinode = NULL;
+}
+
+static int erofs_ishare_file_open(struct inode *inode, struct file *file)
+{
+ struct inode *sharedinode = EROFS_I(inode)->sharedinode;
+ struct file *realfile;
+
+ if (file->f_flags & O_DIRECT)
+ return -EINVAL;
+ realfile = alloc_empty_backing_file(O_RDONLY|O_NOATIME, current_cred());
+ if (IS_ERR(realfile))
+ return PTR_ERR(realfile);
+ ihold(sharedinode);
+ realfile->f_op = &erofs_file_fops;
+ realfile->f_inode = sharedinode;
+ realfile->f_mapping = sharedinode->i_mapping;
+ path_get(&file->f_path);
+ backing_file_set_user_path(realfile, &file->f_path);
+
+ file_ra_state_init(&realfile->f_ra, file->f_mapping);
+ realfile->private_data = EROFS_I(inode);
+ file->private_data = realfile;
+ return 0;
+}
+
+static int erofs_ishare_file_release(struct inode *inode, struct file *file)
+{
+ struct file *realfile = file->private_data;
+
+ iput(realfile->f_inode);
+ fput(realfile);
+ file->private_data = NULL;
+ return 0;
+}
+
+static ssize_t erofs_ishare_file_read_iter(struct kiocb *iocb,
+ struct iov_iter *to)
+{
+ struct file *realfile = iocb->ki_filp->private_data;
+ struct kiocb dedup_iocb;
+ ssize_t nread;
+
+ if (!iov_iter_count(to))
+ return 0;
+ kiocb_clone(&dedup_iocb, iocb, realfile);
+ nread = filemap_read(&dedup_iocb, to, 0);
+ iocb->ki_pos = dedup_iocb.ki_pos;
+ return nread;
+}
+
+static int erofs_ishare_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct file *realfile = file->private_data;
+
+ vma_set_file(vma, realfile);
+ return generic_file_readonly_mmap(file, vma);
+}
+
+const struct file_operations erofs_ishare_fops = {
+ .open = erofs_ishare_file_open,
+ .llseek = generic_file_llseek,
+ .read_iter = erofs_ishare_file_read_iter,
+ .mmap = erofs_ishare_mmap,
+ .release = erofs_ishare_file_release,
+ .get_unmapped_area = thp_get_unmapped_area,
+ .splice_read = filemap_splice_read,
+};
+
+int __init erofs_init_ishare(void)
+{
+ erofs_ishare_mnt = kern_mount(&erofs_anon_fs_type);
+ return PTR_ERR_OR_ZERO(erofs_ishare_mnt);
+}
+
+void erofs_exit_ishare(void)
+{
+ kern_unmount(erofs_ishare_mnt);
+}
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 960da62636ad..1f2b8732b29e 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -396,6 +396,7 @@ static void erofs_default_options(struct erofs_sb_info *sbi)
enum {
Opt_user_xattr, Opt_acl, Opt_cache_strategy, Opt_dax, Opt_dax_enum,
Opt_device, Opt_fsid, Opt_domain_id, Opt_directio, Opt_fsoffset,
+ Opt_inode_share,
};
static const struct constant_table erofs_param_cache_strategy[] = {
@@ -423,6 +424,7 @@ static const struct fs_parameter_spec erofs_fs_parameters[] = {
fsparam_string("domain_id", Opt_domain_id),
fsparam_flag_no("directio", Opt_directio),
fsparam_u64("fsoffset", Opt_fsoffset),
+ fsparam_flag("inode_share", Opt_inode_share),
{}
};
@@ -551,6 +553,13 @@ static int erofs_fc_parse_param(struct fs_context *fc,
case Opt_fsoffset:
sbi->dif0.fsoff = result.uint_64;
break;
+ case Opt_inode_share:
+#if defined(CONFIG_EROFS_FS_PAGE_CACHE_SHARE)
+ set_opt(&sbi->opt, INODE_SHARE);
+#else
+ errorfc(fc, "%s option not supported", erofs_fs_parameters[opt].name);
+#endif
+ break;
}
return 0;
}
@@ -649,6 +658,11 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
sb->s_maxbytes = MAX_LFS_FILESIZE;
sb->s_op = &erofs_sops;
+ if (test_opt(&sbi->opt, DAX_ALWAYS) && test_opt(&sbi->opt, INODE_SHARE)) {
+ errorfc(fc, "FSDAX is not allowed when inode_ishare is on");
+ return -EINVAL;
+ }
+
sbi->blkszbits = PAGE_SHIFT;
if (!sb->s_bdev) {
/*
@@ -719,6 +733,12 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
erofs_info(sb, "unsupported blocksize for DAX");
clear_opt(&sbi->opt, DAX_ALWAYS);
}
+ if (test_opt(&sbi->opt, INODE_SHARE) && !erofs_sb_has_ishare_xattrs(sbi)) {
+ erofs_info(sb, "on-disk ishare xattrs not found. Turning off inode_share.");
+ clear_opt(&sbi->opt, INODE_SHARE);
+ }
+ if (test_opt(&sbi->opt, INODE_SHARE))
+ erofs_info(sb, "EXPERIMENTAL EROFS page cache share support in use. Use at your own risk!");
sb->s_time_gran = 1;
sb->s_xattr = erofs_xattr_handlers;
@@ -948,10 +968,32 @@ static struct file_system_type erofs_fs_type = {
};
MODULE_ALIAS_FS("erofs");
-#if defined(CONFIG_EROFS_FS_ONDEMAND)
+#if defined(CONFIG_EROFS_FS_ONDEMAND) || defined(CONFIG_EROFS_FS_PAGE_CACHE_SHARE)
+static void erofs_free_anon_inode(struct inode *inode)
+{
+ struct erofs_inode *vi = EROFS_I(inode);
+
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ kfree(vi->fingerprint.opaque);
+#endif
+ kmem_cache_free(erofs_inode_cachep, vi);
+}
+
+static const struct super_operations erofs_anon_sops = {
+ .alloc_inode = erofs_alloc_inode,
+ .drop_inode = inode_just_drop,
+ .free_inode = erofs_free_anon_inode,
+};
+
static int erofs_anon_init_fs_context(struct fs_context *fc)
{
- return init_pseudo(fc, EROFS_SUPER_MAGIC) ? 0 : -ENOMEM;
+ struct pseudo_fs_context *ctx;
+
+ ctx = init_pseudo(fc, EROFS_SUPER_MAGIC);
+ if (!ctx)
+ return -ENOMEM;
+ ctx->ops = &erofs_anon_sops;
+ return 0;
}
struct file_system_type erofs_anon_fs_type = {
@@ -986,6 +1028,10 @@ static int __init erofs_module_init(void)
if (err)
goto sysfs_err;
+ err = erofs_init_ishare();
+ if (err)
+ goto ishare_err;
+
err = register_filesystem(&erofs_fs_type);
if (err)
goto fs_err;
@@ -993,6 +1039,8 @@ static int __init erofs_module_init(void)
return 0;
fs_err:
+ erofs_exit_ishare();
+ishare_err:
erofs_exit_sysfs();
sysfs_err:
z_erofs_exit_subsystem();
@@ -1010,6 +1058,7 @@ static void __exit erofs_module_exit(void)
/* Ensure all RCU free inodes / pclusters are safe to be destroyed. */
rcu_barrier();
+ erofs_exit_ishare();
erofs_exit_sysfs();
z_erofs_exit_subsystem();
erofs_exit_shrinker();
@@ -1062,6 +1111,8 @@ static int erofs_show_options(struct seq_file *seq, struct dentry *root)
seq_printf(seq, ",domain_id=%s", sbi->domain_id);
if (sbi->dif0.fsoff)
seq_printf(seq, ",fsoffset=%llu", sbi->dif0.fsoff);
+ if (test_opt(opt, INODE_SHARE))
+ seq_puts(seq, ",inode_share");
return 0;
}
@@ -1072,6 +1123,7 @@ static void erofs_evict_inode(struct inode *inode)
dax_break_layout_final(inode);
#endif
+ erofs_ishare_free_inode(inode);
truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
}
diff --git a/fs/erofs/xattr.c b/fs/erofs/xattr.c
index ae61f20cb861..e1709059d3cc 100644
--- a/fs/erofs/xattr.c
+++ b/fs/erofs/xattr.c
@@ -577,3 +577,37 @@ struct posix_acl *erofs_get_acl(struct inode *inode, int type, bool rcu)
return acl;
}
#endif
+
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+int erofs_xattr_fill_inode_fingerprint(struct erofs_inode_fingerprint *fp,
+ struct inode *inode, const char *domain_id)
+{
+ struct erofs_sb_info *sbi = EROFS_SB(inode->i_sb);
+ struct erofs_xattr_prefix_item *prefix;
+ const char *infix;
+ int valuelen, base_index;
+
+ if (!test_opt(&sbi->opt, INODE_SHARE))
+ return -EOPNOTSUPP;
+ if (!sbi->xattr_prefixes)
+ return -EINVAL;
+ prefix = sbi->xattr_prefixes + sbi->ishare_xattr_prefix_id;
+ infix = prefix->prefix->infix;
+ base_index = prefix->prefix->base_index;
+ valuelen = erofs_getxattr(inode, base_index, infix, NULL, 0);
+ if (valuelen <= 0 || valuelen > (1 << sbi->blkszbits))
+ return -EFSCORRUPTED;
+ fp->size = valuelen + (domain_id ? strlen(domain_id) : 0);
+ fp->opaque = kmalloc(fp->size, GFP_KERNEL);
+ if (!fp->opaque)
+ return -ENOMEM;
+ if (valuelen != erofs_getxattr(inode, base_index, infix,
+ fp->opaque, valuelen)) {
+ kfree(fp->opaque);
+ fp->opaque = NULL;
+ return -EFSCORRUPTED;
+ }
+ memcpy(fp->opaque + valuelen, domain_id, fp->size - valuelen);
+ return 0;
+}
+#endif
diff --git a/fs/erofs/xattr.h b/fs/erofs/xattr.h
index 6317caa8413e..bf75a580b8f1 100644
--- a/fs/erofs/xattr.h
+++ b/fs/erofs/xattr.h
@@ -67,4 +67,7 @@ struct posix_acl *erofs_get_acl(struct inode *inode, int type, bool rcu);
#define erofs_get_acl (NULL)
#endif
+int erofs_xattr_fill_inode_fingerprint(struct erofs_inode_fingerprint *fp,
+ struct inode *inode, const char *domain_id);
+
#endif
--
2.22.0
^ permalink raw reply related [flat|nested] 46+ messages in thread* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-16 9:55 ` [PATCH v15 5/9] erofs: introduce the page cache share feature Hongbo Li
@ 2026-01-16 15:46 ` Christoph Hellwig
2026-01-16 16:21 ` Gao Xiang
2026-01-20 12:29 ` Hongbo Li
2026-01-20 14:19 ` Gao Xiang
1 sibling, 2 replies; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-16 15:46 UTC (permalink / raw)
To: Hongbo Li
Cc: hsiangkao, chao, brauner, djwong, amir73il, hch, linux-fsdevel,
linux-erofs, linux-kernel
I don't really understand the fingerprint idea. Files with the
same content will point to the same physical disk blocks, so that
should be a much better indicator than a finger print? Also how does
the fingerprint guarantee uniqueness? Is it a cryptographically
secure hash? In here it just seems like an opaque blob.
> +static inline int erofs_inode_set_aops(struct inode *inode,
> + struct inode *realinode, bool no_fscache)
Factoring this out first would be a nice little prep patch.
Also it would probably be much cleaner using IS_ENABLED.
> +static int erofs_ishare_file_open(struct inode *inode, struct file *file)
> +{
> + struct inode *sharedinode = EROFS_I(inode)->sharedinode;
Ok, it looks like this allocates a separate backing file and inode.
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-16 15:46 ` Christoph Hellwig
@ 2026-01-16 16:21 ` Gao Xiang
2026-01-19 7:29 ` Christoph Hellwig
2026-01-20 12:29 ` Hongbo Li
1 sibling, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2026-01-16 16:21 UTC (permalink / raw)
To: Christoph Hellwig, Hongbo Li
Cc: chao, brauner, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel
Hi Christoph,
On 2026/1/16 23:46, Christoph Hellwig wrote:
> I don't really understand the fingerprint idea. Files with the
> same content will point to the same physical disk blocks, so that
> should be a much better indicator than a finger print? Also how does
Page cache sharing should apply to different EROFS
filesystem images on the same machine too, so the
physical disk block number idea cannot be applied
to this.
> the fingerprint guarantee uniqueness? Is it a cryptographically
> secure hash? In here it just seems like an opaque blob.
Yes, typically it can be a secure hash like sha256,
but it really depends on the users how to use it.
This feature is enabled _only_ when a dedicated mount
option is used, and should be enabled by the priviledged
mounters, and it's up to the priviledged mounters to
guarantee the fingerprint is correct (usually guaranteed
by signatures by image builders since images will be
signed).
Also different signatures also can be isolated by domain
ids, so that different domain ids cannot be shared.
>
>> +static inline int erofs_inode_set_aops(struct inode *inode,
>> + struct inode *realinode, bool no_fscache)
>
> Factoring this out first would be a nice little prep patch.
> Also it would probably be much cleaner using IS_ENABLED.
>
>> +static int erofs_ishare_file_open(struct inode *inode, struct file *file)
>> +{
>> + struct inode *sharedinode = EROFS_I(inode)->sharedinode;
>
> Ok, it looks like this allocates a separate backing file and inode.
Yes.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-16 16:21 ` Gao Xiang
@ 2026-01-19 7:29 ` Christoph Hellwig
2026-01-19 7:53 ` Gao Xiang
0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-19 7:29 UTC (permalink / raw)
To: Gao Xiang
Cc: Christoph Hellwig, Hongbo Li, chao, brauner, djwong, amir73il,
linux-fsdevel, linux-erofs, linux-kernel
On Sat, Jan 17, 2026 at 12:21:16AM +0800, Gao Xiang wrote:
> Hi Christoph,
>
> On 2026/1/16 23:46, Christoph Hellwig wrote:
>> I don't really understand the fingerprint idea. Files with the
>> same content will point to the same physical disk blocks, so that
>> should be a much better indicator than a finger print? Also how does
>
> Page cache sharing should apply to different EROFS
> filesystem images on the same machine too, so the
> physical disk block number idea cannot be applied
> to this.
Oh. That's kinda unexpected and adds another twist to the whole scheme.
So in that case the on-disk data actually is duplicated in each image
and then de-duplicated in memory only? Ewwww...
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-19 7:29 ` Christoph Hellwig
@ 2026-01-19 7:53 ` Gao Xiang
2026-01-19 8:12 ` Gao Xiang
2026-01-19 8:32 ` Christoph Hellwig
0 siblings, 2 replies; 46+ messages in thread
From: Gao Xiang @ 2026-01-19 7:53 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Hongbo Li, chao, brauner, djwong, amir73il, linux-fsdevel,
linux-erofs, linux-kernel
On 2026/1/19 15:29, Christoph Hellwig wrote:
> On Sat, Jan 17, 2026 at 12:21:16AM +0800, Gao Xiang wrote:
>> Hi Christoph,
>>
>> On 2026/1/16 23:46, Christoph Hellwig wrote:
>>> I don't really understand the fingerprint idea. Files with the
>>> same content will point to the same physical disk blocks, so that
>>> should be a much better indicator than a finger print? Also how does
>>
>> Page cache sharing should apply to different EROFS
>> filesystem images on the same machine too, so the
>> physical disk block number idea cannot be applied
>> to this.
>
> Oh. That's kinda unexpected and adds another twist to the whole scheme.
> So in that case the on-disk data actually is duplicated in each image
> and then de-duplicated in memory only? Ewwww...
On-disk deduplication is decoupled from this feature:
- EROFS can share the same blocks in blobs (multiple
devices) among different images, so that on-disk data
can be shared by refering the same blobs;
- On-disk data won't be deduplicated in image if reflink
is enabled for backing fses, userspace mounters can
trigger background GCs to deduplicate the identical
blocks.
I just tried to say EROFS doesn't limit what's
the real meaning of `fingerprint` (they can be serialized
integer numbers for example defined by a specific image
publisher, or a specific secure hash. Currently,
"mkfs.erofs" will generate sha256 for each files), but
left them to the image builders:
1) if `fingerprint` is distributed as on-disk part of
signed images, as I said, it could be shared within a
trusted domain_id (usually the same image builder) --
that is the top priority thing using dmverity;
Or
2) If `fingerprint` is not distributed in the image
or images are untrusted (e.g. unknown signatures),
image fetchers can scan each inode in the golden
images to generate an extra minimal EROFS
metadata-only image with local calculated
`fingerprint` too, which is much similar to the
current ostree way (parse remote files and calculate
digests).
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-19 7:53 ` Gao Xiang
@ 2026-01-19 8:12 ` Gao Xiang
2026-01-19 8:32 ` Christoph Hellwig
1 sibling, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2026-01-19 8:12 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Hongbo Li, chao, brauner, djwong, amir73il, linux-fsdevel,
linux-erofs, linux-kernel
On 2026/1/19 15:53, Gao Xiang wrote:
>
>
> On 2026/1/19 15:29, Christoph Hellwig wrote:
>> On Sat, Jan 17, 2026 at 12:21:16AM +0800, Gao Xiang wrote:
>>> Hi Christoph,
>>>
>>> On 2026/1/16 23:46, Christoph Hellwig wrote:
>>>> I don't really understand the fingerprint idea. Files with the
>>>> same content will point to the same physical disk blocks, so that
>>>> should be a much better indicator than a finger print? Also how does
>>>
>>> Page cache sharing should apply to different EROFS
>>> filesystem images on the same machine too, so the
>>> physical disk block number idea cannot be applied
>>> to this.
>>
>> Oh. That's kinda unexpected and adds another twist to the whole scheme.
>> So in that case the on-disk data actually is duplicated in each image
>> and then de-duplicated in memory only? Ewwww...
>
> On-disk deduplication is decoupled from this feature:
Of course, first of all:
- Data within a single EROFS image is deduplicated of
course (for example, erofs supports extent-based
chunks);
>
> - EROFS can share the same blocks in blobs (multiple
> devices) among different images, so that on-disk data
This way is like docker layers, common data/layers
can be kept in seperate blobs;
> can be shared by refering the same blobs;
Both deduplication ways above will be applied to the
golden images which will be transfered on the wire.
>
> - On-disk data won't be deduplicated in image if reflink
> is enabled for backing fses, userspace mounters can
> trigger background GCs to deduplicate the identical
> blocks.
And this way is applied at runtime if underlayfs
supports reflink.
>
> I just tried to say EROFS doesn't limit what's
> the real meaning of `fingerprint` (they can be serialized
> integer numbers for example defined by a specific image
> publisher, or a specific secure hash. Currently,
> "mkfs.erofs" will generate sha256 for each files), but
> left them to the image builders:
>
>
> 1) if `fingerprint` is distributed as on-disk part of
> signed images, as I said, it could be shared within a
> trusted domain_id (usually the same image builder) --
> that is the top priority thing using dmverity;
>
> Or
>
> 2) If `fingerprint` is not distributed in the image
> or images are untrusted (e.g. unknown signatures),
> image fetchers can scan each inode in the golden
> images to generate an extra minimal EROFS
> metadata-only image with local calculated
> `fingerprint` too, which is much similar to the
> current ostree way (parse remote files and calculate
> digests).
>
> Thanks,
> Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-19 7:53 ` Gao Xiang
2026-01-19 8:12 ` Gao Xiang
@ 2026-01-19 8:32 ` Christoph Hellwig
2026-01-19 8:52 ` Gao Xiang
1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-19 8:32 UTC (permalink / raw)
To: Gao Xiang
Cc: Christoph Hellwig, Hongbo Li, chao, brauner, djwong, amir73il,
linux-fsdevel, linux-erofs, linux-kernel
On Mon, Jan 19, 2026 at 03:53:21PM +0800, Gao Xiang wrote:
> I just tried to say EROFS doesn't limit what's
> the real meaning of `fingerprint` (they can be serialized
> integer numbers for example defined by a specific image
> publisher, or a specific secure hash. Currently,
> "mkfs.erofs" will generate sha256 for each files), but
> left them to the image builders:
To me this sounds pretty scary, as we have code in the kernel's trust
domain that heavily depends on arbitrary userspace policy decisions.
Similarly the sharing of blocks between different file system
instances opens a lot of questions about trust boundaries and life
time rules. I don't really have good answers, but writing up the
lifetime and threat models would really help.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-19 8:32 ` Christoph Hellwig
@ 2026-01-19 8:52 ` Gao Xiang
2026-01-19 9:22 ` Christoph Hellwig
0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2026-01-19 8:52 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Hongbo Li, chao, brauner, djwong, amir73il, linux-fsdevel,
linux-erofs, linux-kernel
On 2026/1/19 16:32, Christoph Hellwig wrote:
> On Mon, Jan 19, 2026 at 03:53:21PM +0800, Gao Xiang wrote:
>> I just tried to say EROFS doesn't limit what's
>> the real meaning of `fingerprint` (they can be serialized
>> integer numbers for example defined by a specific image
>> publisher, or a specific secure hash. Currently,
>> "mkfs.erofs" will generate sha256 for each files), but
>> left them to the image builders:
>
> To me this sounds pretty scary, as we have code in the kernel's trust
> domain that heavily depends on arbitrary userspace policy decisions.
For example, overlayfs metacopy can also points to
arbitary files, what's the difference between them?
https://docs.kernel.org/filesystems/overlayfs.html#metadata-only-copy-up
By using metacopy, overlayfs can access arbitary files
as long as the metacopy has the pointer, so it should
be a priviledged stuff, which is similar to this feature.
>
> Similarly the sharing of blocks between different file system
> instances opens a lot of questions about trust boundaries and life
> time rules. I don't really have good answers, but writing up the
Could you give more details about the these? Since you
raised the questions but I have no idea what the threats
really come from.
As for the lifetime: The blob itself are immutable files,
what the lifetime rules means?
And how do you define trust boundaries? You mean users
have no right to access the data?
I think it's similar: for blockdevice-based filesystems,
you mount the filesystem with a given source, and it
should have permission to the mounter.
For multiple-blob EROFS filesystems, you mount the
filesystem with multiple data sources, and the blockdevices
and/or backed files should have permission to the
mounters too.
I don't quite get the point.
Thanks,
Gao Xiang
> lifetime and threat models would really help.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-19 8:52 ` Gao Xiang
@ 2026-01-19 9:22 ` Christoph Hellwig
2026-01-19 9:38 ` Gao Xiang
0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-19 9:22 UTC (permalink / raw)
To: Gao Xiang
Cc: Christoph Hellwig, Hongbo Li, chao, brauner, djwong, amir73il,
linux-fsdevel, linux-erofs, linux-kernel
On Mon, Jan 19, 2026 at 04:52:54PM +0800, Gao Xiang wrote:
>> To me this sounds pretty scary, as we have code in the kernel's trust
>> domain that heavily depends on arbitrary userspace policy decisions.
>
> For example, overlayfs metacopy can also points to
> arbitary files, what's the difference between them?
> https://docs.kernel.org/filesystems/overlayfs.html#metadata-only-copy-up
>
> By using metacopy, overlayfs can access arbitary files
> as long as the metacopy has the pointer, so it should
> be a priviledged stuff, which is similar to this feature.
Sounds scary too. But overlayfs' job is to combine underlying files, so
it is expected. I think it's the mix of erofs being a disk based file
system, and reaching out beyond the device(s) assigned to the file system
instance that makes me feel rather uneasy.
>>
>> Similarly the sharing of blocks between different file system
>> instances opens a lot of questions about trust boundaries and life
>> time rules. I don't really have good answers, but writing up the
>
> Could you give more details about the these? Since you
> raised the questions but I have no idea what the threats
> really come from.
Right now by default we don't allow any unprivileged mounts. Now
if people thing that say erofs is safe enough and opt into that,
it needs to be clear what the boundaries of that are. For a file
system limited to a single block device that boundaries are
pretty clear. For file systems reaching out to the entire system
(or some kind of domain), the scope is much wider.
> As for the lifetime: The blob itself are immutable files,
> what the lifetime rules means?
What happens if the blob gets removed, intentionally or accidentally?
> And how do you define trust boundaries? You mean users
> have no right to access the data?
>
> I think it's similar: for blockdevice-based filesystems,
> you mount the filesystem with a given source, and it
> should have permission to the mounter.
Yes.
> For multiple-blob EROFS filesystems, you mount the
> filesystem with multiple data sources, and the blockdevices
> and/or backed files should have permission to the
> mounters too.
And what prevents other from modifying them, or sneaking
unexpected data including unexpected comparison blobs in?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-19 9:22 ` Christoph Hellwig
@ 2026-01-19 9:38 ` Gao Xiang
2026-01-19 9:53 ` Gao Xiang
2026-01-20 3:07 ` Gao Xiang
0 siblings, 2 replies; 46+ messages in thread
From: Gao Xiang @ 2026-01-19 9:38 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Hongbo Li, chao, brauner, djwong, amir73il, linux-fsdevel,
linux-erofs, linux-kernel
On 2026/1/19 17:22, Christoph Hellwig wrote:
> On Mon, Jan 19, 2026 at 04:52:54PM +0800, Gao Xiang wrote:
>>> To me this sounds pretty scary, as we have code in the kernel's trust
>>> domain that heavily depends on arbitrary userspace policy decisions.
>>
>> For example, overlayfs metacopy can also points to
>> arbitary files, what's the difference between them?
>> https://docs.kernel.org/filesystems/overlayfs.html#metadata-only-copy-up
>>
>> By using metacopy, overlayfs can access arbitary files
>> as long as the metacopy has the pointer, so it should
>> be a priviledged stuff, which is similar to this feature.
>
> Sounds scary too. But overlayfs' job is to combine underlying files, so
> it is expected. I think it's the mix of erofs being a disk based file
But you still could point to an arbitary page cache
if metacopy is used.
> system, and reaching out beyond the device(s) assigned to the file system
> instance that makes me feel rather uneasy.
You mean the page cache can be shared from other
filesystems even not backed by these devices/files?
I admitted yes, there could be different: but that
is why new mount options "inode_share" and the
"domain_id" mount option are used.
I think they should be regarded as a single super
filesystem if "domain_id" is the same: From the
security perspective much like subvolumes of
a single super filesystem.
And mounting a new filesystem within a "domain_id"
can be regard as importing data into the super
"domain_id" filesystem, and I think only trusted
data within the single domain can be mounted/shared.
>
>>>
>>> Similarly the sharing of blocks between different file system
>>> instances opens a lot of questions about trust boundaries and life
>>> time rules. I don't really have good answers, but writing up the
>>
>> Could you give more details about the these? Since you
>> raised the questions but I have no idea what the threats
>> really come from.
>
> Right now by default we don't allow any unprivileged mounts. Now
> if people thing that say erofs is safe enough and opt into that,
> it needs to be clear what the boundaries of that are. For a file
> system limited to a single block device that boundaries are
> pretty clear. For file systems reaching out to the entire system
> (or some kind of domain), the scope is much wider.
Why multiple device differ for an immutable fses, any
filesystem instance cannot change the primary or
external device/blobs. All data are immutable.
>
>> As for the lifetime: The blob itself are immutable files,
>> what the lifetime rules means?
>
> What happens if the blob gets removed, intentionally or accidentally?
The extra device/blob reference is held during
the whole mount lifetime, much like the primary
(block) device.
And EROFS is an immutable filesystem, so that
inner blocks within the blob won't be go away
by some fs instance too.
>
>> And how do you define trust boundaries? You mean users
>> have no right to access the data?
>>
>> I think it's similar: for blockdevice-based filesystems,
>> you mount the filesystem with a given source, and it
>> should have permission to the mounter.
>
> Yes.
>
>> For multiple-blob EROFS filesystems, you mount the
>> filesystem with multiple data sources, and the blockdevices
>> and/or backed files should have permission to the
>> mounters too.
>
> And what prevents other from modifying them, or sneaking
> unexpected data including unexpected comparison blobs in?
I don't think it's difference from filesystems with single
device.
First, EROFS instances never modify any underlay
device/blobs:
If you say some other program modify the device data, yes,
it can be changed externally, but I think it's just like
trusted FUSE deamons, untrusted FUSE daemon can return
arbitary (meta)data at random times too.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-19 9:38 ` Gao Xiang
@ 2026-01-19 9:53 ` Gao Xiang
2026-01-20 3:07 ` Gao Xiang
1 sibling, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2026-01-19 9:53 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Hongbo Li, chao, brauner, djwong, amir73il, linux-fsdevel,
linux-erofs, linux-kernel
On 2026/1/19 17:38, Gao Xiang wrote:
>
>
> On 2026/1/19 17:22, Christoph Hellwig wrote:
>> On Mon, Jan 19, 2026 at 04:52:54PM +0800, Gao Xiang wrote:
>>>> To me this sounds pretty scary, as we have code in the kernel's trust
>>>> domain that heavily depends on arbitrary userspace policy decisions.
>>>
>>> For example, overlayfs metacopy can also points to
>>> arbitary files, what's the difference between them?
>>> https://docs.kernel.org/filesystems/overlayfs.html#metadata-only-copy-up
>>>
>>> By using metacopy, overlayfs can access arbitary files
>>> as long as the metacopy has the pointer, so it should
>>> be a priviledged stuff, which is similar to this feature.
>>
>> Sounds scary too. But overlayfs' job is to combine underlying files, so
>> it is expected. I think it's the mix of erofs being a disk based file
>
> But you still could point to an arbitary page cache
> if metacopy is used.
>
>> system, and reaching out beyond the device(s) assigned to the file system
>> instance that makes me feel rather uneasy.
>
> You mean the page cache can be shared from other
> filesystems even not backed by these devices/files?
>
> I admitted yes, there could be different: but that
> is why new mount options "inode_share" and the
> "domain_id" mount option are used.
>
> I think they should be regarded as a single super
> filesystem if "domain_id" is the same: From the
> security perspective much like subvolumes of
> a single super filesystem.
>
> And mounting a new filesystem within a "domain_id"
> can be regard as importing data into the super
> "domain_id" filesystem, and I think only trusted
> data within the single domain can be mounted/shared.
>
>>
>>>>
>>>> Similarly the sharing of blocks between different file system
>>>> instances opens a lot of questions about trust boundaries and life
>>>> time rules. I don't really have good answers, but writing up the
>>>
>>> Could you give more details about the these? Since you
>>> raised the questions but I have no idea what the threats
>>> really come from.
>>
>> Right now by default we don't allow any unprivileged mounts. Now
>> if people thing that say erofs is safe enough and opt into that,
>> it needs to be clear what the boundaries of that are. For a file
>> system limited to a single block device that boundaries are
>> pretty clear. For file systems reaching out to the entire system
>> (or some kind of domain), the scope is much wider.
btw, I think it's indeed to be helpful to get the boundaries (even
from on-disk formats and runtime features).
But I have to clarify that a single EROFS filesystem instance won'
have access to random block device or files.
The backing device or files are specified by users explicitly when
mounting, like:
mount -odevice=blob1,device=blob2,...,device=blobn-1 blob0 mnt
And these devices / files will be opened when mounting at once,
no more than that.
May I ask the difference between one device/file and a group of
given devices/files? Especially for immutable usage.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-19 9:38 ` Gao Xiang
2026-01-19 9:53 ` Gao Xiang
@ 2026-01-20 3:07 ` Gao Xiang
2026-01-20 6:52 ` Christoph Hellwig
1 sibling, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2026-01-20 3:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Hongbo Li, chao, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel, Linus Torvalds, Christian Brauner, oliver.yang
Hi Christoph,
Sorry I didn't phrase things clearly earlier, but I'd still
like to explain the whole idea, as this feature is clearly
useful for containerization. I hope we can reach agreement
on the page cache sharing feature: Christian agreed on this
feature (and I hope still):
https://lore.kernel.org/linux-fsdevel/20260112-begreifbar-hasten-da396ac2759b@brauner
First, let's separate this feature from mounting in user
namespaces (i.e., unprivileged mounts), because this feature
is designed specifically for privileged mounts.
The EROFS page cache sharing feature stems from a current
limitation in the page cache: a file-based folio cannot be
shared across different inode mappings (or the different
page index within the same mapping; If this limitation
were resolved, we could implement a finer-grained page
cache sharing mechanism at the folio level). As you may
know, this patchset dates back to 2023, and as of 2026; I
still see no indication that the page cache infra will
change.
So that let's face the reality: this feature introduces
on-disk xattrs called "fingerprints." --- Since they're
just xattrs, the EROFS on-disk format remains unchanged.
A new compat feature bit in the superblock indicates
whether an EROFS image contains such xattrs.
=====
In short: no on-disk format changes are required for
page cache sharing -- only xattrs attached to inodes
in the EROFS image.
Even if finer-grained page cache sharing is implemented
many many years later, existing images will remain
compatible, as we can simply ignore those xattrs.
=====
At runtime, the feature is explicitly enabled via a new
mount option: `inode_share`, which is intended only for
privileged mounters. A `domain_id` must also be specified
to define a trusted domain. This means:
- For regular EROFS mounts (without `inode_share`;
default), no page cache sharing happens for those
images;
- For mounts with `inode_share`, page cache sharing is
allowed only among mounts with the same `domain_id`.
The `domain_id` can be thought of as defining a federated
super-filesystem: data of the unique "fingerprints" (e.g.,
secure hashes or UUIDs) may come from any of the
participating filesystems, but page cache is the only one.
EROFS is an immutable, image-based golden filesystem: its
(meta)data is generated entirely in userspace. I consider
it as a special class of disk filesystem, so traditional
assumptions about generic read-write filesystems don't
always apply; and the image filesystem (especially for
containers) can also have unique features according to
image use cases against typical local filesystems.
As for unpriviledged mounts, that is another story (clearly
there are different features at least at runtime), first
I think no one argues whether mounting in the user space
is useful for containers: I do agree it should have a formal
written threat model in advance. While I'm not a security
expert per se, I'll draft one later separately.
My rough thoughts are:
- Let's not focusing entirely on the random human bugs,
because I think every practical subsystem should have bugs,
the whole threat model focuses on the system design, and
less code doesn't mean anything (buggy or even has system
design flaw)
- EROFS only accesses the (meta)data from the source blobs
specified at mount time, even with multi-device support:
mount -t erofs -odevice=[blob],device=[blob],... [source]
An EROFS mount instance never accesses data beyond those
blobs. Moreover, EROFS holds reference counts on these
blobs for the entire lifetime of the mounted filesystem
(so even if a blob is deleted, blobs remain accessible as
orphan/deleted inodes).
- As a strictly immutable filesystem, EROFS never writes to
underlying blobs/devices and thus avoids complicated space
allocation, deallocation, reverse mapping or journaling
writeback consistency issues from its design in writable
filesystems like ext4, XFS, or BTRFS. However, it doesn't
mean EROFS cannot bear random (meta)data change from
modifing blobs directly from external users.
- External users can modify underlay blobs/devices only when
they have permission to the blobs/devices, so there is no
privilege escalation risk; so I think "Sneaking in
unexpected data" isn't meaningful here -- you need proper
permissions to alter the source blobs;
So then the only question is whether EROFS's on-disk design
can safely handle arbitrary (even fuzzed) external
modifications. I believe it can: because EROFS don't
have any redundant metadata especially for space allocation
, reverse mapping and journalling like EXT4, XFS, BTRFS.
Thus, it avoids the kinds of severe inconsistency bugs
seen in generic readwrite filesystems; if you say corruption
or inconsientcy, you should define the corruption. Almost
all severe inconsientcy issue cannot be seen as inconsientcy
from EROFS on-disk design itself, also see:
https://erofs.docs.kernel.org/en/latest/imagefs.html
- Of course, unprivileged kernel EROFS mounts should start
from a minimal core on-disk format, typically the following:
https://erofs.docs.kernel.org/en/latest/core_ondisk.html
I'll clarify this together with the full security model
later if this feature really gets developped;
- In the end, I don't think various wild non-technical
assumptions makes any sense to form out the correct design
of unprivileged mounts, if a real security threat exists, it
should first have a potential attack path written in words
(even in theory), but I can't identify any practical one
based on the design in my mind.
All in all, I'm open to hear and discuss any potential
threat or valid argument and find the final answers, but I do
think we should keep discussion in the technical way rather
than purely in policy as in the previous related threads.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-20 3:07 ` Gao Xiang
@ 2026-01-20 6:52 ` Christoph Hellwig
2026-01-20 7:19 ` Gao Xiang
2026-01-20 13:40 ` Christian Brauner
0 siblings, 2 replies; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-20 6:52 UTC (permalink / raw)
To: Gao Xiang
Cc: Christoph Hellwig, Hongbo Li, chao, djwong, amir73il,
linux-fsdevel, linux-erofs, linux-kernel, Linus Torvalds,
Christian Brauner, oliver.yang
On Tue, Jan 20, 2026 at 11:07:48AM +0800, Gao Xiang wrote:
>
> Hi Christoph,
>
> Sorry I didn't phrase things clearly earlier, but I'd still
> like to explain the whole idea, as this feature is clearly
> useful for containerization. I hope we can reach agreement
> on the page cache sharing feature: Christian agreed on this
> feature (and I hope still):
>
> https://lore.kernel.org/linux-fsdevel/20260112-begreifbar-hasten-da396ac2759b@brauner
He has to ultimatively decide. I do have an uneasy feeling about this.
It's not super informed as I can keep up, and I'm not the one in charge,
but I hope it is helpful to share my perspective.
> First, let's separate this feature from mounting in user
> namespaces (i.e., unprivileged mounts), because this feature
> is designed specifically for privileged mounts.
Ok.
> The EROFS page cache sharing feature stems from a current
> limitation in the page cache: a file-based folio cannot be
> shared across different inode mappings (or the different
> page index within the same mapping; If this limitation
> were resolved, we could implement a finer-grained page
> cache sharing mechanism at the folio level). As you may
> know, this patchset dates back to 2023,
I didn't..
> and as of 2026; I
> still see no indication that the page cache infra will
> change.
It will be very hard to change unless we move to physical indexing of
the page cache, which has all kinds of downside.s
> So that let's face the reality: this feature introduces
> on-disk xattrs called "fingerprints." --- Since they're
> just xattrs, the EROFS on-disk format remains unchanged.
I think the concept of using a backing file of some sort for the shared
pagecache (which I have no problem with at all), vs the imprecise
selection through a free form fingerprint are quite different aspects,
that could be easily separated. I.e. one could easily imagine using
the data path approach based purely on exact file system metadata.
But that would of course not work with multiple images, which I think
is a key feature here if I'm reading between the lines correctly.
> - Let's not focusing entirely on the random human bugs,
> because I think every practical subsystem should have bugs,
> the whole threat model focuses on the system design, and
> less code doesn't mean anything (buggy or even has system
> design flaw)
Yes, threats through malicious actors are much more intereating
here.
> - EROFS only accesses the (meta)data from the source blobs
> specified at mount time, even with multi-device support:
>
> mount -t erofs -odevice=[blob],device=[blob],... [source]
That is an important part that wasn't fully clear to me.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-20 6:52 ` Christoph Hellwig
@ 2026-01-20 7:19 ` Gao Xiang
2026-01-22 8:33 ` Christoph Hellwig
2026-01-20 13:40 ` Christian Brauner
1 sibling, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2026-01-20 7:19 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Hongbo Li, chao, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel, Linus Torvalds, Christian Brauner, oliver.yang
Hi,
Thanks for the reply.
On 2026/1/20 14:52, Christoph Hellwig wrote:
> On Tue, Jan 20, 2026 at 11:07:48AM +0800, Gao Xiang wrote:
>>
>> Hi Christoph,
>>
>> Sorry I didn't phrase things clearly earlier, but I'd still
>> like to explain the whole idea, as this feature is clearly
>> useful for containerization. I hope we can reach agreement
>> on the page cache sharing feature: Christian agreed on this
>> feature (and I hope still):
>>
>> https://lore.kernel.org/linux-fsdevel/20260112-begreifbar-hasten-da396ac2759b@brauner
>
> He has to ultimatively decide. I do have an uneasy feeling about this.
> It's not super informed as I can keep up, and I'm not the one in charge,
> but I hope it is helpful to share my perspective.
>
>> First, let's separate this feature from mounting in user
>> namespaces (i.e., unprivileged mounts), because this feature
>> is designed specifically for privileged mounts.
>
> Ok.
>
>> The EROFS page cache sharing feature stems from a current
>> limitation in the page cache: a file-based folio cannot be
>> shared across different inode mappings (or the different
>> page index within the same mapping; If this limitation
>> were resolved, we could implement a finer-grained page
>> cache sharing mechanism at the folio level). As you may
>> know, this patchset dates back to 2023,
>
> I didn't..
>
>> and as of 2026; I
>> still see no indication that the page cache infra will
>> change.
>
> It will be very hard to change unless we move to physical indexing of
> the page cache, which has all kinds of downside.s
I'm not sure if it's really needed: I think the final
folio adaption plan is that folio can be dynamic
allocated? then why not keep multiple folios for a
physical memory, since folios are not order-0 anymore.
Using physical indexing sounds really inflexible on my
side, and it can be even regarded as a regression for me.
>
>> So that let's face the reality: this feature introduces
>> on-disk xattrs called "fingerprints." --- Since they're
>> just xattrs, the EROFS on-disk format remains unchanged.
>
> I think the concept of using a backing file of some sort for the shared
> pagecache (which I have no problem with at all), vs the imprecise
In that way (actually Jingbo worked that approach in 2023),
we have to keep the shared data physically contiguous and
even uncompressed, which cannot work for most cases.
On the other side, I do think `fingerprint` from design
is much like persistent NFS file handles in some aspect
(but I don't want to equal to that concept, but very
similar) for a single trusted domain, we should have to
deal with multiple filesystem sources and mark in a
unique way in a domain.
> selection through a free form fingerprint are quite different aspects,
> that could be easily separated. I.e. one could easily imagine using
> the data path approach based purely on exact file system metadata.
> But that would of course not work with multiple images, which I think
> is a key feature here if I'm reading between the lines correctly.
EROFS works as golden immutable images, so especially,
remote filesystem images can and will only be used without
any modification.
So we have to deal with multiple filesystems on the same
machine, otherwise, _hardlinks_ in a single filesystem can
resolve most issues for page cache sharing, but that is not
our intention.
>
>> - Let's not focusing entirely on the random human bugs,
>> because I think every practical subsystem should have bugs,
>> the whole threat model focuses on the system design, and
>> less code doesn't mean anything (buggy or even has system
>> design flaw)
>
> Yes, threats through malicious actors are much more intereating
> here.
Yes, otherwise we fail into endless meaningless rust and
code line comparsion without any useful real system
design part.
>
>> - EROFS only accesses the (meta)data from the source blobs
>> specified at mount time, even with multi-device support:
>>
>> mount -t erofs -odevice=[blob],device=[blob],... [source]
>
> That is an important part that wasn't fully clear to me.
Okay,
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-20 7:19 ` Gao Xiang
@ 2026-01-22 8:33 ` Christoph Hellwig
2026-01-22 8:40 ` Gao Xiang
0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-22 8:33 UTC (permalink / raw)
To: Gao Xiang
Cc: Christoph Hellwig, Hongbo Li, chao, djwong, amir73il,
linux-fsdevel, linux-erofs, linux-kernel, Linus Torvalds,
Christian Brauner, oliver.yang
On Tue, Jan 20, 2026 at 03:19:21PM +0800, Gao Xiang wrote:
>> It will be very hard to change unless we move to physical indexing of
>> the page cache, which has all kinds of downside.s
>
> I'm not sure if it's really needed: I think the final
> folio adaption plan is that folio can be dynamic
> allocated? then why not keep multiple folios for a
> physical memory, since folios are not order-0 anymore.
Having multiple folios for the same piece of memory can't work,
at we'd have unsynchronized state.
> Using physical indexing sounds really inflexible on my
> side, and it can be even regarded as a regression for me.
I'm absolutely not arguing for that..
>>> So that let's face the reality: this feature introduces
>>> on-disk xattrs called "fingerprints." --- Since they're
>>> just xattrs, the EROFS on-disk format remains unchanged.
>>
>> I think the concept of using a backing file of some sort for the shared
>> pagecache (which I have no problem with at all), vs the imprecise
>
> In that way (actually Jingbo worked that approach in 2023),
> we have to keep the shared data physically contiguous and
> even uncompressed, which cannot work for most cases.
Why does that matter?
> On the other side, I do think `fingerprint` from design
> is much like persistent NFS file handles in some aspect
> (but I don't want to equal to that concept, but very
> similar) for a single trusted domain, we should have to
> deal with multiple filesystem sources and mark in a
> unique way in a domain.
I don't really thing they are similar in any way.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-22 8:33 ` Christoph Hellwig
@ 2026-01-22 8:40 ` Gao Xiang
2026-01-23 5:39 ` Christoph Hellwig
0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2026-01-22 8:40 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Hongbo Li, chao, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel, Linus Torvalds, Christian Brauner, oliver.yang
On 2026/1/22 16:33, Christoph Hellwig wrote:
> On Tue, Jan 20, 2026 at 03:19:21PM +0800, Gao Xiang wrote:
>>> It will be very hard to change unless we move to physical indexing of
>>> the page cache, which has all kinds of downside.s
>>
>> I'm not sure if it's really needed: I think the final
>> folio adaption plan is that folio can be dynamic
>> allocated? then why not keep multiple folios for a
>> physical memory, since folios are not order-0 anymore.
>
> Having multiple folios for the same piece of memory can't work,
> at we'd have unsynchronized state.
Why not just left unsynchronized state in a unique way,
but just left mapping + indexing seperated.
Anyway, that is just a wild thought, I will not dig
into that.
>
>> Using physical indexing sounds really inflexible on my
>> side, and it can be even regarded as a regression for me.
>
> I'm absolutely not arguing for that..
>
>>>> So that let's face the reality: this feature introduces
>>>> on-disk xattrs called "fingerprints." --- Since they're
>>>> just xattrs, the EROFS on-disk format remains unchanged.
>>>
>>> I think the concept of using a backing file of some sort for the shared
>>> pagecache (which I have no problem with at all), vs the imprecise
>>
>> In that way (actually Jingbo worked that approach in 2023),
>> we have to keep the shared data physically contiguous and
>> even uncompressed, which cannot work for most cases.
>
> Why does that matter?
Sorry then, I think I don't get the point, but we really
need this for the complete page cache sharing on the
single physical machine.
>
>> On the other side, I do think `fingerprint` from design
>> is much like persistent NFS file handles in some aspect
>> (but I don't want to equal to that concept, but very
>> similar) for a single trusted domain, we should have to
>> deal with multiple filesystem sources and mark in a
>> unique way in a domain.
>
> I don't really thing they are similar in any way.
Why they are not similiar, you still need persistent IDs
in inodes for multiple fses, if there are a
content-addressable immutable filesystems working in
inodes, they could just use inode hashs as file handles
instead of inode numbers + generations.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-22 8:40 ` Gao Xiang
@ 2026-01-23 5:39 ` Christoph Hellwig
2026-01-23 5:58 ` Gao Xiang
0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-23 5:39 UTC (permalink / raw)
To: Gao Xiang
Cc: Christoph Hellwig, Hongbo Li, chao, djwong, amir73il,
linux-fsdevel, linux-erofs, linux-kernel, Linus Torvalds,
Christian Brauner, oliver.yang
On Thu, Jan 22, 2026 at 04:40:56PM +0800, Gao Xiang wrote:
>> Having multiple folios for the same piece of memory can't work,
>> at we'd have unsynchronized state.
>
> Why not just left unsynchronized state in a unique way,
> but just left mapping + indexing seperated.
That would not just require allocating the folios dynamically, but most
importantly splitting it up. We'd then also need to find a way to chain
the folio_link structures from the main folio. I'm not going to see this
might not happen, but it feels very far out there and might have all
kinds of issues.
>>>> I think the concept of using a backing file of some sort for the shared
>>>> pagecache (which I have no problem with at all), vs the imprecise
>>>
>>> In that way (actually Jingbo worked that approach in 2023),
>>> we have to keep the shared data physically contiguous and
>>> even uncompressed, which cannot work for most cases.
>>
>> Why does that matter?
>
> Sorry then, I think I don't get the point, but we really
> need this for the complete page cache sharing on the
> single physical machine.
Why do you need physically contigous space to share it that way?
>>
>>> On the other side, I do think `fingerprint` from design
>>> is much like persistent NFS file handles in some aspect
>>> (but I don't want to equal to that concept, but very
>>> similar) for a single trusted domain, we should have to
>>> deal with multiple filesystem sources and mark in a
>>> unique way in a domain.
>>
>> I don't really thing they are similar in any way.
>
> Why they are not similiar, you still need persistent IDs
> in inodes for multiple fses, if there are a
> content-addressable immutable filesystems working in
> inodes, they could just use inode hashs as file handles
> instead of inode numbers + generations.
Sure, if they are well defined, cryptographically secure hashes. But
that's different from file handles, which don't address content at all,
but are just a handle to given file that bypasses the path lookup.
>
> Thanks,
> Gao Xiang
---end quoted text---
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-23 5:39 ` Christoph Hellwig
@ 2026-01-23 5:58 ` Gao Xiang
0 siblings, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2026-01-23 5:58 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Hongbo Li, chao, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel, Linus Torvalds, Christian Brauner, oliver.yang
On 2026/1/23 13:39, Christoph Hellwig wrote:
> On Thu, Jan 22, 2026 at 04:40:56PM +0800, Gao Xiang wrote:
>>> Having multiple folios for the same piece of memory can't work,
>>> at we'd have unsynchronized state.
>>
>> Why not just left unsynchronized state in a unique way,
>> but just left mapping + indexing seperated.
>
> That would not just require allocating the folios dynamically, but most
> importantly splitting it up. We'd then also need to find a way to chain
> the folio_link structures from the main folio. I'm not going to see this
> might not happen, but it feels very far out there and might have all
> kinds of issues.
I can see the way, but at least I don't have any resource,
and I'm even not sure it will happen in the foresee future,
so that is why we will not wait for per-folio sharing
anymore (memory is already becoming $$$$$$..).
>
>>>>> I think the concept of using a backing file of some sort for the shared
>>>>> pagecache (which I have no problem with at all), vs the imprecise
>>>>
>>>> In that way (actually Jingbo worked that approach in 2023),
>>>> we have to keep the shared data physically contiguous and
>>>> even uncompressed, which cannot work for most cases.
>>>
>>> Why does that matter?
>>
>> Sorry then, I think I don't get the point, but we really
>> need this for the complete page cache sharing on the
>> single physical machine.
>
> Why do you need physically contigous space to share it that way?
Yes, it won't be necessary, but the main goal is to share
various different filesystem images with consensus per-inode
content-addressable IDs, either secure hashs or per-inode UUIDs.
I still think it's very useful considering finer-grain page
cache sharing can only exist in our heads so I will go on use
this way for everyone to save memory (considering AI needs
too much memory and memory becomes more expensive.)
>
>>>
>>>> On the other side, I do think `fingerprint` from design
>>>> is much like persistent NFS file handles in some aspect
>>>> (but I don't want to equal to that concept, but very
>>>> similar) for a single trusted domain, we should have to
>>>> deal with multiple filesystem sources and mark in a
>>>> unique way in a domain.
>>>
>>> I don't really thing they are similar in any way.
>>
>> Why they are not similiar, you still need persistent IDs
>> in inodes for multiple fses, if there are a
>> content-addressable immutable filesystems working in
>> inodes, they could just use inode hashs as file handles
>> instead of inode numbers + generations.
>
> Sure, if they are well defined, cryptographically secure hashes. But
EROFS is a golden image filesystem generated purely in
userspace, vendors will use secure hashs or
per-vendor-generated per-inode UUID.
> that's different from file handles, which don't address content at all,
> but are just a handle to given file that bypasses the path lookup.
I agree, so I once said _somewhat_ similar. Considering
content-addressable filesystems, of course they could use
simplifed secure hashs as file handles in some form.
Thanks,
Gao Xiang
>
>>
>> Thanks,
>> Gao Xiang
> ---end quoted text---
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-20 6:52 ` Christoph Hellwig
2026-01-20 7:19 ` Gao Xiang
@ 2026-01-20 13:40 ` Christian Brauner
2026-01-20 14:11 ` Gao Xiang
1 sibling, 1 reply; 46+ messages in thread
From: Christian Brauner @ 2026-01-20 13:40 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Gao Xiang, Hongbo Li, chao, djwong, amir73il, linux-fsdevel,
linux-erofs, linux-kernel, Linus Torvalds, oliver.yang
On Tue, Jan 20, 2026 at 07:52:42AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 20, 2026 at 11:07:48AM +0800, Gao Xiang wrote:
> >
> > Hi Christoph,
> >
> > Sorry I didn't phrase things clearly earlier, but I'd still
> > like to explain the whole idea, as this feature is clearly
> > useful for containerization. I hope we can reach agreement
> > on the page cache sharing feature: Christian agreed on this
> > feature (and I hope still):
> >
> > https://lore.kernel.org/linux-fsdevel/20260112-begreifbar-hasten-da396ac2759b@brauner
>
> He has to ultimatively decide. I do have an uneasy feeling about this.
> It's not super informed as I can keep up, and I'm not the one in charge,
> but I hope it is helpful to share my perspective.
It always is helpful, Christoph! I appreciate your input.
I'm fine with this feature. But as I've said in person: I still oppose
making any block-based filesystem mountable in unprivileged containers
without any sort of trust mechanism.
I am however open in the future for block devices protected by dm-verity
with the root hash signed by a sufficiently trusted key to be mountable
in unprivileged containers.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-20 13:40 ` Christian Brauner
@ 2026-01-20 14:11 ` Gao Xiang
0 siblings, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2026-01-20 14:11 UTC (permalink / raw)
To: Christian Brauner, Christoph Hellwig
Cc: Hongbo Li, chao, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel, Linus Torvalds, oliver.yang
Hi Christian,
On 2026/1/20 21:40, Christian Brauner wrote:
> On Tue, Jan 20, 2026 at 07:52:42AM +0100, Christoph Hellwig wrote:
>> On Tue, Jan 20, 2026 at 11:07:48AM +0800, Gao Xiang wrote:
>>>
>>> Hi Christoph,
>>>
>>> Sorry I didn't phrase things clearly earlier, but I'd still
>>> like to explain the whole idea, as this feature is clearly
>>> useful for containerization. I hope we can reach agreement
>>> on the page cache sharing feature: Christian agreed on this
>>> feature (and I hope still):
>>>
>>> https://lore.kernel.org/linux-fsdevel/20260112-begreifbar-hasten-da396ac2759b@brauner
>>
>> He has to ultimatively decide. I do have an uneasy feeling about this.
>> It's not super informed as I can keep up, and I'm not the one in charge,
>> but I hope it is helpful to share my perspective.
>
> It always is helpful, Christoph! I appreciate your input.
Thanks, I will raise some extra comments for Hongbo
to change to make this feature more safer.
>
> I'm fine with this feature. But as I've said in person: I still oppose
> making any block-based filesystem mountable in unprivileged containers
> without any sort of trust mechanism.
Nevertheless, since Christoph put this topic on the
community list, I had to repeat my own latest
thoughts of this on the list for reference.
Anyway, some people would just be nitpicky to the words
above as a policy: they will re-invent new
non-block-based trick filesystems (but with much odd
kernel-parsed metadata design) for the kernel community.
Honestly, my own idea is that we should find real
threats instead of arbitary assumptions against different
types of filesystems. The original question is still
that what provents _kernel filesystems with kernel-parsed
metadata_ from mountable in unprivileged containers.
On my own perspective (in public, without any policy
involved), I think it would be better to get some fair
technical points & concerns, so that either we either fully
get in agreement as the real dead end or really overcome
some barriers since this feature is indeed useful.
I will not repeat my thoughts again to annoy folks even
further for this topic, but document here for reference.
>
> I am however open in the future for block devices protected by dm-verity
> with the root hash signed by a sufficiently trusted key to be mountable
> in unprivileged containers.
Signed images will be a good start, I fully agree.
No one really argues that, and I believe I've told the
signed image ideas in person to Christoph and Darrick too.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-16 15:46 ` Christoph Hellwig
2026-01-16 16:21 ` Gao Xiang
@ 2026-01-20 12:29 ` Hongbo Li
2026-01-22 14:48 ` Hongbo Li
1 sibling, 1 reply; 46+ messages in thread
From: Hongbo Li @ 2026-01-20 12:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: hsiangkao, chao, brauner, djwong, amir73il, linux-fsdevel,
linux-erofs, linux-kernel
On 2026/1/16 23:46, Christoph Hellwig wrote:
> I don't really understand the fingerprint idea. Files with the
> same content will point to the same physical disk blocks, so that
> should be a much better indicator than a finger print? Also how does
> the fingerprint guarantee uniqueness? Is it a cryptographically
> secure hash? In here it just seems like an opaque blob.
>
>> +static inline int erofs_inode_set_aops(struct inode *inode,
>> + struct inode *realinode, bool no_fscache)
>
> Factoring this out first would be a nice little prep patch.
> Also it would probably be much cleaner using IS_ENABLED.
Ok, Thanks for reviewing. I will refine in next version.
Thanks,
Hongbo
>
>> +static int erofs_ishare_file_open(struct inode *inode, struct file *file)
>> +{
>> + struct inode *sharedinode = EROFS_I(inode)->sharedinode;
>
> Ok, it looks like this allocates a separate backing file and inode.
>
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-20 12:29 ` Hongbo Li
@ 2026-01-22 14:48 ` Hongbo Li
2026-01-23 6:19 ` Christoph Hellwig
0 siblings, 1 reply; 46+ messages in thread
From: Hongbo Li @ 2026-01-22 14:48 UTC (permalink / raw)
To: Christoph Hellwig
Cc: hsiangkao, chao, brauner, djwong, amir73il, linux-fsdevel,
linux-erofs, linux-kernel
On 2026/1/20 20:29, Hongbo Li wrote:
>
>
> On 2026/1/16 23:46, Christoph Hellwig wrote:
>> I don't really understand the fingerprint idea. Files with the
>> same content will point to the same physical disk blocks, so that
>> should be a much better indicator than a finger print? Also how does
>> the fingerprint guarantee uniqueness? Is it a cryptographically
>> secure hash? In here it just seems like an opaque blob.
>>
>>> +static inline int erofs_inode_set_aops(struct inode *inode,
>>> + struct inode *realinode, bool no_fscache)
>>
>> Factoring this out first would be a nice little prep patch.
>> Also it would probably be much cleaner using IS_ENABLED.
>
> Ok, Thanks for reviewing. I will refine in next version.
Sorry I overlooked this point. Factoring this out is a good idea, but we
cannot use IS_ENABLED here, because some aops is not visible when the
relevant config macro is not enabled. So I choose to keep this format
and only to factor this out.
Thanks,
Hongbo
>
> Thanks,
> Hongbo
>
>>
>>> +static int erofs_ishare_file_open(struct inode *inode, struct file
>>> *file)
>>> +{
>>> + struct inode *sharedinode = EROFS_I(inode)->sharedinode;
>>
>> Ok, it looks like this allocates a separate backing file and inode.
>>
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-22 14:48 ` Hongbo Li
@ 2026-01-23 6:19 ` Christoph Hellwig
0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-23 6:19 UTC (permalink / raw)
To: Hongbo Li
Cc: Christoph Hellwig, hsiangkao, chao, brauner, djwong, amir73il,
linux-fsdevel, linux-erofs, linux-kernel
On Thu, Jan 22, 2026 at 10:48:27PM +0800, Hongbo Li wrote:
> Sorry I overlooked this point. Factoring this out is a good idea, but we
> cannot use IS_ENABLED here, because some aops is not visible when the
> relevant config macro is not enabled. So I choose to keep this format and
> only to factor this out.
Is it? If so just moving the extern outside the ifdef should be
easy enough, but from a quick grep I can't see any such case.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-16 9:55 ` [PATCH v15 5/9] erofs: introduce the page cache share feature Hongbo Li
2026-01-16 15:46 ` Christoph Hellwig
@ 2026-01-20 14:19 ` Gao Xiang
2026-01-20 14:33 ` Gao Xiang
2026-01-21 1:29 ` Hongbo Li
1 sibling, 2 replies; 46+ messages in thread
From: Gao Xiang @ 2026-01-20 14:19 UTC (permalink / raw)
To: Hongbo Li, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel
On 2026/1/16 17:55, Hongbo Li wrote:
> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>
> Currently, reading files with different paths (or names) but the same
> content will consume multiple copies of the page cache, even if the
> content of these page caches is the same. For example, reading
> identical files (e.g., *.so files) from two different minor versions of
> container images will cost multiple copies of the same page cache,
> since different containers have different mount points. Therefore,
> sharing the page cache for files with the same content can save memory.
>
> This introduces the page cache share feature in erofs. It allocate a
> deduplicated inode and use its page cache as shared. Reads for files
> with identical content will ultimately be routed to the page cache of
> the deduplicated inode. In this way, a single page cache satisfies
> multiple read requests for different files with the same contents.
>
> We introduce inode_share mount option to enable the page sharing mode
> during mounting.
>
> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
> ---
> Documentation/filesystems/erofs.rst | 5 +
> fs/erofs/Makefile | 1 +
> fs/erofs/inode.c | 24 +----
> fs/erofs/internal.h | 57 ++++++++++
> fs/erofs/ishare.c | 161 ++++++++++++++++++++++++++++
> fs/erofs/super.c | 56 +++++++++-
> fs/erofs/xattr.c | 34 ++++++
> fs/erofs/xattr.h | 3 +
> 8 files changed, 316 insertions(+), 25 deletions(-)
> create mode 100644 fs/erofs/ishare.c
>
> diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
> index 08194f194b94..27d3caa3c73c 100644
> --- a/Documentation/filesystems/erofs.rst
> +++ b/Documentation/filesystems/erofs.rst
> @@ -128,7 +128,12 @@ device=%s Specify a path to an extra device to be used together.
> fsid=%s Specify a filesystem image ID for Fscache back-end.
> domain_id=%s Specify a domain ID in fscache mode so that different images
> with the same blobs under a given domain ID can share storage.
> + Also used for inode page sharing mode which defines a sharing
> + domain.
I think either the existing or the page cache sharing
here, `domain_id` should be protected as sensitive
information, so it'd be helpful to protect it as a
separate patch.
And change the description as below:
Specify a trusted domain ID for fscache mode so that
different images with the same blobs, identified by blob IDs,
can share storage within the same trusted domain.
Also used for different filesystems with inode page sharing
enabled to share page cache within the trusted domain.
> fsoffset=%llu Specify block-aligned filesystem offset for the primary device.
> +inode_share Enable inode page sharing for this filesystem. Inodes with
> + identical content within the same domain ID can share the
> + page cache.
> =================== =========================================================
...
> erofs_exit_shrinker();
> @@ -1062,6 +1111,8 @@ static int erofs_show_options(struct seq_file *seq, struct dentry *root)
> seq_printf(seq, ",domain_id=%s", sbi->domain_id);
I think we shouldn't show `domain_id` to the userspace
entirely.
Also, let's use kfree_sentitive() and no_free_ptr() to
replace the following snippet:
case Opt_domain_id:
kfree(sbi->domain_id); -> kfree_sentitive
sbi->domain_id = kstrdup(param->string, GFP_KERNEL);
-> sbi->domain_id = no_free_ptr(param->string);
if (!sbi->domain_id)
return -ENOMEM;
break;
And replace with kfree_sentitive() for domain_id everywhere.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-20 14:19 ` Gao Xiang
@ 2026-01-20 14:33 ` Gao Xiang
2026-01-21 1:29 ` Hongbo Li
1 sibling, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2026-01-20 14:33 UTC (permalink / raw)
To: Hongbo Li, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel
On 2026/1/20 22:19, Gao Xiang wrote:
>
>
> On 2026/1/16 17:55, Hongbo Li wrote:
>> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>
>> Currently, reading files with different paths (or names) but the same
>> content will consume multiple copies of the page cache, even if the
>> content of these page caches is the same. For example, reading
>> identical files (e.g., *.so files) from two different minor versions of
>> container images will cost multiple copies of the same page cache,
>> since different containers have different mount points. Therefore,
>> sharing the page cache for files with the same content can save memory.
>>
>> This introduces the page cache share feature in erofs. It allocate a
>> deduplicated inode and use its page cache as shared. Reads for files
>> with identical content will ultimately be routed to the page cache of
>> the deduplicated inode. In this way, a single page cache satisfies
>> multiple read requests for different files with the same contents.
>>
>> We introduce inode_share mount option to enable the page sharing mode
>> during mounting.
>>
>> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
>> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
>> ---
>> Documentation/filesystems/erofs.rst | 5 +
>> fs/erofs/Makefile | 1 +
>> fs/erofs/inode.c | 24 +----
>> fs/erofs/internal.h | 57 ++++++++++
>> fs/erofs/ishare.c | 161 ++++++++++++++++++++++++++++
>> fs/erofs/super.c | 56 +++++++++-
>> fs/erofs/xattr.c | 34 ++++++
>> fs/erofs/xattr.h | 3 +
>> 8 files changed, 316 insertions(+), 25 deletions(-)
>> create mode 100644 fs/erofs/ishare.c
>>
>> diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
>> index 08194f194b94..27d3caa3c73c 100644
>> --- a/Documentation/filesystems/erofs.rst
>> +++ b/Documentation/filesystems/erofs.rst
>> @@ -128,7 +128,12 @@ device=%s Specify a path to an extra device to be used together.
>> fsid=%s Specify a filesystem image ID for Fscache back-end.
>> domain_id=%s Specify a domain ID in fscache mode so that different images
>> with the same blobs under a given domain ID can share storage.
>> + Also used for inode page sharing mode which defines a sharing
>> + domain.
>
> I think either the existing or the page cache sharing
> here, `domain_id` should be protected as sensitive
> information, so it'd be helpful to protect it as a
> separate patch.
>
> And change the description as below:
> Specify a trusted domain ID for fscache mode so that
> different images with the same blobs, identified by blob IDs,
> can share storage within the same trusted domain.
> Also used for different filesystems with inode page sharing
> enabled to share page cache within the trusted domain.
>
>
>> fsoffset=%llu Specify block-aligned filesystem offset for the primary device.
>> +inode_share Enable inode page sharing for this filesystem. Inodes with
>> + identical content within the same domain ID can share the
>> + page cache.
>> =================== =========================================================
>
> ...
>
>
>> erofs_exit_shrinker();
>> @@ -1062,6 +1111,8 @@ static int erofs_show_options(struct seq_file *seq, struct dentry *root)
>> seq_printf(seq, ",domain_id=%s", sbi->domain_id);
>
> I think we shouldn't show `domain_id` to the userspace
> entirely.
Maybe not bother with the deprecated fscache, just make
sure `domain_id` won't be shown in any form for page
cache sharing feature.
>
> Also, let's use kfree_sentitive() and no_free_ptr() to
> replace the following snippet:
>
> case Opt_domain_id:
> kfree(sbi->domain_id); -> kfree_sentitive
> sbi->domain_id = kstrdup(param->string, GFP_KERNEL);
> -> sbi->domain_id = no_free_ptr(param->string);
> if (!sbi->domain_id)
> return -ENOMEM;
> break;
>
> And replace with kfree_sentitive() for domain_id everywhere.
> > Thanks,
> Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
2026-01-20 14:19 ` Gao Xiang
2026-01-20 14:33 ` Gao Xiang
@ 2026-01-21 1:29 ` Hongbo Li
1 sibling, 0 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-21 1:29 UTC (permalink / raw)
To: Gao Xiang, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel
On 2026/1/20 22:19, Gao Xiang wrote:
>
>
> On 2026/1/16 17:55, Hongbo Li wrote:
>> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>
>> Currently, reading files with different paths (or names) but the same
>> content will consume multiple copies of the page cache, even if the
>> content of these page caches is the same. For example, reading
>> identical files (e.g., *.so files) from two different minor versions of
>> container images will cost multiple copies of the same page cache,
>> since different containers have different mount points. Therefore,
>> sharing the page cache for files with the same content can save memory.
>>
>> This introduces the page cache share feature in erofs. It allocate a
>> deduplicated inode and use its page cache as shared. Reads for files
>> with identical content will ultimately be routed to the page cache of
>> the deduplicated inode. In this way, a single page cache satisfies
>> multiple read requests for different files with the same contents.
>>
>> We introduce inode_share mount option to enable the page sharing mode
>> during mounting.
>>
>> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
>> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
>> ---
>> Documentation/filesystems/erofs.rst | 5 +
>> fs/erofs/Makefile | 1 +
>> fs/erofs/inode.c | 24 +----
>> fs/erofs/internal.h | 57 ++++++++++
>> fs/erofs/ishare.c | 161 ++++++++++++++++++++++++++++
>> fs/erofs/super.c | 56 +++++++++-
>> fs/erofs/xattr.c | 34 ++++++
>> fs/erofs/xattr.h | 3 +
>> 8 files changed, 316 insertions(+), 25 deletions(-)
>> create mode 100644 fs/erofs/ishare.c
>>
>> diff --git a/Documentation/filesystems/erofs.rst
>> b/Documentation/filesystems/erofs.rst
>> index 08194f194b94..27d3caa3c73c 100644
>> --- a/Documentation/filesystems/erofs.rst
>> +++ b/Documentation/filesystems/erofs.rst
>> @@ -128,7 +128,12 @@ device=%s Specify a path to an extra
>> device to be used together.
>> fsid=%s Specify a filesystem image ID for Fscache
>> back-end.
>> domain_id=%s Specify a domain ID in fscache mode so that
>> different images
>> with the same blobs under a given domain ID
>> can share storage.
>> + Also used for inode page sharing mode which
>> defines a sharing
>> + domain.
>
> I think either the existing or the page cache sharing
> here, `domain_id` should be protected as sensitive
> information, so it'd be helpful to protect it as a
> separate patch.
>
> And change the description as below:
> Specify a trusted domain ID for fscache mode
> so that
> different images with the same blobs,
> identified by blob IDs,
> can share storage within the same trusted
> domain.
> Also used for different filesystems with
> inode page sharing
> enabled to share page cache within the
> trusted domain.
>
>
>> fsoffset=%llu Specify block-aligned filesystem offset for
>> the primary device.
>> +inode_share Enable inode page sharing for this
>> filesystem. Inodes with
>> + identical content within the same domain ID
>> can share the
>> + page cache.
>> ===================
>> =========================================================
>
> ...
>
>
>> erofs_exit_shrinker();
>> @@ -1062,6 +1111,8 @@ static int erofs_show_options(struct seq_file
>> *seq, struct dentry *root)
>> seq_printf(seq, ",domain_id=%s", sbi->domain_id);
>
> I think we shouldn't show `domain_id` to the userspace
> entirely.
>
> Also, let's use kfree_sentitive() and no_free_ptr() to
> replace the following snippet:
>
> case Opt_domain_id:
> kfree(sbi->domain_id); -> kfree_sentitive
Ok, kfree_sensitive/no_free_ptr looks good, this way makes domain_id
more reliable.
Thanks,
Hongbo
> sbi->domain_id = kstrdup(param->string, GFP_KERNEL);
> -> sbi->domain_id = no_free_ptr(param->string);
> if (!sbi->domain_id)
> return -ENOMEM;
> break;
>
> And replace with kfree_sentitive() for domain_id everywhere.
>
> Thanks,
> Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH v15 6/9] erofs: pass inode to trace_erofs_read_folio
2026-01-16 9:55 [PATCH v15 0/9] erofs: Introduce page cache sharing feature Hongbo Li
` (4 preceding siblings ...)
2026-01-16 9:55 ` [PATCH v15 5/9] erofs: introduce the page cache share feature Hongbo Li
@ 2026-01-16 9:55 ` Hongbo Li
2026-01-16 9:55 ` [PATCH v15 7/9] erofs: support unencoded inodes for page cache share Hongbo Li
` (3 subsequent siblings)
9 siblings, 0 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-16 9:55 UTC (permalink / raw)
To: hsiangkao, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel,
lihongbo22
The trace_erofs_read_folio accesses inode information through folio,
but this method fails if the real inode is not associated with the
folio(such as for the uncomping page cache sharing case). Therefore,
we pass the real inode to it so that the inode information can be
printed out in that case.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
fs/erofs/data.c | 6 ++----
fs/erofs/fileio.c | 2 +-
fs/erofs/zdata.c | 2 +-
include/trace/events/erofs.h | 10 +++++-----
4 files changed, 9 insertions(+), 11 deletions(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 71e23d91123d..ea198defb531 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -385,8 +385,7 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
};
struct erofs_iomap_iter_ctx iter_ctx = {};
- trace_erofs_read_folio(folio, true);
-
+ trace_erofs_read_folio(folio_inode(folio), folio, true);
iomap_read_folio(&erofs_iomap_ops, &read_ctx, &iter_ctx);
return 0;
}
@@ -400,8 +399,7 @@ static void erofs_readahead(struct readahead_control *rac)
struct erofs_iomap_iter_ctx iter_ctx = {};
trace_erofs_readahead(rac->mapping->host, readahead_index(rac),
- readahead_count(rac), true);
-
+ readahead_count(rac), true);
iomap_readahead(&erofs_iomap_ops, &read_ctx, &iter_ctx);
}
diff --git a/fs/erofs/fileio.c b/fs/erofs/fileio.c
index 932e8b353ba1..d07dc248d264 100644
--- a/fs/erofs/fileio.c
+++ b/fs/erofs/fileio.c
@@ -161,7 +161,7 @@ static int erofs_fileio_read_folio(struct file *file, struct folio *folio)
struct erofs_fileio io = {};
int err;
- trace_erofs_read_folio(folio, true);
+ trace_erofs_read_folio(folio_inode(folio), folio, true);
err = erofs_fileio_scan_folio(&io, folio);
erofs_fileio_rq_submit(io.rq);
return err;
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 3d31f7840ca0..93ab6a481b64 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1887,7 +1887,7 @@ static int z_erofs_read_folio(struct file *file, struct folio *folio)
Z_EROFS_DEFINE_FRONTEND(f, inode, folio_pos(folio));
int err;
- trace_erofs_read_folio(folio, false);
+ trace_erofs_read_folio(inode, folio, false);
z_erofs_pcluster_readmore(&f, NULL, true);
err = z_erofs_scan_folio(&f, folio, false);
z_erofs_pcluster_readmore(&f, NULL, false);
diff --git a/include/trace/events/erofs.h b/include/trace/events/erofs.h
index dad7360f42f9..def20d06507b 100644
--- a/include/trace/events/erofs.h
+++ b/include/trace/events/erofs.h
@@ -82,9 +82,9 @@ TRACE_EVENT(erofs_fill_inode,
TRACE_EVENT(erofs_read_folio,
- TP_PROTO(struct folio *folio, bool raw),
+ TP_PROTO(struct inode *inode, struct folio *folio, bool raw),
- TP_ARGS(folio, raw),
+ TP_ARGS(inode, folio, raw),
TP_STRUCT__entry(
__field(dev_t, dev )
@@ -96,9 +96,9 @@ TRACE_EVENT(erofs_read_folio,
),
TP_fast_assign(
- __entry->dev = folio->mapping->host->i_sb->s_dev;
- __entry->nid = EROFS_I(folio->mapping->host)->nid;
- __entry->dir = S_ISDIR(folio->mapping->host->i_mode);
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->nid = EROFS_I(inode)->nid;
+ __entry->dir = S_ISDIR(inode->i_mode);
__entry->index = folio->index;
__entry->uptodate = folio_test_uptodate(folio);
__entry->raw = raw;
--
2.22.0
^ permalink raw reply related [flat|nested] 46+ messages in thread* [PATCH v15 7/9] erofs: support unencoded inodes for page cache share
2026-01-16 9:55 [PATCH v15 0/9] erofs: Introduce page cache sharing feature Hongbo Li
` (5 preceding siblings ...)
2026-01-16 9:55 ` [PATCH v15 6/9] erofs: pass inode to trace_erofs_read_folio Hongbo Li
@ 2026-01-16 9:55 ` Hongbo Li
2026-01-16 9:55 ` [PATCH v15 8/9] erofs: support compressed " Hongbo Li
` (2 subsequent siblings)
9 siblings, 0 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-16 9:55 UTC (permalink / raw)
To: hsiangkao, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel,
lihongbo22
This patch adds inode page cache sharing functionality for unencoded
files.
I conducted experiments in the container environment. Below is the
memory usage for reading all files in two different minor versions
of container images:
+-------------------+------------------+-------------+---------------+
| Image | Page Cache Share | Memory (MB) | Memory |
| | | | Reduction (%) |
+-------------------+------------------+-------------+---------------+
| | No | 241 | - |
| redis +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 163 | 33% |
+-------------------+------------------+-------------+---------------+
| | No | 872 | - |
| postgres +------------------+-------------+---------------+
| 16.1 & 16.2 | Yes | 630 | 28% |
+-------------------+------------------+-------------+---------------+
| | No | 2771 | - |
| tensorflow +------------------+-------------+---------------+
| 2.11.0 & 2.11.1 | Yes | 2340 | 16% |
+-------------------+------------------+-------------+---------------+
| | No | 926 | - |
| mysql +------------------+-------------+---------------+
| 8.0.11 & 8.0.12 | Yes | 735 | 21% |
+-------------------+------------------+-------------+---------------+
| | No | 390 | - |
| nginx +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 219 | 44% |
+-------------------+------------------+-------------+---------------+
| tomcat | No | 924 | - |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
| | Yes | 474 | 49% |
+-------------------+------------------+-------------+---------------+
Additionally, the table below shows the runtime memory usage of the
container:
+-------------------+------------------+-------------+---------------+
| Image | Page Cache Share | Memory (MB) | Memory |
| | | | Reduction (%) |
+-------------------+------------------+-------------+---------------+
| | No | 35 | - |
| redis +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 28 | 20% |
+-------------------+------------------+-------------+---------------+
| | No | 149 | - |
| postgres +------------------+-------------+---------------+
| 16.1 & 16.2 | Yes | 95 | 37% |
+-------------------+------------------+-------------+---------------+
| | No | 1028 | - |
| tensorflow +------------------+-------------+---------------+
| 2.11.0 & 2.11.1 | Yes | 930 | 10% |
+-------------------+------------------+-------------+---------------+
| | No | 155 | - |
| mysql +------------------+-------------+---------------+
| 8.0.11 & 8.0.12 | Yes | 132 | 15% |
+-------------------+------------------+-------------+---------------+
| | No | 25 | - |
| nginx +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 20 | 20% |
+-------------------+------------------+-------------+---------------+
| tomcat | No | 186 | - |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
| | Yes | 98 | 48% |
+-------------------+------------------+-------------+---------------+
Co-developed-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
fs/erofs/data.c | 32 +++++++++++++++++++++++---------
fs/erofs/fileio.c | 25 ++++++++++++++++---------
fs/erofs/inode.c | 3 ++-
fs/erofs/internal.h | 6 ++++++
fs/erofs/ishare.c | 34 ++++++++++++++++++++++++++++++++++
5 files changed, 81 insertions(+), 19 deletions(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index ea198defb531..3a4eb0dececd 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -269,6 +269,7 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
struct erofs_iomap_iter_ctx {
struct page *page;
void *base;
+ struct inode *realinode;
};
static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
@@ -276,14 +277,15 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
{
struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
struct erofs_iomap_iter_ctx *ctx = iter->private;
- struct super_block *sb = inode->i_sb;
+ struct inode *realinode = ctx ? ctx->realinode : inode;
+ struct super_block *sb = realinode->i_sb;
struct erofs_map_blocks map;
struct erofs_map_dev mdev;
int ret;
map.m_la = offset;
map.m_llen = length;
- ret = erofs_map_blocks(inode, &map);
+ ret = erofs_map_blocks(realinode, &map);
if (ret < 0)
return ret;
@@ -296,7 +298,7 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
return 0;
}
- if (!(map.m_flags & EROFS_MAP_META) || !erofs_inode_in_metabox(inode)) {
+ if (!(map.m_flags & EROFS_MAP_META) || !erofs_inode_in_metabox(realinode)) {
mdev = (struct erofs_map_dev) {
.m_deviceid = map.m_deviceid,
.m_pa = map.m_pa,
@@ -322,7 +324,7 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
void *ptr;
ptr = erofs_read_metabuf(&buf, sb, map.m_pa,
- erofs_inode_in_metabox(inode));
+ erofs_inode_in_metabox(realinode));
if (IS_ERR(ptr))
return PTR_ERR(ptr);
iomap->inline_data = ptr;
@@ -383,10 +385,15 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
.ops = &iomap_bio_read_ops,
.cur_folio = folio,
};
- struct erofs_iomap_iter_ctx iter_ctx = {};
+ bool need_iput;
+ struct erofs_iomap_iter_ctx iter_ctx = {
+ .realinode = erofs_real_inode(folio_inode(folio), &need_iput),
+ };
- trace_erofs_read_folio(folio_inode(folio), folio, true);
+ trace_erofs_read_folio(iter_ctx.realinode, folio, true);
iomap_read_folio(&erofs_iomap_ops, &read_ctx, &iter_ctx);
+ if (need_iput)
+ iput(iter_ctx.realinode);
return 0;
}
@@ -396,11 +403,16 @@ static void erofs_readahead(struct readahead_control *rac)
.ops = &iomap_bio_read_ops,
.rac = rac,
};
- struct erofs_iomap_iter_ctx iter_ctx = {};
+ bool need_iput;
+ struct erofs_iomap_iter_ctx iter_ctx = {
+ .realinode = erofs_real_inode(rac->mapping->host, &need_iput),
+ };
- trace_erofs_readahead(rac->mapping->host, readahead_index(rac),
+ trace_erofs_readahead(iter_ctx.realinode, readahead_index(rac),
readahead_count(rac), true);
iomap_readahead(&erofs_iomap_ops, &read_ctx, &iter_ctx);
+ if (need_iput)
+ iput(iter_ctx.realinode);
}
static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
@@ -421,7 +433,9 @@ static ssize_t erofs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
return dax_iomap_rw(iocb, to, &erofs_iomap_ops);
#endif
if ((iocb->ki_flags & IOCB_DIRECT) && inode->i_sb->s_bdev) {
- struct erofs_iomap_iter_ctx iter_ctx = {};
+ struct erofs_iomap_iter_ctx iter_ctx = {
+ .realinode = inode,
+ };
return iomap_dio_rw(iocb, to, &erofs_iomap_ops,
NULL, 0, &iter_ctx, 0);
diff --git a/fs/erofs/fileio.c b/fs/erofs/fileio.c
index d07dc248d264..c1d0081609dc 100644
--- a/fs/erofs/fileio.c
+++ b/fs/erofs/fileio.c
@@ -88,9 +88,9 @@ void erofs_fileio_submit_bio(struct bio *bio)
bio));
}
-static int erofs_fileio_scan_folio(struct erofs_fileio *io, struct folio *folio)
+static int erofs_fileio_scan_folio(struct erofs_fileio *io,
+ struct inode *inode, struct folio *folio)
{
- struct inode *inode = folio_inode(folio);
struct erofs_map_blocks *map = &io->map;
unsigned int cur = 0, end = folio_size(folio), len, attached = 0;
loff_t pos = folio_pos(folio), ofs;
@@ -158,31 +158,38 @@ static int erofs_fileio_scan_folio(struct erofs_fileio *io, struct folio *folio)
static int erofs_fileio_read_folio(struct file *file, struct folio *folio)
{
+ bool need_iput;
+ struct inode *realinode = erofs_real_inode(folio_inode(folio), &need_iput);
struct erofs_fileio io = {};
int err;
- trace_erofs_read_folio(folio_inode(folio), folio, true);
- err = erofs_fileio_scan_folio(&io, folio);
+ trace_erofs_read_folio(realinode, folio, true);
+ err = erofs_fileio_scan_folio(&io, realinode, folio);
erofs_fileio_rq_submit(io.rq);
+ if (need_iput)
+ iput(realinode);
return err;
}
static void erofs_fileio_readahead(struct readahead_control *rac)
{
- struct inode *inode = rac->mapping->host;
+ bool need_iput;
+ struct inode *realinode = erofs_real_inode(rac->mapping->host, &need_iput);
struct erofs_fileio io = {};
struct folio *folio;
int err;
- trace_erofs_readahead(inode, readahead_index(rac),
+ trace_erofs_readahead(realinode, readahead_index(rac),
readahead_count(rac), true);
while ((folio = readahead_folio(rac))) {
- err = erofs_fileio_scan_folio(&io, folio);
+ err = erofs_fileio_scan_folio(&io, realinode, folio);
if (err && err != -EINTR)
- erofs_err(inode->i_sb, "readahead error at folio %lu @ nid %llu",
- folio->index, EROFS_I(inode)->nid);
+ erofs_err(realinode->i_sb, "readahead error at folio %lu @ nid %llu",
+ folio->index, EROFS_I(realinode)->nid);
}
erofs_fileio_rq_submit(io.rq);
+ if (need_iput)
+ iput(realinode);
}
const struct address_space_operations erofs_fileio_aops = {
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 202cbbb4eada..d33816cff813 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -213,7 +213,8 @@ static int erofs_fill_inode(struct inode *inode)
switch (inode->i_mode & S_IFMT) {
case S_IFREG:
inode->i_op = &erofs_generic_iops;
- inode->i_fop = &erofs_file_fops;
+ inode->i_fop = erofs_ishare_fill_inode(inode) ?
+ &erofs_ishare_fops : &erofs_file_fops;
break;
case S_IFDIR:
inode->i_op = &erofs_dir_iops;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 15945e3308b8..d38e63e361c1 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -591,11 +591,17 @@ int __init erofs_init_ishare(void);
void erofs_exit_ishare(void);
bool erofs_ishare_fill_inode(struct inode *inode);
void erofs_ishare_free_inode(struct inode *inode);
+struct inode *erofs_real_inode(struct inode *inode, bool *need_iput);
#else
static inline int erofs_init_ishare(void) { return 0; }
static inline void erofs_exit_ishare(void) {}
static inline bool erofs_ishare_fill_inode(struct inode *inode) { return false; }
static inline void erofs_ishare_free_inode(struct inode *inode) {}
+static inline struct inode *erofs_real_inode(struct inode *inode, bool *need_iput)
+{
+ *need_iput = false;
+ return inode;
+}
#endif
long erofs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
diff --git a/fs/erofs/ishare.c b/fs/erofs/ishare.c
index 6b710c935afb..affa34ac5b2e 100644
--- a/fs/erofs/ishare.c
+++ b/fs/erofs/ishare.c
@@ -11,6 +11,12 @@
static struct vfsmount *erofs_ishare_mnt;
+static inline bool erofs_is_ishare_inode(struct inode *inode)
+{
+ /* assumed FS_ONDEMAND is excluded with FS_PAGE_CACHE_SHARE feature */
+ return inode->i_sb->s_type == &erofs_anon_fs_type;
+}
+
static int erofs_ishare_iget5_eq(struct inode *inode, void *data)
{
struct erofs_inode_fingerprint *fp1 = &EROFS_I(inode)->fingerprint;
@@ -38,6 +44,8 @@ bool erofs_ishare_fill_inode(struct inode *inode)
struct inode *sharedinode;
unsigned long hash;
+ if (erofs_inode_is_data_compressed(vi->datalayout))
+ return false;
if (erofs_xattr_fill_inode_fingerprint(&fp, inode, sbi->domain_id))
return false;
hash = xxh32(fp.opaque, fp.size, 0);
@@ -149,6 +157,32 @@ const struct file_operations erofs_ishare_fops = {
.splice_read = filemap_splice_read,
};
+struct inode *erofs_real_inode(struct inode *inode, bool *need_iput)
+{
+ struct erofs_inode *vi, *vi_share;
+ struct inode *realinode;
+
+ *need_iput = false;
+ if (!erofs_is_ishare_inode(inode))
+ return inode;
+
+ vi_share = EROFS_I(inode);
+ spin_lock(&vi_share->ishare_lock);
+ /* fetch any one as real inode */
+ DBG_BUGON(list_empty(&vi_share->ishare_list));
+ list_for_each_entry(vi, &vi_share->ishare_list, ishare_list) {
+ realinode = igrab(&vi->vfs_inode);
+ if (realinode) {
+ *need_iput = true;
+ break;
+ }
+ }
+ spin_unlock(&vi_share->ishare_lock);
+
+ DBG_BUGON(!realinode);
+ return realinode;
+}
+
int __init erofs_init_ishare(void)
{
erofs_ishare_mnt = kern_mount(&erofs_anon_fs_type);
--
2.22.0
^ permalink raw reply related [flat|nested] 46+ messages in thread* [PATCH v15 8/9] erofs: support compressed inodes for page cache share
2026-01-16 9:55 [PATCH v15 0/9] erofs: Introduce page cache sharing feature Hongbo Li
` (6 preceding siblings ...)
2026-01-16 9:55 ` [PATCH v15 7/9] erofs: support unencoded inodes for page cache share Hongbo Li
@ 2026-01-16 9:55 ` Hongbo Li
2026-01-16 9:55 ` [PATCH v15 9/9] erofs: implement .fadvise " Hongbo Li
2026-01-16 15:36 ` [PATCH v15 0/9] erofs: Introduce page cache sharing feature Christoph Hellwig
9 siblings, 0 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-16 9:55 UTC (permalink / raw)
To: hsiangkao, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel,
lihongbo22
From: Hongzhen Luo <hongzhen@linux.alibaba.com>
This patch adds page cache sharing functionality for compressed inodes.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
fs/erofs/ishare.c | 2 --
fs/erofs/zdata.c | 38 ++++++++++++++++++++++++--------------
2 files changed, 24 insertions(+), 16 deletions(-)
diff --git a/fs/erofs/ishare.c b/fs/erofs/ishare.c
index affa34ac5b2e..96679286da95 100644
--- a/fs/erofs/ishare.c
+++ b/fs/erofs/ishare.c
@@ -44,8 +44,6 @@ bool erofs_ishare_fill_inode(struct inode *inode)
struct inode *sharedinode;
unsigned long hash;
- if (erofs_inode_is_data_compressed(vi->datalayout))
- return false;
if (erofs_xattr_fill_inode_fingerprint(&fp, inode, sbi->domain_id))
return false;
hash = xxh32(fp.opaque, fp.size, 0);
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 93ab6a481b64..59ee9a36d9eb 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -493,7 +493,7 @@ enum z_erofs_pclustermode {
};
struct z_erofs_frontend {
- struct inode *const inode;
+ struct inode *inode, *sharedinode;
struct erofs_map_blocks map;
struct z_erofs_bvec_iter biter;
@@ -508,8 +508,8 @@ struct z_erofs_frontend {
unsigned int icur;
};
-#define Z_EROFS_DEFINE_FRONTEND(fe, i, ho) struct z_erofs_frontend fe = { \
- .inode = i, .head = Z_EROFS_PCLUSTER_TAIL, \
+#define Z_EROFS_DEFINE_FRONTEND(fe, i, si, ho) struct z_erofs_frontend fe = { \
+ .inode = i, .sharedinode = si, .head = Z_EROFS_PCLUSTER_TAIL, \
.mode = Z_EROFS_PCLUSTER_FOLLOWED, .headoffset = ho }
static bool z_erofs_should_alloc_cache(struct z_erofs_frontend *fe)
@@ -1866,7 +1866,7 @@ static void z_erofs_pcluster_readmore(struct z_erofs_frontend *f,
pgoff_t index = cur >> PAGE_SHIFT;
struct folio *folio;
- folio = erofs_grab_folio_nowait(inode->i_mapping, index);
+ folio = erofs_grab_folio_nowait(f->sharedinode->i_mapping, index);
if (!IS_ERR_OR_NULL(folio)) {
if (folio_test_uptodate(folio))
folio_unlock(folio);
@@ -1883,11 +1883,13 @@ static void z_erofs_pcluster_readmore(struct z_erofs_frontend *f,
static int z_erofs_read_folio(struct file *file, struct folio *folio)
{
- struct inode *const inode = folio->mapping->host;
- Z_EROFS_DEFINE_FRONTEND(f, inode, folio_pos(folio));
+ struct inode *sharedinode = folio->mapping->host;
+ bool need_iput;
+ struct inode *realinode = erofs_real_inode(sharedinode, &need_iput);
+ Z_EROFS_DEFINE_FRONTEND(f, realinode, sharedinode, folio_pos(folio));
int err;
- trace_erofs_read_folio(inode, folio, false);
+ trace_erofs_read_folio(realinode, folio, false);
z_erofs_pcluster_readmore(&f, NULL, true);
err = z_erofs_scan_folio(&f, folio, false);
z_erofs_pcluster_readmore(&f, NULL, false);
@@ -1896,23 +1898,28 @@ static int z_erofs_read_folio(struct file *file, struct folio *folio)
/* if some pclusters are ready, need submit them anyway */
err = z_erofs_runqueue(&f, 0) ?: err;
if (err && err != -EINTR)
- erofs_err(inode->i_sb, "read error %d @ %lu of nid %llu",
- err, folio->index, EROFS_I(inode)->nid);
+ erofs_err(realinode->i_sb, "read error %d @ %lu of nid %llu",
+ err, folio->index, EROFS_I(realinode)->nid);
erofs_put_metabuf(&f.map.buf);
erofs_release_pages(&f.pagepool);
+
+ if (need_iput)
+ iput(realinode);
return err;
}
static void z_erofs_readahead(struct readahead_control *rac)
{
- struct inode *const inode = rac->mapping->host;
- Z_EROFS_DEFINE_FRONTEND(f, inode, readahead_pos(rac));
+ struct inode *sharedinode = rac->mapping->host;
+ bool need_iput;
+ struct inode *realinode = erofs_real_inode(sharedinode, &need_iput);
+ Z_EROFS_DEFINE_FRONTEND(f, realinode, sharedinode, readahead_pos(rac));
unsigned int nrpages = readahead_count(rac);
struct folio *head = NULL, *folio;
int err;
- trace_erofs_readahead(inode, readahead_index(rac), nrpages, false);
+ trace_erofs_readahead(realinode, readahead_index(rac), nrpages, false);
z_erofs_pcluster_readmore(&f, rac, true);
while ((folio = readahead_folio(rac))) {
folio->private = head;
@@ -1926,8 +1933,8 @@ static void z_erofs_readahead(struct readahead_control *rac)
err = z_erofs_scan_folio(&f, folio, true);
if (err && err != -EINTR)
- erofs_err(inode->i_sb, "readahead error at folio %lu @ nid %llu",
- folio->index, EROFS_I(inode)->nid);
+ erofs_err(realinode->i_sb, "readahead error at folio %lu @ nid %llu",
+ folio->index, EROFS_I(realinode)->nid);
}
z_erofs_pcluster_readmore(&f, rac, false);
z_erofs_pcluster_end(&f);
@@ -1935,6 +1942,9 @@ static void z_erofs_readahead(struct readahead_control *rac)
(void)z_erofs_runqueue(&f, nrpages);
erofs_put_metabuf(&f.map.buf);
erofs_release_pages(&f.pagepool);
+
+ if (need_iput)
+ iput(realinode);
}
const struct address_space_operations z_erofs_aops = {
--
2.22.0
^ permalink raw reply related [flat|nested] 46+ messages in thread* [PATCH v15 9/9] erofs: implement .fadvise for page cache share
2026-01-16 9:55 [PATCH v15 0/9] erofs: Introduce page cache sharing feature Hongbo Li
` (7 preceding siblings ...)
2026-01-16 9:55 ` [PATCH v15 8/9] erofs: support compressed " Hongbo Li
@ 2026-01-16 9:55 ` Hongbo Li
2026-01-16 15:46 ` Christoph Hellwig
2026-01-16 15:36 ` [PATCH v15 0/9] erofs: Introduce page cache sharing feature Christoph Hellwig
9 siblings, 1 reply; 46+ messages in thread
From: Hongbo Li @ 2026-01-16 9:55 UTC (permalink / raw)
To: hsiangkao, chao, brauner
Cc: djwong, amir73il, hch, linux-fsdevel, linux-erofs, linux-kernel,
lihongbo22
From: Hongzhen Luo <hongzhen@linux.alibaba.com>
This patch implements the .fadvise interface for page cache share.
Similar to overlayfs, it drops those clean, unused pages through
vfs_fadvise().
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
fs/erofs/ishare.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/fs/erofs/ishare.c b/fs/erofs/ishare.c
index 96679286da95..78242f9d8dde 100644
--- a/fs/erofs/ishare.c
+++ b/fs/erofs/ishare.c
@@ -145,6 +145,13 @@ static int erofs_ishare_mmap(struct file *file, struct vm_area_struct *vma)
return generic_file_readonly_mmap(file, vma);
}
+static int erofs_ishare_fadvise(struct file *file, loff_t offset,
+ loff_t len, int advice)
+{
+ return vfs_fadvise((struct file *)file->private_data,
+ offset, len, advice);
+}
+
const struct file_operations erofs_ishare_fops = {
.open = erofs_ishare_file_open,
.llseek = generic_file_llseek,
@@ -153,6 +160,7 @@ const struct file_operations erofs_ishare_fops = {
.release = erofs_ishare_file_release,
.get_unmapped_area = thp_get_unmapped_area,
.splice_read = filemap_splice_read,
+ .fadvise = erofs_ishare_fadvise,
};
struct inode *erofs_real_inode(struct inode *inode, bool *need_iput)
--
2.22.0
^ permalink raw reply related [flat|nested] 46+ messages in thread* Re: [PATCH v15 9/9] erofs: implement .fadvise for page cache share
2026-01-16 9:55 ` [PATCH v15 9/9] erofs: implement .fadvise " Hongbo Li
@ 2026-01-16 15:46 ` Christoph Hellwig
2026-01-19 1:30 ` Hongbo Li
0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-16 15:46 UTC (permalink / raw)
To: Hongbo Li
Cc: hsiangkao, chao, brauner, djwong, amir73il, hch, linux-fsdevel,
linux-erofs, linux-kernel
On Fri, Jan 16, 2026 at 09:55:50AM +0000, Hongbo Li wrote:
> +static int erofs_ishare_fadvise(struct file *file, loff_t offset,
> + loff_t len, int advice)
> +{
> + return vfs_fadvise((struct file *)file->private_data,
> + offset, len, advice);
No need to cast a void pointer.
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: [PATCH v15 9/9] erofs: implement .fadvise for page cache share
2026-01-16 15:46 ` Christoph Hellwig
@ 2026-01-19 1:30 ` Hongbo Li
0 siblings, 0 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-19 1:30 UTC (permalink / raw)
To: Christoph Hellwig
Cc: hsiangkao, chao, brauner, djwong, amir73il, linux-fsdevel,
linux-erofs, linux-kernel
On 2026/1/16 23:46, Christoph Hellwig wrote:
> On Fri, Jan 16, 2026 at 09:55:50AM +0000, Hongbo Li wrote:
>> +static int erofs_ishare_fadvise(struct file *file, loff_t offset,
>> + loff_t len, int advice)
>> +{
>> + return vfs_fadvise((struct file *)file->private_data,
>> + offset, len, advice);
>
> No need to cast a void pointer.
>
Thanks, will remove in next.
Thanks,
Hongbo
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 0/9] erofs: Introduce page cache sharing feature
2026-01-16 9:55 [PATCH v15 0/9] erofs: Introduce page cache sharing feature Hongbo Li
` (8 preceding siblings ...)
2026-01-16 9:55 ` [PATCH v15 9/9] erofs: implement .fadvise " Hongbo Li
@ 2026-01-16 15:36 ` Christoph Hellwig
2026-01-16 16:30 ` Gao Xiang
2026-01-16 16:43 ` Gao Xiang
9 siblings, 2 replies; 46+ messages in thread
From: Christoph Hellwig @ 2026-01-16 15:36 UTC (permalink / raw)
To: Hongbo Li
Cc: hsiangkao, chao, brauner, djwong, amir73il, hch, linux-fsdevel,
linux-erofs, linux-kernel
Sorry, just getting to this from my overful inbox by now.
On Fri, Jan 16, 2026 at 09:55:41AM +0000, Hongbo Li wrote:
> 2.1. file open & close
> ----------------------
> When the file is opened, the ->private_data field of file A or file B is
> set to point to an internal deduplicated file. When the actual read
> occurs, the page cache of this deduplicated file will be accessed.
So the first opener wins and others point to it? That would lead to
some really annoying life time rules. Or you allocate a hidden backing
file and have everyone point to it (the backing_file related subject
kinda hints at that), which would be much more sensible, but then the
above descriptions would not be correct.
>
> When the file is opened, if the corresponding erofs inode is newly
> created, then perform the following actions:
> 1. add the erofs inode to the backing list of the deduplicated inode;
> 2. increase the reference count of the deduplicated inode.
This on the other hand suggests the fist opener is used approach again?
> Assuming the deduplication inode's page cache is PGCache_dedup, there
What is PGCache_dedup?
> Iomap and the layers below will involve disk I/O operations. As
> described in 2.1, the deduplicated inode itself is not bound to a
> specific device. The deduplicated inode will select an erofs inode from
> the backing list (by default, the first one) to complete the
> corresponding iomap operation.
What happens for mmap I/O where folio->mapping is kinda important?
Also do you have a git tree for the whole feature?
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: [PATCH v15 0/9] erofs: Introduce page cache sharing feature
2026-01-16 15:36 ` [PATCH v15 0/9] erofs: Introduce page cache sharing feature Christoph Hellwig
@ 2026-01-16 16:30 ` Gao Xiang
2026-01-16 16:43 ` Gao Xiang
1 sibling, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2026-01-16 16:30 UTC (permalink / raw)
To: Christoph Hellwig, Hongbo Li
Cc: chao, brauner, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel
On 2026/1/16 23:36, Christoph Hellwig wrote:
> Sorry, just getting to this from my overful inbox by now.
>
> On Fri, Jan 16, 2026 at 09:55:41AM +0000, Hongbo Li wrote:
>> 2.1. file open & close
>> ----------------------
>> When the file is opened, the ->private_data field of file A or file B is
>> set to point to an internal deduplicated file. When the actual read
>> occurs, the page cache of this deduplicated file will be accessed.
>
> So the first opener wins and others point to it? That would lead to
> some really annoying life time rules. Or you allocate a hidden backing
> file and have everyone point to it (the backing_file related subject
> kinda hints at that), which would be much more sensible, but then the
> above descriptions would not be correct.
Your latter thought is correct, I think the words above
are ambiguous.
>
>>
>> When the file is opened, if the corresponding erofs inode is newly
>> created, then perform the following actions:
>> 1. add the erofs inode to the backing list of the deduplicated inode;
>> 2. increase the reference count of the deduplicated inode.
>
> This on the other hand suggests the fist opener is used approach again?
Not quite sure about this part, assuming you read the
patches, it's just similar to the backing_file approach.
>
>> Assuming the deduplication inode's page cache is PGCache_dedup, there
>
> What is PGCache_dedup?
Maybe it's just an outdated expression from the older versions
from Hongzhen. I think just ignore this part.
>
>> Iomap and the layers below will involve disk I/O operations. As
>> described in 2.1, the deduplicated inode itself is not bound to a
>> specific device. The deduplicated inode will select an erofs inode from
>> the backing list (by default, the first one) to complete the
>> corresponding iomap operation.
>
> What happens for mmap I/O where folio->mapping is kinda important?
`folio->mapping` will just get the anon inode, but
(meta)data I/Os will submit to one of the real
filesystem (that is why a real inode is needed to
pass into iomap), and use the data to fill the
anon inode page cache, and the anon inode is like
backing_file, and vma->vm_file will point to the
hidden backing file backed by the anon inode .
Thanks,
Gao Xiang
>
> Also do you have a git tree for the whole feature?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 0/9] erofs: Introduce page cache sharing feature
2026-01-16 15:36 ` [PATCH v15 0/9] erofs: Introduce page cache sharing feature Christoph Hellwig
2026-01-16 16:30 ` Gao Xiang
@ 2026-01-16 16:43 ` Gao Xiang
2026-01-19 1:23 ` Hongbo Li
1 sibling, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2026-01-16 16:43 UTC (permalink / raw)
To: Christoph Hellwig, Hongbo Li
Cc: chao, brauner, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel
On 2026/1/16 23:36, Christoph Hellwig wrote:
>
> Also do you have a git tree for the whole feature?
I prepared a test tree for Hongbo but it's v14:
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/log/?h=erofs/pagecache-share
I think v15 is almost close to the final status,
I hope Hongbo addresses your comment and I will
review the remaining parts too and apply to
linux-next at the beginning of next week.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v15 0/9] erofs: Introduce page cache sharing feature
2026-01-16 16:43 ` Gao Xiang
@ 2026-01-19 1:23 ` Hongbo Li
0 siblings, 0 replies; 46+ messages in thread
From: Hongbo Li @ 2026-01-19 1:23 UTC (permalink / raw)
To: Gao Xiang, Christoph Hellwig
Cc: chao, brauner, djwong, amir73il, linux-fsdevel, linux-erofs,
linux-kernel
Thanks for comments and attention.
So in short: Two files with identical content fingerprints
(user-defined, such as sha256) will share the same page cache which
associated with an anonymous inode. Data operations are redirected to
the anonymous inode while metadata operations (include locating the data
position on disk) are still performed on the original inode(realinode).
Sorry for the ambiguous annotation, and I will refine the cover-letter
in next iteration to make it clear as possible.
Thanks,
Hongbo
On 2026/1/17 0:43, Gao Xiang wrote:
>
>
> On 2026/1/16 23:36, Christoph Hellwig wrote:
>
>>
>> Also do you have a git tree for the whole feature?
>
> I prepared a test tree for Hongbo but it's v14:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/log/?h=erofs/pagecache-share
>
> I think v15 is almost close to the final status,
> I hope Hongbo addresses your comment and I will
> review the remaining parts too and apply to
> linux-next at the beginning of next week.
>
> Thanks,
> Gao Xiang
^ permalink raw reply [flat|nested] 46+ messages in thread