* [PATCH RFC 0/4] erofs: allow page cache sharing
@ 2025-07-03 12:23 Christian Brauner
2025-07-03 12:23 ` [PATCH RFC 1/4] erofs: move `struct erofs_anon_fs_type` to super.c Christian Brauner
` (5 more replies)
0 siblings, 6 replies; 20+ messages in thread
From: Christian Brauner @ 2025-07-03 12:23 UTC (permalink / raw)
To: Gao Xiang, Jan Kara, Amir Goldstein, Jeff Layton, Hongzhen Luo,
Matthew Wilcox
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs,
Christian Brauner
Hey!
This series is originally from Hongzhen. I'm picking it back up because
support for page cache sharing is pretty important for container and
service workloads that want to make use of erofs images. The main
obstacle currently is the inability to share page cache contents between
different erofs superblocks.
I think the mechanism that Hongzhen came up with is decent and will
remove one final obstacle.
However, I have not worked in this area in meaningful ways before so to
an experienced page cache person this might all look like a little kid
doodling on a piece of paper.
One obvious question mark I have is around mmap. The current
implementation mimicks what overlayfs is doing and I'm not sure that
it's correct or even necessary to mimick overlayfs behavior here at all.
Anyway, I would really appreciate the help!
[Background]
============
Currently, reading files with different paths (or names) but the same
content will consume multiple copies of the page cache, even if the
content of these page caches is the same. For example, reading identical
files (e.g., *.so files) from two different minor versions of container
images will cost multiple copies of the same page cache, since different
containers have different mount points. Therefore, sharing the page cache
for files with the same content can save memory.
[Implementation]
================
This introduces the page cache share feature in erofs. During the mkfs
phase, the file content is hashed and the hash value is stored in the
`trusted.erofs.fingerprint` extended attribute. Inodes of files with the
same `trusted.erofs.fingerprint` are mapped to the same anonymous inode
(indicated by the `ano_inode` field). When a read request occurs, the
anonymous inode serves as a "container" whose page cache is shared. The
actual operations involving the iomap are carried out by the original
inode which is mapped to the anonymous inode.
[Effect]
========
I conducted experiments on two aspects across two different minor versions of
container images:
1. reading all files in two different minor versions of container images
2. run workloads or use the default entrypoint within the containers^[1]
Below is the memory usage for reading all files in two different minor
versions of container images:
+-------------------+------------------+-------------+---------------+
| Image | Page Cache Share | Memory (MB) | Memory |
| | | | Reduction (%) |
+-------------------+------------------+-------------+---------------+
| | No | 241 | - |
| redis +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 163 | 33% |
+-------------------+------------------+-------------+---------------+
| | No | 872 | - |
| postgres +------------------+-------------+---------------+
| 16.1 & 16.2 | Yes | 630 | 28% |
+-------------------+------------------+-------------+---------------+
| | No | 2771 | - |
| tensorflow +------------------+-------------+---------------+
| 1.11.0 & 2.11.1 | Yes | 2340 | 16% |
+-------------------+------------------+-------------+---------------+
| | No | 926 | - |
| mysql +------------------+-------------+---------------+
| 8.0.11 & 8.0.12 | Yes | 735 | 21% |
+-------------------+------------------+-------------+---------------+
| | No | 390 | - |
| nginx +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 219 | 44% |
+-------------------+------------------+-------------+---------------+
| tomcat | No | 924 | - |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
| | Yes | 474 | 49% |
+-------------------+------------------+-------------+---------------+
Additionally, the table below shows the runtime memory usage of the
container:
+-------------------+------------------+-------------+---------------+
| Image | Page Cache Share | Memory (MB) | Memory |
| | | | Reduction (%) |
+-------------------+------------------+-------------+---------------+
| | No | 35 | - |
| redis +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 28 | 20% |
+-------------------+------------------+-------------+---------------+
| | No | 149 | - |
| postgres +------------------+-------------+---------------+
| 16.1 & 16.2 | Yes | 95 | 37% |
+-------------------+------------------+-------------+---------------+
| | No | 1028 | - |
| tensorflow +------------------+-------------+---------------+
| 1.11.0 & 2.11.1 | Yes | 930 | 10% |
+-------------------+------------------+-------------+---------------+
| | No | 155 | - |
| mysql +------------------+-------------+---------------+
| 8.0.11 & 8.0.12 | Yes | 132 | 15% |
+-------------------+------------------+-------------+---------------+
| | No | 25 | - |
| nginx +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 20 | 20% |
+-------------------+------------------+-------------+---------------+
| tomcat | No | 186 | - |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
| | Yes | 98 | 48% |
+-------------------+------------------+-------------+---------------+
It can be observed that when reading all the files in the image, the reduced
memory usage varies from 16% to 49%, depending on the specific image.
Additionally, the container's runtime memory usage reduction ranges from 10%
to 48%.
[1] Below are the workload for these images:
- redis: redis-benchmark
- postgres: sysbench
- tensorflow: app.py of tensorflow.python.platform
- mysql: sysbench
- nginx: wrk
- tomcat: default entrypoint
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Hongzhen Luo (4):
erofs: move `struct erofs_anon_fs_type` to super.c
erofs: introduce page cache share feature
erofs: apply the page cache share feature
erofs: introduce .fadvise for page cache share
fs/erofs/Kconfig | 10 ++
fs/erofs/Makefile | 1 +
fs/erofs/data.c | 67 +++++++++++
fs/erofs/fscache.c | 13 ---
fs/erofs/inode.c | 15 ++-
fs/erofs/internal.h | 11 ++
fs/erofs/pagecache_share.c | 281 +++++++++++++++++++++++++++++++++++++++++++++
fs/erofs/pagecache_share.h | 22 ++++
fs/erofs/super.c | 62 ++++++++++
fs/erofs/zdata.c | 32 ++++++
10 files changed, 500 insertions(+), 14 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250703-work-erofs-pcs-f6f3d0722401
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH RFC 1/4] erofs: move `struct erofs_anon_fs_type` to super.c
2025-07-03 12:23 [PATCH RFC 0/4] erofs: allow page cache sharing Christian Brauner
@ 2025-07-03 12:23 ` Christian Brauner
2025-07-03 12:23 ` [PATCH RFC 2/4] erofs: introduce page cache share feature Christian Brauner
` (4 subsequent siblings)
5 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2025-07-03 12:23 UTC (permalink / raw)
To: Gao Xiang, Jan Kara, Amir Goldstein, Jeff Layton, Hongzhen Luo,
Matthew Wilcox
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs,
Christian Brauner
From: Hongzhen Luo <hongzhen@linux.alibaba.com>
Move the `struct erofs_anon_fs_type` to the super.c and
expose it in preparation for the upcoming page cache share
feature.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Link: https://lore.kernel.org/20240902110620.2202586-2-hongzhen@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/erofs/fscache.c | 13 -------------
fs/erofs/internal.h | 2 ++
fs/erofs/super.c | 24 ++++++++++++++++++++++++
3 files changed, 26 insertions(+), 13 deletions(-)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 34517ca9df91..fe9216dd27f8 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -3,7 +3,6 @@
* Copyright (C) 2022, Alibaba Cloud
* Copyright (C) 2022, Bytedance Inc. All rights reserved.
*/
-#include <linux/pseudo_fs.h>
#include <linux/fscache.h>
#include "internal.h"
@@ -13,18 +12,6 @@ static LIST_HEAD(erofs_domain_list);
static LIST_HEAD(erofs_domain_cookies_list);
static struct vfsmount *erofs_pseudo_mnt;
-static int erofs_anon_init_fs_context(struct fs_context *fc)
-{
- return init_pseudo(fc, EROFS_SUPER_MAGIC) ? 0 : -ENOMEM;
-}
-
-static struct file_system_type erofs_anon_fs_type = {
- .owner = THIS_MODULE,
- .name = "pseudo_erofs",
- .init_fs_context = erofs_anon_init_fs_context,
- .kill_sb = kill_anon_super,
-};
-
struct erofs_fscache_io {
struct netfs_cache_resources cres;
struct iov_iter iter;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index a32c03a80c70..30380f7baf5e 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -372,6 +372,8 @@ extern const struct file_operations erofs_dir_fops;
extern const struct iomap_ops z_erofs_iomap_report_ops;
+extern struct file_system_type erofs_anon_fs_type;
+
/* flags for erofs_fscache_register_cookie() */
#define EROFS_REG_COOKIE_SHARE 0x0001
#define EROFS_REG_COOKIE_NEED_NOEXIST 0x0002
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index e1e9f06e8342..8fb14df2a343 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -11,6 +11,7 @@
#include <linux/fs_parser.h>
#include <linux/exportfs.h>
#include <linux/backing-dev.h>
+#include <linux/pseudo_fs.h>
#include "xattr.h"
#define CREATE_TRACE_POINTS
@@ -889,6 +890,29 @@ static struct file_system_type erofs_fs_type = {
};
MODULE_ALIAS_FS("erofs");
+static const struct super_operations erofs_anon_sops = {
+ .statfs = simple_statfs,
+};
+
+static int erofs_anon_init_fs_context(struct fs_context *fc)
+{
+ struct pseudo_fs_context *ctx;
+
+ ctx = init_pseudo(fc, EROFS_SUPER_MAGIC);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->ops = &erofs_anon_sops;
+ return 0;
+}
+
+struct file_system_type erofs_anon_fs_type = {
+ .owner = THIS_MODULE,
+ .name = "pseudo_erofs",
+ .init_fs_context = erofs_anon_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
static int __init erofs_module_init(void)
{
int err;
--
2.47.2
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH RFC 2/4] erofs: introduce page cache share feature
2025-07-03 12:23 [PATCH RFC 0/4] erofs: allow page cache sharing Christian Brauner
2025-07-03 12:23 ` [PATCH RFC 1/4] erofs: move `struct erofs_anon_fs_type` to super.c Christian Brauner
@ 2025-07-03 12:23 ` Christian Brauner
2025-07-04 21:06 ` Gao Xiang
2025-07-05 1:09 ` Hongzhen Luo
2025-07-03 12:23 ` [PATCH RFC 3/4] erofs: apply the " Christian Brauner
` (3 subsequent siblings)
5 siblings, 2 replies; 20+ messages in thread
From: Christian Brauner @ 2025-07-03 12:23 UTC (permalink / raw)
To: Gao Xiang, Jan Kara, Amir Goldstein, Jeff Layton, Hongzhen Luo,
Matthew Wilcox
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs,
Christian Brauner
From: Hongzhen Luo <hongzhen@linux.alibaba.com>
Currently, reading files with different paths (or names) but the same
content will consume multiple copies of the page cache, even if the
content of these page caches is the same. For example, reading identical
files (e.g., *.so files) from two different minor versions of container
images will cost multiple copies of the same page cache, since different
containers have different mount points. Therefore, sharing the page cache
for files with the same content can save memory.
This introduces the page cache share feature in erofs. During the mkfs
phase, the file content is hashed and the hash value is stored in the
`trusted.erofs.fingerprint` extended attribute. Inodes of files with the
same `trusted.erofs.fingerprint` are mapped to the same anonymous inode
(indicated by the `ano_inode` field). When a read request occurs, the
anonymous inode serves as a "container" whose page cache is shared. The
actual operations involving the iomap are carried out by the original
inode which is mapped to the anonymous inode.
Below is the memory usage for reading all files in two different minor
versions of container images:
+-------------------+------------------+-------------+---------------+
| Image | Page Cache Share | Memory (MB) | Memory |
| | | | Reduction (%) |
+-------------------+------------------+-------------+---------------+
| | No | 241 | - |
| redis +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 163 | 33% |
+-------------------+------------------+-------------+---------------+
| | No | 872 | - |
| postgres +------------------+-------------+---------------+
| 16.1 & 16.2 | Yes | 630 | 28% |
+-------------------+------------------+-------------+---------------+
| | No | 2771 | - |
| tensorflow +------------------+-------------+---------------+
| 1.11.0 & 2.11.1 | Yes | 2340 | 16% |
+-------------------+------------------+-------------+---------------+
| | No | 926 | - |
| mysql +------------------+-------------+---------------+
| 8.0.11 & 8.0.12 | Yes | 735 | 21% |
+-------------------+------------------+-------------+---------------+
| | No | 390 | - |
| nginx +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 219 | 44% |
+-------------------+------------------+-------------+---------------+
| tomcat | No | 924 | - |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
| | Yes | 474 | 49% |
+-------------------+------------------+-------------+---------------+
Additionally, the table below shows the runtime memory usage of the
container:
+-------------------+------------------+-------------+---------------+
| Image | Page Cache Share | Memory (MB) | Memory |
| | | | Reduction (%) |
+-------------------+------------------+-------------+---------------+
| | No | 35 | - |
| redis +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 28 | 20% |
+-------------------+------------------+-------------+---------------+
| | No | 149 | - |
| postgres +------------------+-------------+---------------+
| 16.1 & 16.2 | Yes | 95 | 37% |
+-------------------+------------------+-------------+---------------+
| | No | 1028 | - |
| tensorflow +------------------+-------------+---------------+
| 1.11.0 & 2.11.1 | Yes | 930 | 10% |
+-------------------+------------------+-------------+---------------+
| | No | 155 | - |
| mysql +------------------+-------------+---------------+
| 8.0.11 & 8.0.12 | Yes | 132 | 15% |
+-------------------+------------------+-------------+---------------+
| | No | 25 | - |
| nginx +------------------+-------------+---------------+
| 7.2.4 & 7.2.5 | Yes | 20 | 20% |
+-------------------+------------------+-------------+---------------+
| tomcat | No | 186 | - |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
| | Yes | 98 | 48% |
+-------------------+------------------+-------------+---------------+
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Link: https://lore.kernel.org/20240902110620.2202586-3-hongzhen@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/erofs/Kconfig | 10 +++
fs/erofs/Makefile | 1 +
fs/erofs/internal.h | 4 +
fs/erofs/pagecache_share.c | 204 +++++++++++++++++++++++++++++++++++++++++++++
fs/erofs/pagecache_share.h | 20 +++++
5 files changed, 239 insertions(+)
diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index 6beeb7063871..553770068fee 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -192,3 +192,13 @@ config EROFS_FS_PCPU_KTHREAD_HIPRI
at higher priority.
If unsure, say N.
+
+config EROFS_FS_PAGE_CACHE_SHARE
+ bool "EROFS page cache share support"
+ depends on EROFS_FS
+ default n
+ help
+ This permits EROFS to share page cache for files with same
+ fingerprints.
+
+ If unsure, say N.
diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
index 549abc424763..f4141fdfcb0b 100644
--- a/fs/erofs/Makefile
+++ b/fs/erofs/Makefile
@@ -10,3 +10,4 @@ erofs-$(CONFIG_EROFS_FS_ZIP_ZSTD) += decompressor_zstd.o
erofs-$(CONFIG_EROFS_FS_ZIP_ACCEL) += decompressor_crypto.o
erofs-$(CONFIG_EROFS_FS_BACKED_BY_FILE) += fileio.o
erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o
+erofs-$(CONFIG_EROFS_FS_PAGE_CACHE_SHARE) += pagecache_share.o
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 30380f7baf5e..47136894d17d 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -273,6 +273,9 @@ struct erofs_inode {
};
#endif /* CONFIG_EROFS_FS_ZIP */
};
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ struct inode *ano_inode;
+#endif
/* the corresponding vfs inode */
struct inode vfs_inode;
};
@@ -369,6 +372,7 @@ extern const struct inode_operations erofs_dir_iops;
extern const struct file_operations erofs_file_fops;
extern const struct file_operations erofs_dir_fops;
+extern const struct file_operations erofs_pcs_file_fops;
extern const struct iomap_ops z_erofs_iomap_report_ops;
diff --git a/fs/erofs/pagecache_share.c b/fs/erofs/pagecache_share.c
new file mode 100644
index 000000000000..309b33cc6c30
--- /dev/null
+++ b/fs/erofs/pagecache_share.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2024, Alibaba Cloud
+ */
+#include <linux/xxhash.h>
+#include <linux/refcount.h>
+#include "pagecache_share.h"
+#include "internal.h"
+#include "xattr.h"
+
+#define PCS_FPRT_IDX 4
+#define PCS_FPRT_NAME "erofs.fingerprint"
+#define PCS_FPRT_MAXLEN (sizeof(size_t) + 1024)
+
+static DEFINE_MUTEX(pseudo_mnt_lock);
+static refcount_t pseudo_mnt_count;
+static struct vfsmount *erofs_pcs_mnt;
+
+int erofs_pcs_init_mnt(void)
+{
+ struct vfsmount *mnt;
+
+ if (refcount_inc_not_zero(&pseudo_mnt_count))
+ return 0;
+
+ guard(mutex)(&pseudo_mnt_lock);
+ if (erofs_pcs_mnt) {
+ refcount_inc(&pseudo_mnt_count);
+ return 0;
+ }
+
+ mnt = kern_mount(&erofs_anon_fs_type);
+ if (IS_ERR(mnt))
+ return PTR_ERR(mnt);
+
+ rcu_read_lock();
+ rcu_assign_pointer(erofs_pcs_mnt, mnt);
+ rcu_read_unlock();
+ refcount_set_release(&pseudo_mnt_count, 1);
+ return 0;
+}
+
+void erofs_pcs_free_mnt(void)
+{
+ struct vfsmount *mnt = NULL;
+
+ if (refcount_dec_not_one(&pseudo_mnt_count))
+ return;
+
+ scoped_guard(mutex, &pseudo_mnt_lock) {
+ rcu_read_lock();
+ if (refcount_dec_and_test(&pseudo_mnt_count))
+ mnt = rcu_replace_pointer(erofs_pcs_mnt, NULL, true);
+ rcu_read_unlock();
+ }
+ if (mnt)
+ kern_unmount(mnt);
+}
+
+static int erofs_pcs_eq(struct inode *inode, void *data)
+{
+ return inode->i_private && memcmp(inode->i_private, data,
+ sizeof(size_t) + *(size_t *)data) == 0 ? 1 : 0;
+}
+
+static int erofs_pcs_set_fprt(struct inode *inode, void *data)
+{
+ /* fprt length and content */
+ inode->i_private = kmalloc(*(size_t *)data + sizeof(size_t),
+ GFP_KERNEL);
+ memcpy(inode->i_private, data, sizeof(size_t) + *(size_t *)data);
+ return 0;
+}
+
+void erofs_pcs_fill_inode(struct inode *inode)
+{
+ struct erofs_inode *vi = EROFS_I(inode);
+ char fprt[PCS_FPRT_MAXLEN];
+ struct inode *ano_inode;
+ unsigned long fprt_hash;
+ size_t fprt_len;
+
+ vi->ano_inode = NULL;
+ fprt_len = erofs_getxattr(inode, PCS_FPRT_IDX, PCS_FPRT_NAME,
+ fprt + sizeof(size_t), PCS_FPRT_MAXLEN);
+ if (fprt_len > 0 && fprt_len <= PCS_FPRT_MAXLEN) {
+ *(size_t *)fprt = fprt_len;
+ fprt_hash = xxh32(fprt + sizeof(size_t), fprt_len, 0);
+ ano_inode = iget5_locked(erofs_pcs_mnt->mnt_sb, fprt_hash,
+ erofs_pcs_eq, erofs_pcs_set_fprt,
+ fprt);
+ vi->ano_inode = ano_inode;
+ if (ano_inode->i_state & I_NEW) {
+ if (erofs_inode_is_data_compressed(vi->datalayout))
+ ano_inode->i_mapping->a_ops = &z_erofs_aops;
+ else
+ ano_inode->i_mapping->a_ops = &erofs_aops;
+ ano_inode->i_size = inode->i_size;
+ unlock_new_inode(ano_inode);
+ }
+ }
+}
+
+/*
+ * TODO: Hm, could we leverage our fancy new backing file infrastructure
+ * as for overlayfs and fuse?
+ */
+static struct file *erofs_pcs_alloc_file(struct file *file,
+ struct inode *ano_inode)
+{
+ struct file *ano_file;
+
+ ano_file = alloc_file_pseudo(ano_inode, erofs_pcs_mnt, "[erofs_pcs_f]",
+ O_RDONLY, &erofs_file_fops);
+ file_ra_state_init(&ano_file->f_ra, file->f_mapping);
+ ano_file->private_data = EROFS_I(file_inode(file));
+ return ano_file;
+}
+
+static int erofs_pcs_file_open(struct inode *inode, struct file *file)
+{
+ struct file *ano_file;
+ struct inode *ano_inode;
+ struct erofs_inode *vi = EROFS_I(inode);
+
+ ano_inode = vi->ano_inode;
+ if (!ano_inode)
+ return -EINVAL;
+
+ ano_file = erofs_pcs_alloc_file(file, ano_inode);
+ if (IS_ERR(ano_file))
+ return PTR_ERR(ano_file);
+
+ file->private_data = ano_file;
+ return 0;
+}
+
+static int erofs_pcs_file_release(struct inode *inode, struct file *file)
+{
+ struct file *ano_file __free(fput) = NULL;
+
+ if (WARN_ON_ONCE(!file->private_data))
+ return -EINVAL;
+
+ swap(file->private_data, ano_file);
+ return 0;
+}
+
+static ssize_t erofs_pcs_file_read_iter(struct kiocb *iocb,
+ struct iov_iter *to)
+{
+ struct file *file, *ano_file;
+ struct kiocb ano_iocb;
+ ssize_t res;
+
+ if (!iov_iter_count(to))
+ return 0;
+
+#ifdef CONFIG_FS_DAX
+ if (IS_DAX(inode))
+ return iocb->ki_filp->f_op->read_iter(iocb, to);
+#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return iocb->ki_filp->f_op->read_iter(iocb, to);
+
+ memcpy(&ano_iocb, iocb, sizeof(struct kiocb));
+ file = iocb->ki_filp;
+ ano_file = file->private_data;
+ if (WARN_ON_ONCE(!ano_file))
+ return -EINVAL;
+ ano_iocb.ki_filp = ano_file;
+ res = filemap_read(&ano_iocb, to, 0);
+ memcpy(iocb, &ano_iocb, sizeof(struct kiocb));
+ iocb->ki_filp = file;
+ file_accessed(file);
+ return res;
+}
+
+/*
+ * TODO: Amir, you've got some experience in this area due to overlayfs
+ * and fuse. Does that work?
+ */
+static int erofs_pcs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct file *ano_file = file->private_data;
+
+ vma_set_file(vma, ano_file);
+ vma->vm_ops = &generic_file_vm_ops;
+ return 0;
+}
+
+const struct file_operations erofs_pcs_file_fops = {
+ .open = erofs_pcs_file_open,
+ /*
+ * TODO: Why doesn't .llseek require similar treatment as
+ * .read_iter?
+ */
+ .llseek = generic_file_llseek,
+ .read_iter = erofs_pcs_file_read_iter,
+ .mmap = erofs_pcs_mmap,
+ .release = erofs_pcs_file_release,
+ .get_unmapped_area = thp_get_unmapped_area,
+ .splice_read = filemap_splice_read,
+};
diff --git a/fs/erofs/pagecache_share.h b/fs/erofs/pagecache_share.h
new file mode 100644
index 000000000000..b8111291cf79
--- /dev/null
+++ b/fs/erofs/pagecache_share.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2024, Alibaba Cloud
+ */
+#ifndef __EROFS_PAGECACHE_SHARE_H
+#define __EROFS_PAGECACHE_SHARE_H
+
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/rwlock.h>
+#include <linux/mutex.h>
+#include "internal.h"
+
+int erofs_pcs_init_mnt(void);
+void erofs_pcs_free_mnt(void);
+void erofs_pcs_fill_inode(struct inode *inode);
+
+extern const struct vm_operations_struct generic_file_vm_ops;
+
+#endif
--
2.47.2
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH RFC 3/4] erofs: apply the page cache share feature
2025-07-03 12:23 [PATCH RFC 0/4] erofs: allow page cache sharing Christian Brauner
2025-07-03 12:23 ` [PATCH RFC 1/4] erofs: move `struct erofs_anon_fs_type` to super.c Christian Brauner
2025-07-03 12:23 ` [PATCH RFC 2/4] erofs: introduce page cache share feature Christian Brauner
@ 2025-07-03 12:23 ` Christian Brauner
2025-07-04 20:45 ` Gao Xiang
2025-07-03 12:23 ` [PATCH RFC 4/4] erofs: introduce .fadvise for page cache share Christian Brauner
` (2 subsequent siblings)
5 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2025-07-03 12:23 UTC (permalink / raw)
To: Gao Xiang, Jan Kara, Amir Goldstein, Jeff Layton, Hongzhen Luo,
Matthew Wilcox
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs,
Christian Brauner
From: Hongzhen Luo <hongzhen@linux.alibaba.com>
This modifies relevant functions to apply the page cache
share feature.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Link: https://lore.kernel.org/20240902110620.2202586-4-hongzhen@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/erofs/data.c | 33 +++++++++++++++++++++++++++++++++
fs/erofs/inode.c | 15 ++++++++++++++-
fs/erofs/super.c | 29 +++++++++++++++++++++++++++++
fs/erofs/zdata.c | 32 ++++++++++++++++++++++++++++++++
4 files changed, 108 insertions(+), 1 deletion(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 6a329c329f43..fb54162f4c54 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -351,12 +351,45 @@ int erofs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
*/
static int erofs_read_folio(struct file *file, struct folio *folio)
{
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ struct erofs_inode *vi = NULL;
+ int ret;
+
+ if (file && file->private_data) {
+ vi = file->private_data;
+ if (vi->ano_inode == file_inode(file))
+ folio->mapping->host = &vi->vfs_inode;
+ else
+ vi = NULL;
+ }
+ ret = iomap_read_folio(folio, &erofs_iomap_ops);
+ if (vi)
+ folio->mapping->host = file_inode(file);
+ return ret;
+#else
return iomap_read_folio(folio, &erofs_iomap_ops);
+#endif
}
static void erofs_readahead(struct readahead_control *rac)
{
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ struct erofs_inode *vi = NULL;
+ struct file *file = rac->file;
+
+ if (file && file->private_data) {
+ vi = file->private_data;
+ if (vi->ano_inode == file_inode(file))
+ rac->mapping->host = &vi->vfs_inode;
+ else
+ vi = NULL;
+ }
+ iomap_readahead(rac, &erofs_iomap_ops);
+ if (vi)
+ rac->mapping->host = file_inode(file);
+#else
return iomap_readahead(rac, &erofs_iomap_ops);
+#endif
}
static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index a0ae0b4f7b01..57a23bd9d196 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -4,9 +4,11 @@
* https://www.huawei.com/
* Copyright (C) 2021, Alibaba Cloud
*/
-#include "xattr.h"
#include <trace/events/erofs.h>
+#include "xattr.h"
+#include "pagecache_share.h"
+
static int erofs_fill_symlink(struct inode *inode, void *kaddr,
unsigned int m_pofs)
{
@@ -212,10 +214,21 @@ static int erofs_fill_inode(struct inode *inode)
switch (inode->i_mode & S_IFMT) {
case S_IFREG:
inode->i_op = &erofs_generic_iops;
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ erofs_pcs_fill_inode(inode);
+ if (vi->ano_inode)
+ inode->i_fop = &erofs_pcs_file_fops;
+ else if (erofs_inode_is_data_compressed(vi->datalayout))
+ inode->i_fop = &generic_ro_fops;
+ else
+ inode->i_fop = &erofs_file_fops;
+#else
if (erofs_inode_is_data_compressed(vi->datalayout))
inode->i_fop = &generic_ro_fops;
else
inode->i_fop = &erofs_file_fops;
+#endif
+
break;
case S_IFDIR:
inode->i_op = &erofs_dir_iops;
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 8fb14df2a343..b9a71840cc45 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -13,6 +13,7 @@
#include <linux/backing-dev.h>
#include <linux/pseudo_fs.h>
#include "xattr.h"
+#include "pagecache_share.h"
#define CREATE_TRACE_POINTS
#include <trace/events/erofs.h>
@@ -81,6 +82,12 @@ static void erofs_free_inode(struct inode *inode)
{
struct erofs_inode *vi = EROFS_I(inode);
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ if (S_ISREG(inode->i_mode) && vi->ano_inode) {
+ iput(vi->ano_inode);
+ vi->ano_inode = NULL;
+ }
+#endif
if (inode->i_op == &erofs_fast_symlink_iops)
kfree(inode->i_link);
kfree(vi->xattr_shared_xattrs);
@@ -716,6 +723,12 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
if (err)
return err;
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ err = erofs_pcs_init_mnt();
+ if (err)
+ return err;
+#endif
+
erofs_info(sb, "mounted with root inode @ nid %llu.", sbi->root_nid);
return 0;
}
@@ -863,6 +876,9 @@ static void erofs_kill_sb(struct super_block *sb)
kill_block_super(sb);
erofs_drop_internal_inodes(sbi);
fs_put_dax(sbi->dif0.dax_dev, NULL);
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ erofs_pcs_free_mnt();
+#endif
erofs_fscache_unregister_fs(sb);
erofs_sb_free(sbi);
sb->s_fs_info = NULL;
@@ -890,8 +906,21 @@ static struct file_system_type erofs_fs_type = {
};
MODULE_ALIAS_FS("erofs");
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+static void erofs_free_anon_inode(struct inode *inode)
+{
+ if (inode->i_private) {
+ kfree(inode->i_private);
+ inode->i_private = NULL;
+ }
+}
+#endif
+
static const struct super_operations erofs_anon_sops = {
.statfs = simple_statfs,
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ .free_inode = erofs_free_anon_inode,
+#endif
};
static int erofs_anon_init_fs_context(struct fs_context *fc)
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index fe8071844724..346d34aa6a6c 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1862,6 +1862,17 @@ static void z_erofs_pcluster_readmore(struct z_erofs_frontend *f,
static int z_erofs_read_folio(struct file *file, struct folio *folio)
{
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ struct erofs_inode *vi = NULL;
+
+ if (file && file->private_data) {
+ vi = file->private_data;
+ if (vi->ano_inode == file_inode(file))
+ folio->mapping->host = &vi->vfs_inode;
+ else
+ vi = NULL;
+ }
+#endif
struct inode *const inode = folio->mapping->host;
Z_EROFS_DEFINE_FRONTEND(f, inode, folio_pos(folio));
int err;
@@ -1880,11 +1891,27 @@ static int z_erofs_read_folio(struct file *file, struct folio *folio)
erofs_put_metabuf(&f.map.buf);
erofs_release_pages(&f.pagepool);
+
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ if (vi)
+ folio->mapping->host = file_inode(file);
+#endif
return err;
}
static void z_erofs_readahead(struct readahead_control *rac)
{
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ struct erofs_inode *vi = NULL;
+
+ if (rac->file && rac->file->private_data) {
+ vi = rac->file->private_data;
+ if (vi->ano_inode == file_inode(rac->file))
+ rac->mapping->host = &vi->vfs_inode;
+ else
+ vi = NULL;
+ }
+#endif
struct inode *const inode = rac->mapping->host;
Z_EROFS_DEFINE_FRONTEND(f, inode, readahead_pos(rac));
unsigned int nrpages = readahead_count(rac);
@@ -1914,6 +1941,11 @@ static void z_erofs_readahead(struct readahead_control *rac)
(void)z_erofs_runqueue(&f, nrpages);
erofs_put_metabuf(&f.map.buf);
erofs_release_pages(&f.pagepool);
+
+#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ if (vi)
+ rac->mapping->host = file_inode(rac->file);
+#endif
}
const struct address_space_operations z_erofs_aops = {
--
2.47.2
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH RFC 4/4] erofs: introduce .fadvise for page cache share
2025-07-03 12:23 [PATCH RFC 0/4] erofs: allow page cache sharing Christian Brauner
` (2 preceding siblings ...)
2025-07-03 12:23 ` [PATCH RFC 3/4] erofs: apply the " Christian Brauner
@ 2025-07-03 12:23 ` Christian Brauner
2025-07-04 21:09 ` Gao Xiang
2025-07-03 12:53 ` [PATCH RFC 0/4] erofs: allow page cache sharing Gao Xiang
2025-07-05 0:51 ` Hongzhen Luo
5 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2025-07-03 12:23 UTC (permalink / raw)
To: Gao Xiang, Jan Kara, Amir Goldstein, Jeff Layton, Hongzhen Luo,
Matthew Wilcox
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs,
Christian Brauner
From: Hongzhen Luo <hongzhen@linux.alibaba.com>
When using .fadvise to release a file's page cache, it frees page cache
pages that were first read by this file. To achieve this, an interval
tree is added in the inode of that file to track the segments first
read by that inode.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Link: https://lore.kernel.org/20240902110620.2202586-5-hongzhen@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/erofs/data.c | 38 ++++++++++++++++++++--
fs/erofs/internal.h | 5 +++
fs/erofs/pagecache_share.c | 81 ++++++++++++++++++++++++++++++++++++++++++++--
fs/erofs/pagecache_share.h | 2 ++
fs/erofs/super.c | 9 ++++++
5 files changed, 131 insertions(+), 4 deletions(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index fb54162f4c54..61a42a95d26b 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -7,6 +7,7 @@
#include "internal.h"
#include <linux/sched/mm.h>
#include <trace/events/erofs.h>
+#include "pagecache_share.h"
void erofs_unmap_metabuf(struct erofs_buf *buf)
{
@@ -353,6 +354,7 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
{
#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
struct erofs_inode *vi = NULL;
+ struct interval_tree_node *seg;
int ret;
if (file && file->private_data) {
@@ -363,8 +365,22 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
vi = NULL;
}
ret = iomap_read_folio(folio, &erofs_iomap_ops);
- if (vi)
+ if (vi) {
folio->mapping->host = file_inode(file);
+ seg = erofs_pcs_alloc_seg();
+ if (!seg)
+ return -ENOMEM;
+ seg->start = folio->index;
+ seg->last = seg->start + (folio_size(folio) >> PAGE_SHIFT);
+ if (seg->last > (vi->vfs_inode.i_size >> PAGE_SHIFT))
+ seg->last = vi->vfs_inode.i_size >> PAGE_SHIFT;
+ if (seg->last >= seg->start) {
+ mutex_lock(&vi->segs_mutex);
+ interval_tree_insert(seg, &vi->segs);
+ mutex_unlock(&vi->segs_mutex);
+ } else
+ erofs_pcs_free_seg(seg);
+ }
return ret;
#else
return iomap_read_folio(folio, &erofs_iomap_ops);
@@ -376,6 +392,8 @@ static void erofs_readahead(struct readahead_control *rac)
#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
struct erofs_inode *vi = NULL;
struct file *file = rac->file;
+ struct interval_tree_node *seg;
+ erofs_off_t start, end;
if (file && file->private_data) {
vi = file->private_data;
@@ -383,10 +401,26 @@ static void erofs_readahead(struct readahead_control *rac)
rac->mapping->host = &vi->vfs_inode;
else
vi = NULL;
+ start = readahead_pos(rac);
+ end = start + readahead_length(rac);
+ if (end > vi->vfs_inode.i_size)
+ end = vi->vfs_inode.i_size;
}
iomap_readahead(rac, &erofs_iomap_ops);
- if (vi)
+ if (vi) {
rac->mapping->host = file_inode(file);
+ seg = erofs_pcs_alloc_seg();
+ if (!seg)
+ return;
+ seg->start = start >> PAGE_SHIFT;
+ seg->last = end >> PAGE_SHIFT;
+ if (seg->last >= seg->start) {
+ mutex_lock(&vi->segs_mutex);
+ interval_tree_insert(seg, &vi->segs);
+ mutex_unlock(&vi->segs_mutex);
+ } else
+ erofs_pcs_free_seg(seg);
+ }
#else
return iomap_readahead(rac, &erofs_iomap_ops);
#endif
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 47136894d17d..5aa1215ce734 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -18,6 +18,8 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/iomap.h>
+#include <linux/interval_tree.h>
+#include <linux/mutex.h>
#include "erofs_fs.h"
__printf(2, 3) void _erofs_printk(struct super_block *sb, const char *fmt, ...);
@@ -275,6 +277,9 @@ struct erofs_inode {
};
#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
struct inode *ano_inode;
+ /* segments attributed by this inode */
+ struct rb_root_cached segs;
+ struct mutex segs_mutex;
#endif
/* the corresponding vfs inode */
struct inode vfs_inode;
diff --git a/fs/erofs/pagecache_share.c b/fs/erofs/pagecache_share.c
index 309b33cc6c30..84713c0f20c8 100644
--- a/fs/erofs/pagecache_share.c
+++ b/fs/erofs/pagecache_share.c
@@ -4,6 +4,9 @@
*/
#include <linux/xxhash.h>
#include <linux/refcount.h>
+#include <uapi/linux/fadvise.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
#include "pagecache_share.h"
#include "internal.h"
#include "xattr.h"
@@ -15,10 +18,12 @@
static DEFINE_MUTEX(pseudo_mnt_lock);
static refcount_t pseudo_mnt_count;
static struct vfsmount *erofs_pcs_mnt;
+struct kmem_cache *erofs_pcs_segsp;
int erofs_pcs_init_mnt(void)
{
struct vfsmount *mnt;
+ struct kmem_cache *cache;
if (refcount_inc_not_zero(&pseudo_mnt_count))
return 0;
@@ -29,12 +34,21 @@ int erofs_pcs_init_mnt(void)
return 0;
}
+ cache = kmem_cache_create("erofs_pcs_segs",
+ sizeof(struct interval_tree_node), 0,
+ SLAB_RECLAIM_ACCOUNT | SLAB_ACCOUNT, NULL);
+ if (!cache)
+ return -ENOMEM;
+
mnt = kern_mount(&erofs_anon_fs_type);
- if (IS_ERR(mnt))
+ if (IS_ERR(mnt)) {
+ kmem_cache_destroy(cache);
return PTR_ERR(mnt);
+ }
rcu_read_lock();
rcu_assign_pointer(erofs_pcs_mnt, mnt);
+ rcu_assign_pointer(erofs_pcs_segsp, cache);
rcu_read_unlock();
refcount_set_release(&pseudo_mnt_count, 1);
return 0;
@@ -43,18 +57,34 @@ int erofs_pcs_init_mnt(void)
void erofs_pcs_free_mnt(void)
{
struct vfsmount *mnt = NULL;
+ struct kmem_cache *cache = NULL;
if (refcount_dec_not_one(&pseudo_mnt_count))
return;
scoped_guard(mutex, &pseudo_mnt_lock) {
rcu_read_lock();
- if (refcount_dec_and_test(&pseudo_mnt_count))
+ if (refcount_dec_and_test(&pseudo_mnt_count)) {
mnt = rcu_replace_pointer(erofs_pcs_mnt, NULL, true);
+ cache = rcu_replace_pointer(erofs_pcs_segsp, NULL, true);
+ }
rcu_read_unlock();
}
+
if (mnt)
kern_unmount(mnt);
+ if (cache)
+ kmem_cache_destroy(cache);
+}
+
+struct interval_tree_node *erofs_pcs_alloc_seg(void)
+{
+ return kmem_cache_alloc(erofs_pcs_segsp, GFP_KERNEL);
+}
+
+void erofs_pcs_free_seg(struct interval_tree_node *seg)
+{
+ kmem_cache_free(erofs_pcs_segsp, seg);
}
static int erofs_pcs_eq(struct inode *inode, void *data)
@@ -90,6 +120,8 @@ void erofs_pcs_fill_inode(struct inode *inode)
erofs_pcs_eq, erofs_pcs_set_fprt,
fprt);
vi->ano_inode = ano_inode;
+ vi->segs = RB_ROOT_CACHED;
+ mutex_init(&vi->segs_mutex);
if (ano_inode->i_state & I_NEW) {
if (erofs_inode_is_data_compressed(vi->datalayout))
ano_inode->i_mapping->a_ops = &z_erofs_aops;
@@ -189,6 +221,50 @@ static int erofs_pcs_mmap(struct file *file, struct vm_area_struct *vma)
return 0;
}
+static int erofs_pcs_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
+{
+ struct erofs_inode *vi = EROFS_I(file_inode(file));
+ struct interval_tree_node *seg, *next_seg, *new_seg;
+ struct file *ano_file = file->private_data;
+ erofs_off_t start, end;
+ int err = 0;
+ u64 l, r;
+
+ if (advice != POSIX_FADV_DONTNEED)
+ return generic_fadvise(ano_file, offset, len, advice);
+
+ start = offset >> PAGE_SHIFT;
+ /* len = 0 means EOF */
+ end = (!len ? LLONG_MAX : offset + len) >> PAGE_SHIFT;
+
+ mutex_lock(&vi->segs_mutex);
+ seg = interval_tree_iter_first(&vi->segs, start, end);
+ while (seg) {
+ next_seg = interval_tree_iter_next(seg, start, end);
+ l = max_t(u64, seg->start | 0ULL, start);
+ r = min_t(u64, seg->last | 0ULL, end);
+ if (l > r)
+ continue;
+ (void)invalidate_mapping_pages(ano_file->f_mapping, l, r);
+ if (seg->start < l) {
+ new_seg = erofs_pcs_alloc_seg();
+ new_seg->start = seg->start;
+ new_seg->last = l;
+ interval_tree_insert(new_seg, &vi->segs);
+ }
+ if (r < seg->last) {
+ new_seg = erofs_pcs_alloc_seg();
+ new_seg->start = r;
+ new_seg->last = seg->last;
+ interval_tree_insert(new_seg, &vi->segs);
+ }
+ interval_tree_remove(seg, &vi->segs);
+ seg = next_seg;
+ }
+ mutex_unlock(&vi->segs_mutex);
+ return err;
+}
+
const struct file_operations erofs_pcs_file_fops = {
.open = erofs_pcs_file_open,
/*
@@ -201,4 +277,5 @@ const struct file_operations erofs_pcs_file_fops = {
.release = erofs_pcs_file_release,
.get_unmapped_area = thp_get_unmapped_area,
.splice_read = filemap_splice_read,
+ .fadvise = erofs_pcs_fadvise,
};
diff --git a/fs/erofs/pagecache_share.h b/fs/erofs/pagecache_share.h
index b8111291cf79..eb5869070d4b 100644
--- a/fs/erofs/pagecache_share.h
+++ b/fs/erofs/pagecache_share.h
@@ -14,6 +14,8 @@
int erofs_pcs_init_mnt(void);
void erofs_pcs_free_mnt(void);
void erofs_pcs_fill_inode(struct inode *inode);
+struct interval_tree_node *erofs_pcs_alloc_seg(void);
+void erofs_pcs_free_seg(struct interval_tree_node *seg);
extern const struct vm_operations_struct generic_file_vm_ops;
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index b9a71840cc45..607dc94a45a0 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -83,10 +83,19 @@ static void erofs_free_inode(struct inode *inode)
struct erofs_inode *vi = EROFS_I(inode);
#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
+ struct interval_tree_node *seg, *next_seg;
+
if (S_ISREG(inode->i_mode) && vi->ano_inode) {
iput(vi->ano_inode);
vi->ano_inode = NULL;
}
+ seg = interval_tree_iter_first(&vi->segs, 0, ULONG_MAX);
+ while (seg) {
+ next_seg = interval_tree_iter_next(seg, 0, ULONG_MAX);
+ interval_tree_remove(seg, &vi->segs);
+ erofs_pcs_free_seg(seg);
+ seg = next_seg;
+ }
#endif
if (inode->i_op == &erofs_fast_symlink_iops)
kfree(inode->i_link);
--
2.47.2
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 0/4] erofs: allow page cache sharing
2025-07-03 12:23 [PATCH RFC 0/4] erofs: allow page cache sharing Christian Brauner
` (3 preceding siblings ...)
2025-07-03 12:23 ` [PATCH RFC 4/4] erofs: introduce .fadvise for page cache share Christian Brauner
@ 2025-07-03 12:53 ` Gao Xiang
2025-07-05 0:51 ` Hongzhen Luo
5 siblings, 0 replies; 20+ messages in thread
From: Gao Xiang @ 2025-07-03 12:53 UTC (permalink / raw)
To: Christian Brauner
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs,
Gao Xiang, Jan Kara, Amir Goldstein, Jeff Layton, Hongzhen Luo,
Matthew Wilcox
On 2025/7/3 20:23, Christian Brauner wrote:
> Hey!
>
> This series is originally from Hongzhen. I'm picking it back up because
> support for page cache sharing is pretty important for container and
> service workloads that want to make use of erofs images. The main
> obstacle currently is the inability to share page cache contents between
> different erofs superblocks.
Hi Christian,
Many thanks for your effort. I do hope this feature can be
landed upstream too since as you said it's important to
container workloads.
Hongzhen had a job change recently, so he may not follow this.
It's on my own TODO list too, but it would be awesome if you
could take this!
Anyway, we could consider resending this to -fsdevel for more
discussion (I think you forgot about it ;-)). But personally
I think using anon inode mapping (vma->vm_file = anon_file)
for mmap is fine to userspace, it just needs more work to
handle edge cases like data source redirection if one sb is
gone.
Thanks,
Gao Xiang
>
> I think the mechanism that Hongzhen came up with is decent and will
> remove one final obstacle.
>
> However, I have not worked in this area in meaningful ways before so to
> an experienced page cache person this might all look like a little kid
> doodling on a piece of paper.
>
> One obvious question mark I have is around mmap. The current
> implementation mimicks what overlayfs is doing and I'm not sure that
> it's correct or even necessary to mimick overlayfs behavior here at all.
>
> Anyway, I would really appreciate the help!
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 3/4] erofs: apply the page cache share feature
2025-07-03 12:23 ` [PATCH RFC 3/4] erofs: apply the " Christian Brauner
@ 2025-07-04 20:45 ` Gao Xiang
0 siblings, 0 replies; 20+ messages in thread
From: Gao Xiang @ 2025-07-04 20:45 UTC (permalink / raw)
To: Christian Brauner
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs,
Gao Xiang, Jan Kara, Amir Goldstein, Jeff Layton, Hongzhen Luo,
Matthew Wilcox
Hi Christian,
On 2025/7/3 20:23, Christian Brauner wrote:
> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>
> This modifies relevant functions to apply the page cache
> share feature.
>
> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Link: https://lore.kernel.org/20240902110620.2202586-4-hongzhen@linux.alibaba.com
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/erofs/data.c | 33 +++++++++++++++++++++++++++++++++
> fs/erofs/inode.c | 15 ++++++++++++++-
> fs/erofs/super.c | 29 +++++++++++++++++++++++++++++
> fs/erofs/zdata.c | 32 ++++++++++++++++++++++++++++++++
> 4 files changed, 108 insertions(+), 1 deletion(-)
>
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index 6a329c329f43..fb54162f4c54 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -351,12 +351,45 @@ int erofs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> */
> static int erofs_read_folio(struct file *file, struct folio *folio)
> {
> +#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
> + struct erofs_inode *vi = NULL;
> + int ret;
> +
> + if (file && file->private_data) {
> + vi = file->private_data;
> + if (vi->ano_inode == file_inode(file))
> + folio->mapping->host = &vi->vfs_inode;
This is one of the parts I asked Hongzhen to refactor
because it simply doesn't work.
The background is that:
- since the folio is from the anon_inode mapping now,
so .iomap_begin() will use the anon inode rather
than a real inode in a sb, which causes iomap
don't find the real data source to read;
- Hongzhen just switch `folio->mapping->host` but it's
just broken. Also `file` here can be NULL if the
request is a kernel-internal request, so we'd better
not rely on `file` argument.
- My suggestion was that maintain a inode list so that
erofs_iomap_begin() can find any valid sb inode to
find data source instead.
> + else
> + vi = NULL;
> + }
> + ret = iomap_read_folio(folio, &erofs_iomap_ops);
> + if (vi)
> + folio->mapping->host = file_inode(file);
here.
> + return ret;
> +#else
> return iomap_read_folio(folio, &erofs_iomap_ops);
> +#endif
> }
>
> static void erofs_readahead(struct readahead_control *rac)
> {
> +#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
Also the #ifdef are too ugly from the upstream code cycle.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 2/4] erofs: introduce page cache share feature
2025-07-03 12:23 ` [PATCH RFC 2/4] erofs: introduce page cache share feature Christian Brauner
@ 2025-07-04 21:06 ` Gao Xiang
2025-07-05 0:54 ` Hongzhen Luo
2025-07-05 8:25 ` Amir Goldstein
2025-07-05 1:09 ` Hongzhen Luo
1 sibling, 2 replies; 20+ messages in thread
From: Gao Xiang @ 2025-07-04 21:06 UTC (permalink / raw)
To: Christian Brauner, Amir Goldstein, Hongzhen Luo
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs,
Gao Xiang, Jan Kara, Jeff Layton, Matthew Wilcox
Hi Christian,
On 2025/7/3 20:23, Christian Brauner wrote:
...
> diff --git a/fs/erofs/pagecache_share.c b/fs/erofs/pagecache_share.c
> new file mode 100644
> index 000000000000..309b33cc6c30
> --- /dev/null
> +++ b/fs/erofs/pagecache_share.c
> @@ -0,0 +1,204 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2024, Alibaba Cloud
> + */
> +#include <linux/xxhash.h>
> +#include <linux/refcount.h>
> +#include "pagecache_share.h"
> +#include "internal.h"
> +#include "xattr.h"
> +
> +#define PCS_FPRT_IDX 4
> +#define PCS_FPRT_NAME "erofs.fingerprint"
> +#define PCS_FPRT_MAXLEN (sizeof(size_t) + 1024)
One thing I told Hongzhen to work on is that I really
don't like a hardcode xattr like this.
Because EROFS can store common long xattr prefixes, see:
https://docs.kernel.org/filesystems/erofs.html#long-extended-attribute-name-prefixes
So it would be nice to just record a name prefix in the
ondisk superblock so that users can use their own xattr
names for this usage.
For example, users could use "overlay.metacopy" xattr
as page cache sharing fingerprint to identify different
inodes if overlayfs fsverity feature is on, see:
https://docs.kernel.org/filesystems/overlayfs.html#fs-verity-support
But if you really don't have more time to know the EROFS
internals here, you could just leave as-is. I could
handle myself.
> +
...
> +}
> +
> +/*
> + * TODO: Hm, could we leverage our fancy new backing file infrastructure
> + * as for overlayfs and fuse?
If some code can be lifted up as a vfs helper, it would be
much helpful as the backing file infrastructure was lifted
from overlayfs.
But I'm not sure if it's really needed for now anyway
because it's only EROFS-specific, and I only maintain and
can speak of EROFS.
> + */
> +static struct file *erofs_pcs_alloc_file(struct file *file,
> + struct inode *ano_inode)
> +{
> + struct file *ano_file;
> +
> + ano_file = alloc_file_pseudo(ano_inode, erofs_pcs_mnt, "[erofs_pcs_f]",
> + O_RDONLY, &erofs_file_fops);
> + file_ra_state_init(&ano_file->f_ra, file->f_mapping);
> + ano_file->private_data = EROFS_I(file_inode(file));
> + return ano_file;
> +}
> +
...
> +
> +/*
> + * TODO: Amir, you've got some experience in this area due to overlayfs
> + * and fuse. Does that work?
> + */
Hi Amir,
I do think it will work, if you have chance please help
take a quick look too.
It's much similar to overlayfs, the difference is that the real
inodes is not in some other fs, but anon inodes from a pseudo
sb which shares among the whole host to share page cache for
containers.
> +static int erofs_pcs_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct file *ano_file = file->private_data;
> +
> + vma_set_file(vma, ano_file);
> + vma->vm_ops = &generic_file_vm_ops;
> + return 0;
> +}
> +
> +const struct file_operations erofs_pcs_file_fops = {
> + .open = erofs_pcs_file_open,
> + /*
> + * TODO: Why doesn't .llseek require similar treatment as
> + * .read_iter?
> + */
I don't know some specific reason because it wrote by
Hongzhen.
Hongzhen is still at work until by the end of the month,
I hope he could answer some question.
> + .llseek = generic_file_llseek,
> + .read_iter = erofs_pcs_file_read_iter,
> + .mmap = erofs_pcs_mmap,
> + .release = erofs_pcs_file_release,
> + .get_unmapped_area = thp_get_unmapped_area,
> + .splice_read = filemap_splice_read,
> +};
> diff --git a/fs/erofs/pagecache_share.h b/fs/erofs/pagecache_share.h
> new file mode 100644
> index 000000000000..b8111291cf79
> --- /dev/null
> +++ b/fs/erofs/pagecache_share.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2024, Alibaba Cloud
> + */
BTW, it seems that this header is too small, maybe just
fold them into internal.h.
Thanks,
Gao Xiang
> +#ifndef __EROFS_PAGECACHE_SHARE_H
> +#define __EROFS_PAGECACHE_SHARE_H
> +
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/rwlock.h>
> +#include <linux/mutex.h>
> +#include "internal.h"
> +
> +int erofs_pcs_init_mnt(void);
> +void erofs_pcs_free_mnt(void);
> +void erofs_pcs_fill_inode(struct inode *inode);
> +
> +extern const struct vm_operations_struct generic_file_vm_ops;
> +
> +#endif
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 4/4] erofs: introduce .fadvise for page cache share
2025-07-03 12:23 ` [PATCH RFC 4/4] erofs: introduce .fadvise for page cache share Christian Brauner
@ 2025-07-04 21:09 ` Gao Xiang
2025-07-05 1:15 ` Hongzhen Luo
0 siblings, 1 reply; 20+ messages in thread
From: Gao Xiang @ 2025-07-04 21:09 UTC (permalink / raw)
To: Christian Brauner, Gao Xiang, Jan Kara, Amir Goldstein,
Jeff Layton, Hongzhen Luo, Matthew Wilcox
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs
On 2025/7/3 20:23, Christian Brauner wrote:
> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>
> When using .fadvise to release a file's page cache, it frees page cache
> pages that were first read by this file. To achieve this, an interval
> tree is added in the inode of that file to track the segments first
> read by that inode.
>
> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Link: https://lore.kernel.org/20240902110620.2202586-5-hongzhen@linux.alibaba.com
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/erofs/data.c | 38 ++++++++++++++++++++--
> fs/erofs/internal.h | 5 +++
> fs/erofs/pagecache_share.c | 81 ++++++++++++++++++++++++++++++++++++++++++++--
> fs/erofs/pagecache_share.h | 2 ++
> fs/erofs/super.c | 9 ++++++
> 5 files changed, 131 insertions(+), 4 deletions(-)
>
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index fb54162f4c54..61a42a95d26b 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -7,6 +7,7 @@
> #include "internal.h"
> #include <linux/sched/mm.h>
> #include <trace/events/erofs.h>
> +#include "pagecache_share.h"
>
> void erofs_unmap_metabuf(struct erofs_buf *buf)
> {
> @@ -353,6 +354,7 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
> {
> #ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
> struct erofs_inode *vi = NULL;
> + struct interval_tree_node *seg;
> int ret;
>
> if (file && file->private_data) {
> @@ -363,8 +365,22 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
> vi = NULL;
> }
> ret = iomap_read_folio(folio, &erofs_iomap_ops);
> - if (vi)
> + if (vi) {
> folio->mapping->host = file_inode(file);
> + seg = erofs_pcs_alloc_seg();
> + if (!seg)
> + return -ENOMEM;
> + seg->start = folio->index;
> + seg->last = seg->start + (folio_size(folio) >> PAGE_SHIFT);
> + if (seg->last > (vi->vfs_inode.i_size >> PAGE_SHIFT))
> + seg->last = vi->vfs_inode.i_size >> PAGE_SHIFT;
> + if (seg->last >= seg->start) {
> + mutex_lock(&vi->segs_mutex);
> + interval_tree_insert(seg, &vi->segs);
> + mutex_unlock(&vi->segs_mutex);
> + } else
> + erofs_pcs_free_seg(seg);
> + }
I don't know what Hongzhen is trying to do in this patch and
it seems too odd on my side, maybe it needs to reimplement
this patch later but we should support .fadvise().
Thanks,
Gao Xiang
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 0/4] erofs: allow page cache sharing
2025-07-03 12:23 [PATCH RFC 0/4] erofs: allow page cache sharing Christian Brauner
` (4 preceding siblings ...)
2025-07-03 12:53 ` [PATCH RFC 0/4] erofs: allow page cache sharing Gao Xiang
@ 2025-07-05 0:51 ` Hongzhen Luo
5 siblings, 0 replies; 20+ messages in thread
From: Hongzhen Luo @ 2025-07-05 0:51 UTC (permalink / raw)
To: Christian Brauner, Gao Xiang, Jan Kara, Amir Goldstein,
Jeff Layton, Matthew Wilcox
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs
[-- Attachment #1: Type: text/plain, Size: 8515 bytes --]
On 2025/7/3 20:23, Christian Brauner wrote:
> Hey!
>
> This series is originally from Hongzhen. I'm picking it back up because
> support for page cache sharing is pretty important for container and
> service workloads that want to make use of erofs images. The main
> obstacle currently is the inability to share page cache contents between
> different erofs superblocks.
>
> I think the mechanism that Hongzhen came up with is decent and will
> remove one final obstacle.
>
> However, I have not worked in this area in meaningful ways before so to
> an experienced page cache person this might all look like a little kid
> doodling on a piece of paper.
>
> One obvious question mark I have is around mmap. The current
> implementation mimicks what overlayfs is doing and I'm not sure that
> it's correct or even necessary to mimick overlayfs behavior here at all.
>
> Anyway, I would really appreciate the help!
Hi Christian, glad to hear you're interested in my previous patch – and
please forgive my delayed
response, as I was swamped with other tasks. Finally catching up now
that it's the weekend. Due to
work change, I can no longer continue driving this patch series upstream.
This patch series seems to be outdated, and some of the implementations
are quite hacky. Please
take a look at the latest RFC patch series (v6):
https://lore.kernel.org/all/20250301145002.2420830-1-hongzhen@linux.alibaba.com/
> [Background]
> ============
> Currently, reading files with different paths (or names) but the same
> content will consume multiple copies of the page cache, even if the
> content of these page caches is the same. For example, reading identical
> files (e.g., *.so files) from two different minor versions of container
> images will cost multiple copies of the same page cache, since different
> containers have different mount points. Therefore, sharing the page cache
> for files with the same content can save memory.
>
> [Implementation]
> ================
> This introduces the page cache share feature in erofs. During the mkfs
> phase, the file content is hashed and the hash value is stored in the
> `trusted.erofs.fingerprint` extended attribute. Inodes of files with the
> same `trusted.erofs.fingerprint` are mapped to the same anonymous inode
> (indicated by the `ano_inode` field). When a read request occurs, the
> anonymous inode serves as a "container" whose page cache is shared. The
> actual operations involving the iomap are carried out by the original
> inode which is mapped to the anonymous inode.
>
> [Effect]
> ========
> I conducted experiments on two aspects across two different minor versions of
> container images:
>
> 1. reading all files in two different minor versions of container images
>
> 2. run workloads or use the default entrypoint within the containers^[1]
>
> Below is the memory usage for reading all files in two different minor
> versions of container images:
>
> +-------------------+------------------+-------------+---------------+
> | Image | Page Cache Share | Memory (MB) | Memory |
> | | | | Reduction (%) |
> +-------------------+------------------+-------------+---------------+
> | | No | 241 | - |
> | redis +------------------+-------------+---------------+
> | 7.2.4 & 7.2.5 | Yes | 163 | 33% |
> +-------------------+------------------+-------------+---------------+
> | | No | 872 | - |
> | postgres +------------------+-------------+---------------+
> | 16.1 & 16.2 | Yes | 630 | 28% |
> +-------------------+------------------+-------------+---------------+
> | | No | 2771 | - |
> | tensorflow +------------------+-------------+---------------+
> | 1.11.0 & 2.11.1 | Yes | 2340 | 16% |
> +-------------------+------------------+-------------+---------------+
> | | No | 926 | - |
> | mysql +------------------+-------------+---------------+
> | 8.0.11 & 8.0.12 | Yes | 735 | 21% |
> +-------------------+------------------+-------------+---------------+
> | | No | 390 | - |
> | nginx +------------------+-------------+---------------+
> | 7.2.4 & 7.2.5 | Yes | 219 | 44% |
> +-------------------+------------------+-------------+---------------+
> | tomcat | No | 924 | - |
> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
> | | Yes | 474 | 49% |
> +-------------------+------------------+-------------+---------------+
>
> Additionally, the table below shows the runtime memory usage of the
> container:
>
> +-------------------+------------------+-------------+---------------+
> | Image | Page Cache Share | Memory (MB) | Memory |
> | | | | Reduction (%) |
> +-------------------+------------------+-------------+---------------+
> | | No | 35 | - |
> | redis +------------------+-------------+---------------+
> | 7.2.4 & 7.2.5 | Yes | 28 | 20% |
> +-------------------+------------------+-------------+---------------+
> | | No | 149 | - |
> | postgres +------------------+-------------+---------------+
> | 16.1 & 16.2 | Yes | 95 | 37% |
> +-------------------+------------------+-------------+---------------+
> | | No | 1028 | - |
> | tensorflow +------------------+-------------+---------------+
> | 1.11.0 & 2.11.1 | Yes | 930 | 10% |
> +-------------------+------------------+-------------+---------------+
> | | No | 155 | - |
> | mysql +------------------+-------------+---------------+
> | 8.0.11 & 8.0.12 | Yes | 132 | 15% |
> +-------------------+------------------+-------------+---------------+
> | | No | 25 | - |
> | nginx +------------------+-------------+---------------+
> | 7.2.4 & 7.2.5 | Yes | 20 | 20% |
> +-------------------+------------------+-------------+---------------+
> | tomcat | No | 186 | - |
> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
> | | Yes | 98 | 48% |
> +-------------------+------------------+-------------+---------------+
>
> It can be observed that when reading all the files in the image, the reduced
> memory usage varies from 16% to 49%, depending on the specific image.
> Additionally, the container's runtime memory usage reduction ranges from 10%
> to 48%.
>
> [1] Below are the workload for these images:
> - redis: redis-benchmark
> - postgres: sysbench
> - tensorflow: app.py of tensorflow.python.platform
> - mysql: sysbench
> - nginx: wrk
> - tomcat: default entrypoint
>
> Signed-off-by: Christian Brauner<brauner@kernel.org>
> ---
> Hongzhen Luo (4):
> erofs: move `struct erofs_anon_fs_type` to super.c
> erofs: introduce page cache share feature
> erofs: apply the page cache share feature
> erofs: introduce .fadvise for page cache share
>
> fs/erofs/Kconfig | 10 ++
> fs/erofs/Makefile | 1 +
> fs/erofs/data.c | 67 +++++++++++
> fs/erofs/fscache.c | 13 ---
> fs/erofs/inode.c | 15 ++-
> fs/erofs/internal.h | 11 ++
> fs/erofs/pagecache_share.c | 281 +++++++++++++++++++++++++++++++++++++++++++++
> fs/erofs/pagecache_share.h | 22 ++++
> fs/erofs/super.c | 62 ++++++++++
> fs/erofs/zdata.c | 32 ++++++
> 10 files changed, 500 insertions(+), 14 deletions(-)
> ---
> base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
> change-id: 20250703-work-erofs-pcs-f6f3d0722401
>
[-- Attachment #2: Type: text/html, Size: 14857 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 2/4] erofs: introduce page cache share feature
2025-07-04 21:06 ` Gao Xiang
@ 2025-07-05 0:54 ` Hongzhen Luo
2025-07-05 8:25 ` Amir Goldstein
1 sibling, 0 replies; 20+ messages in thread
From: Hongzhen Luo @ 2025-07-05 0:54 UTC (permalink / raw)
To: Gao Xiang, Christian Brauner, Amir Goldstein
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs,
Gao Xiang, Jan Kara, Jeff Layton, Matthew Wilcox
On 2025/7/5 05:06, Gao Xiang wrote:
> Hi Christian,
>
> On 2025/7/3 20:23, Christian Brauner wrote:
>
> ...
>
>> diff --git a/fs/erofs/pagecache_share.c b/fs/erofs/pagecache_share.c
>> new file mode 100644
>> index 000000000000..309b33cc6c30
>> --- /dev/null
>> +++ b/fs/erofs/pagecache_share.c
>> @@ -0,0 +1,204 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * Copyright (C) 2024, Alibaba Cloud
>> + */
>> +#include <linux/xxhash.h>
>> +#include <linux/refcount.h>
>> +#include "pagecache_share.h"
>> +#include "internal.h"
>> +#include "xattr.h"
>> +
>> +#define PCS_FPRT_IDX 4
>> +#define PCS_FPRT_NAME "erofs.fingerprint"
>> +#define PCS_FPRT_MAXLEN (sizeof(size_t) + 1024)
>
> One thing I told Hongzhen to work on is that I really
> don't like a hardcode xattr like this.
>
> Because EROFS can store common long xattr prefixes, see:
> https://docs.kernel.org/filesystems/erofs.html#long-extended-attribute-name-prefixes
>
>
> So it would be nice to just record a name prefix in the
> ondisk superblock so that users can use their own xattr
> names for this usage.
>
> For example, users could use "overlay.metacopy" xattr
> as page cache sharing fingerprint to identify different
> inodes if overlayfs fsverity feature is on, see:
> https://docs.kernel.org/filesystems/overlayfs.html#fs-verity-support
Yes, please refer to the latest patch series here:
https://lore.kernel.org/all/20250301145002.2420830-3-hongzhen@linux.alibaba.com/
>
> But if you really don't have more time to know the EROFS
> internals here, you could just leave as-is. I could
> handle myself.
>
>> +
>
> ...
>
>> +}
>> +
>> +/*
>> + * TODO: Hm, could we leverage our fancy new backing file
>> infrastructure
>> + * as for overlayfs and fuse?
>
> If some code can be lifted up as a vfs helper, it would be
> much helpful as the backing file infrastructure was lifted
> from overlayfs.
>
> But I'm not sure if it's really needed for now anyway
> because it's only EROFS-specific, and I only maintain and
> can speak of EROFS.
>
>> + */
>> +static struct file *erofs_pcs_alloc_file(struct file *file,
>> + struct inode *ano_inode)
>> +{
>> + struct file *ano_file;
>> +
>> + ano_file = alloc_file_pseudo(ano_inode, erofs_pcs_mnt,
>> "[erofs_pcs_f]",
>> + O_RDONLY, &erofs_file_fops);
>> + file_ra_state_init(&ano_file->f_ra, file->f_mapping);
>> + ano_file->private_data = EROFS_I(file_inode(file));
>> + return ano_file;
>> +}
>> +
>
> ...
>
>> +
>> +/*
>> + * TODO: Amir, you've got some experience in this area due to overlayfs
>> + * and fuse. Does that work?
>> + */
>
>
>
> Hi Amir,
>
> I do think it will work, if you have chance please help
> take a quick look too.
>
> It's much similar to overlayfs, the difference is that the real
> inodes is not in some other fs, but anon inodes from a pseudo
> sb which shares among the whole host to share page cache for
> containers.
>
>> +static int erofs_pcs_mmap(struct file *file, struct vm_area_struct
>> *vma)
>> +{
>> + struct file *ano_file = file->private_data;
>> +
>> + vma_set_file(vma, ano_file);
>> + vma->vm_ops = &generic_file_vm_ops;
>> + return 0;
>> +}
>> +
>> +const struct file_operations erofs_pcs_file_fops = {
>> + .open = erofs_pcs_file_open,
>> + /*
>> + * TODO: Why doesn't .llseek require similar treatment as
>> + * .read_iter?
>> + */
>
> I don't know some specific reason because it wrote by
> Hongzhen.
>
> Hongzhen is still at work until by the end of the month,
> I hope he could answer some question.
>
>> + .llseek = generic_file_llseek,
>> + .read_iter = erofs_pcs_file_read_iter,
>> + .mmap = erofs_pcs_mmap,
>> + .release = erofs_pcs_file_release,
>> + .get_unmapped_area = thp_get_unmapped_area,
>> + .splice_read = filemap_splice_read,
>> +};
>> diff --git a/fs/erofs/pagecache_share.h b/fs/erofs/pagecache_share.h
>> new file mode 100644
>> index 000000000000..b8111291cf79
>> --- /dev/null
>> +++ b/fs/erofs/pagecache_share.h
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> +/*
>> + * Copyright (C) 2024, Alibaba Cloud
>> + */
>
>
> BTW, it seems that this header is too small, maybe just
> fold them into internal.h.
>
> Thanks,
> Gao Xiang
>
>> +#ifndef __EROFS_PAGECACHE_SHARE_H
>> +#define __EROFS_PAGECACHE_SHARE_H
>> +
>> +#include <linux/fs.h>
>> +#include <linux/mount.h>
>> +#include <linux/rwlock.h>
>> +#include <linux/mutex.h>
>> +#include "internal.h"
>> +
>> +int erofs_pcs_init_mnt(void);
>> +void erofs_pcs_free_mnt(void);
>> +void erofs_pcs_fill_inode(struct inode *inode);
>> +
>> +extern const struct vm_operations_struct generic_file_vm_ops;
>> +
>> +#endif
>>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 2/4] erofs: introduce page cache share feature
2025-07-03 12:23 ` [PATCH RFC 2/4] erofs: introduce page cache share feature Christian Brauner
2025-07-04 21:06 ` Gao Xiang
@ 2025-07-05 1:09 ` Hongzhen Luo
1 sibling, 0 replies; 20+ messages in thread
From: Hongzhen Luo @ 2025-07-05 1:09 UTC (permalink / raw)
To: Christian Brauner, Gao Xiang, Jan Kara, Amir Goldstein,
Jeff Layton, Matthew Wilcox
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs
On 2025/7/3 20:23, Christian Brauner wrote:
> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>
> Currently, reading files with different paths (or names) but the same
> content will consume multiple copies of the page cache, even if the
> content of these page caches is the same. For example, reading identical
> files (e.g., *.so files) from two different minor versions of container
> images will cost multiple copies of the same page cache, since different
> containers have different mount points. Therefore, sharing the page cache
> for files with the same content can save memory.
>
> This introduces the page cache share feature in erofs. During the mkfs
> phase, the file content is hashed and the hash value is stored in the
> `trusted.erofs.fingerprint` extended attribute. Inodes of files with the
> same `trusted.erofs.fingerprint` are mapped to the same anonymous inode
> (indicated by the `ano_inode` field). When a read request occurs, the
> anonymous inode serves as a "container" whose page cache is shared. The
> actual operations involving the iomap are carried out by the original
> inode which is mapped to the anonymous inode.
>
> Below is the memory usage for reading all files in two different minor
> versions of container images:
>
> +-------------------+------------------+-------------+---------------+
> | Image | Page Cache Share | Memory (MB) | Memory |
> | | | | Reduction (%) |
> +-------------------+------------------+-------------+---------------+
> | | No | 241 | - |
> | redis +------------------+-------------+---------------+
> | 7.2.4 & 7.2.5 | Yes | 163 | 33% |
> +-------------------+------------------+-------------+---------------+
> | | No | 872 | - |
> | postgres +------------------+-------------+---------------+
> | 16.1 & 16.2 | Yes | 630 | 28% |
> +-------------------+------------------+-------------+---------------+
> | | No | 2771 | - |
> | tensorflow +------------------+-------------+---------------+
> | 1.11.0 & 2.11.1 | Yes | 2340 | 16% |
> +-------------------+------------------+-------------+---------------+
> | | No | 926 | - |
> | mysql +------------------+-------------+---------------+
> | 8.0.11 & 8.0.12 | Yes | 735 | 21% |
> +-------------------+------------------+-------------+---------------+
> | | No | 390 | - |
> | nginx +------------------+-------------+---------------+
> | 7.2.4 & 7.2.5 | Yes | 219 | 44% |
> +-------------------+------------------+-------------+---------------+
> | tomcat | No | 924 | - |
> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
> | | Yes | 474 | 49% |
> +-------------------+------------------+-------------+---------------+
>
> Additionally, the table below shows the runtime memory usage of the
> container:
>
> +-------------------+------------------+-------------+---------------+
> | Image | Page Cache Share | Memory (MB) | Memory |
> | | | | Reduction (%) |
> +-------------------+------------------+-------------+---------------+
> | | No | 35 | - |
> | redis +------------------+-------------+---------------+
> | 7.2.4 & 7.2.5 | Yes | 28 | 20% |
> +-------------------+------------------+-------------+---------------+
> | | No | 149 | - |
> | postgres +------------------+-------------+---------------+
> | 16.1 & 16.2 | Yes | 95 | 37% |
> +-------------------+------------------+-------------+---------------+
> | | No | 1028 | - |
> | tensorflow +------------------+-------------+---------------+
> | 1.11.0 & 2.11.1 | Yes | 930 | 10% |
> +-------------------+------------------+-------------+---------------+
> | | No | 155 | - |
> | mysql +------------------+-------------+---------------+
> | 8.0.11 & 8.0.12 | Yes | 132 | 15% |
> +-------------------+------------------+-------------+---------------+
> | | No | 25 | - |
> | nginx +------------------+-------------+---------------+
> | 7.2.4 & 7.2.5 | Yes | 20 | 20% |
> +-------------------+------------------+-------------+---------------+
> | tomcat | No | 186 | - |
> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
> | | Yes | 98 | 48% |
> +-------------------+------------------+-------------+---------------+
>
> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Link: https://lore.kernel.org/20240902110620.2202586-3-hongzhen@linux.alibaba.com
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/erofs/Kconfig | 10 +++
> fs/erofs/Makefile | 1 +
> fs/erofs/internal.h | 4 +
> fs/erofs/pagecache_share.c | 204 +++++++++++++++++++++++++++++++++++++++++++++
> fs/erofs/pagecache_share.h | 20 +++++
> 5 files changed, 239 insertions(+)
>
> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
> index 6beeb7063871..553770068fee 100644
> --- a/fs/erofs/Kconfig
> +++ b/fs/erofs/Kconfig
> @@ -192,3 +192,13 @@ config EROFS_FS_PCPU_KTHREAD_HIPRI
> at higher priority.
>
> If unsure, say N.
> +
> +config EROFS_FS_PAGE_CACHE_SHARE
> + bool "EROFS page cache share support"
> + depends on EROFS_FS
> + default n
> + help
> + This permits EROFS to share page cache for files with same
> + fingerprints.
> +
> + If unsure, say N.
> diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
> index 549abc424763..f4141fdfcb0b 100644
> --- a/fs/erofs/Makefile
> +++ b/fs/erofs/Makefile
> @@ -10,3 +10,4 @@ erofs-$(CONFIG_EROFS_FS_ZIP_ZSTD) += decompressor_zstd.o
> erofs-$(CONFIG_EROFS_FS_ZIP_ACCEL) += decompressor_crypto.o
> erofs-$(CONFIG_EROFS_FS_BACKED_BY_FILE) += fileio.o
> erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o
> +erofs-$(CONFIG_EROFS_FS_PAGE_CACHE_SHARE) += pagecache_share.o
> diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
> index 30380f7baf5e..47136894d17d 100644
> --- a/fs/erofs/internal.h
> +++ b/fs/erofs/internal.h
> @@ -273,6 +273,9 @@ struct erofs_inode {
> };
> #endif /* CONFIG_EROFS_FS_ZIP */
> };
> +#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
> + struct inode *ano_inode;
> +#endif
> /* the corresponding vfs inode */
> struct inode vfs_inode;
> };
> @@ -369,6 +372,7 @@ extern const struct inode_operations erofs_dir_iops;
>
> extern const struct file_operations erofs_file_fops;
> extern const struct file_operations erofs_dir_fops;
> +extern const struct file_operations erofs_pcs_file_fops;
>
> extern const struct iomap_ops z_erofs_iomap_report_ops;
>
> diff --git a/fs/erofs/pagecache_share.c b/fs/erofs/pagecache_share.c
> new file mode 100644
> index 000000000000..309b33cc6c30
> --- /dev/null
> +++ b/fs/erofs/pagecache_share.c
> @@ -0,0 +1,204 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2024, Alibaba Cloud
> + */
> +#include <linux/xxhash.h>
> +#include <linux/refcount.h>
> +#include "pagecache_share.h"
> +#include "internal.h"
> +#include "xattr.h"
> +
> +#define PCS_FPRT_IDX 4
> +#define PCS_FPRT_NAME "erofs.fingerprint"
> +#define PCS_FPRT_MAXLEN (sizeof(size_t) + 1024)
> +
> +static DEFINE_MUTEX(pseudo_mnt_lock);
> +static refcount_t pseudo_mnt_count;
> +static struct vfsmount *erofs_pcs_mnt;
> +
> +int erofs_pcs_init_mnt(void)
> +{
> + struct vfsmount *mnt;
> +
> + if (refcount_inc_not_zero(&pseudo_mnt_count))
> + return 0;
> +
> + guard(mutex)(&pseudo_mnt_lock);
> + if (erofs_pcs_mnt) {
> + refcount_inc(&pseudo_mnt_count);
> + return 0;
> + }
> +
> + mnt = kern_mount(&erofs_anon_fs_type);
> + if (IS_ERR(mnt))
> + return PTR_ERR(mnt);
> +
> + rcu_read_lock();
> + rcu_assign_pointer(erofs_pcs_mnt, mnt);
> + rcu_read_unlock();
> + refcount_set_release(&pseudo_mnt_count, 1);
> + return 0;
> +}
> +
> +void erofs_pcs_free_mnt(void)
> +{
> + struct vfsmount *mnt = NULL;
> +
> + if (refcount_dec_not_one(&pseudo_mnt_count))
> + return;
> +
> + scoped_guard(mutex, &pseudo_mnt_lock) {
> + rcu_read_lock();
> + if (refcount_dec_and_test(&pseudo_mnt_count))
> + mnt = rcu_replace_pointer(erofs_pcs_mnt, NULL, true);
> + rcu_read_unlock();
> + }
> + if (mnt)
> + kern_unmount(mnt);
> +}
> +
> +static int erofs_pcs_eq(struct inode *inode, void *data)
> +{
> + return inode->i_private && memcmp(inode->i_private, data,
> + sizeof(size_t) + *(size_t *)data) == 0 ? 1 : 0;
> +}
> +
> +static int erofs_pcs_set_fprt(struct inode *inode, void *data)
> +{
> + /* fprt length and content */
> + inode->i_private = kmalloc(*(size_t *)data + sizeof(size_t),
> + GFP_KERNEL);
> + memcpy(inode->i_private, data, sizeof(size_t) + *(size_t *)data);
> + return 0;
> +}
> +
> +void erofs_pcs_fill_inode(struct inode *inode)
> +{
> + struct erofs_inode *vi = EROFS_I(inode);
> + char fprt[PCS_FPRT_MAXLEN];
> + struct inode *ano_inode;
> + unsigned long fprt_hash;
> + size_t fprt_len;
> +
> + vi->ano_inode = NULL;
> + fprt_len = erofs_getxattr(inode, PCS_FPRT_IDX, PCS_FPRT_NAME,
> + fprt + sizeof(size_t), PCS_FPRT_MAXLEN);
> + if (fprt_len > 0 && fprt_len <= PCS_FPRT_MAXLEN) {
> + *(size_t *)fprt = fprt_len;
> + fprt_hash = xxh32(fprt + sizeof(size_t), fprt_len, 0);
> + ano_inode = iget5_locked(erofs_pcs_mnt->mnt_sb, fprt_hash,
> + erofs_pcs_eq, erofs_pcs_set_fprt,
> + fprt);
> + vi->ano_inode = ano_inode;
> + if (ano_inode->i_state & I_NEW) {
> + if (erofs_inode_is_data_compressed(vi->datalayout))
> + ano_inode->i_mapping->a_ops = &z_erofs_aops;
> + else
> + ano_inode->i_mapping->a_ops = &erofs_aops;
> + ano_inode->i_size = inode->i_size;
> + unlock_new_inode(ano_inode);
> + }
> + }
> +}
> +
> +/*
> + * TODO: Hm, could we leverage our fancy new backing file infrastructure
> + * as for overlayfs and fuse?
> + */
> +static struct file *erofs_pcs_alloc_file(struct file *file,
> + struct inode *ano_inode)
> +{
> + struct file *ano_file;
> +
> + ano_file = alloc_file_pseudo(ano_inode, erofs_pcs_mnt, "[erofs_pcs_f]",
> + O_RDONLY, &erofs_file_fops);
> + file_ra_state_init(&ano_file->f_ra, file->f_mapping);
> + ano_file->private_data = EROFS_I(file_inode(file));
> + return ano_file;
> +}
> +
> +static int erofs_pcs_file_open(struct inode *inode, struct file *file)
> +{
> + struct file *ano_file;
> + struct inode *ano_inode;
> + struct erofs_inode *vi = EROFS_I(inode);
> +
> + ano_inode = vi->ano_inode;
> + if (!ano_inode)
> + return -EINVAL;
> +
> + ano_file = erofs_pcs_alloc_file(file, ano_inode);
> + if (IS_ERR(ano_file))
> + return PTR_ERR(ano_file);
> +
> + file->private_data = ano_file;
> + return 0;
> +}
> +
> +static int erofs_pcs_file_release(struct inode *inode, struct file *file)
> +{
> + struct file *ano_file __free(fput) = NULL;
> +
> + if (WARN_ON_ONCE(!file->private_data))
> + return -EINVAL;
> +
> + swap(file->private_data, ano_file);
> + return 0;
> +}
> +
> +static ssize_t erofs_pcs_file_read_iter(struct kiocb *iocb,
> + struct iov_iter *to)
> +{
> + struct file *file, *ano_file;
> + struct kiocb ano_iocb;
> + ssize_t res;
> +
> + if (!iov_iter_count(to))
> + return 0;
> +
> +#ifdef CONFIG_FS_DAX
> + if (IS_DAX(inode))
> + return iocb->ki_filp->f_op->read_iter(iocb, to);
> +#endif
> + if (iocb->ki_flags & IOCB_DIRECT)
> + return iocb->ki_filp->f_op->read_iter(iocb, to);
> +
> + memcpy(&ano_iocb, iocb, sizeof(struct kiocb));
> + file = iocb->ki_filp;
> + ano_file = file->private_data;
> + if (WARN_ON_ONCE(!ano_file))
> + return -EINVAL;
> + ano_iocb.ki_filp = ano_file;
> + res = filemap_read(&ano_iocb, to, 0);
> + memcpy(iocb, &ano_iocb, sizeof(struct kiocb));
> + iocb->ki_filp = file;
> + file_accessed(file);
> + return res;
> +}
> +
> +/*
> + * TODO: Amir, you've got some experience in this area due to overlayfs
> + * and fuse. Does that work?
> + */
> +static int erofs_pcs_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct file *ano_file = file->private_data;
> +
> + vma_set_file(vma, ano_file);
> + vma->vm_ops = &generic_file_vm_ops;
> + return 0;
> +}
> +
> +const struct file_operations erofs_pcs_file_fops = {
> + .open = erofs_pcs_file_open,
> + /*
> + * TODO: Why doesn't .llseek require similar treatment as
> + * .read_iter?
> + */
.llseek only needs to calculate the offset and requires no excessive
handling for regular files, while .read_iter
involves actual data retrieval (EROFS-specific logic such as
decompressing compressed data and cross-block reads),
thus necessitating page cache sharing logic.
Thanks,
Hongzhen
> + .llseek = generic_file_llseek,
> + .read_iter = erofs_pcs_file_read_iter,
> + .mmap = erofs_pcs_mmap,
> + .release = erofs_pcs_file_release,
> + .get_unmapped_area = thp_get_unmapped_area,
> + .splice_read = filemap_splice_read,
> +};
> diff --git a/fs/erofs/pagecache_share.h b/fs/erofs/pagecache_share.h
> new file mode 100644
> index 000000000000..b8111291cf79
> --- /dev/null
> +++ b/fs/erofs/pagecache_share.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2024, Alibaba Cloud
> + */
> +#ifndef __EROFS_PAGECACHE_SHARE_H
> +#define __EROFS_PAGECACHE_SHARE_H
> +
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/rwlock.h>
> +#include <linux/mutex.h>
> +#include "internal.h"
> +
> +int erofs_pcs_init_mnt(void);
> +void erofs_pcs_free_mnt(void);
> +void erofs_pcs_fill_inode(struct inode *inode);
> +
> +extern const struct vm_operations_struct generic_file_vm_ops;
> +
> +#endif
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 4/4] erofs: introduce .fadvise for page cache share
2025-07-04 21:09 ` Gao Xiang
@ 2025-07-05 1:15 ` Hongzhen Luo
2025-07-05 1:25 ` Gao Xiang
0 siblings, 1 reply; 20+ messages in thread
From: Hongzhen Luo @ 2025-07-05 1:15 UTC (permalink / raw)
To: Gao Xiang, Christian Brauner, Gao Xiang, Jan Kara, Amir Goldstein,
Jeff Layton, Matthew Wilcox
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs
On 2025/7/5 05:09, Gao Xiang wrote:
>
>
> On 2025/7/3 20:23, Christian Brauner wrote:
>> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>
>> When using .fadvise to release a file's page cache, it frees page cache
>> pages that were first read by this file. To achieve this, an interval
>> tree is added in the inode of that file to track the segments first
>> read by that inode.
>>
>> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
>> Link:
>> https://lore.kernel.org/20240902110620.2202586-5-hongzhen@linux.alibaba.com
>> Signed-off-by: Christian Brauner <brauner@kernel.org>
>> ---
>> fs/erofs/data.c | 38 ++++++++++++++++++++--
>> fs/erofs/internal.h | 5 +++
>> fs/erofs/pagecache_share.c | 81
>> ++++++++++++++++++++++++++++++++++++++++++++--
>> fs/erofs/pagecache_share.h | 2 ++
>> fs/erofs/super.c | 9 ++++++
>> 5 files changed, 131 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>> index fb54162f4c54..61a42a95d26b 100644
>> --- a/fs/erofs/data.c
>> +++ b/fs/erofs/data.c
>> @@ -7,6 +7,7 @@
>> #include "internal.h"
>> #include <linux/sched/mm.h>
>> #include <trace/events/erofs.h>
>> +#include "pagecache_share.h"
>> void erofs_unmap_metabuf(struct erofs_buf *buf)
>> {
>> @@ -353,6 +354,7 @@ static int erofs_read_folio(struct file *file,
>> struct folio *folio)
>> {
>> #ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
>> struct erofs_inode *vi = NULL;
>> + struct interval_tree_node *seg;
>> int ret;
>> if (file && file->private_data) {
>> @@ -363,8 +365,22 @@ static int erofs_read_folio(struct file *file,
>> struct folio *folio)
>> vi = NULL;
>> }
>> ret = iomap_read_folio(folio, &erofs_iomap_ops);
>> - if (vi)
>> + if (vi) {
>> folio->mapping->host = file_inode(file);
>> + seg = erofs_pcs_alloc_seg();
>> + if (!seg)
>> + return -ENOMEM;
>> + seg->start = folio->index;
>> + seg->last = seg->start + (folio_size(folio) >> PAGE_SHIFT);
>> + if (seg->last > (vi->vfs_inode.i_size >> PAGE_SHIFT))
>> + seg->last = vi->vfs_inode.i_size >> PAGE_SHIFT;
>> + if (seg->last >= seg->start) {
>> + mutex_lock(&vi->segs_mutex);
>> + interval_tree_insert(seg, &vi->segs);
>> + mutex_unlock(&vi->segs_mutex);
>> + } else
>> + erofs_pcs_free_seg(seg);
>> + }
>
> I don't know what Hongzhen is trying to do in this patch and
> it seems too odd on my side, maybe it needs to reimplement
> this patch later but we should support .fadvise().
The original approach aimed to maintain a first-read interval tree per
inode, ensuring
that .fadvise would only release cached pages within its own mapped
ranges, thereby
preventing interference with other file operations. However, this
introduced unnecessary
complexity. The latest patch series adopts overlayfs-style handling:
https://lore.kernel.org/all/20250301145002.2420830-8-hongzhen@linux.alibaba.com/
Thanks,
Hongzhen
>
> Thanks,
> Gao Xiang
>
>>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 4/4] erofs: introduce .fadvise for page cache share
2025-07-05 1:15 ` Hongzhen Luo
@ 2025-07-05 1:25 ` Gao Xiang
0 siblings, 0 replies; 20+ messages in thread
From: Gao Xiang @ 2025-07-05 1:25 UTC (permalink / raw)
To: Hongzhen Luo, Christian Brauner, Gao Xiang, Jan Kara,
Amir Goldstein, Jeff Layton, Matthew Wilcox
Cc: Daan De Meyer, Lennart Poettering, Mike Yuan,
Zbigniew Jędrzejewski-Szmek, lihongbo22, linux-erofs
On 2025/7/5 09:15, Hongzhen Luo wrote:
>
> On 2025/7/5 05:09, Gao Xiang wrote:
>>
>>
>> On 2025/7/3 20:23, Christian Brauner wrote:
>>> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>>
>>> When using .fadvise to release a file's page cache, it frees page cache
>>> pages that were first read by this file. To achieve this, an interval
>>> tree is added in the inode of that file to track the segments first
>>> read by that inode.
>>>
>>> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>> Link: https://lore.kernel.org/20240902110620.2202586-5-hongzhen@linux.alibaba.com
>>> Signed-off-by: Christian Brauner <brauner@kernel.org>
>>> ---
>>> fs/erofs/data.c | 38 ++++++++++++++++++++--
>>> fs/erofs/internal.h | 5 +++
>>> fs/erofs/pagecache_share.c | 81 ++++++++++++++++++++++++++++++++++++++++++++--
>>> fs/erofs/pagecache_share.h | 2 ++
>>> fs/erofs/super.c | 9 ++++++
>>> 5 files changed, 131 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>>> index fb54162f4c54..61a42a95d26b 100644
>>> --- a/fs/erofs/data.c
>>> +++ b/fs/erofs/data.c
>>> @@ -7,6 +7,7 @@
>>> #include "internal.h"
>>> #include <linux/sched/mm.h>
>>> #include <trace/events/erofs.h>
>>> +#include "pagecache_share.h"
>>> void erofs_unmap_metabuf(struct erofs_buf *buf)
>>> {
>>> @@ -353,6 +354,7 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
>>> {
>>> #ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
>>> struct erofs_inode *vi = NULL;
>>> + struct interval_tree_node *seg;
>>> int ret;
>>> if (file && file->private_data) {
>>> @@ -363,8 +365,22 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
>>> vi = NULL;
>>> }
>>> ret = iomap_read_folio(folio, &erofs_iomap_ops);
>>> - if (vi)
>>> + if (vi) {
>>> folio->mapping->host = file_inode(file);
>>> + seg = erofs_pcs_alloc_seg();
>>> + if (!seg)
>>> + return -ENOMEM;
>>> + seg->start = folio->index;
>>> + seg->last = seg->start + (folio_size(folio) >> PAGE_SHIFT);
>>> + if (seg->last > (vi->vfs_inode.i_size >> PAGE_SHIFT))
>>> + seg->last = vi->vfs_inode.i_size >> PAGE_SHIFT;
>>> + if (seg->last >= seg->start) {
>>> + mutex_lock(&vi->segs_mutex);
>>> + interval_tree_insert(seg, &vi->segs);
>>> + mutex_unlock(&vi->segs_mutex);
>>> + } else
>>> + erofs_pcs_free_seg(seg);
>>> + }
>>
>> I don't know what Hongzhen is trying to do in this patch and
>> it seems too odd on my side, maybe it needs to reimplement
>> this patch later but we should support .fadvise().
>
> The original approach aimed to maintain a first-read interval tree per inode, ensuring
>
> that .fadvise would only release cached pages within its own mapped ranges, thereby
>
> preventing interference with other file operations. However, this introduced unnecessary
>
> complexity. The latest patch series adopts overlayfs-style handling:
> https://lore.kernel.org/all/20250301145002.2420830-8-hongzhen@linux.alibaba.com/
Yes, that patch makes more sense for me since mm
code will handle it as page cache ops.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 2/4] erofs: introduce page cache share feature
2025-07-04 21:06 ` Gao Xiang
2025-07-05 0:54 ` Hongzhen Luo
@ 2025-07-05 8:25 ` Amir Goldstein
2025-07-05 10:58 ` Gao Xiang
1 sibling, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2025-07-05 8:25 UTC (permalink / raw)
To: Gao Xiang
Cc: Christian Brauner, Hongzhen Luo, Daan De Meyer,
Lennart Poettering, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
lihongbo22, linux-erofs, Gao Xiang, Jan Kara, Jeff Layton,
Matthew Wilcox
> > +}
> > +
> > +/*
> > + * TODO: Hm, could we leverage our fancy new backing file infrastructure
> > + * as for overlayfs and fuse?
>
> If some code can be lifted up as a vfs helper, it would be
> much helpful as the backing file infrastructure was lifted
> from overlayfs.
>
> But I'm not sure if it's really needed for now anyway
> because it's only EROFS-specific, and I only maintain and
> can speak of EROFS.
>
> > + */
> > +static struct file *erofs_pcs_alloc_file(struct file *file,
> > + struct inode *ano_inode)
> > +{
> > + struct file *ano_file;
> > +
> > + ano_file = alloc_file_pseudo(ano_inode, erofs_pcs_mnt, "[erofs_pcs_f]",
> > + O_RDONLY, &erofs_file_fops);
> > + file_ra_state_init(&ano_file->f_ra, file->f_mapping);
> > + ano_file->private_data = EROFS_I(file_inode(file));
> > + return ano_file;
> > +}
> > +
>
> ...
>
> > +
> > +/*
> > + * TODO: Amir, you've got some experience in this area due to overlayfs
> > + * and fuse. Does that work?
> > + */
>
>
>
> Hi Amir,
>
> I do think it will work, if you have chance please help
> take a quick look too.
>
> It's much similar to overlayfs, the difference is that the real
> inodes is not in some other fs, but anon inodes from a pseudo
> sb which shares among the whole host to share page cache for
> containers.
>
I will answer that from two different POV.
1. Will the backing_file helpers reduce much complicated code?
Not really IMO, because EROFS shared inode does not support
dio/aio and does not require override_cred(), so the remaining bit
of read_ite/splice_read/mmap are pretty trivial IMO.
2. When is it ok to use the backing_file helpers?
The current patch allocates a struct file with
alloc_file_pseudo() and not a struct backing_file,
so using the backing_file helpers is going to be awkward and
misleading and I think in this case it will we wize to refrain
from using a local var name backing_file.
The thing that you need to ask yourself is do you want
backing_file_set_user_path() for the erofs shared inode.
That means, what do you want users to see when they
look at /proc/self/map_files symlinks.
With the current patch set I believe they will see a symlink to
something like "[erofs_pcs_f]" for any mapped file
which is somewhat odd.
I think it would have been nice if users saw something like
"[erofs_pcs_md5digest:XXXXXXX]"
IMO, making the page cache dedupe visible in map_files is
a very nice feature.
Overlayfs took the approach that users are expecting to see
the actual path of the (non-anonymous) file that they mapped
when looking at map_files symlinks.
If you do not display the path to erofs mount in map_files,
then lsof will not be able to blame a process with mapped files
as the reason for keeping erofs mount busy.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 2/4] erofs: introduce page cache share feature
2025-07-05 8:25 ` Amir Goldstein
@ 2025-07-05 10:58 ` Gao Xiang
2025-07-05 12:34 ` Amir Goldstein
0 siblings, 1 reply; 20+ messages in thread
From: Gao Xiang @ 2025-07-05 10:58 UTC (permalink / raw)
To: Amir Goldstein
Cc: Christian Brauner, Hongzhen Luo, Daan De Meyer,
Lennart Poettering, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
lihongbo22, linux-erofs, Gao Xiang, Jan Kara, Jeff Layton,
Matthew Wilcox
Hi Amir,
On 2025/7/5 16:25, Amir Goldstein wrote:
...
>
> 2. When is it ok to use the backing_file helpers?
>
> The current patch allocates a struct file with
> alloc_file_pseudo() and not a struct backing_file,
> so using the backing_file helpers is going to be awkward and
> misleading and I think in this case it will we wize to refrain
> from using a local var name backing_file.
>
> The thing that you need to ask yourself is do you want
> backing_file_set_user_path() for the erofs shared inode.
Yes, agreed, that should be improved as you said.
>
> That means, what do you want users to see when they
> look at /proc/self/map_files symlinks.
>
> With the current patch set I believe they will see a symlink to
> something like "[erofs_pcs_f]" for any mapped file
> which is somewhat odd.
Agreed.
>
> I think it would have been nice if users saw something like
> "[erofs_pcs_md5digest:XXXXXXX]"
But if we use `overlay.metacopy` for example, it's not
a string anymore. IOWs, I'd like to support binary
footprint too.
And I tend to avoid md5 calcuation or something in
the kernel. The kernel just uses footprint xattrs
instead.
> IMO, making the page cache dedupe visible in map_files is
> a very nice feature.
> > Overlayfs took the approach that users are expecting to see
> the actual path of the (non-anonymous) file that they mapped
> when looking at map_files symlinks.
> If you do not display the path to erofs mount in map_files,
> then lsof will not be able to blame a process with mapped files
> as the reason for keeping erofs mount busy.
I think users should see `the actual path` rather than
"[erofs_pcs_xxxxx]" too, but I don't have any chance to
check this part yet.
If it's possible for overlayfs, I guess it's possible for
erofs page cache sharing, maybe?
Thanks,
Gao Xiang
>
> Thanks,
> Amir.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 2/4] erofs: introduce page cache share feature
2025-07-05 10:58 ` Gao Xiang
@ 2025-07-05 12:34 ` Amir Goldstein
2025-07-05 12:53 ` Gao Xiang
0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2025-07-05 12:34 UTC (permalink / raw)
To: Gao Xiang
Cc: Christian Brauner, Hongzhen Luo, Daan De Meyer,
Lennart Poettering, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
lihongbo22, linux-erofs, Gao Xiang, Jan Kara, Jeff Layton,
Matthew Wilcox
On Sat, Jul 5, 2025 at 12:58 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi Amir,
>
> On 2025/7/5 16:25, Amir Goldstein wrote:
>
> ...
>
> >
> > 2. When is it ok to use the backing_file helpers?
> >
> > The current patch allocates a struct file with
> > alloc_file_pseudo() and not a struct backing_file,
> > so using the backing_file helpers is going to be awkward and
> > misleading and I think in this case it will we wize to refrain
> > from using a local var name backing_file.
> >
> > The thing that you need to ask yourself is do you want
> > backing_file_set_user_path() for the erofs shared inode.
>
> Yes, agreed, that should be improved as you said.
>
> >
> > That means, what do you want users to see when they
> > look at /proc/self/map_files symlinks.
> >
> > With the current patch set I believe they will see a symlink to
> > something like "[erofs_pcs_f]" for any mapped file
> > which is somewhat odd.
>
> Agreed.
>
> >
> > I think it would have been nice if users saw something like
> > "[erofs_pcs_md5digest:XXXXXXX]"
>
> But if we use `overlay.metacopy` for example, it's not
> a string anymore. IOWs, I'd like to support binary
> footprint too.
>
> And I tend to avoid md5 calcuation or something in
> the kernel. The kernel just uses footprint xattrs
> instead.
>
> > IMO, making the page cache dedupe visible in map_files is
> > a very nice feature.
> > > Overlayfs took the approach that users are expecting to see
> > the actual path of the (non-anonymous) file that they mapped
> > when looking at map_files symlinks.
> > If you do not display the path to erofs mount in map_files,
> > then lsof will not be able to blame a process with mapped files
> > as the reason for keeping erofs mount busy.
>
> I think users should see `the actual path` rather than
> "[erofs_pcs_xxxxx]" too, but I don't have any chance to
> check this part yet.
>
> If it's possible for overlayfs, I guess it's possible for
> erofs page cache sharing, maybe?
>
Yes, I am sorry if that wasn't clear.
If you want the map_files to show the user mapped file's path,
you can use the backing_file helpers and more specifically
backing_file_open() and all should work as in overlayfs.
Obviously, you'd need to wrap the back_file helper with
erofs specific logic, such as don't allow dio and such.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 2/4] erofs: introduce page cache share feature
2025-07-05 12:34 ` Amir Goldstein
@ 2025-07-05 12:53 ` Gao Xiang
2025-07-05 13:53 ` Amir Goldstein
0 siblings, 1 reply; 20+ messages in thread
From: Gao Xiang @ 2025-07-05 12:53 UTC (permalink / raw)
To: Amir Goldstein
Cc: Christian Brauner, Hongzhen Luo, Daan De Meyer,
Lennart Poettering, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
lihongbo22, linux-erofs, Gao Xiang, Jan Kara, Jeff Layton,
Matthew Wilcox
On 2025/7/5 20:34, Amir Goldstein wrote:
> On Sat, Jul 5, 2025 at 12:58 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Hi Amir,
>>
>> On 2025/7/5 16:25, Amir Goldstein wrote:
>>
>> ...
>>
>>>
>>> 2. When is it ok to use the backing_file helpers?
>>>
>>> The current patch allocates a struct file with
>>> alloc_file_pseudo() and not a struct backing_file,
>>> so using the backing_file helpers is going to be awkward and
>>> misleading and I think in this case it will we wize to refrain
>>> from using a local var name backing_file.
>>>
>>> The thing that you need to ask yourself is do you want
>>> backing_file_set_user_path() for the erofs shared inode.
>>
>> Yes, agreed, that should be improved as you said.
>>
>>>
>>> That means, what do you want users to see when they
>>> look at /proc/self/map_files symlinks.
>>>
>>> With the current patch set I believe they will see a symlink to
>>> something like "[erofs_pcs_f]" for any mapped file
>>> which is somewhat odd.
>>
>> Agreed.
>>
>>>
>>> I think it would have been nice if users saw something like
>>> "[erofs_pcs_md5digest:XXXXXXX]"
>>
>> But if we use `overlay.metacopy` for example, it's not
>> a string anymore. IOWs, I'd like to support binary
>> footprint too.
>>
>> And I tend to avoid md5 calcuation or something in
>> the kernel. The kernel just uses footprint xattrs
>> instead.
>>
>>> IMO, making the page cache dedupe visible in map_files is
>>> a very nice feature.
>>>> Overlayfs took the approach that users are expecting to see
>>> the actual path of the (non-anonymous) file that they mapped
>>> when looking at map_files symlinks.
>>> If you do not display the path to erofs mount in map_files,
>>> then lsof will not be able to blame a process with mapped files
>>> as the reason for keeping erofs mount busy.
>>
>> I think users should see `the actual path` rather than
>> "[erofs_pcs_xxxxx]" too, but I don't have any chance to
>> check this part yet.
>>
>> If it's possible for overlayfs, I guess it's possible for
>> erofs page cache sharing, maybe?
>>
>
> Yes, I am sorry if that wasn't clear.
> If you want the map_files to show the user mapped file's path,
> you can use the backing_file helpers and more specifically
> backing_file_open() and all should work as in overlayfs.
Yep, makes sense, and a quick look I think we should leverage
`file_real_path()` to reveal the user-shown file's path.
But I also think erofs don't need every backingfile infra as
you said we don't need to override_creds and call in the
underlay fs but that use erofs io directly instead.
>
> Obviously, you'd need to wrap the back_file helper with
> erofs specific logic, such as don't allow dio and such.
I think maybe only `struct backing_file` is really needed
to support the user mapped file's path.
Other thing it seems a overkill to use `fs/backing-file.c`
for inode page cache use cases.
Also, maybe I misunderstand your point. I think DIO can
work too, .read_iter() should be per-sb and we may just
use the original DIO logic and pass down iocb and iov etc?
Since only mmap i/o needs to be handled via anon inodes.
Thanks,
Gao Xiang
>
> Thanks,
> Amir.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 2/4] erofs: introduce page cache share feature
2025-07-05 12:53 ` Gao Xiang
@ 2025-07-05 13:53 ` Amir Goldstein
2025-07-05 15:14 ` Gao Xiang
0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2025-07-05 13:53 UTC (permalink / raw)
To: Gao Xiang
Cc: Christian Brauner, Hongzhen Luo, Daan De Meyer,
Lennart Poettering, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
lihongbo22, linux-erofs, Gao Xiang, Jan Kara, Jeff Layton,
Matthew Wilcox
On Sat, Jul 5, 2025 at 2:53 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2025/7/5 20:34, Amir Goldstein wrote:
> > On Sat, Jul 5, 2025 at 12:58 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>
> >> Hi Amir,
> >>
> >> On 2025/7/5 16:25, Amir Goldstein wrote:
> >>
> >> ...
> >>
> >>>
> >>> 2. When is it ok to use the backing_file helpers?
> >>>
> >>> The current patch allocates a struct file with
> >>> alloc_file_pseudo() and not a struct backing_file,
> >>> so using the backing_file helpers is going to be awkward and
> >>> misleading and I think in this case it will we wize to refrain
> >>> from using a local var name backing_file.
> >>>
> >>> The thing that you need to ask yourself is do you want
> >>> backing_file_set_user_path() for the erofs shared inode.
> >>
> >> Yes, agreed, that should be improved as you said.
> >>
> >>>
> >>> That means, what do you want users to see when they
> >>> look at /proc/self/map_files symlinks.
> >>>
> >>> With the current patch set I believe they will see a symlink to
> >>> something like "[erofs_pcs_f]" for any mapped file
> >>> which is somewhat odd.
> >>
> >> Agreed.
> >>
> >>>
> >>> I think it would have been nice if users saw something like
> >>> "[erofs_pcs_md5digest:XXXXXXX]"
> >>
> >> But if we use `overlay.metacopy` for example, it's not
> >> a string anymore. IOWs, I'd like to support binary
> >> footprint too.
> >>
> >> And I tend to avoid md5 calcuation or something in
> >> the kernel. The kernel just uses footprint xattrs
> >> instead.
> >>
> >>> IMO, making the page cache dedupe visible in map_files is
> >>> a very nice feature.
> >>>> Overlayfs took the approach that users are expecting to see
> >>> the actual path of the (non-anonymous) file that they mapped
> >>> when looking at map_files symlinks.
> >>> If you do not display the path to erofs mount in map_files,
> >>> then lsof will not be able to blame a process with mapped files
> >>> as the reason for keeping erofs mount busy.
> >>
> >> I think users should see `the actual path` rather than
> >> "[erofs_pcs_xxxxx]" too, but I don't have any chance to
> >> check this part yet.
> >>
> >> If it's possible for overlayfs, I guess it's possible for
> >> erofs page cache sharing, maybe?
> >>
> >
> > Yes, I am sorry if that wasn't clear.
> > If you want the map_files to show the user mapped file's path,
> > you can use the backing_file helpers and more specifically
> > backing_file_open() and all should work as in overlayfs.
>
> Yep, makes sense, and a quick look I think we should leverage
> `file_real_path()` to reveal the user-shown file's path.
>
> But I also think erofs don't need every backingfile infra as
> you said we don't need to override_creds and call in the
> underlay fs but that use erofs io directly instead.
>
True.
> >
> > Obviously, you'd need to wrap the back_file helper with
> > erofs specific logic, such as don't allow dio and such.
>
> I think maybe only `struct backing_file` is really needed
> to support the user mapped file's path.
>
> Other thing it seems a overkill to use `fs/backing-file.c`
> for inode page cache use cases.
True.
>
> Also, maybe I misunderstand your point. I think DIO can
> work too, .read_iter() should be per-sb and we may just
> use the original DIO logic and pass down iocb and iov etc?
> Since only mmap i/o needs to be handled via anon inodes.
>
The title of the patch set is page cache sharing
and DIO has nothing to do with page cache so I thought it
wasn't relevant.
If the inodes are also sharing the backing disk blocks,
then I guess you can do dio either on the shared inode
or the original inode, but it doesn't matter much.
I meant that the only part of backing_file_read_iter()
for which you may want to consider the helper is the aio
part, but since aio is only for directo io, I think there is
no real benefit of erofs to use the helper.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC 2/4] erofs: introduce page cache share feature
2025-07-05 13:53 ` Amir Goldstein
@ 2025-07-05 15:14 ` Gao Xiang
0 siblings, 0 replies; 20+ messages in thread
From: Gao Xiang @ 2025-07-05 15:14 UTC (permalink / raw)
To: Amir Goldstein
Cc: Christian Brauner, Hongzhen Luo, Daan De Meyer,
Lennart Poettering, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
lihongbo22, linux-erofs, Gao Xiang, Jan Kara, Jeff Layton,
Matthew Wilcox
On 2025/7/5 21:53, Amir Goldstein wrote:
> On Sat, Jul 5, 2025 at 2:53 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/7/5 20:34, Amir Goldstein wrote:
>>> On Sat, Jul 5, 2025 at 12:58 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>
>>>> Hi Amir,
>>>>
>>>> On 2025/7/5 16:25, Amir Goldstein wrote:
>>>>
>>>> ...
>>>>
>>>>>
>>>>> 2. When is it ok to use the backing_file helpers?
>>>>>
>>>>> The current patch allocates a struct file with
>>>>> alloc_file_pseudo() and not a struct backing_file,
>>>>> so using the backing_file helpers is going to be awkward and
>>>>> misleading and I think in this case it will we wize to refrain
>>>>> from using a local var name backing_file.
>>>>>
>>>>> The thing that you need to ask yourself is do you want
>>>>> backing_file_set_user_path() for the erofs shared inode.
>>>>
>>>> Yes, agreed, that should be improved as you said.
>>>>
>>>>>
>>>>> That means, what do you want users to see when they
>>>>> look at /proc/self/map_files symlinks.
>>>>>
>>>>> With the current patch set I believe they will see a symlink to
>>>>> something like "[erofs_pcs_f]" for any mapped file
>>>>> which is somewhat odd.
>>>>
>>>> Agreed.
>>>>
>>>>>
>>>>> I think it would have been nice if users saw something like
>>>>> "[erofs_pcs_md5digest:XXXXXXX]"
>>>>
>>>> But if we use `overlay.metacopy` for example, it's not
>>>> a string anymore. IOWs, I'd like to support binary
>>>> footprint too.
>>>>
>>>> And I tend to avoid md5 calcuation or something in
>>>> the kernel. The kernel just uses footprint xattrs
>>>> instead.
>>>>
>>>>> IMO, making the page cache dedupe visible in map_files is
>>>>> a very nice feature.
>>>>>> Overlayfs took the approach that users are expecting to see
>>>>> the actual path of the (non-anonymous) file that they mapped
>>>>> when looking at map_files symlinks.
>>>>> If you do not display the path to erofs mount in map_files,
>>>>> then lsof will not be able to blame a process with mapped files
>>>>> as the reason for keeping erofs mount busy.
>>>>
>>>> I think users should see `the actual path` rather than
>>>> "[erofs_pcs_xxxxx]" too, but I don't have any chance to
>>>> check this part yet.
>>>>
>>>> If it's possible for overlayfs, I guess it's possible for
>>>> erofs page cache sharing, maybe?
>>>>
>>>
>>> Yes, I am sorry if that wasn't clear.
>>> If you want the map_files to show the user mapped file's path,
>>> you can use the backing_file helpers and more specifically
>>> backing_file_open() and all should work as in overlayfs.
>>
>> Yep, makes sense, and a quick look I think we should leverage
>> `file_real_path()` to reveal the user-shown file's path.
>>
>> But I also think erofs don't need every backingfile infra as
>> you said we don't need to override_creds and call in the
>> underlay fs but that use erofs io directly instead.
>>
>
> True.
>
>>>
>>> Obviously, you'd need to wrap the back_file helper with
>>> erofs specific logic, such as don't allow dio and such.
>>
>> I think maybe only `struct backing_file` is really needed
>> to support the user mapped file's path.
>>
>> Other thing it seems a overkill to use `fs/backing-file.c`
>> for inode page cache use cases.
>
> True.
>
>>
>> Also, maybe I misunderstand your point. I think DIO can
>> work too, .read_iter() should be per-sb and we may just
>> use the original DIO logic and pass down iocb and iov etc?
>> Since only mmap i/o needs to be handled via anon inodes.
>>
>
> The title of the patch set is page cache sharing
> and DIO has nothing to do with page cache so I thought it
> wasn't relevant.
>
> If the inodes are also sharing the backing disk blocks,
> then I guess you can do dio either on the shared inode
> or the original inode, but it doesn't matter much.
>
> I meant that the only part of backing_file_read_iter()
> for which you may want to consider the helper is the aio
> part, but since aio is only for directo io, I think there is
> no real benefit of erofs to use the helper.
Agreed, thank you Amir!
Thanks,
Gao Xiang
>
> Thanks,
> Amir.
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-07-05 15:14 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 12:23 [PATCH RFC 0/4] erofs: allow page cache sharing Christian Brauner
2025-07-03 12:23 ` [PATCH RFC 1/4] erofs: move `struct erofs_anon_fs_type` to super.c Christian Brauner
2025-07-03 12:23 ` [PATCH RFC 2/4] erofs: introduce page cache share feature Christian Brauner
2025-07-04 21:06 ` Gao Xiang
2025-07-05 0:54 ` Hongzhen Luo
2025-07-05 8:25 ` Amir Goldstein
2025-07-05 10:58 ` Gao Xiang
2025-07-05 12:34 ` Amir Goldstein
2025-07-05 12:53 ` Gao Xiang
2025-07-05 13:53 ` Amir Goldstein
2025-07-05 15:14 ` Gao Xiang
2025-07-05 1:09 ` Hongzhen Luo
2025-07-03 12:23 ` [PATCH RFC 3/4] erofs: apply the " Christian Brauner
2025-07-04 20:45 ` Gao Xiang
2025-07-03 12:23 ` [PATCH RFC 4/4] erofs: introduce .fadvise for page cache share Christian Brauner
2025-07-04 21:09 ` Gao Xiang
2025-07-05 1:15 ` Hongzhen Luo
2025-07-05 1:25 ` Gao Xiang
2025-07-03 12:53 ` [PATCH RFC 0/4] erofs: allow page cache sharing Gao Xiang
2025-07-05 0:51 ` Hongzhen Luo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.