All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hongzhen Luo <hongzhen@linux.alibaba.com>
To: Christian Brauner <brauner@kernel.org>,
	Gao Xiang <xiang@kernel.org>, Jan Kara <jack@suse.cz>,
	Amir Goldstein <amir73il@gmail.com>,
	Jeff Layton <jlayton@kernel.org>,
	Matthew Wilcox <willy@infradead.org>
Cc: "Daan De Meyer" <daan.j.demeyer@gmail.com>,
	"Lennart Poettering" <lennart@poettering.net>,
	"Mike Yuan" <me@yhndnzj.com>,
	"Zbigniew Jędrzejewski-Szmek" <zbyszek@in.waw.pl>,
	lihongbo22@huawei.com, linux-erofs@lists.ozlabs.org
Subject: Re: [PATCH RFC 2/4] erofs: introduce page cache share feature
Date: Sat, 5 Jul 2025 09:09:26 +0800	[thread overview]
Message-ID: <a67f082c-7328-41ea-94ef-2efd18e593ce@linux.alibaba.com> (raw)
In-Reply-To: <20250703-work-erofs-pcs-v1-2-0ce1f6be28ee@kernel.org>


On 2025/7/3 20:23, Christian Brauner wrote:
> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>
> Currently, reading files with different paths (or names) but the same
> content will consume multiple copies of the page cache, even if the
> content of these page caches is the same. For example, reading identical
> files (e.g., *.so files) from two different minor versions of container
> images will cost multiple copies of the same page cache, since different
> containers have different mount points. Therefore, sharing the page cache
> for files with the same content can save memory.
>
> This introduces the page cache share feature in erofs. During the mkfs
> phase, the file content is hashed and the hash value is stored in the
> `trusted.erofs.fingerprint` extended attribute. Inodes of files with the
> same `trusted.erofs.fingerprint` are mapped to the same anonymous inode
> (indicated by the `ano_inode` field). When a read request occurs, the
> anonymous inode serves as a "container" whose page cache is shared. The
> actual operations involving the iomap are carried out by the original
> inode which is mapped to the anonymous inode.
>
> Below is the memory usage for reading all files in two different minor
> versions of container images:
>
> +-------------------+------------------+-------------+---------------+
> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
> |                   |                  |             | Reduction (%) |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     241     |       -       |
> |       redis       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |     163     |      33%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     872     |       -       |
> |      postgres     +------------------+-------------+---------------+
> |    16.1 & 16.2    |        Yes       |     630     |      28%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     2771    |       -       |
> |     tensorflow    +------------------+-------------+---------------+
> |  1.11.0 & 2.11.1  |        Yes       |     2340    |      16%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     926     |       -       |
> |       mysql       +------------------+-------------+---------------+
> |  8.0.11 & 8.0.12  |        Yes       |     735     |      21%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     390     |       -       |
> |       nginx       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |     219     |      44%      |
> +-------------------+------------------+-------------+---------------+
> |       tomcat      |        No        |     924     |       -       |
> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
> |                   |        Yes       |     474     |      49%      |
> +-------------------+------------------+-------------+---------------+
>
> Additionally, the table below shows the runtime memory usage of the
> container:
>
> +-------------------+------------------+-------------+---------------+
> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
> |                   |                  |             | Reduction (%) |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |      35     |       -       |
> |       redis       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |      28     |      20%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     149     |       -       |
> |      postgres     +------------------+-------------+---------------+
> |    16.1 & 16.2    |        Yes       |      95     |      37%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     1028    |       -       |
> |     tensorflow    +------------------+-------------+---------------+
> |  1.11.0 & 2.11.1  |        Yes       |     930     |      10%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     155     |       -       |
> |       mysql       +------------------+-------------+---------------+
> |  8.0.11 & 8.0.12  |        Yes       |     132     |      15%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |      25     |       -       |
> |       nginx       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |      20     |      20%      |
> +-------------------+------------------+-------------+---------------+
> |       tomcat      |        No        |     186     |       -       |
> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
> |                   |        Yes       |      98     |      48%      |
> +-------------------+------------------+-------------+---------------+
>
> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Link: https://lore.kernel.org/20240902110620.2202586-3-hongzhen@linux.alibaba.com
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>   fs/erofs/Kconfig           |  10 +++
>   fs/erofs/Makefile          |   1 +
>   fs/erofs/internal.h        |   4 +
>   fs/erofs/pagecache_share.c | 204 +++++++++++++++++++++++++++++++++++++++++++++
>   fs/erofs/pagecache_share.h |  20 +++++
>   5 files changed, 239 insertions(+)
>
> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
> index 6beeb7063871..553770068fee 100644
> --- a/fs/erofs/Kconfig
> +++ b/fs/erofs/Kconfig
> @@ -192,3 +192,13 @@ config EROFS_FS_PCPU_KTHREAD_HIPRI
>   	  at higher priority.
>   
>   	  If unsure, say N.
> +
> +config EROFS_FS_PAGE_CACHE_SHARE
> +       bool "EROFS page cache share support"
> +       depends on EROFS_FS
> +       default n
> +	help
> +	  This permits EROFS to share page cache for files with same
> +	  fingerprints.
> +
> +	  If unsure, say N.
> diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
> index 549abc424763..f4141fdfcb0b 100644
> --- a/fs/erofs/Makefile
> +++ b/fs/erofs/Makefile
> @@ -10,3 +10,4 @@ erofs-$(CONFIG_EROFS_FS_ZIP_ZSTD) += decompressor_zstd.o
>   erofs-$(CONFIG_EROFS_FS_ZIP_ACCEL) += decompressor_crypto.o
>   erofs-$(CONFIG_EROFS_FS_BACKED_BY_FILE) += fileio.o
>   erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o
> +erofs-$(CONFIG_EROFS_FS_PAGE_CACHE_SHARE) += pagecache_share.o
> diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
> index 30380f7baf5e..47136894d17d 100644
> --- a/fs/erofs/internal.h
> +++ b/fs/erofs/internal.h
> @@ -273,6 +273,9 @@ struct erofs_inode {
>   		};
>   #endif	/* CONFIG_EROFS_FS_ZIP */
>   	};
> +#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
> +	struct inode *ano_inode;
> +#endif
>   	/* the corresponding vfs inode */
>   	struct inode vfs_inode;
>   };
> @@ -369,6 +372,7 @@ extern const struct inode_operations erofs_dir_iops;
>   
>   extern const struct file_operations erofs_file_fops;
>   extern const struct file_operations erofs_dir_fops;
> +extern const struct file_operations erofs_pcs_file_fops;
>   
>   extern const struct iomap_ops z_erofs_iomap_report_ops;
>   
> diff --git a/fs/erofs/pagecache_share.c b/fs/erofs/pagecache_share.c
> new file mode 100644
> index 000000000000..309b33cc6c30
> --- /dev/null
> +++ b/fs/erofs/pagecache_share.c
> @@ -0,0 +1,204 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2024, Alibaba Cloud
> + */
> +#include <linux/xxhash.h>
> +#include <linux/refcount.h>
> +#include "pagecache_share.h"
> +#include "internal.h"
> +#include "xattr.h"
> +
> +#define PCS_FPRT_IDX	4
> +#define PCS_FPRT_NAME	"erofs.fingerprint"
> +#define PCS_FPRT_MAXLEN (sizeof(size_t) + 1024)
> +
> +static DEFINE_MUTEX(pseudo_mnt_lock);
> +static refcount_t pseudo_mnt_count;
> +static struct vfsmount *erofs_pcs_mnt;
> +
> +int erofs_pcs_init_mnt(void)
> +{
> +	struct vfsmount *mnt;
> +
> +	if (refcount_inc_not_zero(&pseudo_mnt_count))
> +		return 0;
> +
> +	guard(mutex)(&pseudo_mnt_lock);
> +	if (erofs_pcs_mnt) {
> +		refcount_inc(&pseudo_mnt_count);
> +		return 0;
> +	}
> +
> +	mnt = kern_mount(&erofs_anon_fs_type);
> +	if (IS_ERR(mnt))
> +		return PTR_ERR(mnt);
> +
> +	rcu_read_lock();
> +	rcu_assign_pointer(erofs_pcs_mnt, mnt);
> +	rcu_read_unlock();
> +	refcount_set_release(&pseudo_mnt_count, 1);
> +	return 0;
> +}
> +
> +void erofs_pcs_free_mnt(void)
> +{
> +	struct vfsmount *mnt = NULL;
> +
> +	if (refcount_dec_not_one(&pseudo_mnt_count))
> +		return;
> +
> +	scoped_guard(mutex, &pseudo_mnt_lock) {
> +		rcu_read_lock();
> +		if (refcount_dec_and_test(&pseudo_mnt_count))
> +			mnt = rcu_replace_pointer(erofs_pcs_mnt, NULL, true);
> +		rcu_read_unlock();
> +	}
> +	if (mnt)
> +		kern_unmount(mnt);
> +}
> +
> +static int erofs_pcs_eq(struct inode *inode, void *data)
> +{
> +	return inode->i_private && memcmp(inode->i_private, data,
> +			sizeof(size_t) + *(size_t *)data) == 0 ? 1 : 0;
> +}
> +
> +static int erofs_pcs_set_fprt(struct inode *inode, void *data)
> +{
> +	/* fprt length and content */
> +	inode->i_private = kmalloc(*(size_t *)data + sizeof(size_t),
> +				   GFP_KERNEL);
> +	memcpy(inode->i_private, data, sizeof(size_t) + *(size_t *)data);
> +	return 0;
> +}
> +
> +void erofs_pcs_fill_inode(struct inode *inode)
> +{
> +	struct erofs_inode *vi = EROFS_I(inode);
> +	char fprt[PCS_FPRT_MAXLEN];
> +	struct inode *ano_inode;
> +	unsigned long fprt_hash;
> +	size_t fprt_len;
> +
> +	vi->ano_inode = NULL;
> +	fprt_len = erofs_getxattr(inode, PCS_FPRT_IDX, PCS_FPRT_NAME,
> +				  fprt + sizeof(size_t), PCS_FPRT_MAXLEN);
> +	if (fprt_len > 0 && fprt_len <= PCS_FPRT_MAXLEN) {
> +		*(size_t *)fprt = fprt_len;
> +		fprt_hash = xxh32(fprt + sizeof(size_t), fprt_len, 0);
> +		ano_inode = iget5_locked(erofs_pcs_mnt->mnt_sb, fprt_hash,
> +					 erofs_pcs_eq, erofs_pcs_set_fprt,
> +					 fprt);
> +		vi->ano_inode = ano_inode;
> +		if (ano_inode->i_state & I_NEW) {
> +			if (erofs_inode_is_data_compressed(vi->datalayout))
> +				ano_inode->i_mapping->a_ops = &z_erofs_aops;
> +			else
> +				ano_inode->i_mapping->a_ops = &erofs_aops;
> +			ano_inode->i_size = inode->i_size;
> +			unlock_new_inode(ano_inode);
> +		}
> +	}
> +}
> +
> +/*
> + * TODO: Hm, could we leverage our fancy new backing file infrastructure
> + * as for overlayfs and fuse?
> + */
> +static struct file *erofs_pcs_alloc_file(struct file *file,
> +					 struct inode *ano_inode)
> +{
> +	struct file *ano_file;
> +
> +	ano_file = alloc_file_pseudo(ano_inode, erofs_pcs_mnt, "[erofs_pcs_f]",
> +				     O_RDONLY, &erofs_file_fops);
> +	file_ra_state_init(&ano_file->f_ra, file->f_mapping);
> +	ano_file->private_data = EROFS_I(file_inode(file));
> +	return ano_file;
> +}
> +
> +static int erofs_pcs_file_open(struct inode *inode, struct file *file)
> +{
> +	struct file *ano_file;
> +	struct inode *ano_inode;
> +	struct erofs_inode *vi = EROFS_I(inode);
> +
> +	ano_inode = vi->ano_inode;
> +	if (!ano_inode)
> +		return -EINVAL;
> +
> +	ano_file = erofs_pcs_alloc_file(file, ano_inode);
> +	if (IS_ERR(ano_file))
> +		return PTR_ERR(ano_file);
> +
> +	file->private_data = ano_file;
> +	return 0;
> +}
> +
> +static int erofs_pcs_file_release(struct inode *inode, struct file *file)
> +{
> +	struct file *ano_file __free(fput) = NULL;
> +
> +	if (WARN_ON_ONCE(!file->private_data))
> +		return -EINVAL;
> +
> +	swap(file->private_data, ano_file);
> +	return 0;
> +}
> +
> +static ssize_t erofs_pcs_file_read_iter(struct kiocb *iocb,
> +					struct iov_iter *to)
> +{
> +	struct file *file, *ano_file;
> +	struct kiocb ano_iocb;
> +	ssize_t res;
> +
> +	if (!iov_iter_count(to))
> +		return 0;
> +
> +#ifdef CONFIG_FS_DAX
> +	if (IS_DAX(inode))
> +		return iocb->ki_filp->f_op->read_iter(iocb, to);
> +#endif
> +	if (iocb->ki_flags & IOCB_DIRECT)
> +		return iocb->ki_filp->f_op->read_iter(iocb, to);
> +
> +	memcpy(&ano_iocb, iocb, sizeof(struct kiocb));
> +	file = iocb->ki_filp;
> +	ano_file = file->private_data;
> +	if (WARN_ON_ONCE(!ano_file))
> +		return -EINVAL;
> +	ano_iocb.ki_filp = ano_file;
> +	res = filemap_read(&ano_iocb, to, 0);
> +	memcpy(iocb, &ano_iocb, sizeof(struct kiocb));
> +	iocb->ki_filp = file;
> +	file_accessed(file);
> +	return res;
> +}
> +
> +/*
> + * TODO: Amir, you've got some experience in this area due to overlayfs
> + * and fuse. Does that work?
> + */
> +static int erofs_pcs_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct file *ano_file = file->private_data;
> +
> +	vma_set_file(vma, ano_file);
> +	vma->vm_ops = &generic_file_vm_ops;
> +	return 0;
> +}
> +
> +const struct file_operations erofs_pcs_file_fops = {
> +	.open		= erofs_pcs_file_open,
> +	/*
> +	 * TODO: Why doesn't .llseek require similar treatment as
> +	 * .read_iter?
> +	 */

.llseek only needs to calculate the offset and requires no excessive 
handling for regular files, while .read_iter

involves actual data retrieval (EROFS-specific logic such as 
decompressing compressed data and cross-block reads),

thus necessitating page cache sharing logic.

Thanks,
Hongzhen

> +	.llseek		= generic_file_llseek,
> +	.read_iter	= erofs_pcs_file_read_iter,
> +	.mmap		= erofs_pcs_mmap,
> +	.release	= erofs_pcs_file_release,
> +	.get_unmapped_area = thp_get_unmapped_area,
> +	.splice_read	= filemap_splice_read,
> +};
> diff --git a/fs/erofs/pagecache_share.h b/fs/erofs/pagecache_share.h
> new file mode 100644
> index 000000000000..b8111291cf79
> --- /dev/null
> +++ b/fs/erofs/pagecache_share.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2024, Alibaba Cloud
> + */
> +#ifndef __EROFS_PAGECACHE_SHARE_H
> +#define __EROFS_PAGECACHE_SHARE_H
> +
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/rwlock.h>
> +#include <linux/mutex.h>
> +#include "internal.h"
> +
> +int erofs_pcs_init_mnt(void);
> +void erofs_pcs_free_mnt(void);
> +void erofs_pcs_fill_inode(struct inode *inode);
> +
> +extern const struct vm_operations_struct generic_file_vm_ops;
> +
> +#endif
>


  parent reply	other threads:[~2025-07-05  1:09 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-03 12:23 [PATCH RFC 0/4] erofs: allow page cache sharing Christian Brauner
2025-07-03 12:23 ` [PATCH RFC 1/4] erofs: move `struct erofs_anon_fs_type` to super.c Christian Brauner
2025-07-03 12:23 ` [PATCH RFC 2/4] erofs: introduce page cache share feature Christian Brauner
2025-07-04 21:06   ` Gao Xiang
2025-07-05  0:54     ` Hongzhen Luo
2025-07-05  8:25     ` Amir Goldstein
2025-07-05 10:58       ` Gao Xiang
2025-07-05 12:34         ` Amir Goldstein
2025-07-05 12:53           ` Gao Xiang
2025-07-05 13:53             ` Amir Goldstein
2025-07-05 15:14               ` Gao Xiang
2025-07-05  1:09   ` Hongzhen Luo [this message]
2025-07-03 12:23 ` [PATCH RFC 3/4] erofs: apply the " Christian Brauner
2025-07-04 20:45   ` Gao Xiang
2025-07-03 12:23 ` [PATCH RFC 4/4] erofs: introduce .fadvise for page cache share Christian Brauner
2025-07-04 21:09   ` Gao Xiang
2025-07-05  1:15     ` Hongzhen Luo
2025-07-05  1:25       ` Gao Xiang
2025-07-03 12:53 ` [PATCH RFC 0/4] erofs: allow page cache sharing Gao Xiang
2025-07-05  0:51 ` Hongzhen Luo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a67f082c-7328-41ea-94ef-2efd18e593ce@linux.alibaba.com \
    --to=hongzhen@linux.alibaba.com \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=daan.j.demeyer@gmail.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=lennart@poettering.net \
    --cc=lihongbo22@huawei.com \
    --cc=linux-erofs@lists.ozlabs.org \
    --cc=me@yhndnzj.com \
    --cc=willy@infradead.org \
    --cc=xiang@kernel.org \
    --cc=zbyszek@in.waw.pl \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.