linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/9] erofs: inode page cache share feature
@ 2025-11-14  9:55 Hongbo Li
  2025-11-14  9:55 ` [PATCH v8 1/9] iomap: stash iomap read ctx in the private field of iomap_iter Hongbo Li
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Hongbo Li @ 2025-11-14  9:55 UTC (permalink / raw)
  To: hsiangkao, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

Enabling page cahe sharing in container scenarios has become increasingly
crucial, as it can significantly reduce memory usage. In previous efforts,
Hongzhen has done substantial work to push this feature into the EROFS
mainline. Due to other commitments, he hasn't been able to continue his
work recently, and I'm very pleased to build upon his work and continue
to refine this implementation.

This patch series is based on Hongzhen's original EROFS shared pagecache
implementation which was posted about half a year ago:
https://lore.kernel.org/all/20250301145002.2420830-1-hongzhen@linux.alibaba.com/T/#u

In addition to the forward-port, I have also fixed several bugs, resolved
some prerequisite dependencies and performed some minor cleanup.

(A recap of Hongzhen's original cover letter is below, edited slightly
for this serise:)

Background
==============
Currently, reading files with different paths (or names) but the same
content can consume multiple copies of the page cache, even if the
content of these caches is identical. For example, reading identical
files (e.g., *.so files) from two different minor versions of container
images can result in multiple copies of the same page cache, since
different containers have different mount points. Therefore, sharing
the page cache for files with the same content can save memory.

Proposal
==============

1. determining file identity
----------------------------
First, a way needs to be found to check whether the content of two files
is the same. Here, the xattr values associated with the file
fingerprints are assessed for consistency. When creating the EROFS
image, users can specify the name of the xattr for file fingerprints,
and the corresponding name will be stored in the packfile. The on-disk
`ishare_key_start` indicates the index of the xattr name within the
prefix xattrs:

```
struct erofs_super_block {
	__u8 xattr_filter_reserved; /* reserved for xattr name filter */
-	__u8 reserved[3];
+	__u8 ishare_key_start;	/* start of ishare key */
+	__u8 reserved[2];
};
```

For example, users can specify the first long prefix as the name for the
file fingerprint as follows:

```
mkfs.erofs  --ishare_key=trusted.erofs.fingerprint  erofs.img ./dir
```

In this way, `trusted.erofs.fingerprint` serves as the name of the xattr
for the file fingerprint. The relevant patch for erofs-utils has been posted
in:

https://lore.kernel.org/all/20251114092845.207368-1-lihongbo22@huawei.com/

At the same time, for security reasons, this patch series only shares
files within the same domain, which is achieved by adding
"-o domain_id=xxxx" during the mounting process:

```
mount -t erofs -o domain_id=trusted.erofs.fingerprint erofs.img /mnt
```

If no domain ID is specified, it will fall back to the non-page cache
sharing mode.

2. Implementation
==================

2.1. file open & close
----------------------
When the file is opened, the ->private_data field of file A or file B is
set to point to an internal deduplicated file. When the actual read
occurs, the page cache of this deduplicated file will be accessed.

When the file is opened, if the corresponding erofs inode is newly
created, then perform the following actions:
1. add the erofs inode to the backing list of the deduplicated inode;
2. increase the reference count of the deduplicated inode.

The purpose of step 1 above is to ensure that when a real I/O operation
occurs, the deduplicated inode can locate one of the disk devices
(as the deduplicated inode itself is not bound to a specific device).
Step 2 is for managing the lifecycle of the deduplicated inode.

When the erofs inode is destroyed, the opposite actions mentioned above
will be taken.

2.2. file reading
-----------------
Assuming the deduplication inode's page cache is PGCache_dedup, there
are two possible scenarios when reading a file:
1) the content being read is already present in PGCache_dedup;
2) the content being read is not present in PGCache_dedup.

In the second scenario, it involves the iomap operation to read from the
disk.

2.2.1. reading existing data in PGCache_dedup
-------------------------------------------
In this case, the overall read flowchart is as follows (take ksys_read()
for example):

         ksys_read
             │
             │
             ▼
            ...
             │
             │
             ▼
erofs_ishare_file_read_iter (switch to backing deduplicated file)
             │
             │
             ▼

 read PGCache_dedup & return

At this point, the content in PGCache_dedup will be read directly and
returned.

2.2.2 reading non-existent content in PGCache_dedup
---------------------------------------------------
In this case, disk I/O operations will be involved. Taking the reading
of an uncompressed file as an example, here is the reading process:

         ksys_read
             │
             │
             ▼
            ...
             │
             │
             ▼
erofs_ishare_file_read_iter (switch to backing deduplicated file)
             │
             │
             ▼
            ... (allocate pages)
             │
             │
             ▼
erofs_read_folio/erofs_readahead
             │
             │
             ▼
            ... (iomap)
             │
             │
             ▼
        erofs_iomap_begin
             │
             │
             ▼
            ...

Iomap and the layers below will involve disk I/O operations. As
described in 2.1, the deduplicated inode itself is not bound to a
specific device. The deduplicated inode will select an erofs inode from
the backing list (by default, the first one) to complete the
corresponding iomap operation.

2.2.3 optimized inode selection
-------------------------------
The inode selection method described in 2.2.2 may select an "inactive"
inode. An inactive inode indicates that there may have been no read
operations on the inode's device for a long time, and there is a high
likelihood that the device may be unmounted. In this case, unmounting
the device may experience a slight delay due to other read requests
being routed to that device. Therefore, we need to select some "active"
inodes for the iomap operation.

To achieve optimized inode selection, an additional `processing` list
has been added. At the beginning of erofs_{read_folio,readahead}(), the
corresponding erofs inode will be added to the `processing` list
(because they are active). And it is removed at the end of
erofs_{read_folio,readahead}(). In erofs_read_begin(), the selected
erofs inode's count is incremented, and in erofs_read_end(), the count
is decremented.

In this way, even after the erofs inode is removed from the `processing`
list, the increment in the reference count can ensure the integrity of
the data reading process. This is somewhat similar to RCU (not exactly
the same, but similar).

2.3. release page cache
-----------------------
Similar to overlayfs, when dropping the page cache via .fadvise, erofs
locates the deduplicated file and applies vfs_fadvise to that specific
file.

Effect
==================
I conducted experiments on two aspects across two different minor
versions of container images:

1. reading all files in two different minor versions of container images

2. run workloads or use the default entrypoint within the containers^[1]

Below is the memory usage for reading all files in two different minor
versions of container images:

+-------------------+------------------+-------------+---------------+
|       Image       | Page Cache Share | Memory (MB) |    Memory     |
|                   |                  |             | Reduction (%) |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     241     |       -       |
|       redis       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     163     |      33%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     872     |       -       |
|      postgres     +------------------+-------------+---------------+
|    16.1 & 16.2    |        Yes       |     630     |      28%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     2771    |       -       |
|     tensorflow    +------------------+-------------+---------------+
|  2.11.0 & 2.11.1  |        Yes       |     2340    |      16%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     926     |       -       |
|       mysql       +------------------+-------------+---------------+
|  8.0.11 & 8.0.12  |        Yes       |     735     |      21%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     390     |       -       |
|       nginx       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     219     |      44%      |
+-------------------+------------------+-------------+---------------+
|       tomcat      |        No        |     924     |       -       |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
|                   |        Yes       |     474     |      49%      |
+-------------------+------------------+-------------+---------------+

Additionally, the table below shows the runtime memory usage of the
container:

+-------------------+------------------+-------------+---------------+
|       Image       | Page Cache Share | Memory (MB) |    Memory     |
|                   |                  |             | Reduction (%) |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     34.9    |       -       |
|       redis       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     33.6    |       4%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |    149.1    |       -       |
|      postgres     +------------------+-------------+---------------+
|    16.1 & 16.2    |        Yes       |      95     |      37%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |    1027.9   |       -       |
|     tensorflow    +------------------+-------------+---------------+
|  2.11.0 & 2.11.1  |        Yes       |    934.3    |      10%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |    155.0    |       -       |
|       mysql       +------------------+-------------+---------------+
|  8.0.11 & 8.0.12  |        Yes       |    139.1    |      11%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     25.4    |       -       |
|       nginx       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     18.8    |      26%      |
+-------------------+------------------+-------------+---------------+
|       tomcat      |        No        |     186     |       -       |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
|                   |        Yes       |      99     |      47%      |
+-------------------+------------------+-------------+---------------+

It can be observed that when reading all the files in the image, the
reduced memory usage varies from 16% to 49%, depending on the specific
image. Additionally, the container's runtime memory usage reduction
ranges from 4% to 47%.

[1] Below are the workload for these images:
      - redis: redis-benchmark
      - postgres: sysbench
      - tensorflow: app.py of tensorflow.python.platform
      - mysql: sysbench
      - nginx: wrk
      - tomcat: default entrypoint

Changes:
v7: https://lore.kernel.org/all/20251021104815.70662-1-lihongbo22@huawei.com/
v6: https://lore.kernel.org/all/20250301145002.2420830-1-hongzhen@linux.alibaba.com/T/#u
v5: https://lore.kernel.org/all/20250105151208.3797385-1-hongzhen@linux.alibaba.com/
v4: https://lore.kernel.org/all/20240902110620.2202586-1-hongzhen@linux.alibaba.com/
v3: https://lore.kernel.org/all/20240828111959.3677011-1-hongzhen@linux.alibaba.com/
v2: https://lore.kernel.org/all/20240731080704.678259-1-hongzhen@linux.alibaba.com/
v1: https://lore.kernel.org/all/20240722065355.1396365-1-hongzhen@linux.alibaba.com/


Diffstat:
 fs/erofs/Kconfig       |   9 ++
 fs/erofs/Makefile      |   1 +
 fs/erofs/data.c        | 115 +++++++++++---
 fs/erofs/erofs_fs.h    |   6 +-
 fs/erofs/fscache.c     |  13 --
 fs/erofs/inode.c       |   5 +
 fs/erofs/internal.h    |  44 ++++++
 fs/erofs/ishare.c      | 341 +++++++++++++++++++++++++++++++++++++++++
 fs/erofs/ishare.h      |  46 ++++++
 fs/erofs/super.c       |  55 ++++++-
 fs/erofs/xattr.c       |  26 ++++
 fs/erofs/xattr.h       |   6 +
 fs/erofs/zdata.c       |  56 +++++--
 fs/fuse/file.c         |   4 +-
 fs/iomap/buffered-io.c |   6 +-
 include/linux/iomap.h  |   8 +-
 16 files changed, 682 insertions(+), 59 deletions(-)
 create mode 100644 fs/erofs/ishare.c
 create mode 100644 fs/erofs/ishare.h

-- 
2.22.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v8 1/9] iomap: stash iomap read ctx in the private field of iomap_iter
  2025-11-14  9:55 [PATCH v8 0/9] erofs: inode page cache share feature Hongbo Li
@ 2025-11-14  9:55 ` Hongbo Li
  2025-11-16 11:53   ` Gao Xiang
  2025-11-16 11:54   ` Gao Xiang
  2025-11-14  9:55 ` [PATCH v8 2/9] erofs: hold read context in iomap_iter if needed Hongbo Li
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 23+ messages in thread
From: Hongbo Li @ 2025-11-14  9:55 UTC (permalink / raw)
  To: hsiangkao, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

It's useful to get filesystem-specific information using the
existing private field in the @iomap_iter passed to iomap_{begin,end}
for advanced usage for iomap buffered reads, which is much like the
current iomap DIO.

For example, EROFS needs it to:

 - implement an efficient page cache sharing feature, since iomap
   needs to apply to anon inode page cache but we'd like to get the
   backing inode/fs instead, so filesystem-specific private data is
   needed to keep such information;

 - pass in both struct page * and void * for inline data to avoid
   kmap_to_page() usage (which is bogus).

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
 fs/fuse/file.c         | 4 ++--
 fs/iomap/buffered-io.c | 6 ++++--
 include/linux/iomap.h  | 8 ++++----
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8275b6681b9b..98dd20f0bb53 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -973,7 +973,7 @@ static int fuse_read_folio(struct file *file, struct folio *folio)
 		return -EIO;
 	}
 
-	iomap_read_folio(&fuse_iomap_ops, &ctx);
+	iomap_read_folio(&fuse_iomap_ops, &ctx, NULL);
 	fuse_invalidate_atime(inode);
 	return 0;
 }
@@ -1075,7 +1075,7 @@ static void fuse_readahead(struct readahead_control *rac)
 	if (fuse_is_bad(inode))
 		return;
 
-	iomap_readahead(&fuse_iomap_ops, &ctx);
+	iomap_readahead(&fuse_iomap_ops, &ctx, NULL);
 }
 
 static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6ae031ac8058..8e79303c074e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -496,13 +496,14 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 }
 
 void iomap_read_folio(const struct iomap_ops *ops,
-		struct iomap_read_folio_ctx *ctx)
+		struct iomap_read_folio_ctx *ctx, void *private)
 {
 	struct folio *folio = ctx->cur_folio;
 	struct iomap_iter iter = {
 		.inode		= folio->mapping->host,
 		.pos		= folio_pos(folio),
 		.len		= folio_size(folio),
+		.private	= private,
 	};
 	size_t bytes_pending = 0;
 	int ret;
@@ -560,13 +561,14 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
  * the filesystem to be reentered.
  */
 void iomap_readahead(const struct iomap_ops *ops,
-		struct iomap_read_folio_ctx *ctx)
+		struct iomap_read_folio_ctx *ctx, void *private)
 {
 	struct readahead_control *rac = ctx->rac;
 	struct iomap_iter iter = {
 		.inode	= rac->mapping->host,
 		.pos	= readahead_pos(rac),
 		.len	= readahead_length(rac),
+		.private = private,
 	};
 	size_t cur_bytes_pending;
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b1ac08c7474..c3ecbbdb14e8 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -341,9 +341,9 @@ ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops,
 		const struct iomap_write_ops *write_ops, void *private);
 void iomap_read_folio(const struct iomap_ops *ops,
-		struct iomap_read_folio_ctx *ctx);
+		struct iomap_read_folio_ctx *ctx, void *private);
 void iomap_readahead(const struct iomap_ops *ops,
-		struct iomap_read_folio_ctx *ctx);
+		struct iomap_read_folio_ctx *ctx, void *private);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
@@ -594,7 +594,7 @@ static inline void iomap_bio_read_folio(struct folio *folio,
 		.cur_folio	= folio,
 	};
 
-	iomap_read_folio(ops, &ctx);
+	iomap_read_folio(ops, &ctx, NULL);
 }
 
 static inline void iomap_bio_readahead(struct readahead_control *rac,
@@ -605,7 +605,7 @@ static inline void iomap_bio_readahead(struct readahead_control *rac,
 		.rac		= rac,
 	};
 
-	iomap_readahead(ops, &ctx);
+	iomap_readahead(ops, &ctx, NULL);
 }
 #endif /* CONFIG_BLOCK */
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 2/9] erofs: hold read context in iomap_iter if needed
  2025-11-14  9:55 [PATCH v8 0/9] erofs: inode page cache share feature Hongbo Li
  2025-11-14  9:55 ` [PATCH v8 1/9] iomap: stash iomap read ctx in the private field of iomap_iter Hongbo Li
@ 2025-11-14  9:55 ` Hongbo Li
  2025-11-16 12:01   ` Gao Xiang
  2025-11-14  9:55 ` [PATCH v8 3/9] erofs: move `struct erofs_anon_fs_type` to super.c Hongbo Li
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Hongbo Li @ 2025-11-14  9:55 UTC (permalink / raw)
  To: hsiangkao, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

Uncoming page cache sharing needs pass read context to iomap_iter,
here we unify the way of passing the read context in EROFS. Moreover,
bmap and fiemap don't need to map the inline data.

Note that we keep `struct page *` in `struct erofs_iomap_iter_ctx` as
well to avoid bogus kmap_to_page usage.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
 fs/erofs/data.c | 79 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 59 insertions(+), 20 deletions(-)

diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index bb13c4cb8455..bd3d85c61341 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -266,14 +266,23 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
 	folio_end_read(folio, !(v & BIT(EROFS_ONLINEFOLIO_EIO)));
 }
 
+struct erofs_iomap_iter_ctx {
+	struct page *page;
+	void *base;
+};
+
 static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	int ret;
+	struct erofs_iomap_iter_ctx *ctx;
 	struct super_block *sb = inode->i_sb;
 	struct erofs_map_blocks map;
 	struct erofs_map_dev mdev;
+	struct iomap_iter *iter;
 
+	iter = container_of(iomap, struct iomap_iter, iomap);
+	ctx = iter->private;
 	map.m_la = offset;
 	map.m_llen = length;
 	ret = erofs_map_blocks(inode, &map);
@@ -283,7 +292,8 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	iomap->offset = map.m_la;
 	iomap->length = map.m_llen;
 	iomap->flags = 0;
-	iomap->private = NULL;
+	if (ctx)
+		ctx->base = NULL;
 	iomap->addr = IOMAP_NULL_ADDR;
 	if (!(map.m_flags & EROFS_MAP_MAPPED)) {
 		iomap->type = IOMAP_HOLE;
@@ -309,16 +319,20 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	}
 
 	if (map.m_flags & EROFS_MAP_META) {
-		void *ptr;
-		struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
-
 		iomap->type = IOMAP_INLINE;
-		ptr = erofs_read_metabuf(&buf, sb, map.m_pa,
-					 erofs_inode_in_metabox(inode));
-		if (IS_ERR(ptr))
-			return PTR_ERR(ptr);
-		iomap->inline_data = ptr;
-		iomap->private = buf.base;
+		/* read context should read the inlined data */
+		if (ctx) {
+			void *ptr;
+			struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
+
+			ptr = erofs_read_metabuf(&buf, sb, map.m_pa,
+						 erofs_inode_in_metabox(inode));
+			if (IS_ERR(ptr))
+				return PTR_ERR(ptr);
+			iomap->inline_data = ptr;
+			ctx->page = buf.page;
+			ctx->base = buf.base;
+		}
 	} else {
 		iomap->type = IOMAP_MAPPED;
 	}
@@ -328,18 +342,19 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 static int erofs_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 		ssize_t written, unsigned int flags, struct iomap *iomap)
 {
-	void *ptr = iomap->private;
+	struct erofs_iomap_iter_ctx *ctx;
+	struct iomap_iter *iter;
 
-	if (ptr) {
+	iter = container_of(iomap, struct iomap_iter, iomap);
+	ctx = iter->private;
+	if (ctx && ctx->base) {
 		struct erofs_buf buf = {
-			.page = kmap_to_page(ptr),
-			.base = ptr,
+			.page = ctx->page,
+			.base = ctx->base,
 		};
 
 		DBG_BUGON(iomap->type != IOMAP_INLINE);
 		erofs_put_metabuf(&buf);
-	} else {
-		DBG_BUGON(iomap->type == IOMAP_INLINE);
 	}
 	return written;
 }
@@ -369,18 +384,36 @@ int erofs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
  */
 static int erofs_read_folio(struct file *file, struct folio *folio)
 {
+	struct iomap_read_folio_ctx read_ctx = {
+		.ops		= &iomap_bio_read_ops,
+		.cur_folio	= folio,
+	};
+	struct erofs_iomap_iter_ctx iter_ctx = {
+		.page		= NULL,
+		.base		= NULL,
+	};
+
 	trace_erofs_read_folio(folio, true);
 
-	iomap_bio_read_folio(folio, &erofs_iomap_ops);
+	iomap_read_folio(&erofs_iomap_ops, &read_ctx, &iter_ctx);
 	return 0;
 }
 
 static void erofs_readahead(struct readahead_control *rac)
 {
+	struct iomap_read_folio_ctx read_ctx = {
+		.ops		= &iomap_bio_read_ops,
+		.rac		= rac,
+	};
+	struct erofs_iomap_iter_ctx iter_ctx = {
+		.page		= NULL,
+		.base		= NULL,
+	};
+
 	trace_erofs_readahead(rac->mapping->host, readahead_index(rac),
 					readahead_count(rac), true);
 
-	iomap_bio_readahead(rac, &erofs_iomap_ops);
+	iomap_readahead(&erofs_iomap_ops, &read_ctx, &iter_ctx);
 }
 
 static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
@@ -400,9 +433,15 @@ static ssize_t erofs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (IS_DAX(inode))
 		return dax_iomap_rw(iocb, to, &erofs_iomap_ops);
 #endif
-	if ((iocb->ki_flags & IOCB_DIRECT) && inode->i_sb->s_bdev)
+	if ((iocb->ki_flags & IOCB_DIRECT) && inode->i_sb->s_bdev) {
+		struct erofs_iomap_iter_ctx iter_ctx = {
+			.page = NULL,
+			.base = NULL,
+		};
+
 		return iomap_dio_rw(iocb, to, &erofs_iomap_ops,
-				    NULL, 0, NULL, 0);
+				    NULL, 0, &iter_ctx, 0);
+	}
 	return filemap_read(iocb, to, 0);
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 3/9] erofs: move `struct erofs_anon_fs_type` to super.c
  2025-11-14  9:55 [PATCH v8 0/9] erofs: inode page cache share feature Hongbo Li
  2025-11-14  9:55 ` [PATCH v8 1/9] iomap: stash iomap read ctx in the private field of iomap_iter Hongbo Li
  2025-11-14  9:55 ` [PATCH v8 2/9] erofs: hold read context in iomap_iter if needed Hongbo Li
@ 2025-11-14  9:55 ` Hongbo Li
  2025-11-16 12:02   ` Gao Xiang
  2025-11-14  9:55 ` [PATCH v8 4/9] erofs: support user-defined fingerprint name Hongbo Li
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Hongbo Li @ 2025-11-14  9:55 UTC (permalink / raw)
  To: hsiangkao, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

From: Hongzhen Luo <hongzhen@linux.alibaba.com>

Move the `struct erofs_anon_fs_type` to the super.c and
expose it in preparation for the upcoming page cache share
feature.

Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
 fs/erofs/fscache.c  | 13 -------------
 fs/erofs/internal.h |  4 ++++
 fs/erofs/super.c    | 15 +++++++++++++++
 3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 362acf828279..2d1683479fc0 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -3,7 +3,6 @@
  * Copyright (C) 2022, Alibaba Cloud
  * Copyright (C) 2022, Bytedance Inc. All rights reserved.
  */
-#include <linux/pseudo_fs.h>
 #include <linux/fscache.h>
 #include "internal.h"
 
@@ -13,18 +12,6 @@ static LIST_HEAD(erofs_domain_list);
 static LIST_HEAD(erofs_domain_cookies_list);
 static struct vfsmount *erofs_pseudo_mnt;
 
-static int erofs_anon_init_fs_context(struct fs_context *fc)
-{
-	return init_pseudo(fc, EROFS_SUPER_MAGIC) ? 0 : -ENOMEM;
-}
-
-static struct file_system_type erofs_anon_fs_type = {
-	.owner		= THIS_MODULE,
-	.name           = "pseudo_erofs",
-	.init_fs_context = erofs_anon_init_fs_context,
-	.kill_sb        = kill_anon_super,
-};
-
 struct erofs_fscache_io {
 	struct netfs_cache_resources cres;
 	struct iov_iter		iter;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index f7f622836198..e80b35db18e4 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -188,6 +188,10 @@ static inline bool erofs_is_fileio_mode(struct erofs_sb_info *sbi)
 	return IS_ENABLED(CONFIG_EROFS_FS_BACKED_BY_FILE) && sbi->dif0.file;
 }
 
+#if defined(CONFIG_EROFS_FS_ONDEMAND)
+extern struct file_system_type erofs_anon_fs_type;
+#endif
+
 static inline bool erofs_is_fscache_mode(struct super_block *sb)
 {
 	return IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) &&
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index f3f8d8c066e4..0d88c04684b9 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -11,6 +11,7 @@
 #include <linux/fs_parser.h>
 #include <linux/exportfs.h>
 #include <linux/backing-dev.h>
+#include <linux/pseudo_fs.h>
 #include "xattr.h"
 
 #define CREATE_TRACE_POINTS
@@ -920,6 +921,20 @@ static struct file_system_type erofs_fs_type = {
 };
 MODULE_ALIAS_FS("erofs");
 
+#if defined(CONFIG_EROFS_FS_ONDEMAND)
+static int erofs_anon_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, EROFS_SUPER_MAGIC) ? 0 : -ENOMEM;
+}
+
+struct file_system_type erofs_anon_fs_type = {
+	.owner		= THIS_MODULE,
+	.name           = "pseudo_erofs",
+	.init_fs_context = erofs_anon_init_fs_context,
+	.kill_sb        = kill_anon_super,
+};
+#endif
+
 static int __init erofs_module_init(void)
 {
 	int err;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 4/9] erofs: support user-defined fingerprint name
  2025-11-14  9:55 [PATCH v8 0/9] erofs: inode page cache share feature Hongbo Li
                   ` (2 preceding siblings ...)
  2025-11-14  9:55 ` [PATCH v8 3/9] erofs: move `struct erofs_anon_fs_type` to super.c Hongbo Li
@ 2025-11-14  9:55 ` Hongbo Li
  2025-11-17  2:54   ` Gao Xiang
  2025-11-14  9:55 ` [PATCH v8 5/9] erofs: support domain-specific page cache share Hongbo Li
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Hongbo Li @ 2025-11-14  9:55 UTC (permalink / raw)
  To: hsiangkao, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

From: Hongzhen Luo <hongzhen@linux.alibaba.com>

When creating the EROFS image, users can specify the fingerprint name.
This is to prepare for the upcoming inode page cache share.

Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
 fs/erofs/Kconfig    |  9 +++++++++
 fs/erofs/erofs_fs.h |  6 ++++--
 fs/erofs/internal.h |  6 ++++++
 fs/erofs/super.c    |  5 ++++-
 fs/erofs/xattr.c    | 26 ++++++++++++++++++++++++++
 fs/erofs/xattr.h    |  6 ++++++
 6 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index d81f3318417d..1b5c0cd99203 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -194,3 +194,12 @@ config EROFS_FS_PCPU_KTHREAD_HIPRI
 	  at higher priority.
 
 	  If unsure, say N.
+
+config EROFS_FS_INODE_SHARE
+	bool "EROFS inode page cache share support (experimental)"
+	depends on EROFS_FS && EROFS_FS_XATTR && !EROFS_FS_ONDEMAND
+	help
+	  This permits EROFS to share page cache for files with same
+	  fingerprints.
+
+	  If unsure, say N.
\ No newline at end of file
diff --git a/fs/erofs/erofs_fs.h b/fs/erofs/erofs_fs.h
index 3d5738f80072..104518cd161d 100644
--- a/fs/erofs/erofs_fs.h
+++ b/fs/erofs/erofs_fs.h
@@ -35,8 +35,9 @@
 #define EROFS_FEATURE_INCOMPAT_XATTR_PREFIXES	0x00000040
 #define EROFS_FEATURE_INCOMPAT_48BIT		0x00000080
 #define EROFS_FEATURE_INCOMPAT_METABOX		0x00000100
+#define EROFS_FEATURE_INCOMPAT_ISHARE_KEY	0x00000200
 #define EROFS_ALL_FEATURE_INCOMPAT		\
-	((EROFS_FEATURE_INCOMPAT_METABOX << 1) - 1)
+	((EROFS_FEATURE_INCOMPAT_ISHARE_KEY << 1) - 1)
 
 #define EROFS_SB_EXTSLOT_SIZE	16
 
@@ -83,7 +84,8 @@ struct erofs_super_block {
 	__le32 xattr_prefix_start;	/* start of long xattr prefixes */
 	__le64 packed_nid;	/* nid of the special packed inode */
 	__u8 xattr_filter_reserved; /* reserved for xattr name filter */
-	__u8 reserved[3];
+	__u8 ishare_key_start;	/* start of ishare key */
+	__u8 reserved[2];
 	__le32 build_time;	/* seconds added to epoch for mkfs time */
 	__le64 rootnid_8b;	/* (48BIT on) nid of root directory */
 	__le64 reserved2;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index e80b35db18e4..3ebbb7c5d085 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -167,6 +167,11 @@ struct erofs_sb_info {
 	struct erofs_domain *domain;
 	char *fsid;
 	char *domain_id;
+
+	/* inode page cache share support */
+	u8 ishare_key_start;
+	u8 ishare_key_idx;
+	char *ishare_key;
 };
 
 #define EROFS_SB(sb) ((struct erofs_sb_info *)(sb)->s_fs_info)
@@ -236,6 +241,7 @@ EROFS_FEATURE_FUNCS(dedupe, incompat, INCOMPAT_DEDUPE)
 EROFS_FEATURE_FUNCS(xattr_prefixes, incompat, INCOMPAT_XATTR_PREFIXES)
 EROFS_FEATURE_FUNCS(48bit, incompat, INCOMPAT_48BIT)
 EROFS_FEATURE_FUNCS(metabox, incompat, INCOMPAT_METABOX)
+EROFS_FEATURE_FUNCS(ishare_key, incompat, INCOMPAT_ISHARE_KEY)
 EROFS_FEATURE_FUNCS(sb_chksum, compat, COMPAT_SB_CHKSUM)
 EROFS_FEATURE_FUNCS(xattr_filter, compat, COMPAT_XATTR_FILTER)
 EROFS_FEATURE_FUNCS(shared_ea_in_metabox, compat, COMPAT_SHARED_EA_IN_METABOX)
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 0d88c04684b9..3561473cb789 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -339,7 +339,7 @@ static int erofs_read_superblock(struct super_block *sb)
 			return -EFSCORRUPTED;	/* self-loop detection */
 	}
 	sbi->inos = le64_to_cpu(dsb->inos);
-
+	sbi->ishare_key_start = dsb->ishare_key_start;
 	sbi->epoch = (s64)le64_to_cpu(dsb->epoch);
 	sbi->fixed_nsec = le32_to_cpu(dsb->fixed_nsec);
 	super_set_uuid(sb, (void *)dsb->uuid, sizeof(dsb->uuid));
@@ -738,6 +738,9 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 	if (err)
 		return err;
 
+	err = erofs_xattr_set_ishare_key(sb);
+	if (err)
+		return err;
 	erofs_set_sysfs_name(sb);
 	err = erofs_register_sysfs(sb);
 	if (err)
diff --git a/fs/erofs/xattr.c b/fs/erofs/xattr.c
index 396536d9a862..3c99091f39a5 100644
--- a/fs/erofs/xattr.c
+++ b/fs/erofs/xattr.c
@@ -564,3 +564,29 @@ struct posix_acl *erofs_get_acl(struct inode *inode, int type, bool rcu)
 	return acl;
 }
 #endif
+
+#ifdef CONFIG_EROFS_FS_INODE_SHARE
+int erofs_xattr_set_ishare_key(struct super_block *sb)
+{
+	struct erofs_sb_info *sbi = EROFS_SB(sb);
+	struct erofs_xattr_prefix_item *pf;
+	char *ishare_key;
+
+	if (!sbi->xattr_prefixes ||
+	    !(sbi->ishare_key_start & EROFS_XATTR_LONG_PREFIX))
+		return 0;
+
+	pf = sbi->xattr_prefixes +
+		(sbi->ishare_key_start & EROFS_XATTR_LONG_PREFIX_MASK);
+	if (!pf || pf >= sbi->xattr_prefixes + sbi->xattr_prefix_count)
+		return 0;
+	ishare_key = kmalloc(pf->infix_len + 1, GFP_KERNEL);
+	if (!ishare_key)
+		return -ENOMEM;
+	memcpy(ishare_key, pf->prefix->infix, pf->infix_len);
+	ishare_key[pf->infix_len] = '\0';
+	sbi->ishare_key = ishare_key;
+	sbi->ishare_key_idx = pf->prefix->base_index;
+	return 0;
+}
+#endif
diff --git a/fs/erofs/xattr.h b/fs/erofs/xattr.h
index 6317caa8413e..21684359662c 100644
--- a/fs/erofs/xattr.h
+++ b/fs/erofs/xattr.h
@@ -67,4 +67,10 @@ struct posix_acl *erofs_get_acl(struct inode *inode, int type, bool rcu);
 #define erofs_get_acl	(NULL)
 #endif
 
+#ifdef CONFIG_EROFS_FS_INODE_SHARE
+int erofs_xattr_set_ishare_key(struct super_block *sb);
+#else
+static inline int erofs_xattr_set_ishare_key(struct super_block *sb) { return 0; }
+#endif
+
 #endif
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 5/9] erofs: support domain-specific page cache share
  2025-11-14  9:55 [PATCH v8 0/9] erofs: inode page cache share feature Hongbo Li
                   ` (3 preceding siblings ...)
  2025-11-14  9:55 ` [PATCH v8 4/9] erofs: support user-defined fingerprint name Hongbo Li
@ 2025-11-14  9:55 ` Hongbo Li
  2025-11-14  9:55 ` [PATCH v8 6/9] erofs: introduce the page cache share feature Hongbo Li
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Hongbo Li @ 2025-11-14  9:55 UTC (permalink / raw)
  To: hsiangkao, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

From: Hongzhen Luo <hongzhen@linux.alibaba.com>

Only files in the same domain will share the page cache. Also modify
the sysfs related content in preparation for the upcoming page cache
share feature.

Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
 fs/erofs/super.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 3561473cb789..ce95454c9ee7 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -515,6 +515,8 @@ static int erofs_fc_parse_param(struct fs_context *fc,
 		if (!sbi->fsid)
 			return -ENOMEM;
 		break;
+#endif
+#if defined(CONFIG_EROFS_FS_ONDEMAND) || defined(CONFIG_EROFS_FS_INODE_SHARE)
 	case Opt_domain_id:
 		kfree(sbi->domain_id);
 		sbi->domain_id = kstrdup(param->string, GFP_KERNEL);
@@ -615,7 +617,7 @@ static void erofs_set_sysfs_name(struct super_block *sb)
 {
 	struct erofs_sb_info *sbi = EROFS_SB(sb);
 
-	if (sbi->domain_id)
+	if (sbi->domain_id && !sbi->ishare_key)
 		super_set_sysfs_name_generic(sb, "%s,%s", sbi->domain_id,
 					     sbi->fsid);
 	else if (sbi->fsid)
@@ -1036,6 +1038,8 @@ static int erofs_show_options(struct seq_file *seq, struct dentry *root)
 #ifdef CONFIG_EROFS_FS_ONDEMAND
 	if (sbi->fsid)
 		seq_printf(seq, ",fsid=%s", sbi->fsid);
+#endif
+#if defined(CONFIG_EROFS_FS_ONDEMAND) || defined(CONFIG_EROFS_FS_INODE_SHARE)
 	if (sbi->domain_id)
 		seq_printf(seq, ",domain_id=%s", sbi->domain_id);
 #endif
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 6/9] erofs: introduce the page cache share feature
  2025-11-14  9:55 [PATCH v8 0/9] erofs: inode page cache share feature Hongbo Li
                   ` (4 preceding siblings ...)
  2025-11-14  9:55 ` [PATCH v8 5/9] erofs: support domain-specific page cache share Hongbo Li
@ 2025-11-14  9:55 ` Hongbo Li
  2025-11-17  3:06   ` Gao Xiang
  2025-11-14  9:55 ` [PATCH v8 7/9] erofs: support unencoded inodes for page cache share Hongbo Li
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Hongbo Li @ 2025-11-14  9:55 UTC (permalink / raw)
  To: hsiangkao, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

From: Hongzhen Luo <hongzhen@linux.alibaba.com>

Currently, reading files with different paths (or names) but the same
content will consume multiple copies of the page cache, even if the
content of these page caches is the same. For example, reading
identical files (e.g., *.so files) from two different minor versions of
container images will cost multiple copies of the same page cache,
since different containers have different mount points. Therefore,
sharing the page cache for files with the same content can save memory.

This introduces the page cache share feature in erofs. It allocate a
deduplicated inode and use its page cache as shared. Reads for files
with identical content will ultimately be routed to the page cache of
the deduplicated inode. In this way, a single page cache satisfies
multiple read requests for different files with the same contents.

Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
 fs/erofs/Makefile   |   1 +
 fs/erofs/internal.h |  32 +++++-
 fs/erofs/ishare.c   | 236 ++++++++++++++++++++++++++++++++++++++++++++
 fs/erofs/ishare.h   |  28 ++++++
 fs/erofs/super.c    |  30 +++++-
 5 files changed, 324 insertions(+), 3 deletions(-)
 create mode 100644 fs/erofs/ishare.c
 create mode 100644 fs/erofs/ishare.h

diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
index 549abc424763..102a23bf5dec 100644
--- a/fs/erofs/Makefile
+++ b/fs/erofs/Makefile
@@ -10,3 +10,4 @@ erofs-$(CONFIG_EROFS_FS_ZIP_ZSTD) += decompressor_zstd.o
 erofs-$(CONFIG_EROFS_FS_ZIP_ACCEL) += decompressor_crypto.o
 erofs-$(CONFIG_EROFS_FS_BACKED_BY_FILE) += fileio.o
 erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o
+erofs-$(CONFIG_EROFS_FS_INODE_SHARE) += ishare.o
\ No newline at end of file
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 3ebbb7c5d085..26772458fda7 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -193,7 +193,7 @@ static inline bool erofs_is_fileio_mode(struct erofs_sb_info *sbi)
 	return IS_ENABLED(CONFIG_EROFS_FS_BACKED_BY_FILE) && sbi->dif0.file;
 }
 
-#if defined(CONFIG_EROFS_FS_ONDEMAND)
+#if defined(CONFIG_EROFS_FS_ONDEMAND) || defined(CONFIG_EROFS_FS_INODE_SHARE)
 extern struct file_system_type erofs_anon_fs_type;
 #endif
 
@@ -203,6 +203,19 @@ static inline bool erofs_is_fscache_mode(struct super_block *sb)
 			!erofs_is_fileio_mode(EROFS_SB(sb)) && !sb->s_bdev;
 }
 
+#if defined(CONFIG_EROFS_FS_INODE_SHARE)
+static inline bool erofs_is_ishare_inode(struct inode *inode)
+{
+	/* we have assumed FS_ONDEMAND is excluded with FS_INODE_SHARE feature */
+	return inode->i_sb->s_type == &erofs_anon_fs_type;
+}
+#else
+static inline bool erofs_is_ishare_inode(struct inode *inode)
+{
+	return false;
+}
+#endif
+
 enum {
 	EROFS_ZIP_CACHE_DISABLED,
 	EROFS_ZIP_CACHE_READAHEAD,
@@ -310,6 +323,22 @@ struct erofs_inode {
 		};
 #endif	/* CONFIG_EROFS_FS_ZIP */
 	};
+#ifdef CONFIG_EROFS_FS_INODE_SHARE
+	union {
+		/* internal dedup inode */
+		struct {
+			char *fingerprint;
+			spinlock_t lock;
+			/* all backing inodes */
+			struct list_head backing_head;
+		};
+
+		struct {
+			struct inode *ishare;
+			struct list_head backing_link;
+		};
+	};
+#endif
 	/* the corresponding vfs inode */
 	struct inode vfs_inode;
 };
@@ -416,6 +445,7 @@ extern const struct inode_operations erofs_dir_iops;
 
 extern const struct file_operations erofs_file_fops;
 extern const struct file_operations erofs_dir_fops;
+extern const struct file_operations erofs_ishare_fops;
 
 extern const struct iomap_ops z_erofs_iomap_report_ops;
 
diff --git a/fs/erofs/ishare.c b/fs/erofs/ishare.c
new file mode 100644
index 000000000000..910b732bf8e7
--- /dev/null
+++ b/fs/erofs/ishare.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2024, Alibaba Cloud
+ */
+#include <linux/xxhash.h>
+#include <linux/refcount.h>
+#include <linux/mount.h>
+#include <linux/mutex.h>
+#include <linux/ramfs.h>
+#include "ishare.h"
+#include "internal.h"
+#include "xattr.h"
+
+static DEFINE_MUTEX(erofs_ishare_lock);
+static struct vfsmount *erofs_ishare_mnt;
+static refcount_t erofs_ishare_supers;
+
+int erofs_ishare_init(struct super_block *sb)
+{
+	struct vfsmount *mnt = NULL;
+	struct erofs_sb_info *sbi = EROFS_SB(sb);
+
+	if (!sbi->ishare_key)
+		return 0;
+
+	mutex_lock(&erofs_ishare_lock);
+	if (erofs_ishare_mnt) {
+		refcount_inc(&erofs_ishare_supers);
+	} else {
+		mnt = kern_mount(&erofs_anon_fs_type);
+		if (!IS_ERR(mnt)) {
+			erofs_ishare_mnt = mnt;
+			refcount_set(&erofs_ishare_supers, 1);
+		}
+	}
+	mutex_unlock(&erofs_ishare_lock);
+	return IS_ERR(mnt) ? PTR_ERR(mnt) : 0;
+}
+
+void erofs_ishare_exit(struct super_block *sb)
+{
+	struct erofs_sb_info *sbi = EROFS_SB(sb);
+	struct vfsmount *tmp;
+
+	if (!sbi->ishare_key || !erofs_ishare_mnt)
+		return;
+
+	mutex_lock(&erofs_ishare_lock);
+	if (refcount_dec_and_test(&erofs_ishare_supers)) {
+		tmp = erofs_ishare_mnt;
+		erofs_ishare_mnt = NULL;
+		mutex_unlock(&erofs_ishare_lock);
+		kern_unmount(tmp);
+		mutex_lock(&erofs_ishare_lock);
+	}
+	mutex_unlock(&erofs_ishare_lock);
+	kfree(sbi->ishare_key);
+	sbi->ishare_key = NULL;
+}
+
+static int erofs_ishare_iget5_eq(struct inode *inode, void *data)
+{
+	struct erofs_inode *vi = EROFS_I(inode);
+
+	return vi->fingerprint && memcmp(vi->fingerprint, data,
+			sizeof(size_t) + *(size_t *)data) == 0;
+}
+
+static int erofs_ishare_iget5_set(struct inode *inode, void *data)
+{
+	struct erofs_inode *vi = EROFS_I(inode);
+
+	vi->fingerprint = data;
+	INIT_LIST_HEAD(&vi->backing_head);
+	spin_lock_init(&vi->lock);
+	return 0;
+}
+
+bool erofs_ishare_fill_inode(struct inode *inode)
+{
+	struct erofs_inode *vi = EROFS_I(inode);
+	struct erofs_sb_info *sbi = EROFS_SB(inode->i_sb);
+	struct inode *idedup;
+	/*
+	 * fingerprint layout:
+	 * fingerprint length + fingerprint content (xattr_value + domain_id)
+	 */
+	char *ishare_key = sbi->ishare_key, *fingerprint;
+	ssize_t ishare_vlen;
+	unsigned long hash;
+	int key_idx;
+
+	if (!sbi->domain_id || !ishare_key)
+		return false;
+
+	key_idx = sbi->ishare_key_idx;
+	ishare_vlen = erofs_getxattr(inode, key_idx, ishare_key, NULL, 0);
+	if (ishare_vlen <= 0 || ishare_vlen > (1 << sbi->blkszbits))
+		return false;
+
+	fingerprint = kmalloc(sizeof(ssize_t) + ishare_vlen +
+			      strlen(sbi->domain_id), GFP_KERNEL);
+	if (!fingerprint)
+		return false;
+
+	*(ssize_t *)fingerprint = ishare_vlen + strlen(sbi->domain_id);
+	if (ishare_vlen != erofs_getxattr(inode, key_idx, ishare_key,
+					  fingerprint + sizeof(ssize_t),
+					  ishare_vlen)) {
+		kfree(fingerprint);
+		return false;
+	}
+
+	memcpy(fingerprint + sizeof(ssize_t) + ishare_vlen,
+	       sbi->domain_id, strlen(sbi->domain_id));
+	hash = xxh32(fingerprint + sizeof(ssize_t),
+		     ishare_vlen + strlen(sbi->domain_id), hash);
+	idedup = iget5_locked(erofs_ishare_mnt->mnt_sb, hash,
+			      erofs_ishare_iget5_eq, erofs_ishare_iget5_set,
+			      fingerprint);
+	if (!idedup) {
+		kfree(fingerprint);
+		return false;
+	}
+
+	INIT_LIST_HEAD(&vi->backing_link);
+	vi->ishare = idedup;
+	spin_lock(&EROFS_I(idedup)->lock);
+	list_add(&vi->backing_link, &EROFS_I(idedup)->backing_head);
+	spin_unlock(&EROFS_I(idedup)->lock);
+
+	if (!(idedup->i_state & I_NEW)) {
+		kfree(fingerprint);
+		return true;
+	}
+
+	if (erofs_inode_is_data_compressed(vi->datalayout))
+		idedup->i_mapping->a_ops = &z_erofs_aops;
+	else
+		idedup->i_mapping->a_ops = &erofs_aops;
+	idedup->i_mode = vi->vfs_inode.i_mode;
+	i_size_write(idedup, vi->vfs_inode.i_size);
+	unlock_new_inode(idedup);
+	return true;
+}
+
+void erofs_ishare_free_inode(struct inode *inode)
+{
+	struct erofs_inode *vi = EROFS_I(inode);
+	struct inode *idedup = vi->ishare;
+
+	if (!idedup)
+		return;
+
+	spin_lock(&EROFS_I(idedup)->lock);
+	list_del(&vi->backing_link);
+	spin_unlock(&EROFS_I(idedup)->lock);
+	iput(idedup);
+	vi->ishare = NULL;
+}
+
+static int erofs_ishare_file_open(struct inode *inode, struct file *file)
+{
+	struct file *realfile;
+	struct inode *dedup;
+
+	dedup = EROFS_I(inode)->ishare;
+	if (!dedup)
+		return -EINVAL;
+
+	realfile = alloc_file_pseudo(dedup, erofs_ishare_mnt, "erofs_ishare_file",
+				     O_RDONLY, &erofs_file_fops);
+	if (IS_ERR(realfile))
+		return PTR_ERR(realfile);
+
+	file_ra_state_init(&realfile->f_ra, file->f_mapping);
+	realfile->private_data = EROFS_I(inode);
+	file->private_data = realfile;
+	return 0;
+}
+
+static int erofs_ishare_file_release(struct inode *inode, struct file *file)
+{
+	struct file *realfile = file->private_data;
+
+	if (!realfile)
+		return -EINVAL;
+	fput(realfile);
+	realfile->private_data = NULL;
+	return 0;
+}
+
+static ssize_t erofs_ishare_file_read_iter(struct kiocb *iocb,
+						    struct iov_iter *to)
+{
+	struct file *realfile = iocb->ki_filp->private_data;
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct kiocb dedup_iocb;
+	ssize_t nread;
+
+	if (!realfile)
+		return -EINVAL;
+	if (!iov_iter_count(to))
+		return 0;
+
+	/* fallback to the original file in DAX or DIRECT mode */
+	if (IS_DAX(inode) || (iocb->ki_flags & IOCB_DIRECT))
+		realfile = iocb->ki_filp;
+
+	kiocb_clone(&dedup_iocb, iocb, realfile);
+	nread = filemap_read(&dedup_iocb, to, 0);
+	iocb->ki_pos = dedup_iocb.ki_pos;
+	touch_atime(&iocb->ki_filp->f_path);
+	return nread;
+}
+
+static int erofs_ishare_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct file *realfile = file->private_data;
+
+	if (!realfile)
+		return -EINVAL;
+
+	vma_set_file(vma, realfile);
+	return generic_file_readonly_mmap(file, vma);
+}
+
+const struct file_operations erofs_ishare_fops = {
+	.open		= erofs_ishare_file_open,
+	.llseek		= generic_file_llseek,
+	.read_iter	= erofs_ishare_file_read_iter,
+	.mmap		= erofs_ishare_mmap,
+	.release	= erofs_ishare_file_release,
+	.get_unmapped_area = thp_get_unmapped_area,
+	.splice_read	= filemap_splice_read,
+};
diff --git a/fs/erofs/ishare.h b/fs/erofs/ishare.h
new file mode 100644
index 000000000000..54f2251c8179
--- /dev/null
+++ b/fs/erofs/ishare.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2024, Alibaba Cloud
+ */
+#ifndef __EROFS_ISHARE_H
+#define __EROFS_ISHARE_H
+
+#include <linux/fs.h>
+#include <linux/spinlock.h>
+#include "internal.h"
+
+#ifdef CONFIG_EROFS_FS_INODE_SHARE
+
+int erofs_ishare_init(struct super_block *sb);
+void erofs_ishare_exit(struct super_block *sb);
+bool erofs_ishare_fill_inode(struct inode *inode);
+void erofs_ishare_free_inode(struct inode *inode);
+
+#else
+
+static inline int erofs_ishare_init(struct super_block *sb) { return 0; }
+static inline void erofs_ishare_exit(struct super_block *sb) {}
+static inline bool erofs_ishare_fill_inode(struct inode *inode) { return false; }
+static inline void erofs_ishare_free_inode(struct inode *inode) {}
+
+#endif // CONFIG_EROFS_FS_INODE_SHARE
+
+#endif
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index ce95454c9ee7..613dfbe988de 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -13,6 +13,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pseudo_fs.h>
 #include "xattr.h"
+#include "ishare.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/erofs.h>
@@ -81,6 +82,10 @@ static void erofs_free_inode(struct inode *inode)
 {
 	struct erofs_inode *vi = EROFS_I(inode);
 
+	if (erofs_is_ishare_inode(inode)) {
+		erofs_free_dedup_inode(vi);
+		return;
+	}
 	if (inode->i_op == &erofs_fast_symlink_iops)
 		kfree(inode->i_link);
 	kfree(vi->xattr_shared_xattrs);
@@ -926,10 +931,31 @@ static struct file_system_type erofs_fs_type = {
 };
 MODULE_ALIAS_FS("erofs");
 
-#if defined(CONFIG_EROFS_FS_ONDEMAND)
+#if defined(CONFIG_EROFS_FS_ONDEMAND) || defined(CONFIG_EROFS_FS_INODE_SHARE)
+static void erofs_free_anon_inode(struct inode *inode)
+{
+	struct erofs_inode *vi = EROFS_I(inode);
+
+#ifdef CONFIG_EROFS_FS_INODE_SHARE
+	kfree(vi->fingerprint);
+#endif
+	kmem_cache_free(erofs_inode_cachep, vi);
+}
+
+static const struct super_operations erofs_anon_sops = {
+	.alloc_inode = erofs_alloc_inode,
+	.free_inode = erofs_free_anon_inode,
+};
+
 static int erofs_anon_init_fs_context(struct fs_context *fc)
 {
-	return init_pseudo(fc, EROFS_SUPER_MAGIC) ? 0 : -ENOMEM;
+	struct pseudo_fs_context *ctx;
+
+	ctx = init_pseudo(fc, EROFS_SUPER_MAGIC);
+	if (ctx)
+		ctx->ops = &erofs_anon_sops;
+
+	return ctx ? 0 : -ENOMEM;
 }
 
 struct file_system_type erofs_anon_fs_type = {
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 7/9] erofs: support unencoded inodes for page cache share
  2025-11-14  9:55 [PATCH v8 0/9] erofs: inode page cache share feature Hongbo Li
                   ` (5 preceding siblings ...)
  2025-11-14  9:55 ` [PATCH v8 6/9] erofs: introduce the page cache share feature Hongbo Li
@ 2025-11-14  9:55 ` Hongbo Li
  2025-11-17  3:44   ` Gao Xiang
  2025-11-14  9:55 ` [PATCH v8 8/9] erofs: support compressed " Hongbo Li
  2025-11-14  9:55 ` [PATCH v8 9/9] erofs: implement .fadvise " Hongbo Li
  8 siblings, 1 reply; 23+ messages in thread
From: Hongbo Li @ 2025-11-14  9:55 UTC (permalink / raw)
  To: hsiangkao, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

This patch adds inode page cache sharing functionality for unencoded
files.

I conducted experiments in the container environment. Below is the
memory usage for reading all files in two different minor versions
of container images:

+-------------------+------------------+-------------+---------------+
|       Image       | Page Cache Share | Memory (MB) |    Memory     |
|                   |                  |             | Reduction (%) |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     241     |       -       |
|       redis       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     163     |      33%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     872     |       -       |
|      postgres     +------------------+-------------+---------------+
|    16.1 & 16.2    |        Yes       |     630     |      28%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     2771    |       -       |
|     tensorflow    +------------------+-------------+---------------+
|  2.11.0 & 2.11.1  |        Yes       |     2340    |      16%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     926     |       -       |
|       mysql       +------------------+-------------+---------------+
|  8.0.11 & 8.0.12  |        Yes       |     735     |      21%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     390     |       -       |
|       nginx       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     219     |      44%      |
+-------------------+------------------+-------------+---------------+
|       tomcat      |        No        |     924     |       -       |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
|                   |        Yes       |     474     |      49%      |
+-------------------+------------------+-------------+---------------+

Additionally, the table below shows the runtime memory usage of the
container:

+-------------------+------------------+-------------+---------------+
|       Image       | Page Cache Share | Memory (MB) |    Memory     |
|                   |                  |             | Reduction (%) |
+-------------------+------------------+-------------+---------------+
|                   |        No        |      35     |       -       |
|       redis       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |      28     |      20%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     149     |       -       |
|      postgres     +------------------+-------------+---------------+
|    16.1 & 16.2    |        Yes       |      95     |      37%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     1028    |       -       |
|     tensorflow    +------------------+-------------+---------------+
|  2.11.0 & 2.11.1  |        Yes       |     930     |      10%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     155     |       -       |
|       mysql       +------------------+-------------+---------------+
|  8.0.11 & 8.0.12  |        Yes       |     132     |      15%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |      25     |       -       |
|       nginx       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |      20     |      20%      |
+-------------------+------------------+-------------+---------------+
|       tomcat      |        No        |     186     |       -       |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
|                   |        Yes       |      98     |      48%      |
+-------------------+------------------+-------------+---------------+

Co-developed-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
 fs/erofs/data.c     | 38 +++++++++++++++---
 fs/erofs/inode.c    |  5 +++
 fs/erofs/internal.h |  4 ++
 fs/erofs/ishare.c   | 98 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/erofs/ishare.h   | 18 +++++++++
 fs/erofs/super.c    | 11 +++--
 6 files changed, 163 insertions(+), 11 deletions(-)

diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index bd3d85c61341..c459104e4734 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2021, Alibaba Cloud
  */
 #include "internal.h"
+#include "ishare.h"
 #include <linux/sched/mm.h>
 #include <trace/events/erofs.h>
 
@@ -269,23 +270,27 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
 struct erofs_iomap_iter_ctx {
 	struct page *page;
 	void *base;
+	struct inode *realinode;
 };
 
 static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
 {
-	int ret;
 	struct erofs_iomap_iter_ctx *ctx;
-	struct super_block *sb = inode->i_sb;
 	struct erofs_map_blocks map;
 	struct erofs_map_dev mdev;
 	struct iomap_iter *iter;
+	struct inode *realinode;
+	struct super_block *sb;
+	int ret;
 
 	iter = container_of(iomap, struct iomap_iter, iomap);
 	ctx = iter->private;
+	realinode = ctx ? ctx->realinode : inode;
+	sb = realinode->i_sb;
 	map.m_la = offset;
 	map.m_llen = length;
-	ret = erofs_map_blocks(inode, &map);
+	ret = erofs_map_blocks(realinode, &map);
 	if (ret < 0)
 		return ret;
 
@@ -300,7 +305,7 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		return 0;
 	}
 
-	if (!(map.m_flags & EROFS_MAP_META) || !erofs_inode_in_metabox(inode)) {
+	if (!(map.m_flags & EROFS_MAP_META) || !erofs_inode_in_metabox(realinode)) {
 		mdev = (struct erofs_map_dev) {
 			.m_deviceid = map.m_deviceid,
 			.m_pa = map.m_pa,
@@ -326,7 +331,7 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 			struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
 
 			ptr = erofs_read_metabuf(&buf, sb, map.m_pa,
-						 erofs_inode_in_metabox(inode));
+						 erofs_inode_in_metabox(realinode));
 			if (IS_ERR(ptr))
 				return PTR_ERR(ptr);
 			iomap->inline_data = ptr;
@@ -384,6 +389,7 @@ int erofs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
  */
 static int erofs_read_folio(struct file *file, struct folio *folio)
 {
+	struct inode *inode = folio_inode(folio);
 	struct iomap_read_folio_ctx read_ctx = {
 		.ops		= &iomap_bio_read_ops,
 		.cur_folio	= folio,
@@ -391,16 +397,27 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
 	struct erofs_iomap_iter_ctx iter_ctx = {
 		.page		= NULL,
 		.base		= NULL,
+		.realinode	= erofs_ishare_iget(inode),
+	};
+	struct erofs_read_ctx rdctx = {
+		.file		= file,
+		.inode		= inode,
 	};
 
+	if (!iter_ctx.realinode)
+		return -EIO;
 	trace_erofs_read_folio(folio, true);
 
+	erofs_read_begin(&rdctx);
 	iomap_read_folio(&erofs_iomap_ops, &read_ctx, &iter_ctx);
+	erofs_read_end(&rdctx);
+	erofs_ishare_iput(iter_ctx.realinode);
 	return 0;
 }
 
 static void erofs_readahead(struct readahead_control *rac)
 {
+	struct inode *inode = rac->mapping->host;
 	struct iomap_read_folio_ctx read_ctx = {
 		.ops		= &iomap_bio_read_ops,
 		.rac		= rac,
@@ -408,12 +425,22 @@ static void erofs_readahead(struct readahead_control *rac)
 	struct erofs_iomap_iter_ctx iter_ctx = {
 		.page		= NULL,
 		.base		= NULL,
+		.realinode	= erofs_ishare_iget(inode),
+	};
+	struct erofs_read_ctx rdctx = {
+		.file		= rac->file,
+		.inode		= inode,
 	};
 
+	if (!iter_ctx.realinode)
+		return;
 	trace_erofs_readahead(rac->mapping->host, readahead_index(rac),
 					readahead_count(rac), true);
 
+	erofs_read_begin(&rdctx);
 	iomap_readahead(&erofs_iomap_ops, &read_ctx, &iter_ctx);
+	erofs_read_end(&rdctx);
+	erofs_ishare_iput(iter_ctx.realinode);
 }
 
 static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
@@ -437,6 +464,7 @@ static ssize_t erofs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		struct erofs_iomap_iter_ctx iter_ctx = {
 			.page = NULL,
 			.base = NULL,
+			.realinode = inode,
 		};
 
 		return iomap_dio_rw(iocb, to, &erofs_iomap_ops,
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index cb780c095d28..fe45e6c18f8e 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2021, Alibaba Cloud
  */
 #include "xattr.h"
+#include "ishare.h"
 #include <linux/compat.h>
 #include <trace/events/erofs.h>
 
@@ -215,6 +216,10 @@ static int erofs_fill_inode(struct inode *inode)
 	case S_IFREG:
 		inode->i_op = &erofs_generic_iops;
 		inode->i_fop = &erofs_file_fops;
+#ifdef CONFIG_EROFS_FS_INODE_SHARE
+		if (erofs_ishare_fill_inode(inode))
+			inode->i_fop = &erofs_ishare_fops;
+#endif
 		break;
 	case S_IFDIR:
 		inode->i_op = &erofs_dir_iops;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 26772458fda7..6f7d441955c6 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -331,11 +331,15 @@ struct erofs_inode {
 			spinlock_t lock;
 			/* all backing inodes */
 			struct list_head backing_head;
+			/* processing list */
+			struct list_head processing_head;
 		};
 
 		struct {
 			struct inode *ishare;
 			struct list_head backing_link;
+			struct list_head processing_link;
+			atomic_t processing_count;
 		};
 	};
 #endif
diff --git a/fs/erofs/ishare.c b/fs/erofs/ishare.c
index 910b732bf8e7..14b2690055c5 100644
--- a/fs/erofs/ishare.c
+++ b/fs/erofs/ishare.c
@@ -72,6 +72,7 @@ static int erofs_ishare_iget5_set(struct inode *inode, void *data)
 
 	vi->fingerprint = data;
 	INIT_LIST_HEAD(&vi->backing_head);
+	INIT_LIST_HEAD(&vi->processing_head);
 	spin_lock_init(&vi->lock);
 	return 0;
 }
@@ -124,7 +125,9 @@ bool erofs_ishare_fill_inode(struct inode *inode)
 	}
 
 	INIT_LIST_HEAD(&vi->backing_link);
+	INIT_LIST_HEAD(&vi->processing_link);
 	vi->ishare = idedup;
+
 	spin_lock(&EROFS_I(idedup)->lock);
 	list_add(&vi->backing_link, &EROFS_I(idedup)->backing_head);
 	spin_unlock(&EROFS_I(idedup)->lock);
@@ -163,17 +166,28 @@ static int erofs_ishare_file_open(struct inode *inode, struct file *file)
 {
 	struct file *realfile;
 	struct inode *dedup;
+	char *buf, *filepath;
 
 	dedup = EROFS_I(inode)->ishare;
 	if (!dedup)
 		return -EINVAL;
 
-	realfile = alloc_file_pseudo(dedup, erofs_ishare_mnt, "erofs_ishare_file",
+	buf = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+	filepath = file_path(file, buf, PATH_MAX);
+	if (IS_ERR(filepath)) {
+		kfree(buf);
+		return -PTR_ERR(filepath);
+	}
+	realfile = alloc_file_pseudo(dedup, erofs_ishare_mnt, filepath + 1,
 				     O_RDONLY, &erofs_file_fops);
+	kfree(buf);
 	if (IS_ERR(realfile))
 		return PTR_ERR(realfile);
 
 	file_ra_state_init(&realfile->f_ra, file->f_mapping);
+	ihold(dedup);
 	realfile->private_data = EROFS_I(inode);
 	file->private_data = realfile;
 	return 0;
@@ -185,8 +199,8 @@ static int erofs_ishare_file_release(struct inode *inode, struct file *file)
 
 	if (!realfile)
 		return -EINVAL;
+	file->private_data = NULL;
 	fput(realfile);
-	realfile->private_data = NULL;
 	return 0;
 }
 
@@ -234,3 +248,83 @@ const struct file_operations erofs_ishare_fops = {
 	.get_unmapped_area = thp_get_unmapped_area,
 	.splice_read	= filemap_splice_read,
 };
+
+void erofs_read_begin(struct erofs_read_ctx *rdctx)
+{
+	struct erofs_inode *vi, *vi_dedup;
+
+	if (!rdctx->file || !erofs_is_ishare_inode(rdctx->inode))
+		return;
+
+	vi = rdctx->file->private_data;
+	vi_dedup = EROFS_I(file_inode(rdctx->file));
+
+	spin_lock(&vi_dedup->lock);
+	if (!list_empty(&vi->processing_link)) {
+		atomic_inc(&vi->processing_count);
+	} else {
+		list_add(&vi->processing_link,
+			 &vi_dedup->processing_head);
+		atomic_set(&vi->processing_count, 1);
+	}
+	spin_unlock(&vi_dedup->lock);
+}
+
+void erofs_read_end(struct erofs_read_ctx *rdctx)
+{
+	struct erofs_inode *vi, *vi_dedup;
+
+	if (!rdctx->file || !erofs_is_ishare_inode(rdctx->inode))
+		return;
+
+	vi = rdctx->file->private_data;
+	vi_dedup = EROFS_I(file_inode(rdctx->file));
+
+	spin_lock(&vi_dedup->lock);
+	if (atomic_dec_and_test(&vi->processing_count))
+		list_del_init(&vi->processing_link);
+	spin_unlock(&vi_dedup->lock);
+}
+
+/*
+ * erofs_ishare_iget - find the backing inode.
+ */
+struct inode *erofs_ishare_iget(struct inode *inode)
+{
+	struct erofs_inode *vi, *vi_dedup;
+	struct inode *realinode;
+
+	if (!erofs_is_ishare_inode(inode))
+		return igrab(inode);
+
+	vi_dedup = EROFS_I(inode);
+	spin_lock(&vi_dedup->lock);
+	/* try processing inodes first */
+	if (!list_empty(&vi_dedup->processing_head)) {
+		list_for_each_entry(vi, &vi_dedup->processing_head,
+				    processing_link) {
+			realinode = igrab(&vi->vfs_inode);
+			if (realinode) {
+				spin_unlock(&vi_dedup->lock);
+				return realinode;
+			}
+		}
+	}
+
+	/* fall back to all backing inodes */
+	DBG_BUGON(list_empty(&vi_dedup->backing_head));
+	list_for_each_entry(vi, &vi_dedup->backing_head, backing_link) {
+		realinode = igrab(&vi->vfs_inode);
+		if (realinode)
+			break;
+	}
+	spin_unlock(&vi_dedup->lock);
+
+	DBG_BUGON(!realinode);
+	return realinode;
+}
+
+void erofs_ishare_iput(struct inode *realinode)
+{
+	iput(realinode);
+}
diff --git a/fs/erofs/ishare.h b/fs/erofs/ishare.h
index 54f2251c8179..b85fa240507b 100644
--- a/fs/erofs/ishare.h
+++ b/fs/erofs/ishare.h
@@ -9,6 +9,11 @@
 #include <linux/spinlock.h>
 #include "internal.h"
 
+struct erofs_read_ctx {
+	struct file *file; /* may be NULL */
+	struct inode *inode;
+};
+
 #ifdef CONFIG_EROFS_FS_INODE_SHARE
 
 int erofs_ishare_init(struct super_block *sb);
@@ -16,6 +21,13 @@ void erofs_ishare_exit(struct super_block *sb);
 bool erofs_ishare_fill_inode(struct inode *inode);
 void erofs_ishare_free_inode(struct inode *inode);
 
+/* read/readahead */
+void erofs_read_begin(struct erofs_read_ctx *rdctx);
+void erofs_read_end(struct erofs_read_ctx *rdctx);
+
+struct inode *erofs_ishare_iget(struct inode *inode);
+void erofs_ishare_iput(struct inode *realinode);
+
 #else
 
 static inline int erofs_ishare_init(struct super_block *sb) { return 0; }
@@ -23,6 +35,12 @@ static inline void erofs_ishare_exit(struct super_block *sb) {}
 static inline bool erofs_ishare_fill_inode(struct inode *inode) { return false; }
 static inline void erofs_ishare_free_inode(struct inode *inode) {}
 
+static inline void erofs_read_begin(struct erofs_read_ctx *rdctx) {}
+static inline void erofs_read_end(struct erofs_read_ctx *rdctx) {}
+
+static inline struct inode *erofs_ishare_iget(struct inode *inode) { return inode; }
+static inline void erofs_ishare_iput(struct inode *realinode) {}
+
 #endif // CONFIG_EROFS_FS_INODE_SHARE
 
 #endif
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 613dfbe988de..2af82171dd78 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -82,10 +82,6 @@ static void erofs_free_inode(struct inode *inode)
 {
 	struct erofs_inode *vi = EROFS_I(inode);
 
-	if (erofs_is_ishare_inode(inode)) {
-		erofs_free_dedup_inode(vi);
-		return;
-	}
 	if (inode->i_op == &erofs_fast_symlink_iops)
 		kfree(inode->i_link);
 	kfree(vi->xattr_shared_xattrs);
@@ -753,6 +749,10 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 	if (err)
 		return err;
 
+	err = erofs_ishare_init(sb);
+	if (err)
+		return err;
+
 	sbi->dir_ra_bytes = EROFS_DIR_RA_BYTES;
 	erofs_info(sb, "mounted with root inode @ nid %llu.", sbi->root_nid);
 	return 0;
@@ -902,6 +902,7 @@ static void erofs_kill_sb(struct super_block *sb)
 		kill_anon_super(sb);
 	else
 		kill_block_super(sb);
+
 	erofs_drop_internal_inodes(sbi);
 	fs_put_dax(sbi->dif0.dax_dev, NULL);
 	erofs_fscache_unregister_fs(sb);
@@ -913,6 +914,7 @@ static void erofs_put_super(struct super_block *sb)
 {
 	struct erofs_sb_info *const sbi = EROFS_SB(sb);
 
+	erofs_ishare_exit(sb);
 	erofs_unregister_sysfs(sb);
 	erofs_shrinker_unregister(sb);
 	erofs_xattr_prefixes_cleanup(sb);
@@ -1081,6 +1083,7 @@ static void erofs_evict_inode(struct inode *inode)
 		dax_break_layout_final(inode);
 #endif
 
+	erofs_ishare_free_inode(inode);
 	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 }
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 8/9] erofs: support compressed inodes for page cache share
  2025-11-14  9:55 [PATCH v8 0/9] erofs: inode page cache share feature Hongbo Li
                   ` (6 preceding siblings ...)
  2025-11-14  9:55 ` [PATCH v8 7/9] erofs: support unencoded inodes for page cache share Hongbo Li
@ 2025-11-14  9:55 ` Hongbo Li
  2025-11-14  9:55 ` [PATCH v8 9/9] erofs: implement .fadvise " Hongbo Li
  8 siblings, 0 replies; 23+ messages in thread
From: Hongbo Li @ 2025-11-14  9:55 UTC (permalink / raw)
  To: hsiangkao, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

From: Hongzhen Luo <hongzhen@linux.alibaba.com>

This patch adds page cache sharing functionality for compressed inodes.

Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
 fs/erofs/zdata.c | 56 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 46 insertions(+), 10 deletions(-)

diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index bc80cfe482f7..e76421de86cb 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2022 Alibaba Cloud
  */
 #include "compress.h"
+#include "ishare.h"
 #include <linux/psi.h>
 #include <linux/cpuhotplug.h>
 #include <trace/events/erofs.h>
@@ -493,7 +494,7 @@ enum z_erofs_pclustermode {
 };
 
 struct z_erofs_frontend {
-	struct inode *const inode;
+	struct inode *inode;
 	struct erofs_map_blocks map;
 	struct z_erofs_bvec_iter biter;
 
@@ -1870,10 +1871,24 @@ static void z_erofs_pcluster_readmore(struct z_erofs_frontend *f,
 
 static int z_erofs_read_folio(struct file *file, struct folio *folio)
 {
-	struct inode *const inode = folio->mapping->host;
-	Z_EROFS_DEFINE_FRONTEND(f, inode, folio_pos(folio));
+	struct inode *const inode = folio->mapping->host, *realinode;
+	Z_EROFS_DEFINE_FRONTEND(f, NULL, folio_pos(folio));
+	struct erofs_read_ctx rdctx = {
+		.file = file,
+		.inode = inode,
+	};
 	int err;
 
+	trace_erofs_read_folio(folio, false);
+
+	erofs_read_begin(&rdctx);
+
+	if (erofs_is_ishare_inode(inode))
+		realinode = erofs_ishare_iget(inode);
+	else
+		realinode = inode;
+
+	f.inode = realinode;
 	trace_erofs_read_folio(folio, false);
 	z_erofs_pcluster_readmore(&f, NULL, true);
 	err = z_erofs_scan_folio(&f, folio, false);
@@ -1883,23 +1898,39 @@ static int z_erofs_read_folio(struct file *file, struct folio *folio)
 	/* if some pclusters are ready, need submit them anyway */
 	err = z_erofs_runqueue(&f, 0) ?: err;
 	if (err && err != -EINTR)
-		erofs_err(inode->i_sb, "read error %d @ %lu of nid %llu",
-			  err, folio->index, EROFS_I(inode)->nid);
+		erofs_err(realinode->i_sb, "read error %d @ %lu of nid %llu",
+			  err, folio->index, EROFS_I(realinode)->nid);
 
 	erofs_put_metabuf(&f.map.buf);
 	erofs_release_pages(&f.pagepool);
+
+	if (erofs_is_ishare_inode(inode))
+		erofs_ishare_iput(realinode);
+
+	erofs_read_end(&rdctx);
 	return err;
 }
 
 static void z_erofs_readahead(struct readahead_control *rac)
 {
-	struct inode *const inode = rac->mapping->host;
-	Z_EROFS_DEFINE_FRONTEND(f, inode, readahead_pos(rac));
+	struct inode *const inode = rac->mapping->host, *realinode;
+	Z_EROFS_DEFINE_FRONTEND(f, NULL, readahead_pos(rac));
 	unsigned int nrpages = readahead_count(rac);
 	struct folio *head = NULL, *folio;
+	struct erofs_read_ctx rdctx = {
+		.file = rac->file,
+		.inode = inode,
+	};
 	int err;
 
-	trace_erofs_readahead(inode, readahead_index(rac), nrpages, false);
+	erofs_read_begin(&rdctx);
+	if (erofs_is_ishare_inode(inode))
+		realinode = erofs_ishare_iget(inode);
+	else
+		realinode = inode;
+
+	f.inode = realinode;
+	trace_erofs_readahead(realinode, readahead_index(rac), nrpages, false);
 	z_erofs_pcluster_readmore(&f, rac, true);
 	while ((folio = readahead_folio(rac))) {
 		folio->private = head;
@@ -1913,8 +1944,8 @@ static void z_erofs_readahead(struct readahead_control *rac)
 
 		err = z_erofs_scan_folio(&f, folio, true);
 		if (err && err != -EINTR)
-			erofs_err(inode->i_sb, "readahead error at folio %lu @ nid %llu",
-				  folio->index, EROFS_I(inode)->nid);
+			erofs_err(realinode->i_sb, "readahead error at folio %lu @ nid %llu",
+				  folio->index, EROFS_I(realinode)->nid);
 	}
 	z_erofs_pcluster_readmore(&f, rac, false);
 	z_erofs_pcluster_end(&f);
@@ -1922,6 +1953,11 @@ static void z_erofs_readahead(struct readahead_control *rac)
 	(void)z_erofs_runqueue(&f, nrpages);
 	erofs_put_metabuf(&f.map.buf);
 	erofs_release_pages(&f.pagepool);
+
+	if (erofs_is_ishare_inode(inode))
+		erofs_ishare_iput(realinode);
+
+	erofs_read_end(&rdctx);
 }
 
 const struct address_space_operations z_erofs_aops = {
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 9/9] erofs: implement .fadvise for page cache share
  2025-11-14  9:55 [PATCH v8 0/9] erofs: inode page cache share feature Hongbo Li
                   ` (7 preceding siblings ...)
  2025-11-14  9:55 ` [PATCH v8 8/9] erofs: support compressed " Hongbo Li
@ 2025-11-14  9:55 ` Hongbo Li
  2025-11-17  3:48   ` Gao Xiang
  8 siblings, 1 reply; 23+ messages in thread
From: Hongbo Li @ 2025-11-14  9:55 UTC (permalink / raw)
  To: hsiangkao, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

From: Hongzhen Luo <hongzhen@linux.alibaba.com>

This patch implements the .fadvise interface for page cache share.
Similar to overlayfs, it drops those clean, unused pages through
vfs_fadvise().

Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
 fs/erofs/ishare.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/erofs/ishare.c b/fs/erofs/ishare.c
index 14b2690055c5..88c4af3f8993 100644
--- a/fs/erofs/ishare.c
+++ b/fs/erofs/ishare.c
@@ -239,6 +239,16 @@ static int erofs_ishare_mmap(struct file *file, struct vm_area_struct *vma)
 	return generic_file_readonly_mmap(file, vma);
 }
 
+static int erofs_ishare_fadvice(struct file *file, loff_t offset,
+				      loff_t len, int advice)
+{
+	struct file *realfile = file->private_data;
+
+	if (!realfile)
+		return -EINVAL;
+	return vfs_fadvise(realfile, offset, len, advice);
+}
+
 const struct file_operations erofs_ishare_fops = {
 	.open		= erofs_ishare_file_open,
 	.llseek		= generic_file_llseek,
@@ -247,6 +257,7 @@ const struct file_operations erofs_ishare_fops = {
 	.release	= erofs_ishare_file_release,
 	.get_unmapped_area = thp_get_unmapped_area,
 	.splice_read	= filemap_splice_read,
+	.fadvise	= erofs_ishare_fadvice,
 };
 
 void erofs_read_begin(struct erofs_read_ctx *rdctx)
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 1/9] iomap: stash iomap read ctx in the private field of iomap_iter
  2025-11-14  9:55 ` [PATCH v8 1/9] iomap: stash iomap read ctx in the private field of iomap_iter Hongbo Li
@ 2025-11-16 11:53   ` Gao Xiang
  2025-11-16 11:54   ` Gao Xiang
  1 sibling, 0 replies; 23+ messages in thread
From: Gao Xiang @ 2025-11-16 11:53 UTC (permalink / raw)
  To: Hongbo Li, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel



On 2025/11/14 17:55, Hongbo Li wrote:
> It's useful to get filesystem-specific information using the
> existing private field in the @iomap_iter passed to iomap_{begin,end}
> for advanced usage for iomap buffered reads, which is much like the
> current iomap DIO.
> 
> For example, EROFS needs it to:
> 
>   - implement an efficient page cache sharing feature, since iomap
>     needs to apply to anon inode page cache but we'd like to get the
>     backing inode/fs instead, so filesystem-specific private data is
>     needed to keep such information;
> 
>   - pass in both struct page * and void * for inline data to avoid
>     kmap_to_page() usage (which is bogus).
> 
> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
> ---
>   fs/fuse/file.c         | 4 ++--
>   fs/iomap/buffered-io.c | 6 ++++--
>   include/linux/iomap.h  | 8 ++++----
>   3 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 8275b6681b9b..98dd20f0bb53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -973,7 +973,7 @@ static int fuse_read_folio(struct file *file, struct folio *folio)
>   		return -EIO;
>   	}
>   
> -	iomap_read_folio(&fuse_iomap_ops, &ctx);
> +	iomap_read_folio(&fuse_iomap_ops, &ctx, NULL);
>   	fuse_invalidate_atime(inode);
>   	return 0;
>   }
> @@ -1075,7 +1075,7 @@ static void fuse_readahead(struct readahead_control *rac)
>   	if (fuse_is_bad(inode))
>   		return;
>   
> -	iomap_readahead(&fuse_iomap_ops, &ctx);
> +	iomap_readahead(&fuse_iomap_ops, &ctx, NULL);
>   }
>   
>   static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 6ae031ac8058..8e79303c074e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -496,13 +496,14 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
>   }
>   
>   void iomap_read_folio(const struct iomap_ops *ops,
> -		struct iomap_read_folio_ctx *ctx)
> +		struct iomap_read_folio_ctx *ctx, void *private)
>   {
>   	struct folio *folio = ctx->cur_folio;
>   	struct iomap_iter iter = {
>   		.inode		= folio->mapping->host,
>   		.pos		= folio_pos(folio),
>   		.len		= folio_size(folio),
> +		.private	= private,
>   	};
>   	size_t bytes_pending = 0;
>   	int ret;
> @@ -560,13 +561,14 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
>    * the filesystem to be reentered.
>    */
>   void iomap_readahead(const struct iomap_ops *ops,
> -		struct iomap_read_folio_ctx *ctx)
> +		struct iomap_read_folio_ctx *ctx, void *private)
>   {
>   	struct readahead_control *rac = ctx->rac;
>   	struct iomap_iter iter = {
>   		.inode	= rac->mapping->host,
>   		.pos	= readahead_pos(rac),
>   		.len	= readahead_length(rac),
> +		.private = private,
>   	};
>   	size_t cur_bytes_pending;
>   
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 8b1ac08c7474..c3ecbbdb14e8 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -341,9 +341,9 @@ ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>   		const struct iomap_ops *ops,
>   		const struct iomap_write_ops *write_ops, void *private);
>   void iomap_read_folio(const struct iomap_ops *ops,
> -		struct iomap_read_folio_ctx *ctx);
> +		struct iomap_read_folio_ctx *ctx, void *private);
>   void iomap_readahead(const struct iomap_ops *ops,
> -		struct iomap_read_folio_ctx *ctx);
> +		struct iomap_read_folio_ctx *ctx, void *private);
>   bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>   struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
>   bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
> @@ -594,7 +594,7 @@ static inline void iomap_bio_read_folio(struct folio *folio,
>   		.cur_folio	= folio,
>   	};
>   
> -	iomap_read_folio(ops, &ctx);
> +	iomap_read_folio(ops, &ctx, NULL);
>   }
>   
>   static inline void iomap_bio_readahead(struct readahead_control *rac,
> @@ -605,7 +605,7 @@ static inline void iomap_bio_readahead(struct readahead_control *rac,
>   		.rac		= rac,
>   	};
>   
> -	iomap_readahead(ops, &ctx);
> +	iomap_readahead(ops, &ctx, NULL);
>   }
>   #endif /* CONFIG_BLOCK */
>   


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 1/9] iomap: stash iomap read ctx in the private field of iomap_iter
  2025-11-14  9:55 ` [PATCH v8 1/9] iomap: stash iomap read ctx in the private field of iomap_iter Hongbo Li
  2025-11-16 11:53   ` Gao Xiang
@ 2025-11-16 11:54   ` Gao Xiang
  1 sibling, 0 replies; 23+ messages in thread
From: Gao Xiang @ 2025-11-16 11:54 UTC (permalink / raw)
  To: Hongbo Li, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel



On 2025/11/14 17:55, Hongbo Li wrote:
> It's useful to get filesystem-specific information using the
> existing private field in the @iomap_iter passed to iomap_{begin,end}
> for advanced usage for iomap buffered reads, which is much like the
> current iomap DIO.
> 
> For example, EROFS needs it to:
> 
>   - implement an efficient page cache sharing feature, since iomap
>     needs to apply to anon inode page cache but we'd like to get the
>     backing inode/fs instead, so filesystem-specific private data is
>     needed to keep such information;
> 
>   - pass in both struct page * and void * for inline data to avoid
>     kmap_to_page() usage (which is bogus).
> 
> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>

LGTM, and I think the case 2) is useful even without
the main feature:

Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 2/9] erofs: hold read context in iomap_iter if needed
  2025-11-14  9:55 ` [PATCH v8 2/9] erofs: hold read context in iomap_iter if needed Hongbo Li
@ 2025-11-16 12:01   ` Gao Xiang
  2025-11-17  1:45     ` Hongbo Li
  0 siblings, 1 reply; 23+ messages in thread
From: Gao Xiang @ 2025-11-16 12:01 UTC (permalink / raw)
  To: Hongbo Li, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel



On 2025/11/14 17:55, Hongbo Li wrote:
> Uncoming page cache sharing needs pass read context to iomap_iter,
> here we unify the way of passing the read context in EROFS. Moreover,
> bmap and fiemap don't need to map the inline data.
> 
> Note that we keep `struct page *` in `struct erofs_iomap_iter_ctx` as
> well to avoid bogus kmap_to_page usage.
> 
> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
> ---
>   fs/erofs/data.c | 79 ++++++++++++++++++++++++++++++++++++-------------
>   1 file changed, 59 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index bb13c4cb8455..bd3d85c61341 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -266,14 +266,23 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>   	folio_end_read(folio, !(v & BIT(EROFS_ONLINEFOLIO_EIO)));
>   }
>   
> +struct erofs_iomap_iter_ctx {
> +	struct page *page;
> +	void *base;
> +};
> +
>   static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   		unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
>   {
>   	int ret;
> +	struct erofs_iomap_iter_ctx *ctx;
>   	struct super_block *sb = inode->i_sb;
>   	struct erofs_map_blocks map;
>   	struct erofs_map_dev mdev;
> +	struct iomap_iter *iter;
>   
> +	iter = container_of(iomap, struct iomap_iter, iomap);
> +	ctx = iter->private;

Can you just rearrange it as:

	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
	struct erofs_iomap_iter_ctx *ctx = iter->private;

?

>   	map.m_la = offset;
>   	map.m_llen = length;
>   	ret = erofs_map_blocks(inode, &map);
> @@ -283,7 +292,8 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   	iomap->offset = map.m_la;
>   	iomap->length = map.m_llen;
>   	iomap->flags = 0;
> -	iomap->private = NULL;
> +	if (ctx)
> +		ctx->base = NULL;

I think this line is unnecessary if iter->private == ctx;

>   	iomap->addr = IOMAP_NULL_ADDR;
>   	if (!(map.m_flags & EROFS_MAP_MAPPED)) {
>   		iomap->type = IOMAP_HOLE;
> @@ -309,16 +319,20 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   	}
>   
>   	if (map.m_flags & EROFS_MAP_META) {
> -		void *ptr;
> -		struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
> -
>   		iomap->type = IOMAP_INLINE;
> -		ptr = erofs_read_metabuf(&buf, sb, map.m_pa,
> -					 erofs_inode_in_metabox(inode));
> -		if (IS_ERR(ptr))
> -			return PTR_ERR(ptr);
> -		iomap->inline_data = ptr;
> -		iomap->private = buf.base;
> +		/* read context should read the inlined data */
> +		if (ctx) {
> +			void *ptr;
> +			struct erofs_buf buf = __EROFS_BUF_INITIALIZER;

better to resort them as:
			struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
			void *ptr;

> +
> +			ptr = erofs_read_metabuf(&buf, sb, map.m_pa,
> +						 erofs_inode_in_metabox(inode));
> +			if (IS_ERR(ptr))
> +				return PTR_ERR(ptr);
> +			iomap->inline_data = ptr;
> +			ctx->page = buf.page;
> +			ctx->base = buf.base;
> +		}
>   	} else {
>   		iomap->type = IOMAP_MAPPED;
>   	}
> @@ -328,18 +342,19 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   static int erofs_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>   		ssize_t written, unsigned int flags, struct iomap *iomap)
>   {
> -	void *ptr = iomap->private;
> +	struct erofs_iomap_iter_ctx *ctx;
> +	struct iomap_iter *iter;
>   
> -	if (ptr) {
> +	iter = container_of(iomap, struct iomap_iter, iomap);
> +	ctx = iter->private;
> +	if (ctx && ctx->base) {
>   		struct erofs_buf buf = {
> -			.page = kmap_to_page(ptr),
> -			.base = ptr,
> +			.page = ctx->page,
> +			.base = ctx->base,
>   		};
>   
>   		DBG_BUGON(iomap->type != IOMAP_INLINE);
>   		erofs_put_metabuf(&buf);

so need to nullify ctx->base here:

		ctx->base = NULL;

> -	} else {
> -		DBG_BUGON(iomap->type == IOMAP_INLINE);
>   	}
>   	return written;
>   }
> @@ -369,18 +384,36 @@ int erofs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>    */
>   static int erofs_read_folio(struct file *file, struct folio *folio)
>   {
> +	struct iomap_read_folio_ctx read_ctx = {
> +		.ops		= &iomap_bio_read_ops,
> +		.cur_folio	= folio,
> +	};
> +	struct erofs_iomap_iter_ctx iter_ctx = {
> +		.page		= NULL,
> +		.base		= NULL,
> +	};

it can be initialized just by:
	struct erofs_iomap_iter_ctx iter_ctx = {};

> +
>   	trace_erofs_read_folio(folio, true);
>   
> -	iomap_bio_read_folio(folio, &erofs_iomap_ops);
> +	iomap_read_folio(&erofs_iomap_ops, &read_ctx, &iter_ctx);
>   	return 0;
>   }
>   
>   static void erofs_readahead(struct readahead_control *rac)
>   {
> +	struct iomap_read_folio_ctx read_ctx = {
> +		.ops		= &iomap_bio_read_ops,
> +		.rac		= rac,
> +	};
> +	struct erofs_iomap_iter_ctx iter_ctx = {
> +		.page		= NULL,
> +		.base		= NULL,
> +	};

Same here.

> +
>   	trace_erofs_readahead(rac->mapping->host, readahead_index(rac),
>   					readahead_count(rac), true);
>   
> -	iomap_bio_readahead(rac, &erofs_iomap_ops);
> +	iomap_readahead(&erofs_iomap_ops, &read_ctx, &iter_ctx);
>   }
>   
>   static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
> @@ -400,9 +433,15 @@ static ssize_t erofs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>   	if (IS_DAX(inode))
>   		return dax_iomap_rw(iocb, to, &erofs_iomap_ops);
>   #endif
> -	if ((iocb->ki_flags & IOCB_DIRECT) && inode->i_sb->s_bdev)
> +	if ((iocb->ki_flags & IOCB_DIRECT) && inode->i_sb->s_bdev) {
> +		struct erofs_iomap_iter_ctx iter_ctx = {
> +			.page = NULL,
> +			.base = NULL,
> +		};

Same here again.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 3/9] erofs: move `struct erofs_anon_fs_type` to super.c
  2025-11-14  9:55 ` [PATCH v8 3/9] erofs: move `struct erofs_anon_fs_type` to super.c Hongbo Li
@ 2025-11-16 12:02   ` Gao Xiang
  0 siblings, 0 replies; 23+ messages in thread
From: Gao Xiang @ 2025-11-16 12:02 UTC (permalink / raw)
  To: Hongbo Li, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel



On 2025/11/14 17:55, Hongbo Li wrote:
> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
> 
> Move the `struct erofs_anon_fs_type` to the super.c and
> expose it in preparation for the upcoming page cache share
> feature.
> 
> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
> ---
>   fs/erofs/fscache.c  | 13 -------------
>   fs/erofs/internal.h |  4 ++++
>   fs/erofs/super.c    | 15 +++++++++++++++
>   3 files changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
> index 362acf828279..2d1683479fc0 100644
> --- a/fs/erofs/fscache.c
> +++ b/fs/erofs/fscache.c
> @@ -3,7 +3,6 @@
>    * Copyright (C) 2022, Alibaba Cloud
>    * Copyright (C) 2022, Bytedance Inc. All rights reserved.
>    */
> -#include <linux/pseudo_fs.h>
>   #include <linux/fscache.h>
>   #include "internal.h"
>   
> @@ -13,18 +12,6 @@ static LIST_HEAD(erofs_domain_list);
>   static LIST_HEAD(erofs_domain_cookies_list);
>   static struct vfsmount *erofs_pseudo_mnt;
>   
> -static int erofs_anon_init_fs_context(struct fs_context *fc)
> -{
> -	return init_pseudo(fc, EROFS_SUPER_MAGIC) ? 0 : -ENOMEM;
> -}
> -
> -static struct file_system_type erofs_anon_fs_type = {
> -	.owner		= THIS_MODULE,
> -	.name           = "pseudo_erofs",
> -	.init_fs_context = erofs_anon_init_fs_context,
> -	.kill_sb        = kill_anon_super,
> -};
> -
>   struct erofs_fscache_io {
>   	struct netfs_cache_resources cres;
>   	struct iov_iter		iter;
> diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
> index f7f622836198..e80b35db18e4 100644
> --- a/fs/erofs/internal.h
> +++ b/fs/erofs/internal.h
> @@ -188,6 +188,10 @@ static inline bool erofs_is_fileio_mode(struct erofs_sb_info *sbi)
>   	return IS_ENABLED(CONFIG_EROFS_FS_BACKED_BY_FILE) && sbi->dif0.file;
>   }
>   
> +#if defined(CONFIG_EROFS_FS_ONDEMAND)
> +extern struct file_system_type erofs_anon_fs_type;
> +#endif

It's unnecessary to use #ifdef for "extern", otherwise
it looks good me.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 2/9] erofs: hold read context in iomap_iter if needed
  2025-11-16 12:01   ` Gao Xiang
@ 2025-11-17  1:45     ` Hongbo Li
  0 siblings, 0 replies; 23+ messages in thread
From: Hongbo Li @ 2025-11-17  1:45 UTC (permalink / raw)
  To: Gao Xiang, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

Hi Xiang,

On 2025/11/16 20:01, Gao Xiang wrote:
> 
> 
> On 2025/11/14 17:55, Hongbo Li wrote:
>> Uncoming page cache sharing needs pass read context to iomap_iter,
>> here we unify the way of passing the read context in EROFS. Moreover,
>> bmap and fiemap don't need to map the inline data.
>>
>> Note that we keep `struct page *` in `struct erofs_iomap_iter_ctx` as
>> well to avoid bogus kmap_to_page usage.
>>
>> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
>> ---
>>   fs/erofs/data.c | 79 ++++++++++++++++++++++++++++++++++++-------------
>>   1 file changed, 59 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>> index bb13c4cb8455..bd3d85c61341 100644
>> --- a/fs/erofs/data.c
>> +++ b/fs/erofs/data.c
>> @@ -266,14 +266,23 @@ void erofs_onlinefolio_end(struct folio *folio, 
>> int err, bool dirty)
>>       folio_end_read(folio, !(v & BIT(EROFS_ONLINEFOLIO_EIO)));
>>   }
>> +struct erofs_iomap_iter_ctx {
>> +    struct page *page;
>> +    void *base;
>> +};
>> +
>>   static int erofs_iomap_begin(struct inode *inode, loff_t offset, 
>> loff_t length,
>>           unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
>>   {
>>       int ret;
>> +    struct erofs_iomap_iter_ctx *ctx;
>>       struct super_block *sb = inode->i_sb;
>>       struct erofs_map_blocks map;
>>       struct erofs_map_dev mdev;
>> +    struct iomap_iter *iter;
>> +    iter = container_of(iomap, struct iomap_iter, iomap);
>> +    ctx = iter->private;
> 
> Can you just rearrange it as:
> 
>      struct iomap_iter *iter = container_of(iomap, struct iomap_iter, 
> iomap);
>      struct erofs_iomap_iter_ctx *ctx = iter->private;
> 
> ?
> 

Thanks for your through review. The points you raised are quite 
reasonable, and I will address them in later version.

Thanks,
Hongbo

>>       map.m_la = offset;
>>       map.m_llen = length;
>>       ret = erofs_map_blocks(inode, &map);
>> @@ -283,7 +292,8 @@ static int erofs_iomap_begin(struct inode *inode, 
>> loff_t offset, loff_t length,
>>       iomap->offset = map.m_la;
>>       iomap->length = map.m_llen;
>>       iomap->flags = 0;
>> -    iomap->private = NULL;
>> +    if (ctx)
>> +        ctx->base = NULL;
> 
> I think this line is unnecessary if iter->private == ctx;
> 
>>       iomap->addr = IOMAP_NULL_ADDR;
>>       if (!(map.m_flags & EROFS_MAP_MAPPED)) {
>>           iomap->type = IOMAP_HOLE;
>> @@ -309,16 +319,20 @@ static int erofs_iomap_begin(struct inode 
>> *inode, loff_t offset, loff_t length,
>>       }
>>       if (map.m_flags & EROFS_MAP_META) {
>> -        void *ptr;
>> -        struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
>> -
>>           iomap->type = IOMAP_INLINE;
>> -        ptr = erofs_read_metabuf(&buf, sb, map.m_pa,
>> -                     erofs_inode_in_metabox(inode));
>> -        if (IS_ERR(ptr))
>> -            return PTR_ERR(ptr);
>> -        iomap->inline_data = ptr;
>> -        iomap->private = buf.base;
>> +        /* read context should read the inlined data */
>> +        if (ctx) {
>> +            void *ptr;
>> +            struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
> 
> better to resort them as:
>              struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
>              void *ptr;
> 
>> +
>> +            ptr = erofs_read_metabuf(&buf, sb, map.m_pa,
>> +                         erofs_inode_in_metabox(inode));
>> +            if (IS_ERR(ptr))
>> +                return PTR_ERR(ptr);
>> +            iomap->inline_data = ptr;
>> +            ctx->page = buf.page;
>> +            ctx->base = buf.base;
>> +        }
>>       } else {
>>           iomap->type = IOMAP_MAPPED;
>>       }
>> @@ -328,18 +342,19 @@ static int erofs_iomap_begin(struct inode 
>> *inode, loff_t offset, loff_t length,
>>   static int erofs_iomap_end(struct inode *inode, loff_t pos, loff_t 
>> length,
>>           ssize_t written, unsigned int flags, struct iomap *iomap)
>>   {
>> -    void *ptr = iomap->private;
>> +    struct erofs_iomap_iter_ctx *ctx;
>> +    struct iomap_iter *iter;
>> -    if (ptr) {
>> +    iter = container_of(iomap, struct iomap_iter, iomap);
>> +    ctx = iter->private;
>> +    if (ctx && ctx->base) {
>>           struct erofs_buf buf = {
>> -            .page = kmap_to_page(ptr),
>> -            .base = ptr,
>> +            .page = ctx->page,
>> +            .base = ctx->base,
>>           };
>>           DBG_BUGON(iomap->type != IOMAP_INLINE);
>>           erofs_put_metabuf(&buf);
> 
> so need to nullify ctx->base here:
> 
>          ctx->base = NULL;
> 
>> -    } else {
>> -        DBG_BUGON(iomap->type == IOMAP_INLINE);
>>       }
>>       return written;
>>   }
>> @@ -369,18 +384,36 @@ int erofs_fiemap(struct inode *inode, struct 
>> fiemap_extent_info *fieinfo,
>>    */
>>   static int erofs_read_folio(struct file *file, struct folio *folio)
>>   {
>> +    struct iomap_read_folio_ctx read_ctx = {
>> +        .ops        = &iomap_bio_read_ops,
>> +        .cur_folio    = folio,
>> +    };
>> +    struct erofs_iomap_iter_ctx iter_ctx = {
>> +        .page        = NULL,
>> +        .base        = NULL,
>> +    };
> 
> it can be initialized just by:
>      struct erofs_iomap_iter_ctx iter_ctx = {};
> 
>> +
>>       trace_erofs_read_folio(folio, true);
>> -    iomap_bio_read_folio(folio, &erofs_iomap_ops);
>> +    iomap_read_folio(&erofs_iomap_ops, &read_ctx, &iter_ctx);
>>       return 0;
>>   }
>>   static void erofs_readahead(struct readahead_control *rac)
>>   {
>> +    struct iomap_read_folio_ctx read_ctx = {
>> +        .ops        = &iomap_bio_read_ops,
>> +        .rac        = rac,
>> +    };
>> +    struct erofs_iomap_iter_ctx iter_ctx = {
>> +        .page        = NULL,
>> +        .base        = NULL,
>> +    };
> 
> Same here.
> 
>> +
>>       trace_erofs_readahead(rac->mapping->host, readahead_index(rac),
>>                       readahead_count(rac), true);
>> -    iomap_bio_readahead(rac, &erofs_iomap_ops);
>> +    iomap_readahead(&erofs_iomap_ops, &read_ctx, &iter_ctx);
>>   }
>>   static sector_t erofs_bmap(struct address_space *mapping, sector_t 
>> block)
>> @@ -400,9 +433,15 @@ static ssize_t erofs_file_read_iter(struct kiocb 
>> *iocb, struct iov_iter *to)
>>       if (IS_DAX(inode))
>>           return dax_iomap_rw(iocb, to, &erofs_iomap_ops);
>>   #endif
>> -    if ((iocb->ki_flags & IOCB_DIRECT) && inode->i_sb->s_bdev)
>> +    if ((iocb->ki_flags & IOCB_DIRECT) && inode->i_sb->s_bdev) {
>> +        struct erofs_iomap_iter_ctx iter_ctx = {
>> +            .page = NULL,
>> +            .base = NULL,
>> +        };
> 
> Same here again.
> 
> Thanks,
> Gao Xiang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 4/9] erofs: support user-defined fingerprint name
  2025-11-14  9:55 ` [PATCH v8 4/9] erofs: support user-defined fingerprint name Hongbo Li
@ 2025-11-17  2:54   ` Gao Xiang
  2025-11-17  7:41     ` Hongbo Li
  0 siblings, 1 reply; 23+ messages in thread
From: Gao Xiang @ 2025-11-17  2:54 UTC (permalink / raw)
  To: Hongbo Li, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel



On 2025/11/14 17:55, Hongbo Li wrote:
> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
> 
> When creating the EROFS image, users can specify the fingerprint name.
> This is to prepare for the upcoming inode page cache share.
> 
> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
> ---
>   fs/erofs/Kconfig    |  9 +++++++++
>   fs/erofs/erofs_fs.h |  6 ++++--
>   fs/erofs/internal.h |  6 ++++++
>   fs/erofs/super.c    |  5 ++++-
>   fs/erofs/xattr.c    | 26 ++++++++++++++++++++++++++
>   fs/erofs/xattr.h    |  6 ++++++
>   6 files changed, 55 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
> index d81f3318417d..1b5c0cd99203 100644
> --- a/fs/erofs/Kconfig
> +++ b/fs/erofs/Kconfig
> @@ -194,3 +194,12 @@ config EROFS_FS_PCPU_KTHREAD_HIPRI
>   	  at higher priority.
>   
>   	  If unsure, say N.
> +
> +config EROFS_FS_INODE_SHARE
> +	bool "EROFS inode page cache share support (experimental)"
> +	depends on EROFS_FS && EROFS_FS_XATTR && !EROFS_FS_ONDEMAND
> +	help
> +	  This permits EROFS to share page cache for files with same
> +	  fingerprints.

I tend to use "EROFS_FS_PAGE_CACHE_SHARE" since it's closer to
user impact definition (inode sharing is ambiguious), but we
could leave "ishare.c" since it's closer to the implementation
details.

And how about:

config EROFS_FS_PAGE_CACHE_SHARE
	bool "EROFS page cache share support (experimental)"
	depends on EROFS_FS && EROFS_FS_XATTR && !EROFS_FS_ONDEMAND
	help
	  This enables page cache sharing among inodes with identical
	  content fingerprints on the same device.

	  If unsure, say N.

> +
> +	  If unsure, say N.
> \ No newline at end of file

"\ No newline at end of file" should be fixed.

> diff --git a/fs/erofs/erofs_fs.h b/fs/erofs/erofs_fs.h
> index 3d5738f80072..104518cd161d 100644
> --- a/fs/erofs/erofs_fs.h
> +++ b/fs/erofs/erofs_fs.h
> @@ -35,8 +35,9 @@
>   #define EROFS_FEATURE_INCOMPAT_XATTR_PREFIXES	0x00000040
>   #define EROFS_FEATURE_INCOMPAT_48BIT		0x00000080
>   #define EROFS_FEATURE_INCOMPAT_METABOX		0x00000100
> +#define EROFS_FEATURE_INCOMPAT_ISHARE_KEY	0x00000200

I do think it should be a compatible feature since images can be
mounted in the old kernels without any issue, and it should be
renamed as

EROFS_FEATURE_COMPAT_ISHARE_XATTRS

>   #define EROFS_ALL_FEATURE_INCOMPAT		\
> -	((EROFS_FEATURE_INCOMPAT_METABOX << 1) - 1)
> +	((EROFS_FEATURE_INCOMPAT_ISHARE_KEY << 1) - 1)
>   
>   #define EROFS_SB_EXTSLOT_SIZE	16
>   
> @@ -83,7 +84,8 @@ struct erofs_super_block {
>   	__le32 xattr_prefix_start;	/* start of long xattr prefixes */
>   	__le64 packed_nid;	/* nid of the special packed inode */
>   	__u8 xattr_filter_reserved; /* reserved for xattr name filter */
> -	__u8 reserved[3];
> +	__u8 ishare_key_start;	/* start of ishare key */

ishare_xattr_prefix_id; ?

> +	__u8 reserved[2];
>   	__le32 build_time;	/* seconds added to epoch for mkfs time */
>   	__le64 rootnid_8b;	/* (48BIT on) nid of root directory */
>   	__le64 reserved2;
> diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
> index e80b35db18e4..3ebbb7c5d085 100644
> --- a/fs/erofs/internal.h
> +++ b/fs/erofs/internal.h
> @@ -167,6 +167,11 @@ struct erofs_sb_info {
>   	struct erofs_domain *domain;
>   	char *fsid;
>   	char *domain_id;
> +
> +	/* inode page cache share support */
> +	u8 ishare_key_start;

	u8 ishare_xattr_pfx;

> +	u8 ishare_key_idx;

why need this, considering we could just use

sbi->xattr_prefixes[sbi->ishare_xattr_pfx]

to get this.

> +	char *ishare_key;
>   };
>   
>   #define EROFS_SB(sb) ((struct erofs_sb_info *)(sb)->s_fs_info)
> @@ -236,6 +241,7 @@ EROFS_FEATURE_FUNCS(dedupe, incompat, INCOMPAT_DEDUPE)
>   EROFS_FEATURE_FUNCS(xattr_prefixes, incompat, INCOMPAT_XATTR_PREFIXES)
>   EROFS_FEATURE_FUNCS(48bit, incompat, INCOMPAT_48BIT)
>   EROFS_FEATURE_FUNCS(metabox, incompat, INCOMPAT_METABOX)
> +EROFS_FEATURE_FUNCS(ishare_key, incompat, INCOMPAT_ISHARE_KEY)
>   EROFS_FEATURE_FUNCS(sb_chksum, compat, COMPAT_SB_CHKSUM)
>   EROFS_FEATURE_FUNCS(xattr_filter, compat, COMPAT_XATTR_FILTER)
>   EROFS_FEATURE_FUNCS(shared_ea_in_metabox, compat, COMPAT_SHARED_EA_IN_METABOX)
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 0d88c04684b9..3561473cb789 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -339,7 +339,7 @@ static int erofs_read_superblock(struct super_block *sb)
>   			return -EFSCORRUPTED;	/* self-loop detection */
>   	}
>   	sbi->inos = le64_to_cpu(dsb->inos);
> -
> +	sbi->ishare_key_start = dsb->ishare_key_start;
>   	sbi->epoch = (s64)le64_to_cpu(dsb->epoch);
>   	sbi->fixed_nsec = le32_to_cpu(dsb->fixed_nsec);
>   	super_set_uuid(sb, (void *)dsb->uuid, sizeof(dsb->uuid));
> @@ -738,6 +738,9 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>   	if (err)
>   		return err;
>   
> +	err = erofs_xattr_set_ishare_key(sb);

I don't think it's necessary to duplicate the copy, just use
"sbi->xattr_prefixes[sbi->ishare_xattr_pfx]" directly.

Thanks,
Gao Xiang

> +	if (err)
> +		return err;
>   	erofs_set_sysfs_name(sb);
>   	err = erofs_register_sysfs(sb);
>   	if (err)
> diff --git a/fs/erofs/xattr.c b/fs/erofs/xattr.c
> index 396536d9a862..3c99091f39a5 100644
> --- a/fs/erofs/xattr.c
> +++ b/fs/erofs/xattr.c
> @@ -564,3 +564,29 @@ struct posix_acl *erofs_get_acl(struct inode *inode, int type, bool rcu)
>   	return acl;
>   }
>   #endif
> +
> +#ifdef CONFIG_EROFS_FS_INODE_SHARE
> +int erofs_xattr_set_ishare_key(struct super_block *sb)
> +{
> +	struct erofs_sb_info *sbi = EROFS_SB(sb);
> +	struct erofs_xattr_prefix_item *pf;
> +	char *ishare_key;
> +
> +	if (!sbi->xattr_prefixes ||
> +	    !(sbi->ishare_key_start & EROFS_XATTR_LONG_PREFIX))
> +		return 0;
> +
> +	pf = sbi->xattr_prefixes +
> +		(sbi->ishare_key_start & EROFS_XATTR_LONG_PREFIX_MASK);
> +	if (!pf || pf >= sbi->xattr_prefixes + sbi->xattr_prefix_count)
> +		return 0;
> +	ishare_key = kmalloc(pf->infix_len + 1, GFP_KERNEL);
> +	if (!ishare_key)
> +		return -ENOMEM;
> +	memcpy(ishare_key, pf->prefix->infix, pf->infix_len);
> +	ishare_key[pf->infix_len] = '\0';
> +	sbi->ishare_key = ishare_key;
> +	sbi->ishare_key_idx = pf->prefix->base_index;
> +	return 0;
> +}
> +#endif
> diff --git a/fs/erofs/xattr.h b/fs/erofs/xattr.h
> index 6317caa8413e..21684359662c 100644
> --- a/fs/erofs/xattr.h
> +++ b/fs/erofs/xattr.h
> @@ -67,4 +67,10 @@ struct posix_acl *erofs_get_acl(struct inode *inode, int type, bool rcu);
>   #define erofs_get_acl	(NULL)
>   #endif
>   
> +#ifdef CONFIG_EROFS_FS_INODE_SHARE
> +int erofs_xattr_set_ishare_key(struct super_block *sb);
> +#else
> +static inline int erofs_xattr_set_ishare_key(struct super_block *sb) { return 0; }
> +#endif
> +
>   #endif


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 6/9] erofs: introduce the page cache share feature
  2025-11-14  9:55 ` [PATCH v8 6/9] erofs: introduce the page cache share feature Hongbo Li
@ 2025-11-17  3:06   ` Gao Xiang
  2025-11-17  3:14     ` Hongbo Li
  0 siblings, 1 reply; 23+ messages in thread
From: Gao Xiang @ 2025-11-17  3:06 UTC (permalink / raw)
  To: Hongbo Li, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel



On 2025/11/14 17:55, Hongbo Li wrote:
> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
> 
> Currently, reading files with different paths (or names) but the same
> content will consume multiple copies of the page cache, even if the
> content of these page caches is the same. For example, reading
> identical files (e.g., *.so files) from two different minor versions of
> container images will cost multiple copies of the same page cache,
> since different containers have different mount points. Therefore,
> sharing the page cache for files with the same content can save memory.
> 
> This introduces the page cache share feature in erofs. It allocate a
> deduplicated inode and use its page cache as shared. Reads for files
> with identical content will ultimately be routed to the page cache of
> the deduplicated inode. In this way, a single page cache satisfies
> multiple read requests for different files with the same contents.
> 
> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
> ---

...


> +
> +static int erofs_ishare_file_open(struct inode *inode, struct file *file)
> +{
> +	struct file *realfile;
> +	struct inode *dedup;
> +
> +	dedup = EROFS_I(inode)->ishare;
> +	if (!dedup)
> +		return -EINVAL;
> +
> +	realfile = alloc_file_pseudo(dedup, erofs_ishare_mnt, "erofs_ishare_file",
> +				     O_RDONLY, &erofs_file_fops);
> +	if (IS_ERR(realfile))
> +		return PTR_ERR(realfile);
> +
> +	file_ra_state_init(&realfile->f_ra, file->f_mapping);
> +	realfile->private_data = EROFS_I(inode);
> +	file->private_data = realfile;
> +	return 0;

Again, as Amir mentioned before, it should be converted to use (at least)
some of backing file interfaces, please see:
   file_user_path() and file_user_inode() in include/linux/fs.h

Or are you sure /proc/<pid>/maps is shown as expected?

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 6/9] erofs: introduce the page cache share feature
  2025-11-17  3:06   ` Gao Xiang
@ 2025-11-17  3:14     ` Hongbo Li
  2025-11-17  3:18       ` Hongbo Li
  2025-11-17  3:30       ` Gao Xiang
  0 siblings, 2 replies; 23+ messages in thread
From: Hongbo Li @ 2025-11-17  3:14 UTC (permalink / raw)
  To: Gao Xiang, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

Hi Xiang

On 2025/11/17 11:06, Gao Xiang wrote:
> 
> 
> On 2025/11/14 17:55, Hongbo Li wrote:
>> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>
>> Currently, reading files with different paths (or names) but the same
>> content will consume multiple copies of the page cache, even if the
>> content of these page caches is the same. For example, reading
>> identical files (e.g., *.so files) from two different minor versions of
>> container images will cost multiple copies of the same page cache,
>> since different containers have different mount points. Therefore,
>> sharing the page cache for files with the same content can save memory.
>>
>> This introduces the page cache share feature in erofs. It allocate a
>> deduplicated inode and use its page cache as shared. Reads for files
>> with identical content will ultimately be routed to the page cache of
>> the deduplicated inode. In this way, a single page cache satisfies
>> multiple read requests for different files with the same contents.
>>
>> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
>> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
>> ---
> 
> ...
> 
> 
>> +
>> +static int erofs_ishare_file_open(struct inode *inode, struct file 
>> *file)
>> +{
>> +    struct file *realfile;
>> +    struct inode *dedup;
>> +
>> +    dedup = EROFS_I(inode)->ishare;
>> +    if (!dedup)
>> +        return -EINVAL;
>> +
>> +    realfile = alloc_file_pseudo(dedup, erofs_ishare_mnt, 
>> "erofs_ishare_file",
>> +                     O_RDONLY, &erofs_file_fops);
>> +    if (IS_ERR(realfile))
>> +        return PTR_ERR(realfile);
>> +
>> +    file_ra_state_init(&realfile->f_ra, file->f_mapping);
>> +    realfile->private_data = EROFS_I(inode);
>> +    file->private_data = realfile;
>> +    return 0;
> 

My apologies, I got it wrong. The latest code wasn't synced. The most 
current version should be this one.

static int erofs_ishare_file_open(struct inode *inode, struct file *file)
{
	struct file *realfile;
	struct inode *dedup;
	char *buf, *filepath;

	dedup = EROFS_I(inode)->ishare;
	if (!dedup)
		return -EINVAL;

	buf = kmalloc(PATH_MAX, GFP_KERNEL);
	if (!buf)
		return -ENOMEM;
	filepath = file_path(file, buf, PATH_MAX);
	if (IS_ERR(filepath)) {
		kfree(buf);
		return -PTR_ERR(filepath);
	}
	realfile = alloc_file_pseudo(dedup, erofs_ishare_mnt, filepath + 1,
				     O_RDONLY, &erofs_file_fops);
	kfree(buf);
	if (IS_ERR(realfile))
		return PTR_ERR(realfile);

	file_ra_state_init(&realfile->f_ra, file->f_mapping);
	ihold(dedup);
	realfile->private_data = EROFS_I(inode);
	file->private_data = realfile;
	return 0;
}

I changed the "erofs_ishare_file" with filepath + 1 to display the 
realpath of the original file.

Thanks,
Hongbo

> Again, as Amir mentioned before, it should be converted to use (at least)
> some of backing file interfaces, please see:
>    file_user_path() and file_user_inode() in include/linux/fs.h
> 
> Or are you sure /proc/<pid>/maps is shown as expected?
> 
> Thanks,
> Gao Xiang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 6/9] erofs: introduce the page cache share feature
  2025-11-17  3:14     ` Hongbo Li
@ 2025-11-17  3:18       ` Hongbo Li
  2025-11-17  3:30       ` Gao Xiang
  1 sibling, 0 replies; 23+ messages in thread
From: Hongbo Li @ 2025-11-17  3:18 UTC (permalink / raw)
  To: Gao Xiang, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

Hi Xiang,

On 2025/11/17 11:14, Hongbo Li wrote:
> Hi Xiang
> 
> On 2025/11/17 11:06, Gao Xiang wrote:
>>
>>
>> On 2025/11/14 17:55, Hongbo Li wrote:
>>> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>>
>>> Currently, reading files with different paths (or names) but the same
>>> content will consume multiple copies of the page cache, even if the
>>> content of these page caches is the same. For example, reading
>>> identical files (e.g., *.so files) from two different minor versions of
>>> container images will cost multiple copies of the same page cache,
>>> since different containers have different mount points. Therefore,
>>> sharing the page cache for files with the same content can save memory.
>>>
>>> This introduces the page cache share feature in erofs. It allocate a
>>> deduplicated inode and use its page cache as shared. Reads for files
>>> with identical content will ultimately be routed to the page cache of
>>> the deduplicated inode. In this way, a single page cache satisfies
>>> multiple read requests for different files with the same contents.
>>>
>>> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
>>> ---
>>
>> ...
>>
>>
>>> +
>>> +static int erofs_ishare_file_open(struct inode *inode, struct file 
>>> *file)
>>> +{
>>> +    struct file *realfile;
>>> +    struct inode *dedup;
>>> +
>>> +    dedup = EROFS_I(inode)->ishare;
>>> +    if (!dedup)
>>> +        return -EINVAL;
>>> +
>>> +    realfile = alloc_file_pseudo(dedup, erofs_ishare_mnt, 
>>> "erofs_ishare_file",
>>> +                     O_RDONLY, &erofs_file_fops);
>>> +    if (IS_ERR(realfile))
>>> +        return PTR_ERR(realfile);
>>> +
>>> +    file_ra_state_init(&realfile->f_ra, file->f_mapping);
>>> +    realfile->private_data = EROFS_I(inode);
>>> +    file->private_data = realfile;
>>> +    return 0;
>>
> 
> My apologies, I got it wrong. The latest code wasn't synced. The most 
> current version should be this one.
> 
> static int erofs_ishare_file_open(struct inode *inode, struct file *file)
> {
>      struct file *realfile;
>      struct inode *dedup;
>      char *buf, *filepath;
> 
>      dedup = EROFS_I(inode)->ishare;
>      if (!dedup)
>          return -EINVAL;
> 
>      buf = kmalloc(PATH_MAX, GFP_KERNEL);
>      if (!buf)
>          return -ENOMEM;
>      filepath = file_path(file, buf, PATH_MAX);
>      if (IS_ERR(filepath)) {
>          kfree(buf);
>          return -PTR_ERR(filepath);
>      }
>      realfile = alloc_file_pseudo(dedup, erofs_ishare_mnt, filepath + 1,
>                       O_RDONLY, &erofs_file_fops);
>      kfree(buf);
>      if (IS_ERR(realfile))
>          return PTR_ERR(realfile);
> 
>      file_ra_state_init(&realfile->f_ra, file->f_mapping);
>      ihold(dedup);
>      realfile->private_data = EROFS_I(inode);
>      file->private_data = realfile;
>      return 0;
> }
> 
> I changed the "erofs_ishare_file" with filepath + 1 to display the 
> realpath of the original file.

I made this change in patch 7 which caused the misunderstanding here.

Thanks,
Hongbo

> 
> Thanks,
> Hongbo
> 
>> Again, as Amir mentioned before, it should be converted to use (at least)
>> some of backing file interfaces, please see:
>>    file_user_path() and file_user_inode() in include/linux/fs.h
>>
>> Or are you sure /proc/<pid>/maps is shown as expected?
>>
>> Thanks,
>> Gao Xiang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 6/9] erofs: introduce the page cache share feature
  2025-11-17  3:14     ` Hongbo Li
  2025-11-17  3:18       ` Hongbo Li
@ 2025-11-17  3:30       ` Gao Xiang
  1 sibling, 0 replies; 23+ messages in thread
From: Gao Xiang @ 2025-11-17  3:30 UTC (permalink / raw)
  To: Hongbo Li, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel



On 2025/11/17 11:14, Hongbo Li wrote:
> Hi Xiang
> 
> On 2025/11/17 11:06, Gao Xiang wrote:
>>
>>
>> On 2025/11/14 17:55, Hongbo Li wrote:
>>> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>>
>>> Currently, reading files with different paths (or names) but the same
>>> content will consume multiple copies of the page cache, even if the
>>> content of these page caches is the same. For example, reading
>>> identical files (e.g., *.so files) from two different minor versions of
>>> container images will cost multiple copies of the same page cache,
>>> since different containers have different mount points. Therefore,
>>> sharing the page cache for files with the same content can save memory.
>>>
>>> This introduces the page cache share feature in erofs. It allocate a
>>> deduplicated inode and use its page cache as shared. Reads for files
>>> with identical content will ultimately be routed to the page cache of
>>> the deduplicated inode. In this way, a single page cache satisfies
>>> multiple read requests for different files with the same contents.
>>>
>>> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
>>> ---
>>
>> ...
>>
>>
>>> +
>>> +static int erofs_ishare_file_open(struct inode *inode, struct file *file)
>>> +{
>>> +    struct file *realfile;
>>> +    struct inode *dedup;
>>> +
>>> +    dedup = EROFS_I(inode)->ishare;
>>> +    if (!dedup)
>>> +        return -EINVAL;
>>> +
>>> +    realfile = alloc_file_pseudo(dedup, erofs_ishare_mnt, "erofs_ishare_file",
>>> +                     O_RDONLY, &erofs_file_fops);
>>> +    if (IS_ERR(realfile))
>>> +        return PTR_ERR(realfile);
>>> +
>>> +    file_ra_state_init(&realfile->f_ra, file->f_mapping);
>>> +    realfile->private_data = EROFS_I(inode);
>>> +    file->private_data = realfile;
>>> +    return 0;
>>
> 
> My apologies, I got it wrong. The latest code wasn't synced. The most current version should be this one.
> 
> static int erofs_ishare_file_open(struct inode *inode, struct file *file)
> {
>      struct file *realfile;
>      struct inode *dedup;
>      char *buf, *filepath;
> 
>      dedup = EROFS_I(inode)->ishare;
>      if (!dedup)
>          return -EINVAL;
> 
>      buf = kmalloc(PATH_MAX, GFP_KERNEL);
>      if (!buf)
>          return -ENOMEM;
>      filepath = file_path(file, buf, PATH_MAX);
>      if (IS_ERR(filepath)) {
>          kfree(buf);
>          return -PTR_ERR(filepath);
>      }
>      realfile = alloc_file_pseudo(dedup, erofs_ishare_mnt, filepath + 1,
>                       O_RDONLY, &erofs_file_fops);
>      kfree(buf);
>      if (IS_ERR(realfile))
>          return PTR_ERR(realfile);
> 
>      file_ra_state_init(&realfile->f_ra, file->f_mapping);
>      ihold(dedup);
>      realfile->private_data = EROFS_I(inode);
>      file->private_data = realfile;
>      return 0;
> }
> 
> I changed the "erofs_ishare_file" with filepath + 1 to display the realpath of the original file.

Although it could work for file_user_path() [but it's unclean on my side],
but file_user_inode() still doesn't work.

You should adapt backing_file infrastructure instead.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 7/9] erofs: support unencoded inodes for page cache share
  2025-11-14  9:55 ` [PATCH v8 7/9] erofs: support unencoded inodes for page cache share Hongbo Li
@ 2025-11-17  3:44   ` Gao Xiang
  0 siblings, 0 replies; 23+ messages in thread
From: Gao Xiang @ 2025-11-17  3:44 UTC (permalink / raw)
  To: Hongbo Li, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel



On 2025/11/14 17:55, Hongbo Li wrote:
> This patch adds inode page cache sharing functionality for unencoded
> files.
> 
> I conducted experiments in the container environment. Below is the
> memory usage for reading all files in two different minor versions
> of container images:
> 
> +-------------------+------------------+-------------+---------------+
> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
> |                   |                  |             | Reduction (%) |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     241     |       -       |
> |       redis       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |     163     |      33%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     872     |       -       |
> |      postgres     +------------------+-------------+---------------+
> |    16.1 & 16.2    |        Yes       |     630     |      28%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     2771    |       -       |
> |     tensorflow    +------------------+-------------+---------------+
> |  2.11.0 & 2.11.1  |        Yes       |     2340    |      16%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     926     |       -       |
> |       mysql       +------------------+-------------+---------------+
> |  8.0.11 & 8.0.12  |        Yes       |     735     |      21%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     390     |       -       |
> |       nginx       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |     219     |      44%      |
> +-------------------+------------------+-------------+---------------+
> |       tomcat      |        No        |     924     |       -       |
> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
> |                   |        Yes       |     474     |      49%      |
> +-------------------+------------------+-------------+---------------+
> 
> Additionally, the table below shows the runtime memory usage of the
> container:
> 
> +-------------------+------------------+-------------+---------------+
> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
> |                   |                  |             | Reduction (%) |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |      35     |       -       |
> |       redis       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |      28     |      20%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     149     |       -       |
> |      postgres     +------------------+-------------+---------------+
> |    16.1 & 16.2    |        Yes       |      95     |      37%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     1028    |       -       |
> |     tensorflow    +------------------+-------------+---------------+
> |  2.11.0 & 2.11.1  |        Yes       |     930     |      10%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     155     |       -       |
> |       mysql       +------------------+-------------+---------------+
> |  8.0.11 & 8.0.12  |        Yes       |     132     |      15%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |      25     |       -       |
> |       nginx       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |      20     |      20%      |
> +-------------------+------------------+-------------+---------------+
> |       tomcat      |        No        |     186     |       -       |
> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
> |                   |        Yes       |      98     |      48%      |
> +-------------------+------------------+-------------+---------------+
> 
> Co-developed-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
> ---
>   fs/erofs/data.c     | 38 +++++++++++++++---
>   fs/erofs/inode.c    |  5 +++
>   fs/erofs/internal.h |  4 ++
>   fs/erofs/ishare.c   | 98 ++++++++++++++++++++++++++++++++++++++++++++-
>   fs/erofs/ishare.h   | 18 +++++++++
>   fs/erofs/super.c    | 11 +++--
>   6 files changed, 163 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index bd3d85c61341..c459104e4734 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -5,6 +5,7 @@
>    * Copyright (C) 2021, Alibaba Cloud
>    */
>   #include "internal.h"
> +#include "ishare.h"

Can we just get rid of another "ishare.h", these can be moved into
internal.h:

#ifdef CONFIG_EROFS_FS_INODE_SHARE

int erofs_ishare_init(struct super_block *sb);
void erofs_ishare_exit(struct super_block *sb);
bool erofs_ishare_fill_inode(struct inode *inode);
void erofs_ishare_free_inode(struct inode *inode);

#else

static inline int erofs_ishare_init(struct super_block *sb) { return 0; }
static inline void erofs_ishare_exit(struct super_block *sb) {}
static inline bool erofs_ishare_fill_inode(struct inode *inode) { return false; }
static inline void erofs_ishare_free_inode(struct inode *inode) {}

#endif // CONFIG_EROFS_FS_INODE_SHARE

>   #include <linux/sched/mm.h>
>   #include <trace/events/erofs.h>
>   
> @@ -269,23 +270,27 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>   struct erofs_iomap_iter_ctx {
>   	struct page *page;
>   	void *base;
> +	struct inode *realinode;
>   };
>   
>   static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   		unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
>   {
> -	int ret;
>   	struct erofs_iomap_iter_ctx *ctx;
> -	struct super_block *sb = inode->i_sb;
>   	struct erofs_map_blocks map;
>   	struct erofs_map_dev mdev;
>   	struct iomap_iter *iter;
> +	struct inode *realinode;
> +	struct super_block *sb;

	struct inode *realinode = ctx ? ctx->realinode : inode;
	struct super_block *sb = realinode->i_sb;

> +	int ret;
>   
>   	iter = container_of(iomap, struct iomap_iter, iomap);
>   	ctx = iter->private;
> +	realinode = ctx ? ctx->realinode : inode;
> +	sb = realinode->i_sb;
>   	map.m_la = offset;
>   	map.m_llen = length;
> -	ret = erofs_map_blocks(inode, &map);
> +	ret = erofs_map_blocks(realinode, &map);
>   	if (ret < 0)
>   		return ret;
>   
> @@ -300,7 +305,7 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   		return 0;
>   	}
>   
> -	if (!(map.m_flags & EROFS_MAP_META) || !erofs_inode_in_metabox(inode)) {
> +	if (!(map.m_flags & EROFS_MAP_META) || !erofs_inode_in_metabox(realinode)) {
>   		mdev = (struct erofs_map_dev) {
>   			.m_deviceid = map.m_deviceid,
>   			.m_pa = map.m_pa,
> @@ -326,7 +331,7 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   			struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
>   
>   			ptr = erofs_read_metabuf(&buf, sb, map.m_pa,
> -						 erofs_inode_in_metabox(inode));
> +						 erofs_inode_in_metabox(realinode));
>   			if (IS_ERR(ptr))
>   				return PTR_ERR(ptr);
>   			iomap->inline_data = ptr;

...

>   
> @@ -234,3 +248,83 @@ const struct file_operations erofs_ishare_fops = {
>   	.get_unmapped_area = thp_get_unmapped_area,
>   	.splice_read	= filemap_splice_read,
>   };
> +
> +void erofs_read_begin(struct erofs_read_ctx *rdctx)

I think if backing_head, backing_link (although I don't like
the naming) is valid, erofs_read_begin() and erofs_read_end()
is unneeded here.

Since we maintain the backing validity using .open() and
.release() hooks.

the odd erofs_read_{begin,end} can be avoided then...

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 9/9] erofs: implement .fadvise for page cache share
  2025-11-14  9:55 ` [PATCH v8 9/9] erofs: implement .fadvise " Hongbo Li
@ 2025-11-17  3:48   ` Gao Xiang
  0 siblings, 0 replies; 23+ messages in thread
From: Gao Xiang @ 2025-11-17  3:48 UTC (permalink / raw)
  To: Hongbo Li, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel



On 2025/11/14 17:55, Hongbo Li wrote:
> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
> 
> This patch implements the .fadvise interface for page cache share.
> Similar to overlayfs, it drops those clean, unused pages through
> vfs_fadvise().
> 
> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
> ---
>   fs/erofs/ishare.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/fs/erofs/ishare.c b/fs/erofs/ishare.c
> index 14b2690055c5..88c4af3f8993 100644
> --- a/fs/erofs/ishare.c
> +++ b/fs/erofs/ishare.c
> @@ -239,6 +239,16 @@ static int erofs_ishare_mmap(struct file *file, struct vm_area_struct *vma)
>   	return generic_file_readonly_mmap(file, vma);
>   }
>   
> +static int erofs_ishare_fadvice(struct file *file, loff_t offset,
> +				      loff_t len, int advice)

s/fadvice/fadvise/

Otherwise it looks good to me,
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 4/9] erofs: support user-defined fingerprint name
  2025-11-17  2:54   ` Gao Xiang
@ 2025-11-17  7:41     ` Hongbo Li
  0 siblings, 0 replies; 23+ messages in thread
From: Hongbo Li @ 2025-11-17  7:41 UTC (permalink / raw)
  To: Gao Xiang, chao, brauner, djwong, amir73il, joannelkoong
  Cc: linux-fsdevel, linux-erofs, linux-kernel

Hi Xiang,

On 2025/11/17 10:54, Gao Xiang wrote:
> 
> 
> On 2025/11/14 17:55, Hongbo Li wrote:
>> From: Hongzhen Luo <hongzhen@linux.alibaba.com>
>>
>> When creating the EROFS image, users can specify the fingerprint name.
>> This is to prepare for the upcoming inode page cache share.
>>
>> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
>> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
>> ---
>>   fs/erofs/Kconfig    |  9 +++++++++
>>   fs/erofs/erofs_fs.h |  6 ++++--
>>   fs/erofs/internal.h |  6 ++++++
>>   fs/erofs/super.c    |  5 ++++-
>>   fs/erofs/xattr.c    | 26 ++++++++++++++++++++++++++
>>   fs/erofs/xattr.h    |  6 ++++++
>>   6 files changed, 55 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
>> index d81f3318417d..1b5c0cd99203 100644
>> --- a/fs/erofs/Kconfig
>> +++ b/fs/erofs/Kconfig
>> @@ -194,3 +194,12 @@ config EROFS_FS_PCPU_KTHREAD_HIPRI
>>         at higher priority.
>>         If unsure, say N.
>> +
>> +config EROFS_FS_INODE_SHARE
>> +    bool "EROFS inode page cache share support (experimental)"
>> +    depends on EROFS_FS && EROFS_FS_XATTR && !EROFS_FS_ONDEMAND
>> +    help
>> +      This permits EROFS to share page cache for files with same
>> +      fingerprints.
> 
> I tend to use "EROFS_FS_PAGE_CACHE_SHARE" since it's closer to
> user impact definition (inode sharing is ambiguious), but we
> could leave "ishare.c" since it's closer to the implementation
> details.
> 
> And how about:
> 
> config EROFS_FS_PAGE_CACHE_SHARE
>      bool "EROFS page cache share support (experimental)"
>      depends on EROFS_FS && EROFS_FS_XATTR && !EROFS_FS_ONDEMAND
>      help
>        This enables page cache sharing among inodes with identical
>        content fingerprints on the same device.
> 
>        If unsure, say N.
> 
>> +
>> +      If unsure, say N.
>> \ No newline at end of file
> 
> "\ No newline at end of file" should be fixed.
> 
>> diff --git a/fs/erofs/erofs_fs.h b/fs/erofs/erofs_fs.h
>> index 3d5738f80072..104518cd161d 100644
>> --- a/fs/erofs/erofs_fs.h
>> +++ b/fs/erofs/erofs_fs.h
>> @@ -35,8 +35,9 @@
>>   #define EROFS_FEATURE_INCOMPAT_XATTR_PREFIXES    0x00000040
>>   #define EROFS_FEATURE_INCOMPAT_48BIT        0x00000080
>>   #define EROFS_FEATURE_INCOMPAT_METABOX        0x00000100
>> +#define EROFS_FEATURE_INCOMPAT_ISHARE_KEY    0x00000200
> 
> I do think it should be a compatible feature since images can be
> mounted in the old kernels without any issue, and it should be
> renamed as
> 
> EROFS_FEATURE_COMPAT_ISHARE_XATTRS
> 
>>   #define EROFS_ALL_FEATURE_INCOMPAT        \
>> -    ((EROFS_FEATURE_INCOMPAT_METABOX << 1) - 1)
>> +    ((EROFS_FEATURE_INCOMPAT_ISHARE_KEY << 1) - 1)
>>   #define EROFS_SB_EXTSLOT_SIZE    16
>> @@ -83,7 +84,8 @@ struct erofs_super_block {
>>       __le32 xattr_prefix_start;    /* start of long xattr prefixes */
>>       __le64 packed_nid;    /* nid of the special packed inode */
>>       __u8 xattr_filter_reserved; /* reserved for xattr name filter */
>> -    __u8 reserved[3];
>> +    __u8 ishare_key_start;    /* start of ishare key */
> 
> ishare_xattr_prefix_id; ?
> 
>> +    __u8 reserved[2];
>>       __le32 build_time;    /* seconds added to epoch for mkfs time */
>>       __le64 rootnid_8b;    /* (48BIT on) nid of root directory */
>>       __le64 reserved2;
>> diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
>> index e80b35db18e4..3ebbb7c5d085 100644
>> --- a/fs/erofs/internal.h
>> +++ b/fs/erofs/internal.h
>> @@ -167,6 +167,11 @@ struct erofs_sb_info {
>>       struct erofs_domain *domain;
>>       char *fsid;
>>       char *domain_id;
>> +
>> +    /* inode page cache share support */
>> +    u8 ishare_key_start;
> 
>      u8 ishare_xattr_pfx;
> 
>> +    u8 ishare_key_idx;
> 
> why need this, considering we could just use
> 
> sbi->xattr_prefixes[sbi->ishare_xattr_pfx]
> 
> to get this.
> 
>> +    char *ishare_key;
>>   };
>>   #define EROFS_SB(sb) ((struct erofs_sb_info *)(sb)->s_fs_info)
>> @@ -236,6 +241,7 @@ EROFS_FEATURE_FUNCS(dedupe, incompat, 
>> INCOMPAT_DEDUPE)
>>   EROFS_FEATURE_FUNCS(xattr_prefixes, incompat, INCOMPAT_XATTR_PREFIXES)
>>   EROFS_FEATURE_FUNCS(48bit, incompat, INCOMPAT_48BIT)
>>   EROFS_FEATURE_FUNCS(metabox, incompat, INCOMPAT_METABOX)
>> +EROFS_FEATURE_FUNCS(ishare_key, incompat, INCOMPAT_ISHARE_KEY)
>>   EROFS_FEATURE_FUNCS(sb_chksum, compat, COMPAT_SB_CHKSUM)
>>   EROFS_FEATURE_FUNCS(xattr_filter, compat, COMPAT_XATTR_FILTER)
>>   EROFS_FEATURE_FUNCS(shared_ea_in_metabox, compat, 
>> COMPAT_SHARED_EA_IN_METABOX)
>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>> index 0d88c04684b9..3561473cb789 100644
>> --- a/fs/erofs/super.c
>> +++ b/fs/erofs/super.c
>> @@ -339,7 +339,7 @@ static int erofs_read_superblock(struct 
>> super_block *sb)
>>               return -EFSCORRUPTED;    /* self-loop detection */
>>       }
>>       sbi->inos = le64_to_cpu(dsb->inos);
>> -
>> +    sbi->ishare_key_start = dsb->ishare_key_start;
>>       sbi->epoch = (s64)le64_to_cpu(dsb->epoch);
>>       sbi->fixed_nsec = le32_to_cpu(dsb->fixed_nsec);
>>       super_set_uuid(sb, (void *)dsb->uuid, sizeof(dsb->uuid));
>> @@ -738,6 +738,9 @@ static int erofs_fc_fill_super(struct super_block 
>> *sb, struct fs_context *fc)
>>       if (err)
>>           return err;
>> +    err = erofs_xattr_set_ishare_key(sb);
> 
> I don't think it's necessary to duplicate the copy, just use
> "sbi->xattr_prefixes[sbi->ishare_xattr_pfx]" directly.
> 

Thanks for review, but here we should pass the char * to erofs_getxattr 
to obtain the xattr length and value. And xattr_prefixes packed all 
entries together so we cannot tranform 
sbi->xattr_prefixes[sbi->ishare_xattr_pfx] into char * directly.

Thanks,
Hongbo

> Thanks,
> Gao Xiang
> 
>> +    if (err)
>> +        return err;
>>       erofs_set_sysfs_name(sb);
>>       err = erofs_register_sysfs(sb);
>>       if (err)
>> diff --git a/fs/erofs/xattr.c b/fs/erofs/xattr.c
>> index 396536d9a862..3c99091f39a5 100644
>> --- a/fs/erofs/xattr.c
>> +++ b/fs/erofs/xattr.c
>> @@ -564,3 +564,29 @@ struct posix_acl *erofs_get_acl(struct inode 
>> *inode, int type, bool rcu)
>>       return acl;
>>   }
>>   #endif
>> +
>> +#ifdef CONFIG_EROFS_FS_INODE_SHARE
>> +int erofs_xattr_set_ishare_key(struct super_block *sb)
>> +{
>> +    struct erofs_sb_info *sbi = EROFS_SB(sb);
>> +    struct erofs_xattr_prefix_item *pf;
>> +    char *ishare_key;
>> +
>> +    if (!sbi->xattr_prefixes ||
>> +        !(sbi->ishare_key_start & EROFS_XATTR_LONG_PREFIX))
>> +        return 0;
>> +
>> +    pf = sbi->xattr_prefixes +
>> +        (sbi->ishare_key_start & EROFS_XATTR_LONG_PREFIX_MASK);
>> +    if (!pf || pf >= sbi->xattr_prefixes + sbi->xattr_prefix_count)
>> +        return 0;
>> +    ishare_key = kmalloc(pf->infix_len + 1, GFP_KERNEL);
>> +    if (!ishare_key)
>> +        return -ENOMEM;
>> +    memcpy(ishare_key, pf->prefix->infix, pf->infix_len);
>> +    ishare_key[pf->infix_len] = '\0';
>> +    sbi->ishare_key = ishare_key;
>> +    sbi->ishare_key_idx = pf->prefix->base_index;
>> +    return 0;
>> +}
>> +#endif
>> diff --git a/fs/erofs/xattr.h b/fs/erofs/xattr.h
>> index 6317caa8413e..21684359662c 100644
>> --- a/fs/erofs/xattr.h
>> +++ b/fs/erofs/xattr.h
>> @@ -67,4 +67,10 @@ struct posix_acl *erofs_get_acl(struct inode 
>> *inode, int type, bool rcu);
>>   #define erofs_get_acl    (NULL)
>>   #endif
>> +#ifdef CONFIG_EROFS_FS_INODE_SHARE
>> +int erofs_xattr_set_ishare_key(struct super_block *sb);
>> +#else
>> +static inline int erofs_xattr_set_ishare_key(struct super_block *sb) 
>> { return 0; }
>> +#endif
>> +
>>   #endif
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-11-17  7:41 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-14  9:55 [PATCH v8 0/9] erofs: inode page cache share feature Hongbo Li
2025-11-14  9:55 ` [PATCH v8 1/9] iomap: stash iomap read ctx in the private field of iomap_iter Hongbo Li
2025-11-16 11:53   ` Gao Xiang
2025-11-16 11:54   ` Gao Xiang
2025-11-14  9:55 ` [PATCH v8 2/9] erofs: hold read context in iomap_iter if needed Hongbo Li
2025-11-16 12:01   ` Gao Xiang
2025-11-17  1:45     ` Hongbo Li
2025-11-14  9:55 ` [PATCH v8 3/9] erofs: move `struct erofs_anon_fs_type` to super.c Hongbo Li
2025-11-16 12:02   ` Gao Xiang
2025-11-14  9:55 ` [PATCH v8 4/9] erofs: support user-defined fingerprint name Hongbo Li
2025-11-17  2:54   ` Gao Xiang
2025-11-17  7:41     ` Hongbo Li
2025-11-14  9:55 ` [PATCH v8 5/9] erofs: support domain-specific page cache share Hongbo Li
2025-11-14  9:55 ` [PATCH v8 6/9] erofs: introduce the page cache share feature Hongbo Li
2025-11-17  3:06   ` Gao Xiang
2025-11-17  3:14     ` Hongbo Li
2025-11-17  3:18       ` Hongbo Li
2025-11-17  3:30       ` Gao Xiang
2025-11-14  9:55 ` [PATCH v8 7/9] erofs: support unencoded inodes for page cache share Hongbo Li
2025-11-17  3:44   ` Gao Xiang
2025-11-14  9:55 ` [PATCH v8 8/9] erofs: support compressed " Hongbo Li
2025-11-14  9:55 ` [PATCH v8 9/9] erofs: implement .fadvise " Hongbo Li
2025-11-17  3:48   ` Gao Xiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).