[LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
@ 2025-02-21 21:13 Kalesh Singh
  2025-02-22 18:03 ` Kent Overstreet
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Kalesh Singh @ 2025-02-21 21:13 UTC (permalink / raw)
  To: lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel
  Cc: Suren Baghdasaryan, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

Hi organizers of LSF/MM,

I realize this is a late submission, but I was hoping there might
still be a chance to have this topic considered for discussion.

Problem Statement
===============

Readahead can result in unnecessary page cache pollution for mapped
regions that are never accessed. Current mechanisms to disable
readahead lack granularity and rather operate at the file or VMA
level. This proposal seeks to initiate discussion at LSFMM to explore
potential solutions for optimizing page cache/readahead behavior.

Background
=========

The read-ahead heuristics on file-backed memory mappings can
inadvertently populate the page cache with pages corresponding to
regions that user-space processes are known never to access e.g ELF
LOAD segment padding regions. While these pages are ultimately
reclaimable, their presence precipitates unnecessary I/O operations,
particularly when a substantial quantity of such regions exists.

Although the underlying file can be made sparse in these regions to
mitigate I/O, readahead will still allocate discrete zero pages when
populating the page cache within these ranges. These pages, while
subject to reclaim, introduce additional churn to the LRU. This
reclaim overhead is further exacerbated in filesystems that support
"fault-around" semantics, that can populate the surrounding pages’
PTEs if found present in the page cache.

While the memory impact may be negligible for large files containing a
limited number of sparse regions, it becomes appreciable for many
small mappings characterized by numerous holes. This scenario can
arise from efforts to minimize vm_area_struct slab memory footprint.

Limitations of Existing Mechanisms
===========================

fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
entire file, rather than specific sub-regions. The offset and length
parameters primarily serve the POSIX_FADV_WILLNEED [1] and
POSIX_FADV_DONTNEED [2] cases.

madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
VMA, rather than specific sub-regions. [3]
Guard Regions: While guard regions for file-backed VMAs circumvent
fault-around concerns, the fundamental issue of unnecessary page cache
population persists. [4]

Empirical Demonstration
===================

Below is a simple program to demonstrate the issue. Assume that the
last 20 pages of the mapping is a region known to never be accessed
(perhaps a guard region).

cachestat is a simple C program I wrote that returns the nr_cached for
the entire file using the new cachestat() syscall [5].

cat pollute_page_cache.sh

#!/bin/bash

FILE="myfile.txt"

echo "Creating sparse file of size 25 pages"
truncate -s 100k $FILE

apparent_size=$(ls -lahs $FILE | awk '{ print $6 }')
echo "Apparent Size: $apparent_size"

real_size=$(ls -lahs $FILE | awk '{ print $1 }')
echo "Real Size: $real_size"

nr_cached=$(./cachestat $FILE | grep nr_cache: | awk '{ print $2 }')
echo "Number cached pages: $nr_cached"

echo "Reading first 5 pages..."
head -c 20k $FILE

nr_cached=$(./cachestat $FILE | grep nr_cache: | awk '{ print $2 }')
echo "Number cached pages: $nr_cached"

rm $FILE

-------

./pollute_page_cache.sh
Creating sparse file of size 25 pages
Apparent Size: 100K
Real Size: 0
Number cached pages: 0
Reading first 5 pages...
Number cached pages: 25

Thanks,
Kalesh

[1] https://github.com/torvalds/linux/blob/v6.14-rc3/mm/fadvise.c#L96
[2] https://github.com/torvalds/linux/blob/v6.14-rc3/mm/fadvise.c#L113
[3] https://github.com/torvalds/linux/blob/v6.14-rc3/mm/madvise.c#L1277
[4] https://lore.kernel.org/r/cover.1739469950.git.lorenzo.stoakes@oracle.com/
[5] https://lore.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-21 21:13 [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior Kalesh Singh
@ 2025-02-22 18:03 ` Kent Overstreet
  2025-02-23  5:36   ` Kalesh Singh
  2025-02-23  5:34 ` Ritesh Harjani
  2025-02-24 14:14 ` [Lsf-pc] " Jan Kara
  2 siblings, 1 reply; 26+ messages in thread
From: Kent Overstreet @ 2025-02-22 18:03 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

On Fri, Feb 21, 2025 at 01:13:15PM -0800, Kalesh Singh wrote:
> Hi organizers of LSF/MM,
> 
> I realize this is a late submission, but I was hoping there might
> still be a chance to have this topic considered for discussion.
> 
> Problem Statement
> ===============
> 
> Readahead can result in unnecessary page cache pollution for mapped
> regions that are never accessed. Current mechanisms to disable
> readahead lack granularity and rather operate at the file or VMA
> level. This proposal seeks to initiate discussion at LSFMM to explore
> potential solutions for optimizing page cache/readahead behavior.
> 
> 
> Background
> =========
> 
> The read-ahead heuristics on file-backed memory mappings can
> inadvertently populate the page cache with pages corresponding to
> regions that user-space processes are known never to access e.g ELF
> LOAD segment padding regions. While these pages are ultimately
> reclaimable, their presence precipitates unnecessary I/O operations,
> particularly when a substantial quantity of such regions exists.
> 
> Although the underlying file can be made sparse in these regions to
> mitigate I/O, readahead will still allocate discrete zero pages when
> populating the page cache within these ranges. These pages, while
> subject to reclaim, introduce additional churn to the LRU. This
> reclaim overhead is further exacerbated in filesystems that support
> "fault-around" semantics, that can populate the surrounding pages’
> PTEs if found present in the page cache.
> 
> While the memory impact may be negligible for large files containing a
> limited number of sparse regions, it becomes appreciable for many
> small mappings characterized by numerous holes. This scenario can
> arise from efforts to minimize vm_area_struct slab memory footprint.
> 
> Limitations of Existing Mechanisms
> ===========================
> 
> fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> entire file, rather than specific sub-regions. The offset and length
> parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> POSIX_FADV_DONTNEED [2] cases.
> 
> madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> VMA, rather than specific sub-regions. [3]
> Guard Regions: While guard regions for file-backed VMAs circumvent
> fault-around concerns, the fundamental issue of unnecessary page cache
> population persists. [4]

What if we introduced something like

madvise(..., MADV_READAHEAD_BOUNDARY, offset)

Would that be sufficient? And would a single readahead boundary offset
suffice?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-21 21:13 [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior Kalesh Singh
  2025-02-22 18:03 ` Kent Overstreet
@ 2025-02-23  5:34 ` Ritesh Harjani
  2025-02-23  6:50   ` Kalesh Singh
  2025-02-24 12:56   ` David Sterba
  2025-02-24 14:14 ` [Lsf-pc] " Jan Kara
  2 siblings, 2 replies; 26+ messages in thread
From: Ritesh Harjani @ 2025-02-23  5:34 UTC (permalink / raw)
  To: Kalesh Singh, lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel
  Cc: Suren Baghdasaryan, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

Kalesh Singh <kaleshsingh@google.com> writes:

> Hi organizers of LSF/MM,
>
> I realize this is a late submission, but I was hoping there might
> still be a chance to have this topic considered for discussion.
>
> Problem Statement
> ===============
>
> Readahead can result in unnecessary page cache pollution for mapped
> regions that are never accessed. Current mechanisms to disable
> readahead lack granularity and rather operate at the file or VMA

From what I understand the readahead setting is done at the per-bdi
level (default set to 128K). That means we don't get to control the
amount of readahead pages needed on a per file basis. If say we can
control the amount of readahead pages on a per open fd, will that solve
the problem you are facing? That also means we don't need to change the
setting for the entire system, but we can control this knob on a per fd
basis? 

I just quickly hacked fcntl to allow setting no. of ra_pages in
inode->i_ra_pages. Readahead algorithm then takes this setting whenever
it initializes the readahead control in "file_ra_state_init()"
So after one opens the file, we can set the fcntl F_SET_FILE_READAHEAD
to the preferred value on the open fd. 


Note: I am not saying the implementation could be 100% correct. But it's
just a quick working PoC to discuss whether this is the right approach
to the given problem.

-ritesh


<quick patch>
===========
fcntl: Add control to set per inode readahead pages

As of now readahead setting is done in units of pages at the bdi level.
(default 128K).
But sometimes the user wants to have more granular control over this
knob on a per file basis. This adds support to control readahead pages
on an open fd.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/btrfs/defrag.c           |  2 +-
 fs/btrfs/free-space-cache.c |  2 +-
 fs/btrfs/relocation.c       |  2 +-
 fs/btrfs/send.c             |  2 +-
 fs/cramfs/inode.c           |  2 +-
 fs/fcntl.c                  | 44 +++++++++++++++++++++++++++++++++++++
 fs/nfs/nfs4file.c           |  2 +-
 fs/open.c                   |  2 +-
 include/linux/fs.h          |  4 +++-
 include/uapi/linux/fcntl.h  |  2 ++
 mm/readahead.c              |  7 ++++--
 11 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 968dae953948..c6616d69a9af 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -261,7 +261,7 @@ static int btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
 	range.len = (u64)-1;
 	range.start = cur;
 	range.extent_thresh = defrag->extent_thresh;
-	file_ra_state_init(ra, inode->i_mapping);
+	file_ra_state_init(ra, inode);
 
 	sb_start_write(fs_info->sb);
 	ret = btrfs_defrag_file(inode, ra, &range, defrag->transid,
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index cfa52ef40b06..ac240b148747 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -373,7 +373,7 @@ static void readahead_cache(struct inode *inode)
 	struct file_ra_state ra;
 	unsigned long last_index;
 
-	file_ra_state_init(&ra, inode->i_mapping);
+	file_ra_state_init(&ra, inode);
 	last_index = (i_size_read(inode) - 1) >> PAGE_SHIFT;
 
 	page_cache_sync_readahead(inode->i_mapping, &ra, NULL, 0, last_index);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index bf267bdfa8f8..7688b79ae7e7 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3057,7 +3057,7 @@ static int relocate_file_extent_cluster(struct reloc_control *rc)
 	if (ret)
 		goto out;
 
-	file_ra_state_init(ra, inode->i_mapping);
+	file_ra_state_init(ra, inode);
 
 	ret = setup_relocation_extent_mapping(rc);
 	if (ret)
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 7254279c3cc9..b22fc2a426e4 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5745,7 +5745,7 @@ static int send_extent_data(struct send_ctx *sctx, struct btrfs_path *path,
 			return err;
 		}
 		memset(&sctx->ra, 0, sizeof(struct file_ra_state));
-		file_ra_state_init(&sctx->ra, sctx->cur_inode->i_mapping);
+		file_ra_state_init(&sctx->ra, sctx->cur_inode);
 
 		/*
 		 * It's very likely there are no pages from this inode in the page
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index b84d1747a020..917f09040f6e 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -214,7 +214,7 @@ static void *cramfs_blkdev_read(struct super_block *sb, unsigned int offset,
 	devsize = bdev_nr_bytes(sb->s_bdev) >> PAGE_SHIFT;
 
 	/* Ok, read in BLKS_PER_BUF pages completely first. */
-	file_ra_state_init(&ra, mapping);
+	file_ra_state_init(&ra, mapping->host);
 	page_cache_sync_readahead(mapping, &ra, NULL, blocknr, BLKS_PER_BUF);
 
 	for (i = 0; i < BLKS_PER_BUF; i++) {
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 49884fa3c81d..277afe78536f 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -394,6 +394,44 @@ static long fcntl_set_rw_hint(struct file *file, unsigned int cmd,
 	return 0;
 }
 
+static long fcntl_get_file_readahead(struct file *file, unsigned int cmd,
+			      unsigned long arg)
+{
+	struct inode *inode = file_inode(file);
+	u64 __user *argp = (u64 __user *)arg;
+	u64 ra_pages = READ_ONCE(inode->i_ra_pages);
+
+	if (copy_to_user(argp, &ra_pages, sizeof(*argp)))
+		return -EFAULT;
+	return 0;
+}
+
+
+static long fcntl_set_file_readahead(struct file *file, unsigned int cmd,
+			      unsigned long arg)
+{
+	struct inode *inode = file_inode(file);
+	u64 __user *argp = (u64 __user *)arg;
+	u64 ra_pages;
+
+	if (!inode_owner_or_capable(file_mnt_idmap(file), inode))
+		return -EPERM;
+
+	if (copy_from_user(&ra_pages, argp, sizeof(ra_pages)))
+		return -EFAULT;
+
+	WRITE_ONCE(inode->i_ra_pages, ra_pages);
+
+	/*
+	 * file->f_mapping->host may differ from inode. As an example,
+	 * blkdev_open() modifies file->f_mapping.
+	 */
+	if (file->f_mapping->host != inode)
+		WRITE_ONCE(file->f_mapping->host->i_ra_pages, ra_pages);
+
+	return 0;
+}
+
 /* Is the file descriptor a dup of the file? */
 static long f_dupfd_query(int fd, struct file *filp)
 {
@@ -552,6 +590,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_SET_RW_HINT:
 		err = fcntl_set_rw_hint(filp, cmd, arg);
 		break;
+	case F_GET_FILE_READAHEAD:
+		err = fcntl_get_file_readahead(filp, cmd, arg);
+		break;
+	case F_SET_FILE_READAHEAD:
+		err = fcntl_set_file_readahead(filp, cmd, arg);
+		break;
 	default:
 		break;
 	}
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 1cd9652f3c28..cee84aa8aa0f 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -388,7 +388,7 @@ static struct file *__nfs42_ssc_open(struct vfsmount *ss_mnt,
 	nfs_file_set_open_context(filep, ctx);
 	put_nfs_open_context(ctx);
 
-	file_ra_state_init(&filep->f_ra, filep->f_mapping->host->i_mapping);
+	file_ra_state_init(&filep->f_ra, filep->f_mapping->host);
 	res = filep;
 out_free_name:
 	kfree(read_name);
diff --git a/fs/open.c b/fs/open.c
index 0f75e220b700..466c3affe161 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -961,7 +961,7 @@ static int do_dentry_open(struct file *f,
 	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
 	f->f_iocb_flags = iocb_flags(f);
 
-	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
+	file_ra_state_init(&f->f_ra, f->f_mapping->host);
 
 	if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
 		return -EINVAL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 12fe11b6e3dd..77ee23e30245 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -678,6 +678,8 @@ struct inode {
 	unsigned short          i_bytes;
 	u8			i_blkbits;
 	enum rw_hint		i_write_hint;
+	/* Per inode setting for max readahead in page_size units */
+	unsigned long		i_ra_pages;
 	blkcnt_t		i_blocks;
 
 #ifdef __NEED_I_SIZE_ORDERED
@@ -3271,7 +3273,7 @@ extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
 
 
 extern void
-file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
+file_ra_state_init(struct file_ra_state *ra, struct inode *inode);
 extern loff_t noop_llseek(struct file *file, loff_t offset, int whence);
 extern loff_t vfs_setpos(struct file *file, loff_t offset, loff_t maxsize);
 extern loff_t generic_file_llseek(struct file *file, loff_t offset, int whence);
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6e6907e63bfc..b6e5413ca660 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -60,6 +60,8 @@
 #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
+#define F_GET_FILE_READAHEAD	(F_LINUX_SPECIFIC_BASE + 15)
+#define F_SET_FILE_READAHEAD	(F_LINUX_SPECIFIC_BASE + 16)
 
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
diff --git a/mm/readahead.c b/mm/readahead.c
index 2bc3abf07828..71079ae1753d 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -136,9 +136,12 @@
  * memset *ra to zero.
  */
 void
-file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
+file_ra_state_init(struct file_ra_state *ra, struct inode *inode)
 {
-	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
+	unsigned int ra_pages = inode->i_ra_pages ? inode->i_ra_pages :
+				inode_to_bdi(inode)->ra_pages;
+
+	ra->ra_pages = ra_pages;
 	ra->prev_pos = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);

2.39.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-22 18:03 ` Kent Overstreet
@ 2025-02-23  5:36   ` Kalesh Singh
  2025-02-23  5:42     ` Kalesh Singh
  2025-02-23  9:30     ` Lorenzo Stoakes
  0 siblings, 2 replies; 26+ messages in thread
From: Kalesh Singh @ 2025-02-23  5:36 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko, Johannes Weiner, Nhat Pham

On Sat, Feb 22, 2025 at 10:03 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Fri, Feb 21, 2025 at 01:13:15PM -0800, Kalesh Singh wrote:
> > Hi organizers of LSF/MM,
> >
> > I realize this is a late submission, but I was hoping there might
> > still be a chance to have this topic considered for discussion.
> >
> > Problem Statement
> > ===============
> >
> > Readahead can result in unnecessary page cache pollution for mapped
> > regions that are never accessed. Current mechanisms to disable
> > readahead lack granularity and rather operate at the file or VMA
> > level. This proposal seeks to initiate discussion at LSFMM to explore
> > potential solutions for optimizing page cache/readahead behavior.
> >
> >
> > Background
> > =========
> >
> > The read-ahead heuristics on file-backed memory mappings can
> > inadvertently populate the page cache with pages corresponding to
> > regions that user-space processes are known never to access e.g ELF
> > LOAD segment padding regions. While these pages are ultimately
> > reclaimable, their presence precipitates unnecessary I/O operations,
> > particularly when a substantial quantity of such regions exists.
> >
> > Although the underlying file can be made sparse in these regions to
> > mitigate I/O, readahead will still allocate discrete zero pages when
> > populating the page cache within these ranges. These pages, while
> > subject to reclaim, introduce additional churn to the LRU. This
> > reclaim overhead is further exacerbated in filesystems that support
> > "fault-around" semantics, that can populate the surrounding pages’
> > PTEs if found present in the page cache.
> >
> > While the memory impact may be negligible for large files containing a
> > limited number of sparse regions, it becomes appreciable for many
> > small mappings characterized by numerous holes. This scenario can
> > arise from efforts to minimize vm_area_struct slab memory footprint.
> >
> > Limitations of Existing Mechanisms
> > ===========================
> >
> > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > entire file, rather than specific sub-regions. The offset and length
> > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > POSIX_FADV_DONTNEED [2] cases.
> >
> > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > VMA, rather than specific sub-regions. [3]
> > Guard Regions: While guard regions for file-backed VMAs circumvent
> > fault-around concerns, the fundamental issue of unnecessary page cache
> > population persists. [4]
>
Hi Kent. Thanks for taking a look at this.

> What if we introduced something like
>
> madvise(..., MADV_READAHEAD_BOUNDARY, offset)
>
> Would that be sufficient? And would a single readahead boundary offset
> suffice?

I like the idea of having boundaries. In this particular example the
single boundary suffices, though I think we’ll need to support
multiple (see below).

One requirement that we’d like to meet is that the solution doesn’t
cause VMA splits, to avoid additional slab usage, so perhaps fadvise()
is better suited to this?

Another behavior of “mmap readahead” is that it doesn’t really respect
VMA (start, end) boundaries:

The below demonstrates readahead past the end of the mapped region of the file:

sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
./pollute_page_cache.sh

Creating sparse file of size 25 pages
Apparent Size: 100K
Real Size: 0
Number cached pages: 0
Reading first 5 pages via mmap...
Mapping and reading pages: [0, 6) of file 'myfile.txt'
Number cached pages: 25

Similarly the readahead can bring in pages before the start of the
mapped region. I believe this is due to mmap “read-around” [6]:

sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
./pollute_page_cache.sh

Creating sparse file of size 25 pages
Apparent Size: 100K
Real Size: 0
Number cached pages: 0
Reading last 5 pages via mmap...
Mapping and reading pages: [20, 25) of file 'myfile.txt'
Number cached pages: 25

I’m not sure what the historical use cases for readahead past the VMA
boundaries are; but at least in some scenarios this behavior is not
desirable. For instance, many apps mmap uncompressed ELF files
directly from a page-aligned offset within a zipped APK as a space
saving and security feature. The read ahead and read around behaviors
lead to unrelated resources from the zipped APK populated in the page
cache. I think in this case we’ll need to have more than a single
boundary per file.

A somewhat related but separate issue is that currently distinct pages
are allocated in the page cache when reading sparse file holes. I
think at least in the case of reading this should be avoidable.

sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
./pollute_page_cache.sh

Creating sparse file of size 1GB
Apparent Size: 977M
Real Size: 0
Number cached pages: 0
Meminfo Cached:          9078768 kB
Reading 1GB of holes...
Number cached pages: 250000
Meminfo Cached:         10117324 kB

(10117324-9078768)/4 = 259639 = ~250000 pages # (global counter = some noise)

--Kalesh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-23  5:36   ` Kalesh Singh
@ 2025-02-23  5:42     ` Kalesh Singh
  2025-02-23  9:30     ` Lorenzo Stoakes
  1 sibling, 0 replies; 26+ messages in thread
From: Kalesh Singh @ 2025-02-23  5:42 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko, Johannes Weiner, Nhat Pham

On Sat, Feb 22, 2025 at 9:36 PM Kalesh Singh <kaleshsingh@google.com> wrote:
>
> On Sat, Feb 22, 2025 at 10:03 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Fri, Feb 21, 2025 at 01:13:15PM -0800, Kalesh Singh wrote:
> > > Hi organizers of LSF/MM,
> > >
> > > I realize this is a late submission, but I was hoping there might
> > > still be a chance to have this topic considered for discussion.
> > >
> > > Problem Statement
> > > ===============
> > >
> > > Readahead can result in unnecessary page cache pollution for mapped
> > > regions that are never accessed. Current mechanisms to disable
> > > readahead lack granularity and rather operate at the file or VMA
> > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > potential solutions for optimizing page cache/readahead behavior.
> > >
> > >
> > > Background
> > > =========
> > >
> > > The read-ahead heuristics on file-backed memory mappings can
> > > inadvertently populate the page cache with pages corresponding to
> > > regions that user-space processes are known never to access e.g ELF
> > > LOAD segment padding regions. While these pages are ultimately
> > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > particularly when a substantial quantity of such regions exists.
> > >
> > > Although the underlying file can be made sparse in these regions to
> > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > populating the page cache within these ranges. These pages, while
> > > subject to reclaim, introduce additional churn to the LRU. This
> > > reclaim overhead is further exacerbated in filesystems that support
> > > "fault-around" semantics, that can populate the surrounding pages’
> > > PTEs if found present in the page cache.
> > >
> > > While the memory impact may be negligible for large files containing a
> > > limited number of sparse regions, it becomes appreciable for many
> > > small mappings characterized by numerous holes. This scenario can
> > > arise from efforts to minimize vm_area_struct slab memory footprint.
> > >
> > > Limitations of Existing Mechanisms
> > > ===========================
> > >
> > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > entire file, rather than specific sub-regions. The offset and length
> > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > POSIX_FADV_DONTNEED [2] cases.
> > >
> > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > VMA, rather than specific sub-regions. [3]
> > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > population persists. [4]
> >
> Hi Kent. Thanks for taking a look at this.
>
> > What if we introduced something like
> >
> > madvise(..., MADV_READAHEAD_BOUNDARY, offset)
> >
> > Would that be sufficient? And would a single readahead boundary offset
> > suffice?
>
> I like the idea of having boundaries. In this particular example the
> single boundary suffices, though I think we’ll need to support
> multiple (see below).
>
> One requirement that we’d like to meet is that the solution doesn’t
> cause VMA splits, to avoid additional slab usage, so perhaps fadvise()
> is better suited to this?
>
> Another behavior of “mmap readahead” is that it doesn’t really respect
> VMA (start, end) boundaries:
>
> The below demonstrates readahead past the end of the mapped region of the file:
>
> sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
> ./pollute_page_cache.sh
>
> Creating sparse file of size 25 pages
> Apparent Size: 100K
> Real Size: 0
> Number cached pages: 0
> Reading first 5 pages via mmap...
> Mapping and reading pages: [0, 6) of file 'myfile.txt'
> Number cached pages: 25
>
> Similarly the readahead can bring in pages before the start of the
> mapped region. I believe this is due to mmap “read-around” [6]:

I missed the reference to read-around in previous response:

[6] https://github.com/torvalds/linux/blob/v6.13-rc3/mm/filemap.c#L3195-L3204

>
> sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
> ./pollute_page_cache.sh
>
> Creating sparse file of size 25 pages
> Apparent Size: 100K
> Real Size: 0
> Number cached pages: 0
> Reading last 5 pages via mmap...
> Mapping and reading pages: [20, 25) of file 'myfile.txt'
> Number cached pages: 25
>
> I’m not sure what the historical use cases for readahead past the VMA
> boundaries are; but at least in some scenarios this behavior is not
> desirable. For instance, many apps mmap uncompressed ELF files
> directly from a page-aligned offset within a zipped APK as a space
> saving and security feature. The read ahead and read around behaviors
> lead to unrelated resources from the zipped APK populated in the page
> cache. I think in this case we’ll need to have more than a single
> boundary per file.
>
> A somewhat related but separate issue is that currently distinct pages
> are allocated in the page cache when reading sparse file holes. I
> think at least in the case of reading this should be avoidable.
>
> sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
> ./pollute_page_cache.sh
>
> Creating sparse file of size 1GB
> Apparent Size: 977M
> Real Size: 0
> Number cached pages: 0
> Meminfo Cached:          9078768 kB
> Reading 1GB of holes...
> Number cached pages: 250000
> Meminfo Cached:         10117324 kB
>
> (10117324-9078768)/4 = 259639 = ~250000 pages # (global counter = some noise)
>
> --Kalesh


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-23  5:34 ` Ritesh Harjani
@ 2025-02-23  6:50   ` Kalesh Singh
  2025-02-24 12:56   ` David Sterba
  1 sibling, 0 replies; 26+ messages in thread
From: Kalesh Singh @ 2025-02-23  6:50 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

On Sat, Feb 22, 2025 at 9:58 PM Ritesh Harjani <ritesh.list@gmail.com> wrote:
>
> Kalesh Singh <kaleshsingh@google.com> writes:
>
> > Hi organizers of LSF/MM,
> >
> > I realize this is a late submission, but I was hoping there might
> > still be a chance to have this topic considered for discussion.
> >
> > Problem Statement
> > ===============
> >
> > Readahead can result in unnecessary page cache pollution for mapped
> > regions that are never accessed. Current mechanisms to disable
> > readahead lack granularity and rather operate at the file or VMA
>
> From what I understand the readahead setting is done at the per-bdi
> level (default set to 128K). That means we don't get to control the
> amount of readahead pages needed on a per file basis. If say we can
> control the amount of readahead pages on a per open fd, will that solve
> the problem you are facing? That also means we don't need to change the
> setting for the entire system, but we can control this knob on a per fd
> basis?
>
> I just quickly hacked fcntl to allow setting no. of ra_pages in
> inode->i_ra_pages. Readahead algorithm then takes this setting whenever
> it initializes the readahead control in "file_ra_state_init()"
> So after one opens the file, we can set the fcntl F_SET_FILE_READAHEAD
> to the preferred value on the open fd.
>
>
> Note: I am not saying the implementation could be 100% correct. But it's
> just a quick working PoC to discuss whether this is the right approach
> to the given problem.

Hi Ritesh,

Thank you  for sharing the patch. I think the per‑file approach is in
the right direction. However, for this case, we’d like to stop the
read ahead once we hit a certain boundary(s) -- somewhat like Kent
described. Rather than changing the readahead size for the entire
file, imagine that there are certain sections of the file where we
don't want the readahead to "bleed" into; for instance, ELF segment
alignment padding regions; or across different resource boundaries in
a zipped apk.

  --Kaelsh

>
> -ritesh
>
>
> <quick patch>
> ===========
> fcntl: Add control to set per inode readahead pages
>
> As of now readahead setting is done in units of pages at the bdi level.
> (default 128K).
> But sometimes the user wants to have more granular control over this
> knob on a per file basis. This adds support to control readahead pages
> on an open fd.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/btrfs/defrag.c           |  2 +-
>  fs/btrfs/free-space-cache.c |  2 +-
>  fs/btrfs/relocation.c       |  2 +-
>  fs/btrfs/send.c             |  2 +-
>  fs/cramfs/inode.c           |  2 +-
>  fs/fcntl.c                  | 44 +++++++++++++++++++++++++++++++++++++
>  fs/nfs/nfs4file.c           |  2 +-
>  fs/open.c                   |  2 +-
>  include/linux/fs.h          |  4 +++-
>  include/uapi/linux/fcntl.h  |  2 ++
>  mm/readahead.c              |  7 ++++--
>  11 files changed, 61 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index 968dae953948..c6616d69a9af 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -261,7 +261,7 @@ static int btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>         range.len = (u64)-1;
>         range.start = cur;
>         range.extent_thresh = defrag->extent_thresh;
> -       file_ra_state_init(ra, inode->i_mapping);
> +       file_ra_state_init(ra, inode);
>
>         sb_start_write(fs_info->sb);
>         ret = btrfs_defrag_file(inode, ra, &range, defrag->transid,
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index cfa52ef40b06..ac240b148747 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -373,7 +373,7 @@ static void readahead_cache(struct inode *inode)
>         struct file_ra_state ra;
>         unsigned long last_index;
>
> -       file_ra_state_init(&ra, inode->i_mapping);
> +       file_ra_state_init(&ra, inode);
>         last_index = (i_size_read(inode) - 1) >> PAGE_SHIFT;
>
>         page_cache_sync_readahead(inode->i_mapping, &ra, NULL, 0, last_index);
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index bf267bdfa8f8..7688b79ae7e7 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3057,7 +3057,7 @@ static int relocate_file_extent_cluster(struct reloc_control *rc)
>         if (ret)
>                 goto out;
>
> -       file_ra_state_init(ra, inode->i_mapping);
> +       file_ra_state_init(ra, inode);
>
>         ret = setup_relocation_extent_mapping(rc);
>         if (ret)
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 7254279c3cc9..b22fc2a426e4 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -5745,7 +5745,7 @@ static int send_extent_data(struct send_ctx *sctx, struct btrfs_path *path,
>                         return err;
>                 }
>                 memset(&sctx->ra, 0, sizeof(struct file_ra_state));
> -               file_ra_state_init(&sctx->ra, sctx->cur_inode->i_mapping);
> +               file_ra_state_init(&sctx->ra, sctx->cur_inode);
>
>                 /*
>                  * It's very likely there are no pages from this inode in the page
> diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> index b84d1747a020..917f09040f6e 100644
> --- a/fs/cramfs/inode.c
> +++ b/fs/cramfs/inode.c
> @@ -214,7 +214,7 @@ static void *cramfs_blkdev_read(struct super_block *sb, unsigned int offset,
>         devsize = bdev_nr_bytes(sb->s_bdev) >> PAGE_SHIFT;
>
>         /* Ok, read in BLKS_PER_BUF pages completely first. */
> -       file_ra_state_init(&ra, mapping);
> +       file_ra_state_init(&ra, mapping->host);
>         page_cache_sync_readahead(mapping, &ra, NULL, blocknr, BLKS_PER_BUF);
>
>         for (i = 0; i < BLKS_PER_BUF; i++) {
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 49884fa3c81d..277afe78536f 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -394,6 +394,44 @@ static long fcntl_set_rw_hint(struct file *file, unsigned int cmd,
>         return 0;
>  }
>
> +static long fcntl_get_file_readahead(struct file *file, unsigned int cmd,
> +                             unsigned long arg)
> +{
> +       struct inode *inode = file_inode(file);
> +       u64 __user *argp = (u64 __user *)arg;
> +       u64 ra_pages = READ_ONCE(inode->i_ra_pages);
> +
> +       if (copy_to_user(argp, &ra_pages, sizeof(*argp)))
> +               return -EFAULT;
> +       return 0;
> +}
> +
> +
> +static long fcntl_set_file_readahead(struct file *file, unsigned int cmd,
> +                             unsigned long arg)
> +{
> +       struct inode *inode = file_inode(file);
> +       u64 __user *argp = (u64 __user *)arg;
> +       u64 ra_pages;
> +
> +       if (!inode_owner_or_capable(file_mnt_idmap(file), inode))
> +               return -EPERM;
> +
> +       if (copy_from_user(&ra_pages, argp, sizeof(ra_pages)))
> +               return -EFAULT;
> +
> +       WRITE_ONCE(inode->i_ra_pages, ra_pages);
> +
> +       /*
> +        * file->f_mapping->host may differ from inode. As an example,
> +        * blkdev_open() modifies file->f_mapping.
> +        */
> +       if (file->f_mapping->host != inode)
> +               WRITE_ONCE(file->f_mapping->host->i_ra_pages, ra_pages);
> +
> +       return 0;
> +}
> +
>  /* Is the file descriptor a dup of the file? */
>  static long f_dupfd_query(int fd, struct file *filp)
>  {
> @@ -552,6 +590,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>         case F_SET_RW_HINT:
>                 err = fcntl_set_rw_hint(filp, cmd, arg);
>                 break;
> +       case F_GET_FILE_READAHEAD:
> +               err = fcntl_get_file_readahead(filp, cmd, arg);
> +               break;
> +       case F_SET_FILE_READAHEAD:
> +               err = fcntl_set_file_readahead(filp, cmd, arg);
> +               break;
>         default:
>                 break;
>         }
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 1cd9652f3c28..cee84aa8aa0f 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -388,7 +388,7 @@ static struct file *__nfs42_ssc_open(struct vfsmount *ss_mnt,
>         nfs_file_set_open_context(filep, ctx);
>         put_nfs_open_context(ctx);
>
> -       file_ra_state_init(&filep->f_ra, filep->f_mapping->host->i_mapping);
> +       file_ra_state_init(&filep->f_ra, filep->f_mapping->host);
>         res = filep;
>  out_free_name:
>         kfree(read_name);
> diff --git a/fs/open.c b/fs/open.c
> index 0f75e220b700..466c3affe161 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -961,7 +961,7 @@ static int do_dentry_open(struct file *f,
>         f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
>         f->f_iocb_flags = iocb_flags(f);
>
> -       file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
> +       file_ra_state_init(&f->f_ra, f->f_mapping->host);
>
>         if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
>                 return -EINVAL;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 12fe11b6e3dd..77ee23e30245 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -678,6 +678,8 @@ struct inode {
>         unsigned short          i_bytes;
>         u8                      i_blkbits;
>         enum rw_hint            i_write_hint;
> +       /* Per inode setting for max readahead in page_size units */
> +       unsigned long           i_ra_pages;
>         blkcnt_t                i_blocks;
>
>  #ifdef __NEED_I_SIZE_ORDERED
> @@ -3271,7 +3273,7 @@ extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
>
>
>  extern void
> -file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
> +file_ra_state_init(struct file_ra_state *ra, struct inode *inode);
>  extern loff_t noop_llseek(struct file *file, loff_t offset, int whence);
>  extern loff_t vfs_setpos(struct file *file, loff_t offset, loff_t maxsize);
>  extern loff_t generic_file_llseek(struct file *file, loff_t offset, int whence);
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 6e6907e63bfc..b6e5413ca660 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -60,6 +60,8 @@
>  #define F_SET_RW_HINT          (F_LINUX_SPECIFIC_BASE + 12)
>  #define F_GET_FILE_RW_HINT     (F_LINUX_SPECIFIC_BASE + 13)
>  #define F_SET_FILE_RW_HINT     (F_LINUX_SPECIFIC_BASE + 14)
> +#define F_GET_FILE_READAHEAD   (F_LINUX_SPECIFIC_BASE + 15)
> +#define F_SET_FILE_READAHEAD   (F_LINUX_SPECIFIC_BASE + 16)
>
>  /*
>   * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 2bc3abf07828..71079ae1753d 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -136,9 +136,12 @@
>   * memset *ra to zero.
>   */
>  void
> -file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
> +file_ra_state_init(struct file_ra_state *ra, struct inode *inode)
>  {
> -       ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
> +       unsigned int ra_pages = inode->i_ra_pages ? inode->i_ra_pages :
> +                               inode_to_bdi(inode)->ra_pages;
> +
> +       ra->ra_pages = ra_pages;
>         ra->prev_pos = -1;
>  }
>  EXPORT_SYMBOL_GPL(file_ra_state_init);
>
> 2.39.5


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-23  5:36   ` Kalesh Singh
  2025-02-23  5:42     ` Kalesh Singh
@ 2025-02-23  9:30     ` Lorenzo Stoakes
  2025-02-23 12:24       ` Matthew Wilcox
  1 sibling, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-02-23  9:30 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: Kent Overstreet, lsf-pc, open list:MEMORY MANAGEMENT,
	linux-fsdevel, Suren Baghdasaryan, David Hildenbrand,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko, Johannes Weiner, Nhat Pham

On Sat, Feb 22, 2025 at 09:36:48PM -0800, Kalesh Singh wrote:
> On Sat, Feb 22, 2025 at 10:03 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Fri, Feb 21, 2025 at 01:13:15PM -0800, Kalesh Singh wrote:
> > > Hi organizers of LSF/MM,
> > >
> > > I realize this is a late submission, but I was hoping there might
> > > still be a chance to have this topic considered for discussion.
> > >
> > > Problem Statement
> > > ===============
> > >
> > > Readahead can result in unnecessary page cache pollution for mapped
> > > regions that are never accessed. Current mechanisms to disable
> > > readahead lack granularity and rather operate at the file or VMA
> > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > potential solutions for optimizing page cache/readahead behavior.
> > >
> > >
> > > Background
> > > =========
> > >
> > > The read-ahead heuristics on file-backed memory mappings can
> > > inadvertently populate the page cache with pages corresponding to
> > > regions that user-space processes are known never to access e.g ELF
> > > LOAD segment padding regions. While these pages are ultimately
> > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > particularly when a substantial quantity of such regions exists.
> > >
> > > Although the underlying file can be made sparse in these regions to
> > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > populating the page cache within these ranges. These pages, while
> > > subject to reclaim, introduce additional churn to the LRU. This
> > > reclaim overhead is further exacerbated in filesystems that support
> > > "fault-around" semantics, that can populate the surrounding pages’
> > > PTEs if found present in the page cache.

One note - if you use guard regions, fault-around won't be performed on
them ;)

It seems strange to me sparse regions would place duplicate zeroed pages in
the page cache...

> > >
> > > While the memory impact may be negligible for large files containing a
> > > limited number of sparse regions, it becomes appreciable for many
> > > small mappings characterized by numerous holes. This scenario can
> > > arise from efforts to minimize vm_area_struct slab memory footprint.

Presumably we're most concern with _synchronous_ readhead here? Because
once you estabish PG_readhead markers to trigger subsequent asynchronous
readahead, I don't think you can retain control. I go into that more below.

> > >
> > > Limitations of Existing Mechanisms
> > > ===========================
> > >
> > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > entire file, rather than specific sub-regions. The offset and length
> > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > POSIX_FADV_DONTNEED [2] cases.
> > >
> > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > VMA, rather than specific sub-regions. [3]
> > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > population persists. [4]

Note, not for fault-around. But yes for readahead, unavoidably, as there is no
metadata at VMA level (intentionally).

> >
> Hi Kent. Thanks for taking a look at this.
>
> > What if we introduced something like
> >
> > madvise(..., MADV_READAHEAD_BOUNDARY, offset)
> >
> > Would that be sufficient? And would a single readahead boundary offset
> > suffice?
>
> I like the idea of having boundaries. In this particular example the
> single boundary suffices, though I think we’ll need to support
> multiple (see below).
>
> One requirement that we’d like to meet is that the solution doesn’t
> cause VMA splits, to avoid additional slab usage, so perhaps fadvise()
> is better suited to this?

+1 to not causing VMA splits, but presumably you'd madvise() the whole VMA
anyway to adopt to this boundary mode?

But if you're trying to do something sub-VMA, I mean I'm not sure there's
any way for you to do this without splitting the VMA?

You end up in the same situation as guard regions which is - how do we
encode this information in such a way as to _not_ require VMA splitting,
and for guard regions the answer is 'we encode it in the page tables, and
modify _fault_ behaviour'.

Obviously that won't work here, so you really have nowhere else to put it.

While readahead state is stored in struct file(->f_ra) [which is somewhat
iffy on a few levels but still], fundamentally for asynchronous

>
> Another behavior of “mmap readahead” is that it doesn’t really respect
> VMA (start, end) boundaries:

Right, but doesn't readahead strictly belong to the file/folios rather than
any specific mapping?

Fine for synchronous readahead potentially, as you could say - ok we're
major faulting, only bring in up to the VMA boundary. But once you plant
PG_readahead markers to trigger asynchronous readahead on minor faults and
you're into filemap_readahead(), you lose all this kind of context.

And is it really fair if you have multiple mappings as well as potentially
read() operations on a file?

I'm not sure how feasible it is to restrict beyond _initial synchronous_
readahead, and I think you could only do that with VMA metadata, and so
you'd split the VMA, and wouldn't this defeat the purpose somewhat?

>
> The below demonstrates readahead past the end of the mapped region of the file:
>
> sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
> ./pollute_page_cache.sh
>
> Creating sparse file of size 25 pages
> Apparent Size: 100K
> Real Size: 0
> Number cached pages: 0
> Reading first 5 pages via mmap...
> Mapping and reading pages: [0, 6) of file 'myfile.txt'
> Number cached pages: 25
>
> Similarly the readahead can bring in pages before the start of the
> mapped region. I believe this is due to mmap “read-around” [6]:
>
> sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
> ./pollute_page_cache.sh
>
> Creating sparse file of size 25 pages
> Apparent Size: 100K
> Real Size: 0
> Number cached pages: 0
> Reading last 5 pages via mmap...
> Mapping and reading pages: [20, 25) of file 'myfile.txt'
> Number cached pages: 25
>
> I’m not sure what the historical use cases for readahead past the VMA
> boundaries are; but at least in some scenarios this behavior is not
> desirable. For instance, many apps mmap uncompressed ELF files
> directly from a page-aligned offset within a zipped APK as a space
> saving and security feature. The read ahead and read around behaviors
> lead to unrelated resources from the zipped APK populated in the page
> cache. I think in this case we’ll need to have more than a single
> boundary per file.
>
> A somewhat related but separate issue is that currently distinct pages
> are allocated in the page cache when reading sparse file holes. I
> think at least in the case of reading this should be avoidable.

This does seem like something that could be improved, seems very strange we
do this though.

>
> sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
> ./pollute_page_cache.sh
>
> Creating sparse file of size 1GB
> Apparent Size: 977M
> Real Size: 0
> Number cached pages: 0
> Meminfo Cached:          9078768 kB
> Reading 1GB of holes...
> Number cached pages: 250000
> Meminfo Cached:         10117324 kB
>
> (10117324-9078768)/4 = 259639 = ~250000 pages # (global counter = some noise)
>
> --Kalesh


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-23  9:30     ` Lorenzo Stoakes
@ 2025-02-23 12:24       ` Matthew Wilcox
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox @ 2025-02-23 12:24 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kalesh Singh, Kent Overstreet, lsf-pc,
	open list:MEMORY MANAGEMENT, linux-fsdevel, Suren Baghdasaryan,
	David Hildenbrand, Liam R. Howlett, Juan Yescas, android-mm,
	Vlastimil Babka, Michal Hocko, Johannes Weiner, Nhat Pham

On Sun, Feb 23, 2025 at 09:30:57AM +0000, Lorenzo Stoakes wrote:
> It seems strange to me sparse regions would place duplicate zeroed pages in
> the page cache...

https://lore.kernel.org/linux-mm/Z7p-SLdiyQCknetc@casper.infradead.org/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-23  5:34 ` Ritesh Harjani
  2025-02-23  6:50   ` Kalesh Singh
@ 2025-02-24 12:56   ` David Sterba
  1 sibling, 0 replies; 26+ messages in thread
From: David Sterba @ 2025-02-24 12:56 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Kalesh Singh, lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

On Sun, Feb 23, 2025 at 11:04:50AM +0530, Ritesh Harjani wrote:
> Kalesh Singh <kaleshsingh@google.com> writes:
> 
> > Hi organizers of LSF/MM,
> >
> > I realize this is a late submission, but I was hoping there might
> > still be a chance to have this topic considered for discussion.
> >
> > Problem Statement
> > ===============
> >
> > Readahead can result in unnecessary page cache pollution for mapped
> > regions that are never accessed. Current mechanisms to disable
> > readahead lack granularity and rather operate at the file or VMA
> 
> >From what I understand the readahead setting is done at the per-bdi
> level (default set to 128K). That means we don't get to control the
> amount of readahead pages needed on a per file basis. If say we can
> control the amount of readahead pages on a per open fd, will that solve
> the problem you are facing? That also means we don't need to change the
> setting for the entire system, but we can control this knob on a per fd
> basis? 
> 
> I just quickly hacked fcntl to allow setting no. of ra_pages in
> inode->i_ra_pages. Readahead algorithm then takes this setting whenever
> it initializes the readahead control in "file_ra_state_init()"
> So after one opens the file, we can set the fcntl F_SET_FILE_READAHEAD
> to the preferred value on the open fd. 
> 
> 
> Note: I am not saying the implementation could be 100% correct. But it's
> just a quick working PoC to discuss whether this is the right approach
> to the given problem.

> @@ -678,6 +678,8 @@ struct inode {
>  	unsigned short          i_bytes;
>  	u8			i_blkbits;
>  	enum rw_hint		i_write_hint;
> +	/* Per inode setting for max readahead in page_size units */
> +	unsigned long		i_ra_pages;
>  	blkcnt_t		i_blocks;

If your final patch needs to store data in struct inode, please try to
optimize it so that the size does not change. There are at least 2 4
byte holes so if you're fine with a page size unit for readahead then
this should be sufficient.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-21 21:13 [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior Kalesh Singh
  2025-02-22 18:03 ` Kent Overstreet
  2025-02-23  5:34 ` Ritesh Harjani
@ 2025-02-24 14:14 ` Jan Kara
  2025-02-24 14:21   ` Lorenzo Stoakes
  2 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2025-02-24 14:14 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

Hello!

On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> Problem Statement
> ===============
> 
> Readahead can result in unnecessary page cache pollution for mapped
> regions that are never accessed. Current mechanisms to disable
> readahead lack granularity and rather operate at the file or VMA
> level. This proposal seeks to initiate discussion at LSFMM to explore
> potential solutions for optimizing page cache/readahead behavior.
> 
> 
> Background
> =========
> 
> The read-ahead heuristics on file-backed memory mappings can
> inadvertently populate the page cache with pages corresponding to
> regions that user-space processes are known never to access e.g ELF
> LOAD segment padding regions. While these pages are ultimately
> reclaimable, their presence precipitates unnecessary I/O operations,
> particularly when a substantial quantity of such regions exists.
> 
> Although the underlying file can be made sparse in these regions to
> mitigate I/O, readahead will still allocate discrete zero pages when
> populating the page cache within these ranges. These pages, while
> subject to reclaim, introduce additional churn to the LRU. This
> reclaim overhead is further exacerbated in filesystems that support
> "fault-around" semantics, that can populate the surrounding pages’
> PTEs if found present in the page cache.
> 
> While the memory impact may be negligible for large files containing a
> limited number of sparse regions, it becomes appreciable for many
> small mappings characterized by numerous holes. This scenario can
> arise from efforts to minimize vm_area_struct slab memory footprint.

OK, I agree the behavior you describe exists. But do you have some
real-world numbers showing its extent? I'm not looking for some artificial
numbers - sure bad cases can be constructed - but how big practical problem
is this? If you can show that average Android phone has 10% of these
useless pages in memory than that's one thing and we should be looking for
some general solution. If it is more like 0.1%, then why bother?

> Limitations of Existing Mechanisms
> ===========================
> 
> fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> entire file, rather than specific sub-regions. The offset and length
> parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> POSIX_FADV_DONTNEED [2] cases.
> 
> madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> VMA, rather than specific sub-regions. [3]
> Guard Regions: While guard regions for file-backed VMAs circumvent
> fault-around concerns, the fundamental issue of unnecessary page cache
> population persists. [4]

Somewhere else in the thread you complain about readahead extending past
the VMA. That's relatively easy to avoid at least for readahead triggered
from filemap_fault() (i.e., do_async_mmap_readahead() and
do_sync_mmap_readahead()). I agree we could do that and that seems as a
relatively uncontroversial change. Note that if someone accesses the file
through standard read(2) or write(2) syscall or through different memory
mapping, the limits won't apply but such combinations of access are not
that common anyway.

Regarding controlling readahead for various portions of the file - I'm
skeptical. In my opinion it would require too much bookeeping on the kernel
side for such a niche usecache (but maybe your numbers will show it isn't
such a niche as I think :)). I can imagine you could just completely
turn off kernel readahead for the file and do your special readahead from
userspace - I think you could use either userfaultfd for triggering it or
new fanotify FAN_PREACCESS events.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 14:14 ` [Lsf-pc] " Jan Kara
@ 2025-02-24 14:21   ` Lorenzo Stoakes
  2025-02-24 16:31     ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-02-24 14:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kalesh Singh, lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Liam R. Howlett,
	Juan Yescas, android-mm, Matthew Wilcox, Vlastimil Babka,
	Michal Hocko

On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> Hello!
>
> On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > Problem Statement
> > ===============
> >
> > Readahead can result in unnecessary page cache pollution for mapped
> > regions that are never accessed. Current mechanisms to disable
> > readahead lack granularity and rather operate at the file or VMA
> > level. This proposal seeks to initiate discussion at LSFMM to explore
> > potential solutions for optimizing page cache/readahead behavior.
> >
> >
> > Background
> > =========
> >
> > The read-ahead heuristics on file-backed memory mappings can
> > inadvertently populate the page cache with pages corresponding to
> > regions that user-space processes are known never to access e.g ELF
> > LOAD segment padding regions. While these pages are ultimately
> > reclaimable, their presence precipitates unnecessary I/O operations,
> > particularly when a substantial quantity of such regions exists.
> >
> > Although the underlying file can be made sparse in these regions to
> > mitigate I/O, readahead will still allocate discrete zero pages when
> > populating the page cache within these ranges. These pages, while
> > subject to reclaim, introduce additional churn to the LRU. This
> > reclaim overhead is further exacerbated in filesystems that support
> > "fault-around" semantics, that can populate the surrounding pages’
> > PTEs if found present in the page cache.
> >
> > While the memory impact may be negligible for large files containing a
> > limited number of sparse regions, it becomes appreciable for many
> > small mappings characterized by numerous holes. This scenario can
> > arise from efforts to minimize vm_area_struct slab memory footprint.
>
> OK, I agree the behavior you describe exists. But do you have some
> real-world numbers showing its extent? I'm not looking for some artificial
> numbers - sure bad cases can be constructed - but how big practical problem
> is this? If you can show that average Android phone has 10% of these
> useless pages in memory than that's one thing and we should be looking for
> some general solution. If it is more like 0.1%, then why bother?
>
> > Limitations of Existing Mechanisms
> > ===========================
> >
> > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > entire file, rather than specific sub-regions. The offset and length
> > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > POSIX_FADV_DONTNEED [2] cases.
> >
> > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > VMA, rather than specific sub-regions. [3]
> > Guard Regions: While guard regions for file-backed VMAs circumvent
> > fault-around concerns, the fundamental issue of unnecessary page cache
> > population persists. [4]
>
> Somewhere else in the thread you complain about readahead extending past
> the VMA. That's relatively easy to avoid at least for readahead triggered
> from filemap_fault() (i.e., do_async_mmap_readahead() and
> do_sync_mmap_readahead()). I agree we could do that and that seems as a
> relatively uncontroversial change. Note that if someone accesses the file
> through standard read(2) or write(2) syscall or through different memory
> mapping, the limits won't apply but such combinations of access are not
> that common anyway.

Hm I'm not sure sure, map elf files with different mprotect(), or mprotect()
different portions of a file and suddenly you lose all the readahead for the
rest even though you're reading sequentially?

What about shared libraries with r/o parts and exec parts?

I think we'd really need to do some pretty careful checking to ensure this
wouldn't break some real world use cases esp. if we really do mostly readahead
data from page cache.

>
> Regarding controlling readahead for various portions of the file - I'm
> skeptical. In my opinion it would require too much bookeeping on the kernel
> side for such a niche usecache (but maybe your numbers will show it isn't
> such a niche as I think :)). I can imagine you could just completely
> turn off kernel readahead for the file and do your special readahead from
> userspace - I think you could use either userfaultfd for triggering it or
> new fanotify FAN_PREACCESS events.

I'm opposed to anything that'll proliferate VMAs (and from what Kalesh says, he
is too!) I don't really see how we could avoid having to do that for this kind
of case, but I may be missing something...

>
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 14:21   ` Lorenzo Stoakes
@ 2025-02-24 16:31     ` Jan Kara
  2025-02-24 16:52       ` Lorenzo Stoakes
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2025-02-24 16:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jan Kara, Kalesh Singh, lsf-pc, open list:MEMORY MANAGEMENT,
	linux-fsdevel, Suren Baghdasaryan, David Hildenbrand,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > Hello!
> >
> > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > Problem Statement
> > > ===============
> > >
> > > Readahead can result in unnecessary page cache pollution for mapped
> > > regions that are never accessed. Current mechanisms to disable
> > > readahead lack granularity and rather operate at the file or VMA
> > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > potential solutions for optimizing page cache/readahead behavior.
> > >
> > >
> > > Background
> > > =========
> > >
> > > The read-ahead heuristics on file-backed memory mappings can
> > > inadvertently populate the page cache with pages corresponding to
> > > regions that user-space processes are known never to access e.g ELF
> > > LOAD segment padding regions. While these pages are ultimately
> > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > particularly when a substantial quantity of such regions exists.
> > >
> > > Although the underlying file can be made sparse in these regions to
> > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > populating the page cache within these ranges. These pages, while
> > > subject to reclaim, introduce additional churn to the LRU. This
> > > reclaim overhead is further exacerbated in filesystems that support
> > > "fault-around" semantics, that can populate the surrounding pages’
> > > PTEs if found present in the page cache.
> > >
> > > While the memory impact may be negligible for large files containing a
> > > limited number of sparse regions, it becomes appreciable for many
> > > small mappings characterized by numerous holes. This scenario can
> > > arise from efforts to minimize vm_area_struct slab memory footprint.
> >
> > OK, I agree the behavior you describe exists. But do you have some
> > real-world numbers showing its extent? I'm not looking for some artificial
> > numbers - sure bad cases can be constructed - but how big practical problem
> > is this? If you can show that average Android phone has 10% of these
> > useless pages in memory than that's one thing and we should be looking for
> > some general solution. If it is more like 0.1%, then why bother?
> >
> > > Limitations of Existing Mechanisms
> > > ===========================
> > >
> > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > entire file, rather than specific sub-regions. The offset and length
> > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > POSIX_FADV_DONTNEED [2] cases.
> > >
> > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > VMA, rather than specific sub-regions. [3]
> > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > population persists. [4]
> >
> > Somewhere else in the thread you complain about readahead extending past
> > the VMA. That's relatively easy to avoid at least for readahead triggered
> > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > do_sync_mmap_readahead()). I agree we could do that and that seems as a
> > relatively uncontroversial change. Note that if someone accesses the file
> > through standard read(2) or write(2) syscall or through different memory
> > mapping, the limits won't apply but such combinations of access are not
> > that common anyway.
> 
> Hm I'm not sure sure, map elf files with different mprotect(), or mprotect()
> different portions of a file and suddenly you lose all the readahead for the
> rest even though you're reading sequentially?

Well, you wouldn't loose all readahead for the rest. Just readahead won't
preread data underlying the next VMA so yes, you get a cache miss and have
to wait for a page to get loaded into cache when transitioning to the next
VMA but once you get there, you'll have readahead running at full speed
again.

So yes, sequential read of a memory mapping of a file fragmented into many
VMAs will be somewhat slower. My impression is such use is rare (sequential
readers tend to use read(2) rather than mmap) but I could be wrong.

> What about shared libraries with r/o parts and exec parts?
> 
> I think we'd really need to do some pretty careful checking to ensure this
> wouldn't break some real world use cases esp. if we really do mostly
> readahead data from page cache.

So I'm not sure if you are not conflating two things here because the above
sentence doesn't make sense to me :). Readahead is the mechanism that
brings data from underlying filesystem into the page cache. Fault-around is
the mechanism that maps into page tables pages present in the page cache
although they were not possibly requested by the page fault. By "do mostly
readahead data from page cache" are you speaking about fault-around? That
currently does not cross VMA boundaries anyway as far as I'm reading
do_fault_around()...

> > Regarding controlling readahead for various portions of the file - I'm
> > skeptical. In my opinion it would require too much bookeeping on the kernel
> > side for such a niche usecache (but maybe your numbers will show it isn't
> > such a niche as I think :)). I can imagine you could just completely
> > turn off kernel readahead for the file and do your special readahead from
> > userspace - I think you could use either userfaultfd for triggering it or
> > new fanotify FAN_PREACCESS events.
> 
> I'm opposed to anything that'll proliferate VMAs (and from what Kalesh
> says, he is too!) I don't really see how we could avoid having to do that
> for this kind of case, but I may be missing something...

I don't see why we would need to be increasing number of VMAs here at all.
With FAN_PREACCESS you get notification with file & offset when it's
accessed, you can issue readahead(2) calls based on that however you like.
Similarly you can ask for userfaults for the whole mapped range and handle
those. Now thinking more about this, this approach has the downside that
you cannot implement async readahead with it (once PTE is mapped to some
page it won't trigger notifications either with FAN_PREACCESS or with
UFFD). But with UFFD you could at least trigger readahead on minor faults.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 16:31     ` Jan Kara
@ 2025-02-24 16:52       ` Lorenzo Stoakes
  2025-02-24 21:36         ` Kalesh Singh
  2025-02-25 16:21         ` Jan Kara
  0 siblings, 2 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-02-24 16:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kalesh Singh, lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Liam R. Howlett,
	Juan Yescas, android-mm, Matthew Wilcox, Vlastimil Babka,
	Michal Hocko

On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote:
> On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > > Hello!
> > >
> > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > > Problem Statement
> > > > ===============
> > > >
> > > > Readahead can result in unnecessary page cache pollution for mapped
> > > > regions that are never accessed. Current mechanisms to disable
> > > > readahead lack granularity and rather operate at the file or VMA
> > > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > > potential solutions for optimizing page cache/readahead behavior.
> > > >
> > > >
> > > > Background
> > > > =========
> > > >
> > > > The read-ahead heuristics on file-backed memory mappings can
> > > > inadvertently populate the page cache with pages corresponding to
> > > > regions that user-space processes are known never to access e.g ELF
> > > > LOAD segment padding regions. While these pages are ultimately
> > > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > > particularly when a substantial quantity of such regions exists.
> > > >
> > > > Although the underlying file can be made sparse in these regions to
> > > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > > populating the page cache within these ranges. These pages, while
> > > > subject to reclaim, introduce additional churn to the LRU. This
> > > > reclaim overhead is further exacerbated in filesystems that support
> > > > "fault-around" semantics, that can populate the surrounding pages’
> > > > PTEs if found present in the page cache.
> > > >
> > > > While the memory impact may be negligible for large files containing a
> > > > limited number of sparse regions, it becomes appreciable for many
> > > > small mappings characterized by numerous holes. This scenario can
> > > > arise from efforts to minimize vm_area_struct slab memory footprint.
> > >
> > > OK, I agree the behavior you describe exists. But do you have some
> > > real-world numbers showing its extent? I'm not looking for some artificial
> > > numbers - sure bad cases can be constructed - but how big practical problem
> > > is this? If you can show that average Android phone has 10% of these
> > > useless pages in memory than that's one thing and we should be looking for
> > > some general solution. If it is more like 0.1%, then why bother?
> > >
> > > > Limitations of Existing Mechanisms
> > > > ===========================
> > > >
> > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > > entire file, rather than specific sub-regions. The offset and length
> > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > > POSIX_FADV_DONTNEED [2] cases.
> > > >
> > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > > VMA, rather than specific sub-regions. [3]
> > > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > > population persists. [4]
> > >
> > > Somewhere else in the thread you complain about readahead extending past
> > > the VMA. That's relatively easy to avoid at least for readahead triggered
> > > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > > do_sync_mmap_readahead()). I agree we could do that and that seems as a
> > > relatively uncontroversial change. Note that if someone accesses the file
> > > through standard read(2) or write(2) syscall or through different memory
> > > mapping, the limits won't apply but such combinations of access are not
> > > that common anyway.
> >
> > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect()
> > different portions of a file and suddenly you lose all the readahead for the
> > rest even though you're reading sequentially?
>
> Well, you wouldn't loose all readahead for the rest. Just readahead won't
> preread data underlying the next VMA so yes, you get a cache miss and have
> to wait for a page to get loaded into cache when transitioning to the next
> VMA but once you get there, you'll have readahead running at full speed
> again.

I'm aware of how readahead works (I _believe_ there's currently a
pre-release of a book with a very extensive section on readahead written by
somebody :P).

Also been looking at it for file-backed guard regions recently, which is
why I've been commenting here specifically as it's been on my mind lately,
and also Kalesh's interest in this stems from a guard region 'scenario'
(hence my cc).

Anyway perhaps I didn't phrase this well - my concern is whether this might
impact performance in real world scenarios, such as one where a VMA is
mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of
the same file, in sequential order.

From Kalesh's LPC talk, unless I misinterpreted what he said, this is
precisely what he's doing? I mean we'd not be talking here about mmap()
behaviour with readahead otherwise.

Granted, perhaps you'd only _ever_ be reading sequentially within a
specific VMA's boundaries, rather than going from one to another (excluding
PROT_NONE guards obviously) and that's very possible, if that's what you
mean.

But otherwise, surely this is a thing? And might we therefore be imposing
unnecessary cache misses?

Which is why I suggest...

>
> So yes, sequential read of a memory mapping of a file fragmented into many
> VMAs will be somewhat slower. My impression is such use is rare (sequential
> readers tend to use read(2) rather than mmap) but I could be wrong.
>
> > What about shared libraries with r/o parts and exec parts?
> >
> > I think we'd really need to do some pretty careful checking to ensure this
> > wouldn't break some real world use cases esp. if we really do mostly
> > readahead data from page cache.
>
> So I'm not sure if you are not conflating two things here because the above
> sentence doesn't make sense to me :). Readahead is the mechanism that
> brings data from underlying filesystem into the page cache. Fault-around is
> the mechanism that maps into page tables pages present in the page cache
> although they were not possibly requested by the page fault. By "do mostly
> readahead data from page cache" are you speaking about fault-around? That
> currently does not cross VMA boundaries anyway as far as I'm reading
> do_fault_around()...

...that we test this and see how it behaves :) Which is literally all I
am saying in the above. Ideally with representative workloads.

I mean, I think this shouldn't be a controversial point right? Perhaps
again I didn't communicate this well. But this is all I mean here.

BTW, I understand the difference between readahead and fault-around, you can
run git blame on do_fault_around() if you have doubts about that ;)

And yes fault around is constrained to the VMA (and actually avoids
crossing PTE boundaries).

>
> > > Regarding controlling readahead for various portions of the file - I'm
> > > skeptical. In my opinion it would require too much bookeeping on the kernel
> > > side for such a niche usecache (but maybe your numbers will show it isn't
> > > such a niche as I think :)). I can imagine you could just completely
> > > turn off kernel readahead for the file and do your special readahead from
> > > userspace - I think you could use either userfaultfd for triggering it or
> > > new fanotify FAN_PREACCESS events.
> >
> > I'm opposed to anything that'll proliferate VMAs (and from what Kalesh
> > says, he is too!) I don't really see how we could avoid having to do that
> > for this kind of case, but I may be missing something...
>
> I don't see why we would need to be increasing number of VMAs here at all.
> With FAN_PREACCESS you get notification with file & offset when it's
> accessed, you can issue readahead(2) calls based on that however you like.
> Similarly you can ask for userfaults for the whole mapped range and handle
> those. Now thinking more about this, this approach has the downside that
> you cannot implement async readahead with it (once PTE is mapped to some
> page it won't trigger notifications either with FAN_PREACCESS or with
> UFFD). But with UFFD you could at least trigger readahead on minor faults.

Yeah we're talking past each other on this, sorry I missed your point about
fanotify there!

uffd is probably not reasonably workable given overhead I would have
thought.

I am really unaware of how fanotify works so I mean cool if you can find a
solution this way, awesome :)

I'm just saying, if we need to somehow retain state about regions which
should have adjusted readahead behaviour at a VMA level, I can't see how
this could be done without VMA fragmentation and I'd rather we didn't.

If we can avoid that great!

>
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 16:52       ` Lorenzo Stoakes
@ 2025-02-24 21:36         ` Kalesh Singh
  2025-02-24 21:55           ` Kalesh Singh
                             ` (3 more replies)
  2025-02-25 16:21         ` Jan Kara
  1 sibling, 4 replies; 26+ messages in thread
From: Kalesh Singh @ 2025-02-24 21:36 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jan Kara, lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Liam R. Howlett,
	Juan Yescas, android-mm, Matthew Wilcox, Vlastimil Babka,
	Michal Hocko, Cc: Android Kernel

On Mon, Feb 24, 2025 at 8:52 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote:
> > On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> > > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > > > Hello!
> > > >
> > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > > > Problem Statement
> > > > > ===============
> > > > >
> > > > > Readahead can result in unnecessary page cache pollution for mapped
> > > > > regions that are never accessed. Current mechanisms to disable
> > > > > readahead lack granularity and rather operate at the file or VMA
> > > > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > > > potential solutions for optimizing page cache/readahead behavior.
> > > > >
> > > > >
> > > > > Background
> > > > > =========
> > > > >
> > > > > The read-ahead heuristics on file-backed memory mappings can
> > > > > inadvertently populate the page cache with pages corresponding to
> > > > > regions that user-space processes are known never to access e.g ELF
> > > > > LOAD segment padding regions. While these pages are ultimately
> > > > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > > > particularly when a substantial quantity of such regions exists.
> > > > >
> > > > > Although the underlying file can be made sparse in these regions to
> > > > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > > > populating the page cache within these ranges. These pages, while
> > > > > subject to reclaim, introduce additional churn to the LRU. This
> > > > > reclaim overhead is further exacerbated in filesystems that support
> > > > > "fault-around" semantics, that can populate the surrounding pages’
> > > > > PTEs if found present in the page cache.
> > > > >
> > > > > While the memory impact may be negligible for large files containing a
> > > > > limited number of sparse regions, it becomes appreciable for many
> > > > > small mappings characterized by numerous holes. This scenario can
> > > > > arise from efforts to minimize vm_area_struct slab memory footprint.
> > > >

Hi Jan, Lorenzo, thanks for the comments.

> > > > OK, I agree the behavior you describe exists. But do you have some
> > > > real-world numbers showing its extent? I'm not looking for some artificial
> > > > numbers - sure bad cases can be constructed - but how big practical problem
> > > > is this? If you can show that average Android phone has 10% of these
> > > > useless pages in memory than that's one thing and we should be looking for
> > > > some general solution. If it is more like 0.1%, then why bother?
> > > >

Once I revert a workaround that we currently have to avoid
fault-around for these regions (we don't have an out of tree solution
to prevent the page cache population); our CI which checks memory
usage after performing some common app user-journeys; reports
regressions as shown in the snippet below. Note, that the increases
here are only for the populated PTEs (bounded by VMA) so the actual
pollution is theoretically larger.

Metric: perfetto_media.extractor#file-rss-avg
Increased by 7.495 MB (32.7%)

Metric: perfetto_/system/bin/audioserver#file-rss-avg
Increased by 6.262 MB (29.8%)

Metric: perfetto_/system/bin/mediaserver#file-rss-max
Increased by 8.325 MB (28.0%)

Metric: perfetto_/system/bin/mediaserver#file-rss-avg
Increased by 8.198 MB (28.4%)

Metric: perfetto_media.extractor#file-rss-max
Increased by 7.95 MB (33.6%)

Metric: perfetto_/system/bin/incidentd#file-rss-avg
Increased by 0.896 MB (20.4%)

Metric: perfetto_/system/bin/audioserver#file-rss-max
Increased by 6.883 MB (31.9%)

Metric: perfetto_media.swcodec#file-rss-max
Increased by 7.236 MB (34.9%)

Metric: perfetto_/system/bin/incidentd#file-rss-max
Increased by 1.003 MB (22.7%)

Metric: perfetto_/system/bin/cameraserver#file-rss-avg
Increased by 6.946 MB (34.2%)

Metric: perfetto_/system/bin/cameraserver#file-rss-max
Increased by 7.205 MB (33.8%)

Metric: perfetto_com.android.nfc#file-rss-max
Increased by 8.525 MB (9.8%)

Metric: perfetto_/system/bin/surfaceflinger#file-rss-avg
Increased by 3.715 MB (3.6%)

Metric: perfetto_media.swcodec#file-rss-avg
Increased by 5.096 MB (27.1%)

[...]

The issue is widespread across processes because in order to support
larger page sizes Android has a requirement that the ELF segments are
at-least 16KB aligned, which lead to the padding regions (never
accessed).

> > > > > Limitations of Existing Mechanisms
> > > > > ===========================
> > > > >
> > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > > > entire file, rather than specific sub-regions. The offset and length
> > > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > > > POSIX_FADV_DONTNEED [2] cases.
> > > > >
> > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > > > VMA, rather than specific sub-regions. [3]
> > > > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > > > population persists. [4]
> > > >
> > > > Somewhere else in the thread you complain about readahead extending past
> > > > the VMA. That's relatively easy to avoid at least for readahead triggered
> > > > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > > > do_sync_mmap_readahead()). I agree we could do that and that seems as a
> > > > relatively uncontroversial change. Note that if someone accesses the file
> > > > through standard read(2) or write(2) syscall or through different memory
> > > > mapping, the limits won't apply but such combinations of access are not
> > > > that common anyway.
> > >
> > > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect()
> > > different portions of a file and suddenly you lose all the readahead for the
> > > rest even though you're reading sequentially?
> >
> > Well, you wouldn't loose all readahead for the rest. Just readahead won't
> > preread data underlying the next VMA so yes, you get a cache miss and have
> > to wait for a page to get loaded into cache when transitioning to the next
> > VMA but once you get there, you'll have readahead running at full speed
> > again.
>
> I'm aware of how readahead works (I _believe_ there's currently a
> pre-release of a book with a very extensive section on readahead written by
> somebody :P).
>
> Also been looking at it for file-backed guard regions recently, which is
> why I've been commenting here specifically as it's been on my mind lately,
> and also Kalesh's interest in this stems from a guard region 'scenario'
> (hence my cc).
>
> Anyway perhaps I didn't phrase this well - my concern is whether this might
> impact performance in real world scenarios, such as one where a VMA is
> mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of
> the same file, in sequential order.
>
> From Kalesh's LPC talk, unless I misinterpreted what he said, this is
> precisely what he's doing? I mean we'd not be talking here about mmap()
> behaviour with readahead otherwise.
>
> Granted, perhaps you'd only _ever_ be reading sequentially within a
> specific VMA's boundaries, rather than going from one to another (excluding
> PROT_NONE guards obviously) and that's very possible, if that's what you
> mean.
>
> But otherwise, surely this is a thing? And might we therefore be imposing
> unnecessary cache misses?
>
> Which is why I suggest...
>
> >
> > So yes, sequential read of a memory mapping of a file fragmented into many
> > VMAs will be somewhat slower. My impression is such use is rare (sequential
> > readers tend to use read(2) rather than mmap) but I could be wrong.
> >
> > > What about shared libraries with r/o parts and exec parts?
> > >
> > > I think we'd really need to do some pretty careful checking to ensure this
> > > wouldn't break some real world use cases esp. if we really do mostly
> > > readahead data from page cache.
> >
> > So I'm not sure if you are not conflating two things here because the above
> > sentence doesn't make sense to me :). Readahead is the mechanism that
> > brings data from underlying filesystem into the page cache. Fault-around is
> > the mechanism that maps into page tables pages present in the page cache
> > although they were not possibly requested by the page fault. By "do mostly
> > readahead data from page cache" are you speaking about fault-around? That
> > currently does not cross VMA boundaries anyway as far as I'm reading
> > do_fault_around()...
>
> ...that we test this and see how it behaves :) Which is literally all I
> am saying in the above. Ideally with representative workloads.
>
> I mean, I think this shouldn't be a controversial point right? Perhaps
> again I didn't communicate this well. But this is all I mean here.
>
> BTW, I understand the difference between readahead and fault-around, you can
> run git blame on do_fault_around() if you have doubts about that ;)
>
> And yes fault around is constrained to the VMA (and actually avoids
> crossing PTE boundaries).
>
> >
> > > > Regarding controlling readahead for various portions of the file - I'm
> > > > skeptical. In my opinion it would require too much bookeeping on the kernel
> > > > side for such a niche usecache (but maybe your numbers will show it isn't
> > > > such a niche as I think :)). I can imagine you could just completely
> > > > turn off kernel readahead for the file and do your special readahead from
> > > > userspace - I think you could use either userfaultfd for triggering it or
> > > > new fanotify FAN_PREACCESS events.
> > >

Something like this would be ideal for the use case where uncompressed
ELF files are mapped directly from zipped APKs without extracting
them. (I don't have any real world number for this case atm). I also
don't know if the cache miss on the subsequent VMAs has significant
overhead in practice ... I'll try to collect some data for this.

> > > I'm opposed to anything that'll proliferate VMAs (and from what Kalesh
> > > says, he is too!) I don't really see how we could avoid having to do that
> > > for this kind of case, but I may be missing something...
> >
> > I don't see why we would need to be increasing number of VMAs here at all.
> > With FAN_PREACCESS you get notification with file & offset when it's
> > accessed, you can issue readahead(2) calls based on that however you like.
> > Similarly you can ask for userfaults for the whole mapped range and handle
> > those. Now thinking more about this, this approach has the downside that
> > you cannot implement async readahead with it (once PTE is mapped to some
> > page it won't trigger notifications either with FAN_PREACCESS or with
> > UFFD). But with UFFD you could at least trigger readahead on minor faults.
>
> Yeah we're talking past each other on this, sorry I missed your point about
> fanotify there!
>
> uffd is probably not reasonably workable given overhead I would have
> thought.
>
> I am really unaware of how fanotify works so I mean cool if you can find a
> solution this way, awesome :)
>
> I'm just saying, if we need to somehow retain state about regions which
> should have adjusted readahead behaviour at a VMA level, I can't see how
> this could be done without VMA fragmentation and I'd rather we didn't.
>
> If we can avoid that great!

Another possible way we can look at this: in the regressions shared
above by the ELF padding regions, we are able to make these regions
sparse (for *almost* all cases) -- solving the shared-zero page
problem for file mappings, would also eliminate much of this overhead.
So perhaps we should tackle this angle? If that's a more tangible
solution ?

From the previous discussions that Matthew shared [7], it seems like
Dave proposed an alternative to moving the extents to the VFS layer to
invert the IO read path operations [8]. Maybe this is a move
approachable solution since there is precedence for the same in the
write path?

[7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
[8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.area/

Thanks,
Kalesh
>
> >
> >                                                               Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 21:36         ` Kalesh Singh
@ 2025-02-24 21:55           ` Kalesh Singh
  2025-02-24 23:56           ` Dave Chinner
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Kalesh Singh @ 2025-02-24 21:55 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jan Kara, lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Liam R. Howlett,
	Juan Yescas, android-mm, Matthew Wilcox, Vlastimil Babka,
	Michal Hocko, Cc: Android Kernel, david

On Mon, Feb 24, 2025 at 1:36 PM Kalesh Singh <kaleshsingh@google.com> wrote:
>
> On Mon, Feb 24, 2025 at 8:52 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote:
> > > On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> > > > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > > > > Hello!
> > > > >
> > > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > > > > Problem Statement
> > > > > > ===============
> > > > > >
> > > > > > Readahead can result in unnecessary page cache pollution for mapped
> > > > > > regions that are never accessed. Current mechanisms to disable
> > > > > > readahead lack granularity and rather operate at the file or VMA
> > > > > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > > > > potential solutions for optimizing page cache/readahead behavior.
> > > > > >
> > > > > >
> > > > > > Background
> > > > > > =========
> > > > > >
> > > > > > The read-ahead heuristics on file-backed memory mappings can
> > > > > > inadvertently populate the page cache with pages corresponding to
> > > > > > regions that user-space processes are known never to access e.g ELF
> > > > > > LOAD segment padding regions. While these pages are ultimately
> > > > > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > > > > particularly when a substantial quantity of such regions exists.
> > > > > >
> > > > > > Although the underlying file can be made sparse in these regions to
> > > > > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > > > > populating the page cache within these ranges. These pages, while
> > > > > > subject to reclaim, introduce additional churn to the LRU. This
> > > > > > reclaim overhead is further exacerbated in filesystems that support
> > > > > > "fault-around" semantics, that can populate the surrounding pages’
> > > > > > PTEs if found present in the page cache.
> > > > > >
> > > > > > While the memory impact may be negligible for large files containing a
> > > > > > limited number of sparse regions, it becomes appreciable for many
> > > > > > small mappings characterized by numerous holes. This scenario can
> > > > > > arise from efforts to minimize vm_area_struct slab memory footprint.
> > > > >
>
> Hi Jan, Lorenzo, thanks for the comments.
>
> > > > > OK, I agree the behavior you describe exists. But do you have some
> > > > > real-world numbers showing its extent? I'm not looking for some artificial
> > > > > numbers - sure bad cases can be constructed - but how big practical problem
> > > > > is this? If you can show that average Android phone has 10% of these
> > > > > useless pages in memory than that's one thing and we should be looking for
> > > > > some general solution. If it is more like 0.1%, then why bother?
> > > > >
>
> Once I revert a workaround that we currently have to avoid
> fault-around for these regions (we don't have an out of tree solution
> to prevent the page cache population); our CI which checks memory
> usage after performing some common app user-journeys; reports
> regressions as shown in the snippet below. Note, that the increases
> here are only for the populated PTEs (bounded by VMA) so the actual
> pollution is theoretically larger.
>
> Metric: perfetto_media.extractor#file-rss-avg
> Increased by 7.495 MB (32.7%)
>
> Metric: perfetto_/system/bin/audioserver#file-rss-avg
> Increased by 6.262 MB (29.8%)
>
> Metric: perfetto_/system/bin/mediaserver#file-rss-max
> Increased by 8.325 MB (28.0%)
>
> Metric: perfetto_/system/bin/mediaserver#file-rss-avg
> Increased by 8.198 MB (28.4%)
>
> Metric: perfetto_media.extractor#file-rss-max
> Increased by 7.95 MB (33.6%)
>
> Metric: perfetto_/system/bin/incidentd#file-rss-avg
> Increased by 0.896 MB (20.4%)
>
> Metric: perfetto_/system/bin/audioserver#file-rss-max
> Increased by 6.883 MB (31.9%)
>
> Metric: perfetto_media.swcodec#file-rss-max
> Increased by 7.236 MB (34.9%)
>
> Metric: perfetto_/system/bin/incidentd#file-rss-max
> Increased by 1.003 MB (22.7%)
>
> Metric: perfetto_/system/bin/cameraserver#file-rss-avg
> Increased by 6.946 MB (34.2%)
>
> Metric: perfetto_/system/bin/cameraserver#file-rss-max
> Increased by 7.205 MB (33.8%)
>
> Metric: perfetto_com.android.nfc#file-rss-max
> Increased by 8.525 MB (9.8%)
>
> Metric: perfetto_/system/bin/surfaceflinger#file-rss-avg
> Increased by 3.715 MB (3.6%)
>
> Metric: perfetto_media.swcodec#file-rss-avg
> Increased by 5.096 MB (27.1%)
>
> [...]
>
> The issue is widespread across processes because in order to support
> larger page sizes Android has a requirement that the ELF segments are
> at-least 16KB aligned, which lead to the padding regions (never
> accessed).
>
> > > > > > Limitations of Existing Mechanisms
> > > > > > ===========================
> > > > > >
> > > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > > > > entire file, rather than specific sub-regions. The offset and length
> > > > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > > > > POSIX_FADV_DONTNEED [2] cases.
> > > > > >
> > > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > > > > VMA, rather than specific sub-regions. [3]
> > > > > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > > > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > > > > population persists. [4]
> > > > >
> > > > > Somewhere else in the thread you complain about readahead extending past
> > > > > the VMA. That's relatively easy to avoid at least for readahead triggered
> > > > > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > > > > do_sync_mmap_readahead()). I agree we could do that and that seems as a
> > > > > relatively uncontroversial change. Note that if someone accesses the file
> > > > > through standard read(2) or write(2) syscall or through different memory
> > > > > mapping, the limits won't apply but such combinations of access are not
> > > > > that common anyway.
> > > >
> > > > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect()
> > > > different portions of a file and suddenly you lose all the readahead for the
> > > > rest even though you're reading sequentially?
> > >
> > > Well, you wouldn't loose all readahead for the rest. Just readahead won't
> > > preread data underlying the next VMA so yes, you get a cache miss and have
> > > to wait for a page to get loaded into cache when transitioning to the next
> > > VMA but once you get there, you'll have readahead running at full speed
> > > again.
> >
> > I'm aware of how readahead works (I _believe_ there's currently a
> > pre-release of a book with a very extensive section on readahead written by
> > somebody :P).
> >
> > Also been looking at it for file-backed guard regions recently, which is
> > why I've been commenting here specifically as it's been on my mind lately,
> > and also Kalesh's interest in this stems from a guard region 'scenario'
> > (hence my cc).
> >
> > Anyway perhaps I didn't phrase this well - my concern is whether this might
> > impact performance in real world scenarios, such as one where a VMA is
> > mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of
> > the same file, in sequential order.
> >
> > From Kalesh's LPC talk, unless I misinterpreted what he said, this is
> > precisely what he's doing? I mean we'd not be talking here about mmap()
> > behaviour with readahead otherwise.
> >
> > Granted, perhaps you'd only _ever_ be reading sequentially within a
> > specific VMA's boundaries, rather than going from one to another (excluding
> > PROT_NONE guards obviously) and that's very possible, if that's what you
> > mean.
> >
> > But otherwise, surely this is a thing? And might we therefore be imposing
> > unnecessary cache misses?
> >
> > Which is why I suggest...
> >
> > >
> > > So yes, sequential read of a memory mapping of a file fragmented into many
> > > VMAs will be somewhat slower. My impression is such use is rare (sequential
> > > readers tend to use read(2) rather than mmap) but I could be wrong.
> > >
> > > > What about shared libraries with r/o parts and exec parts?
> > > >
> > > > I think we'd really need to do some pretty careful checking to ensure this
> > > > wouldn't break some real world use cases esp. if we really do mostly
> > > > readahead data from page cache.
> > >
> > > So I'm not sure if you are not conflating two things here because the above
> > > sentence doesn't make sense to me :). Readahead is the mechanism that
> > > brings data from underlying filesystem into the page cache. Fault-around is
> > > the mechanism that maps into page tables pages present in the page cache
> > > although they were not possibly requested by the page fault. By "do mostly
> > > readahead data from page cache" are you speaking about fault-around? That
> > > currently does not cross VMA boundaries anyway as far as I'm reading
> > > do_fault_around()...
> >
> > ...that we test this and see how it behaves :) Which is literally all I
> > am saying in the above. Ideally with representative workloads.
> >
> > I mean, I think this shouldn't be a controversial point right? Perhaps
> > again I didn't communicate this well. But this is all I mean here.
> >
> > BTW, I understand the difference between readahead and fault-around, you can
> > run git blame on do_fault_around() if you have doubts about that ;)
> >
> > And yes fault around is constrained to the VMA (and actually avoids
> > crossing PTE boundaries).
> >
> > >
> > > > > Regarding controlling readahead for various portions of the file - I'm
> > > > > skeptical. In my opinion it would require too much bookeeping on the kernel
> > > > > side for such a niche usecache (but maybe your numbers will show it isn't
> > > > > such a niche as I think :)). I can imagine you could just completely
> > > > > turn off kernel readahead for the file and do your special readahead from
> > > > > userspace - I think you could use either userfaultfd for triggering it or
> > > > > new fanotify FAN_PREACCESS events.
> > > >
>
> Something like this would be ideal for the use case where uncompressed
> ELF files are mapped directly from zipped APKs without extracting
> them. (I don't have any real world number for this case atm). I also
> don't know if the cache miss on the subsequent VMAs has significant
> overhead in practice ... I'll try to collect some data for this.
>
> > > > I'm opposed to anything that'll proliferate VMAs (and from what Kalesh
> > > > says, he is too!) I don't really see how we could avoid having to do that
> > > > for this kind of case, but I may be missing something...
> > >
> > > I don't see why we would need to be increasing number of VMAs here at all.
> > > With FAN_PREACCESS you get notification with file & offset when it's
> > > accessed, you can issue readahead(2) calls based on that however you like.
> > > Similarly you can ask for userfaults for the whole mapped range and handle
> > > those. Now thinking more about this, this approach has the downside that
> > > you cannot implement async readahead with it (once PTE is mapped to some
> > > page it won't trigger notifications either with FAN_PREACCESS or with
> > > UFFD). But with UFFD you could at least trigger readahead on minor faults.
> >
> > Yeah we're talking past each other on this, sorry I missed your point about
> > fanotify there!
> >
> > uffd is probably not reasonably workable given overhead I would have
> > thought.
> >
> > I am really unaware of how fanotify works so I mean cool if you can find a
> > solution this way, awesome :)
> >
> > I'm just saying, if we need to somehow retain state about regions which
> > should have adjusted readahead behaviour at a VMA level, I can't see how
> > this could be done without VMA fragmentation and I'd rather we didn't.
> >
> > If we can avoid that great!
>
> Another possible way we can look at this: in the regressions shared
> above by the ELF padding regions, we are able to make these regions
> sparse (for *almost* all cases) -- solving the shared-zero page
> problem for file mappings, would also eliminate much of this overhead.
> So perhaps we should tackle this angle? If that's a more tangible
> solution ?
>
> From the previous discussions that Matthew shared [7], it seems like
> Dave proposed an alternative to moving the extents to the VFS layer to
> invert the IO read path operations [8]. Maybe this is a move
> approachable solution since there is precedence for the same in the
> write path?
>
> [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
> [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.area/

+ cc: Dave Chinner

>
> Thanks,
> Kalesh
> >
> > >
> > >                                                               Honza
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 21:36         ` Kalesh Singh
  2025-02-24 21:55           ` Kalesh Singh
@ 2025-02-24 23:56           ` Dave Chinner
  2025-02-25  6:45             ` Kalesh Singh
  2025-02-27 22:12             ` Matthew Wilcox
  2025-02-25  5:44           ` Lorenzo Stoakes
  2025-02-25 16:36           ` Jan Kara
  3 siblings, 2 replies; 26+ messages in thread
From: Dave Chinner @ 2025-02-24 23:56 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: Lorenzo Stoakes, Jan Kara, lsf-pc, open list:MEMORY MANAGEMENT,
	linux-fsdevel, Suren Baghdasaryan, David Hildenbrand,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko, Cc: Android Kernel

On Mon, Feb 24, 2025 at 01:36:50PM -0800, Kalesh Singh wrote:
> Another possible way we can look at this: in the regressions shared
> above by the ELF padding regions, we are able to make these regions
> sparse (for *almost* all cases) -- solving the shared-zero page
> problem for file mappings, would also eliminate much of this overhead.
> So perhaps we should tackle this angle? If that's a more tangible
> solution ?
> 
> From the previous discussions that Matthew shared [7], it seems like
> Dave proposed an alternative to moving the extents to the VFS layer to
> invert the IO read path operations [8]. Maybe this is a move
> approachable solution since there is precedence for the same in the
> write path?
> 
> [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
> [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.area/

Yes, if we are going to optimise away redundant zeros being stored
in the page cache over holes, we need to know where the holes in the
file are before the page cache is populated.

As for efficient hole tracking in the mapping tree, I suspect that
we should be looking at using exceptional entries in the mapping
tree for holes, not inserting mulitple references to the zero folio.
i.e. the important information for data storage optimisation is that
the region covers a hole, not that it contains zeros.

For buffered reads, all that is required when such an exceptional
entry is returned is a memset of the user buffer. For buffered
writes, we simply treat it like a normal folio allocating write and
replace the exceptional entry with the allocated (and zeroed) folio.

For read page faults, the zero page gets mapped (and maybe
accounted) via the vma rather than the mapping tree entry. For write
faults, a folio gets allocated and the exception entry replaced
before we call into ->page_mkwrite().

Invalidation simply removes the exceptional entries.

This largely gets rid of needing to care about the zero page outside
of mmap() context where something needs to be mapped into the
userspace mm context. Let the page fault/mm context substitute the
zero page in the PTE mappings where necessary, but we don't need to
use and/or track the zero page in the page cache itself....

FWIW, this also lends itself to storing unwritten extent information
in exceptional entries. One of the problems we have is unwritten
extents can contain either zeros (been read) and data (been
overwritten in memory, but not flushed to disk). This is the problem
that SEEK_DATA has to navigate - it has to walk the page cache over
unwritten extents to determine if there is data over the unwritten
extent or not.

In this case, an exceptional entry gets added on read, which is then
replaced with an actual folio on write. Now SEEK_DATA can easily and
safely determine where the data actually lies over the unwritten
extent with a mapping tree walk instead of having to load and lock
each folio to check it is dirty or not....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 21:36         ` Kalesh Singh
  2025-02-24 21:55           ` Kalesh Singh
  2025-02-24 23:56           ` Dave Chinner
@ 2025-02-25  5:44           ` Lorenzo Stoakes
  2025-02-25  6:59             ` Kalesh Singh
  2025-02-25 16:36           ` Jan Kara
  3 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25  5:44 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: Jan Kara, lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Liam R. Howlett,
	Juan Yescas, android-mm, Matthew Wilcox, Vlastimil Babka,
	Michal Hocko, Cc: Android Kernel

On Mon, Feb 24, 2025 at 01:36:50PM -0800, Kalesh Singh wrote:
> On Mon, Feb 24, 2025 at 8:52 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote:
> > > On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> > > > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > > > > Hello!
> > > > >
> > > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > > > > Problem Statement
> > > > > > ===============
> > > > > >
> > > > > > Readahead can result in unnecessary page cache pollution for mapped
> > > > > > regions that are never accessed. Current mechanisms to disable
> > > > > > readahead lack granularity and rather operate at the file or VMA
> > > > > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > > > > potential solutions for optimizing page cache/readahead behavior.
> > > > > >
> > > > > >
> > > > > > Background
> > > > > > =========
> > > > > >
> > > > > > The read-ahead heuristics on file-backed memory mappings can
> > > > > > inadvertently populate the page cache with pages corresponding to
> > > > > > regions that user-space processes are known never to access e.g ELF
> > > > > > LOAD segment padding regions. While these pages are ultimately
> > > > > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > > > > particularly when a substantial quantity of such regions exists.
> > > > > >
> > > > > > Although the underlying file can be made sparse in these regions to
> > > > > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > > > > populating the page cache within these ranges. These pages, while
> > > > > > subject to reclaim, introduce additional churn to the LRU. This
> > > > > > reclaim overhead is further exacerbated in filesystems that support
> > > > > > "fault-around" semantics, that can populate the surrounding pages’
> > > > > > PTEs if found present in the page cache.
> > > > > >
> > > > > > While the memory impact may be negligible for large files containing a
> > > > > > limited number of sparse regions, it becomes appreciable for many
> > > > > > small mappings characterized by numerous holes. This scenario can
> > > > > > arise from efforts to minimize vm_area_struct slab memory footprint.
> > > > >
>
> Hi Jan, Lorenzo, thanks for the comments.
>
> > > > > OK, I agree the behavior you describe exists. But do you have some
> > > > > real-world numbers showing its extent? I'm not looking for some artificial
> > > > > numbers - sure bad cases can be constructed - but how big practical problem
> > > > > is this? If you can show that average Android phone has 10% of these
> > > > > useless pages in memory than that's one thing and we should be looking for
> > > > > some general solution. If it is more like 0.1%, then why bother?
> > > > >
>
> Once I revert a workaround that we currently have to avoid
> fault-around for these regions (we don't have an out of tree solution
> to prevent the page cache population); our CI which checks memory
> usage after performing some common app user-journeys; reports
> regressions as shown in the snippet below. Note, that the increases
> here are only for the populated PTEs (bounded by VMA) so the actual
> pollution is theoretically larger.

Hm fault-around populates these duplicate zero pages? I guess it would
actually. I'd be curious to hear about this out-of-tree patch, and I wonder how
upstreamable it might be? :)

>
> Metric: perfetto_media.extractor#file-rss-avg
> Increased by 7.495 MB (32.7%)
>
> Metric: perfetto_/system/bin/audioserver#file-rss-avg
> Increased by 6.262 MB (29.8%)
>
> Metric: perfetto_/system/bin/mediaserver#file-rss-max
> Increased by 8.325 MB (28.0%)
>
> Metric: perfetto_/system/bin/mediaserver#file-rss-avg
> Increased by 8.198 MB (28.4%)
>
> Metric: perfetto_media.extractor#file-rss-max
> Increased by 7.95 MB (33.6%)
>
> Metric: perfetto_/system/bin/incidentd#file-rss-avg
> Increased by 0.896 MB (20.4%)
>
> Metric: perfetto_/system/bin/audioserver#file-rss-max
> Increased by 6.883 MB (31.9%)
>
> Metric: perfetto_media.swcodec#file-rss-max
> Increased by 7.236 MB (34.9%)
>
> Metric: perfetto_/system/bin/incidentd#file-rss-max
> Increased by 1.003 MB (22.7%)
>
> Metric: perfetto_/system/bin/cameraserver#file-rss-avg
> Increased by 6.946 MB (34.2%)
>
> Metric: perfetto_/system/bin/cameraserver#file-rss-max
> Increased by 7.205 MB (33.8%)
>
> Metric: perfetto_com.android.nfc#file-rss-max
> Increased by 8.525 MB (9.8%)
>
> Metric: perfetto_/system/bin/surfaceflinger#file-rss-avg
> Increased by 3.715 MB (3.6%)
>
> Metric: perfetto_media.swcodec#file-rss-avg
> Increased by 5.096 MB (27.1%)

Yikes yeah.

>
> [...]
>
> The issue is widespread across processes because in order to support
> larger page sizes Android has a requirement that the ELF segments are
> at-least 16KB aligned, which lead to the padding regions (never
> accessed).

Again I wonder if the _really_ important problem here is this duplicate zero
page proliferation?

As Matthew points out, fixing this might be quite involved, but this isn't
pushing back on doing so, it's good to fix things even if it's hard :>)

>
> > > > > > Limitations of Existing Mechanisms
> > > > > > ===========================
> > > > > >
> > > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > > > > entire file, rather than specific sub-regions. The offset and length
> > > > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > > > > POSIX_FADV_DONTNEED [2] cases.
> > > > > >
> > > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > > > > VMA, rather than specific sub-regions. [3]
> > > > > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > > > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > > > > population persists. [4]
> > > > >
> > > > > Somewhere else in the thread you complain about readahead extending past
> > > > > the VMA. That's relatively easy to avoid at least for readahead triggered
> > > > > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > > > > do_sync_mmap_readahead()). I agree we could do that and that seems as a
> > > > > relatively uncontroversial change. Note that if someone accesses the file
> > > > > through standard read(2) or write(2) syscall or through different memory
> > > > > mapping, the limits won't apply but such combinations of access are not
> > > > > that common anyway.
> > > >
> > > > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect()
> > > > different portions of a file and suddenly you lose all the readahead for the
> > > > rest even though you're reading sequentially?
> > >
> > > Well, you wouldn't loose all readahead for the rest. Just readahead won't
> > > preread data underlying the next VMA so yes, you get a cache miss and have
> > > to wait for a page to get loaded into cache when transitioning to the next
> > > VMA but once you get there, you'll have readahead running at full speed
> > > again.
> >
> > I'm aware of how readahead works (I _believe_ there's currently a
> > pre-release of a book with a very extensive section on readahead written by
> > somebody :P).
> >
> > Also been looking at it for file-backed guard regions recently, which is
> > why I've been commenting here specifically as it's been on my mind lately,
> > and also Kalesh's interest in this stems from a guard region 'scenario'
> > (hence my cc).
> >
> > Anyway perhaps I didn't phrase this well - my concern is whether this might
> > impact performance in real world scenarios, such as one where a VMA is
> > mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of
> > the same file, in sequential order.
> >
> > From Kalesh's LPC talk, unless I misinterpreted what he said, this is
> > precisely what he's doing? I mean we'd not be talking here about mmap()
> > behaviour with readahead otherwise.
> >
> > Granted, perhaps you'd only _ever_ be reading sequentially within a
> > specific VMA's boundaries, rather than going from one to another (excluding
> > PROT_NONE guards obviously) and that's very possible, if that's what you
> > mean.
> >
> > But otherwise, surely this is a thing? And might we therefore be imposing
> > unnecessary cache misses?
> >
> > Which is why I suggest...
> >
> > >
> > > So yes, sequential read of a memory mapping of a file fragmented into many
> > > VMAs will be somewhat slower. My impression is such use is rare (sequential
> > > readers tend to use read(2) rather than mmap) but I could be wrong.
> > >
> > > > What about shared libraries with r/o parts and exec parts?
> > > >
> > > > I think we'd really need to do some pretty careful checking to ensure this
> > > > wouldn't break some real world use cases esp. if we really do mostly
> > > > readahead data from page cache.
> > >
> > > So I'm not sure if you are not conflating two things here because the above
> > > sentence doesn't make sense to me :). Readahead is the mechanism that
> > > brings data from underlying filesystem into the page cache. Fault-around is
> > > the mechanism that maps into page tables pages present in the page cache
> > > although they were not possibly requested by the page fault. By "do mostly
> > > readahead data from page cache" are you speaking about fault-around? That
> > > currently does not cross VMA boundaries anyway as far as I'm reading
> > > do_fault_around()...
> >
> > ...that we test this and see how it behaves :) Which is literally all I
> > am saying in the above. Ideally with representative workloads.
> >
> > I mean, I think this shouldn't be a controversial point right? Perhaps
> > again I didn't communicate this well. But this is all I mean here.
> >
> > BTW, I understand the difference between readahead and fault-around, you can
> > run git blame on do_fault_around() if you have doubts about that ;)
> >
> > And yes fault around is constrained to the VMA (and actually avoids
> > crossing PTE boundaries).
> >
> > >
> > > > > Regarding controlling readahead for various portions of the file - I'm
> > > > > skeptical. In my opinion it would require too much bookeeping on the kernel
> > > > > side for such a niche usecache (but maybe your numbers will show it isn't
> > > > > such a niche as I think :)). I can imagine you could just completely
> > > > > turn off kernel readahead for the file and do your special readahead from
> > > > > userspace - I think you could use either userfaultfd for triggering it or
> > > > > new fanotify FAN_PREACCESS events.
> > > >
>
> Something like this would be ideal for the use case where uncompressed
> ELF files are mapped directly from zipped APKs without extracting
> them. (I don't have any real world number for this case atm). I also
> don't know if the cache miss on the subsequent VMAs has significant
> overhead in practice ... I'll try to collect some data for this.
>
> > > > I'm opposed to anything that'll proliferate VMAs (and from what Kalesh
> > > > says, he is too!) I don't really see how we could avoid having to do that
> > > > for this kind of case, but I may be missing something...
> > >
> > > I don't see why we would need to be increasing number of VMAs here at all.
> > > With FAN_PREACCESS you get notification with file & offset when it's
> > > accessed, you can issue readahead(2) calls based on that however you like.
> > > Similarly you can ask for userfaults for the whole mapped range and handle
> > > those. Now thinking more about this, this approach has the downside that
> > > you cannot implement async readahead with it (once PTE is mapped to some
> > > page it won't trigger notifications either with FAN_PREACCESS or with
> > > UFFD). But with UFFD you could at least trigger readahead on minor faults.
> >
> > Yeah we're talking past each other on this, sorry I missed your point about
> > fanotify there!
> >
> > uffd is probably not reasonably workable given overhead I would have
> > thought.
> >
> > I am really unaware of how fanotify works so I mean cool if you can find a
> > solution this way, awesome :)
> >
> > I'm just saying, if we need to somehow retain state about regions which
> > should have adjusted readahead behaviour at a VMA level, I can't see how
> > this could be done without VMA fragmentation and I'd rather we didn't.
> >
> > If we can avoid that great!
>
> Another possible way we can look at this: in the regressions shared
> above by the ELF padding regions, we are able to make these regions
> sparse (for *almost* all cases) -- solving the shared-zero page
> problem for file mappings, would also eliminate much of this overhead.
> So perhaps we should tackle this angle? If that's a more tangible
> solution ?

To me it seems we are converging on this as at least part of the solution.

>
> From the previous discussions that Matthew shared [7], it seems like
> Dave proposed an alternative to moving the extents to the VFS layer to
> invert the IO read path operations [8]. Maybe this is a move
> approachable solution since there is precedence for the same in the
> write path?
>
> [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
> [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.area/
>
> Thanks,
> Kalesh
> >
> > >
> > >                                                               Honza
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR

Overall I think we can conclude - this is a topic of interest to people for
LSF :)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 23:56           ` Dave Chinner
@ 2025-02-25  6:45             ` Kalesh Singh
  2025-02-27 22:12             ` Matthew Wilcox
  1 sibling, 0 replies; 26+ messages in thread
From: Kalesh Singh @ 2025-02-25  6:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Lorenzo Stoakes, Jan Kara, lsf-pc, open list:MEMORY MANAGEMENT,
	linux-fsdevel, Suren Baghdasaryan, David Hildenbrand,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko, Cc: Android Kernel

On Mon, Feb 24, 2025 at 3:56 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Feb 24, 2025 at 01:36:50PM -0800, Kalesh Singh wrote:
> > Another possible way we can look at this: in the regressions shared
> > above by the ELF padding regions, we are able to make these regions
> > sparse (for *almost* all cases) -- solving the shared-zero page
> > problem for file mappings, would also eliminate much of this overhead.
> > So perhaps we should tackle this angle? If that's a more tangible
> > solution ?
> >
> > From the previous discussions that Matthew shared [7], it seems like
> > Dave proposed an alternative to moving the extents to the VFS layer to
> > invert the IO read path operations [8]. Maybe this is a move
> > approachable solution since there is precedence for the same in the
> > write path?
> >
> > [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
> > [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.area/
>
> Yes, if we are going to optimise away redundant zeros being stored
> in the page cache over holes, we need to know where the holes in the
> file are before the page cache is populated.
>
> As for efficient hole tracking in the mapping tree, I suspect that
> we should be looking at using exceptional entries in the mapping
> tree for holes, not inserting mulitple references to the zero folio.
> i.e. the important information for data storage optimisation is that
> the region covers a hole, not that it contains zeros.
>
> For buffered reads, all that is required when such an exceptional
> entry is returned is a memset of the user buffer. For buffered
> writes, we simply treat it like a normal folio allocating write and
> replace the exceptional entry with the allocated (and zeroed) folio.
>
> For read page faults, the zero page gets mapped (and maybe
> accounted) via the vma rather than the mapping tree entry. For write
> faults, a folio gets allocated and the exception entry replaced
> before we call into ->page_mkwrite().
>
> Invalidation simply removes the exceptional entries.
>
> This largely gets rid of needing to care about the zero page outside
> of mmap() context where something needs to be mapped into the
> userspace mm context. Let the page fault/mm context substitute the
> zero page in the PTE mappings where necessary, but we don't need to
> use and/or track the zero page in the page cache itself....
>
> FWIW, this also lends itself to storing unwritten extent information
> in exceptional entries. One of the problems we have is unwritten
> extents can contain either zeros (been read) and data (been
> overwritten in memory, but not flushed to disk). This is the problem
> that SEEK_DATA has to navigate - it has to walk the page cache over
> unwritten extents to determine if there is data over the unwritten
> extent or not.
>
> In this case, an exceptional entry gets added on read, which is then
> replaced with an actual folio on write. Now SEEK_DATA can easily and
> safely determine where the data actually lies over the unwritten
> extent with a mapping tree walk instead of having to load and lock
> each folio to check it is dirty or not....

Thank you for the very detailed explanation Dave.

I think this approach with the exceptional entries and the allocation
decision happening at fault time would also allow us to introduce this
incrementally for MAP_PRIVATE  and MAP_SHARED, should there be any
unforeseen issues MAP_SHARED ...

and file_map_pages() would already correctly handle the exceptional
entries for fault around ...

--Kalesh

>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-25  5:44           ` Lorenzo Stoakes
@ 2025-02-25  6:59             ` Kalesh Singh
  0 siblings, 0 replies; 26+ messages in thread
From: Kalesh Singh @ 2025-02-25  6:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jan Kara, lsf-pc, open list:MEMORY MANAGEMENT, linux-fsdevel,
	Suren Baghdasaryan, David Hildenbrand, Liam R. Howlett,
	Juan Yescas, android-mm, Matthew Wilcox, Vlastimil Babka,
	Michal Hocko, Cc: Android Kernel

On Mon, Feb 24, 2025 at 9:45 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Feb 24, 2025 at 01:36:50PM -0800, Kalesh Singh wrote:
> > On Mon, Feb 24, 2025 at 8:52 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote:
> > > > On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> > > > > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > > > > > Hello!
> > > > > >
> > > > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > > > > > Problem Statement
> > > > > > > ===============
> > > > > > >
> > > > > > > Readahead can result in unnecessary page cache pollution for mapped
> > > > > > > regions that are never accessed. Current mechanisms to disable
> > > > > > > readahead lack granularity and rather operate at the file or VMA
> > > > > > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > > > > > potential solutions for optimizing page cache/readahead behavior.
> > > > > > >
> > > > > > >
> > > > > > > Background
> > > > > > > =========
> > > > > > >
> > > > > > > The read-ahead heuristics on file-backed memory mappings can
> > > > > > > inadvertently populate the page cache with pages corresponding to
> > > > > > > regions that user-space processes are known never to access e.g ELF
> > > > > > > LOAD segment padding regions. While these pages are ultimately
> > > > > > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > > > > > particularly when a substantial quantity of such regions exists.
> > > > > > >
> > > > > > > Although the underlying file can be made sparse in these regions to
> > > > > > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > > > > > populating the page cache within these ranges. These pages, while
> > > > > > > subject to reclaim, introduce additional churn to the LRU. This
> > > > > > > reclaim overhead is further exacerbated in filesystems that support
> > > > > > > "fault-around" semantics, that can populate the surrounding pages’
> > > > > > > PTEs if found present in the page cache.
> > > > > > >
> > > > > > > While the memory impact may be negligible for large files containing a
> > > > > > > limited number of sparse regions, it becomes appreciable for many
> > > > > > > small mappings characterized by numerous holes. This scenario can
> > > > > > > arise from efforts to minimize vm_area_struct slab memory footprint.
> > > > > >
> >
> > Hi Jan, Lorenzo, thanks for the comments.
> >
> > > > > > OK, I agree the behavior you describe exists. But do you have some
> > > > > > real-world numbers showing its extent? I'm not looking for some artificial
> > > > > > numbers - sure bad cases can be constructed - but how big practical problem
> > > > > > is this? If you can show that average Android phone has 10% of these
> > > > > > useless pages in memory than that's one thing and we should be looking for
> > > > > > some general solution. If it is more like 0.1%, then why bother?
> > > > > >
> >
> > Once I revert a workaround that we currently have to avoid
> > fault-around for these regions (we don't have an out of tree solution
> > to prevent the page cache population); our CI which checks memory
> > usage after performing some common app user-journeys; reports
> > regressions as shown in the snippet below. Note, that the increases
> > here are only for the populated PTEs (bounded by VMA) so the actual
> > pollution is theoretically larger.
>
> Hm fault-around populates these duplicate zero pages? I guess it would
> actually. I'd be curious to hear about this out-of-tree patch, and I wonder how
> upstreamable it might be? :)

Let's say it's a hack I'd prefer not to post on the list :) It's very
particular to our use case so great to find a generic solution that
everyone can benefit from.

>
> >
> > Metric: perfetto_media.extractor#file-rss-avg
> > Increased by 7.495 MB (32.7%)
> >
> > Metric: perfetto_/system/bin/audioserver#file-rss-avg
> > Increased by 6.262 MB (29.8%)
> >
> > Metric: perfetto_/system/bin/mediaserver#file-rss-max
> > Increased by 8.325 MB (28.0%)
> >
> > Metric: perfetto_/system/bin/mediaserver#file-rss-avg
> > Increased by 8.198 MB (28.4%)
> >
> > Metric: perfetto_media.extractor#file-rss-max
> > Increased by 7.95 MB (33.6%)
> >
> > Metric: perfetto_/system/bin/incidentd#file-rss-avg
> > Increased by 0.896 MB (20.4%)
> >
> > Metric: perfetto_/system/bin/audioserver#file-rss-max
> > Increased by 6.883 MB (31.9%)
> >
> > Metric: perfetto_media.swcodec#file-rss-max
> > Increased by 7.236 MB (34.9%)
> >
> > Metric: perfetto_/system/bin/incidentd#file-rss-max
> > Increased by 1.003 MB (22.7%)
> >
> > Metric: perfetto_/system/bin/cameraserver#file-rss-avg
> > Increased by 6.946 MB (34.2%)
> >
> > Metric: perfetto_/system/bin/cameraserver#file-rss-max
> > Increased by 7.205 MB (33.8%)
> >
> > Metric: perfetto_com.android.nfc#file-rss-max
> > Increased by 8.525 MB (9.8%)
> >
> > Metric: perfetto_/system/bin/surfaceflinger#file-rss-avg
> > Increased by 3.715 MB (3.6%)
> >
> > Metric: perfetto_media.swcodec#file-rss-avg
> > Increased by 5.096 MB (27.1%)
>
> Yikes yeah.
>
> >
> > [...]
> >
> > The issue is widespread across processes because in order to support
> > larger page sizes Android has a requirement that the ELF segments are
> > at-least 16KB aligned, which lead to the padding regions (never
> > accessed).
>
> Again I wonder if the _really_ important problem here is this duplicate zero
> page proliferation?
>

Initially I didn't want to bias the discussion to only working for
sparse files, since there could be never-accessed file backed regions
that are not necessarily sparse (guard regions?). But the major issue
/ use case in this thread, yes it suffices to solve the zero page
problem. Perhaps the other issues mentioned can be revisited
separately if/when we have some real world numbers as Jan suggested.

> As Matthew points out, fixing this might be quite involved, but this isn't
> pushing back on doing so, it's good to fix things even if it's hard :>)
>
> >
> > > > > > > Limitations of Existing Mechanisms
> > > > > > > ===========================
> > > > > > >
> > > > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > > > > > entire file, rather than specific sub-regions. The offset and length
> > > > > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > > > > > POSIX_FADV_DONTNEED [2] cases.
> > > > > > >
> > > > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > > > > > VMA, rather than specific sub-regions. [3]
> > > > > > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > > > > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > > > > > population persists. [4]
> > > > > >
> > > > > > Somewhere else in the thread you complain about readahead extending past
> > > > > > the VMA. That's relatively easy to avoid at least for readahead triggered
> > > > > > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > > > > > do_sync_mmap_readahead()). I agree we could do that and that seems as a
> > > > > > relatively uncontroversial change. Note that if someone accesses the file
> > > > > > through standard read(2) or write(2) syscall or through different memory
> > > > > > mapping, the limits won't apply but such combinations of access are not
> > > > > > that common anyway.
> > > > >
> > > > > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect()
> > > > > different portions of a file and suddenly you lose all the readahead for the
> > > > > rest even though you're reading sequentially?
> > > >
> > > > Well, you wouldn't loose all readahead for the rest. Just readahead won't
> > > > preread data underlying the next VMA so yes, you get a cache miss and have
> > > > to wait for a page to get loaded into cache when transitioning to the next
> > > > VMA but once you get there, you'll have readahead running at full speed
> > > > again.
> > >
> > > I'm aware of how readahead works (I _believe_ there's currently a
> > > pre-release of a book with a very extensive section on readahead written by
> > > somebody :P).
> > >
> > > Also been looking at it for file-backed guard regions recently, which is
> > > why I've been commenting here specifically as it's been on my mind lately,
> > > and also Kalesh's interest in this stems from a guard region 'scenario'
> > > (hence my cc).
> > >
> > > Anyway perhaps I didn't phrase this well - my concern is whether this might
> > > impact performance in real world scenarios, such as one where a VMA is
> > > mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of
> > > the same file, in sequential order.
> > >
> > > From Kalesh's LPC talk, unless I misinterpreted what he said, this is
> > > precisely what he's doing? I mean we'd not be talking here about mmap()
> > > behaviour with readahead otherwise.
> > >
> > > Granted, perhaps you'd only _ever_ be reading sequentially within a
> > > specific VMA's boundaries, rather than going from one to another (excluding
> > > PROT_NONE guards obviously) and that's very possible, if that's what you
> > > mean.
> > >
> > > But otherwise, surely this is a thing? And might we therefore be imposing
> > > unnecessary cache misses?
> > >
> > > Which is why I suggest...
> > >
> > > >
> > > > So yes, sequential read of a memory mapping of a file fragmented into many
> > > > VMAs will be somewhat slower. My impression is such use is rare (sequential
> > > > readers tend to use read(2) rather than mmap) but I could be wrong.
> > > >
> > > > > What about shared libraries with r/o parts and exec parts?
> > > > >
> > > > > I think we'd really need to do some pretty careful checking to ensure this
> > > > > wouldn't break some real world use cases esp. if we really do mostly
> > > > > readahead data from page cache.
> > > >
> > > > So I'm not sure if you are not conflating two things here because the above
> > > > sentence doesn't make sense to me :). Readahead is the mechanism that
> > > > brings data from underlying filesystem into the page cache. Fault-around is
> > > > the mechanism that maps into page tables pages present in the page cache
> > > > although they were not possibly requested by the page fault. By "do mostly
> > > > readahead data from page cache" are you speaking about fault-around? That
> > > > currently does not cross VMA boundaries anyway as far as I'm reading
> > > > do_fault_around()...
> > >
> > > ...that we test this and see how it behaves :) Which is literally all I
> > > am saying in the above. Ideally with representative workloads.
> > >
> > > I mean, I think this shouldn't be a controversial point right? Perhaps
> > > again I didn't communicate this well. But this is all I mean here.
> > >
> > > BTW, I understand the difference between readahead and fault-around, you can
> > > run git blame on do_fault_around() if you have doubts about that ;)
> > >
> > > And yes fault around is constrained to the VMA (and actually avoids
> > > crossing PTE boundaries).
> > >
> > > >
> > > > > > Regarding controlling readahead for various portions of the file - I'm
> > > > > > skeptical. In my opinion it would require too much bookeeping on the kernel
> > > > > > side for such a niche usecache (but maybe your numbers will show it isn't
> > > > > > such a niche as I think :)). I can imagine you could just completely
> > > > > > turn off kernel readahead for the file and do your special readahead from
> > > > > > userspace - I think you could use either userfaultfd for triggering it or
> > > > > > new fanotify FAN_PREACCESS events.
> > > > >
> >
> > Something like this would be ideal for the use case where uncompressed
> > ELF files are mapped directly from zipped APKs without extracting
> > them. (I don't have any real world number for this case atm). I also
> > don't know if the cache miss on the subsequent VMAs has significant
> > overhead in practice ... I'll try to collect some data for this.
> >
> > > > > I'm opposed to anything that'll proliferate VMAs (and from what Kalesh
> > > > > says, he is too!) I don't really see how we could avoid having to do that
> > > > > for this kind of case, but I may be missing something...
> > > >
> > > > I don't see why we would need to be increasing number of VMAs here at all.
> > > > With FAN_PREACCESS you get notification with file & offset when it's
> > > > accessed, you can issue readahead(2) calls based on that however you like.
> > > > Similarly you can ask for userfaults for the whole mapped range and handle
> > > > those. Now thinking more about this, this approach has the downside that
> > > > you cannot implement async readahead with it (once PTE is mapped to some
> > > > page it won't trigger notifications either with FAN_PREACCESS or with
> > > > UFFD). But with UFFD you could at least trigger readahead on minor faults.
> > >
> > > Yeah we're talking past each other on this, sorry I missed your point about
> > > fanotify there!
> > >
> > > uffd is probably not reasonably workable given overhead I would have
> > > thought.
> > >
> > > I am really unaware of how fanotify works so I mean cool if you can find a
> > > solution this way, awesome :)
> > >
> > > I'm just saying, if we need to somehow retain state about regions which
> > > should have adjusted readahead behaviour at a VMA level, I can't see how
> > > this could be done without VMA fragmentation and I'd rather we didn't.
> > >
> > > If we can avoid that great!
> >
> > Another possible way we can look at this: in the regressions shared
> > above by the ELF padding regions, we are able to make these regions
> > sparse (for *almost* all cases) -- solving the shared-zero page
> > problem for file mappings, would also eliminate much of this overhead.
> > So perhaps we should tackle this angle? If that's a more tangible
> > solution ?
>
> To me it seems we are converging on this as at least part of the solution.
>
> >
> > From the previous discussions that Matthew shared [7], it seems like
> > Dave proposed an alternative to moving the extents to the VFS layer to
> > invert the IO read path operations [8]. Maybe this is a move
> > approachable solution since there is precedence for the same in the
> > write path?
> >
> > [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
> > [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.area/
> >
> > Thanks,
> > Kalesh
> > >
> > > >
> > > >                                                               Honza
> > > > --
> > > > Jan Kara <jack@suse.com>
> > > > SUSE Labs, CR
>
> Overall I think we can conclude - this is a topic of interest to people for
> LSF :)

Yes I'd love to discuss this more with all the relevant folks in person :)

Thanks,
Kalesh


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 16:52       ` Lorenzo Stoakes
  2025-02-24 21:36         ` Kalesh Singh
@ 2025-02-25 16:21         ` Jan Kara
  1 sibling, 0 replies; 26+ messages in thread
From: Jan Kara @ 2025-02-25 16:21 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jan Kara, Kalesh Singh, lsf-pc, open list:MEMORY MANAGEMENT,
	linux-fsdevel, Suren Baghdasaryan, David Hildenbrand,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

On Mon 24-02-25 16:52:24, Lorenzo Stoakes wrote:
> On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote:
> > On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> > > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > > > Hello!
> > > >
> > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > > > Problem Statement
> > > > > ===============
> > > > >
> > > > > Readahead can result in unnecessary page cache pollution for mapped
> > > > > regions that are never accessed. Current mechanisms to disable
> > > > > readahead lack granularity and rather operate at the file or VMA
> > > > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > > > potential solutions for optimizing page cache/readahead behavior.
> > > > >
> > > > >
> > > > > Background
> > > > > =========
> > > > >
> > > > > The read-ahead heuristics on file-backed memory mappings can
> > > > > inadvertently populate the page cache with pages corresponding to
> > > > > regions that user-space processes are known never to access e.g ELF
> > > > > LOAD segment padding regions. While these pages are ultimately
> > > > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > > > particularly when a substantial quantity of such regions exists.
> > > > >
> > > > > Although the underlying file can be made sparse in these regions to
> > > > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > > > populating the page cache within these ranges. These pages, while
> > > > > subject to reclaim, introduce additional churn to the LRU. This
> > > > > reclaim overhead is further exacerbated in filesystems that support
> > > > > "fault-around" semantics, that can populate the surrounding pages’
> > > > > PTEs if found present in the page cache.
> > > > >
> > > > > While the memory impact may be negligible for large files containing a
> > > > > limited number of sparse regions, it becomes appreciable for many
> > > > > small mappings characterized by numerous holes. This scenario can
> > > > > arise from efforts to minimize vm_area_struct slab memory footprint.
> > > >
> > > > OK, I agree the behavior you describe exists. But do you have some
> > > > real-world numbers showing its extent? I'm not looking for some artificial
> > > > numbers - sure bad cases can be constructed - but how big practical problem
> > > > is this? If you can show that average Android phone has 10% of these
> > > > useless pages in memory than that's one thing and we should be looking for
> > > > some general solution. If it is more like 0.1%, then why bother?
> > > >
> > > > > Limitations of Existing Mechanisms
> > > > > ===========================
> > > > >
> > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > > > entire file, rather than specific sub-regions. The offset and length
> > > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > > > POSIX_FADV_DONTNEED [2] cases.
> > > > >
> > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > > > VMA, rather than specific sub-regions. [3]
> > > > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > > > population persists. [4]
> > > >
> > > > Somewhere else in the thread you complain about readahead extending past
> > > > the VMA. That's relatively easy to avoid at least for readahead triggered
> > > > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > > > do_sync_mmap_readahead()). I agree we could do that and that seems as a
> > > > relatively uncontroversial change. Note that if someone accesses the file
> > > > through standard read(2) or write(2) syscall or through different memory
> > > > mapping, the limits won't apply but such combinations of access are not
> > > > that common anyway.
> > >
> > > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect()
> > > different portions of a file and suddenly you lose all the readahead for the
> > > rest even though you're reading sequentially?
> >
> > Well, you wouldn't loose all readahead for the rest. Just readahead won't
> > preread data underlying the next VMA so yes, you get a cache miss and have
> > to wait for a page to get loaded into cache when transitioning to the next
> > VMA but once you get there, you'll have readahead running at full speed
> > again.
> 
> I'm aware of how readahead works (I _believe_ there's currently a
> pre-release of a book with a very extensive section on readahead written by
> somebody :P).

Yeah, sorry. I didn't intend to educate you about basic readahead stuff but
I just felt what you wrote didn't quite make sense to me so I wanted to
spell out basic things to hopefully come to a common understanding :).

> Also been looking at it for file-backed guard regions recently, which is
> why I've been commenting here specifically as it's been on my mind lately,
> and also Kalesh's interest in this stems from a guard region 'scenario'
> (hence my cc).
> 
> Anyway perhaps I didn't phrase this well - my concern is whether this might
> impact performance in real world scenarios, such as one where a VMA is
> mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of
> the same file, in sequential order.
> 
> From Kalesh's LPC talk, unless I misinterpreted what he said, this is
> precisely what he's doing? I mean we'd not be talking here about mmap()
> behaviour with readahead otherwise.
> 
> Granted, perhaps you'd only _ever_ be reading sequentially within a
> specific VMA's boundaries, rather than going from one to another (excluding
> PROT_NONE guards obviously) and that's very possible, if that's what you
> mean.
> 
> But otherwise, surely this is a thing? And might we therefore be imposing
> unnecessary cache misses?
> 
> Which is why I suggest...
> 
> >
> > So yes, sequential read of a memory mapping of a file fragmented into many
> > VMAs will be somewhat slower. My impression is such use is rare (sequential
> > readers tend to use read(2) rather than mmap) but I could be wrong.
> >
> > > What about shared libraries with r/o parts and exec parts?
> > >
> > > I think we'd really need to do some pretty careful checking to ensure this
> > > wouldn't break some real world use cases esp. if we really do mostly
> > > readahead data from page cache.
> >
> > So I'm not sure if you are not conflating two things here because the above
> > sentence doesn't make sense to me :). Readahead is the mechanism that
> > brings data from underlying filesystem into the page cache. Fault-around is
> > the mechanism that maps into page tables pages present in the page cache
> > although they were not possibly requested by the page fault. By "do mostly
> > readahead data from page cache" are you speaking about fault-around? That
> > currently does not cross VMA boundaries anyway as far as I'm reading
> > do_fault_around()...
> 
> ...that we test this and see how it behaves :) Which is literally all I
> am saying in the above. Ideally with representative workloads.
> 
> I mean, I think this shouldn't be a controversial point right? Perhaps
> again I didn't communicate this well. But this is all I mean here.

Ok, I was reading more than what was there. I absolutely agree that this
needs quite a bit of testing to see whether we won't regress anything :).

> > > > Regarding controlling readahead for various portions of the file - I'm
> > > > skeptical. In my opinion it would require too much bookeeping on the kernel
> > > > side for such a niche usecache (but maybe your numbers will show it isn't
> > > > such a niche as I think :)). I can imagine you could just completely
> > > > turn off kernel readahead for the file and do your special readahead from
> > > > userspace - I think you could use either userfaultfd for triggering it or
> > > > new fanotify FAN_PREACCESS events.
> > >
> > > I'm opposed to anything that'll proliferate VMAs (and from what Kalesh
> > > says, he is too!) I don't really see how we could avoid having to do that
> > > for this kind of case, but I may be missing something...
> >
> > I don't see why we would need to be increasing number of VMAs here at all.
> > With FAN_PREACCESS you get notification with file & offset when it's
> > accessed, you can issue readahead(2) calls based on that however you like.
> > Similarly you can ask for userfaults for the whole mapped range and handle
> > those. Now thinking more about this, this approach has the downside that
> > you cannot implement async readahead with it (once PTE is mapped to some
> > page it won't trigger notifications either with FAN_PREACCESS or with
> > UFFD). But with UFFD you could at least trigger readahead on minor faults.
> 
> Yeah we're talking past each other on this, sorry I missed your point about
> fanotify there!
> 
> uffd is probably not reasonably workable given overhead I would have
> thought.
> 
> I am really unaware of how fanotify works so I mean cool if you can find a
> solution this way, awesome :)

Well, the overhead with fanotify will be also non-negligible (roundtrip to
userspace on major page fault - but once you do readahead, pages will be in
the page cache and so no more events need to be handled for the area you've
preread). Again it would need to be measured for the particular usecase
whether that's workable or not.

> I'm just saying, if we need to somehow retain state about regions which
> should have adjusted readahead behaviour at a VMA level, I can't see how
> this could be done without VMA fragmentation and I'd rather we didn't.

I understand and that's why I've suggested something like fanotify where
the burden of keeping the state for file / virtual memory ranges is on the
userspace so not our problem anymore (and userspace can do much more clever
and specialized things than the kernel ;).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 21:36         ` Kalesh Singh
                             ` (2 preceding siblings ...)
  2025-02-25  5:44           ` Lorenzo Stoakes
@ 2025-02-25 16:36           ` Jan Kara
  2025-02-26  0:49             ` Kalesh Singh
  3 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2025-02-25 16:36 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: Lorenzo Stoakes, Jan Kara, lsf-pc, open list:MEMORY MANAGEMENT,
	linux-fsdevel, Suren Baghdasaryan, David Hildenbrand,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko, Cc: Android Kernel

On Mon 24-02-25 13:36:50, Kalesh Singh wrote:
> On Mon, Feb 24, 2025 at 8:52 AM Lorenzo Stoakes
> > > > > OK, I agree the behavior you describe exists. But do you have some
> > > > > real-world numbers showing its extent? I'm not looking for some artificial
> > > > > numbers - sure bad cases can be constructed - but how big practical problem
> > > > > is this? If you can show that average Android phone has 10% of these
> > > > > useless pages in memory than that's one thing and we should be looking for
> > > > > some general solution. If it is more like 0.1%, then why bother?
> > > > >
> 
> Once I revert a workaround that we currently have to avoid
> fault-around for these regions (we don't have an out of tree solution
> to prevent the page cache population); our CI which checks memory
> usage after performing some common app user-journeys; reports
> regressions as shown in the snippet below. Note, that the increases
> here are only for the populated PTEs (bounded by VMA) so the actual
> pollution is theoretically larger.
> 
> Metric: perfetto_media.extractor#file-rss-avg
> Increased by 7.495 MB (32.7%)
> 
> Metric: perfetto_/system/bin/audioserver#file-rss-avg
> Increased by 6.262 MB (29.8%)
> 
> Metric: perfetto_/system/bin/mediaserver#file-rss-max
> Increased by 8.325 MB (28.0%)
> 
> Metric: perfetto_/system/bin/mediaserver#file-rss-avg
> Increased by 8.198 MB (28.4%)
> 
> Metric: perfetto_media.extractor#file-rss-max
> Increased by 7.95 MB (33.6%)
> 
> Metric: perfetto_/system/bin/incidentd#file-rss-avg
> Increased by 0.896 MB (20.4%)
> 
> Metric: perfetto_/system/bin/audioserver#file-rss-max
> Increased by 6.883 MB (31.9%)
> 
> Metric: perfetto_media.swcodec#file-rss-max
> Increased by 7.236 MB (34.9%)
> 
> Metric: perfetto_/system/bin/incidentd#file-rss-max
> Increased by 1.003 MB (22.7%)
> 
> Metric: perfetto_/system/bin/cameraserver#file-rss-avg
> Increased by 6.946 MB (34.2%)
> 
> Metric: perfetto_/system/bin/cameraserver#file-rss-max
> Increased by 7.205 MB (33.8%)
> 
> Metric: perfetto_com.android.nfc#file-rss-max
> Increased by 8.525 MB (9.8%)
> 
> Metric: perfetto_/system/bin/surfaceflinger#file-rss-avg
> Increased by 3.715 MB (3.6%)
> 
> Metric: perfetto_media.swcodec#file-rss-avg
> Increased by 5.096 MB (27.1%)
> 
> [...]
> 
> The issue is widespread across processes because in order to support
> larger page sizes Android has a requirement that the ELF segments are
> at-least 16KB aligned, which lead to the padding regions (never
> accessed).

Thanks for the numbers! It's much more than I'd expect. So you apparently
have a lot of relatively small segments?

> Another possible way we can look at this: in the regressions shared
> above by the ELF padding regions, we are able to make these regions
> sparse (for *almost* all cases) -- solving the shared-zero page
> problem for file mappings, would also eliminate much of this overhead.
> So perhaps we should tackle this angle? If that's a more tangible
> solution ?
> 
> From the previous discussions that Matthew shared [7], it seems like
> Dave proposed an alternative to moving the extents to the VFS layer to
> invert the IO read path operations [8]. Maybe this is a move
> approachable solution since there is precedence for the same in the
> write path?

Yeah, so I certainly wouldn't be opposed to this. What Dave suggests makes
a lot of sense. In principle we did something similar for DAX. But it won't be
a trivial change so details matter...

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-25 16:36           ` Jan Kara
@ 2025-02-26  0:49             ` Kalesh Singh
  0 siblings, 0 replies; 26+ messages in thread
From: Kalesh Singh @ 2025-02-26  0:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Lorenzo Stoakes, lsf-pc, open list:MEMORY MANAGEMENT,
	linux-fsdevel, Suren Baghdasaryan, David Hildenbrand,
	Liam R. Howlett, Juan Yescas, android-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko, Cc: Android Kernel

On Tue, Feb 25, 2025 at 8:36 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 24-02-25 13:36:50, Kalesh Singh wrote:
> > On Mon, Feb 24, 2025 at 8:52 AM Lorenzo Stoakes
> > > > > > OK, I agree the behavior you describe exists. But do you have some
> > > > > > real-world numbers showing its extent? I'm not looking for some artificial
> > > > > > numbers - sure bad cases can be constructed - but how big practical problem
> > > > > > is this? If you can show that average Android phone has 10% of these
> > > > > > useless pages in memory than that's one thing and we should be looking for
> > > > > > some general solution. If it is more like 0.1%, then why bother?
> > > > > >
> >
> > Once I revert a workaround that we currently have to avoid
> > fault-around for these regions (we don't have an out of tree solution
> > to prevent the page cache population); our CI which checks memory
> > usage after performing some common app user-journeys; reports
> > regressions as shown in the snippet below. Note, that the increases
> > here are only for the populated PTEs (bounded by VMA) so the actual
> > pollution is theoretically larger.
> >
> > Metric: perfetto_media.extractor#file-rss-avg
> > Increased by 7.495 MB (32.7%)
> >
> > Metric: perfetto_/system/bin/audioserver#file-rss-avg
> > Increased by 6.262 MB (29.8%)
> >
> > Metric: perfetto_/system/bin/mediaserver#file-rss-max
> > Increased by 8.325 MB (28.0%)
> >
> > Metric: perfetto_/system/bin/mediaserver#file-rss-avg
> > Increased by 8.198 MB (28.4%)
> >
> > Metric: perfetto_media.extractor#file-rss-max
> > Increased by 7.95 MB (33.6%)
> >
> > Metric: perfetto_/system/bin/incidentd#file-rss-avg
> > Increased by 0.896 MB (20.4%)
> >
> > Metric: perfetto_/system/bin/audioserver#file-rss-max
> > Increased by 6.883 MB (31.9%)
> >
> > Metric: perfetto_media.swcodec#file-rss-max
> > Increased by 7.236 MB (34.9%)
> >
> > Metric: perfetto_/system/bin/incidentd#file-rss-max
> > Increased by 1.003 MB (22.7%)
> >
> > Metric: perfetto_/system/bin/cameraserver#file-rss-avg
> > Increased by 6.946 MB (34.2%)
> >
> > Metric: perfetto_/system/bin/cameraserver#file-rss-max
> > Increased by 7.205 MB (33.8%)
> >
> > Metric: perfetto_com.android.nfc#file-rss-max
> > Increased by 8.525 MB (9.8%)
> >
> > Metric: perfetto_/system/bin/surfaceflinger#file-rss-avg
> > Increased by 3.715 MB (3.6%)
> >
> > Metric: perfetto_media.swcodec#file-rss-avg
> > Increased by 5.096 MB (27.1%)
> >
> > [...]
> >
> > The issue is widespread across processes because in order to support
> > larger page sizes Android has a requirement that the ELF segments are
> > at-least 16KB aligned, which lead to the padding regions (never
> > accessed).
>
> Thanks for the numbers! It's much more than I'd expect. So you apparently
> have a lot of relatively small segments?

Hi Jan,

Yeah you are right the segments can be relatively small.

I took one app on my device as an example:

adb shell 'cat /proc/$(pidof com.google.android.youtube)/maps' | grep
'.so$' | tee youtube_so_segments.txt

cat youtube_so_segments.txt | ./total_mapped_size.sh
Total mapping length: 147980288 bytes

cat youtube_so_segments.txt | wc -l
1148

147980288/1148/1024 = 125.88 KB

Let's say very roughly on average it's 128KB per segment; the padding
region can be anywhere from 0 to 60KB of that.

--Kalesh

>
> > Another possible way we can look at this: in the regressions shared
> > above by the ELF padding regions, we are able to make these regions
> > sparse (for *almost* all cases) -- solving the shared-zero page
> > problem for file mappings, would also eliminate much of this overhead.
> > So perhaps we should tackle this angle? If that's a more tangible
> > solution ?
> >
> > From the previous discussions that Matthew shared [7], it seems like
> > Dave proposed an alternative to moving the extents to the VFS layer to
> > invert the IO read path operations [8]. Maybe this is a move
> > approachable solution since there is precedence for the same in the
> > write path?
>
> Yeah, so I certainly wouldn't be opposed to this. What Dave suggests makes
> a lot of sense. In principle we did something similar for DAX. But it won't be
> a trivial change so details matter...
>
>                                                                         Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-24 23:56           ` Dave Chinner
  2025-02-25  6:45             ` Kalesh Singh
@ 2025-02-27 22:12             ` Matthew Wilcox
  2025-02-28  1:12               ` Dave Chinner
  2025-02-28  9:07               ` David Hildenbrand
  1 sibling, 2 replies; 26+ messages in thread
From: Matthew Wilcox @ 2025-02-27 22:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kalesh Singh, Lorenzo Stoakes, Jan Kara, lsf-pc,
	open list:MEMORY MANAGEMENT, linux-fsdevel, Suren Baghdasaryan,
	David Hildenbrand, Liam R. Howlett, Juan Yescas, android-mm,
	Vlastimil Babka, Michal Hocko, Cc: Android Kernel

On Tue, Feb 25, 2025 at 10:56:21AM +1100, Dave Chinner wrote:
> > From the previous discussions that Matthew shared [7], it seems like
> > Dave proposed an alternative to moving the extents to the VFS layer to
> > invert the IO read path operations [8]. Maybe this is a move
> > approachable solution since there is precedence for the same in the
> > write path?
> > 
> > [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
> > [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.area/
> 
> Yes, if we are going to optimise away redundant zeros being stored
> in the page cache over holes, we need to know where the holes in the
> file are before the page cache is populated.

Well, you shot that down when I started trying to flesh it out:
https://lore.kernel.org/linux-fsdevel/Zs+2u3%2FUsoaUHuid@dread.disaster.area/

> As for efficient hole tracking in the mapping tree, I suspect that
> we should be looking at using exceptional entries in the mapping
> tree for holes, not inserting mulitple references to the zero folio.
> i.e. the important information for data storage optimisation is that
> the region covers a hole, not that it contains zeros.

The xarray is very much optimised for storing power-of-two sized &
aligned objects.  It makes no sense to try to track extents using the
mapping tree.  Now, if we abandon the radix tree for the maple tree, we
could talk about storing zero extents in the same data structure.
But that's a big change with potentially significant downsides.
It's something I want to play with, but I'm a little busy right now.

> For buffered reads, all that is required when such an exceptional
> entry is returned is a memset of the user buffer. For buffered
> writes, we simply treat it like a normal folio allocating write and
> replace the exceptional entry with the allocated (and zeroed) folio.

... and unmap the zero page from any mappings.

> For read page faults, the zero page gets mapped (and maybe
> accounted) via the vma rather than the mapping tree entry. For write
> faults, a folio gets allocated and the exception entry replaced
> before we call into ->page_mkwrite().
> 
> Invalidation simply removes the exceptional entries.

... and unmap the zero page from any mappings.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-27 22:12             ` Matthew Wilcox
@ 2025-02-28  1:12               ` Dave Chinner
  2025-02-28  9:07               ` David Hildenbrand
  1 sibling, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2025-02-28  1:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kalesh Singh, Lorenzo Stoakes, Jan Kara, lsf-pc,
	open list:MEMORY MANAGEMENT, linux-fsdevel, Suren Baghdasaryan,
	David Hildenbrand, Liam R. Howlett, Juan Yescas, android-mm,
	Vlastimil Babka, Michal Hocko, Cc: Android Kernel

On Thu, Feb 27, 2025 at 10:12:50PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 25, 2025 at 10:56:21AM +1100, Dave Chinner wrote:
> > > From the previous discussions that Matthew shared [7], it seems like
> > > Dave proposed an alternative to moving the extents to the VFS layer to
> > > invert the IO read path operations [8]. Maybe this is a move
> > > approachable solution since there is precedence for the same in the
> > > write path?
> > > 
> > > [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
> > > [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.area/
> > 
> > Yes, if we are going to optimise away redundant zeros being stored
> > in the page cache over holes, we need to know where the holes in the
> > file are before the page cache is populated.
> 
> Well, you shot that down when I started trying to flesh it out:
> https://lore.kernel.org/linux-fsdevel/Zs+2u3%2FUsoaUHuid@dread.disaster.area/

No, I shot down the idea of having the page cache maintain a generic
cache of file offset to LBA address mappings outside the filesystem.

Having the filesystem insert a special 'this is a hole' entry into
the mapping tree insert of allocating and inserting a page full of
zeroes is not an extent cache - it's just a different way of
representing a data range that is known to always contain zeroes.

> > As for efficient hole tracking in the mapping tree, I suspect that
> > we should be looking at using exceptional entries in the mapping
> > tree for holes, not inserting mulitple references to the zero folio.
> > i.e. the important information for data storage optimisation is that
> > the region covers a hole, not that it contains zeros.
> 
> The xarray is very much optimised for storing power-of-two sized &
> aligned objects.  It makes no sense to try to track extents using the
> mapping tree.

Certainly. I'm not suggesting that we do this at all, and ....

> Now, if we abandon the radix tree for the maple tree, we
> could talk about storing zero extents in the same data structure.
> But that's a big change with potentially significant downsides.
> It's something I want to play with, but I'm a little busy right now.

.... I still do not want the page cache to try to maintain a block
mapping/extent cache in addition to the what the filesystem must
already maintain for the reasons I have previously given.

> > For buffered reads, all that is required when such an exceptional
> > entry is returned is a memset of the user buffer. For buffered
> > writes, we simply treat it like a normal folio allocating write and
> > replace the exceptional entry with the allocated (and zeroed) folio.
> 
> ... and unmap the zero page from any mappings.

Sure. That's just a call to unmap_mapping_range(), yes?

> > For read page faults, the zero page gets mapped (and maybe
> > accounted) via the vma rather than the mapping tree entry. For write
> > faults, a folio gets allocated and the exception entry replaced
> > before we call into ->page_mkwrite().
> > 
> > Invalidation simply removes the exceptional entries.
> 
> ... and unmap the zero page from any mappings.

Invalidation already calls unmap_mapping_range(), so this should
already be handled, right?

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-27 22:12             ` Matthew Wilcox
  2025-02-28  1:12               ` Dave Chinner
@ 2025-02-28  9:07               ` David Hildenbrand
  2025-04-02  0:13                 ` Kalesh Singh
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-02-28  9:07 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Chinner
  Cc: Kalesh Singh, Lorenzo Stoakes, Jan Kara, lsf-pc,
	open list:MEMORY MANAGEMENT, linux-fsdevel, Suren Baghdasaryan,
	Liam R. Howlett, Juan Yescas, android-mm, Vlastimil Babka,
	Michal Hocko, Cc: Android Kernel

On 27.02.25 23:12, Matthew Wilcox wrote:
> On Tue, Feb 25, 2025 at 10:56:21AM +1100, Dave Chinner wrote:
>>>  From the previous discussions that Matthew shared [7], it seems like
>>> Dave proposed an alternative to moving the extents to the VFS layer to
>>> invert the IO read path operations [8]. Maybe this is a move
>>> approachable solution since there is precedence for the same in the
>>> write path?
>>>
>>> [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
>>> [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.area/
>>
>> Yes, if we are going to optimise away redundant zeros being stored
>> in the page cache over holes, we need to know where the holes in the
>> file are before the page cache is populated.
> 
> Well, you shot that down when I started trying to flesh it out:
> https://lore.kernel.org/linux-fsdevel/Zs+2u3%2FUsoaUHuid@dread.disaster.area/
> 
>> As for efficient hole tracking in the mapping tree, I suspect that
>> we should be looking at using exceptional entries in the mapping
>> tree for holes, not inserting mulitple references to the zero folio.
>> i.e. the important information for data storage optimisation is that
>> the region covers a hole, not that it contains zeros.
> 
> The xarray is very much optimised for storing power-of-two sized &
> aligned objects.  It makes no sense to try to track extents using the
> mapping tree.  Now, if we abandon the radix tree for the maple tree, we
> could talk about storing zero extents in the same data structure.
> But that's a big change with potentially significant downsides.
> It's something I want to play with, but I'm a little busy right now.
> 
>> For buffered reads, all that is required when such an exceptional
>> entry is returned is a memset of the user buffer. For buffered
>> writes, we simply treat it like a normal folio allocating write and
>> replace the exceptional entry with the allocated (and zeroed) folio.
> 
> ... and unmap the zero page from any mappings.
> 
>> For read page faults, the zero page gets mapped (and maybe
>> accounted) via the vma rather than the mapping tree entry. For write
>> faults, a folio gets allocated and the exception entry replaced
>> before we call into ->page_mkwrite().
>>
>> Invalidation simply removes the exceptional entries.
> 
> ... and unmap the zero page from any mappings.
> 

I'll add one detail for future reference; not sure about the priority 
this should have, but it's one of these nasty corner cases that are not 
the obvious to spot when having the shared zeropage in MAP_SHARED mappings:

Currently, only FS-DAX makes use of the shared zeropage in "ordinary 
MAP_SHARED" mappings. It doesn't use it for "holes" but for "logically 
zero" pages, to avoid allocating disk blocks (-> translating to actual 
DAX memory) on read-only access.

There is one issue between gup(FOLL_LONGTERM | FOLL_PIN) and the shared 
zeropage in MAP_SHARED mappings. It so far does not apply to fsdax,
because ... we don't support FOLL_LONGTERM for fsdax at all.

I spelled out part of the issue in fce831c92092 ("mm/memory: cleanly 
support zeropage in vm_insert_page*(), vm_map_pages*() and 
vmf_insert_mixed()").

In general, the problem is that gup(FOLL_LONGTERM | FOLL_PIN) will have 
to decide if it is okay to longterm-pin the shared zeropage in a 
MAP_SHARED mapping (which might just be fine with a R/O file in some 
cases?), and if not, it would have to trigger FAULT_FLAG_UNSHARE similar 
to how we break COW in MAP_PRIVATE mappings (shared zeropage -> 
anonymous folio).

If gup(FOLL_LONGTERM | FOLL_PIN) would just always longterm-pin the 
shared zeropage, and somebody else would end up triggering replacement 
of the shared zeropage in the pagecache (e.g., write() to the file 
offset, write access to the VMA that triggers a write fault etc.), you'd 
get a disconnect between what the GUP user sees and what the pagecache 
actually contains.

The file system fault logic will have to be taught about 
FAULT_FLAG_UNSHARE and handle it accordingly (e.g., allocate fill file 
hole, allocate disk space, allocate an actual folio ...).

Things like memfd_pin_folios() might require similar care -- that one in 
particular should likely never return the shared zeropage.

Likely gup(FOLL_LONGTERM | FOLL_PIN) users like RDMA or VFIO will be 
able to trigger it.

Not using the shared zeropage but instead some "hole" PTE marker could 
avoid this problem. Of course, not allowing for reading the shared 
zeropage there, but maybe that's not strictly required?

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
  2025-02-28  9:07               ` David Hildenbrand
@ 2025-04-02  0:13                 ` Kalesh Singh
  0 siblings, 0 replies; 26+ messages in thread
From: Kalesh Singh @ 2025-04-02  0:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Dave Chinner, Lorenzo Stoakes, Jan Kara, lsf-pc,
	open list:MEMORY MANAGEMENT, linux-fsdevel, Suren Baghdasaryan,
	Liam R. Howlett, Juan Yescas, android-mm, Vlastimil Babka,
	Michal Hocko, Cc: Android Kernel

On Fri, Feb 28, 2025 at 1:07 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.02.25 23:12, Matthew Wilcox wrote:
> > On Tue, Feb 25, 2025 at 10:56:21AM +1100, Dave Chinner wrote:
> >>>  From the previous discussions that Matthew shared [7], it seems like
> >>> Dave proposed an alternative to moving the extents to the VFS layer to
> >>> invert the IO read path operations [8]. Maybe this is a move
> >>> approachable solution since there is precedence for the same in the
> >>> write path?
> >>>
> >>> [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
> >>> [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.area/
> >>
> >> Yes, if we are going to optimise away redundant zeros being stored
> >> in the page cache over holes, we need to know where the holes in the
> >> file are before the page cache is populated.
> >
> > Well, you shot that down when I started trying to flesh it out:
> > https://lore.kernel.org/linux-fsdevel/Zs+2u3%2FUsoaUHuid@dread.disaster.area/
> >
> >> As for efficient hole tracking in the mapping tree, I suspect that
> >> we should be looking at using exceptional entries in the mapping
> >> tree for holes, not inserting mulitple references to the zero folio.
> >> i.e. the important information for data storage optimisation is that
> >> the region covers a hole, not that it contains zeros.
> >
> > The xarray is very much optimised for storing power-of-two sized &
> > aligned objects.  It makes no sense to try to track extents using the
> > mapping tree.  Now, if we abandon the radix tree for the maple tree, we
> > could talk about storing zero extents in the same data structure.
> > But that's a big change with potentially significant downsides.
> > It's something I want to play with, but I'm a little busy right now.
> >
> >> For buffered reads, all that is required when such an exceptional
> >> entry is returned is a memset of the user buffer. For buffered
> >> writes, we simply treat it like a normal folio allocating write and
> >> replace the exceptional entry with the allocated (and zeroed) folio.
> >
> > ... and unmap the zero page from any mappings.
> >
> >> For read page faults, the zero page gets mapped (and maybe
> >> accounted) via the vma rather than the mapping tree entry. For write
> >> faults, a folio gets allocated and the exception entry replaced
> >> before we call into ->page_mkwrite().
> >>
> >> Invalidation simply removes the exceptional entries.
> >
> > ... and unmap the zero page from any mappings.
> >
>
> I'll add one detail for future reference; not sure about the priority
> this should have, but it's one of these nasty corner cases that are not
> the obvious to spot when having the shared zeropage in MAP_SHARED mappings:
>
> Currently, only FS-DAX makes use of the shared zeropage in "ordinary
> MAP_SHARED" mappings. It doesn't use it for "holes" but for "logically
> zero" pages, to avoid allocating disk blocks (-> translating to actual
> DAX memory) on read-only access.
>
> There is one issue between gup(FOLL_LONGTERM | FOLL_PIN) and the shared
> zeropage in MAP_SHARED mappings. It so far does not apply to fsdax,
> because ... we don't support FOLL_LONGTERM for fsdax at all.
>
> I spelled out part of the issue in fce831c92092 ("mm/memory: cleanly
> support zeropage in vm_insert_page*(), vm_map_pages*() and
> vmf_insert_mixed()").
>
> In general, the problem is that gup(FOLL_LONGTERM | FOLL_PIN) will have
> to decide if it is okay to longterm-pin the shared zeropage in a
> MAP_SHARED mapping (which might just be fine with a R/O file in some
> cases?), and if not, it would have to trigger FAULT_FLAG_UNSHARE similar
> to how we break COW in MAP_PRIVATE mappings (shared zeropage ->
> anonymous folio).
>
> If gup(FOLL_LONGTERM | FOLL_PIN) would just always longterm-pin the
> shared zeropage, and somebody else would end up triggering replacement
> of the shared zeropage in the pagecache (e.g., write() to the file
> offset, write access to the VMA that triggers a write fault etc.), you'd
> get a disconnect between what the GUP user sees and what the pagecache
> actually contains.
>
> The file system fault logic will have to be taught about
> FAULT_FLAG_UNSHARE and handle it accordingly (e.g., allocate fill file
> hole, allocate disk space, allocate an actual folio ...).
>
> Things like memfd_pin_folios() might require similar care -- that one in
> particular should likely never return the shared zeropage.
>
> Likely gup(FOLL_LONGTERM | FOLL_PIN) users like RDMA or VFIO will be
> able to trigger it.
>
>
> Not using the shared zeropage but instead some "hole" PTE marker could
> avoid this problem. Of course, not allowing for reading the shared
> zeropage there, but maybe that's not strictly required?
>

Link to slides for the talk:
https://drive.google.com/file/d/1MOJu5FZurV4XaCLrQhM9S5ubN7H_jEA8/view?usp=drive_link

Thanks,
Kalesh

> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-04-02  0:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-21 21:13 [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior Kalesh Singh
2025-02-22 18:03 ` Kent Overstreet
2025-02-23  5:36   ` Kalesh Singh
2025-02-23  5:42     ` Kalesh Singh
2025-02-23  9:30     ` Lorenzo Stoakes
2025-02-23 12:24       ` Matthew Wilcox
2025-02-23  5:34 ` Ritesh Harjani
2025-02-23  6:50   ` Kalesh Singh
2025-02-24 12:56   ` David Sterba
2025-02-24 14:14 ` [Lsf-pc] " Jan Kara
2025-02-24 14:21   ` Lorenzo Stoakes
2025-02-24 16:31     ` Jan Kara
2025-02-24 16:52       ` Lorenzo Stoakes
2025-02-24 21:36         ` Kalesh Singh
2025-02-24 21:55           ` Kalesh Singh
2025-02-24 23:56           ` Dave Chinner
2025-02-25  6:45             ` Kalesh Singh
2025-02-27 22:12             ` Matthew Wilcox
2025-02-28  1:12               ` Dave Chinner
2025-02-28  9:07               ` David Hildenbrand
2025-04-02  0:13                 ` Kalesh Singh
2025-02-25  5:44           ` Lorenzo Stoakes
2025-02-25  6:59             ` Kalesh Singh
2025-02-25 16:36           ` Jan Kara
2025-02-26  0:49             ` Kalesh Singh
2025-02-25 16:21         ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).