linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/10] mm: Fix various readahead quirks
@ 2024-06-25 10:18 Jan Kara
  2024-06-25 10:18 ` [PATCH 01/10] readahead: Make sure sync readahead reads needed page Jan Kara
                   ` (11 more replies)
  0 siblings, 12 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

Hello!

When we were internally testing performance of recent kernels, we have noticed
quite variable performance of readahead arising from various quirks in
readahead code. So I went on a cleaning spree there. This is a batch of patches
resulting out of that. A quick testing in my test VM with the following fio
job file:

[global]
direct=0
ioengine=sync
invalidate=1
blocksize=4k
size=10g
readwrite=read

[reader]
numjobs=1

shows that this patch series improves the throughput from variable one in
310-340 MB/s range to rather stable one at 350 MB/s. As a side effect these
cleanups also address the issue noticed by Bruz Zhang [1].

								Honza

[1] https://lore.kernel.org/all/20240618114941.5935-1-zhangpengpeng0808@gmail.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 01/10] readahead: Make sure sync readahead reads needed page
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
@ 2024-06-25 10:18 ` Jan Kara
  2024-06-25 10:18 ` [PATCH 02/10] filemap: Fix page_cache_next_miss() when no hole found Jan Kara
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

page_cache_sync_ra() is called when a folio we want to read is not in
the page cache. It is expected that it creates the folio (and perhaps
the following folios as well) and submits reads for them unless some
error happens. However if index == ra->start + ra->size,
ondemand_readahead() will treat the call as another async readahead hit.
Thus ra->start will be advanced and we create pages and queue reads from
ra->start + ra->size further. Consequentially the page at 'index' is not
created and filemap_get_pages() has to always go through
filemap_create_folio() path.

This behavior has particularly unfortunate consequences
when we have two IO threads sequentially reading from a shared file (as
is the case when NFS serves sequential reads). In that case what can
happen is:

suppose ra->size == ra->async_size == 128, ra->start = 512

T1					T2
reads 128 pages at index 512
  - hits async readahead mark
    filemap_readahead()
      ondemand_readahead()
        if (index == expected ...)
	  ra->start = 512 + 128 = 640
          ra->size = 128
	  ra->async_size = 128
	page_cache_ra_order()
	  blocks in ra_alloc_folio()
					reads 128 pages at index 640
					  - no page found
					  page_cache_sync_readahead()
					    ondemand_readahead()
					      if (index == expected ...)
						ra->start = 640 + 128 = 768
						ra->size = 128
						ra->async_size = 128
					    page_cache_ra_order()
					      submits reads from 768
					  - still no page found at index 640
					    filemap_create_folio()
					  - goes on to index 641
					  page_cache_sync_readahead()
					    ondemand_readahead()
					      - founds ra is confused,
					        trims is to small size
  	  finds pages were already inserted

And as a result read performance suffers.

Fix the problem by triggering async readahead case in
ondemand_readahead() only if we are calling the function because we hit
the readahead marker. In any other case we need to read the folio at
'index' and thus we cannot really use the current ra state.

Note that the above situation could be viewed as a special case of
file->f_ra state corruption. In fact two thread reading using the shared
file can also seemingly corrupt file->f_ra in interesting ways due to
concurrent access. I never saw that in practice and the fix is going to
be much more complex so for now at least fix this practical problem
while we ponder about the theoretically correct solution.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/readahead.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index c1b23989d9ca..af0fbd302a38 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -580,7 +580,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	 */
 	expected = round_down(ra->start + ra->size - ra->async_size,
 			1UL << order);
-	if (index == expected || index == (ra->start + ra->size)) {
+	if (folio && index == expected) {
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max_pages);
 		ra->async_size = ra->size;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 02/10] filemap: Fix page_cache_next_miss() when no hole found
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
  2024-06-25 10:18 ` [PATCH 01/10] readahead: Make sure sync readahead reads needed page Jan Kara
@ 2024-06-25 10:18 ` Jan Kara
  2024-06-25 10:18 ` [PATCH 03/10] readahead: Properly shorten readahead when falling back to do_page_cache_ra() Jan Kara
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

page_cache_next_miss() should return value outside of the specified
range when no hole is found. However currently it will return the last
index *in* the specified range confusing ondemand_readahead() to think
there's a hole in the searched range and upsetting readahead logic.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 876cc64aadd7..015efc261468 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1752,12 +1752,12 @@ pgoff_t page_cache_next_miss(struct address_space *mapping,
 	while (max_scan--) {
 		void *entry = xas_next(&xas);
 		if (!entry || xa_is_value(entry))
-			break;
+			return xas.xa_index;
 		if (xas.xa_index == 0)
-			break;
+			return 0;
 	}
 
-	return xas.xa_index;
+	return index + max_scan;
 }
 EXPORT_SYMBOL(page_cache_next_miss);
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 03/10] readahead: Properly shorten readahead when falling back to do_page_cache_ra()
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
  2024-06-25 10:18 ` [PATCH 01/10] readahead: Make sure sync readahead reads needed page Jan Kara
  2024-06-25 10:18 ` [PATCH 02/10] filemap: Fix page_cache_next_miss() when no hole found Jan Kara
@ 2024-06-25 10:18 ` Jan Kara
  2024-06-25 10:18 ` [PATCH 04/10] readahead: Drop pointless index from force_page_cache_ra() Jan Kara
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

When we succeed in creating some folios in page_cache_ra_order() but
then need to fallback to single page folios, we don't shorten the amount
to read passed to do_page_cache_ra() by the amount we've already read.
This then results in reading more and also in placing another readahead
mark in the middle of the readahead window which confuses readahead
code. Fix the problem by properly reducing number of pages to read.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/readahead.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index af0fbd302a38..1c58e0463be1 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -491,7 +491,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		struct file_ra_state *ra, unsigned int new_order)
 {
 	struct address_space *mapping = ractl->mapping;
-	pgoff_t index = readahead_index(ractl);
+	pgoff_t start = readahead_index(ractl);
+	pgoff_t index = start;
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	unsigned int nofs;
@@ -544,7 +545,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	if (!err)
 		return;
 fallback:
-	do_page_cache_ra(ractl, ra->size, ra->async_size);
+	do_page_cache_ra(ractl, ra->size - (index - start), ra->async_size);
 }
 
 /*
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 04/10] readahead: Drop pointless index from force_page_cache_ra()
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
                   ` (2 preceding siblings ...)
  2024-06-25 10:18 ` [PATCH 03/10] readahead: Properly shorten readahead when falling back to do_page_cache_ra() Jan Kara
@ 2024-06-25 10:18 ` Jan Kara
  2024-06-25 10:18 ` [PATCH 05/10] readahead: Drop index argument of page_cache_async_readahead() Jan Kara
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

Current index to readahead is tracked in readahead_control and properly
updated by page_cache_ra_unbounded() (read_pages() in fact). So there's
no need to track the index separately in force_page_cache_ra().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/readahead.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 1c58e0463be1..455edafebb07 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -313,7 +313,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	struct file_ra_state *ra = ractl->ra;
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
-	unsigned long max_pages, index;
+	unsigned long max_pages;
 
 	if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
 		return;
@@ -322,7 +322,6 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	 * If the request exceeds the readahead window, allow the read to
 	 * be up to the optimal hardware IO size
 	 */
-	index = readahead_index(ractl);
 	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
 	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
 	while (nr_to_read) {
@@ -330,10 +329,8 @@ void force_page_cache_ra(struct readahead_control *ractl,
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
-		ractl->_index = index;
 		do_page_cache_ra(ractl, this_chunk, 0);
 
-		index += this_chunk;
 		nr_to_read -= this_chunk;
 	}
 }
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 05/10] readahead: Drop index argument of page_cache_async_readahead()
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
                   ` (3 preceding siblings ...)
  2024-06-25 10:18 ` [PATCH 04/10] readahead: Drop pointless index from force_page_cache_ra() Jan Kara
@ 2024-06-25 10:18 ` Jan Kara
  2024-06-25 10:18 ` [PATCH 06/10] readahead: Drop dead code in page_cache_ra_order() Jan Kara
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

The index argument of page_cache_async_readahead() is just folio->index
so there's no point in passing is separately. Drop it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/btrfs/relocation.c   | 3 +--
 fs/btrfs/send.c         | 2 +-
 include/linux/pagemap.h | 7 +++----
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 8b24bb5a0aa1..e7e8fce70ccc 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2975,8 +2975,7 @@ static int relocate_one_folio(struct inode *inode, struct file_ra_state *ra,
 
 	if (folio_test_readahead(folio))
 		page_cache_async_readahead(inode->i_mapping, ra, NULL,
-					   folio, index,
-					   last_index + 1 - index);
+					   folio, last_index + 1 - index);
 
 	if (!folio_test_uptodate(folio)) {
 		btrfs_read_folio(NULL, folio);
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 3dd4a48479a9..dbf462e67ad0 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5307,7 +5307,7 @@ static int put_file_data(struct send_ctx *sctx, u64 offset, u32 len)
 
 		if (folio_test_readahead(folio))
 			page_cache_async_readahead(mapping, &sctx->ra, NULL, folio,
-						   index, last_index + 1 - index);
+						   last_index + 1 - index);
 
 		if (!folio_test_uptodate(folio)) {
 			btrfs_read_folio(NULL, folio);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ee633712bba0..7ad4decd4347 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1307,8 +1307,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
  * @mapping: address_space which holds the pagecache and I/O vectors
  * @ra: file_ra_state which holds the readahead state
  * @file: Used by the filesystem for authentication.
- * @folio: The folio at @index which triggered the readahead call.
- * @index: Index of first page to be read.
+ * @folio: The folio which triggered the readahead call.
  * @req_count: Total number of pages being read by the caller.
  *
  * page_cache_async_readahead() should be called when a page is used which
@@ -1319,9 +1318,9 @@ void page_cache_sync_readahead(struct address_space *mapping,
 static inline
 void page_cache_async_readahead(struct address_space *mapping,
 		struct file_ra_state *ra, struct file *file,
-		struct folio *folio, pgoff_t index, unsigned long req_count)
+		struct folio *folio, unsigned long req_count)
 {
-	DEFINE_READAHEAD(ractl, file, ra, mapping, index);
+	DEFINE_READAHEAD(ractl, file, ra, mapping, folio->index);
 	page_cache_async_ra(&ractl, folio, req_count);
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 06/10] readahead: Drop dead code in page_cache_ra_order()
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
                   ` (4 preceding siblings ...)
  2024-06-25 10:18 ` [PATCH 05/10] readahead: Drop index argument of page_cache_async_readahead() Jan Kara
@ 2024-06-25 10:18 ` Jan Kara
  2024-06-25 10:18 ` [PATCH 07/10] readahead: Drop dead code in ondemand_readahead() Jan Kara
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

page_cache_ra_order() scales folio order down so that is fully fits
within readahead window. Thus the code handling the case where we
walked past the readahead window is a dead code. Remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/readahead.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 455edafebb07..9ea5125a0dce 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -525,11 +525,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		index += 1UL << order;
 	}
 
-	if (index > limit) {
-		ra->size += index - limit - 1;
-		ra->async_size += index - limit - 1;
-	}
-
 	read_pages(ractl);
 	filemap_invalidate_unlock_shared(mapping);
 	memalloc_nofs_restore(nofs);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 07/10] readahead: Drop dead code in ondemand_readahead()
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
                   ` (5 preceding siblings ...)
  2024-06-25 10:18 ` [PATCH 06/10] readahead: Drop dead code in page_cache_ra_order() Jan Kara
@ 2024-06-25 10:18 ` Jan Kara
  2024-06-25 10:18 ` [PATCH 08/10] readahead: Disentangle async and sync readahead Jan Kara
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

ondemand_readahead() scales up the readahead window if the current read
would hit the readahead mark placed by itself. However the condition
is mostly dead code because:
a) In case of async readahead we always increase ra->start so ra->start
   == index is never true.
b) In case of sync readahead we either go through
   try_context_readahead() in which case ra->async_size == 1 < ra->size
   or we go through initial_readahead where ra->async_size == ra->size
   iff ra->size == max_pages.

So the only practical effect is reducing async_size for large initial
reads. Make the code more obvious.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/readahead.c | 21 ++-------------------
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 9ea5125a0dce..d92a5e8d89c4 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -549,7 +549,6 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
 	struct file_ra_state *ra = ractl->ra;
 	unsigned long max_pages = ra->ra_pages;
-	unsigned long add_pages;
 	pgoff_t index = readahead_index(ractl);
 	pgoff_t expected, prev_index;
 	unsigned int order = folio ? folio_order(folio) : 0;
@@ -638,26 +637,10 @@ static void ondemand_readahead(struct readahead_control *ractl,
 initial_readahead:
 	ra->start = index;
 	ra->size = get_init_ra_size(req_size, max_pages);
-	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+	ra->async_size = ra->size > req_size ? ra->size - req_size :
+					       ra->size >> 1;
 
 readit:
-	/*
-	 * Will this read hit the readahead marker made by itself?
-	 * If so, trigger the readahead marker hit now, and merge
-	 * the resulted next readahead window into the current one.
-	 * Take care of maximum IO pages as above.
-	 */
-	if (index == ra->start && ra->size == ra->async_size) {
-		add_pages = get_next_ra_size(ra, max_pages);
-		if (ra->size + add_pages <= max_pages) {
-			ra->async_size = add_pages;
-			ra->size += add_pages;
-		} else {
-			ra->size = max_pages;
-			ra->async_size = max_pages >> 1;
-		}
-	}
-
 	ractl->_index = ra->start;
 	page_cache_ra_order(ractl, ra, order);
 }
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 08/10] readahead: Disentangle async and sync readahead
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
                   ` (6 preceding siblings ...)
  2024-06-25 10:18 ` [PATCH 07/10] readahead: Drop dead code in ondemand_readahead() Jan Kara
@ 2024-06-25 10:18 ` Jan Kara
  2024-06-25 10:18 ` [PATCH 09/10] readahead: Fold try_context_readahead() into its single caller Jan Kara
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

Both async and sync readahead are handled by ondemand_readahead()
function. However there isn't actually much in common. Just move async
related parts into page_cache_ra_async() and sync related parts to
page_cache_ra_sync(). No functional changes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/readahead.c | 162 +++++++++++++++++++++++--------------------------
 1 file changed, 77 insertions(+), 85 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index d92a5e8d89c4..a44daa12ebd2 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -540,18 +540,11 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	do_page_cache_ra(ractl, ra->size - (index - start), ra->async_size);
 }
 
-/*
- * A minimal readahead algorithm for trivial sequential/random reads.
- */
-static void ondemand_readahead(struct readahead_control *ractl,
-		struct folio *folio, unsigned long req_size)
+static unsigned long ractl_max_pages(struct readahead_control *ractl,
+		unsigned long req_size)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
-	struct file_ra_state *ra = ractl->ra;
-	unsigned long max_pages = ra->ra_pages;
-	pgoff_t index = readahead_index(ractl);
-	pgoff_t expected, prev_index;
-	unsigned int order = folio ? folio_order(folio) : 0;
+	unsigned long max_pages = ractl->ra->ra_pages;
 
 	/*
 	 * If the request exceeds the readahead window, allow the read to
@@ -559,55 +552,42 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	 */
 	if (req_size > max_pages && bdi->io_pages > max_pages)
 		max_pages = min(req_size, bdi->io_pages);
+	return max_pages;
+}
 
-	/*
-	 * start of file
-	 */
-	if (!index)
-		goto initial_readahead;
-
-	/*
-	 * It's the expected callback index, assume sequential access.
-	 * Ramp up sizes, and push forward the readahead window.
-	 */
-	expected = round_down(ra->start + ra->size - ra->async_size,
-			1UL << order);
-	if (folio && index == expected) {
-		ra->start += ra->size;
-		ra->size = get_next_ra_size(ra, max_pages);
-		ra->async_size = ra->size;
-		goto readit;
-	}
+void page_cache_sync_ra(struct readahead_control *ractl,
+		unsigned long req_count)
+{
+	pgoff_t index = readahead_index(ractl);
+	bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM);
+	struct file_ra_state *ra = ractl->ra;
+	unsigned long max_pages;
+	pgoff_t prev_index;
 
 	/*
-	 * Hit a marked folio without valid readahead state.
-	 * E.g. interleaved reads.
-	 * Query the pagecache for async_size, which normally equals to
-	 * readahead size. Ramp it up and use it as the new readahead size.
+	 * Even if readahead is disabled, issue this request as readahead
+	 * as we'll need it to satisfy the requested range. The forced
+	 * readahead will do the right thing and limit the read to just the
+	 * requested range, which we'll set to 1 page for this case.
 	 */
-	if (folio) {
-		pgoff_t start;
-
-		rcu_read_lock();
-		start = page_cache_next_miss(ractl->mapping, index + 1,
-				max_pages);
-		rcu_read_unlock();
-
-		if (!start || start - index > max_pages)
+	if (!ra->ra_pages || blk_cgroup_congested()) {
+		if (!ractl->file)
 			return;
+		req_count = 1;
+		do_forced_ra = true;
+	}
 
-		ra->start = start;
-		ra->size = start - index;	/* old async_size */
-		ra->size += req_size;
-		ra->size = get_next_ra_size(ra, max_pages);
-		ra->async_size = ra->size;
-		goto readit;
+	/* be dumb */
+	if (do_forced_ra) {
+		force_page_cache_ra(ractl, req_count);
+		return;
 	}
 
+	max_pages = ractl_max_pages(ractl, req_count);
 	/*
-	 * oversize read
+	 * start of file or oversized read
 	 */
-	if (req_size > max_pages)
+	if (!index || req_count > max_pages)
 		goto initial_readahead;
 
 	/*
@@ -623,7 +603,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	 * Query the page cache and look for the traces(cached history pages)
 	 * that a sequential stream would leave behind.
 	 */
-	if (try_context_readahead(ractl->mapping, ra, index, req_size,
+	if (try_context_readahead(ractl->mapping, ra, index, req_count,
 			max_pages))
 		goto readit;
 
@@ -631,53 +611,31 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	 * standalone, small random read
 	 * Read as is, and do not pollute the readahead state.
 	 */
-	do_page_cache_ra(ractl, req_size, 0);
+	do_page_cache_ra(ractl, req_count, 0);
 	return;
 
 initial_readahead:
 	ra->start = index;
-	ra->size = get_init_ra_size(req_size, max_pages);
-	ra->async_size = ra->size > req_size ? ra->size - req_size :
-					       ra->size >> 1;
-
+	ra->size = get_init_ra_size(req_count, max_pages);
+	ra->async_size = ra->size > req_count ? ra->size - req_count :
+						ra->size >> 1;
 readit:
 	ractl->_index = ra->start;
-	page_cache_ra_order(ractl, ra, order);
-}
-
-void page_cache_sync_ra(struct readahead_control *ractl,
-		unsigned long req_count)
-{
-	bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM);
-
-	/*
-	 * Even if readahead is disabled, issue this request as readahead
-	 * as we'll need it to satisfy the requested range. The forced
-	 * readahead will do the right thing and limit the read to just the
-	 * requested range, which we'll set to 1 page for this case.
-	 */
-	if (!ractl->ra->ra_pages || blk_cgroup_congested()) {
-		if (!ractl->file)
-			return;
-		req_count = 1;
-		do_forced_ra = true;
-	}
-
-	/* be dumb */
-	if (do_forced_ra) {
-		force_page_cache_ra(ractl, req_count);
-		return;
-	}
-
-	ondemand_readahead(ractl, NULL, req_count);
+	page_cache_ra_order(ractl, ra, 0);
 }
 EXPORT_SYMBOL_GPL(page_cache_sync_ra);
 
 void page_cache_async_ra(struct readahead_control *ractl,
 		struct folio *folio, unsigned long req_count)
 {
+	unsigned long max_pages;
+	struct file_ra_state *ra = ractl->ra;
+	pgoff_t index = readahead_index(ractl);
+	pgoff_t expected, start;
+	unsigned int order = folio_order(folio);
+
 	/* no readahead */
-	if (!ractl->ra->ra_pages)
+	if (!ra->ra_pages)
 		return;
 
 	/*
@@ -691,7 +649,41 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	if (blk_cgroup_congested())
 		return;
 
-	ondemand_readahead(ractl, folio, req_count);
+	max_pages = ractl_max_pages(ractl, req_count);
+	/*
+	 * It's the expected callback index, assume sequential access.
+	 * Ramp up sizes, and push forward the readahead window.
+	 */
+	expected = round_down(ra->start + ra->size - ra->async_size,
+			1UL << order);
+	if (index == expected) {
+		ra->start += ra->size;
+		ra->size = get_next_ra_size(ra, max_pages);
+		ra->async_size = ra->size;
+		goto readit;
+	}
+
+	/*
+	 * Hit a marked folio without valid readahead state.
+	 * E.g. interleaved reads.
+	 * Query the pagecache for async_size, which normally equals to
+	 * readahead size. Ramp it up and use it as the new readahead size.
+	 */
+	rcu_read_lock();
+	start = page_cache_next_miss(ractl->mapping, index + 1, max_pages);
+	rcu_read_unlock();
+
+	if (!start || start - index > max_pages)
+		return;
+
+	ra->start = start;
+	ra->size = start - index;	/* old async_size */
+	ra->size += req_count;
+	ra->size = get_next_ra_size(ra, max_pages);
+	ra->async_size = ra->size;
+readit:
+	ractl->_index = ra->start;
+	page_cache_ra_order(ractl, ra, order);
 }
 EXPORT_SYMBOL_GPL(page_cache_async_ra);
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 09/10] readahead: Fold try_context_readahead() into its single caller
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
                   ` (7 preceding siblings ...)
  2024-06-25 10:18 ` [PATCH 08/10] readahead: Disentangle async and sync readahead Jan Kara
@ 2024-06-25 10:18 ` Jan Kara
  2024-06-25 10:19 ` [PATCH 10/10] readahead: Simplify gotos in page_cache_sync_ra() Jan Kara
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

try_context_readahead() has a single caller page_cache_sync_ra(). Fold
the function there to make ra state modifications more obvious. No
functional changes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/readahead.c | 84 +++++++++++++-------------------------------------
 1 file changed, 22 insertions(+), 62 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index a44daa12ebd2..12c0d2215329 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -410,58 +410,6 @@ static unsigned long get_next_ra_size(struct file_ra_state *ra,
  * it approaches max_readhead.
  */
 
-/*
- * Count contiguously cached pages from @index-1 to @index-@max,
- * this count is a conservative estimation of
- * 	- length of the sequential read sequence, or
- * 	- thrashing threshold in memory tight systems
- */
-static pgoff_t count_history_pages(struct address_space *mapping,
-				   pgoff_t index, unsigned long max)
-{
-	pgoff_t head;
-
-	rcu_read_lock();
-	head = page_cache_prev_miss(mapping, index - 1, max);
-	rcu_read_unlock();
-
-	return index - 1 - head;
-}
-
-/*
- * page cache context based readahead
- */
-static int try_context_readahead(struct address_space *mapping,
-				 struct file_ra_state *ra,
-				 pgoff_t index,
-				 unsigned long req_size,
-				 unsigned long max)
-{
-	pgoff_t size;
-
-	size = count_history_pages(mapping, index, max);
-
-	/*
-	 * not enough history pages:
-	 * it could be a random read
-	 */
-	if (size <= req_size)
-		return 0;
-
-	/*
-	 * starts from beginning of file:
-	 * it is a strong indication of long-run stream (or whole-file-read)
-	 */
-	if (size >= index)
-		size *= 2;
-
-	ra->start = index;
-	ra->size = min(size + req_size, max);
-	ra->async_size = 1;
-
-	return 1;
-}
-
 static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {
@@ -561,8 +509,8 @@ void page_cache_sync_ra(struct readahead_control *ractl,
 	pgoff_t index = readahead_index(ractl);
 	bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM);
 	struct file_ra_state *ra = ractl->ra;
-	unsigned long max_pages;
-	pgoff_t prev_index;
+	unsigned long max_pages, contig_count;
+	pgoff_t prev_index, miss;
 
 	/*
 	 * Even if readahead is disabled, issue this request as readahead
@@ -603,16 +551,28 @@ void page_cache_sync_ra(struct readahead_control *ractl,
 	 * Query the page cache and look for the traces(cached history pages)
 	 * that a sequential stream would leave behind.
 	 */
-	if (try_context_readahead(ractl->mapping, ra, index, req_count,
-			max_pages))
-		goto readit;
-
+	rcu_read_lock();
+	miss = page_cache_prev_miss(ractl->mapping, index - 1, max_pages);
+	rcu_read_unlock();
+	contig_count = index - miss - 1;
 	/*
-	 * standalone, small random read
-	 * Read as is, and do not pollute the readahead state.
+	 * Standalone, small random read. Read as is, and do not pollute the
+	 * readahead state.
 	 */
-	do_page_cache_ra(ractl, req_count, 0);
-	return;
+	if (contig_count <= req_count) {
+		do_page_cache_ra(ractl, req_count, 0);
+		return;
+	}
+	/*
+	 * File cached from the beginning:
+	 * it is a strong indication of long-run stream (or whole-file-read)
+	 */
+	if (miss == ULONG_MAX)
+		contig_count *= 2;
+	ra->start = index;
+	ra->size = min(contig_count + req_count, max_pages);
+	ra->async_size = 1;
+	goto readit;
 
 initial_readahead:
 	ra->start = index;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 10/10] readahead: Simplify gotos in page_cache_sync_ra()
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
                   ` (8 preceding siblings ...)
  2024-06-25 10:18 ` [PATCH 09/10] readahead: Fold try_context_readahead() into its single caller Jan Kara
@ 2024-06-25 10:19 ` Jan Kara
  2024-06-25 17:12 ` [PATCH 0/10] mm: Fix various readahead quirks Josef Bacik
  2024-06-27  3:04 ` Zhang Peng
  11 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2024-06-25 10:19 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, Jan Kara

Unify all conditions for initial readahead to simplify goto logic in
page_cache_sync_ra(). No functional changes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/readahead.c | 26 +++++++++-----------------
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 12c0d2215329..d68d5ce657a7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -532,20 +532,19 @@ void page_cache_sync_ra(struct readahead_control *ractl,
 	}
 
 	max_pages = ractl_max_pages(ractl, req_count);
+	prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT;
 	/*
-	 * start of file or oversized read
-	 */
-	if (!index || req_count > max_pages)
-		goto initial_readahead;
-
-	/*
-	 * sequential cache miss
+	 * A start of file, oversized read, or sequential cache miss:
 	 * trivial case: (index - prev_index) == 1
 	 * unaligned reads: (index - prev_index) == 0
 	 */
-	prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT;
-	if (index - prev_index <= 1UL)
-		goto initial_readahead;
+	if (!index || req_count > max_pages || index - prev_index <= 1UL) {
+		ra->start = index;
+		ra->size = get_init_ra_size(req_count, max_pages);
+		ra->async_size = ra->size > req_count ? ra->size - req_count :
+							ra->size >> 1;
+		goto readit;
+	}
 
 	/*
 	 * Query the page cache and look for the traces(cached history pages)
@@ -572,13 +571,6 @@ void page_cache_sync_ra(struct readahead_control *ractl,
 	ra->start = index;
 	ra->size = min(contig_count + req_count, max_pages);
 	ra->async_size = 1;
-	goto readit;
-
-initial_readahead:
-	ra->start = index;
-	ra->size = get_init_ra_size(req_count, max_pages);
-	ra->async_size = ra->size > req_count ? ra->size - req_count :
-						ra->size >> 1;
 readit:
 	ractl->_index = ra->start;
 	page_cache_ra_order(ractl, ra, 0);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/10] mm: Fix various readahead quirks
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
                   ` (9 preceding siblings ...)
  2024-06-25 10:19 ` [PATCH 10/10] readahead: Simplify gotos in page_cache_sync_ra() Jan Kara
@ 2024-06-25 17:12 ` Josef Bacik
  2024-06-27  3:04 ` Zhang Peng
  11 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2024-06-25 17:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, Andrew Morton, Matthew Wilcox, linux-fsdevel

On Tue, Jun 25, 2024 at 12:18:50PM +0200, Jan Kara wrote:
> Hello!
> 
> When we were internally testing performance of recent kernels, we have noticed
> quite variable performance of readahead arising from various quirks in
> readahead code. So I went on a cleaning spree there. This is a batch of patches
> resulting out of that. A quick testing in my test VM with the following fio
> job file:
> 
> [global]
> direct=0
> ioengine=sync
> invalidate=1
> blocksize=4k
> size=10g
> readwrite=read
> 
> [reader]
> numjobs=1
> 
> shows that this patch series improves the throughput from variable one in
> 310-340 MB/s range to rather stable one at 350 MB/s. As a side effect these
> cleanups also address the issue noticed by Bruz Zhang [1].
> 

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/10] mm: Fix various readahead quirks
  2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
                   ` (10 preceding siblings ...)
  2024-06-25 17:12 ` [PATCH 0/10] mm: Fix various readahead quirks Josef Bacik
@ 2024-06-27  3:04 ` Zhang Peng
  2024-06-27  6:10   ` zippermonkey
  11 siblings, 1 reply; 15+ messages in thread
From: Zhang Peng @ 2024-06-27  3:04 UTC (permalink / raw)
  To: jack; +Cc: akpm, linux-fsdevel, linux-mm, willy, vernhao, zigiwang,
	bruzzhang

I test this batch of patch with fio, it indeed has a huge sppedup
in sequential read when block size is 4KiB. The result as follow,
for async read, iodepth is set to 128, and other settings
are self-evident.

casename                upstream   withFix speedup
----------------        --------   -------- -------
randread-4k-sync        48991      47773 -2.4862%
seqread-4k-sync         1162758    1422955 22.3776%
seqread-1024k-sync      1460208    1452522 -0.5264%
randread-4k-libaio      47467      47309 -0.3329%
randread-4k-posixaio    49190      49512 0.6546%
seqread-4k-libaio       1085932    1234635 13.6936%
seqread-1024k-libaio    1423341    1402214 -1.4843%
seqread-4k-posixaio     1165084    1369613 17.5549%
seqread-1024k-posixaio  1435422    1408808 -1.8541%

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/10] mm: Fix various readahead quirks
  2024-06-27  3:04 ` Zhang Peng
@ 2024-06-27  6:10   ` zippermonkey
  2024-06-27 21:13     ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: zippermonkey @ 2024-06-27  6:10 UTC (permalink / raw)
  To: zhangpengpeng0808
  Cc: akpm, bruzzhang, jack, linux-fsdevel, linux-mm, vernhao, willy,
	zigiwang


Hi, Jan

This is my environment:
pcie gen 3  NVMe SSD,
XFS filesystem with default config,
host 96 cores.

BTW, your patchset seems also fix the bug that I found[1], more complete 
than mine,
soyou can ignore my patch :)

Thanks,
-zp

Tested-by: Zhang Peng <bruzzhang@tencent.com>


[1] 
https://lore.kernel.org/linux-mm/20240625103653.uzabtus3yq2lo3o6@quack3/T/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/10] mm: Fix various readahead quirks
  2024-06-27  6:10   ` zippermonkey
@ 2024-06-27 21:13     ` Andrew Morton
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2024-06-27 21:13 UTC (permalink / raw)
  To: zippermonkey
  Cc: zhangpengpeng0808, bruzzhang, jack, linux-fsdevel, linux-mm,
	vernhao, willy, zigiwang

On Thu, 27 Jun 2024 14:10:48 +0800 zippermonkey <zzippermonkey@outlook.com> wrote:

> Tested-by: Zhang Peng <bruzzhang@tencent.com>

Thanks.  I added this to the changelogs and pasted your testing results
into the [0/N] description.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-06-27 21:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-25 10:18 [PATCH 0/10] mm: Fix various readahead quirks Jan Kara
2024-06-25 10:18 ` [PATCH 01/10] readahead: Make sure sync readahead reads needed page Jan Kara
2024-06-25 10:18 ` [PATCH 02/10] filemap: Fix page_cache_next_miss() when no hole found Jan Kara
2024-06-25 10:18 ` [PATCH 03/10] readahead: Properly shorten readahead when falling back to do_page_cache_ra() Jan Kara
2024-06-25 10:18 ` [PATCH 04/10] readahead: Drop pointless index from force_page_cache_ra() Jan Kara
2024-06-25 10:18 ` [PATCH 05/10] readahead: Drop index argument of page_cache_async_readahead() Jan Kara
2024-06-25 10:18 ` [PATCH 06/10] readahead: Drop dead code in page_cache_ra_order() Jan Kara
2024-06-25 10:18 ` [PATCH 07/10] readahead: Drop dead code in ondemand_readahead() Jan Kara
2024-06-25 10:18 ` [PATCH 08/10] readahead: Disentangle async and sync readahead Jan Kara
2024-06-25 10:18 ` [PATCH 09/10] readahead: Fold try_context_readahead() into its single caller Jan Kara
2024-06-25 10:19 ` [PATCH 10/10] readahead: Simplify gotos in page_cache_sync_ra() Jan Kara
2024-06-25 17:12 ` [PATCH 0/10] mm: Fix various readahead quirks Josef Bacik
2024-06-27  3:04 ` Zhang Peng
2024-06-27  6:10   ` zippermonkey
2024-06-27 21:13     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).