Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Matthew Wilcox <willy@infradead.org>
To: Jan Kara <jack@suse.cz>
Cc: "Yin, Fengwei" <fengwei.yin@intel.com>,
	Yujie Liu <yujie.liu@intel.com>,
	Oliver Sang <oliver.sang@intel.com>,
	oe-lkp@lists.linux.dev, lkp@intel.com,
	linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Guo Xuenan <guoxuenan@huawei.com>,
	linux-fsdevel@vger.kernel.org, ying.huang@intel.com,
	feng.tang@intel.com
Subject: Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression
Date: Thu, 7 Mar 2024 18:19:46 +0000	[thread overview]
Message-ID: <ZeoFQnVYLLBLNL6J@casper.infradead.org> (raw)
In-Reply-To: <20240307092308.u54fjngivmx23ty3@quack3>

On Thu, Mar 07, 2024 at 10:23:08AM +0100, Jan Kara wrote:
> Thanks for testing! This is an interesting result and certainly unexpected
> for me. The readahead code allocates naturally aligned pages so based on
> the distribution of allocations it seems that before commit ab4443fe3ca6
> readahead window was at least 32 pages (128KB) aligned and so we allocated
> order 5 pages. After the commit, the readahead window somehow ended up only
> aligned to 20 modulo 32. To follow natural alignment and fill 128KB
> readahead window we allocated order 2 page (got us to offset 24 modulo 32),
> then order 3 page (got us to offset 0 modulo 32), order 4 page (larger
> would not fit in 128KB readahead window now), and order 2 page to finish
> filling the readahead window.
> 
> Now I'm not 100% sure why the readahead window alignment changed with
> different rounding when placing readahead mark - probably that's some
> artifact when readahead window is tiny in the beginning before we scale it
> up (I'll verify by tracing whether everything ends up looking correctly
> with the current code). So I don't expect this is a problem in ab4443fe3ca6
> as such but it exposes the issue that readahead page insertion code should
> perhaps strive to achieve better readahead window alignment with logical
> file offset even at the cost of occasionally performing somewhat shorter
> readahead. I'll look into this once I dig out of the huge heap of email
> after vacation...

I was surprised by what you said here, so I went and re-read the code
and it doesn't work the way I thought it did.  So I had a good long think
about how it _should_ work, and I looked for some more corner conditions,
and this is what I came up with.

The first thing I've done is separate out the two limits.  The EOF is
a hard limit; we will not allocate pages beyond EOF.  The ra->size is
a soft limit; we will allocate pages beyond ra->size, but not too far.

The second thing I noticed is that index + ra_size could wrap.  So add
a check for that, and set it to ULONG_MAX.  index + ra_size - async_size
could also wrap, but this is harmless.  We certainly don't want to kick
off any more readahead in this circumstance, so leaving 'mark' outside
the range [index..ULONG_MAX] is just fine.

The third thing is that we could allocate a folio which contains a page
at ULONG_MAX.  We don't really want that in the page cache; it makes
filesystems more complicated if they have to check for that, and we
don't allow an order-0 folio at ULONG_MAX, so there's no need for it.
This _should_ already be prohibited by the "Don't allocate pages past EOF"
check, but let's explicitly prohibit it.

Compile tested only.

diff --git a/mm/readahead.c b/mm/readahead.c
index 130c0e7df99f..742e1f39035b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -488,7 +488,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t index = readahead_index(ractl);
-	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+	pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+	pgoff_t limit = index + ra->size;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
@@ -496,23 +497,26 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	if (!mapping_large_folio_support(mapping) || ra->size < 4)
 		goto fallback;
 
-	limit = min(limit, index + ra->size - 1);
-
 	if (new_order < MAX_PAGECACHE_ORDER) {
 		new_order += 2;
 		new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
 		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 	}
 
+	if (limit < index)
+		limit = ULONG_MAX;
 	filemap_invalidate_lock_shared(mapping);
-	while (index <= limit) {
+	while (index < limit) {
 		unsigned int order = new_order;
 
 		/* Align with smaller pages if needed */
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
+		/* Avoid wrap */
+		if (index + (1UL << order) == 0)
+			order--;
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (index + (1UL << order) - 1 > last)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)

next prev parent reply	other threads:[~2024-03-07 18:19 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-20  8:25 [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression kernel test robot
2024-02-21 11:14 ` Jan Kara
2024-02-22  1:32   ` Oliver Sang
2024-02-22 11:50     ` Jan Kara
2024-02-22 18:37       ` Jan Kara
2024-03-04  4:59         ` Yujie Liu
2024-03-04  5:35           ` Yin, Fengwei
2024-03-06  5:36             ` Yin Fengwei
2024-03-07  9:23             ` Jan Kara
2024-03-07 18:19               ` Matthew Wilcox [this message]
2024-03-08  8:37                 ` Yujie Liu
2024-03-10  6:41                 ` Yin, Fengwei
2024-03-10  6:40               ` Yin, Fengwei

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:130c0e7df99 dfblob:742e1f39035 )
 OR (
bs:"Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZeoFQnVYLLBLNL6J@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=guoxuenan@huawei.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=ying.huang@intel.com \
    --cc=yujie.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).