From: Ming Lei <ming.lei@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Mike Snitzer <snitzer@kernel.org>,
Matthew Wilcox <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, Don Dutile <ddutile@redhat.com>,
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>,
linux-block@vger.kernel.org, ming.lei@redhat.com
Subject: Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
Date: Mon, 29 Jan 2024 11:57:45 +0800 [thread overview]
Message-ID: <ZbciOba1h3V9mmup@fedora> (raw)
In-Reply-To: <ZbcDvTkeDKttPfJ4@dread.disaster.area>
On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > On Sun, Jan 28 2024 at 5:02P -0500,
> > > > Matthew Wilcox <willy@infradead.org> wrote:
> > > Understood. But ... the application is asking for as much readahead as
> > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > a time". So why will we not get a bug report in 1-15 years time saying
> > > "I put a limit on readahead and the kernel is ignoring it"? I think
> > > typically we allow the sysadmin to override application requests,
> > > don't we?
> >
> > The application isn't knowingly asking for readahead. It is asking to
> > mmap the file (and reporter wants it done as quickly as possible..
> > like occurred before).
>
> ... which we do within the constraints of the given configuration.
>
> > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > request size based on read-ahead setting") -- same logic, just applied
> > to callchain that ends up using madvise(MADV_WILLNEED).
>
> Not really. There is a difference between performing a synchronous
> read IO here that we must complete, compared to optimistic
> asynchronous read-ahead which we can fail or toss away without the
> user ever seeing the data the IO returned.
Yeah, the big readahead in this patch happens when user starts to read
over mmaped buffer instead of madvise().
>
> We want required IO to be done in as few, larger IOs as possible,
> and not be limited by constraints placed on background optimistic
> IOs.
>
> madvise(WILLNEED) is optimistic IO - there is no requirement that it
> complete the data reads successfully. If the data is actually
> required, we'll guarantee completion when the user accesses it, not
> when madvise() is called. IOWs, madvise is async readahead, and so
> really should be constrained by readahead bounds and not user IO
> bounds.
>
> We could change this behaviour for madvise of large ranges that we
> force into the page cache by ignoring device readahead bounds, but
> I'm not sure we want to do this in general.
>
> Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> value in this situation to override the device limit for large
> ranges (for some definition of large - say 10x bdi->ra_pages) and
> restore it once the readahead operation is done. This would make it
> behave less like readahead and more like a user read from an IO
> perspective...
->ra_pages is just one hint, which is 128KB at default, and either
device or userspace can override it.
fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which
is the max device sector size(often 10X of ->ra_pages), please see
force_page_cache_ra().
Follows the current report:
1) usersapce call madvise(willneed, 1G)
2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
readahead in madvise(willneed, 1G) since commit 6d2be915e589
3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
set as 64KB by userspace when userspace reads the mmaped buffer, then
the whole application becomes slower.
This patch changes 3) to use bdi->io_pages as readahead unit.
Thanks,
Ming
next prev parent reply other threads:[~2024-01-29 3:58 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-28 14:25 [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range Ming Lei
2024-01-28 22:02 ` Matthew Wilcox
2024-01-28 23:12 ` Mike Snitzer
2024-01-29 0:21 ` Matthew Wilcox
2024-01-29 0:39 ` Mike Snitzer
2024-01-29 1:47 ` Dave Chinner
2024-01-29 2:12 ` Mike Snitzer
2024-01-29 4:56 ` Dave Chinner
2024-01-29 3:57 ` Ming Lei [this message]
2024-01-29 5:15 ` Dave Chinner
2024-01-29 8:25 ` Ming Lei
2024-01-29 13:26 ` Matthew Wilcox
2024-01-29 22:07 ` Dave Chinner
2024-01-30 3:13 ` Ming Lei
2024-01-30 5:29 ` Dave Chinner
2024-01-30 11:34 ` Ming Lei
2024-01-29 3:20 ` Ming Lei
2024-01-29 3:00 ` Ming Lei
2024-01-29 17:19 ` Mike Snitzer
2024-01-29 17:42 ` Mike Snitzer
2024-01-29 22:12 ` Dave Chinner
2024-01-29 22:46 ` Mike Snitzer
2024-01-30 10:43 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZbciOba1h3V9mmup@fedora \
--to=ming.lei@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=david@fromorbit.com \
--cc=ddutile@redhat.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=raghavendra.kt@linux.vnet.ibm.com \
--cc=snitzer@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).