linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Mike Snitzer <snitzer@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Don Dutile <ddutile@redhat.com>,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	linux-block@vger.kernel.org, ming.lei@redhat.com
Subject: Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
Date: Tue, 30 Jan 2024 11:13:50 +0800	[thread overview]
Message-ID: <ZbhpbpeV6ChPD9NT@fedora> (raw)
In-Reply-To: <ZbghnK+Hs+if6vEz@dread.disaster.area>

On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote:
> On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > Follows the current report:
> > > > 
> > > > 1) usersapce call madvise(willneed, 1G)
> > > > 
> > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > > 
> > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > > the whole application becomes slower.
> > > 
> > > It gets limited by file->f_ra->ra_pages being initialised to
> > > bdi->ra_pages and then never changed as the advice for access
> > > methods to the file are changed.
> > > 
> > > But the problem here is *not the readahead code*. The problem is
> > > that the user has configured the device readahead window to be far
> > > smaller than is optimal for the storage. Hence readahead is slow.
> > > The fix for that is to either increase the device readahead windows,
> > > or to change the specific readahead window for the file that has
> > > sequential access patterns.
> > > 
> > > Indeed, we already have that - FADV_SEQUENTIAL will set
> > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > > larger IOs for that access.
> > > 
> > > That's what should happen here - MADV_WILLNEED does not imply a
> > > specific access pattern so the application should be running
> > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > > to start the readahead, and then the rest of the on-demand readahead
> > > will get the higher readahead limits.
> > > 
> > > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > > 
> > > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > > mem.load() implementation in the application converted to use
> > > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > > readahead algorithm.
> > 
> > Here the single .ra_pages may not work, that is why this patch stores
> > the willneed range in maple tree, please see the following words from
> > the original RH report:
> 
> > "
> > Increasing read ahead is not an option as it has a mixed I/O workload of
> > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > to the random I/O and is unacceptable.
> > "
> 
> Yes, I've read the bug. There's no triage that tells us what the
> root cause of the application perofrmance issue might be. Just an
> assertion that "this is how we did it 10 years ago, it's been
> unchanged for all this time, the new kernel we are upgrading
> to needs to behave exactly like pre-3.10 era kernels did.
> 
> And to be totally honest, my instincts tell me this is more likely a
> problem with a root cause in poor IO scheduling decisions than be a
> problem with the page cache readahead implementation. Readahead has
> been turned down to stop the bandwidth it uses via background async
> read IO from starving latency dependent foreground random IO
> operation, and then we're being asked to turn readahead back up
> in specific situations because it's actually needed for performance
> in certain access patterns. This is the sort of thing bfq is
> intended to solve.

Reading mmaped buffer in userspace is sync IO, and page fault just
readahead 64KB. I don't understand how block IO scheduler makes a
difference in this single 64KB readahead in case of cache miss.

> 
> 
> > Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM)
> > ignore the passed range, and the behavior becomes all or nothing,
> > instead of something only for the specified range, which may not
> > match with man, please see 'man posix_fadvise':
> 
> The man page says:
> 
> 	The advice is not binding; it merely constitutes an
> 	expectation on behalf of the application.
> 
> > It is even worse for readahead() syscall:
> > 
> > 	``` DESCRIPTION readahead()  initiates readahead on a file
> > 	so that subsequent reads from that file will be satisfied
> > 	from the cache, and not block on disk I/O (assuming the
> > 	readahead was initiated early enough and that other activity
> > 	on the system did not in the meantime flush pages from the
> > 	cache).  ```
> 
> Yes, that's been "broken" for a long time (since the changes to cap
> force_page_cache_readahead() to ra_pages way back when), but the
> assumption documented about when readahead(2) will work goes to the
> heart of why we don't let user controlled readahead actually do much
> in the way of direct readahead. i.e. too much readahead is
> typically harmful to IO and system performance and very, very few
> applications actually need files preloaded entirely into memory.

It is true for normal readahead, but not sure if it is for
advise(willneed) or readahead().

> 
> ----
> 
> All said, I'm starting to think that there isn't an immediate
> upstream kernel change needed right now.  I just did a quick check
> through the madvise() man page to see if I'd missed anything, and I
> most definitely did miss what is a relatively new addition to it:
> 
> MADV_POPULATE_READ (since Linux 5.14)
>      "Populate (prefault) page tables readable, faulting in all
>      pages in the range just as if manually reading from each page;
>      however, avoid the actual memory access that would have been
>      performed after handling the fault.
> 
>      In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide
>      errors, can be applied to (parts of) existing mappings and will
>      al‐ ways  populate (prefault) page tables readable.  One
>      example use case is prefaulting a file mapping, reading all
>      file content from disk; however, pages won't be dirtied and
>      consequently won't have to be written back to disk when
>      evicting the pages from memory.
> 
> That's exactly what the application is apparently wanting
> MADV_WILLNEED to do.

Indeed, it works as expected(all mmapped pages are load in
madvise(MADV_POPULATE_READ)) in my test code except for 16 ra_pages, but
it is less important now.

Thanks for this idea!

> 
> Please read the commit message for commit 4ca9b3859dac ("mm/madvise:
> introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It
> has some relevant commentary on why MADV_WILLNEED could not be
> modified to meet the pre-population requirements of the applications
> that required this pre-population behaviour from the kernel.
> 
> With this, I suspect that the application needs to be updated to
> use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go
> back and do some analysis of the readahead behaviour of the
> application and the MADV_POPULATE_READ operation. We may need to
> tweak MADV_POPULATE_READ for large readahead IO, but that's OK
> because it's no longer "optimistic speculation" about whether the
> data is needed in cache - the operation being performed guarantees
> that or it fails with an error. IOWs, MADV_POPULATE_READ is
> effectively user data IO at this point, not advice about future
> access patterns...

BTW, in this report, MADV_WILLNEED is used by java library[1], and I
guess it could be difficult to update to MADV_POPULATE_READ.

[1] https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html



Thanks,
Ming



  reply	other threads:[~2024-01-30  3:14 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-28 14:25 [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range Ming Lei
2024-01-28 22:02 ` Matthew Wilcox
2024-01-28 23:12   ` Mike Snitzer
2024-01-29  0:21     ` Matthew Wilcox
2024-01-29  0:39       ` Mike Snitzer
2024-01-29  1:47         ` Dave Chinner
2024-01-29  2:12           ` Mike Snitzer
2024-01-29  4:56             ` Dave Chinner
2024-01-29  3:57           ` Ming Lei
2024-01-29  5:15             ` Dave Chinner
2024-01-29  8:25               ` Ming Lei
2024-01-29 13:26                 ` Matthew Wilcox
2024-01-29 22:07                 ` Dave Chinner
2024-01-30  3:13                   ` Ming Lei [this message]
2024-01-30  5:29                     ` Dave Chinner
2024-01-30 11:34                       ` Ming Lei
2024-01-29  3:20       ` Ming Lei
2024-01-29  3:00   ` Ming Lei
2024-01-29 17:19 ` Mike Snitzer
2024-01-29 17:42   ` Mike Snitzer
2024-01-29 22:12   ` Dave Chinner
2024-01-29 22:46     ` Mike Snitzer
2024-01-30 10:43       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZbhpbpeV6ChPD9NT@fedora \
    --to=ming.lei@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=david@fromorbit.com \
    --cc=ddutile@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=snitzer@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).