linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Ming Lei <ming.lei@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	Don Dutile <ddutile@redhat.com>,
	Rafael Aquini <raquini@redhat.com>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: mm/madvise: set ra_pages as device max request size during ADV_POPULATE_READ
Date: Fri, 2 Feb 2024 09:19:13 -0500	[thread overview]
Message-ID: <Zbz54MUZ2kZLTQQx@redhat.com> (raw)
In-Reply-To: <ZbzJZji95a1qmhcj@fedora>

On Fri, Feb 02 2024 at  5:52P -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> On Thu, Feb 01, 2024 at 11:43:11PM -0500, Mike Snitzer wrote:
> > On Thu, Feb 01 2024 at  9:20P -0500,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > madvise(MADV_POPULATE_READ) tries to populate all page tables in the
> > > specific range, so it is usually sequential IO if VMA is backed by
> > > file.
> > > 
> > > Set ra_pages as device max request size for the involved readahead in
> > > the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ)
> > > to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with
> > > usual(default) 128KB of read_ahead_kb.
> > > 
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Don Dutile <ddutile@redhat.com>
> > > Cc: Rafael Aquini <raquini@redhat.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Cc: Mike Snitzer <snitzer@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 51 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 912155a94ed5..db5452c8abdd 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > >  		return -EINVAL;
> > >  }
> > >  
> > > +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages)
> > > +{
> > > +	if (*file) {
> > > +		struct file *f = *file;
> > > +
> > > +		f->f_ra.ra_pages = ra_pages;
> > > +		fput(f);
> > > +		*file = NULL;
> > > +	}
> > > +}
> > > +
> > > +static struct file *madvise_override_ra_win(struct file *f,
> > > +		unsigned long start, unsigned long end,
> > > +		unsigned int *old_ra_pages)
> > > +{
> > > +	unsigned int io_pages;
> > > +
> > > +	if (!f || !f->f_mapping || !f->f_mapping->host)
> > > +		return NULL;
> > > +
> > > +	io_pages = inode_to_bdi(f->f_mapping->host)->io_pages;
> > > +	if (((end - start) >> PAGE_SHIFT) < io_pages)
> > > +		return NULL;
> > > +
> > > +	f = get_file(f);
> > > +	*old_ra_pages = f->f_ra.ra_pages;
> > > +	f->f_ra.ra_pages = io_pages;
> > > +
> > > +	return f;
> > > +}
> > > +
> > 
> > Does this override imply that madvise_populate resorts to calling
> > filemap_fault() and here you're just arming it to use the larger
> > ->io_pages for the duration of all associated faulting?
> 
> Yes.
> 
> > 
> > Wouldn't it be better to avoid faulting and build up larger page
> 
> How can we avoid the fault handling? which is needed to build VA->PA mapping.

I was wondering if it made sense to add fadvise_populate -- but given
my lack of experience with MM I then get handwavvy quick -- I have
more work ahead to round out my MM understanding so that I'm more
informed.
 
> > vectors that get sent down to the block layer in one go and let the
> 
> filemap_fault() already tries to allocate folio in big size(max order
> is MAX_PAGECACHE_ORDER), see page_cache_ra_order() and ra_alloc_folio().
> 
> > block layer split using the device's limits? (like happens with
> > force_page_cache_ra)
> 
> Here filemap code won't deal with block directly because there is VFS &
> FS and io mapping is required, and it just calls aops->readahead() or
> aops->read_folio(), but block plug & readahead_control are applied for
> handling everything in batch.
> 
> > 
> > I'm concerned that madvise_populate isn't so efficient with filemap
> 
> That is why this patch increases readahead window, then
> madvise_populate() performance can be improved by X10 in big file-backed
> popluate read.

Right, as you know I've tested your patch, the larger readahead window
certainly did provide the much more desirable performance.  I'll reply
to your v2 (with reduced negative checks) with my Reviewed-by and
Tested-by.

I was just wondering if there an opportunity to plumb in more a
specific (and potentially better) fadvise_populate for dealing with
file backed pages.

> > due to excessive faulting (*BUT* I haven't traced to know, I'm just
> > inferring that is why twiddling f->f_ra.ra_pages helps improve
> > madvise_populate by having it issue larger IO. Apologies if I'm way
> > off base)
> 
> As mentioned, fault handling can't be avoided, but we can improve
> involved readahead IO perf.

Thanks, and sorry for asking such a naive question (put more pressure
on you to educate than I should have).

Mike

  reply	other threads:[~2024-02-02 14:19 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-02  2:20 [PATCH] mm/madvise: set ra_pages as device max request size during ADV_POPULATE_READ Ming Lei
2024-02-02  4:15 ` Matthew Wilcox
2024-02-02  4:48   ` Ming Lei
2024-02-02  4:43 ` Mike Snitzer
2024-02-02 10:52   ` Ming Lei
2024-02-02 14:19     ` Mike Snitzer [this message]
2024-02-04 23:34 ` [PATCH] " Dave Chinner
2024-02-05  9:53   ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zbz54MUZ2kZLTQQx@redhat.com \
    --to=snitzer@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=david@fromorbit.com \
    --cc=david@redhat.com \
    --cc=ddutile@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ming.lei@redhat.com \
    --cc=raquini@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).