* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
[not found] ` <ZbcDvTkeDKttPfJ4@dread.disaster.area>
@ 2024-01-29 3:57 ` Ming Lei
2024-01-29 5:15 ` Dave Chinner
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2024-01-29 3:57 UTC (permalink / raw)
To: Dave Chinner
Cc: Mike Snitzer, Matthew Wilcox, Andrew Morton, linux-fsdevel,
linux-mm, linux-kernel, Don Dutile, Raghavendra K T,
Alexander Viro, Christian Brauner, linux-block, ming.lei
On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > On Sun, Jan 28 2024 at 5:02P -0500,
> > > > Matthew Wilcox <willy@infradead.org> wrote:
> > > Understood. But ... the application is asking for as much readahead as
> > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > a time". So why will we not get a bug report in 1-15 years time saying
> > > "I put a limit on readahead and the kernel is ignoring it"? I think
> > > typically we allow the sysadmin to override application requests,
> > > don't we?
> >
> > The application isn't knowingly asking for readahead. It is asking to
> > mmap the file (and reporter wants it done as quickly as possible..
> > like occurred before).
>
> ... which we do within the constraints of the given configuration.
>
> > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > request size based on read-ahead setting") -- same logic, just applied
> > to callchain that ends up using madvise(MADV_WILLNEED).
>
> Not really. There is a difference between performing a synchronous
> read IO here that we must complete, compared to optimistic
> asynchronous read-ahead which we can fail or toss away without the
> user ever seeing the data the IO returned.
Yeah, the big readahead in this patch happens when user starts to read
over mmaped buffer instead of madvise().
>
> We want required IO to be done in as few, larger IOs as possible,
> and not be limited by constraints placed on background optimistic
> IOs.
>
> madvise(WILLNEED) is optimistic IO - there is no requirement that it
> complete the data reads successfully. If the data is actually
> required, we'll guarantee completion when the user accesses it, not
> when madvise() is called. IOWs, madvise is async readahead, and so
> really should be constrained by readahead bounds and not user IO
> bounds.
>
> We could change this behaviour for madvise of large ranges that we
> force into the page cache by ignoring device readahead bounds, but
> I'm not sure we want to do this in general.
>
> Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> value in this situation to override the device limit for large
> ranges (for some definition of large - say 10x bdi->ra_pages) and
> restore it once the readahead operation is done. This would make it
> behave less like readahead and more like a user read from an IO
> perspective...
->ra_pages is just one hint, which is 128KB at default, and either
device or userspace can override it.
fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which
is the max device sector size(often 10X of ->ra_pages), please see
force_page_cache_ra().
Follows the current report:
1) usersapce call madvise(willneed, 1G)
2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
readahead in madvise(willneed, 1G) since commit 6d2be915e589
3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
set as 64KB by userspace when userspace reads the mmaped buffer, then
the whole application becomes slower.
This patch changes 3) to use bdi->io_pages as readahead unit.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-29 3:57 ` [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range Ming Lei
@ 2024-01-29 5:15 ` Dave Chinner
2024-01-29 8:25 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2024-01-29 5:15 UTC (permalink / raw)
To: Ming Lei
Cc: Mike Snitzer, Matthew Wilcox, Andrew Morton, linux-fsdevel,
linux-mm, linux-kernel, Don Dutile, Raghavendra K T,
Alexander Viro, Christian Brauner, linux-block
On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > > On Sun, Jan 28 2024 at 5:02P -0500,
> > > > > Matthew Wilcox <willy@infradead.org> wrote:
> > > > Understood. But ... the application is asking for as much readahead as
> > > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > > a time". So why will we not get a bug report in 1-15 years time saying
> > > > "I put a limit on readahead and the kernel is ignoring it"? I think
> > > > typically we allow the sysadmin to override application requests,
> > > > don't we?
> > >
> > > The application isn't knowingly asking for readahead. It is asking to
> > > mmap the file (and reporter wants it done as quickly as possible..
> > > like occurred before).
> >
> > ... which we do within the constraints of the given configuration.
> >
> > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > > request size based on read-ahead setting") -- same logic, just applied
> > > to callchain that ends up using madvise(MADV_WILLNEED).
> >
> > Not really. There is a difference between performing a synchronous
> > read IO here that we must complete, compared to optimistic
> > asynchronous read-ahead which we can fail or toss away without the
> > user ever seeing the data the IO returned.
>
> Yeah, the big readahead in this patch happens when user starts to read
> over mmaped buffer instead of madvise().
Yes, that's how it is intended to work :/
> > We want required IO to be done in as few, larger IOs as possible,
> > and not be limited by constraints placed on background optimistic
> > IOs.
> >
> > madvise(WILLNEED) is optimistic IO - there is no requirement that it
> > complete the data reads successfully. If the data is actually
> > required, we'll guarantee completion when the user accesses it, not
> > when madvise() is called. IOWs, madvise is async readahead, and so
> > really should be constrained by readahead bounds and not user IO
> > bounds.
> >
> > We could change this behaviour for madvise of large ranges that we
> > force into the page cache by ignoring device readahead bounds, but
> > I'm not sure we want to do this in general.
> >
> > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> > value in this situation to override the device limit for large
> > ranges (for some definition of large - say 10x bdi->ra_pages) and
> > restore it once the readahead operation is done. This would make it
> > behave less like readahead and more like a user read from an IO
> > perspective...
>
> ->ra_pages is just one hint, which is 128KB at default, and either
> device or userspace can override it.
>
> fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which
> is the max device sector size(often 10X of ->ra_pages), please see
> force_page_cache_ra().
Yes, but if we also change vma->file->f_ra->ra_pages during the
WILLNEED operation (as we do for FADV_SEQUENTIAL) then we get a
larger readahead window for the demand-paged access portion of the
WILLNEED access...
>
> Follows the current report:
>
> 1) usersapce call madvise(willneed, 1G)
>
> 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> readahead in madvise(willneed, 1G) since commit 6d2be915e589
>
> 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> set as 64KB by userspace when userspace reads the mmaped buffer, then
> the whole application becomes slower.
It gets limited by file->f_ra->ra_pages being initialised to
bdi->ra_pages and then never changed as the advice for access
methods to the file are changed.
But the problem here is *not the readahead code*. The problem is
that the user has configured the device readahead window to be far
smaller than is optimal for the storage. Hence readahead is slow.
The fix for that is to either increase the device readahead windows,
or to change the specific readahead window for the file that has
sequential access patterns.
Indeed, we already have that - FADV_SEQUENTIAL will set
file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
larger IOs for that access.
That's what should happen here - MADV_WILLNEED does not imply a
specific access pattern so the application should be running
MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
to start the readahead, and then the rest of the on-demand readahead
will get the higher readahead limits.
> This patch changes 3) to use bdi->io_pages as readahead unit.
I think it really should be changing MADV/FADV_SEQUENTIAL to set
file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
mem.load() implementation in the application converted to use
MADV_SEQUENTIAL to properly indicate it's access pattern to the
readahead algorithm.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-29 5:15 ` Dave Chinner
@ 2024-01-29 8:25 ` Ming Lei
2024-01-29 13:26 ` Matthew Wilcox
2024-01-29 22:07 ` Dave Chinner
0 siblings, 2 replies; 13+ messages in thread
From: Ming Lei @ 2024-01-29 8:25 UTC (permalink / raw)
To: Dave Chinner
Cc: Mike Snitzer, Matthew Wilcox, Andrew Morton, linux-fsdevel,
linux-mm, linux-kernel, Don Dutile, Raghavendra K T,
Alexander Viro, Christian Brauner, linux-block, ming.lei
On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > > > On Sun, Jan 28 2024 at 5:02P -0500,
> > > > > > Matthew Wilcox <willy@infradead.org> wrote:
> > > > > Understood. But ... the application is asking for as much readahead as
> > > > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > > > a time". So why will we not get a bug report in 1-15 years time saying
> > > > > "I put a limit on readahead and the kernel is ignoring it"? I think
> > > > > typically we allow the sysadmin to override application requests,
> > > > > don't we?
> > > >
> > > > The application isn't knowingly asking for readahead. It is asking to
> > > > mmap the file (and reporter wants it done as quickly as possible..
> > > > like occurred before).
> > >
> > > ... which we do within the constraints of the given configuration.
> > >
> > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > > > request size based on read-ahead setting") -- same logic, just applied
> > > > to callchain that ends up using madvise(MADV_WILLNEED).
> > >
> > > Not really. There is a difference between performing a synchronous
> > > read IO here that we must complete, compared to optimistic
> > > asynchronous read-ahead which we can fail or toss away without the
> > > user ever seeing the data the IO returned.
> >
> > Yeah, the big readahead in this patch happens when user starts to read
> > over mmaped buffer instead of madvise().
>
> Yes, that's how it is intended to work :/
>
> > > We want required IO to be done in as few, larger IOs as possible,
> > > and not be limited by constraints placed on background optimistic
> > > IOs.
> > >
> > > madvise(WILLNEED) is optimistic IO - there is no requirement that it
> > > complete the data reads successfully. If the data is actually
> > > required, we'll guarantee completion when the user accesses it, not
> > > when madvise() is called. IOWs, madvise is async readahead, and so
> > > really should be constrained by readahead bounds and not user IO
> > > bounds.
> > >
> > > We could change this behaviour for madvise of large ranges that we
> > > force into the page cache by ignoring device readahead bounds, but
> > > I'm not sure we want to do this in general.
> > >
> > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> > > value in this situation to override the device limit for large
> > > ranges (for some definition of large - say 10x bdi->ra_pages) and
> > > restore it once the readahead operation is done. This would make it
> > > behave less like readahead and more like a user read from an IO
> > > perspective...
> >
> > ->ra_pages is just one hint, which is 128KB at default, and either
> > device or userspace can override it.
> >
> > fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which
> > is the max device sector size(often 10X of ->ra_pages), please see
> > force_page_cache_ra().
>
> Yes, but if we also change vma->file->f_ra->ra_pages during the
> WILLNEED operation (as we do for FADV_SEQUENTIAL) then we get a
> larger readahead window for the demand-paged access portion of the
> WILLNEED access...
>
> >
> > Follows the current report:
> >
> > 1) usersapce call madvise(willneed, 1G)
> >
> > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> >
> > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > the whole application becomes slower.
>
> It gets limited by file->f_ra->ra_pages being initialised to
> bdi->ra_pages and then never changed as the advice for access
> methods to the file are changed.
>
> But the problem here is *not the readahead code*. The problem is
> that the user has configured the device readahead window to be far
> smaller than is optimal for the storage. Hence readahead is slow.
> The fix for that is to either increase the device readahead windows,
> or to change the specific readahead window for the file that has
> sequential access patterns.
>
> Indeed, we already have that - FADV_SEQUENTIAL will set
> file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> larger IOs for that access.
>
> That's what should happen here - MADV_WILLNEED does not imply a
> specific access pattern so the application should be running
> MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> to start the readahead, and then the rest of the on-demand readahead
> will get the higher readahead limits.
>
> > This patch changes 3) to use bdi->io_pages as readahead unit.
>
> I think it really should be changing MADV/FADV_SEQUENTIAL to set
> file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> mem.load() implementation in the application converted to use
> MADV_SEQUENTIAL to properly indicate it's access pattern to the
> readahead algorithm.
Here the single .ra_pages may not work, that is why this patch stores
the willneed range in maple tree, please see the following words from
the original RH report:
"
Increasing read ahead is not an option as it has a mixed I/O workload of
random I/O and sequential I/O, so that a large read ahead is very counterproductive
to the random I/O and is unacceptable.
"
Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM)
ignore the passed range, and the behavior becomes all or nothing,
instead of something only for the specified range, which may not
match with man, please see 'man posix_fadvise':
```
POSIX_FADV_NORMAL
Indicates that the application has no advice to give about its access
pattern for the specified data. If no advice is given for an open file,
this is the default assumption.
POSIX_FADV_SEQUENTIAL
The application expects to access the specified data sequentially (with
lower offsets read before higher ones).
POSIX_FADV_RANDOM
The specified data will be accessed in random order.
POSIX_FADV_NOREUSE
The specified data will be accessed only once.
In kernels before 2.6.18, POSIX_FADV_NOREUSE had the same semantics as
POSIX_FADV_WILLNEED. This was probably a bug; since kernel 2.6.18, this
flag is a no-op.
POSIX_FADV_WILLNEED
The specified data will be accessed in the near future.
```
It is even worse for readahead() syscall:
```
DESCRIPTION
readahead() initiates readahead on a file so that subsequent reads from that
file will be satisfied from the cache, and not block on disk I/O (assuming the
readahead was initiated early enough and that other activity on the system did
not in the meantime flush pages from the cache).
```
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-29 8:25 ` Ming Lei
@ 2024-01-29 13:26 ` Matthew Wilcox
2024-01-29 22:07 ` Dave Chinner
1 sibling, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2024-01-29 13:26 UTC (permalink / raw)
To: Ming Lei
Cc: Dave Chinner, Mike Snitzer, Andrew Morton, linux-fsdevel,
linux-mm, linux-kernel, Don Dutile, Raghavendra K T,
Alexander Viro, Christian Brauner, linux-block
On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> Here the single .ra_pages may not work, that is why this patch stores
> the willneed range in maple tree, please see the following words from
> the original RH report:
>
> "
> Increasing read ahead is not an option as it has a mixed I/O workload of
> random I/O and sequential I/O, so that a large read ahead is very counterproductive
> to the random I/O and is unacceptable.
> "
It is really frustrating having to drag this important information out of
you. Please send the full bug report (stripping identifying information
if that's what the customer needs). We seem to be covering the same
ground again in public that apparently you've already done in private.
This is no way to work with upstream.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
[not found] <20240128142522.1524741-1-ming.lei@redhat.com>
[not found] ` <ZbbPCQZdazF7s0_b@casper.infradead.org>
@ 2024-01-29 17:19 ` Mike Snitzer
2024-01-29 17:42 ` Mike Snitzer
2024-01-29 22:12 ` Dave Chinner
1 sibling, 2 replies; 13+ messages in thread
From: Mike Snitzer @ 2024-01-29 17:19 UTC (permalink / raw)
To: Ming Lei, Dave Chinner, Matthew Wilcox
Cc: Andrew Morton, linux-fsdevel, linux-mm, linux-kernel, Don Dutile,
Raghavendra K T, Alexander Viro, Christian Brauner, linux-block
On Mon, Jan 29 2024 at 8:26P -0500, Matthew Wilcox <willy@infradead.org> wrote:
> On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > Here the single .ra_pages may not work, that is why this patch stores
> > the willneed range in maple tree, please see the following words from
> > the original RH report:
> >
> > "
> > Increasing read ahead is not an option as it has a mixed I/O workload of
> > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > to the random I/O and is unacceptable.
> > "
>
> It is really frustrating having to drag this important information out of
> you. Please send the full bug report (stripping identifying information
> if that's what the customer needs). We seem to be covering the same
> ground again in public that apparently you've already done in private.
> This is no way to work with upstream.
We are all upstream developers here. And Ming is among the best of us,
so please try to stay constructive.
You now have the full bug report (I provided the reproducer and
additonal context, and now Ming shared the "Increasing read ahead is
not an option..." detail from the original report).
BUT Ming did mention the potential for mixed workload in his original
RFC patch header:
On Sun, Jan 28 2024 at 9:25P -0500, Ming Lei <ming.lei@redhat.com> wrote:
> Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
> memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
> only tries to readahead 512 pages, and the remained part in the advised
> range fallback on normal readahead.
>
> If bdi->ra_pages is set as small, readahead will perform not efficient
> enough. Increasing read ahead may not be an option since workload may
> have mixed random and sequential I/O.
And while both you and Dave rightly seized on the seemingly "Doctor it
hurts when I clamp readahead to be small but then issue larger
sequential reads and want it to use larger readahead" aspect of this
report...
Dave was helpful with his reasoned follow-up responses, culminating
with this one (where the discussion evolved to clearly consider the
fact that an integral part of addressing the reported issue is the
need to allow for mixed workloads not stomping on each other when
issuing IO to the same backing block device):
On Mon, Jan 29 2024 at 12:15P -0500,
Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > > > On Sun, Jan 28 2024 at 5:02P -0500,
> > > > > > Matthew Wilcox <willy@infradead.org> wrote:
> > > > > Understood. But ... the application is asking for as much readahead as
> > > > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > > > a time". So why will we not get a bug report in 1-15 years time saying
> > > > > "I put a limit on readahead and the kernel is ignoring it"? I think
> > > > > typically we allow the sysadmin to override application requests,
> > > > > don't we?
> > > >
> > > > The application isn't knowingly asking for readahead. It is asking to
> > > > mmap the file (and reporter wants it done as quickly as possible..
> > > > like occurred before).
> > >
> > > ... which we do within the constraints of the given configuration.
> > >
> > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > > > request size based on read-ahead setting") -- same logic, just applied
> > > > to callchain that ends up using madvise(MADV_WILLNEED).
> > >
> > > Not really. There is a difference between performing a synchronous
> > > read IO here that we must complete, compared to optimistic
> > > asynchronous read-ahead which we can fail or toss away without the
> > > user ever seeing the data the IO returned.
> >
> > Yeah, the big readahead in this patch happens when user starts to read
> > over mmaped buffer instead of madvise().
>
> Yes, that's how it is intended to work :/
>
> > > We want required IO to be done in as few, larger IOs as possible,
> > > and not be limited by constraints placed on background optimistic
> > > IOs.
> > >
> > > madvise(WILLNEED) is optimistic IO - there is no requirement that it
> > > complete the data reads successfully. If the data is actually
> > > required, we'll guarantee completion when the user accesses it, not
> > > when madvise() is called. IOWs, madvise is async readahead, and so
> > > really should be constrained by readahead bounds and not user IO
> > > bounds.
> > >
> > > We could change this behaviour for madvise of large ranges that we
> > > force into the page cache by ignoring device readahead bounds, but
> > > I'm not sure we want to do this in general.
> > >
> > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> > > value in this situation to override the device limit for large
> > > ranges (for some definition of large - say 10x bdi->ra_pages) and
> > > restore it once the readahead operation is done. This would make it
> > > behave less like readahead and more like a user read from an IO
> > > perspective...
> >
> > ->ra_pages is just one hint, which is 128KB at default, and either
> > device or userspace can override it.
> >
> > fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which
> > is the max device sector size(often 10X of ->ra_pages), please see
> > force_page_cache_ra().
>
> Yes, but if we also change vma->file->f_ra->ra_pages during the
> WILLNEED operation (as we do for FADV_SEQUENTIAL) then we get a
> larger readahead window for the demand-paged access portion of the
> WILLNEED access...
>
> >
> > Follows the current report:
> >
> > 1) usersapce call madvise(willneed, 1G)
> >
> > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> >
> > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > the whole application becomes slower.
>
> It gets limited by file->f_ra->ra_pages being initialised to
> bdi->ra_pages and then never changed as the advice for access
> methods to the file are changed.
>
> But the problem here is *not the readahead code*. The problem is
> that the user has configured the device readahead window to be far
> smaller than is optimal for the storage. Hence readahead is slow.
> The fix for that is to either increase the device readahead windows,
> or to change the specific readahead window for the file that has
> sequential access patterns.
>
> Indeed, we already have that - FADV_SEQUENTIAL will set
> file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> larger IOs for that access.
>
> That's what should happen here - MADV_WILLNEED does not imply a
> specific access pattern so the application should be running
> MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> to start the readahead, and then the rest of the on-demand readahead
> will get the higher readahead limits.
>
> > This patch changes 3) to use bdi->io_pages as readahead unit.
>
> I think it really should be changing MADV/FADV_SEQUENTIAL to set
> file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> mem.load() implementation in the application converted to use
> MADV_SEQUENTIAL to properly indicate it's access pattern to the
> readahead algorithm.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
Anyway, if we could all keep cool and just finish the discussion about
how to address the reported issue in a long-term supported way that'd
be _awesome_.
While I'm sure this legacy application would love to not have to
change its code at all, I think we can all agree that we need to just
focus on how best to advise applications that have mixed workloads
accomplish efficient mmap+read of both sequential and random.
To that end, I heard Dave clearly suggest 2 things:
1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
bdi->io_pages, not bdi->ra_pages * 2
2) Have the application first issue MADV_SEQUENTIAL to convey that for
the following MADV_WILLNEED is for sequential file load (so it is
desirable to use larger ra_pages)
This overrides the default of bdi->io_pages and _should_ provide the
required per-file duality of control for readahead, correct?
Thanks,
Mike
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-29 17:19 ` Mike Snitzer
@ 2024-01-29 17:42 ` Mike Snitzer
2024-01-29 22:12 ` Dave Chinner
1 sibling, 0 replies; 13+ messages in thread
From: Mike Snitzer @ 2024-01-29 17:42 UTC (permalink / raw)
To: Ming Lei, Dave Chinner, Matthew Wilcox
Cc: Andrew Morton, linux-fsdevel, linux-mm, linux-kernel, Don Dutile,
Raghavendra K T, Alexander Viro, Christian Brauner, linux-block
On Mon, Jan 29 2024 at 12:19P -0500,
Mike Snitzer <snitzer@kernel.org> wrote:
> While I'm sure this legacy application would love to not have to
> change its code at all, I think we can all agree that we need to just
> focus on how best to advise applications that have mixed workloads
> accomplish efficient mmap+read of both sequential and random.
>
> To that end, I heard Dave clearly suggest 2 things:
>
> 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
> bdi->io_pages, not bdi->ra_pages * 2
>
> 2) Have the application first issue MADV_SEQUENTIAL to convey that for
> the following MADV_WILLNEED is for sequential file load (so it is
> desirable to use larger ra_pages)
>
> This overrides the default of bdi->io_pages and _should_ provide the
> required per-file duality of control for readahead, correct?
I meant "This overrides the default of bdi->ra_pages ..." ;)
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-29 8:25 ` Ming Lei
2024-01-29 13:26 ` Matthew Wilcox
@ 2024-01-29 22:07 ` Dave Chinner
2024-01-30 3:13 ` Ming Lei
1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2024-01-29 22:07 UTC (permalink / raw)
To: Ming Lei
Cc: Mike Snitzer, Matthew Wilcox, Andrew Morton, linux-fsdevel,
linux-mm, linux-kernel, Don Dutile, Raghavendra K T,
Alexander Viro, Christian Brauner, linux-block
On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > Follows the current report:
> > >
> > > 1) usersapce call madvise(willneed, 1G)
> > >
> > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > >
> > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > the whole application becomes slower.
> >
> > It gets limited by file->f_ra->ra_pages being initialised to
> > bdi->ra_pages and then never changed as the advice for access
> > methods to the file are changed.
> >
> > But the problem here is *not the readahead code*. The problem is
> > that the user has configured the device readahead window to be far
> > smaller than is optimal for the storage. Hence readahead is slow.
> > The fix for that is to either increase the device readahead windows,
> > or to change the specific readahead window for the file that has
> > sequential access patterns.
> >
> > Indeed, we already have that - FADV_SEQUENTIAL will set
> > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > larger IOs for that access.
> >
> > That's what should happen here - MADV_WILLNEED does not imply a
> > specific access pattern so the application should be running
> > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > to start the readahead, and then the rest of the on-demand readahead
> > will get the higher readahead limits.
> >
> > > This patch changes 3) to use bdi->io_pages as readahead unit.
> >
> > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > mem.load() implementation in the application converted to use
> > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > readahead algorithm.
>
> Here the single .ra_pages may not work, that is why this patch stores
> the willneed range in maple tree, please see the following words from
> the original RH report:
> "
> Increasing read ahead is not an option as it has a mixed I/O workload of
> random I/O and sequential I/O, so that a large read ahead is very counterproductive
> to the random I/O and is unacceptable.
> "
Yes, I've read the bug. There's no triage that tells us what the
root cause of the application perofrmance issue might be. Just an
assertion that "this is how we did it 10 years ago, it's been
unchanged for all this time, the new kernel we are upgrading
to needs to behave exactly like pre-3.10 era kernels did.
And to be totally honest, my instincts tell me this is more likely a
problem with a root cause in poor IO scheduling decisions than be a
problem with the page cache readahead implementation. Readahead has
been turned down to stop the bandwidth it uses via background async
read IO from starving latency dependent foreground random IO
operation, and then we're being asked to turn readahead back up
in specific situations because it's actually needed for performance
in certain access patterns. This is the sort of thing bfq is
intended to solve.
> Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM)
> ignore the passed range, and the behavior becomes all or nothing,
> instead of something only for the specified range, which may not
> match with man, please see 'man posix_fadvise':
The man page says:
The advice is not binding; it merely constitutes an
expectation on behalf of the application.
> It is even worse for readahead() syscall:
>
> ``` DESCRIPTION readahead() initiates readahead on a file
> so that subsequent reads from that file will be satisfied
> from the cache, and not block on disk I/O (assuming the
> readahead was initiated early enough and that other activity
> on the system did not in the meantime flush pages from the
> cache). ```
Yes, that's been "broken" for a long time (since the changes to cap
force_page_cache_readahead() to ra_pages way back when), but the
assumption documented about when readahead(2) will work goes to the
heart of why we don't let user controlled readahead actually do much
in the way of direct readahead. i.e. too much readahead is
typically harmful to IO and system performance and very, very few
applications actually need files preloaded entirely into memory.
----
All said, I'm starting to think that there isn't an immediate
upstream kernel change needed right now. I just did a quick check
through the madvise() man page to see if I'd missed anything, and I
most definitely did miss what is a relatively new addition to it:
MADV_POPULATE_READ (since Linux 5.14)
"Populate (prefault) page tables readable, faulting in all
pages in the range just as if manually reading from each page;
however, avoid the actual memory access that would have been
performed after handling the fault.
In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide
errors, can be applied to (parts of) existing mappings and will
al‐ ways populate (prefault) page tables readable. One
example use case is prefaulting a file mapping, reading all
file content from disk; however, pages won't be dirtied and
consequently won't have to be written back to disk when
evicting the pages from memory.
That's exactly what the application is apparently wanting
MADV_WILLNEED to do.
Please read the commit message for commit 4ca9b3859dac ("mm/madvise:
introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It
has some relevant commentary on why MADV_WILLNEED could not be
modified to meet the pre-population requirements of the applications
that required this pre-population behaviour from the kernel.
With this, I suspect that the application needs to be updated to
use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go
back and do some analysis of the readahead behaviour of the
application and the MADV_POPULATE_READ operation. We may need to
tweak MADV_POPULATE_READ for large readahead IO, but that's OK
because it's no longer "optimistic speculation" about whether the
data is needed in cache - the operation being performed guarantees
that or it fails with an error. IOWs, MADV_POPULATE_READ is
effectively user data IO at this point, not advice about future
access patterns...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-29 17:19 ` Mike Snitzer
2024-01-29 17:42 ` Mike Snitzer
@ 2024-01-29 22:12 ` Dave Chinner
2024-01-29 22:46 ` Mike Snitzer
1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2024-01-29 22:12 UTC (permalink / raw)
To: Mike Snitzer
Cc: Ming Lei, Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
linux-kernel, Don Dutile, Raghavendra K T, Alexander Viro,
Christian Brauner, linux-block
On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote:
> While I'm sure this legacy application would love to not have to
> change its code at all, I think we can all agree that we need to just
> focus on how best to advise applications that have mixed workloads
> accomplish efficient mmap+read of both sequential and random.
>
> To that end, I heard Dave clearly suggest 2 things:
>
> 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
> bdi->io_pages, not bdi->ra_pages * 2
>
> 2) Have the application first issue MADV_SEQUENTIAL to convey that for
> the following MADV_WILLNEED is for sequential file load (so it is
> desirable to use larger ra_pages)
>
> This overrides the default of bdi->io_pages and _should_ provide the
> required per-file duality of control for readahead, correct?
I just discovered MADV_POPULATE_READ - see my reply to Ming
up-thread about that. The applicaiton should use that instead of
MADV_WILLNEED because it gives cache population guarantees that
WILLNEED doesn't. Then we can look at optimising the performance of
MADV_POPULATE_READ (if needed) as there is constrained scope we can
optimise within in ways that we cannot do with WILLNEED.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-29 22:12 ` Dave Chinner
@ 2024-01-29 22:46 ` Mike Snitzer
2024-01-30 10:43 ` David Hildenbrand
0 siblings, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2024-01-29 22:46 UTC (permalink / raw)
To: Dave Chinner
Cc: Ming Lei, Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
linux-kernel, Don Dutile, Raghavendra K T, Alexander Viro,
Christian Brauner, linux-block, David Hildenbrand
On Mon, Jan 29 2024 at 5:12P -0500,
Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote:
> > While I'm sure this legacy application would love to not have to
> > change its code at all, I think we can all agree that we need to just
> > focus on how best to advise applications that have mixed workloads
> > accomplish efficient mmap+read of both sequential and random.
> >
> > To that end, I heard Dave clearly suggest 2 things:
> >
> > 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
> > bdi->io_pages, not bdi->ra_pages * 2
> >
> > 2) Have the application first issue MADV_SEQUENTIAL to convey that for
> > the following MADV_WILLNEED is for sequential file load (so it is
> > desirable to use larger ra_pages)
> >
> > This overrides the default of bdi->ra_pages and _should_ provide the
> > required per-file duality of control for readahead, correct?
>
> I just discovered MADV_POPULATE_READ - see my reply to Ming
> up-thread about that. The applicaiton should use that instead of
> MADV_WILLNEED because it gives cache population guarantees that
> WILLNEED doesn't. Then we can look at optimising the performance of
> MADV_POPULATE_READ (if needed) as there is constrained scope we can
> optimise within in ways that we cannot do with WILLNEED.
Nice find! Given commit 4ca9b3859dac ("mm/madvise: introduce
MADV_POPULATE_(READ|WRITE) to prefault page tables"), I've cc'd David
Hildenbrand just so he's in the loop.
FYI, I proactively raised feedback and questions to the reporter of
this issue:
CONTEXT: madvise(WILLNEED) doesn't convey the nature of the access,
sequential vs random, just the range that may be accessed.
Q1: Is your application's sequential vs random (or smaller sequential)
access split on a per-file basis? Or is the same file accessed both
sequentially and randomly?
A1: The same files can be accessed either randomly or sequentially,
depending on certain access patterns and optimizing logic.
Q2: Can the application be changed to use madvise() MADV_SEQUENTIAL
and MADV_RANDOM to indicate its access pattern?
A2: No, the application is a Java application. Java does not expose
MADVISE API directly. Our application uses Java NIO API via
MappedByteBuffer.load()
(https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html#load--)
that calls MADVISE_WILLNEED at the low level. There is no way for us
to switch this behavior, but we take advantage of this behavior to
optimize large file sequential I/O with great success.
So it's looking like it'll be hard to help this reporter avoid
changes... but that's not upstream's problem!
Mike
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-29 22:07 ` Dave Chinner
@ 2024-01-30 3:13 ` Ming Lei
2024-01-30 5:29 ` Dave Chinner
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2024-01-30 3:13 UTC (permalink / raw)
To: Dave Chinner
Cc: Mike Snitzer, Matthew Wilcox, Andrew Morton, linux-fsdevel,
linux-mm, linux-kernel, Don Dutile, Raghavendra K T,
Alexander Viro, Christian Brauner, linux-block, ming.lei
On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote:
> On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > Follows the current report:
> > > >
> > > > 1) usersapce call madvise(willneed, 1G)
> > > >
> > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > >
> > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > > the whole application becomes slower.
> > >
> > > It gets limited by file->f_ra->ra_pages being initialised to
> > > bdi->ra_pages and then never changed as the advice for access
> > > methods to the file are changed.
> > >
> > > But the problem here is *not the readahead code*. The problem is
> > > that the user has configured the device readahead window to be far
> > > smaller than is optimal for the storage. Hence readahead is slow.
> > > The fix for that is to either increase the device readahead windows,
> > > or to change the specific readahead window for the file that has
> > > sequential access patterns.
> > >
> > > Indeed, we already have that - FADV_SEQUENTIAL will set
> > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > > larger IOs for that access.
> > >
> > > That's what should happen here - MADV_WILLNEED does not imply a
> > > specific access pattern so the application should be running
> > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > > to start the readahead, and then the rest of the on-demand readahead
> > > will get the higher readahead limits.
> > >
> > > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > >
> > > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > > mem.load() implementation in the application converted to use
> > > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > > readahead algorithm.
> >
> > Here the single .ra_pages may not work, that is why this patch stores
> > the willneed range in maple tree, please see the following words from
> > the original RH report:
>
> > "
> > Increasing read ahead is not an option as it has a mixed I/O workload of
> > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > to the random I/O and is unacceptable.
> > "
>
> Yes, I've read the bug. There's no triage that tells us what the
> root cause of the application perofrmance issue might be. Just an
> assertion that "this is how we did it 10 years ago, it's been
> unchanged for all this time, the new kernel we are upgrading
> to needs to behave exactly like pre-3.10 era kernels did.
>
> And to be totally honest, my instincts tell me this is more likely a
> problem with a root cause in poor IO scheduling decisions than be a
> problem with the page cache readahead implementation. Readahead has
> been turned down to stop the bandwidth it uses via background async
> read IO from starving latency dependent foreground random IO
> operation, and then we're being asked to turn readahead back up
> in specific situations because it's actually needed for performance
> in certain access patterns. This is the sort of thing bfq is
> intended to solve.
Reading mmaped buffer in userspace is sync IO, and page fault just
readahead 64KB. I don't understand how block IO scheduler makes a
difference in this single 64KB readahead in case of cache miss.
>
>
> > Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM)
> > ignore the passed range, and the behavior becomes all or nothing,
> > instead of something only for the specified range, which may not
> > match with man, please see 'man posix_fadvise':
>
> The man page says:
>
> The advice is not binding; it merely constitutes an
> expectation on behalf of the application.
>
> > It is even worse for readahead() syscall:
> >
> > ``` DESCRIPTION readahead() initiates readahead on a file
> > so that subsequent reads from that file will be satisfied
> > from the cache, and not block on disk I/O (assuming the
> > readahead was initiated early enough and that other activity
> > on the system did not in the meantime flush pages from the
> > cache). ```
>
> Yes, that's been "broken" for a long time (since the changes to cap
> force_page_cache_readahead() to ra_pages way back when), but the
> assumption documented about when readahead(2) will work goes to the
> heart of why we don't let user controlled readahead actually do much
> in the way of direct readahead. i.e. too much readahead is
> typically harmful to IO and system performance and very, very few
> applications actually need files preloaded entirely into memory.
It is true for normal readahead, but not sure if it is for
advise(willneed) or readahead().
>
> ----
>
> All said, I'm starting to think that there isn't an immediate
> upstream kernel change needed right now. I just did a quick check
> through the madvise() man page to see if I'd missed anything, and I
> most definitely did miss what is a relatively new addition to it:
>
> MADV_POPULATE_READ (since Linux 5.14)
> "Populate (prefault) page tables readable, faulting in all
> pages in the range just as if manually reading from each page;
> however, avoid the actual memory access that would have been
> performed after handling the fault.
>
> In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide
> errors, can be applied to (parts of) existing mappings and will
> al‐ ways populate (prefault) page tables readable. One
> example use case is prefaulting a file mapping, reading all
> file content from disk; however, pages won't be dirtied and
> consequently won't have to be written back to disk when
> evicting the pages from memory.
>
> That's exactly what the application is apparently wanting
> MADV_WILLNEED to do.
Indeed, it works as expected(all mmapped pages are load in
madvise(MADV_POPULATE_READ)) in my test code except for 16 ra_pages, but
it is less important now.
Thanks for this idea!
>
> Please read the commit message for commit 4ca9b3859dac ("mm/madvise:
> introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It
> has some relevant commentary on why MADV_WILLNEED could not be
> modified to meet the pre-population requirements of the applications
> that required this pre-population behaviour from the kernel.
>
> With this, I suspect that the application needs to be updated to
> use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go
> back and do some analysis of the readahead behaviour of the
> application and the MADV_POPULATE_READ operation. We may need to
> tweak MADV_POPULATE_READ for large readahead IO, but that's OK
> because it's no longer "optimistic speculation" about whether the
> data is needed in cache - the operation being performed guarantees
> that or it fails with an error. IOWs, MADV_POPULATE_READ is
> effectively user data IO at this point, not advice about future
> access patterns...
BTW, in this report, MADV_WILLNEED is used by java library[1], and I
guess it could be difficult to update to MADV_POPULATE_READ.
[1] https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-30 3:13 ` Ming Lei
@ 2024-01-30 5:29 ` Dave Chinner
2024-01-30 11:34 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2024-01-30 5:29 UTC (permalink / raw)
To: Ming Lei
Cc: Mike Snitzer, Matthew Wilcox, Andrew Morton, linux-fsdevel,
linux-mm, linux-kernel, Don Dutile, Raghavendra K T,
Alexander Viro, Christian Brauner, linux-block
On Tue, Jan 30, 2024 at 11:13:50AM +0800, Ming Lei wrote:
> On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote:
> > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > > Follows the current report:
> > > > >
> > > > > 1) usersapce call madvise(willneed, 1G)
> > > > >
> > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > > >
> > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > > > the whole application becomes slower.
> > > >
> > > > It gets limited by file->f_ra->ra_pages being initialised to
> > > > bdi->ra_pages and then never changed as the advice for access
> > > > methods to the file are changed.
> > > >
> > > > But the problem here is *not the readahead code*. The problem is
> > > > that the user has configured the device readahead window to be far
> > > > smaller than is optimal for the storage. Hence readahead is slow.
> > > > The fix for that is to either increase the device readahead windows,
> > > > or to change the specific readahead window for the file that has
> > > > sequential access patterns.
> > > >
> > > > Indeed, we already have that - FADV_SEQUENTIAL will set
> > > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > > > larger IOs for that access.
> > > >
> > > > That's what should happen here - MADV_WILLNEED does not imply a
> > > > specific access pattern so the application should be running
> > > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > > > to start the readahead, and then the rest of the on-demand readahead
> > > > will get the higher readahead limits.
> > > >
> > > > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > > >
> > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > > > mem.load() implementation in the application converted to use
> > > > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > > > readahead algorithm.
> > >
> > > Here the single .ra_pages may not work, that is why this patch stores
> > > the willneed range in maple tree, please see the following words from
> > > the original RH report:
> >
> > > "
> > > Increasing read ahead is not an option as it has a mixed I/O workload of
> > > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > > to the random I/O and is unacceptable.
> > > "
> >
> > Yes, I've read the bug. There's no triage that tells us what the
> > root cause of the application perofrmance issue might be. Just an
> > assertion that "this is how we did it 10 years ago, it's been
> > unchanged for all this time, the new kernel we are upgrading
> > to needs to behave exactly like pre-3.10 era kernels did.
> >
> > And to be totally honest, my instincts tell me this is more likely a
> > problem with a root cause in poor IO scheduling decisions than be a
> > problem with the page cache readahead implementation. Readahead has
> > been turned down to stop the bandwidth it uses via background async
> > read IO from starving latency dependent foreground random IO
> > operation, and then we're being asked to turn readahead back up
> > in specific situations because it's actually needed for performance
> > in certain access patterns. This is the sort of thing bfq is
> > intended to solve.
>
> Reading mmaped buffer in userspace is sync IO, and page fault just
> readahead 64KB. I don't understand how block IO scheduler makes a
> difference in this single 64KB readahead in case of cache miss.
I think you've misunderstood what I said. I was refering to the
original customer problem of "too much readahead IO causes problems
for latency sensitive IO" issue that lead to the customer setting
64kB readahead device limits in the first place.
That is, if reducing readahead for sequential IO suddenly makes
synchronous random IO perform a whole lot better and the application
goes faster, then it indicates the problem is IO dispatch
prioritisation, not that there is too much readahead. Deprioritising
readahead will educe it's impact on other IO, without having to
reduce the readahead windows that provide decent sequential IO
perofrmance...
I really think the customer needs to retune their application from
first principles. Start with the defaults, measure where things are
slow, address the worst issue by twiddling knobs. Repeat until
performance is either good enough or they hit on actual problems
that need code changes.
> > > It is even worse for readahead() syscall:
> > >
> > > ``` DESCRIPTION readahead() initiates readahead on a file
> > > so that subsequent reads from that file will be satisfied
> > > from the cache, and not block on disk I/O (assuming the
> > > readahead was initiated early enough and that other activity
> > > on the system did not in the meantime flush pages from the
> > > cache). ```
> >
> > Yes, that's been "broken" for a long time (since the changes to cap
> > force_page_cache_readahead() to ra_pages way back when), but the
> > assumption documented about when readahead(2) will work goes to the
> > heart of why we don't let user controlled readahead actually do much
> > in the way of direct readahead. i.e. too much readahead is
> > typically harmful to IO and system performance and very, very few
> > applications actually need files preloaded entirely into memory.
>
> It is true for normal readahead, but not sure if it is for
> advise(willneed) or readahead().
If we allowed unbound readahead via WILLNEED or readahead(2), then
a user can DOS the storage and/or the memory allocation subsystem
very easily.
In a previous attempt to revert the current WILLNEED readahead
bounding behaviour changes, Linus said this:
"It's just that historically we've had
some issues with people over-doing readahead (because it often helps
some made-up microbenchmark), and then we end up with latency issues
when somebody does a multi-gigabyte readahead... Iirc, we had exactly
that problem with the readahead() system call at some point (long
ago)."
https://lore.kernel.org/linux-mm/CA+55aFy8kOomnL-C5GwSpHTn+g5R7dY78C9=h-J_Rb_u=iASpg@mail.gmail.com/
Elsewhere in a different thread for a different patchset to try to
revert this readahead behaviour, Linus ranted about how it allowed
unbound, unkillable user controlled readahead for 64-bit data
lengths.
Fundamentally, readahead is not functionality we want to expose
directly to user control. MADV_POPULATE_* is a different in that it
isn't actually readahead - it works more like normal sequential user
page fault access. It is interruptable, it can fail due to ENOMEM or
OOM-kill, it can fail on IO errors, etc. IOWs, The MADV_POPULATE
functions are what the application should be using, not trying to
hack WILLNEED to do stuff that MADV_POPULATE* already does in a
better way...
> > Please read the commit message for commit 4ca9b3859dac ("mm/madvise:
> > introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It
> > has some relevant commentary on why MADV_WILLNEED could not be
> > modified to meet the pre-population requirements of the applications
> > that required this pre-population behaviour from the kernel.
> >
> > With this, I suspect that the application needs to be updated to
> > use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go
> > back and do some analysis of the readahead behaviour of the
> > application and the MADV_POPULATE_READ operation. We may need to
> > tweak MADV_POPULATE_READ for large readahead IO, but that's OK
> > because it's no longer "optimistic speculation" about whether the
> > data is needed in cache - the operation being performed guarantees
> > that or it fails with an error. IOWs, MADV_POPULATE_READ is
> > effectively user data IO at this point, not advice about future
> > access patterns...
>
> BTW, in this report, MADV_WILLNEED is used by java library[1], and I
> guess it could be difficult to update to MADV_POPULATE_READ.
Yes, but that's not an upstream kernel code development problem.
That's a problem for the people paying $$$$$ to their software
vendor to sort out.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-29 22:46 ` Mike Snitzer
@ 2024-01-30 10:43 ` David Hildenbrand
0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2024-01-30 10:43 UTC (permalink / raw)
To: Mike Snitzer, Dave Chinner
Cc: Ming Lei, Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
linux-kernel, Don Dutile, Raghavendra K T, Alexander Viro,
Christian Brauner, linux-block
On 29.01.24 23:46, Mike Snitzer wrote:
> On Mon, Jan 29 2024 at 5:12P -0500,
> Dave Chinner <david@fromorbit.com> wrote:
>
>> On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote:
>>> While I'm sure this legacy application would love to not have to
>>> change its code at all, I think we can all agree that we need to just
>>> focus on how best to advise applications that have mixed workloads
>>> accomplish efficient mmap+read of both sequential and random.
>>>
>>> To that end, I heard Dave clearly suggest 2 things:
>>>
>>> 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
>>> bdi->io_pages, not bdi->ra_pages * 2
>>>
>>> 2) Have the application first issue MADV_SEQUENTIAL to convey that for
>>> the following MADV_WILLNEED is for sequential file load (so it is
>>> desirable to use larger ra_pages)
>>>
>>> This overrides the default of bdi->ra_pages and _should_ provide the
>>> required per-file duality of control for readahead, correct?
>>
>> I just discovered MADV_POPULATE_READ - see my reply to Ming
>> up-thread about that. The applicaiton should use that instead of
>> MADV_WILLNEED because it gives cache population guarantees that
>> WILLNEED doesn't. Then we can look at optimising the performance of
>> MADV_POPULATE_READ (if needed) as there is constrained scope we can
>> optimise within in ways that we cannot do with WILLNEED.
>
> Nice find! Given commit 4ca9b3859dac ("mm/madvise: introduce
> MADV_POPULATE_(READ|WRITE) to prefault page tables"), I've cc'd David
> Hildenbrand just so he's in the loop.
Thanks for CCing me.
MADV_POPULATE_READ is indeed different; it doesn't give hints (not
"might be a good idea to read some pages" like MADV_WILLNEED documents),
it forces swapin/read/.../.
In a sense, MADV_POPULATE_READ is similar to simply reading one byte
from each PTE, triggering page faults. However, without actually reading
from the target pages.
MADV_POPULATE_READ has a conceptual benefit: we know exactly how much
memory user space wants to have populated (which range). In contrast,
page faults contain no such hints and we have to guess based on
historical behavior. One could use that range information to *not* do
any faultaround/readahead when we come via MADV_POPULATE_READ, and
really only popoulate the range of interest.
Further, one can use that range information to allocate larger folios,
without having to guess where placement of a large folio is reasonable,
and which size we should use.
>
> FYI, I proactively raised feedback and questions to the reporter of
> this issue:
>
> CONTEXT: madvise(WILLNEED) doesn't convey the nature of the access,
> sequential vs random, just the range that may be accessed.
Indeed. The "problem" with MADV_SEQUENTIAL/MADV_RANDOM is that it will
fragment/split VMAs. So applying it to smaller chunks (like one would do
with MADV_WILLNEED) is likely not a good option.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range
2024-01-30 5:29 ` Dave Chinner
@ 2024-01-30 11:34 ` Ming Lei
0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2024-01-30 11:34 UTC (permalink / raw)
To: Dave Chinner
Cc: Mike Snitzer, Matthew Wilcox, Andrew Morton, linux-fsdevel,
linux-mm, linux-kernel, Don Dutile, Raghavendra K T,
Alexander Viro, Christian Brauner, linux-block, ming.lei
On Tue, Jan 30, 2024 at 04:29:35PM +1100, Dave Chinner wrote:
> On Tue, Jan 30, 2024 at 11:13:50AM +0800, Ming Lei wrote:
> > On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > > > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > > > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > > > Follows the current report:
> > > > > >
> > > > > > 1) usersapce call madvise(willneed, 1G)
> > > > > >
> > > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > > > >
> > > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > > > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > > > > the whole application becomes slower.
> > > > >
> > > > > It gets limited by file->f_ra->ra_pages being initialised to
> > > > > bdi->ra_pages and then never changed as the advice for access
> > > > > methods to the file are changed.
> > > > >
> > > > > But the problem here is *not the readahead code*. The problem is
> > > > > that the user has configured the device readahead window to be far
> > > > > smaller than is optimal for the storage. Hence readahead is slow.
> > > > > The fix for that is to either increase the device readahead windows,
> > > > > or to change the specific readahead window for the file that has
> > > > > sequential access patterns.
> > > > >
> > > > > Indeed, we already have that - FADV_SEQUENTIAL will set
> > > > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > > > > larger IOs for that access.
> > > > >
> > > > > That's what should happen here - MADV_WILLNEED does not imply a
> > > > > specific access pattern so the application should be running
> > > > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > > > > to start the readahead, and then the rest of the on-demand readahead
> > > > > will get the higher readahead limits.
> > > > >
> > > > > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > > > >
> > > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > > > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > > > > mem.load() implementation in the application converted to use
> > > > > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > > > > readahead algorithm.
> > > >
> > > > Here the single .ra_pages may not work, that is why this patch stores
> > > > the willneed range in maple tree, please see the following words from
> > > > the original RH report:
> > >
> > > > "
> > > > Increasing read ahead is not an option as it has a mixed I/O workload of
> > > > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > > > to the random I/O and is unacceptable.
> > > > "
> > >
> > > Yes, I've read the bug. There's no triage that tells us what the
> > > root cause of the application perofrmance issue might be. Just an
> > > assertion that "this is how we did it 10 years ago, it's been
> > > unchanged for all this time, the new kernel we are upgrading
> > > to needs to behave exactly like pre-3.10 era kernels did.
> > >
> > > And to be totally honest, my instincts tell me this is more likely a
> > > problem with a root cause in poor IO scheduling decisions than be a
> > > problem with the page cache readahead implementation. Readahead has
> > > been turned down to stop the bandwidth it uses via background async
> > > read IO from starving latency dependent foreground random IO
> > > operation, and then we're being asked to turn readahead back up
> > > in specific situations because it's actually needed for performance
> > > in certain access patterns. This is the sort of thing bfq is
> > > intended to solve.
> >
> > Reading mmaped buffer in userspace is sync IO, and page fault just
> > readahead 64KB. I don't understand how block IO scheduler makes a
> > difference in this single 64KB readahead in case of cache miss.
>
> I think you've misunderstood what I said. I was refering to the
> original customer problem of "too much readahead IO causes problems
> for latency sensitive IO" issue that lead to the customer setting
> 64kB readahead device limits in the first place.
Looks we are not in same page, I never see words of "latency sensitive IO"
in this report(RHEL-22476).
>
> That is, if reducing readahead for sequential IO suddenly makes
> synchronous random IO perform a whole lot better and the application
> goes faster, then it indicates the problem is IO dispatch
> prioritisation, not that there is too much readahead. Deprioritising
> readahead will educe it's impact on other IO, without having to
> reduce the readahead windows that provide decent sequential IO
> perofrmance...
>
> I really think the customer needs to retune their application from
> first principles. Start with the defaults, measure where things are
> slow, address the worst issue by twiddling knobs. Repeat until
> performance is either good enough or they hit on actual problems
> that need code changes.
io priority is set in blkcg/process level, we even don't know if the random IO
and sequential IO are submitted from different process.
Also io priority is only applied when IOs with different priority are
submitted concurrently.
The main input from the report is that iostat shows that read IO request size is
reduced to 64K from 1MB, which isn't something io priority can deal with.
Here from my understanding the problem is that advise(ADV_RANDOM, ADV_SEQUENTIAL,
ADV_WILLNEED) are basically applied on file level instead of range level, even
though range is passed in from these syscalls.
So sequential and random advise are actually exclusively on the whole file.
That is why the customer don't want to set bigger ra_pages because they
are afraid(or have tested) that bigger ra_pages hurts performance of random
IO workload because unnecessary data may be readahead. But readahead
algorithm is supposed to be capable of dealing with it, maybe still not
well enough.
But yes, more details are needed for understand the issue further.
thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-01-30 11:35 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20240128142522.1524741-1-ming.lei@redhat.com>
[not found] ` <ZbbPCQZdazF7s0_b@casper.infradead.org>
[not found] ` <ZbbfXVg9FpWRUVDn@redhat.com>
[not found] ` <ZbbvfFxcVgkwbhFv@casper.infradead.org>
[not found] ` <CAH6w=aw_46Ker0w8HmSA41vUUDKGDGC3gxBFWAhd326+kEtrNg@mail.gmail.com>
[not found] ` <ZbcDvTkeDKttPfJ4@dread.disaster.area>
2024-01-29 3:57 ` [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range Ming Lei
2024-01-29 5:15 ` Dave Chinner
2024-01-29 8:25 ` Ming Lei
2024-01-29 13:26 ` Matthew Wilcox
2024-01-29 22:07 ` Dave Chinner
2024-01-30 3:13 ` Ming Lei
2024-01-30 5:29 ` Dave Chinner
2024-01-30 11:34 ` Ming Lei
2024-01-29 17:19 ` Mike Snitzer
2024-01-29 17:42 ` Mike Snitzer
2024-01-29 22:12 ` Dave Chinner
2024-01-29 22:46 ` Mike Snitzer
2024-01-30 10:43 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).