[PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
@ 2016-07-25 14:39 Kyle Walker
  2016-07-25 20:47 ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Kyle Walker @ 2016-07-25 14:39 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kyle Walker, Andrew Morton, Michal Hocko,
	Geliang Tang, Vlastimil Babka, Roman Gushchin, Kirill A. Shutemov

Java workloads using the MappedByteBuffer library result in the fadvise()
and madvise() syscalls being used extensively. Following recent readahead
limiting alterations, such as 600e19af ("mm: use only per-device readahead
limit") and 6d2be915 ("mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages"), application performance
suffers in instances where small readahead is configured.

By moving this limit outside of the syscall codepaths, the syscalls are
able to advise an inordinately large amount of readahead when desired.
With a cap being imposed based on the half of NR_INACTIVE_FILE and
NR_FREE_PAGES. In essence, allowing performance tuning efforts to define a
small readahead limit, but then benefiting from large sequential readahead
values selectively.

Signed-off-by: Kyle Walker <kwalker@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Geliang Tang <geliangtang@163.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
---
 mm/readahead.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 65ec288..6f8bb44 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -211,7 +211,9 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
 		return -EINVAL;
 
-	nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
+	nr_to_read = min(nr_to_read, (global_page_state(NR_INACTIVE_FILE) +
+				     (global_page_state(NR_FREE_PAGES)) / 2));
+
 	while (nr_to_read) {
 		int err;
 
@@ -484,6 +486,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
 
 	/* be dumb */
 	if (filp && (filp->f_mode & FMODE_RANDOM)) {
+		req_size = min(req_size, inode_to_bdi(mapping->host)->ra_pages);
 		force_page_cache_readahead(mapping, filp, offset, req_size);
 		return;
 	}
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
  2016-07-25 14:39 [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls Kyle Walker
@ 2016-07-25 20:47 ` Andrew Morton
  2016-07-26  9:31   ` Michal Hocko
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Andrew Morton @ 2016-07-25 20:47 UTC (permalink / raw)
  To: Kyle Walker
  Cc: linux-mm, linux-kernel, Michal Hocko, Geliang Tang,
	Vlastimil Babka, Roman Gushchin, Kirill A. Shutemov,
	Linus Torvalds

On Mon, 25 Jul 2016 10:39:25 -0400 Kyle Walker <kwalker@redhat.com> wrote:

> Java workloads using the MappedByteBuffer library result in the fadvise()
> and madvise() syscalls being used extensively. Following recent readahead
> limiting alterations, such as 600e19af ("mm: use only per-device readahead
> limit") and 6d2be915 ("mm/readahead.c: fix readahead failure for
> memoryless NUMA nodes and limit readahead pages"), application performance
> suffers in instances where small readahead is configured.

Can this suffering be quantified please?

> By moving this limit outside of the syscall codepaths, the syscalls are
> able to advise an inordinately large amount of readahead when desired.
> With a cap being imposed based on the half of NR_INACTIVE_FILE and
> NR_FREE_PAGES. In essence, allowing performance tuning efforts to define a
> small readahead limit, but then benefiting from large sequential readahead
> values selectively.
> 
> ...
>
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -211,7 +211,9 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
>  		return -EINVAL;
>  
> -	nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
> +	nr_to_read = min(nr_to_read, (global_page_state(NR_INACTIVE_FILE) +
> +				     (global_page_state(NR_FREE_PAGES)) / 2));
> +
>  	while (nr_to_read) {
>  		int err;
>  
> @@ -484,6 +486,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
>  
>  	/* be dumb */
>  	if (filp && (filp->f_mode & FMODE_RANDOM)) {
> +		req_size = min(req_size, inode_to_bdi(mapping->host)->ra_pages);
>  		force_page_cache_readahead(mapping, filp, offset, req_size);
>  		return;
>  	}

Linus probably has opinions ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
  2016-07-25 20:47 ` Andrew Morton
@ 2016-07-26  9:31   ` Michal Hocko
  2016-07-26 19:23   ` Kyle Walker
  2016-08-03 15:24   ` Rafael Aquini
  2 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2016-07-26  9:31 UTC (permalink / raw)
  To: Kyle Walker
  Cc: Andrew Morton, linux-mm, linux-kernel, Geliang Tang,
	Vlastimil Babka, Roman Gushchin, Kirill A. Shutemov,
	Linus Torvalds

On Mon 25-07-16 13:47:32, Andrew Morton wrote:
> On Mon, 25 Jul 2016 10:39:25 -0400 Kyle Walker <kwalker@redhat.com> wrote:
> 
> > Java workloads using the MappedByteBuffer library result in the fadvise()
> > and madvise() syscalls being used extensively. Following recent readahead
> > limiting alterations, such as 600e19af ("mm: use only per-device readahead
> > limit") and 6d2be915 ("mm/readahead.c: fix readahead failure for
> > memoryless NUMA nodes and limit readahead pages"), application performance
> > suffers in instances where small readahead is configured.
> 
> Can this suffering be quantified please?
> 
> > By moving this limit outside of the syscall codepaths, the syscalls are
> > able to advise an inordinately large amount of readahead when desired.
> > With a cap being imposed based on the half of NR_INACTIVE_FILE and
> > NR_FREE_PAGES. In essence, allowing performance tuning efforts to define a
> > small readahead limit, but then benefiting from large sequential readahead
> > values selectively.
> > 
> > ...
> >
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -211,7 +211,9 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> >  	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
> >  		return -EINVAL;
> >  
> > -	nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
> > +	nr_to_read = min(nr_to_read, (global_page_state(NR_INACTIVE_FILE) +
> > +				     (global_page_state(NR_FREE_PAGES)) / 2));
> > +
> >  	while (nr_to_read) {
> >  		int err;
> >  
> > @@ -484,6 +486,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
> >  
> >  	/* be dumb */
> >  	if (filp && (filp->f_mode & FMODE_RANDOM)) {
> > +		req_size = min(req_size, inode_to_bdi(mapping->host)->ra_pages);
> >  		force_page_cache_readahead(mapping, filp, offset, req_size);
> >  		return;
> >  	}
> 
> Linus probably has opinions ;)

Just for the reference a similar patch has been discussed already [1] or
from a different angle [2]

[1] http://lkml.kernel.org/r/1440087598-27185-1-git-send-email-klamm@yandex-team.ru
[2] http://lkml.kernel.org/r/1456277927-12044-1-git-send-email-hannes@cmpxchg.org
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
  2016-07-25 20:47 ` Andrew Morton
  2016-07-26  9:31   ` Michal Hocko
@ 2016-07-26 19:23   ` Kyle Walker
  2016-08-03 15:24   ` Rafael Aquini
  2 siblings, 0 replies; 6+ messages in thread
From: Kyle Walker @ 2016-07-26 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Geliang Tang,
	Vlastimil Babka, Roman Gushchin, Kirill A. Shutemov,
	Linus Torvalds

On Mon, Jul 25, 2016 at 4:47 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> Can this suffering be quantified please?
>

The observed suffering is primarily visible within an IBM Qradar
installation. From a high level, the lower limit to the amount of advisory
readahead pages results in a 3-5x increase in time necessary to complete
an identical query within the application.

Note, all of the below values are with Readahead configured to 64Kib.

Baseline behaviour - Prior to:
 600e19af ("mm: use only per-device readahead limit")
 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA
           nodes and limit readahead pages")

Result:
 Qradar - Command: "username equals root" - 57.3s to complete search

New performance - With:
 600e19af ("mm: use only per-device readahead limit")
 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA
           nodes and limit readahead pages")

Result:
 Qradar - "username equals root" query - 245.7s to complete search

Proposed behaviour - With the proposed patch in place.

Result:
 Qradar - "username equals root" query - 57s to complete search

In narrowing the source of the performance deficit, it was observed that
the amount of data loaded into pagecache via madvise was quite a bit lower
following the noted commits. As simply reverting those lower limits were
not accepted previously, the proposed alternative strategy seemed like the
most beneficial path forwards.

>
> Linus probably has opinions ;)
>

I understand that changes to readahead that are very similar have been
proposed quite a bit lately. If there are any changes or testing needed,
I'm more than happy to tackle that.

Thank you in advance!
-- 
Kyle Walker

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
  2016-07-25 20:47 ` Andrew Morton
  2016-07-26  9:31   ` Michal Hocko
  2016-07-26 19:23   ` Kyle Walker
@ 2016-08-03 15:24   ` Rafael Aquini
  2016-08-25 14:59     ` Kyle Walker
  2 siblings, 1 reply; 6+ messages in thread
From: Rafael Aquini @ 2016-08-03 15:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kyle Walker, linux-mm, linux-kernel, Michal Hocko, Geliang Tang,
	Vlastimil Babka, Roman Gushchin, Kirill A. Shutemov,
	Linus Torvalds

On Mon, Jul 25, 2016 at 01:47:32PM -0700, Andrew Morton wrote:
> On Mon, 25 Jul 2016 10:39:25 -0400 Kyle Walker <kwalker@redhat.com> wrote:
> 
> > Java workloads using the MappedByteBuffer library result in the fadvise()
> > and madvise() syscalls being used extensively. Following recent readahead
> > limiting alterations, such as 600e19af ("mm: use only per-device readahead
> > limit") and 6d2be915 ("mm/readahead.c: fix readahead failure for
> > memoryless NUMA nodes and limit readahead pages"), application performance
> > suffers in instances where small readahead is configured.
> 
> Can this suffering be quantified please?
> 
> > By moving this limit outside of the syscall codepaths, the syscalls are
> > able to advise an inordinately large amount of readahead when desired.
> > With a cap being imposed based on the half of NR_INACTIVE_FILE and
> > NR_FREE_PAGES. In essence, allowing performance tuning efforts to define a
> > small readahead limit, but then benefiting from large sequential readahead
> > values selectively.
> > 
> > ...
> >
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -211,7 +211,9 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> >  	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
> >  		return -EINVAL;
> >  
> > -	nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
> > +	nr_to_read = min(nr_to_read, (global_page_state(NR_INACTIVE_FILE) +
> > +				     (global_page_state(NR_FREE_PAGES)) / 2));
> > +
> >  	while (nr_to_read) {
> >  		int err;
> >  
> > @@ -484,6 +486,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
> >  
> >  	/* be dumb */
> >  	if (filp && (filp->f_mode & FMODE_RANDOM)) {
> > +		req_size = min(req_size, inode_to_bdi(mapping->host)->ra_pages);
> >  		force_page_cache_readahead(mapping, filp, offset, req_size);
> >  		return;
> >  	}
> 
> Linus probably has opinions ;)
>

IIRC one of the issues Linus had with previous attempts was because 
they were utilizing/bringing back a node-memory state based heuristic. 

Since Kyle patch is using a global state counter for that matter,
I think that issue condition might now be sorted out.

-- Rafael

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
  2016-08-03 15:24   ` Rafael Aquini
@ 2016-08-25 14:59     ` Kyle Walker
  0 siblings, 0 replies; 6+ messages in thread
From: Kyle Walker @ 2016-08-25 14:59 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Andrew Morton, linux-mm, lkml, Michal Hocko, Geliang Tang,
	Vlastimil Babka, Roman Gushchin, Kirill A. Shutemov,
	Linus Torvalds

On Wed, Aug 3, 2016 at 11:24 AM, Rafael Aquini <aquini@redhat.com> wrote:
> IIRC one of the issues Linus had with previous attempts was because
> they were utilizing/bringing back a node-memory state based heuristic.
>
> Since Kyle patch is using a global state counter for that matter,
> I think that issue condition might now be sorted out.

It's been a few weeks since the last feedback. Are there any further
questions or concerns I can help out with?

--
Kyle Walker

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-08-25 14:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-25 14:39 [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls Kyle Walker
2016-07-25 20:47 ` Andrew Morton
2016-07-26  9:31   ` Michal Hocko
2016-07-26 19:23   ` Kyle Walker
2016-08-03 15:24   ` Rafael Aquini
2016-08-25 14:59     ` Kyle Walker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).