[RFC] mm: bail out in shrin_inactive

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] mm: bail out in shrin_inactive_list
@ 2016-07-25  7:51 Minchan Kim
  2016-07-25  9:29 ` Mel Gorman
  2016-07-29 14:11 ` Johannes Weiner
  0 siblings, 2 replies; 7+ messages in thread
From: Minchan Kim @ 2016-07-25  7:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, linux-mm, linux-kernel, Minchan Kim

With node-lru, if there are enough reclaimable pages in highmem
but nothing in lowmem, VM can try to shrink inactive list although
the requested zone is lowmem.

The problem is direct reclaimer scans inactive list is fulled with
highmem pages to find a victim page at a reqested zone or lower zones
but the result is that VM should skip all of pages. It just burns out
CPU. Even, many direct reclaimers are stalled by too_many_isolated
if lots of parallel reclaimer are going on although there are no
reclaimable memory in inactive list.

I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine
to get elapsed time.

	hackbench 500 process 2

= Old =

1st: 289s 2nd: 310s 3rd: 112s 4th: 272s

= Now =

1st: 31s  2nd: 132s 3rd: 162s 4th: 50s.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
I believe proper fix is to modify get_scan_count. IOW, I think
we should introduce lruvec_reclaimable_lru_size with proper
classzone_idx but I don't know how we can fix it with memcg
which doesn't have zone stat now. should introduce zone stat
back to memcg? Or, it's okay to ignore memcg?

 mm/vmscan.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e5af357..3d285cc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1652,6 +1652,31 @@ static int current_may_throttle(void)
 		bdi_write_congested(current->backing_dev_info);
 }
 
+static inline bool inactive_reclaimable_pages(struct lruvec *lruvec,
+				struct scan_control *sc,
+				enum lru_list lru)
+{
+	int zid;
+	struct zone *zone;
+	bool file = is_file_lru(lru);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	if (!global_reclaim(sc))
+		return true;
+
+	for (zid = sc->reclaim_idx; zid >= 0; zid--) {
+		zone = &pgdat->node_zones[zid];
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_page_state_snapshot(zone, NR_ZONE_LRU_BASE +
+				LRU_FILE * file) >= SWAP_CLUSTER_MAX)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
@@ -1674,6 +1699,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
+	if (!inactive_reclaimable_pages(lruvec, sc, lru))
+		return 0;
+
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: bail out in shrin_inactive_list
  2016-07-25  7:51 [RFC] mm: bail out in shrin_inactive_list Minchan Kim
@ 2016-07-25  9:29 ` Mel Gorman
  2016-07-26  1:21   ` Minchan Kim
  2016-07-29 14:11 ` Johannes Weiner
  1 sibling, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2016-07-25  9:29 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, Johannes Weiner, linux-mm, linux-kernel

There is a typo in the subject line.

On Mon, Jul 25, 2016 at 04:51:59PM +0900, Minchan Kim wrote:
> With node-lru, if there are enough reclaimable pages in highmem
> but nothing in lowmem, VM can try to shrink inactive list although
> the requested zone is lowmem.
> 
> The problem is direct reclaimer scans inactive list is fulled with


> highmem pages to find a victim page at a reqested zone or lower zones
> but the result is that VM should skip all of pages. 

Rephease -- The problem is that if the inactive list is full of highmem
pages then a direct reclaimer searching for a lowmem page waste CPU
scanning uselessly.

> CPU. Even, many direct reclaimers are stalled by too_many_isolated
> if lots of parallel reclaimer are going on although there are no
> reclaimable memory in inactive list.
> 
> I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine
> to get elapsed time.
> 
> 	hackbench 500 process 2
> 
> = Old =
> 
> 1st: 289s 2nd: 310s 3rd: 112s 4th: 272s
> 
> = Now =
> 
> 1st: 31s  2nd: 132s 3rd: 162s 4th: 50s.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> I believe proper fix is to modify get_scan_count. IOW, I think
> we should introduce lruvec_reclaimable_lru_size with proper
> classzone_idx but I don't know how we can fix it with memcg
> which doesn't have zone stat now. should introduce zone stat
> back to memcg? Or, it's okay to ignore memcg?
> 

I think it's ok to ignore memcg in this case as a memcg shrink is often
going to be for pages that can use highmem anyway.

>  mm/vmscan.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e5af357..3d285cc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1652,6 +1652,31 @@ static int current_may_throttle(void)
>  		bdi_write_congested(current->backing_dev_info);
>  }
>  
> +static inline bool inactive_reclaimable_pages(struct lruvec *lruvec,
> +				struct scan_control *sc,
> +				enum lru_list lru)

inline is unnecessary. The function is long but only has one caller so
it'll be inlined automatically.

> +{
> +	int zid;
> +	struct zone *zone;
> +	bool file = is_file_lru(lru);

It's more appropriate to use int for file in this case as it's used as a
multiplier. It'll work either way.

Otherwise;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: bail out in shrin_inactive_list
  2016-07-25  9:29 ` Mel Gorman
@ 2016-07-26  1:21   ` Minchan Kim
  2016-07-26  7:46     ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Minchan Kim @ 2016-07-26  1:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, linux-mm, linux-kernel,
	Michal Hocko, Vladimir Davydov

On Mon, Jul 25, 2016 at 10:29:09AM +0100, Mel Gorman wrote:
> There is a typo in the subject line.
> 
> On Mon, Jul 25, 2016 at 04:51:59PM +0900, Minchan Kim wrote:
> > With node-lru, if there are enough reclaimable pages in highmem
> > but nothing in lowmem, VM can try to shrink inactive list although
> > the requested zone is lowmem.
> > 
> > The problem is direct reclaimer scans inactive list is fulled with
> 
> 
> > highmem pages to find a victim page at a reqested zone or lower zones
> > but the result is that VM should skip all of pages. 
> 
> Rephease -- The problem is that if the inactive list is full of highmem
> pages then a direct reclaimer searching for a lowmem page waste CPU
> scanning uselessly.

It's better. Thanks.

> 
> > CPU. Even, many direct reclaimers are stalled by too_many_isolated
> > if lots of parallel reclaimer are going on although there are no
> > reclaimable memory in inactive list.
> > 
> > I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine
> > to get elapsed time.
> > 
> > 	hackbench 500 process 2
> > 
> > = Old =
> > 
> > 1st: 289s 2nd: 310s 3rd: 112s 4th: 272s
> > 
> > = Now =
> > 
> > 1st: 31s  2nd: 132s 3rd: 162s 4th: 50s.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> > I believe proper fix is to modify get_scan_count. IOW, I think
> > we should introduce lruvec_reclaimable_lru_size with proper
> > classzone_idx but I don't know how we can fix it with memcg
> > which doesn't have zone stat now. should introduce zone stat
> > back to memcg? Or, it's okay to ignore memcg?
> > 
> 
> I think it's ok to ignore memcg in this case as a memcg shrink is often
> going to be for pages that can use highmem anyway.

So, you mean it's okay to ignore kmemcg case?
If memcg guys agree it, I want to make get_scan_count consider
reclaimable lru size under the reclaim constraint, instead.

> 
> >  mm/vmscan.c | 28 ++++++++++++++++++++++++++++
> >  1 file changed, 28 insertions(+)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index e5af357..3d285cc 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1652,6 +1652,31 @@ static int current_may_throttle(void)
> >  		bdi_write_congested(current->backing_dev_info);
> >  }
> >  
> > +static inline bool inactive_reclaimable_pages(struct lruvec *lruvec,
> > +				struct scan_control *sc,
> > +				enum lru_list lru)
> 
> inline is unnecessary. The function is long but only has one caller so
> it'll be inlined automatically.
> 
> > +{
> > +	int zid;
> > +	struct zone *zone;
> > +	bool file = is_file_lru(lru);
> 
> It's more appropriate to use int for file in this case as it's used as a
> multiplier. It'll work either way.
> 
> Otherwise;
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 
> -- 
> Mel Gorman
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: bail out in shrin_inactive_list
  2016-07-26  1:21   ` Minchan Kim
@ 2016-07-26  7:46     ` Mel Gorman
  2016-07-26  8:27       ` Minchan Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2016-07-26  7:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, linux-mm, linux-kernel,
	Michal Hocko, Vladimir Davydov

On Tue, Jul 26, 2016 at 10:21:57AM +0900, Minchan Kim wrote:
> > > I believe proper fix is to modify get_scan_count. IOW, I think
> > > we should introduce lruvec_reclaimable_lru_size with proper
> > > classzone_idx but I don't know how we can fix it with memcg
> > > which doesn't have zone stat now. should introduce zone stat
> > > back to memcg? Or, it's okay to ignore memcg?
> > > 
> > 
> > I think it's ok to ignore memcg in this case as a memcg shrink is often
> > going to be for pages that can use highmem anyway.
> 
> So, you mean it's okay to ignore kmemcg case?
> If memcg guys agree it, I want to make get_scan_count consider
> reclaimable lru size under the reclaim constraint, instead.
> 

For now, I believe yet. My understanding is that the primary use cases
for kmemcg is systems running large numbers of containers. It consider
it extremely unlikely that large 32-bit systems are being used for large
numbers of containers and require usage of kmemcg.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: bail out in shrin_inactive_list
  2016-07-26  7:46     ` Mel Gorman
@ 2016-07-26  8:27       ` Minchan Kim
  0 siblings, 0 replies; 7+ messages in thread
From: Minchan Kim @ 2016-07-26  8:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, linux-mm, linux-kernel,
	Michal Hocko, Vladimir Davydov

On Tue, Jul 26, 2016 at 08:46:50AM +0100, Mel Gorman wrote:
> On Tue, Jul 26, 2016 at 10:21:57AM +0900, Minchan Kim wrote:
> > > > I believe proper fix is to modify get_scan_count. IOW, I think
> > > > we should introduce lruvec_reclaimable_lru_size with proper
> > > > classzone_idx but I don't know how we can fix it with memcg
> > > > which doesn't have zone stat now. should introduce zone stat
> > > > back to memcg? Or, it's okay to ignore memcg?
> > > > 
> > > 
> > > I think it's ok to ignore memcg in this case as a memcg shrink is often
> > > going to be for pages that can use highmem anyway.
> > 
> > So, you mean it's okay to ignore kmemcg case?
> > If memcg guys agree it, I want to make get_scan_count consider
> > reclaimable lru size under the reclaim constraint, instead.
> > 
> 
> For now, I believe yet. My understanding is that the primary use cases
> for kmemcg is systems running large numbers of containers. It consider
> it extremely unlikely that large 32-bit systems are being used for large
> numbers of containers and require usage of kmemcg.

Okay, Then how about this?
I didn't test it but I guess it should work.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: bail out in shrin_inactive_list
  2016-07-25  7:51 [RFC] mm: bail out in shrin_inactive_list Minchan Kim
  2016-07-25  9:29 ` Mel Gorman
@ 2016-07-29 14:11 ` Johannes Weiner
  2016-08-01 23:46   ` Minchan Kim
  1 sibling, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2016-07-29 14:11 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel

On Mon, Jul 25, 2016 at 04:51:59PM +0900, Minchan Kim wrote:
> With node-lru, if there are enough reclaimable pages in highmem
> but nothing in lowmem, VM can try to shrink inactive list although
> the requested zone is lowmem.
> 
> The problem is direct reclaimer scans inactive list is fulled with
> highmem pages to find a victim page at a reqested zone or lower zones
> but the result is that VM should skip all of pages. It just burns out
> CPU. Even, many direct reclaimers are stalled by too_many_isolated
> if lots of parallel reclaimer are going on although there are no
> reclaimable memory in inactive list.
> 
> I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine
> to get elapsed time.
> 
> 	hackbench 500 process 2
> 
> = Old =
> 
> 1st: 289s 2nd: 310s 3rd: 112s 4th: 272s
> 
> = Now =
> 
> 1st: 31s  2nd: 132s 3rd: 162s 4th: 50s.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> I believe proper fix is to modify get_scan_count. IOW, I think
> we should introduce lruvec_reclaimable_lru_size with proper
> classzone_idx but I don't know how we can fix it with memcg
> which doesn't have zone stat now. should introduce zone stat
> back to memcg? Or, it's okay to ignore memcg?

You can fully ignore memcg and kmemcg. They only care about the
balance sheet - page in, page out - never mind the type of page.

If you are allocating a slab object and there is no physical memory,
you'll wake kswapd or enter direct reclaim with the restricted zone
index. If you then try to charge the freshly allocated page or object
but hit the limit, kmem or otherwise, you'll enter memcg reclaim that
is not restricted and only cares about getting usage + pages < limit.

I agree that it might be better to put this logic in get_scan_count()
and set both nr[lru] as well as *lru_pages according to the pages that
are eligible for the given reclaim index.

if (global_reclaim(sc))
  add zone stats from 0 to sc->reclaim_idx
else
  use lruvec_lru_size()

It's a bit unfortunate that abstractions like the lruvec fall apart
when we have to reconstruct zones ad-hoc now, but I don't see any
obvious way around it...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: bail out in shrin_inactive_list
  2016-07-29 14:11 ` Johannes Weiner
@ 2016-08-01 23:46   ` Minchan Kim
  0 siblings, 0 replies; 7+ messages in thread
From: Minchan Kim @ 2016-08-01 23:46 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel

On Fri, Jul 29, 2016 at 10:11:30AM -0400, Johannes Weiner wrote:
> On Mon, Jul 25, 2016 at 04:51:59PM +0900, Minchan Kim wrote:
> > With node-lru, if there are enough reclaimable pages in highmem
> > but nothing in lowmem, VM can try to shrink inactive list although
> > the requested zone is lowmem.
> > 
> > The problem is direct reclaimer scans inactive list is fulled with
> > highmem pages to find a victim page at a reqested zone or lower zones
> > but the result is that VM should skip all of pages. It just burns out
> > CPU. Even, many direct reclaimers are stalled by too_many_isolated
> > if lots of parallel reclaimer are going on although there are no
> > reclaimable memory in inactive list.
> > 
> > I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine
> > to get elapsed time.
> > 
> > 	hackbench 500 process 2
> > 
> > = Old =
> > 
> > 1st: 289s 2nd: 310s 3rd: 112s 4th: 272s
> > 
> > = Now =
> > 
> > 1st: 31s  2nd: 132s 3rd: 162s 4th: 50s.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> > I believe proper fix is to modify get_scan_count. IOW, I think
> > we should introduce lruvec_reclaimable_lru_size with proper
> > classzone_idx but I don't know how we can fix it with memcg
> > which doesn't have zone stat now. should introduce zone stat
> > back to memcg? Or, it's okay to ignore memcg?
> 
> You can fully ignore memcg and kmemcg. They only care about the
> balance sheet - page in, page out - never mind the type of page.
> 
> If you are allocating a slab object and there is no physical memory,
> you'll wake kswapd or enter direct reclaim with the restricted zone
> index. If you then try to charge the freshly allocated page or object
> but hit the limit, kmem or otherwise, you'll enter memcg reclaim that
> is not restricted and only cares about getting usage + pages < limit.

Thanks. I got understood.

> 
> I agree that it might be better to put this logic in get_scan_count()
> and set both nr[lru] as well as *lru_pages according to the pages that
> are eligible for the given reclaim index.
> 
> if (global_reclaim(sc))
>   add zone stats from 0 to sc->reclaim_idx
> else
>   use lruvec_lru_size()

Yeb, I already sent it.
http://lkml.kernel.org/r/1469604588-6051-2-git-send-email-minchan@kernel.org

Thanks for the review, Johannes!

> 
> It's a bit unfortunate that abstractions like the lruvec fall apart
> when we have to reconstruct zones ad-hoc now, but I don't see any
> obvious way around it...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-08-01 23:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-25  7:51 [RFC] mm: bail out in shrin_inactive_list Minchan Kim
2016-07-25  9:29 ` Mel Gorman
2016-07-26  1:21   ` Minchan Kim
2016-07-26  7:46     ` Mel Gorman
2016-07-26  8:27       ` Minchan Kim
2016-07-29 14:11 ` Johannes Weiner
2016-08-01 23:46   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).